kortecxdocs

Local inference

Optional Tier-1 local LLM inference via llama.cpp behind the InferenceBackend trait seam — GGUF by path, Metal on Apple, CUDA cloud-only.

Local model execution is opt-in (Tier 1). The default kx install is FFI-free and does not include it; enable it when you want a model running on your own machine.

What it is

  • llama.cpp behind the InferenceBackend trait seam — the runtime talks to an interface, not directly to llama.cpp, so the backend is swappable.
  • GGUF by path — point the runtime at a local GGUF model file.
  • A Qwen3-4B agent is the reference model.

Build requirements

Local inference compiles llama.cpp, so it needs a C++ toolchain. The FFI-free default build does not.

GPU offload

  • Metal on Apple silicon.
  • CUDA is cloud-only (not part of the local build).

Swapping the backend

Because inference sits behind the InferenceBackend seam, you can implement your own backend (a different local engine, or a remote one) without touching the rest of the runtime. See Extending.

Frontier on demand

Local-first does not mean local-only. The runtime is designed to reach for hosted frontier models when a task needs the depth, while keeping the default path local and FFI-free.