Local inference
Optional Tier-1 local LLM inference via llama.cpp behind the InferenceBackend trait seam — GGUF by path, Metal on Apple, CUDA cloud-only.
Local model execution is opt-in (Tier 1). The default kx install is FFI-free and does not include it; enable it when you want a model running on your own machine.
What it is
- llama.cpp behind the
InferenceBackendtrait seam — the runtime talks to an interface, not directly to llama.cpp, so the backend is swappable. - GGUF by path — point the runtime at a local GGUF model file.
- A Qwen3-4B agent is the reference model.
Build requirements
Local inference compiles llama.cpp, so it needs a C++ toolchain. The FFI-free default build does not.
GPU offload
- Metal on Apple silicon.
- CUDA is cloud-only (not part of the local build).
Swapping the backend
Because inference sits behind the InferenceBackend seam, you can implement your own backend (a different local engine, or a remote one) without touching the rest of the runtime. See Extending.
Frontier on demand
Local-first does not mean local-only. The runtime is designed to reach for hosted frontier models when a task needs the depth, while keeping the default path local and FFI-free.
Recipes
Five reusable, parameterized workflow shapes that compile to a mote DAG — map_reduce, fan_out_gather, retry_until_critic, react_tool_loop, image_batch_describe_reduce.
Architecture
A layered DAG of 39 crates; single-system by default, with an optional coordinator/worker layer that keeps the same guarantees.