Platform · Performance Engineering

Performance Engineering on Jetson Orin NX

ThoxOS is custom-tuned for inference workloads on Jetson Orin NX. Three concurrent workstreams target tail latency, UMA contention, and hot-path predictability: a PREEMPT_RT kernel sprint to bound scheduler jitter, a memory broker to keep Ollama, TensorRT-LLM, and cuQuantum from starving each other, and a Rust inference daemon that retires Ollama from the hot path.

Active Workstreams

Three concurrent fronts, one objective: predictable inference.

Status as of May 2026

Sprint in flight

Workstream 01

Kernel Tuning

PREEMPT_RT kernel + CPU isolation for inference threads, api-server and cluster-agent pinned to dedicated cores, fallible I/O paths moved to io_uring. Measurable wins on tail latency.

PREEMPT_RT — bounded scheduling latency under load.
isolcpus — cores 4–7 reserved for inference; kernel housekeeping cannot steal them.
io_uring — batched submission queues for KV cache spill + log flush.
Target: cut p99 enqueue jitter from ~8ms to <1ms on saturated nodes.

Memory broker WIP

Workstream 02

UMA Discipline

Orin's biggest performance trap is the unified memory pool — every byte the GPU grabs is a byte the CPU loses. The pipelining doc's managed aggregation is correct; what's missing is a hard memory broker between Ollama, TensorRT-LLM, and cuQuantum so they cannot starve each other.

Hard per-runtime ceilings (Ollama 5.5GB, TensorRT-LLM 6GB, cuQuantum 2.5GB) enforced at allocation time.
Broker daemon refuses allocations that would trip OOM and surfaces back-pressure to callers.
LRU model + KV-cache eviction with NVMe spillover when a higher-priority runtime needs headroom.
Per-runtime telemetry in the GPU Dashboard with pre-ceiling alerts.

Replacing Ollama

Workstream 03

Rust Inference Daemon

Ollama is the largest non-Rust process in the hot path and its Go GC pauses show up under load. Wrapping llama-cpp-2 in our own Axum-based Rust HTTP server gets deterministic memory behavior without touching the kernel.

llama-cpp-2 — direct Rust bindings to llama.cpp; same GGUF models, no Go runtime in the hot path.
Axum HTTP — already powers cluster-agent and the inference gateway.
Arena-allocated request buffers; tokenizer + KV cache in long-lived pools; no GC tracing.
Drop-in /api/generate + /v1/chat/completions surface so existing clients (incl. NullClaw) need zero changes.

Target outcomes

What the three workstreams are measured against. Numbers shown are targets for the in-flight sprint, not shipped benchmarks. {{ TODO: bench }}

p99 enqueue jitter

<1ms

down from ~8ms

UMA contention OOMs

enforced at alloc time

Hot-path GC pauses

None

Rust + arena allocation

Client API breakage

drop-in surface

Built into the platform, not bolted on.

Performance work lands directly in ThoxOS and ships to every Nova device and MagStack cluster. Read the OS deep-dive or browse hardware specs for more context.