Platform · Performance Engineering

Performance Engineering on Jetson Orin NX

ThoxOS is custom-tuned for inference workloads on Jetson Orin NX. Three concurrent workstreams target tail latency, UMA contention, and hot-path predictability: a PREEMPT_RT kernel sprint to bound scheduler jitter, a memory broker to keep Ollama, TensorRT-LLM, and cuQuantum from starving each other, and a Rust inference daemon that retires Ollama from the hot path.

Active Workstreams

Three concurrent fronts, one objective: predictable inference.

Status as of May 2026
Sprint in flight
Workstream 01

Kernel Tuning

PREEMPT_RT kernel + CPU isolation for inference threads, api-server and cluster-agent pinned to dedicated cores, fallible I/O paths moved to io_uring. Measurable wins on tail latency.

  • PREEMPT_RT — bounded scheduling latency under load.
  • isolcpus — cores 4–7 reserved for inference; kernel housekeeping cannot steal them.
  • io_uring — batched submission queues for KV cache spill + log flush.
  • Target: cut p99 enqueue jitter from ~8ms to <1ms on saturated nodes.
Memory broker WIP
Workstream 02

UMA Discipline

Orin's biggest performance trap is the unified memory pool — every byte the GPU grabs is a byte the CPU loses. The pipelining doc's managed aggregation is correct; what's missing is a hard memory broker between Ollama, TensorRT-LLM, and cuQuantum so they cannot starve each other.

  • Hard per-runtime ceilings (Ollama 5.5GB, TensorRT-LLM 6GB, cuQuantum 2.5GB) enforced at allocation time.
  • Broker daemon refuses allocations that would trip OOM and surfaces back-pressure to callers.
  • LRU model + KV-cache eviction with NVMe spillover when a higher-priority runtime needs headroom.
  • Per-runtime telemetry in the GPU Dashboard with pre-ceiling alerts.
Replacing Ollama
Workstream 03

Rust Inference Daemon

Ollama is the largest non-Rust process in the hot path and its Go GC pauses show up under load. Wrapping llama-cpp-2 in our own Axum-based Rust HTTP server gets deterministic memory behavior without touching the kernel.

  • llama-cpp-2 — direct Rust bindings to llama.cpp; same GGUF models, no Go runtime in the hot path.
  • Axum HTTP — already powers cluster-agent and the inference gateway.
  • Arena-allocated request buffers; tokenizer + KV cache in long-lived pools; no GC tracing.
  • Drop-in /api/generate + /v1/chat/completions surface so existing clients (incl. NullClaw) need zero changes.

Target outcomes

What the three workstreams are measured against. Numbers shown are targets for the in-flight sprint, not shipped benchmarks. {{ TODO: bench }}

p99 enqueue jitter
<1ms
down from ~8ms
UMA contention OOMs
0
enforced at alloc time
Hot-path GC pauses
None
Rust + arena allocation
Client API breakage
0
drop-in surface

Built into the platform, not bolted on.

Performance work lands directly in ThoxOS and ships to every Nova device and MagStack cluster. Read the OS deep-dive or browse hardware specs for more context.