Features & Usage

Learn how to make the most of your Thox.ai device's capabilities.

Popular Guides

AI-powered code completion

Get intelligent suggestions as you type.

How it works

Thox.ai analyzes your code context in real-time to provide relevant completions. It considers your current file, open files, and project structure to suggest accurate code.

Triggering completions

Completions appear automatically as you type. Press Tab to accept, Escape to dismiss. In VS Code, you can also use Ctrl+Space to manually trigger suggestions.

Multi-line completions

For longer suggestions, Thox.ai can complete entire functions or code blocks. These appear with a preview showing what will be inserted.

Language support

Best results with Python, JavaScript, TypeScript, Go, Rust, Java, and C++. Other languages are supported but may have reduced accuracy.

Customization

Adjust completion behavior in settings: delay before suggestions, maximum suggestion length, and languages to enable/disable.

Choosing the right model

Select optimal models for your use case.

thox-coder (7B)

Optimized for code completion and generation. 7B parameters, balanced speed and quality. Best for most development workflows. Runs on Ollama backend (45-72 tok/s).

thox-coder-pro (14B)

Enhanced 14B model for complex development tasks. Automatically routes to TensorRT-LLM for 60-100% faster inference. Ideal for system design and complex refactoring.

thox-coder-max (32B)

Maximum capability 32B model for enterprise workloads. Uses TensorRT-LLM backend for production performance. Best for architecture design and security auditing.

Hybrid Inference

Thox.ai automatically routes requests to the optimal backend: Ollama for smaller models (7B) and TensorRT-LLM for larger models (14B+). This provides up to 100% performance improvement for large models.

Switching models

Change active model via web interface (/admin/models) or CLI: "thox models switch [name]". The smart router automatically selects the best backend for your model.

Interactive chat and Q&A

Ask questions and get explanations.

Accessing chat

Use the web interface at /chat or IDE extensions' chat panel. Send questions about code, ask for explanations, or request help with debugging.

Context-aware responses

The chat understands your codebase. Reference files with @filename and it will include them in context. Ask about specific functions or classes.

Code generation

Request new code: "Write a function that validates email addresses" and receive complete, ready-to-use code blocks.

Conversation history

Chat maintains context within a session. Follow up on previous responses without repeating context. Start a new session to reset.

System prompts

Customize behavior with system prompts in settings. Define coding style preferences, language preferences, or specialized instructions.

Context and project understanding

How Thox.ai understands your codebase.

Automatic indexing

On first connection, Thox.ai indexes your project structure. This enables smart completions that reference other files and understand project layout.

Context window

The model can process thousands of tokens of context. It automatically selects relevant code from open files, imports, and related files.

Project configuration

Add a .thoxignore file to exclude files from indexing (similar to .gitignore). Exclude build directories, node_modules, and large binary files.

Re-indexing

Trigger manual re-index after major project changes: "thox index refresh" or via web interface at /admin/index.

API and integrations

Integrate Thox.ai with your tools.

OpenAI-compatible API

Thox.ai exposes an OpenAI-compatible API at /v1. Use existing OpenAI client libraries by pointing them to your Thox.ai device.

Endpoints

/v1/completions for text completion, /v1/chat/completions for chat, /v1/embeddings for vector embeddings. Full API reference at /docs/api-reference.

Authentication

Generate API keys in /admin/api-keys. Pass via Authorization header: "Bearer your-api-key". Keys can have scopes and rate limits.

Rate limits

Default 60 requests/minute, 100k tokens/hour. Adjust per-key limits in admin. Local network requests can be exempted from limits.

Webhooks

Configure webhooks in /admin/webhooks to receive notifications on completion events, errors, or model changes.

Getting the best performance

Optimize speed and quality of responses.

Hybrid Architecture

Thox.ai uses a hybrid Ollama + TensorRT-LLM architecture. Smaller models (7B) use Ollama for simplicity, while larger models (14B+) automatically route to TensorRT-LLM for 60-100% faster inference.

TensorRT-LLM Benefits

TensorRT-LLM provides custom attention kernels, paged KV caching, and INT8/INT4 quantization. This delivers significantly higher tokens/second for production workloads with large models.

Use Ethernet

Wired connections provide the lowest latency. Wi-Fi adds 20-50ms per request. For real-time completions, Ethernet is strongly recommended.

Smart Routing

The smart router automatically selects the optimal backend based on model size, latency requirements, and backend availability. Check router status at /router/status.

Thermal management

Keep the device cool for sustained performance. TensorRT-LLM is more GPU-intensive but also more efficient. Allow cool-down periods during intensive sessions.

Explore More