LMX Overview
Opta LMX is an MLX-native inference server built for Apple Silicon. It delivers OpenAI-compatible APIs for local model inference, running entirely on your Mac Studio's unified memory with no cloud dependency.
What Is LMX
LMX (Local Model eXecution) is a headless Python daemon that loads large language models into Apple Silicon unified memory using the MLX framework. It exposes an OpenAI-compatible HTTP API on port 1234, making it a drop-in replacement for LM Studio, Ollama, or any other local inference server.
LMX is designed to run 24/7 on a dedicated Apple Silicon machine (recommended: Mac Studio with M3 Ultra and 192GB unified memory). The Opta CLI daemon connects to LMX over LAN to perform inference.
Why MLX Native
Most local inference tools use GGUF quantized models through llama.cpp. LMX uses Apple's MLX framework directly, which provides significant advantages on Apple Silicon:
- 15-30% faster inference compared to GGUF on the same hardware, because MLX uses the GPU and Neural Engine natively without translation layers.
- Unified memory efficiency — MLX models sit directly in unified memory shared by CPU and GPU, avoiding data copies.
- Native Metal acceleration — All matrix operations run on Metal shaders optimized for Apple GPU architectures.
- GGUF fallback — LMX can still load GGUF models when an MLX-native variant is not available, so you are never locked out of a model.
OpenAI Compatibility
LMX implements the OpenAI Chat Completions API spec. Any tool that supports a custom OpenAI base URL works with LMX out of the box:
from openai import OpenAI
client = OpenAI(
base_url="http://192.168.188.11:1234/v1",
api_key="not-needed" # LMX does not require API keys on LAN
)
response = client.chat.completions.create(
model="qwen3-30b-a3b",
messages=[{"role": "user", "content": "Hello"}],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")Core Capabilities
LMX provides 12 non-negotiable capabilities that define its operational contract:
- Streaming inference — Server-Sent Events (SSE) for real-time token streaming
- WebSocket streaming — Alternative low-latency streaming via WebSocket
- Model hot-swap — Load and unload models without restarting the server
- Multi-model inventory — Scan and list all available models on disk
- Health probes — Separate liveness (
/healthz) and readiness (/readyz) endpoints - Admin API — Model management, metrics, and configuration endpoints
- SSE metrics stream — Real-time throughput and VRAM telemetry via
/admin/events - Embedding generation —
/v1/embeddingsfor RAG and semantic search - Reranking —
/v1/rerankfor search result reranking - Agent skills — Registered tool functions callable by models
- Graceful degradation — Auto-unload on OOM instead of crashing
- Launchd integration — Runs as a macOS service with auto-restart
Never-Crash Guarantee
503 Service Unavailable with a no-model-loaded error code. The model can be reloaded once memory is available.Architecture
LMX is a FastAPI application with an MLX inference backend. It runs on the Mac Studio and listens on all interfaces (0.0.0.0:1234) to accept connections from any device on the LAN.
┌─────────────────────────────────────────┐
│ Opta LMX (192.168.188.11:1234) │
│ │
│ FastAPI Server │
│ ├── /v1/chat/completions (inference) │
│ ├── /v1/embeddings (embedding) │
│ ├── /v1/rerank (reranking) │
│ ├── /healthz (liveness) │
│ ├── /readyz (readiness) │
│ └── /admin/* (management) │
│ │
│ MLX Inference Engine │
│ ├── Model loader (MLX native + GGUF) │
│ ├── Streaming tokenizer │
│ ├── VRAM monitor │
│ └── OOM guardian │
│ │
│ Apple Silicon Hardware │
│ └── M3 Ultra / 192GB unified memory │
└─────────────────────────────────────────┘Drop-In Replacement
LMX intentionally listens on port 1234 — the same default port as LM Studio. This makes it a seamless replacement for any workflow that already targets a local OpenAI-compatible server. Point your tools at http://192.168.188.11:1234/v1 and they will work without configuration changes.
api_key parameter is accepted but ignored. This simplifies setup for local-only deployments.Quick Start
Verify that LMX is running and a model is loaded:
curl http://192.168.188.11:1234/healthz{"status":"ok"}curl http://192.168.188.11:1234/readyz{"ready":true,"model":"qwen3-30b-a3b"}Send a test completion:
curl http://192.168.188.11:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3-30b-a3b","messages":[{"role":"user","content":"Hello"}]}'See the Setup page for full installation and configuration instructions.