LMX Overview

Opta LMX is an MLX-native inference server built for Apple Silicon. It delivers OpenAI-compatible APIs for local model inference, running entirely on your Mac Studio's unified memory with no cloud dependency.

What Is LMX

LMX (Local Model eXecution) is a headless Python daemon that loads large language models into Apple Silicon unified memory using the MLX framework. It exposes an OpenAI-compatible HTTP API on port 1234, making it a drop-in replacement for LM Studio, Ollama, or any other local inference server.

LMX is designed to run 24/7 on a dedicated Apple Silicon machine (recommended: Mac Studio with M3 Ultra and 192GB unified memory). The Opta CLI daemon connects to LMX over LAN to perform inference.

Why MLX Native

Most local inference tools use GGUF quantized models through llama.cpp. LMX uses Apple's MLX framework directly, which provides significant advantages on Apple Silicon:

  • 15-30% faster inference compared to GGUF on the same hardware, because MLX uses the GPU and Neural Engine natively without translation layers.
  • Unified memory efficiency — MLX models sit directly in unified memory shared by CPU and GPU, avoiding data copies.
  • Native Metal acceleration — All matrix operations run on Metal shaders optimized for Apple GPU architectures.
  • GGUF fallback — LMX can still load GGUF models when an MLX-native variant is not available, so you are never locked out of a model.

OpenAI Compatibility

LMX implements the OpenAI Chat Completions API spec. Any tool that supports a custom OpenAI base URL works with LMX out of the box:

python
from openai import OpenAI

client = OpenAI(
    base_url="http://192.168.188.11:1234/v1",
    api_key="not-needed"  # LMX does not require API keys on LAN
)

response = client.chat.completions.create(
    model="qwen3-30b-a3b",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

Core Capabilities

LMX provides 12 non-negotiable capabilities that define its operational contract:

  • Streaming inference — Server-Sent Events (SSE) for real-time token streaming
  • WebSocket streaming — Alternative low-latency streaming via WebSocket
  • Model hot-swap — Load and unload models without restarting the server
  • Multi-model inventory — Scan and list all available models on disk
  • Health probes — Separate liveness (/healthz) and readiness (/readyz) endpoints
  • Admin API — Model management, metrics, and configuration endpoints
  • SSE metrics stream — Real-time throughput and VRAM telemetry via /admin/events
  • Embedding generation/v1/embeddings for RAG and semantic search
  • Reranking/v1/rerank for search result reranking
  • Agent skills — Registered tool functions callable by models
  • Graceful degradation — Auto-unload on OOM instead of crashing
  • Launchd integration — Runs as a macOS service with auto-restart

Never-Crash Guarantee

OOM protection
LMX must never crash on out-of-memory conditions. When VRAM is exhausted, LMX automatically unloads the current model and returns a 503 Service Unavailable with a no-model-loaded error code. The model can be reloaded once memory is available.

Architecture

LMX is a FastAPI application with an MLX inference backend. It runs on the Mac Studio and listens on all interfaces (0.0.0.0:1234) to accept connections from any device on the LAN.

text
┌─────────────────────────────────────────┐
│  Opta LMX (192.168.188.11:1234)         │
│                                          │
│  FastAPI Server                          │
│  ├── /v1/chat/completions  (inference)   │
│  ├── /v1/embeddings        (embedding)   │
│  ├── /v1/rerank            (reranking)   │
│  ├── /healthz              (liveness)    │
│  ├── /readyz               (readiness)   │
│  └── /admin/*              (management)  │
│                                          │
│  MLX Inference Engine                    │
│  ├── Model loader (MLX native + GGUF)    │
│  ├── Streaming tokenizer                 │
│  ├── VRAM monitor                        │
│  └── OOM guardian                        │
│                                          │
│  Apple Silicon Hardware                  │
│  └── M3 Ultra / 192GB unified memory     │
└─────────────────────────────────────────┘

Drop-In Replacement

LMX intentionally listens on port 1234 — the same default port as LM Studio. This makes it a seamless replacement for any workflow that already targets a local OpenAI-compatible server. Point your tools at http://192.168.188.11:1234/v1 and they will work without configuration changes.

No API key required
LMX does not require API keys for LAN connections. The api_key parameter is accepted but ignored. This simplifies setup for local-only deployments.

Quick Start

Verify that LMX is running and a model is loaded:

Check LMX is alive
curl http://192.168.188.11:1234/healthz
{"status":"ok"}
Check a model is loaded and ready
curl http://192.168.188.11:1234/readyz
{"ready":true,"model":"qwen3-30b-a3b"}

Send a test completion:

Send a non-streaming chat completion
curl http://192.168.188.11:1234/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"qwen3-30b-a3b","messages":[{"role":"user","content":"Hello"}]}'

See the Setup page for full installation and configuration instructions.