API Reference
LMX exposes an OpenAI-compatible API for inference alongside admin endpoints for model management and monitoring. All endpoints are unauthenticated on LAN.
Base URL
http://192.168.188.11:1234When running locally on the Mac Studio, use http://localhost:1234. From other devices on the LAN, use the Mac Studio's IP address.
api_key field in OpenAI client libraries can be set to any non-empty string.Inference Endpoints
Chat Completions
/v1/chat/completionsGenerate a chat completion. Supports both streaming (SSE) and non-streaming modes. Follows the OpenAI Chat Completions API spec.
Parameters
modelstringrequired— Model identifiermessagesMessage[]required— Array of chat messages with role and contentstreamboolean— Enable SSE streaming (default: false)temperaturenumber— Sampling temperature 0-2 (default: 0.7)max_tokensnumber— Maximum tokens to generate (default: 4096)top_pnumber— Nucleus sampling threshold (default: 1.0)stopstring[]— Stop sequencesResponse
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "qwen3-30b-a3b",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you?"
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 8,
"total_tokens": 20
}
}Streaming (SSE)
Set stream: true to receive Server-Sent Events. Each event is a JSON chunk following the OpenAI streaming format:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" world"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]WebSocket Streaming
/v1/chat/streamAlternative streaming endpoint using WebSocket. Send the same request body as /v1/chat/completions. The server sends individual token messages followed by a final stats message.
const ws = new WebSocket("ws://192.168.188.11:1234/v1/chat/stream");
ws.onopen = () => {
ws.send(JSON.stringify({
model: "qwen3-30b-a3b",
messages: [{ role: "user", content: "Hello" }]
}));
};
ws.onmessage = (e) => {
const data = JSON.parse(e.data);
if (data.token) {
process.stdout.write(data.token);
} else if (data.done) {
console.log("\nTokens/s:", data.stats.tokens_per_second);
ws.close();
}
};Embeddings
/v1/embeddingsGenerate embeddings for the given input text. Uses the currently loaded embedding model.
Parameters
modelstringrequired— Embedding model identifierinputstring | string[]required— Text to embed (single string or array)Response
{
"object": "list",
"data": [{
"object": "embedding",
"index": 0,
"embedding": [0.0012, -0.0034, 0.0056, ...]
}],
"model": "nomic-embed-text",
"usage": { "prompt_tokens": 5, "total_tokens": 5 }
}Reranking
/v1/rerankRerank a list of documents against a query. Returns documents sorted by relevance score.
Parameters
modelstringrequired— Reranker model identifierquerystringrequired— The query to rank againstdocumentsstring[]required— Documents to reranktop_nnumber— Return only the top N resultsResponse
{
"results": [
{ "index": 2, "relevance_score": 0.95, "document": "..." },
{ "index": 0, "relevance_score": 0.72, "document": "..." },
{ "index": 1, "relevance_score": 0.31, "document": "..." }
]
}Health Endpoints
/healthzLiveness probe. Returns 200 if the LMX process is running, regardless of whether a model is loaded.
Response
{"status":"ok"}/readyzReadiness probe. Returns 200 only if a model is loaded and ready for inference. Returns 503 if no model is loaded.
{"ready":true,"model":"qwen3-30b-a3b","vram_gb":18.4}Admin Endpoints
List Models
/admin/modelsReturns all available models on disk, with their load status and size information.
Response
{
"models": [
{
"id": "qwen3-30b-a3b",
"path": "mlx-community/Qwen3-30B-A3B-4bit",
"loaded": true,
"vram_gb": 18.4,
"format": "mlx"
},
{
"id": "llama-3.3-70b",
"path": "mlx-community/Llama-3.3-70B-Instruct-4bit",
"loaded": false,
"vram_gb": null,
"format": "mlx"
}
]
}Load Model
/admin/models/loadLoads a model into memory. If another model is currently loaded, it is unloaded first. Returns the load time and VRAM usage.
Parameters
modelstringrequired— Model path or HuggingFace identifierResponse
{
"success": true,
"model": "llama-3.3-70b",
"load_time_ms": 4200,
"vram_gb": 38.7
}Unload Model
/admin/models/unloadUnloads the currently loaded model, freeing VRAM. After unloading, /readyz will return 503 until a new model is loaded.
Response
{
"success": true,
"freed_vram_gb": 18.4
}Metrics Stream
/admin/eventsServer-Sent Events stream of real-time metrics. Emits throughput, VRAM usage, and inference status events. Used by the Opta Local Web dashboard for live monitoring.
data: {"type":"throughput","tokens_per_second":45.2,"active_requests":1}
data: {"type":"memory","vram_used_gb":18.4,"vram_total_gb":192.0,"pct":9.6}
data: {"type":"heartbeat","uptime":3600}Error Codes
LMX returns standard HTTP status codes with a JSON error body:
{
"error": {
"code": "no-model-loaded",
"message": "No model is currently loaded. Call POST /admin/models/load first."
}
}| Code | HTTP | Description |
|---|---|---|
no-model-loaded | 503 | No model in memory. Load one first. |
model-not-found | 404 | Requested model not found on disk. |
storage-full | 507 | Insufficient disk space for model download. |
lmx-timeout | 504 | Inference timed out (model took too long). |
oom-unloaded | 503 | Model was unloaded due to OOM pressure. |
invalid-request | 400 | Malformed request body or missing fields. |