API Reference

LMX exposes an OpenAI-compatible API for inference alongside admin endpoints for model management and monitoring. All endpoints are unauthenticated on LAN.

Base URL

text
http://192.168.188.11:1234

When running locally on the Mac Studio, use http://localhost:1234. From other devices on the LAN, use the Mac Studio's IP address.

No API key required
LMX does not enforce API key authentication on LAN. The api_key field in OpenAI client libraries can be set to any non-empty string.

Inference Endpoints

Chat Completions

POST/v1/chat/completions

Generate a chat completion. Supports both streaming (SSE) and non-streaming modes. Follows the OpenAI Chat Completions API spec.

Parameters

modelstringrequiredModel identifier
messagesMessage[]requiredArray of chat messages with role and content
streambooleanEnable SSE streaming (default: false)
temperaturenumberSampling temperature 0-2 (default: 0.7)
max_tokensnumberMaximum tokens to generate (default: 4096)
top_pnumberNucleus sampling threshold (default: 1.0)
stopstring[]Stop sequences

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "qwen3-30b-a3b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 8,
    "total_tokens": 20
  }
}

Streaming (SSE)

Set stream: true to receive Server-Sent Events. Each event is a JSON chunk following the OpenAI streaming format:

SSE response
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" world"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

WebSocket Streaming

WS/v1/chat/stream

Alternative streaming endpoint using WebSocket. Send the same request body as /v1/chat/completions. The server sends individual token messages followed by a final stats message.

ws-stream.js
const ws = new WebSocket("ws://192.168.188.11:1234/v1/chat/stream");

ws.onopen = () => {
  ws.send(JSON.stringify({
    model: "qwen3-30b-a3b",
    messages: [{ role: "user", content: "Hello" }]
  }));
};

ws.onmessage = (e) => {
  const data = JSON.parse(e.data);
  if (data.token) {
    process.stdout.write(data.token);
  } else if (data.done) {
    console.log("\nTokens/s:", data.stats.tokens_per_second);
    ws.close();
  }
};

Embeddings

POST/v1/embeddings

Generate embeddings for the given input text. Uses the currently loaded embedding model.

Parameters

modelstringrequiredEmbedding model identifier
inputstring | string[]requiredText to embed (single string or array)

Response

{
  "object": "list",
  "data": [{
    "object": "embedding",
    "index": 0,
    "embedding": [0.0012, -0.0034, 0.0056, ...]
  }],
  "model": "nomic-embed-text",
  "usage": { "prompt_tokens": 5, "total_tokens": 5 }
}

Reranking

POST/v1/rerank

Rerank a list of documents against a query. Returns documents sorted by relevance score.

Parameters

modelstringrequiredReranker model identifier
querystringrequiredThe query to rank against
documentsstring[]requiredDocuments to rerank
top_nnumberReturn only the top N results

Response

{
  "results": [
    { "index": 2, "relevance_score": 0.95, "document": "..." },
    { "index": 0, "relevance_score": 0.72, "document": "..." },
    { "index": 1, "relevance_score": 0.31, "document": "..." }
  ]
}

Health Endpoints

GET/healthz

Liveness probe. Returns 200 if the LMX process is running, regardless of whether a model is loaded.

Response

{"status":"ok"}
GET/readyz

Readiness probe. Returns 200 only if a model is loaded and ready for inference. Returns 503 if no model is loaded.

json
{"ready":true,"model":"qwen3-30b-a3b","vram_gb":18.4}

Admin Endpoints

List Models

GET/admin/models

Returns all available models on disk, with their load status and size information.

Response

{
  "models": [
    {
      "id": "qwen3-30b-a3b",
      "path": "mlx-community/Qwen3-30B-A3B-4bit",
      "loaded": true,
      "vram_gb": 18.4,
      "format": "mlx"
    },
    {
      "id": "llama-3.3-70b",
      "path": "mlx-community/Llama-3.3-70B-Instruct-4bit",
      "loaded": false,
      "vram_gb": null,
      "format": "mlx"
    }
  ]
}

Load Model

POST/admin/models/load

Loads a model into memory. If another model is currently loaded, it is unloaded first. Returns the load time and VRAM usage.

Parameters

modelstringrequiredModel path or HuggingFace identifier

Response

{
  "success": true,
  "model": "llama-3.3-70b",
  "load_time_ms": 4200,
  "vram_gb": 38.7
}

Unload Model

POST/admin/models/unload

Unloads the currently loaded model, freeing VRAM. After unloading, /readyz will return 503 until a new model is loaded.

Response

{
  "success": true,
  "freed_vram_gb": 18.4
}

Metrics Stream

GET/admin/events

Server-Sent Events stream of real-time metrics. Emits throughput, VRAM usage, and inference status events. Used by the Opta Local Web dashboard for live monitoring.

SSE metrics
data: {"type":"throughput","tokens_per_second":45.2,"active_requests":1}

data: {"type":"memory","vram_used_gb":18.4,"vram_total_gb":192.0,"pct":9.6}

data: {"type":"heartbeat","uptime":3600}

Error Codes

LMX returns standard HTTP status codes with a JSON error body:

json
{
  "error": {
    "code": "no-model-loaded",
    "message": "No model is currently loaded. Call POST /admin/models/load first."
  }
}
CodeHTTPDescription
no-model-loaded503No model in memory. Load one first.
model-not-found404Requested model not found on disk.
storage-full507Insufficient disk space for model download.
lmx-timeout504Inference timed out (model took too long).
oom-unloaded503Model was unloaded due to OOM pressure.
invalid-request400Malformed request body or missing fields.