Learn About

Deep workflow guides aligned to this documentation section.

Opta LMX Masterclass

Advanced LMX runtime behavior, memory strategy, and performance tuning.

API Reference

LMX exposes an OpenAI-compatible API for inference alongside admin endpoints for model management and monitoring. All endpoints are unauthenticated on LAN.

Base URL

text

http://lmx-host.local:1234

When running locally on the dedicated Apple Silicon host, use http://localhost:1234. From other devices on the LAN, use the dedicated Apple Silicon host's IP address.

No API key required

LMX does not enforce API key authentication on LAN. The api_key field in OpenAI client libraries can be set to any non-empty string.

Inference Endpoints

Chat Completions

POST/v1/chat/completions

Generate a chat completion. Supports both streaming (SSE) and non-streaming modes. Follows the OpenAI Chat Completions API spec.

Parameters

modelstringrequired— Model identifier

messagesMessage[]required— Array of chat messages with role and content

streamboolean— Enable SSE streaming (default: false)

temperaturenumber— Sampling temperature 0-2 (default: 0.7)

max_tokensnumber— Maximum tokens to generate (default: 4096)

top_pnumber— Nucleus sampling threshold (default: 1.0)

stopstring[]— Stop sequences

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "qwen3-30b-a3b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 8,
    "total_tokens": 20
  }
}

Streaming (SSE)

Set stream: true to receive Server-Sent Events. Each event is a JSON chunk following the OpenAI streaming format:

SSE response

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" world"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

WebSocket Streaming

WS/v1/chat/stream

Alternative streaming endpoint using WebSocket. Send the same request body as /v1/chat/completions. The server sends individual token messages followed by a final stats message.

ws-stream.js

const ws = new WebSocket("ws://lmx-host.local:1234/v1/chat/stream");

ws.onopen = () => {
  ws.send(JSON.stringify({
    model: "qwen3-30b-a3b",
    messages: [{ role: "user", content: "Hello" }]
  }));
};

ws.onmessage = (e) => {
  const data = JSON.parse(e.data);
  if (data.token) {
    process.stdout.write(data.token);
  } else if (data.done) {
    console.log("\nTokens/s:", data.stats.tokens_per_second);
    ws.close();
  }
};

Embeddings

POST/v1/embeddings

Generate embeddings for the given input text. Uses the currently loaded embedding model.

Parameters

modelstringrequired— Embedding model identifier

inputstring | string[]required— Text to embed (single string or array)

Response

{
  "object": "list",
  "data": [{
    "object": "embedding",
    "index": 0,
    "embedding": [0.0012, -0.0034, 0.0056, ...]
  }],
  "model": "nomic-embed-text",
  "usage": { "prompt_tokens": 5, "total_tokens": 5 }
}

Reranking

POST/v1/rerank

Rerank a list of documents against a query. Returns documents sorted by relevance score.

Parameters

modelstringrequired— Reranker model identifier

querystringrequired— The query to rank against

documentsstring[]required— Documents to rerank

top_nnumber— Return only the top N results

Response

{
  "results": [
    { "index": 2, "relevance_score": 0.95, "document": "..." },
    { "index": 0, "relevance_score": 0.72, "document": "..." },
    { "index": 1, "relevance_score": 0.31, "document": "..." }
  ]
}

Health Endpoints

GET/healthz

Liveness probe. Returns 200 if the LMX process is running, regardless of whether a model is loaded.

Response

{"status":"ok"}

GET/readyz

Readiness probe. Returns 200 only if a model is loaded and ready for inference. Returns 503 if no model is loaded.

json

{"ready":true,"model":"qwen3-30b-a3b","vram_gb":18.4}

Admin Endpoints

List Models

GET/admin/models

Returns all available models on disk, with their load status and size information.

Response

{
  "models": [
    {
      "id": "qwen3-30b-a3b",
      "path": "mlx-community/Qwen3-30B-A3B-4bit",
      "loaded": true,
      "vram_gb": 18.4,
      "format": "mlx"
    },
    {
      "id": "llama-3.3-70b",
      "path": "mlx-community/Llama-3.3-70B-Instruct-4bit",
      "loaded": false,
      "vram_gb": null,
      "format": "mlx"
    }
  ]
}

Load Model

POST/admin/models/load

Loads a model into memory. If another model is currently loaded, it is unloaded first. Returns the load time and VRAM usage.

Parameters

modelstringrequired— Model path or HuggingFace identifier

Response

{
  "success": true,
  "model": "llama-3.3-70b",
  "load_time_ms": 4200,
  "vram_gb": 38.7
}

Unload Model

POST/admin/models/unload

Unloads the currently loaded model, freeing VRAM. After unloading, /readyz will return 503 until a new model is loaded.

Response

{
  "success": true,
  "freed_vram_gb": 18.4
}

Metrics Stream

GET/admin/events

Server-Sent Events stream of real-time metrics. Emits throughput, VRAM usage, and inference status events. Used by the Opta Local Web dashboard for live monitoring.

SSE metrics

data: {"type":"throughput","tokens_per_second":45.2,"active_requests":1}

data: {"type":"memory","vram_used_gb":18.4,"vram_total_gb":192.0,"pct":9.6}

data: {"type":"heartbeat","uptime":3600}

Error Codes

LMX returns standard HTTP status codes with a JSON error body:

json

{
  "error": {
    "code": "no-model-loaded",
    "message": "No model is currently loaded. Call POST /admin/models/load first."
  }
}

Code	HTTP	Description
`no-model-loaded`	503	No model in memory. Load one first.
`model-not-found`	404	Requested model not found on disk.
`storage-full`	507	Insufficient disk space for model download.
`lmx-timeout`	504	Inference timed out (model took too long).
`oom-unloaded`	503	Model was unloaded due to OOM pressure.
`invalid-request`	400	Malformed request body or missing fields.

Setup

Model Management

API Reference

Base URL

Inference Endpoints

Chat Completions

Parameters

Response

Streaming (SSE)

WebSocket Streaming

Embeddings

Parameters

Response

Reranking

Parameters

Response

Health Endpoints

Response

Admin Endpoints

List Models

Response

Load Model

Parameters

Response

Unload Model

Response

Metrics Stream

Error Codes

On this page