Monitoring
LMX provides health probes, a real-time metrics stream, and detailed error reporting to help you monitor inference performance and diagnose issues.
Health Probes
LMX exposes two health endpoints following the Kubernetes probe pattern. These are useful for monitoring tools, load balancers, and the Opta daemon's preflight check.
Liveness (/healthz)
/healthzReturns 200 if the LMX process is running and the HTTP server is accepting connections. Does not check whether a model is loaded.
Response
{"status":"ok"}Use this endpoint for basic process monitoring. If /healthz returns an error or times out, the LMX process has crashed and needs to be restarted.
curl -f http://192.168.188.11:1234/healthz && echo 'LMX alive' || echo 'LMX down'Readiness (/readyz)
/readyzReturns 200 only if a model is loaded and the server is ready to handle inference requests. Returns 503 if no model is loaded.
{
"ready": true,
"model": "qwen3-30b-a3b",
"vram_gb": 18.4
}/readyz before every inference request. The result is cached for 10 seconds to reduce overhead. If /readyz returns 503, the daemon reports an LMX_UNREACHABLE error to the client.Real-Time Metrics
/admin/eventsServer-Sent Events stream of real-time metrics. The stream emits throughput, memory, and heartbeat events continuously. Connect to this endpoint for live monitoring dashboards.
Connect with any SSE client to receive metrics in real time:
curl -N http://192.168.188.11:1234/admin/eventsSSE Event Types
Emitted during active inference. Reports tokens per second and active request count.
{
"type": "throughput",
"tokens_per_second": 45.2,
"active_requests": 1,
"model": "qwen3-30b-a3b"
}Memory Monitoring
VRAM Usage
Monitor VRAM usage through the /admin/events SSE stream (memory events) or by querying the admin models endpoint:
curl http://192.168.188.11:1234/admin/models{
"models": [{
"id": "qwen3-30b-a3b",
"loaded": true,
"vram_gb": 18.4,
"format": "mlx"
}]
}On a Mac Studio with 192GB unified memory, you can monitor system-level memory alongside LMX:
memory_pressureOOM Alerts
When memory pressure reaches the configured thresholds, LMX emits special events via the SSE stream:
{
"type": "memory_warning",
"vram_used_gb": 163.2,
"vram_total_gb": 192.0,
"pct": 85.0,
"message": "Memory usage exceeds warning threshold"
}{
"type": "memory_critical",
"vram_used_gb": 172.8,
"vram_total_gb": 192.0,
"pct": 90.0,
"message": "OOM threshold reached, unloading model",
"action": "model_unloaded"
}503 response with the oom-unloaded error code.Error Codes
These error codes appear in API responses and SSE events. Use them for automated monitoring and alerting:
| Code | Severity | Action |
|---|---|---|
no-model-loaded | Warning | Load a model with POST /admin/models/load |
storage-full | Error | Free disk space or remove cached models |
lmx-timeout | Warning | Reduce context length or switch to a smaller model |
oom-unloaded | Critical | Free memory and reload a model (possibly smaller) |
model-not-found | Warning | Check model path or download the model |
Log Files
LMX writes logs to stdout/stderr, which are captured by launchd when running as a service:
tail -f /tmp/opta-lmx.stdout.logtail -f /tmp/opta-lmx.stderr.logKey log patterns to watch for:
# Normal startup
INFO: LMX starting on 0.0.0.0:1234
INFO: Model loaded: qwen3-30b-a3b (18.4 GB, 2.3s)
# Inference activity
INFO: Completion request model=qwen3-30b-a3b tokens=342 speed=45.2tok/s
# Memory warning
WARNING: Memory usage at 85.0% (163.2/192.0 GB)
# OOM protection triggered
ERROR: Memory critical at 90.0% — unloading model
INFO: Model unloaded, freed 18.4 GB
# Client connection errors (usually harmless)
WARNING: SSE client disconnected: ConnectionResetErrorPerformance Benchmarks
Throughput Targets
Expected token generation speeds on a Mac Studio M3 Ultra (192GB) with MLX-native 4-bit quantized models:
| Model | Prompt (tok/s) | Generate (tok/s) | Time to First Token |
|---|---|---|---|
| Qwen3-30B-A3B (4-bit) | ~200 | ~65 | <200ms |
| Llama-3.3-70B (4-bit) | ~120 | ~25 | <500ms |
| DeepSeek-V3-0324 (4-bit) | ~40 | ~8 | <2s |
Monitoring from Local Web
The Opta Local Web dashboard at http://localhost:3004 provides a visual interface for monitoring LMX. It connects to the /admin/events SSE stream and displays:
- VRAM gauge — Real-time memory usage ring with percentage
- Throughput graph — Tokens per second over time (300-sample circular buffer)
- Model list — Currently loaded model and available models on disk
- Health indicator — Connection status badge with heartbeat monitoring
See the Local Web documentation for setup and usage details.