Monitoring

LMX provides health probes, a real-time metrics stream, and detailed error reporting to help you monitor inference performance and diagnose issues.

Health Probes

LMX exposes two health endpoints following the Kubernetes probe pattern. These are useful for monitoring tools, load balancers, and the Opta daemon's preflight check.

Liveness (/healthz)

GET/healthz

Returns 200 if the LMX process is running and the HTTP server is accepting connections. Does not check whether a model is loaded.

Response

{"status":"ok"}

Use this endpoint for basic process monitoring. If /healthz returns an error or times out, the LMX process has crashed and needs to be restarted.

Quick health check script
curl -f http://192.168.188.11:1234/healthz && echo 'LMX alive' || echo 'LMX down'

Readiness (/readyz)

GET/readyz

Returns 200 only if a model is loaded and the server is ready to handle inference requests. Returns 503 if no model is loaded.

json
{
  "ready": true,
  "model": "qwen3-30b-a3b",
  "vram_gb": 18.4
}
Daemon preflight
The Opta daemon calls /readyz before every inference request. The result is cached for 10 seconds to reduce overhead. If /readyz returns 503, the daemon reports an LMX_UNREACHABLE error to the client.

Real-Time Metrics

GET/admin/events

Server-Sent Events stream of real-time metrics. The stream emits throughput, memory, and heartbeat events continuously. Connect to this endpoint for live monitoring dashboards.

Connect with any SSE client to receive metrics in real time:

Stream metrics (press Ctrl+C to stop)
curl -N http://192.168.188.11:1234/admin/events

SSE Event Types

Emitted during active inference. Reports tokens per second and active request count.

json
{
  "type": "throughput",
  "tokens_per_second": 45.2,
  "active_requests": 1,
  "model": "qwen3-30b-a3b"
}

Memory Monitoring

VRAM Usage

Monitor VRAM usage through the /admin/events SSE stream (memory events) or by querying the admin models endpoint:

curl http://192.168.188.11:1234/admin/models
{
  "models": [{
    "id": "qwen3-30b-a3b",
    "loaded": true,
    "vram_gb": 18.4,
    "format": "mlx"
  }]
}

On a Mac Studio with 192GB unified memory, you can monitor system-level memory alongside LMX:

macOS system memory pressure report
memory_pressure

OOM Alerts

When memory pressure reaches the configured thresholds, LMX emits special events via the SSE stream:

Warning event (85% threshold)
{
  "type": "memory_warning",
  "vram_used_gb": 163.2,
  "vram_total_gb": 192.0,
  "pct": 85.0,
  "message": "Memory usage exceeds warning threshold"
}
Critical event (90% threshold)
{
  "type": "memory_critical",
  "vram_used_gb": 172.8,
  "vram_total_gb": 192.0,
  "pct": 90.0,
  "message": "OOM threshold reached, unloading model",
  "action": "model_unloaded"
}
Automatic unload
When the critical threshold is hit, LMX unloads the model automatically to prevent a crash. All in-flight requests receive a 503 response with the oom-unloaded error code.

Error Codes

These error codes appear in API responses and SSE events. Use them for automated monitoring and alerting:

CodeSeverityAction
no-model-loadedWarningLoad a model with POST /admin/models/load
storage-fullErrorFree disk space or remove cached models
lmx-timeoutWarningReduce context length or switch to a smaller model
oom-unloadedCriticalFree memory and reload a model (possibly smaller)
model-not-foundWarningCheck model path or download the model

Log Files

LMX writes logs to stdout/stderr, which are captured by launchd when running as a service:

Follow LMX stdout logs
tail -f /tmp/opta-lmx.stdout.log
Follow LMX error logs
tail -f /tmp/opta-lmx.stderr.log

Key log patterns to watch for:

text
# Normal startup
INFO:     LMX starting on 0.0.0.0:1234
INFO:     Model loaded: qwen3-30b-a3b (18.4 GB, 2.3s)

# Inference activity
INFO:     Completion request model=qwen3-30b-a3b tokens=342 speed=45.2tok/s

# Memory warning
WARNING:  Memory usage at 85.0% (163.2/192.0 GB)

# OOM protection triggered
ERROR:    Memory critical at 90.0% — unloading model
INFO:     Model unloaded, freed 18.4 GB

# Client connection errors (usually harmless)
WARNING:  SSE client disconnected: ConnectionResetError

Performance Benchmarks

Throughput Targets

Expected token generation speeds on a Mac Studio M3 Ultra (192GB) with MLX-native 4-bit quantized models:

ModelPrompt (tok/s)Generate (tok/s)Time to First Token
Qwen3-30B-A3B (4-bit)~200~65<200ms
Llama-3.3-70B (4-bit)~120~25<500ms
DeepSeek-V3-0324 (4-bit)~40~8<2s
MLX vs GGUF
These numbers are for MLX-native models. GGUF models loaded through the compatibility layer typically show 10-20% lower throughput. Always prefer MLX-native models for best performance.

Monitoring from Local Web

The Opta Local Web dashboard at http://localhost:3004 provides a visual interface for monitoring LMX. It connects to the /admin/events SSE stream and displays:

  • VRAM gauge — Real-time memory usage ring with percentage
  • Throughput graph — Tokens per second over time (300-sample circular buffer)
  • Model list — Currently loaded model and available models on disk
  • Health indicator — Connection status badge with heartbeat monitoring

See the Local Web documentation for setup and usage details.