LMX provides health probes, a real-time metrics stream, and detailed error reporting to help you monitor inference performance and diagnose issues.
Health Probes
LMX exposes two health endpoints following the Kubernetes probe pattern. These are useful for monitoring tools, load balancers, and the Opta daemon's preflight check.
Liveness (/healthz)
GET/healthz
Returns 200 if the LMX process is running and the HTTP server is accepting connections. Does not check whether a model is loaded.
Response
{"status":"ok"}
Use this endpoint for basic process monitoring. If /healthz returns an error or times out, the LMX process has crashed and needs to be restarted.
The Opta daemon calls /readyz before every inference request. The result is cached for 10 seconds to reduce overhead. If /readyz returns 503, the daemon reports an LMX_UNREACHABLE error to the client.
Real-Time Metrics
GET/admin/events
Server-Sent Events stream of real-time metrics. The stream emits throughput, memory, and heartbeat events continuously. Connect to this endpoint for live monitoring dashboards.
Connect with any SSE client to receive metrics in real time:
Stream metrics (press Ctrl+C to stop)
curl -N http://lmx-host.local:1234/admin/events
SSE Event Types
Emitted during active inference. Reports tokens per second and active request count.
When the critical threshold is hit, LMX unloads the model automatically to prevent a crash. All in-flight requests receive a 503 response with the oom-unloaded error code.
Error Codes
These error codes appear in API responses and SSE events. Use them for automated monitoring and alerting:
Code
Severity
Action
no-model-loaded
Warning
Load a model with POST /admin/models/load
storage-full
Error
Free disk space or remove cached models
lmx-timeout
Warning
Reduce context length or switch to a smaller model
oom-unloaded
Critical
Free memory and reload a model (possibly smaller)
model-not-found
Warning
Check model path or download the model
Log Files
LMX writes logs to stdout/stderr, which are captured by launchd when running as a service:
Follow LMX stdout logs
tail -f /tmp/opta-lmx.stdout.log
Follow LMX error logs
tail -f /tmp/opta-lmx.stderr.log
Key log patterns to watch for:
text
# Normal startup
INFO: LMX starting on 0.0.0.0:1234
INFO: Model loaded: qwen3-30b-a3b (18.4 GB, 2.3s)
# Inference activity
INFO: Completion request model=qwen3-30b-a3b tokens=342 speed=45.2tok/s
# Memory warning
WARNING: Memory usage at 85.0% (163.2/192.0 GB)
# OOM protection triggered
ERROR: Memory critical at 90.0% — unloading model
INFO: Model unloaded, freed 18.4 GB
# Client connection errors (usually harmless)
WARNING: SSE client disconnected: ConnectionResetError
Performance Benchmarks
Throughput Targets
Expected token generation speeds on a dedicated Apple Silicon host M3 Ultra (192GB) with MLX-native 4-bit quantized models:
Model
Prompt (tok/s)
Generate (tok/s)
Time to First Token
Qwen3-30B-A3B (4-bit)
~200
~65
<200ms
Llama-3.3-70B (4-bit)
~120
~25
<500ms
DeepSeek-V3-0324 (4-bit)
~40
~8
<2s
MLX vs GGUF
These numbers are for MLX-native models. GGUF models loaded through the compatibility layer typically show 10-20% lower throughput. Always prefer MLX-native models for best performance.
Monitoring from Local Web
The Opta Local Web dashboard at http://localhost:3004 provides a visual interface for monitoring LMX. It connects to the /admin/events SSE stream and displays:
VRAM gauge — Real-time memory usage ring with percentage
Throughput graph — Tokens per second over time (300-sample circular buffer)
Model list — Currently loaded model and available models on disk
Health indicator — Connection status badge with heartbeat monitoring