Model Management
LMX supports hot-swapping models without restarting the server. This page covers loading, unloading, downloading, and managing models on disk.
Loading Models
LMX loads one model at a time into unified memory. When you load a new model, the previously loaded model is automatically unloaded first.
Via API
Via CLI
The Opta CLI provides a convenient wrapper for model management:
Unloading Models
Unloading a model frees its VRAM. After unloading, the /readyz endpoint returns 503 until a new model is loaded.
Downloading Models
HuggingFace Hub
LMX downloads models from HuggingFace Hub on first load. Models are cached in the standard HuggingFace cache directory at ~/.cache/huggingface/hub/.
Find a model on HuggingFace
Browse the mlx-community organization for pre-converted MLX models. Look for models with the '-4bit' suffix for 4-bit quantization.
Load the model (triggers download)
Wait for download to complete
Large models (70B+) can take several minutes to download depending on your internet speed. The API blocks until the model is fully loaded.
Model Formats
MLX Native
MLX-native models are the preferred format. They are pre-converted to MLX's internal representation and load directly into unified memory without conversion overhead.
The mlx-community organization on HuggingFace maintains converted versions of popular models. Look for repositories with names like:
mlx-community/Qwen3-30B-A3B-4bitmlx-community/Llama-3.3-70B-Instruct-4bitmlx-community/Mistral-Large-Instruct-2411-4bit
GGUF Fallback
When an MLX-native version is not available, LMX can load GGUF quantized models as a fallback. GGUF models are loaded through a compatibility layer, which adds slight overhead (10-20% slower than native MLX).
VRAM Estimation
Estimating VRAM requirements before loading a model helps prevent OOM conditions. Use this rule of thumb:
Check your current VRAM usage via the admin API:
OOM Prevention
LMX includes an OOM guardian that prevents crashes when memory pressure is too high. The guardian operates at two thresholds:
- Warning threshold (85% VRAM) — LMX logs a warning and emits a memory pressure event via the
/admin/eventsSSE stream. - Critical threshold (90% VRAM) — LMX automatically unloads the model to prevent the process from being killed by the OS. Subsequent inference requests return
503with codeoom-unloaded.
The thresholds are configurable in the LMX config file:
Recommended Models
These models are tested and recommended for use with LMX on the dedicated Apple Silicon host M3 Ultra (192GB):
| Model | Params | VRAM | tok/s | Use Case |
|---|---|---|---|---|
| Qwen3-30B-A3B | 30B (3B active) | 18 GB | ~65 | Daily driver, fast MoE |
| Llama-3.3-70B | 70B | 39 GB | ~25 | Code generation, analysis |
| DeepSeek-V3-0324 | 685B MoE | ~160 GB | ~8 | Maximum quality reasoning |
| nomic-embed-text | 137M | 0.3 GB | n/a | Embeddings for RAG |