Model Management
LMX supports hot-swapping models without restarting the server. This page covers loading, unloading, downloading, and managing models on disk.
Loading Models
LMX loads one model at a time into unified memory. When you load a new model, the previously loaded model is automatically unloaded first.
Via API
/admin/models/loadLoad a model into memory by HuggingFace identifier or local path.
Parameters
modelstringrequired— HuggingFace model ID or local pathResponse
{
"success": true,
"model": "qwen3-30b-a3b",
"load_time_ms": 2340,
"vram_gb": 18.4
}curl -X POST http://192.168.188.11:1234/admin/models/load \
-H "Content-Type: application/json" \
-d '{"model":"mlx-community/Qwen3-30B-A3B-4bit"}'Via CLI
The Opta CLI provides a convenient wrapper for model management:
opta models load qwen3-30b-a3bLoading qwen3-30b-a3b... Model loaded in 2.3s (18.4 GB VRAM)
opta models listqwen3-30b-a3b 18.4 GB MLX ● loaded llama-3.3-70b 38.7 GB MLX ○ available deepseek-v3-0324 42.1 GB MLX ○ available nomic-embed-text 0.3 GB MLX ○ available
Unloading Models
Unloading a model frees its VRAM. After unloading, the /readyz endpoint returns 503 until a new model is loaded.
curl -X POST http://192.168.188.11:1234/admin/models/unload{"success":true,"freed_vram_gb":18.4}Downloading Models
HuggingFace Hub
LMX downloads models from HuggingFace Hub on first load. Models are cached in the standard HuggingFace cache directory at ~/.cache/huggingface/hub/.
Find a model on HuggingFace
Browse the mlx-community organization for pre-converted MLX models. Look for models with the '-4bit' suffix for 4-bit quantization.
Load the model (triggers download)
curl -X POST http://192.168.188.11:1234/admin/models/load \
-H "Content-Type: application/json" \
-d '{"model":"mlx-community/Llama-3.3-70B-Instruct-4bit"}'Wait for download to complete
Large models (70B+) can take several minutes to download depending on your internet speed. The API blocks until the model is fully loaded.
huggingface-cli tool to avoid waiting during load:huggingface-cli download mlx-community/Qwen3-30B-A3B-4bitModel Formats
MLX Native
MLX-native models are the preferred format. They are pre-converted to MLX's internal representation and load directly into unified memory without conversion overhead.
The mlx-community organization on HuggingFace maintains converted versions of popular models. Look for repositories with names like:
mlx-community/Qwen3-30B-A3B-4bitmlx-community/Llama-3.3-70B-Instruct-4bitmlx-community/Mistral-Large-Instruct-2411-4bit
GGUF Fallback
When an MLX-native version is not available, LMX can load GGUF quantized models as a fallback. GGUF models are loaded through a compatibility layer, which adds slight overhead (10-20% slower than native MLX).
VRAM Estimation
Estimating VRAM requirements before loading a model helps prevent OOM conditions. Use this rule of thumb:
VRAM (GB) ≈ Parameters (B) × Bits / 8 × 1.2
Examples:
7B model @ 4-bit: 7 × 4 / 8 × 1.2 ≈ 4.2 GB
30B model @ 4-bit: 30 × 4 / 8 × 1.2 ≈ 18.0 GB
70B model @ 4-bit: 70 × 4 / 8 × 1.2 ≈ 42.0 GB
120B model @ 4-bit: 120 × 4 / 8 × 1.2 ≈ 72.0 GB
The 1.2× multiplier accounts for KV cache and runtime overhead.Check your current VRAM usage via the admin API:
curl http://192.168.188.11:1234/admin/models{"models":[{"id":"qwen3-30b-a3b","loaded":true,"vram_gb":18.4}]}OOM Prevention
LMX includes an OOM guardian that prevents crashes when memory pressure is too high. The guardian operates at two thresholds:
- Warning threshold (85% VRAM) — LMX logs a warning and emits a memory pressure event via the
/admin/eventsSSE stream. - Critical threshold (90% VRAM) — LMX automatically unloads the model to prevent the process from being killed by the OS. Subsequent inference requests return
503with codeoom-unloaded.
The thresholds are configurable in the LMX config file:
[memory]
max_memory_pct = 85 # Warning threshold
oom_threshold_pct = 90 # Auto-unload thresholdRecommended Models
These models are tested and recommended for use with LMX on the Mac Studio M3 Ultra (192GB):
| Model | Params | VRAM | tok/s | Use Case |
|---|---|---|---|---|
| Qwen3-30B-A3B | 30B (3B active) | 18 GB | ~65 | Daily driver, fast MoE |
| Llama-3.3-70B | 70B | 39 GB | ~25 | Code generation, analysis |
| DeepSeek-V3-0324 | 685B MoE | ~160 GB | ~8 | Maximum quality reasoning |
| nomic-embed-text | 137M | 0.3 GB | n/a | Embeddings for RAG |