Model Management

LMX supports hot-swapping models without restarting the server. This page covers loading, unloading, downloading, and managing models on disk.

Loading Models

LMX loads one model at a time into unified memory. When you load a new model, the previously loaded model is automatically unloaded first.

Via API

POST/admin/models/load

Load a model into memory by HuggingFace identifier or local path.

Parameters

modelstringrequiredHuggingFace model ID or local path

Response

{
  "success": true,
  "model": "qwen3-30b-a3b",
  "load_time_ms": 2340,
  "vram_gb": 18.4
}
Load a model via the API
curl -X POST http://192.168.188.11:1234/admin/models/load \ -H "Content-Type: application/json" \ -d '{"model":"mlx-community/Qwen3-30B-A3B-4bit"}'

Via CLI

The Opta CLI provides a convenient wrapper for model management:

Load a model by alias
opta models load qwen3-30b-a3b
Loading qwen3-30b-a3b...
Model loaded in 2.3s (18.4 GB VRAM)
List all available models
opta models list
  qwen3-30b-a3b       18.4 GB   MLX   ● loaded
  llama-3.3-70b       38.7 GB   MLX   ○ available
  deepseek-v3-0324    42.1 GB   MLX   ○ available
  nomic-embed-text     0.3 GB   MLX   ○ available

Unloading Models

Unloading a model frees its VRAM. After unloading, the /readyz endpoint returns 503 until a new model is loaded.

curl -X POST http://192.168.188.11:1234/admin/models/unload
{"success":true,"freed_vram_gb":18.4}

Downloading Models

HuggingFace Hub

LMX downloads models from HuggingFace Hub on first load. Models are cached in the standard HuggingFace cache directory at ~/.cache/huggingface/hub/.

1

Find a model on HuggingFace

Browse the mlx-community organization for pre-converted MLX models. Look for models with the '-4bit' suffix for 4-bit quantization.

2

Load the model (triggers download)

First load will download the model
curl -X POST http://192.168.188.11:1234/admin/models/load \ -H "Content-Type: application/json" \ -d '{"model":"mlx-community/Llama-3.3-70B-Instruct-4bit"}'
3

Wait for download to complete

Large models (70B+) can take several minutes to download depending on your internet speed. The API blocks until the model is fully loaded.

Pre-download models
You can pre-download models using the huggingface-cli tool to avoid waiting during load:
Pre-download a model to the local cache
huggingface-cli download mlx-community/Qwen3-30B-A3B-4bit

Model Formats

MLX Native

MLX-native models are the preferred format. They are pre-converted to MLX's internal representation and load directly into unified memory without conversion overhead.

The mlx-community organization on HuggingFace maintains converted versions of popular models. Look for repositories with names like:

  • mlx-community/Qwen3-30B-A3B-4bit
  • mlx-community/Llama-3.3-70B-Instruct-4bit
  • mlx-community/Mistral-Large-Instruct-2411-4bit

GGUF Fallback

When an MLX-native version is not available, LMX can load GGUF quantized models as a fallback. GGUF models are loaded through a compatibility layer, which adds slight overhead (10-20% slower than native MLX).

Prefer MLX native
Always use MLX-native models when available. GGUF fallback exists for compatibility, but native MLX delivers 15-30% better performance on the same hardware.

VRAM Estimation

Estimating VRAM requirements before loading a model helps prevent OOM conditions. Use this rule of thumb:

text
VRAM (GB) ≈ Parameters (B) × Bits / 8 × 1.2

Examples:
  7B  model @ 4-bit:   7 × 4 / 8 × 1.2 ≈   4.2 GB
  30B model @ 4-bit:  30 × 4 / 8 × 1.2 ≈  18.0 GB
  70B model @ 4-bit:  70 × 4 / 8 × 1.2 ≈  42.0 GB
 120B model @ 4-bit: 120 × 4 / 8 × 1.2 ≈  72.0 GB

The 1.2× multiplier accounts for KV cache and runtime overhead.

Check your current VRAM usage via the admin API:

curl http://192.168.188.11:1234/admin/models
{"models":[{"id":"qwen3-30b-a3b","loaded":true,"vram_gb":18.4}]}

OOM Prevention

LMX includes an OOM guardian that prevents crashes when memory pressure is too high. The guardian operates at two thresholds:

  • Warning threshold (85% VRAM) — LMX logs a warning and emits a memory pressure event via the /admin/events SSE stream.
  • Critical threshold (90% VRAM) — LMX automatically unloads the model to prevent the process from being killed by the OS. Subsequent inference requests return 503 with code oom-unloaded.
Never-crash guarantee
LMX must never crash on OOM. When the critical threshold is hit, the model is unloaded gracefully and the server continues running. Reload the model (or a smaller one) once memory is available.

The thresholds are configurable in the LMX config file:

config.toml
[memory]
max_memory_pct = 85      # Warning threshold
oom_threshold_pct = 90   # Auto-unload threshold

These models are tested and recommended for use with LMX on the Mac Studio M3 Ultra (192GB):

ModelParamsVRAMtok/sUse Case
Qwen3-30B-A3B30B (3B active)18 GB~65Daily driver, fast MoE
Llama-3.3-70B70B39 GB~25Code generation, analysis
DeepSeek-V3-0324685B MoE~160 GB~8Maximum quality reasoning
nomic-embed-text137M0.3 GBn/aEmbeddings for RAG
MoE models
Mixture-of-Experts (MoE) models like Qwen3-30B-A3B and DeepSeek-V3 only activate a fraction of their parameters per token, delivering much higher throughput relative to their total parameter count.