Learn About

Deep workflow guides aligned to this documentation section.

Opta LMX Masterclass

Advanced LMX runtime behavior, memory strategy, and performance tuning.

Model Management

LMX supports hot-swapping models without restarting the server. This page covers loading, unloading, downloading, and managing models on disk.

Loading Models

LMX loads one model at a time into unified memory. When you load a new model, the previously loaded model is automatically unloaded first.

Via API

POST/admin/models/load

Load a model into memory by HuggingFace identifier or local path.

Parameters

modelstringrequired— HuggingFace model ID or local path

Response

{
  "success": true,
  "model": "qwen3-30b-a3b",
  "load_time_ms": 2340,
  "vram_gb": 18.4
}

Load a model via the API

curl -X POST http://lmx-host.local:1234/admin/models/load \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Qwen3-30B-A3B-4bit"}'

Via CLI

The Opta CLI provides a convenient wrapper for model management:

Load a model by alias

opta models load qwen3-30b-a3b

Loading qwen3-30b-a3b...
Model loaded in 2.3s (18.4 GB VRAM)

List all available models

opta models list

  qwen3-30b-a3b       18.4 GB   MLX   ● loaded
  llama-3.3-70b       38.7 GB   MLX   ○ available
  deepseek-v3-0324    42.1 GB   MLX   ○ available
  nomic-embed-text     0.3 GB   MLX   ○ available

Unloading Models

Unloading a model frees its VRAM. After unloading, the /readyz endpoint returns 503 until a new model is loaded.

curl -X POST http://lmx-host.local:1234/admin/models/unload

{"success":true,"freed_vram_gb":18.4}

Downloading Models

HuggingFace Hub

LMX downloads models from HuggingFace Hub on first load. Models are cached in the standard HuggingFace cache directory at ~/.cache/huggingface/hub/.

Find a model on HuggingFace

Browse the mlx-community organization for pre-converted MLX models. Look for models with the '-4bit' suffix for 4-bit quantization.

Load the model (triggers download)

First load will download the model

curl -X POST http://lmx-host.local:1234/admin/models/load \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/Llama-3.3-70B-Instruct-4bit"}'

Wait for download to complete

Large models (70B+) can take several minutes to download depending on your internet speed. The API blocks until the model is fully loaded.

Pre-download models

You can pre-download models using the huggingface-cli tool to avoid waiting during load:

Pre-download a model to the local cache

huggingface-cli download mlx-community/Qwen3-30B-A3B-4bit

Model Formats

MLX Native

MLX-native models are the preferred format. They are pre-converted to MLX's internal representation and load directly into unified memory without conversion overhead.

The mlx-community organization on HuggingFace maintains converted versions of popular models. Look for repositories with names like:

mlx-community/Qwen3-30B-A3B-4bit
mlx-community/Llama-3.3-70B-Instruct-4bit
mlx-community/Mistral-Large-Instruct-2411-4bit

GGUF Fallback

When an MLX-native version is not available, LMX can load GGUF quantized models as a fallback. GGUF models are loaded through a compatibility layer, which adds slight overhead (10-20% slower than native MLX).

Prefer MLX native

Always use MLX-native models when available. GGUF fallback exists for compatibility, but native MLX delivers 15-30% better performance on the same hardware.

VRAM Estimation

Estimating VRAM requirements before loading a model helps prevent OOM conditions. Use this rule of thumb:

text

VRAM (GB) ≈ Parameters (B) × Bits / 8 × 1.2

Examples:
  7B  model @ 4-bit:   7 × 4 / 8 × 1.2 ≈   4.2 GB
  30B model @ 4-bit:  30 × 4 / 8 × 1.2 ≈  18.0 GB
  70B model @ 4-bit:  70 × 4 / 8 × 1.2 ≈  42.0 GB
 120B model @ 4-bit: 120 × 4 / 8 × 1.2 ≈  72.0 GB

The 1.2× multiplier accounts for KV cache and runtime overhead.

Check your current VRAM usage via the admin API:

curl http://lmx-host.local:1234/admin/models

{"models":[{"id":"qwen3-30b-a3b","loaded":true,"vram_gb":18.4}]}

OOM Prevention

LMX includes an OOM guardian that prevents crashes when memory pressure is too high. The guardian operates at two thresholds:

Warning threshold (85% VRAM) — LMX logs a warning and emits a memory pressure event via the /admin/events SSE stream.
Critical threshold (90% VRAM) — LMX automatically unloads the model to prevent the process from being killed by the OS. Subsequent inference requests return 503 with code oom-unloaded.

Never-crash guarantee

LMX must never crash on OOM. When the critical threshold is hit, the model is unloaded gracefully and the server continues running. Reload the model (or a smaller one) once memory is available.

The thresholds are configurable in the LMX config file:

config.toml

[memory]
max_memory_pct = 85      # Warning threshold
oom_threshold_pct = 90   # Auto-unload threshold

Recommended Models

These models are tested and recommended for use with LMX on the dedicated Apple Silicon host M3 Ultra (192GB):

Model	Params	VRAM	tok/s	Use Case
Qwen3-30B-A3B	30B (3B active)	18 GB	~65	Daily driver, fast MoE
Llama-3.3-70B	70B	39 GB	~25	Code generation, analysis
DeepSeek-V3-0324	685B MoE	~160 GB	~8	Maximum quality reasoning
nomic-embed-text	137M	0.3 GB	n/a	Embeddings for RAG

MoE models

Mixture-of-Experts (MoE) models like Qwen3-30B-A3B and DeepSeek-V3 only activate a fraction of their parameters per token, delivering much higher throughput relative to their total parameter count.

API Reference

Monitoring