Manage the models running on your LMX inference server. Load, swap, browse, and monitor models directly from the CLI.
Overview
The opta models command group lets you control which models are loaded on your LMX server, view VRAM usage and throughput metrics, browse available models from HuggingFace, and configure model aliases for quick access. All model operations communicate with the LMX server over your LAN.
Show current model status
opta models
Currently loaded:
qwen3-30b-a3b (4-bit, 18.2 GB VRAM)
Available commands:
opta models load <name> Load a model
opta models swap <name> Unload current, load new
opta models browse-library Browse HuggingFace models
opta models dashboard Live VRAM and throughput view
Listing Models
Running opta models with no subcommand shows the currently loaded model, its quantization level, and VRAM usage. If no model is loaded, it shows available models that have been previously downloaded to the LMX server.
Show loaded model and status
opta models
Loading Models
Use opta models load to load a model into memory. If the model has been previously downloaded, it loads from the local cache. If not, LMX downloads it from HuggingFace first.
Load a specific model
opta models load qwen3-30b-a3b
Loading qwen3-30b-a3b...\nModel loaded in 4.2s (18.2 GB VRAM)
VRAM management
Loading a model that exceeds available VRAM will cause LMX to automatically unload the current model first. LMX is designed to never crash on out-of-memory conditions -- it degrades gracefully by unloading models.
Swapping Models
opta models swap is a convenience command that unloads the current model and loads a new one in a single operation. This is the recommended way to switch between models during a session.
Swap to a different model
opta models swap deepseek-r1-0528
Unloading qwen3-30b-a3b...\nLoading deepseek-r1-0528...\nModel swapped in 6.1s (42.8 GB VRAM)
Browsing the Library
opta models browse-library opens an interactive TUI browser that lets you search HuggingFace for MLX-compatible models. You can filter by size, quantization, and task type, then download directly to your LMX server.
opta models dashboard opens a live terminal dashboard showing VRAM usage, throughput (tokens per second), and other performance metrics for the currently loaded model.
Open live model performance dashboard
opta models dashboard
Dashboard output
Model: qwen3-30b-a3b (4-bit)
VRAM: 18.2 / 192.0 GB [████░░░░░░░░░░░░] 9.5%
Throughput: 42.3 tok/s (avg over last 60s)
Requests: 1,247 total | 3 active
Uptime: 4h 23m
Model Aliases
Model aliases let you use short, memorable names instead of full model identifiers. Aliases are configured in your CLI config and resolve to full model names when used in commands.
Alias
Resolves To
qwen
mlx-community/Qwen3-30B-A3B-MLX-4bit
deepseek
mlx-community/DeepSeek-R1-0528-MLX-4bit
codestral
mlx-community/Codestral-25.01-MLX-4bit
Fallback Chain
Opta CLI uses a two-tier fallback chain for inference. It first attempts to use the local LMX server for fast, private inference. If LMX is unreachable or the request fails, it falls back to the Anthropic cloud API (if an API key is configured).
Inference fallback chain
Request Flow:
1. LMX (local, lmx-host.local:1234)
├─ Success → Use local response
└─ Fail → Fallback to Anthropic
2. Anthropic (cloud, api.anthropic.com)
├─ Success → Use cloud response
└─ Fail → Error reported to user
Staying local
If you want to ensure all inference stays on your local network, run opta config set provider.fallback false to disable the cloud fallback. The CLI will return an error if LMX is unavailable instead of falling back to Anthropic.