Model Management
Manage the models running on your LMX inference server. Load, swap, browse, and monitor models directly from the CLI.
Overview
The opta models command group lets you control which models are loaded on your LMX server, view VRAM usage and throughput metrics, browse available models from HuggingFace, and configure model aliases for quick access. All model operations communicate with the LMX server over your LAN.
opta modelsCurrently loaded: qwen3-30b-a3b (4-bit, 18.2 GB VRAM) Available commands: opta models load <name> Load a model opta models swap <name> Unload current, load new opta models browse-library Browse HuggingFace models opta models dashboard Live VRAM and throughput view
Listing Models
Running opta models with no subcommand shows the currently loaded model, its quantization level, and VRAM usage. If no model is loaded, it shows available models that have been previously downloaded to the LMX server.
opta modelsLoading Models
Use opta models load to load a model into memory. If the model has been previously downloaded, it loads from the local cache. If not, LMX downloads it from HuggingFace first.
opta models load qwen3-30b-a3bLoading qwen3-30b-a3b...\nModel loaded in 4.2s (18.2 GB VRAM)
Swapping Models
opta models swap is a convenience command that unloads the current model and loads a new one in a single operation. This is the recommended way to switch between models during a session.
opta models swap deepseek-r1-0528Unloading qwen3-30b-a3b...\nLoading deepseek-r1-0528...\nModel swapped in 6.1s (42.8 GB VRAM)
Browsing the Library
opta models browse-library opens an interactive TUI browser that lets you search HuggingFace for MLX-compatible models. You can filter by size, quantization, and task type, then download directly to your LMX server.
opta models browse-libraryHuggingFace Model Browser
━━━━━━━━━━━━━━━━━━━━━━━━
Search: coding models < 30GB
Model Size Quant Downloads
mlx-community/Qwen3-30B-A3B 18.2G 4-bit 12.4k
mlx-community/DeepSeek-R1 42.8G 4-bit 8.7k
mlx-community/Codestral-25.01 12.1G 4-bit 6.2k
[Enter] Download [/] Search [q] QuitModel Dashboard
opta models dashboard opens a live terminal dashboard showing VRAM usage, throughput (tokens per second), and other performance metrics for the currently loaded model.
opta models dashboardModel: qwen3-30b-a3b (4-bit)
VRAM: 18.2 / 192.0 GB [████░░░░░░░░░░░░] 9.5%
Throughput: 42.3 tok/s (avg over last 60s)
Requests: 1,247 total | 3 active
Uptime: 4h 23mModel Aliases
Model aliases let you use short, memorable names instead of full model identifiers. Aliases are configured in your CLI config and resolve to full model names when used in commands.
| Alias | Resolves To |
|---|---|
| qwen | mlx-community/Qwen3-30B-A3B-MLX-4bit |
| deepseek | mlx-community/DeepSeek-R1-0528-MLX-4bit |
| codestral | mlx-community/Codestral-25.01-MLX-4bit |
Fallback Chain
Opta CLI uses a two-tier fallback chain for inference. It first attempts to use the local LMX server for fast, private inference. If LMX is unreachable or the request fails, it falls back to the Anthropic cloud API (if an API key is configured).
Request Flow:
1. LMX (local, 192.168.188.11:1234)
├─ Success → Use local response
└─ Fail → Fallback to Anthropic
2. Anthropic (cloud, api.anthropic.com)
├─ Success → Use cloud response
└─ Fail → Error reported to useropta config set provider.fallback false to disable the cloud fallback. The CLI will return an error if LMX is unavailable instead of falling back to Anthropic.