Learn About

Deep workflow guides aligned to this documentation section.

Opta CLI Masterclass

Deep command workflows, session control, and production operator patterns.

cli

Opta Code Desktop Masterclass

Deep architecture and operational workflow patterns for the desktop app.

code

Browse all Opta Learn guides

Model Management

Manage the models running on your LMX inference server. Load, swap, browse, and monitor models directly from the CLI.

Overview

The opta models command group lets you control which models are loaded on your LMX server, view VRAM usage and throughput metrics, browse available models from HuggingFace, and configure model aliases for quick access. All model operations communicate with the LMX server over your LAN.

Show current model status

opta models

Currently loaded:
  qwen3-30b-a3b (4-bit, 18.2 GB VRAM)

Available commands:
  opta models load <name>     Load a model
  opta models swap <name>     Unload current, load new
  opta models browse-library  Browse HuggingFace models
  opta models dashboard       Live VRAM and throughput view

Listing Models

Running opta models with no subcommand shows the currently loaded model, its quantization level, and VRAM usage. If no model is loaded, it shows available models that have been previously downloaded to the LMX server.

Show loaded model and status

opta models

Loading Models

Use opta models load to load a model into memory. If the model has been previously downloaded, it loads from the local cache. If not, LMX downloads it from HuggingFace first.

Load a specific model

opta models load qwen3-30b-a3b

Loading qwen3-30b-a3b...\nModel loaded in 4.2s (18.2 GB VRAM)

VRAM management

Loading a model that exceeds available VRAM will cause LMX to automatically unload the current model first. LMX is designed to never crash on out-of-memory conditions -- it degrades gracefully by unloading models.

Swapping Models

opta models swap is a convenience command that unloads the current model and loads a new one in a single operation. This is the recommended way to switch between models during a session.

Swap to a different model

opta models swap deepseek-r1-0528

Unloading qwen3-30b-a3b...\nLoading deepseek-r1-0528...\nModel swapped in 6.1s (42.8 GB VRAM)

Browsing the Library

opta models browse-library opens an interactive TUI browser that lets you search HuggingFace for MLX-compatible models. You can filter by size, quantization, and task type, then download directly to your LMX server.

Open the HuggingFace model browser

opta models browse-library

Model browser TUI

HuggingFace Model Browser
━━━━━━━━━━━━━━━━━━━━━━━━
Search: coding models < 30GB

  Model                          Size    Quant   Downloads
  mlx-community/Qwen3-30B-A3B   18.2G   4-bit   12.4k
  mlx-community/DeepSeek-R1      42.8G   4-bit   8.7k
  mlx-community/Codestral-25.01  12.1G   4-bit   6.2k

[Enter] Download  [/] Search  [q] Quit

Model Dashboard

opta models dashboard opens a live terminal dashboard showing VRAM usage, throughput (tokens per second), and other performance metrics for the currently loaded model.

Open live model performance dashboard

opta models dashboard

Dashboard output

Model: qwen3-30b-a3b (4-bit)
VRAM: 18.2 / 192.0 GB  [████░░░░░░░░░░░░] 9.5%
Throughput: 42.3 tok/s (avg over last 60s)
Requests: 1,247 total | 3 active
Uptime: 4h 23m

Model Aliases

Model aliases let you use short, memorable names instead of full model identifiers. Aliases are configured in your CLI config and resolve to full model names when used in commands.

Alias	Resolves To
qwen	mlx-community/Qwen3-30B-A3B-MLX-4bit
deepseek	mlx-community/DeepSeek-R1-0528-MLX-4bit
codestral	mlx-community/Codestral-25.01-MLX-4bit

Fallback Chain

Opta CLI uses a two-tier fallback chain for inference. It first attempts to use the local LMX server for fast, private inference. If LMX is unreachable or the request fails, it falls back to the Anthropic cloud API (if an API key is configured).

Inference fallback chain

Request Flow:
1. LMX (local, lmx-host.local:1234)
   ├─ Success → Use local response
   └─ Fail → Fallback to Anthropic
2. Anthropic (cloud, api.anthropic.com)
   ├─ Success → Use cloud response
   └─ Fail → Error reported to user

Staying local

If you want to ensure all inference stays on your local network, run opta config set provider.fallback false to disable the cloud fallback. The CLI will return an error if LMX is unavailable instead of falling back to Anthropic.

Chat & Do

Sessions