Dashboard

The Opta Local Web dashboard provides real-time visibility into your LMX inference server -- VRAM utilization, loaded models, token throughput, and server health at a glance.

Dashboard Overview

The dashboard is the default landing page of Opta Local Web. It is organized into a grid of status panels, each updating in real time through the SSE connection to LMX. The layout is designed to give you immediate situational awareness: is the server healthy, how much memory is available, what models are loaded, and how fast inference is running.

VRAM Gauge

The VRAM gauge is a circular progress indicator that shows the current unified memory utilization of your Apple Silicon GPU. It displays:

Current usage -- how many GB of unified memory are allocated to loaded models
Total capacity -- the total unified memory pool (e.g., 192GB on a Mac Studio Ultra)
Percentage fill -- rendered as a circular arc with smooth animation

The gauge color shifts from green to amber to red as utilization increases, providing instant visual feedback on memory pressure. When VRAM is nearly full, loading additional models risks OOM conditions -- the gauge helps you decide when to unload before loading.

Memory headroom

Keep at least 10-15% of unified memory free for system processes and inference scratch space. Running at 100% utilization will cause model loads to fail or trigger automatic unloading.

Active Models

The active models panel lists every model currently loaded into GPU memory. Each entry shows:

Model name -- the identifier used for inference requests
Memory footprint -- how many GB this model occupies
Quantization -- the quantization level (e.g., Q4_K_M, Q8_0, F16)
Status indicator -- whether the model is idle, actively inferring, or loading

You can load and unload models directly from this panel. Loading a model sends a request to the LMX admin API, and the panel updates in real time as the model initializes and enters the ready state. Unloading frees the GPU memory immediately.

Throughput Chart

The throughput panel shows a rolling history of inference speed measured in tokens per second. Data arrives via the SSE stream from /admin/events and is stored in a client-side circular buffer of 300 entries.

The chart renders as a line graph showing throughput over time. Spikes indicate active inference requests; flat lines indicate idle periods. This is useful for:

Verifying that inference is running at expected speeds
Identifying throughput degradation under concurrent load
Comparing performance between different models or quantizations

SSE Event Stream

GET /admin/events
Accept: text/event-stream

data: {"type":"throughput","tokens_per_sec":42.3,"model":"qwen3-72b"}
data: {"type":"vram","used_gb":87.2,"total_gb":192.0}
data: {"type":"model_status","model":"qwen3-72b","state":"ready"}

Server Status

Status badges at the top of the dashboard indicate the health of each component in the stack:

LMX Server -- connected or disconnected, with latency
Active Requests -- number of concurrent inference requests
Uptime -- how long the LMX server has been running

Each badge is color-coded: green for healthy, amber for degraded, and red for offline. The badges update every heartbeat cycle.

Health Polling

In addition to the SSE event stream, the dashboard runs a periodic heartbeat check against the LMX health endpoint. This independent polling mechanism ensures the dashboard can detect when the SSE connection silently drops (e.g., due to network interruption) and display an accurate offline state.

The heartbeat interval is configurable but defaults to 10 seconds. If three consecutive heartbeats fail, the dashboard transitions to a disconnected state and displays a reconnection indicator.

Automatic reconnection

When the SSE connection drops, the dashboard automatically attempts to reconnect with exponential backoff. Once the LMX server becomes reachable again, the dashboard restores live updates without requiring a page refresh.

Overview

Chat