GPU, host, and workload metrics collected by the Standalone Agent and how to use them
The Standalone Agent is in beta. Metric names and labels may change.
The Standalone Agent collects GPU metrics (via DCGM Exporter), host metrics (via psutil), per-process GPU metrics (via NVML), and application-level service metrics (via the built-in OTLP receiver). All metrics are viewable on the Dashboard > Services page at app.usechamber.io/dashboard?tab=services.
A composite metric reflecting actual GPU compute efficiency.
Standard GPU utilization (chamber_gpu_utilization_percent) can be misleading — a single-threaded kernel running on 1 of 132 streaming multiprocessors (SMs) still reports 100% utilization. GPU Usage provides a more accurate picture.How it’s calculated:
Fraction of SMs that are actively executing a warp (0.0–1.0)
Tensor Active
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
Fraction of time the tensor cores are active (0.0–1.0)
DRAM Active
DCGM_FI_PROF_DRAM_ACTIVE
Fraction of time GPU memory (HBM) is being accessed (0.0–1.0)
Power gate: If the GPU’s power draw is below 10% of its power limit, GPU Usage is reported as 0 regardless of the profiling metrics. This filters out noise from idle GPUs.Why it matters: GPU Usage tells you whether your workload is actually keeping the GPU busy. A training job might show 100% standard utilization but only 30% GPU Usage, meaning most SMs are idle and you could optimize kernel launch, batch size, or parallelism.
In addition to infrastructure metrics, the Standalone Agent tracks services — applications running on your GPU hosts that are identified by the service_name label. This is what powers the Services tab on the dashboard, giving you a per-service view of resource consumption and application health.
The agent associates a service_name with metrics through three mechanisms, in priority order:
OTEL_SERVICE_NAME environment variable — If your GPU process has OTEL_SERVICE_NAME set, the agent reads it directly and uses it to tag all GPU and per-process metrics for that workload. This is the most reliable method. For Docker containers where the process environment may not be directly readable, the agent automatically falls back to querying the Docker Engine API via /var/run/docker.sock to read container environment variables — no configuration needed beyond ensuring the agent user has Docker socket access (the installer handles this by default).
OTLP service.name resource attribute — If your application exports OpenTelemetry metrics to the agent’s OTLP receiver (port 4317), the service.name resource attribute is extracted and attached to every metric in that export. When exactly one service is actively sending OTLP metrics, the agent also uses it to label GPU infrastructure metrics for the running workload.
Workload discovery — The agent classifies GPU processes by inspecting command lines and environment variables to identify ML frameworks (PyTorch, TensorFlow, vLLM, etc.) and launchers (torchrun, deepspeed, accelerate). This provides job_name and workload_type labels even without OTEL_SERVICE_NAME.
# Option 1: Set on your training processOTEL_SERVICE_NAME="llama2-finetune" torchrun --nproc_per_node=4 train.py# Option 2: Set in your OTel SDK resource (see OTLP Integration page)Resource.create({"service.name": "llama2-finetune"})
Any metrics your application exports over OpenTelemetry are forwarded to the Chamber dashboard with the service_name label. These appear alongside GPU and host metrics on the Services tab, letting you correlate application behavior with infrastructure state.Common examples of application metrics you can send:
Metric
What it tells you
training_loss
Whether the model is converging — a sudden spike may indicate a data issue or corrupted checkpoint
training_throughput_samples_total
Samples processed over time — a drop signals a bottleneck (data loading, GPU contention, network)
inference_request_duration_seconds
End-to-end latency per request — rising latency under stable load may point to memory pressure or thermal throttling
inference_requests_total
Request volume — correlate with GPU utilization to understand per-request cost
model_load_time_seconds
Time to load model weights — useful for tracking cold-start performance across deployments
batch_queue_depth
Pending items in your processing queue — growing depth means your GPUs can’t keep up with inbound work
These are examples — you can send any metric your application produces. See OTLP Integration for setup instructions.
GPU utilization and GPU Usage attributed to each service — see which services are efficiently using their GPUs and which are underutilizing
GPU memory per service — identify which service is consuming the most VRAM and whether it’s at risk of OOM
Per-process breakdowns — drill into individual worker processes within a service (e.g., each rank in a distributed training job)
Application metrics — any custom OTLP metrics your service exports, displayed alongside the infrastructure data
Host CPU and memory — spot cases where a service is bottlenecked on CPU or host memory rather than GPU
This per-service view makes it straightforward to answer questions like: “Is my training job actually using the GPUs I allocated?” or “Why did inference latency spike at 2 PM?” — without having to cross-reference multiple monitoring systems.
For the best experience on the Services dashboard, set OTEL_SERVICE_NAME on every GPU process. This ensures GPU metrics are correctly attributed even when multiple services share a host.
The GPU Usage metric and profiling metrics (gpu_sm_active, gpu_tensor_active, gpu_dram_active) require DCGM Exporter to expose profiling-level fields. The installer sets this up automatically, but if you are running DCGM Exporter manually or troubleshooting missing metrics, verify the following:
# Check if DCGM Exporter is serving metricscurl -s http://localhost:9400/metrics | head -20
You should see Prometheus-formatted metrics including lines starting with DCGM_FI_. If this returns nothing, DCGM Exporter is not running or not reachable.
A GPU that supports profiling (Volta, Turing, Ampere, Hopper, Blackwell — i.e., V100, T4, A100, H100, B200, etc.)
No other profiling tool (e.g., Nsight Systems) actively holding a profiling session
If profiling metrics are unavailable, the core GPU metrics (utilization, memory, temperature, power) still collect normally. Only the GPU Usage composite metric will be missing.