Skip to main content
The Standalone Agent is in beta. Metric names and labels may change.
The Standalone Agent collects GPU metrics (via DCGM Exporter), host metrics (via psutil), per-process GPU metrics (via NVML), and application-level service metrics (via the built-in OTLP receiver). All metrics are viewable on the Dashboard > Services page at app.usechamber.io/dashboard?tab=services.

GPU Metrics

These metrics are scraped from DCGM Exporter every 30 seconds.

Core GPU Metrics

MetricUnitDescriptionWhat to look for
chamber_gpu_utilization_percent%Percentage of time the GPU had at least one kernel running.Sustained 0% may indicate idle GPUs wasting capacity. Sustained 100% is normal for training workloads.
chamber_gpu_memory_used_bytesbytesGPU memory (VRAM) in use.Approaching total memory indicates risk of OOM errors.
chamber_gpu_memory_utilization_percent%GPU memory in use as a percentage of total.Above 90% for extended periods suggests memory pressure.
chamber_gpu_temperature_celsius°CGPU die temperature.Consistently above 80°C may indicate thermal throttling. Check airflow and cooling.
chamber_gpu_power_usage_wattswattsCurrent power draw.At or near the power limit can cause power throttling, reducing clock speeds and throughput.

GPU Usage (Composite Metric)

MetricUnitDescription
chamber_gpu_usage%A composite metric reflecting actual GPU compute efficiency.
Standard GPU utilization (chamber_gpu_utilization_percent) can be misleading — a single-threaded kernel running on 1 of 132 streaming multiprocessors (SMs) still reports 100% utilization. GPU Usage provides a more accurate picture. How it’s calculated:
GPU Usage = max(SM Active, Tensor Active, DRAM Active) × 100
InputDCGM SourceWhat it measures
SM ActiveDCGM_FI_PROF_SM_ACTIVEFraction of SMs that are actively executing a warp (0.0–1.0)
Tensor ActiveDCGM_FI_PROF_PIPE_TENSOR_ACTIVEFraction of time the tensor cores are active (0.0–1.0)
DRAM ActiveDCGM_FI_PROF_DRAM_ACTIVEFraction of time GPU memory (HBM) is being accessed (0.0–1.0)
Power gate: If the GPU’s power draw is below 10% of its power limit, GPU Usage is reported as 0 regardless of the profiling metrics. This filters out noise from idle GPUs. Why it matters: GPU Usage tells you whether your workload is actually keeping the GPU busy. A training job might show 100% standard utilization but only 30% GPU Usage, meaning most SMs are idle and you could optimize kernel launch, batch size, or parallelism.
GPU Usage requires DCGM profiling metrics to be enabled. See Ensuring DCGM Profiling Metrics below.

Profiling Metrics

These are the raw profiling metrics behind the GPU Usage composite. They are useful for advanced debugging.
MetricUnitDescription
gpu_sm_activeratio (0.0–1.0)Streaming Multiprocessor activity — fraction of SMs executing at least one warp
gpu_tensor_activeratio (0.0–1.0)Tensor core pipeline activity — high during mixed-precision training (FP16/BF16)
gpu_dram_activeratio (0.0–1.0)GPU memory (HBM) bus activity — high during data-loading-bound workloads

Host Metrics

Collected via psutil every 30 seconds.
MetricUnitDescriptionWhat to look for
chamber_node_cpu_usage_percent%Host CPU utilizationHigh CPU can bottleneck data loading pipelines, starving GPUs of data
chamber_node_cpu_usage_corescoresNumber of CPU cores in use (float)Compare against total cores to gauge headroom
chamber_node_cpu_capacity_corescoresTotal logical CPU count
chamber_node_memory_usage_bytesbytesHost memory (RAM) in useNear capacity can cause OOM kills by the Linux kernel
chamber_node_memory_capacity_bytesbytesTotal host memory
chamber_node_memory_usage_percent%Host memory utilizationAbove 90% sustained warrants investigation

Per-Process GPU Metrics

Collected via NVML for each process using a GPU. These are attributed to discovered workloads in the dashboard.
MetricUnitDescription
chamber_gpu_process_memory_bytesbytesGPU memory allocated by a specific process
chamber_gpu_process_memory_utilization_percent%Process GPU memory as a percentage of total GPU memory
chamber_gpu_process_sm_utilization_percent%Process-level SM utilization (requires NVIDIA driver 450+ and Volta or newer GPUs; the agent skips this metric gracefully on unsupported hardware)

Service-Level Metrics

In addition to infrastructure metrics, the Standalone Agent tracks services — applications running on your GPU hosts that are identified by the service_name label. This is what powers the Services tab on the dashboard, giving you a per-service view of resource consumption and application health.

How Services Are Identified

The agent associates a service_name with metrics through three mechanisms, in priority order:
  1. OTEL_SERVICE_NAME environment variable — If your GPU process has OTEL_SERVICE_NAME set, the agent reads it directly and uses it to tag all GPU and per-process metrics for that workload. This is the most reliable method. For Docker containers where the process environment may not be directly readable, the agent automatically falls back to querying the Docker Engine API via /var/run/docker.sock to read container environment variables — no configuration needed beyond ensuring the agent user has Docker socket access (the installer handles this by default).
  2. OTLP service.name resource attribute — If your application exports OpenTelemetry metrics to the agent’s OTLP receiver (port 4317), the service.name resource attribute is extracted and attached to every metric in that export. When exactly one service is actively sending OTLP metrics, the agent also uses it to label GPU infrastructure metrics for the running workload.
  3. Workload discovery — The agent classifies GPU processes by inspecting command lines and environment variables to identify ML frameworks (PyTorch, TensorFlow, vLLM, etc.) and launchers (torchrun, deepspeed, accelerate). This provides job_name and workload_type labels even without OTEL_SERVICE_NAME.
# Option 1: Set on your training process
OTEL_SERVICE_NAME="llama2-finetune" torchrun --nproc_per_node=4 train.py

# Option 2: Set in your OTel SDK resource (see OTLP Integration page)
Resource.create({"service.name": "llama2-finetune"})

Application Metrics via OTLP

Any metrics your application exports over OpenTelemetry are forwarded to the Chamber dashboard with the service_name label. These appear alongside GPU and host metrics on the Services tab, letting you correlate application behavior with infrastructure state. Common examples of application metrics you can send:
MetricWhat it tells you
training_lossWhether the model is converging — a sudden spike may indicate a data issue or corrupted checkpoint
training_throughput_samples_totalSamples processed over time — a drop signals a bottleneck (data loading, GPU contention, network)
inference_request_duration_secondsEnd-to-end latency per request — rising latency under stable load may point to memory pressure or thermal throttling
inference_requests_totalRequest volume — correlate with GPU utilization to understand per-request cost
model_load_time_secondsTime to load model weights — useful for tracking cold-start performance across deployments
batch_queue_depthPending items in your processing queue — growing depth means your GPUs can’t keep up with inbound work
These are examples — you can send any metric your application produces. See OTLP Integration for setup instructions.

What You See on the Services Dashboard

The Services tab (app.usechamber.io/dashboard?tab=services) groups metrics by service_name, giving you a unified view per service:
  • GPU utilization and GPU Usage attributed to each service — see which services are efficiently using their GPUs and which are underutilizing
  • GPU memory per service — identify which service is consuming the most VRAM and whether it’s at risk of OOM
  • Per-process breakdowns — drill into individual worker processes within a service (e.g., each rank in a distributed training job)
  • Application metrics — any custom OTLP metrics your service exports, displayed alongside the infrastructure data
  • Host CPU and memory — spot cases where a service is bottlenecked on CPU or host memory rather than GPU
This per-service view makes it straightforward to answer questions like: “Is my training job actually using the GPUs I allocated?” or “Why did inference latency spike at 2 PM?” — without having to cross-reference multiple monitoring systems.
For the best experience on the Services dashboard, set OTEL_SERVICE_NAME on every GPU process. This ensures GPU metrics are correctly attributed even when multiple services share a host.

Workload Discovery

The agent automatically discovers GPU-using processes and classifies them by framework:
  • PyTorch (including torchrun, torch.distributed)
  • DeepSpeed
  • vLLM
  • Other GPU processes
Discovered workloads appear in the dashboard with:
  • Process name and command line
  • Framework type
  • GPU memory usage per process
  • Lifecycle state (running, terminated)
  • Duration
For distributed training (e.g., torchrun --nproc_per_node=4), the agent groups worker processes under a single workload entry.
You can enrich workload metadata with environment variables. See Workload Labels.

Ensuring DCGM Profiling Metrics Are Collected

The GPU Usage metric and profiling metrics (gpu_sm_active, gpu_tensor_active, gpu_dram_active) require DCGM Exporter to expose profiling-level fields. The installer sets this up automatically, but if you are running DCGM Exporter manually or troubleshooting missing metrics, verify the following:

1. DCGM Exporter Must Be Running

# Check if DCGM Exporter is serving metrics
curl -s http://localhost:9400/metrics | head -20
You should see Prometheus-formatted metrics including lines starting with DCGM_FI_. If this returns nothing, DCGM Exporter is not running or not reachable.

2. Profiling Fields Must Be Enabled

DCGM Exporter is configured by a CSV file that lists which fields to export. The profiling fields required are:
DCGM Field IDField NameRequired For
1001DCGM_FI_PROF_SM_ACTIVEGPU Usage (SM Active)
1004DCGM_FI_PROF_PIPE_TENSOR_ACTIVEGPU Usage (Tensor Active)
1005DCGM_FI_PROF_DRAM_ACTIVEGPU Usage (DRAM Active)
Verify these fields are in the DCGM Exporter counters file (commonly /etc/dcgm-exporter/default-counters.csv or passed via the -f flag):
# Check if profiling fields are configured
grep -E "DCGM_FI_PROF_SM_ACTIVE|DCGM_FI_PROF_PIPE_TENSOR_ACTIVE|DCGM_FI_PROF_DRAM_ACTIVE" \
  /etc/dcgm-exporter/default-counters.csv
If these entries are missing, add them:
DCGM_FI_PROF_SM_ACTIVE, gauge
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge
DCGM_FI_PROF_DRAM_ACTIVE, gauge
Then restart DCGM Exporter.

3. Profiling Must Be Supported by Your GPU

DCGM profiling metrics require:
  • NVIDIA driver 450+ (most modern drivers)
  • A GPU that supports profiling (Volta, Turing, Ampere, Hopper, Blackwell — i.e., V100, T4, A100, H100, B200, etc.)
  • No other profiling tool (e.g., Nsight Systems) actively holding a profiling session
If profiling metrics are unavailable, the core GPU metrics (utilization, memory, temperature, power) still collect normally. Only the GPU Usage composite metric will be missing.

4. Verify Profiling Metrics Are Flowing

curl -s http://localhost:9400/metrics | grep DCGM_FI_PROF
You should see lines like:
DCGM_FI_PROF_SM_ACTIVE{gpu="0",...} 0.45
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",...} 0.38
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",...} 0.22
If these lines are absent or all values are 0 while a workload is running, check the troubleshooting steps above.

Metric Labels

All metrics include these standard labels:
LabelDescription
cluster_idIdentifier for this host (defaults to hostname, configurable)
node_nameHostname of the machine
organization_idYour Chamber organization ID
managed_byAlways external for standalone agents
GPU metrics additionally include:
LabelDescription
gpu_indexGPU index (0, 1, 2, …)
gpu_typeGPU model name (e.g., NVIDIA A100-SXM4-80GB)
Process and service metrics additionally include:
LabelDescription
job_idUnique workload identifier
job_nameWorkload name (auto-detected or from CHAMBER_JOB_NAME)
pod_nameProcess identifier (process:pid-<N>)
team_nameTeam label (from CHAMBER_TEAM_NAME or default)
service_nameService identifier (from OTEL_SERVICE_NAME env var or OTLP service.name resource attribute). Used to group metrics on the Services dashboard.

What to Look For on the Dashboard

The Services tab (app.usechamber.io/dashboard?tab=services) gives you a per-host and per-GPU breakdown. Here are common patterns to watch for:
PatternWhat it meansAction
High utilization but low GPU UsageKernels are running but underutilizing the GPU’s compute capacityProfile with Nsight, increase batch size, or check for serialization bottlenecks
GPU memory near 100%Risk of OOM errors that crash trainingReduce batch size, enable gradient checkpointing, or use a larger GPU
High CPU, low GPU utilizationData loading is the bottleneck — GPUs are starvedAdd more DataLoader workers, use faster storage, or pre-process data
Temperature consistently above 80°CThermal throttling may be reducing clock speedsCheck cooling, airflow, or ambient temperature
Power at limitPower throttling is reducing clock speedsExpected under heavy load, but sustained power capping may indicate undersized infrastructure
GPU idle (0% utilization, 0 GPU Usage)GPU is allocated but not in useCheck for crashed jobs, queued-but-not-started workloads, or misconfigured resource requests

Next Steps

OTLP Integration

Send your own application metrics to Chamber

Configuration

Tune collection intervals and other settings