Metrics Reference

The Standalone Agent is in beta. Metric names and labels may change.

The Standalone Agent collects GPU metrics (via DCGM Exporter), host metrics (via psutil), per-process GPU metrics (via NVML), and application-level service metrics (via the built-in OTLP receiver). All metrics are viewable on the Dashboard > Services page at app.usechamber.io/dashboard?tab=services.

GPU Metrics

These metrics are scraped from DCGM Exporter every 30 seconds.

Core GPU Metrics

Metric	Unit	Description	What to look for
`chamber_gpu_utilization_percent`	%	Percentage of time the GPU had at least one kernel running.	Sustained 0% may indicate idle GPUs wasting capacity. Sustained 100% is normal for training workloads.
`chamber_gpu_memory_used_bytes`	bytes	GPU memory (VRAM) in use.	Approaching total memory indicates risk of OOM errors.
`chamber_gpu_memory_utilization_percent`	%	GPU memory in use as a percentage of total.	Above 90% for extended periods suggests memory pressure.
`chamber_gpu_temperature_celsius`	°C	GPU die temperature.	Consistently above 80°C may indicate thermal throttling. Check airflow and cooling.
`chamber_gpu_power_usage_watts`	watts	Current power draw.	At or near the power limit can cause power throttling, reducing clock speeds and throughput.

GPU Usage (Composite Metric)

Metric	Unit	Description
`chamber_gpu_usage`	%	A composite metric reflecting actual GPU compute efficiency.

Standard GPU utilization (chamber_gpu_utilization_percent) can be misleading — a single-threaded kernel running on 1 of 132 streaming multiprocessors (SMs) still reports 100% utilization. GPU Usage provides a more accurate picture. How it’s calculated:

GPU Usage = max(SM Active, Tensor Active, DRAM Active) × 100

Input	DCGM Source	What it measures
SM Active	`DCGM_FI_PROF_SM_ACTIVE`	Fraction of SMs that are actively executing a warp (0.0–1.0)
Tensor Active	`DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`	Fraction of time the tensor cores are active (0.0–1.0)
DRAM Active	`DCGM_FI_PROF_DRAM_ACTIVE`	Fraction of time GPU memory (HBM) is being accessed (0.0–1.0)

Power gate: If the GPU’s power draw is below 10% of its power limit, GPU Usage is reported as 0 regardless of the profiling metrics. This filters out noise from idle GPUs. Why it matters: GPU Usage tells you whether your workload is actually keeping the GPU busy. A training job might show 100% standard utilization but only 30% GPU Usage, meaning most SMs are idle and you could optimize kernel launch, batch size, or parallelism.

GPU Usage requires DCGM profiling metrics to be enabled. See Ensuring DCGM Profiling Metrics below.

Profiling Metrics

These are the raw profiling metrics behind the GPU Usage composite. They are useful for advanced debugging.

Metric	Unit	Description
`gpu_sm_active`	ratio (0.0–1.0)	Streaming Multiprocessor activity — fraction of SMs executing at least one warp
`gpu_tensor_active`	ratio (0.0–1.0)	Tensor core pipeline activity — high during mixed-precision training (FP16/BF16)
`gpu_dram_active`	ratio (0.0–1.0)	GPU memory (HBM) bus activity — high during data-loading-bound workloads

Host Metrics

Collected via psutil every 30 seconds.

Metric	Unit	Description	What to look for
`chamber_node_cpu_usage_percent`	%	Host CPU utilization	High CPU can bottleneck data loading pipelines, starving GPUs of data
`chamber_node_cpu_usage_cores`	cores	Number of CPU cores in use (float)	Compare against total cores to gauge headroom
`chamber_node_cpu_capacity_cores`	cores	Total logical CPU count	—
`chamber_node_memory_usage_bytes`	bytes	Host memory (RAM) in use	Near capacity can cause OOM kills by the Linux kernel
`chamber_node_memory_capacity_bytes`	bytes	Total host memory	—
`chamber_node_memory_usage_percent`	%	Host memory utilization	Above 90% sustained warrants investigation

Per-Process GPU Metrics

Collected via NVML for each process using a GPU. These are attributed to discovered workloads in the dashboard.

Metric	Unit	Description
`chamber_gpu_process_memory_bytes`	bytes	GPU memory allocated by a specific process
`chamber_gpu_process_memory_utilization_percent`	%	Process GPU memory as a percentage of total GPU memory
`chamber_gpu_process_sm_utilization_percent`	%	Process-level SM utilization (requires NVIDIA driver 450+ and Volta or newer GPUs; the agent skips this metric gracefully on unsupported hardware)

Service-Level Metrics

In addition to infrastructure metrics, the Standalone Agent tracks services — applications running on your GPU hosts that are identified by the service_name label. This is what powers the Services tab on the dashboard, giving you a per-service view of resource consumption and application health.

How Services Are Identified

The agent associates a service_name with metrics through three mechanisms, in priority order:

OTEL_SERVICE_NAME environment variable — If your GPU process has OTEL_SERVICE_NAME set, the agent reads it directly and uses it to tag all GPU and per-process metrics for that workload. This is the most reliable method. For Docker containers where the process environment may not be directly readable, the agent automatically falls back to querying the Docker Engine API via /var/run/docker.sock to read container environment variables — no configuration needed beyond ensuring the agent user has Docker socket access (the installer handles this by default).
OTLP service.name resource attribute — If your application exports OpenTelemetry metrics to the agent’s OTLP receiver (port 4317), the service.name resource attribute is extracted and attached to every metric in that export. When exactly one service is actively sending OTLP metrics, the agent also uses it to label GPU infrastructure metrics for the running workload.
Workload discovery — The agent classifies GPU processes by inspecting command lines and environment variables to identify ML frameworks (PyTorch, TensorFlow, vLLM, etc.) and launchers (torchrun, deepspeed, accelerate). This provides job_name and workload_type labels even without OTEL_SERVICE_NAME.

# Option 1: Set on your training process
OTEL_SERVICE_NAME="llama2-finetune" torchrun --nproc_per_node=4 train.py

# Option 2: Set in your OTel SDK resource (see OTLP Integration page)
Resource.create({"service.name": "llama2-finetune"})

Application Metrics via OTLP

Any metrics your application exports over OpenTelemetry are forwarded to the Chamber dashboard with the service_name label. These appear alongside GPU and host metrics on the Services tab, letting you correlate application behavior with infrastructure state. Common examples of application metrics you can send:

Metric	What it tells you
`training_loss`	Whether the model is converging — a sudden spike may indicate a data issue or corrupted checkpoint
`training_throughput_samples_total`	Samples processed over time — a drop signals a bottleneck (data loading, GPU contention, network)
`inference_request_duration_seconds`	End-to-end latency per request — rising latency under stable load may point to memory pressure or thermal throttling
`inference_requests_total`	Request volume — correlate with GPU utilization to understand per-request cost
`model_load_time_seconds`	Time to load model weights — useful for tracking cold-start performance across deployments
`batch_queue_depth`	Pending items in your processing queue — growing depth means your GPUs can’t keep up with inbound work

These are examples — you can send any metric your application produces. See OTLP Integration for setup instructions.

What You See on the Services Dashboard

The Services tab (app.usechamber.io/dashboard?tab=services) groups metrics by service_name, giving you a unified view per service:

GPU utilization and GPU Usage attributed to each service — see which services are efficiently using their GPUs and which are underutilizing
GPU memory per service — identify which service is consuming the most VRAM and whether it’s at risk of OOM
Per-process breakdowns — drill into individual worker processes within a service (e.g., each rank in a distributed training job)
Application metrics — any custom OTLP metrics your service exports, displayed alongside the infrastructure data
Host CPU and memory — spot cases where a service is bottlenecked on CPU or host memory rather than GPU

This per-service view makes it straightforward to answer questions like: “Is my training job actually using the GPUs I allocated?” or “Why did inference latency spike at 2 PM?” — without having to cross-reference multiple monitoring systems.

For the best experience on the Services dashboard, set OTEL_SERVICE_NAME on every GPU process. This ensures GPU metrics are correctly attributed even when multiple services share a host.

Workload Discovery

The agent automatically discovers GPU-using processes and classifies them by framework:

PyTorch (including torchrun, torch.distributed)
DeepSpeed
vLLM
Other GPU processes

Discovered workloads appear in the dashboard with:

Process name and command line
Framework type
GPU memory usage per process
Lifecycle state (running, terminated)
Duration

For distributed training (e.g., torchrun --nproc_per_node=4), the agent groups worker processes under a single workload entry.

You can enrich workload metadata with environment variables. See Workload Labels.

Ensuring DCGM Profiling Metrics Are Collected

The GPU Usage metric and profiling metrics (gpu_sm_active, gpu_tensor_active, gpu_dram_active) require DCGM Exporter to expose profiling-level fields. The installer sets this up automatically, but if you are running DCGM Exporter manually or troubleshooting missing metrics, verify the following:

1. DCGM Exporter Must Be Running

# Check if DCGM Exporter is serving metrics
curl -s http://localhost:9400/metrics | head -20

You should see Prometheus-formatted metrics including lines starting with DCGM_FI_. If this returns nothing, DCGM Exporter is not running or not reachable.

2. Profiling Fields Must Be Enabled

DCGM Exporter is configured by a CSV file that lists which fields to export. The profiling fields required are:

DCGM Field ID	Field Name	Required For
1001	`DCGM_FI_PROF_SM_ACTIVE`	GPU Usage (SM Active)
1004	`DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`	GPU Usage (Tensor Active)
1005	`DCGM_FI_PROF_DRAM_ACTIVE`	GPU Usage (DRAM Active)

Verify these fields are in the DCGM Exporter counters file (commonly /etc/dcgm-exporter/default-counters.csv or passed via the -f flag):

# Check if profiling fields are configured
grep -E "DCGM_FI_PROF_SM_ACTIVE|DCGM_FI_PROF_PIPE_TENSOR_ACTIVE|DCGM_FI_PROF_DRAM_ACTIVE" \
  /etc/dcgm-exporter/default-counters.csv

If these entries are missing, add them:

DCGM_FI_PROF_SM_ACTIVE, gauge
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge
DCGM_FI_PROF_DRAM_ACTIVE, gauge

Then restart DCGM Exporter.

3. Profiling Must Be Supported by Your GPU

DCGM profiling metrics require:

NVIDIA driver 450+ (most modern drivers)
A GPU that supports profiling (Volta, Turing, Ampere, Hopper, Blackwell — i.e., V100, T4, A100, H100, B200, etc.)
No other profiling tool (e.g., Nsight Systems) actively holding a profiling session

If profiling metrics are unavailable, the core GPU metrics (utilization, memory, temperature, power) still collect normally. Only the GPU Usage composite metric will be missing.

4. Verify Profiling Metrics Are Flowing

curl -s http://localhost:9400/metrics | grep DCGM_FI_PROF

You should see lines like:

DCGM_FI_PROF_SM_ACTIVE{gpu="0",...} 0.45
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",...} 0.38
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",...} 0.22

If these lines are absent or all values are 0 while a workload is running, check the troubleshooting steps above.

Metric Labels

All metrics include these standard labels:

Label	Description
`cluster_id`	Identifier for this host (defaults to hostname, configurable)
`node_name`	Hostname of the machine
`organization_id`	Your Chamber organization ID
`managed_by`	Always `external` for standalone agents

GPU metrics additionally include:

Label	Description
`gpu_index`	GPU index (0, 1, 2, …)
`gpu_type`	GPU model name (e.g., `NVIDIA A100-SXM4-80GB`)

Process and service metrics additionally include:

Label	Description
`job_id`	Unique workload identifier
`job_name`	Workload name (auto-detected or from `CHAMBER_JOB_NAME`)
`pod_name`	Process identifier (`process:pid-<N>`)
`team_name`	Team label (from `CHAMBER_TEAM_NAME` or default)
`service_name`	Service identifier (from `OTEL_SERVICE_NAME` env var or OTLP `service.name` resource attribute). Used to group metrics on the Services dashboard.

What to Look For on the Dashboard

The Services tab (app.usechamber.io/dashboard?tab=services) gives you a per-host and per-GPU breakdown. Here are common patterns to watch for:

Pattern	What it means	Action
High utilization but low GPU Usage	Kernels are running but underutilizing the GPU’s compute capacity	Profile with Nsight, increase batch size, or check for serialization bottlenecks
GPU memory near 100%	Risk of OOM errors that crash training	Reduce batch size, enable gradient checkpointing, or use a larger GPU
High CPU, low GPU utilization	Data loading is the bottleneck — GPUs are starved	Add more DataLoader workers, use faster storage, or pre-process data
Temperature consistently above 80°C	Thermal throttling may be reducing clock speeds	Check cooling, airflow, or ambient temperature
Power at limit	Power throttling is reducing clock speeds	Expected under heavy load, but sustained power capping may indicate undersized infrastructure
GPU idle (0% utilization, 0 GPU Usage)	GPU is allocated but not in use	Check for crashed jobs, queued-but-not-started workloads, or misconfigured resource requests

Metrics Reference

GPU Metrics

Core GPU Metrics

GPU Usage (Composite Metric)

Profiling Metrics

Host Metrics

Per-Process GPU Metrics

Service-Level Metrics

How Services Are Identified

Application Metrics via OTLP

What You See on the Services Dashboard

Workload Discovery

Ensuring DCGM Profiling Metrics Are Collected

1. DCGM Exporter Must Be Running

2. Profiling Fields Must Be Enabled

3. Profiling Must Be Supported by Your GPU

4. Verify Profiling Metrics Are Flowing

Metric Labels

What to Look For on the Dashboard

Next Steps

OTLP Integration

Configuration

​GPU Metrics

​Core GPU Metrics

​GPU Usage (Composite Metric)

​Profiling Metrics

​Host Metrics

​Per-Process GPU Metrics

​Service-Level Metrics

​How Services Are Identified

​Application Metrics via OTLP

​What You See on the Services Dashboard

​Workload Discovery

​Ensuring DCGM Profiling Metrics Are Collected

​1. DCGM Exporter Must Be Running

​2. Profiling Fields Must Be Enabled

​3. Profiling Must Be Supported by Your GPU

​4. Verify Profiling Metrics Are Flowing

​Metric Labels

​What to Look For on the Dashboard

​Next Steps

OTLP Integration

Configuration

GPU Metrics

Core GPU Metrics

GPU Usage (Composite Metric)

Profiling Metrics

Host Metrics

Per-Process GPU Metrics

Service-Level Metrics

How Services Are Identified

Application Metrics via OTLP

What You See on the Services Dashboard

Workload Discovery

Ensuring DCGM Profiling Metrics Are Collected

1. DCGM Exporter Must Be Running

2. Profiling Fields Must Be Enabled

3. Profiling Must Be Supported by Your GPU

4. Verify Profiling Metrics Are Flowing

Metric Labels

What to Look For on the Dashboard

Next Steps