Troubleshooting

This guide covers common issues with the Chamber agent and how to resolve them.

Quick Checks

# Check agent pod status
kubectl get pods -n chamber-system -l app.kubernetes.io/name=chamber-agent

# View recent logs
kubectl logs -n chamber-system -l app.kubernetes.io/name=chamber-agent --tail=50

# Check installed version
helm list -n chamber-system

Cluster Not Appearing in Dashboard

Symptoms: Agent pod is running but cluster doesn’t appear in Chamber

Check the cluster token

Verify the token is correct and not expired:

kubectl logs -l app.kubernetes.io/name=chamber-agent | grep -i "token\|auth"

If you see authentication errors, generate a new token from Settings > Security > API Tokens -> New Token in the Chamber dashboard.

Check network connectivity

The agent needs outbound HTTPS access to Chamber:

kubectl logs -l app.kubernetes.io/name=chamber-agent | grep -i "connect\|websocket"

If behind a corporate firewall, you may need to configure proxy settings.

Check cluster name

Verify the cluster name was set correctly during installation:

kubectl logs -l app.kubernetes.io/name=chamber-agent | grep -i "cluster"

GPUs Not Detected

Symptoms: Cluster appears but shows 0 GPUs

Check NVIDIA device plugin

The NVIDIA device plugin must be running:

kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds

If not running, install it from NVIDIA’s documentation.

Verify GPUs are allocatable

Check that nodes report GPU resources:

kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUS:.status.allocatable.nvidia\\.com/gpu

Nodes should show a GPU count. If showing <none>, the NVIDIA drivers or device plugin may not be configured correctly.

Workloads Not Tracked

Symptoms: Workloads run but don’t appear in Chamber

Check workload labels

Workloads must have the team label to be tracked:

metadata:
  labels:
    chamber.io/team: your-team-slug

Check namespace filtering

If you configured watchNamespaces, verify your workload’s namespace is included. By default, the agent watches all namespaces.

Agent Not Starting

Symptoms: Pod in CrashLoopBackOff or Error state

# Check logs from crashed pod
kubectl logs -l app.kubernetes.io/name=chamber-agent --previous

Error	Solution
Token/auth errors	Generate new token from Chamber dashboard
Connection errors	Check firewall allows outbound HTTPS
Permission errors	Reinstall agent to fix RBAC

GPU Metrics Not Appearing

Symptoms: Dashboard shows no GPU utilization data

Check DCGM-Exporter

The agent discovers GPU metrics via DCGM-Exporter:

kubectl logs -n chamber-system -l app.kubernetes.io/name=chamber-agent | grep -i dcgm

Look for “DCGM-Exporter detected” messages.

Verify metrics collection

Check that metrics are being sent:

kubectl logs -n chamber-system -l app.kubernetes.io/name=chamber-agent | grep -i "metrics"

Look for “Sent X metrics to Control Plane” messages.

Install DCGM-Exporter

If DCGM-Exporter is not installed, the agent can’t collect GPU metrics. Install via NVIDIA GPU Operator or standalone:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install dcgm-exporter nvidia/dcgm-exporter -n gpu-operator

GPU Usage Metric Not Available

Symptoms: Basic GPU metrics (utilization, memory, temperature) appear in the dashboard, but the GPU Usage metric is missing. Chamber’s GPU Usage metric requires the following five DCGM metrics to be collected:

Profiling metrics: DCGM_FI_PROF_SM_ACTIVE, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, DCGM_FI_PROF_DRAM_ACTIVE — not enabled in the default DCGM-Exporter configuration and must be added explicitly.
Power metrics: DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_POWER_MGMT_LIMIT — included in the default DCGM-Exporter configuration but may be missing if your cluster uses a custom metrics configuration.

All five metrics must be present for GPU Usage to be computed. If any are missing, the agent logs a warning indicating which metrics are unavailable.

If you already have a custom DCGM-Exporter metrics ConfigMap, ensure all five metrics above are included in your existing configuration. A common issue is that custom configurations may omit DCGM_FI_DEV_POWER_MGMT_LIMIT, which is required for the power gate calculation.

Create a metrics ConfigMap

Create a ConfigMap that includes both standard and profiling metrics:

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: dcgm-exporter-metrics
  namespace: gpu-operator
data:
  dcgm-metrics.csv: |
    # Standard GPU metrics
    DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %)
    DCGM_FI_DEV_FB_USED, gauge, GPU memory used (MiB)
    DCGM_FI_DEV_FB_FREE, gauge, GPU memory free (MiB)
    DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (Celsius)
    DCGM_FI_DEV_POWER_USAGE, gauge, Power usage (Watts)
    DCGM_FI_DEV_POWER_MGMT_LIMIT, gauge, Power management limit (Watts)

    # Profiling metrics (required for GPU Usage)
    DCGM_FI_PROF_SM_ACTIVE, gauge, SM activity ratio
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Tensor core activity ratio
    DCGM_FI_PROF_DRAM_ACTIVE, gauge, Memory interface activity ratio
EOF

Configure DCGM-Exporter to use the ConfigMap

Helm Upgrade (Recommended)
kubectl Patch

If DCGM-Exporter was installed via the NVIDIA GPU Operator, upgrade with the ConfigMap reference:

helm upgrade gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --reuse-values \
  --set dcgmExporter.config.name=dcgm-exporter-metrics

The GPU Operator automatically restarts DCGM-Exporter pods and handles the volume mount configuration. This approach is persistent across future upgrades.

Alternatively, patch the DCGM-Exporter daemonset directly:

kubectl patch daemonset nvidia-dcgm-exporter -n gpu-operator --type='json' -p='[
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/0/env/2/value",
    "value": "/etc/dcgm-exporter/dcgm-metrics.csv"
  },
  {
    "op": "add",
    "path": "/spec/template/spec/containers/0/volumeMounts/-",
    "value": {
      "name": "metrics-config",
      "mountPath": "/etc/dcgm-exporter",
      "readOnly": true
    }
  },
  {
    "op": "add",
    "path": "/spec/template/spec/volumes/-",
    "value": {
      "name": "metrics-config",
      "configMap": {
        "name": "dcgm-exporter-metrics"
      }
    }
  }
]'

Wait for the rollout to complete:

kubectl rollout status daemonset/nvidia-dcgm-exporter -n gpu-operator --timeout=120s

The patch command uses a hardcoded path that assumes DCGM_EXPORTER_COLLECTORS is at env var index 2. If the patch fails with “path not found”, check the actual index:

kubectl get daemonset nvidia-dcgm-exporter -n gpu-operator \
  -o jsonpath='{.spec.template.spec.containers[0].env}' | jq .

Find the index of DCGM_EXPORTER_COLLECTORS and adjust the patch path accordingly. This patch may also be overwritten if the GPU Operator is upgraded later.

Verify profiling metrics

Confirm the profiling metrics are being collected:

# Get a DCGM-Exporter pod name
DCGM_POD=$(kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter \
  -o jsonpath='{.items[0].metadata.name}')

# Port-forward and check metrics
kubectl port-forward -n gpu-operator pod/$DCGM_POD 9400:9400 &
PF_PID=$!
sleep 3
curl -s localhost:9400/metrics | grep -E "DCGM_FI_PROF_(SM|PIPE_TENSOR|DRAM)_ACTIVE|DCGM_FI_DEV_POWER_MGMT_LIMIT"
kill $PF_PID 2>/dev/null

You should see all four metrics in the output:

DCGM_FI_PROF_SM_ACTIVE
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
DCGM_FI_PROF_DRAM_ACTIVE
DCGM_FI_DEV_POWER_MGMT_LIMIT

If any are missing, update your DCGM-Exporter metrics ConfigMap to include them. The GPU Usage metric will appear in the Chamber dashboard shortly after all required metrics are available.

Profiling metrics require NVIDIA Turing architecture or newer GPUs (T4, V100, A100, H100, etc.). Consumer GPUs (GeForce series) may not support profiling metrics.

Proxy Configuration

If your cluster uses an HTTP proxy:

# values.yaml
env:
  - name: HTTPS_PROXY
    value: "http://proxy.example.com:8080"
  - name: NO_PROXY
    value: "10.0.0.0/8,172.16.0.0/12,.cluster.local"

Then upgrade the agent:

helm upgrade chamber-agent oci://public.ecr.aws/chamber/chamber-agent-chart \
  -n chamber-system -f values.yaml

Getting Help

If you’re still having issues:

Collect logs: kubectl logs -l app.kubernetes.io/name=chamber-agent > agent-logs.txt
Contact support with logs and your agent version

Installation

Install the agent via Helm

Upgrading

Upgrade to the latest version

Documentation

​Quick Checks

​Cluster Not Appearing in Dashboard

​GPUs Not Detected

​Workloads Not Tracked

​Agent Not Starting

​GPU Metrics Not Appearing

​GPU Usage Metric Not Available

​Proxy Configuration

​Getting Help

​Related Pages

Installation

Upgrading

Quick Checks

Cluster Not Appearing in Dashboard

GPUs Not Detected

Workloads Not Tracked

Agent Not Starting

GPU Metrics Not Appearing

GPU Usage Metric Not Available

Proxy Configuration

Getting Help

Related Pages