Skip to main content
This guide covers common issues with the Chamber agent and how to resolve them.

Quick Checks

# Check agent pod status
kubectl get pods -n chamber-system -l app.kubernetes.io/name=chamber-agent

# View recent logs
kubectl logs -n chamber-system -l app.kubernetes.io/name=chamber-agent --tail=50

# Check installed version
helm list -n chamber-system

Cluster Not Appearing in Dashboard

Symptoms: Agent pod is running but cluster doesn’t appear in Chamber
Verify the token is correct and not expired:
kubectl logs -l app.kubernetes.io/name=chamber-agent | grep -i "token\|auth"
If you see authentication errors, generate a new token from Settings > Security > API Tokens -> New Token in the Chamber dashboard.
The agent needs outbound HTTPS access to Chamber:
kubectl logs -l app.kubernetes.io/name=chamber-agent | grep -i "connect\|websocket"
If behind a corporate firewall, you may need to configure proxy settings.
Verify the cluster name was set correctly during installation:
kubectl logs -l app.kubernetes.io/name=chamber-agent | grep -i "cluster"

GPUs Not Detected

Symptoms: Cluster appears but shows 0 GPUs
The NVIDIA device plugin must be running:
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds
If not running, install it from NVIDIA’s documentation.
Check that nodes report GPU resources:
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUS:.status.allocatable.nvidia\\.com/gpu
Nodes should show a GPU count. If showing <none>, the NVIDIA drivers or device plugin may not be configured correctly.

Workloads Not Tracked

Symptoms: Workloads run but don’t appear in Chamber
Workloads must have the team label to be tracked:
metadata:
  labels:
    chamber.io/team: your-team-slug
If you configured watchNamespaces, verify your workload’s namespace is included. By default, the agent watches all namespaces.

Agent Not Starting

Symptoms: Pod in CrashLoopBackOff or Error state
# Check logs from crashed pod
kubectl logs -l app.kubernetes.io/name=chamber-agent --previous
ErrorSolution
Token/auth errorsGenerate new token from Chamber dashboard
Connection errorsCheck firewall allows outbound HTTPS
Permission errorsReinstall agent to fix RBAC

GPU Metrics Not Appearing

Symptoms: Dashboard shows no GPU utilization data
The agent discovers GPU metrics via DCGM-Exporter:
kubectl logs -n chamber-system -l app.kubernetes.io/name=chamber-agent | grep -i dcgm
Look for “DCGM-Exporter detected” messages.
Check that metrics are being sent:
kubectl logs -n chamber-system -l app.kubernetes.io/name=chamber-agent | grep -i "metrics"
Look for “Sent X metrics to Control Plane” messages.
If DCGM-Exporter is not installed, the agent can’t collect GPU metrics. Install via NVIDIA GPU Operator or standalone:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install dcgm-exporter nvidia/dcgm-exporter -n gpu-operator

GPU Usage Metric Not Available

Symptoms: Basic GPU metrics (utilization, memory, temperature) appear in the dashboard, but the GPU Usage metric is missing. Chamber’s GPU Usage metric requires the following five DCGM metrics to be collected:
  • Profiling metrics: DCGM_FI_PROF_SM_ACTIVE, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, DCGM_FI_PROF_DRAM_ACTIVE — not enabled in the default DCGM-Exporter configuration and must be added explicitly.
  • Power metrics: DCGM_FI_DEV_POWER_USAGE, DCGM_FI_DEV_POWER_MGMT_LIMIT — included in the default DCGM-Exporter configuration but may be missing if your cluster uses a custom metrics configuration.
All five metrics must be present for GPU Usage to be computed. If any are missing, the agent logs a warning indicating which metrics are unavailable.
If you already have a custom DCGM-Exporter metrics ConfigMap, ensure all five metrics above are included in your existing configuration. A common issue is that custom configurations may omit DCGM_FI_DEV_POWER_MGMT_LIMIT, which is required for the power gate calculation.
1

Create a metrics ConfigMap

Create a ConfigMap that includes both standard and profiling metrics:
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: dcgm-exporter-metrics
  namespace: gpu-operator
data:
  dcgm-metrics.csv: |
    # Standard GPU metrics
    DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %)
    DCGM_FI_DEV_FB_USED, gauge, GPU memory used (MiB)
    DCGM_FI_DEV_FB_FREE, gauge, GPU memory free (MiB)
    DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (Celsius)
    DCGM_FI_DEV_POWER_USAGE, gauge, Power usage (Watts)
    DCGM_FI_DEV_POWER_MGMT_LIMIT, gauge, Power management limit (Watts)

    # Profiling metrics (required for GPU Usage)
    DCGM_FI_PROF_SM_ACTIVE, gauge, SM activity ratio
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Tensor core activity ratio
    DCGM_FI_PROF_DRAM_ACTIVE, gauge, Memory interface activity ratio
EOF
2

Configure DCGM-Exporter to use the ConfigMap

3

Verify profiling metrics

Confirm the profiling metrics are being collected:
# Get a DCGM-Exporter pod name
DCGM_POD=$(kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter \
  -o jsonpath='{.items[0].metadata.name}')

# Port-forward and check metrics
kubectl port-forward -n gpu-operator pod/$DCGM_POD 9400:9400 &
PF_PID=$!
sleep 3
curl -s localhost:9400/metrics | grep -E "DCGM_FI_PROF_(SM|PIPE_TENSOR|DRAM)_ACTIVE|DCGM_FI_DEV_POWER_MGMT_LIMIT"
kill $PF_PID 2>/dev/null
You should see all four metrics in the output:
  • DCGM_FI_PROF_SM_ACTIVE
  • DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
  • DCGM_FI_PROF_DRAM_ACTIVE
  • DCGM_FI_DEV_POWER_MGMT_LIMIT
If any are missing, update your DCGM-Exporter metrics ConfigMap to include them. The GPU Usage metric will appear in the Chamber dashboard shortly after all required metrics are available.
Profiling metrics require NVIDIA Turing architecture or newer GPUs (T4, V100, A100, H100, etc.). Consumer GPUs (GeForce series) may not support profiling metrics.

Proxy Configuration

If your cluster uses an HTTP proxy:
# values.yaml
env:
  - name: HTTPS_PROXY
    value: "http://proxy.example.com:8080"
  - name: NO_PROXY
    value: "10.0.0.0/8,172.16.0.0/12,.cluster.local"
Then upgrade the agent:
helm upgrade chamber-agent oci://public.ecr.aws/chamber/chamber-agent-chart \
  -n chamber-system -f values.yaml

Getting Help

If you’re still having issues:
  1. Collect logs: kubectl logs -l app.kubernetes.io/name=chamber-agent > agent-logs.txt
  2. Contact support with logs and your agent version