Quick Checks
Cluster Not Appearing in Dashboard
Symptoms: Agent pod is running but cluster doesn’t appear in ChamberCheck the cluster token
Check the cluster token
Verify the token is correct and not expired:If you see authentication errors, generate a new token from Settings > Security > API Tokens -> New Token in the Chamber dashboard.
Check network connectivity
Check network connectivity
The agent needs outbound HTTPS access to Chamber:If behind a corporate firewall, you may need to configure proxy settings.
Check cluster name
Check cluster name
Verify the cluster name was set correctly during installation:
GPUs Not Detected
Symptoms: Cluster appears but shows 0 GPUsCheck NVIDIA device plugin
Check NVIDIA device plugin
Verify GPUs are allocatable
Verify GPUs are allocatable
Check that nodes report GPU resources:Nodes should show a GPU count. If showing
<none>, the NVIDIA drivers or device plugin may not be configured correctly.Workloads Not Tracked
Symptoms: Workloads run but don’t appear in ChamberCheck workload labels
Check workload labels
Workloads must have the team label to be tracked:
Check namespace filtering
Check namespace filtering
If you configured
watchNamespaces, verify your workload’s namespace is included. By default, the agent watches all namespaces.Agent Not Starting
Symptoms: Pod in CrashLoopBackOff or Error state| Error | Solution |
|---|---|
| Token/auth errors | Generate new token from Chamber dashboard |
| Connection errors | Check firewall allows outbound HTTPS |
| Permission errors | Reinstall agent to fix RBAC |
GPU Metrics Not Appearing
Symptoms: Dashboard shows no GPU utilization dataCheck DCGM-Exporter
Check DCGM-Exporter
The agent discovers GPU metrics via DCGM-Exporter:Look for “DCGM-Exporter detected” messages.
Verify metrics collection
Verify metrics collection
Check that metrics are being sent:Look for “Sent X metrics to Control Plane” messages.
Install DCGM-Exporter
Install DCGM-Exporter
If DCGM-Exporter is not installed, the agent can’t collect GPU metrics. Install via NVIDIA GPU Operator or standalone:
GPU Usage Metric Not Available
Symptoms: Basic GPU metrics (utilization, memory, temperature) appear in the dashboard, but the GPU Usage metric is missing. Chamber’s GPU Usage metric requires the following five DCGM metrics to be collected:- Profiling metrics:
DCGM_FI_PROF_SM_ACTIVE,DCGM_FI_PROF_PIPE_TENSOR_ACTIVE,DCGM_FI_PROF_DRAM_ACTIVE— not enabled in the default DCGM-Exporter configuration and must be added explicitly. - Power metrics:
DCGM_FI_DEV_POWER_USAGE,DCGM_FI_DEV_POWER_MGMT_LIMIT— included in the default DCGM-Exporter configuration but may be missing if your cluster uses a custom metrics configuration.
Configure DCGM-Exporter to use the ConfigMap
- Helm Upgrade (Recommended)
- kubectl Patch
If DCGM-Exporter was installed via the NVIDIA GPU Operator, upgrade with the ConfigMap reference:The GPU Operator automatically restarts DCGM-Exporter pods and handles the volume mount configuration. This approach is persistent across future upgrades.
Verify profiling metrics
Confirm the profiling metrics are being collected:You should see all four metrics in the output:
DCGM_FI_PROF_SM_ACTIVEDCGM_FI_PROF_PIPE_TENSOR_ACTIVEDCGM_FI_PROF_DRAM_ACTIVEDCGM_FI_DEV_POWER_MGMT_LIMIT
Profiling metrics require NVIDIA Turing architecture or newer GPUs (T4, V100, A100, H100, etc.). Consumer GPUs (GeForce series) may not support profiling metrics.
Proxy Configuration
If your cluster uses an HTTP proxy:Getting Help
If you’re still having issues:- Collect logs:
kubectl logs -l app.kubernetes.io/name=chamber-agent > agent-logs.txt - Contact support with logs and your agent version

