Skip to main content
The Chamber AI Ops Agent handles over 200 GPU infrastructure scenarios across training failures, hardware faults, scheduling issues, distributed computing problems, and capacity planning. This page covers the major categories and how the agent addresses each one.

Scenario Overview

CategoryWhat It CoversAgent Response
GPU Memory (OOM)Out-of-memory errors during trainingAuto-fix configuration, resubmit
Distributed TrainingNCCL timeouts, rank failures, communication errorsDiagnose network/config issues, resubmit with fixes
Hardware FaultsECC errors, PCIe failures, thermal throttlingIsolate faulty hardware, migrate jobs, escalate for repair
Scheduling & PreemptionCapacity exhaustion, spot reclamation, taint mismatchesResubmit, adjust priority, or wait for capacity
Ray WorkloadsRay Train, Serve, Data, and cluster-level failuresFramework-aware diagnosis and remediation
Stuck and Cascading FailuresHanging jobs, multi-step failures, dependency chainsMulti-turn investigation, correlated diagnosis
Capacity PlanningOversubscription, demand forecasting, utilization gapsAnalysis and recommendations for team leads

GPU Memory (OOM)

Out-of-memory errors are the most common GPU training failure. The agent doesn’t just detect them — it identifies the specific cause and applies the right fix.
Signal: Job fails with CUDA out of memory during forward pass. GPU memory at 98%+ utilization.Diagnosis: Batch size too large for model size and available GPU memory.Action: Reduce batch size, enable gradient checkpointing, resubmit with corrected configuration.Example: LLaMA 7B fine-tuning on A100-80GB with batch_size=64 fails. Agent reduces to batch_size=16 with gradient checkpointing enabled and resubmits.
Signal: OOM occurs during data loading, not model forward pass. Container memory limit hit.Diagnosis: Too many DataLoader workers consuming host memory.Action: Reduce num_workers, resubmit.Example: 16 DataLoader workers on a node with limited host memory. Agent reduces to 4 workers and resubmits.
Signal: RuntimeError: DataLoader worker is killed with shared memory errors.Diagnosis: Default /dev/shm size (64MB) insufficient for multi-worker data loading.Action: Add emptyDir volume mount for /dev/shm with adequate size, resubmit.
Signal: OOM during backward pass with large sequence lengths.Diagnosis: Activation memory exceeds GPU capacity without checkpointing.Action: Enable gradient checkpointing and mixed precision training (--fp16), resubmit.

Distributed Training

Multi-GPU and multi-node training introduces communication complexity. The agent diagnoses network, synchronization, and hardware issues across distributed training setups.
Signal: NCCL AllReduce timeout. Training hangs after initialization.Diagnosis: NCCL defaulting to loopback interface instead of the cluster network.Action: Set NCCL_SOCKET_IFNAME=eth0 environment variable, enable NCCL debug logging, resubmit.
Signal: Training hangs at a specific step. Some ranks complete while others wait.Diagnosis: Conditional code paths causing rank divergence in DDP training — not all ranks execute the same collective operations.Action: Identify the divergent code path, recommend fix, escalate to developer if code change required.
Signal: NCCL falls back to TCP with 100x slower communication. Training continues but at unacceptable speed.Diagnosis: InfiniBand link failure on a specific node.Action: Cancel job, resubmit with node exclusion, escalate for IB hardware repair.
Signal: Distributed training slows progressively. NCCL timeouts occur after hours of training.Diagnosis: GPU on one node throttling due to overheating (92C+), creating a straggler that slows all ranks.Action: Cancel job, resubmit excluding the overheating node, escalate for cooling system inspection.

Hardware Faults

GPU hardware issues require quick detection and isolation to prevent cascading failures.
Signal: CUDA errors during computation. Xid errors in kernel logs.Diagnosis: ECC uncorrectable memory errors indicating GPU memory degradation.Action: Cancel affected jobs, migrate to healthy GPUs, mark node for maintenance, escalate for hardware replacement.
Signal: Xid 79: GPU has fallen off the bus in system logs.Diagnosis: PCIe link failure — GPU is no longer communicating with the host.Action: Drain node, resubmit affected jobs to other nodes, escalate for hardware inspection.

Scheduling and Preemption

The agent handles the full lifecycle of workload scheduling issues.
Signal: Job stuck in PENDING state. No available GPUs matching the request.Diagnosis: All matching GPUs allocated. Checks elastic workloads that could be preempted.Action: If reserved workload: preempt eligible elastic workloads. If elastic: queue with estimated wait time, notify team.
Signal: Job preempted due to cloud provider spot instance reclamation.Diagnosis: Spot interruption notice followed by pod termination.Action: Resubmit job with same configuration. If frequent reclamation: recommend switching to reserved capacity.
Signal: Pod unschedulable despite available GPU resources.Diagnosis: Node taints don’t match pod tolerations in the workload manifest.Action: Add correct tolerations to manifest, resubmit.
Signal: ImagePullBackOff or ErrImagePull in Kubernetes events.Diagnosis: Registry authentication expired or image tag not found.Action: If auth issue: escalate (requires credential update). If tag issue: recommend correct tag.

Ray Workloads

First-class support for the Ray distributed computing framework, covering Ray Train, Ray Serve, Ray Data, and cluster management.
Signal: Ray worker killed by memory monitor during distributed training.Diagnosis: Worker memory limit too low for model size, or missing memory optimization (DeepSpeed ZeRO, gradient checkpointing).Action: Increase worker memory allocation or enable memory optimization, resubmit RayJob.
Signal: Ray Serve replica fails to start. Model loading exceeds replica memory.Diagnosis: Model too large for configured replica resources.Action: Increase replica GPU memory, or configure model parallelism across replicas.
Signal: RayJob CR created but Ray cluster never reaches running state.Diagnosis: nodeSelector targeting wrong level (e.g., pod-level instead of node-level) or autoscaler deadlock.Action: Fix YAML structure, resubmit. For autoscaler issues: diagnose scaling constraints and recommend configuration changes.
Signal: All Ray workers disconnect. Cluster becomes unresponsive.Diagnosis: Global Control Store (GCS) crash on head node, typically from head node OOM.Action: If GCS fault tolerance enabled: wait for recovery. If not: cancel and resubmit with GCS FT configuration and larger head node memory.
Signal: GPU utilization drops to near zero despite data pipeline running.Diagnosis: Ray Data operator producing data faster than GPU training can consume, filling object store.Action: Configure backpressure limits, adjust data pipeline parallelism to match GPU throughput.

Stuck and Cascading Failures

Some failures require multi-step investigation across multiple data sources and time windows.
Signal: Job status is RUNNING but loss metrics haven’t updated in 30+ minutes.Diagnosis: Agent checks GPU utilization (if near 0%, likely deadlock or I/O stall), network metrics (if communication stalled), and recent log entries to pinpoint where the job is stuck.Action: Based on diagnosis — may fix I/O bottleneck, restart communication, or cancel and resubmit with corrected config.
Signal: Multiple jobs fail simultaneously with different error messages.Diagnosis: Agent correlates the failures to a single node going offline, rather than treating each as an independent issue.Action: Identify the root cause (node crash), resubmit all affected jobs with node exclusion, escalate for node repair.
Signal: Training job blocked, but the error traces back through preprocessing to an expired credential.Diagnosis: Multi-hop investigation: training blocked by preprocessing failure, preprocessing blocked by expired S3 credentials.Action: Escalate credential refresh, then resubmit the full pipeline.
Signal: Same job fails repeatedly with different error messages at roughly the same training step.Diagnosis: Agent correlates across attempts — discovers the underlying cause (e.g., thermal throttling causing GPU clock reduction) is the same despite different surface errors.Action: Address the root cause rather than chasing individual error messages.

Capacity Planning

Beyond reactive incident handling, the agent provides proactive capacity intelligence.
Signal: Team consistently hitting queue limits during peak hours.Diagnosis: Analyzes submission patterns over time — identifies peak demand windows, utilization gaps, and oversubscription trends.Recommendations: GPU allocation adjustments, time-slot scheduling for peak hours, priority queuing for short-duration jobs. Escalates to team leads for capacity decisions.
Signal: Heartbeat detects GPUs sitting idle while jobs are queued in other teams.Diagnosis: Identifies underutilized allocations and teams that could benefit from elastic burst capacity.Recommendations: Rebalancing suggestions, elastic capacity sharing policies, right-sizing team allocations.

Next Steps

Safety & Governance

How the agent keeps your infrastructure safe with five layers of guardrails

Getting Started

Enable AI Ops for your organization