Supported Scenarios

The Chamber AI Ops Agent handles over 200 GPU infrastructure scenarios across training failures, hardware faults, scheduling issues, distributed computing problems, and capacity planning. This page covers the major categories and how the agent addresses each one.

Scenario Overview

Category	What It Covers	Agent Response
GPU Memory (OOM)	Out-of-memory errors during training	Auto-fix configuration, resubmit
Distributed Training	NCCL timeouts, rank failures, communication errors	Diagnose network/config issues, resubmit with fixes
Hardware Faults	ECC errors, PCIe failures, thermal throttling	Isolate faulty hardware, migrate jobs, escalate for repair
Scheduling & Preemption	Capacity exhaustion, spot reclamation, taint mismatches	Resubmit, adjust priority, or wait for capacity
Ray Workloads	Ray Train, Serve, Data, and cluster-level failures	Framework-aware diagnosis and remediation
Stuck and Cascading Failures	Hanging jobs, multi-step failures, dependency chains	Multi-turn investigation, correlated diagnosis
Capacity Planning	Oversubscription, demand forecasting, utilization gaps	Analysis and recommendations for team leads

GPU Memory (OOM)

Out-of-memory errors are the most common GPU training failure. The agent doesn’t just detect them — it identifies the specific cause and applies the right fix.

Batch Size OOM

Signal: Job fails with CUDA out of memory during forward pass. GPU memory at 98%+ utilization.Diagnosis: Batch size too large for model size and available GPU memory.Action: Reduce batch size, enable gradient checkpointing, resubmit with corrected configuration.Example: LLaMA 7B fine-tuning on A100-80GB with batch_size=64 fails. Agent reduces to batch_size=16 with gradient checkpointing enabled and resubmits.

DataLoader Memory Exhaustion

Signal: OOM occurs during data loading, not model forward pass. Container memory limit hit.Diagnosis: Too many DataLoader workers consuming host memory.Action: Reduce num_workers, resubmit.Example: 16 DataLoader workers on a node with limited host memory. Agent reduces to 4 workers and resubmits.

Shared Memory Exhaustion

Signal: RuntimeError: DataLoader worker is killed with shared memory errors.Diagnosis: Default /dev/shm size (64MB) insufficient for multi-worker data loading.Action: Add emptyDir volume mount for /dev/shm with adequate size, resubmit.

Activation Memory Overflow

Signal: OOM during backward pass with large sequence lengths.Diagnosis: Activation memory exceeds GPU capacity without checkpointing.Action: Enable gradient checkpointing and mixed precision training (--fp16), resubmit.

Distributed Training

Multi-GPU and multi-node training introduces communication complexity. The agent diagnoses network, synchronization, and hardware issues across distributed training setups.

Network Interface Mismatch

Signal: NCCL AllReduce timeout. Training hangs after initialization.Diagnosis: NCCL defaulting to loopback interface instead of the cluster network.Action: Set NCCL_SOCKET_IFNAME=eth0 environment variable, enable NCCL debug logging, resubmit.

Rank Synchronization Deadlock

Signal: Training hangs at a specific step. Some ranks complete while others wait.Diagnosis: Conditional code paths causing rank divergence in DDP training — not all ranks execute the same collective operations.Action: Identify the divergent code path, recommend fix, escalate to developer if code change required.

InfiniBand Failure

Signal: NCCL falls back to TCP with 100x slower communication. Training continues but at unacceptable speed.Diagnosis: InfiniBand link failure on a specific node.Action: Cancel job, resubmit with node exclusion, escalate for IB hardware repair.

Thermal Throttling Stragglers

Signal: Distributed training slows progressively. NCCL timeouts occur after hours of training.Diagnosis: GPU on one node throttling due to overheating (92C+), creating a straggler that slows all ranks.Action: Cancel job, resubmit excluding the overheating node, escalate for cooling system inspection.

Hardware Faults

GPU hardware issues require quick detection and isolation to prevent cascading failures.

GPU ECC Memory Errors

Signal: CUDA errors during computation. Xid errors in kernel logs.Diagnosis: ECC uncorrectable memory errors indicating GPU memory degradation.Action: Cancel affected jobs, migrate to healthy GPUs, mark node for maintenance, escalate for hardware replacement.

PCIe Bus Failure (Xid 79)

Signal: Xid 79: GPU has fallen off the bus in system logs.Diagnosis: PCIe link failure — GPU is no longer communicating with the host.Action: Drain node, resubmit affected jobs to other nodes, escalate for hardware inspection.

NVLink Bridge Errors

Signal: Degraded GPU-to-GPU communication bandwidth on multi-GPU nodes.Diagnosis: NVLink bridge errors causing fallback to PCIe for inter-GPU communication.Action: Migrate multi-GPU workloads to healthy nodes, schedule node maintenance.

Scheduling and Preemption

The agent handles the full lifecycle of workload scheduling issues.

Cluster Capacity Exhaustion

Signal: Job stuck in PENDING state. No available GPUs matching the request.Diagnosis: All matching GPUs allocated. Checks elastic workloads that could be preempted.Action: If reserved workload: preempt eligible elastic workloads. If elastic: queue with estimated wait time, notify team.

Spot Instance Reclamation

Signal: Job preempted due to cloud provider spot instance reclamation.Diagnosis: Spot interruption notice followed by pod termination.Action: Resubmit job with same configuration. If frequent reclamation: recommend switching to reserved capacity.

Taint/Toleration Mismatch

Signal: Pod unschedulable despite available GPU resources.Diagnosis: Node taints don’t match pod tolerations in the workload manifest.Action: Add correct tolerations to manifest, resubmit.

Image Pull Failures

Signal: ImagePullBackOff or ErrImagePull in Kubernetes events.Diagnosis: Registry authentication expired or image tag not found.Action: If auth issue: escalate (requires credential update). If tag issue: recommend correct tag.

Ray Workloads

First-class support for the Ray distributed computing framework, covering Ray Train, Ray Serve, Ray Data, and cluster management.

Ray Train Worker OOM

Signal: Ray worker killed by memory monitor during distributed training.Diagnosis: Worker memory limit too low for model size, or missing memory optimization (DeepSpeed ZeRO, gradient checkpointing).Action: Increase worker memory allocation or enable memory optimization, resubmit RayJob.

Ray Serve Model Loading OOM

Signal: Ray Serve replica fails to start. Model loading exceeds replica memory.Diagnosis: Model too large for configured replica resources.Action: Increase replica GPU memory, or configure model parallelism across replicas.

RayJob Stuck in Pending

Signal: RayJob CR created but Ray cluster never reaches running state.Diagnosis: nodeSelector targeting wrong level (e.g., pod-level instead of node-level) or autoscaler deadlock.Action: Fix YAML structure, resubmit. For autoscaler issues: diagnose scaling constraints and recommend configuration changes.

Ray GCS Crash

Signal: All Ray workers disconnect. Cluster becomes unresponsive.Diagnosis: Global Control Store (GCS) crash on head node, typically from head node OOM.Action: If GCS fault tolerance enabled: wait for recovery. If not: cancel and resubmit with GCS FT configuration and larger head node memory.

Ray Data Backpressure

Signal: GPU utilization drops to near zero despite data pipeline running.Diagnosis: Ray Data operator producing data faster than GPU training can consume, filling object store.Action: Configure backpressure limits, adjust data pipeline parallelism to match GPU throughput.

Stuck and Cascading Failures

Some failures require multi-step investigation across multiple data sources and time windows.

Training Job Stalled

Signal: Job status is RUNNING but loss metrics haven’t updated in 30+ minutes.Diagnosis: Agent checks GPU utilization (if near 0%, likely deadlock or I/O stall), network metrics (if communication stalled), and recent log entries to pinpoint where the job is stuck.Action: Based on diagnosis — may fix I/O bottleneck, restart communication, or cancel and resubmit with corrected config.

Multi-Job Cascade on Node Crash

Signal: Multiple jobs fail simultaneously with different error messages.Diagnosis: Agent correlates the failures to a single node going offline, rather than treating each as an independent issue.Action: Identify the root cause (node crash), resubmit all affected jobs with node exclusion, escalate for node repair.

Dependency Chain Failures

Signal: Training job blocked, but the error traces back through preprocessing to an expired credential.Diagnosis: Multi-hop investigation: training blocked by preprocessing failure, preprocessing blocked by expired S3 credentials.Action: Escalate credential refresh, then resubmit the full pipeline.

Intermittent Failures with Shifting Symptoms

Signal: Same job fails repeatedly with different error messages at roughly the same training step.Diagnosis: Agent correlates across attempts — discovers the underlying cause (e.g., thermal throttling causing GPU clock reduction) is the same despite different surface errors.Action: Address the root cause rather than chasing individual error messages.

Capacity Planning

Beyond reactive incident handling, the agent provides proactive capacity intelligence.

Demand Forecasting

Signal: Team consistently hitting queue limits during peak hours.Diagnosis: Analyzes submission patterns over time — identifies peak demand windows, utilization gaps, and oversubscription trends.Recommendations: GPU allocation adjustments, time-slot scheduling for peak hours, priority queuing for short-duration jobs. Escalates to team leads for capacity decisions.

Utilization Optimization

Signal: Heartbeat detects GPUs sitting idle while jobs are queued in other teams.Diagnosis: Identifies underutilized allocations and teams that could benefit from elastic burst capacity.Recommendations: Rebalancing suggestions, elastic capacity sharing policies, right-sizing team allocations.

Documentation

Supported Scenarios

Scenario Overview

GPU Memory (OOM)

Distributed Training

Hardware Faults

Scheduling and Preemption

Ray Workloads

Stuck and Cascading Failures

Capacity Planning

Next Steps

Safety & Governance

Getting Started

Documentation

Documentation Index

​Scenario Overview

​GPU Memory (OOM)

​Distributed Training

​Hardware Faults

​Scheduling and Preemption

​Ray Workloads

​Stuck and Cascading Failures

​Capacity Planning

​Next Steps

Safety & Governance

Getting Started

Scenario Overview

GPU Memory (OOM)

Distributed Training

Hardware Faults

Scheduling and Preemption

Ray Workloads

Stuck and Cascading Failures

Capacity Planning

Next Steps