Scenario Overview
| Category | What It Covers | Agent Response |
|---|---|---|
| GPU Memory (OOM) | Out-of-memory errors during training | Auto-fix configuration, resubmit |
| Distributed Training | NCCL timeouts, rank failures, communication errors | Diagnose network/config issues, resubmit with fixes |
| Hardware Faults | ECC errors, PCIe failures, thermal throttling | Isolate faulty hardware, migrate jobs, escalate for repair |
| Scheduling & Preemption | Capacity exhaustion, spot reclamation, taint mismatches | Resubmit, adjust priority, or wait for capacity |
| Ray Workloads | Ray Train, Serve, Data, and cluster-level failures | Framework-aware diagnosis and remediation |
| Stuck and Cascading Failures | Hanging jobs, multi-step failures, dependency chains | Multi-turn investigation, correlated diagnosis |
| Capacity Planning | Oversubscription, demand forecasting, utilization gaps | Analysis and recommendations for team leads |
GPU Memory (OOM)
Out-of-memory errors are the most common GPU training failure. The agent doesn’t just detect them — it identifies the specific cause and applies the right fix.Batch Size OOM
Batch Size OOM
Signal: Job fails with
CUDA out of memory during forward pass. GPU memory at 98%+ utilization.Diagnosis: Batch size too large for model size and available GPU memory.Action: Reduce batch size, enable gradient checkpointing, resubmit with corrected configuration.Example: LLaMA 7B fine-tuning on A100-80GB with batch_size=64 fails. Agent reduces to batch_size=16 with gradient checkpointing enabled and resubmits.DataLoader Memory Exhaustion
DataLoader Memory Exhaustion
Signal: OOM occurs during data loading, not model forward pass. Container memory limit hit.Diagnosis: Too many DataLoader workers consuming host memory.Action: Reduce
num_workers, resubmit.Example: 16 DataLoader workers on a node with limited host memory. Agent reduces to 4 workers and resubmits.Shared Memory Exhaustion
Shared Memory Exhaustion
Activation Memory Overflow
Activation Memory Overflow
Signal: OOM during backward pass with large sequence lengths.Diagnosis: Activation memory exceeds GPU capacity without checkpointing.Action: Enable gradient checkpointing and mixed precision training (
--fp16), resubmit.Distributed Training
Multi-GPU and multi-node training introduces communication complexity. The agent diagnoses network, synchronization, and hardware issues across distributed training setups.Network Interface Mismatch
Network Interface Mismatch
Signal: NCCL AllReduce timeout. Training hangs after initialization.Diagnosis: NCCL defaulting to loopback interface instead of the cluster network.Action: Set
NCCL_SOCKET_IFNAME=eth0 environment variable, enable NCCL debug logging, resubmit.Rank Synchronization Deadlock
Rank Synchronization Deadlock
Signal: Training hangs at a specific step. Some ranks complete while others wait.Diagnosis: Conditional code paths causing rank divergence in DDP training — not all ranks execute the same collective operations.Action: Identify the divergent code path, recommend fix, escalate to developer if code change required.
InfiniBand Failure
InfiniBand Failure
Signal: NCCL falls back to TCP with 100x slower communication. Training continues but at unacceptable speed.Diagnosis: InfiniBand link failure on a specific node.Action: Cancel job, resubmit with node exclusion, escalate for IB hardware repair.
Thermal Throttling Stragglers
Thermal Throttling Stragglers
Signal: Distributed training slows progressively. NCCL timeouts occur after hours of training.Diagnosis: GPU on one node throttling due to overheating (92C+), creating a straggler that slows all ranks.Action: Cancel job, resubmit excluding the overheating node, escalate for cooling system inspection.
Hardware Faults
GPU hardware issues require quick detection and isolation to prevent cascading failures.GPU ECC Memory Errors
GPU ECC Memory Errors
Signal: CUDA errors during computation. Xid errors in kernel logs.Diagnosis: ECC uncorrectable memory errors indicating GPU memory degradation.Action: Cancel affected jobs, migrate to healthy GPUs, mark node for maintenance, escalate for hardware replacement.
PCIe Bus Failure (Xid 79)
PCIe Bus Failure (Xid 79)
Signal:
Xid 79: GPU has fallen off the bus in system logs.Diagnosis: PCIe link failure — GPU is no longer communicating with the host.Action: Drain node, resubmit affected jobs to other nodes, escalate for hardware inspection.NVLink Bridge Errors
NVLink Bridge Errors
Signal: Degraded GPU-to-GPU communication bandwidth on multi-GPU nodes.Diagnosis: NVLink bridge errors causing fallback to PCIe for inter-GPU communication.Action: Migrate multi-GPU workloads to healthy nodes, schedule node maintenance.
Scheduling and Preemption
The agent handles the full lifecycle of workload scheduling issues.Cluster Capacity Exhaustion
Cluster Capacity Exhaustion
Signal: Job stuck in PENDING state. No available GPUs matching the request.Diagnosis: All matching GPUs allocated. Checks elastic workloads that could be preempted.Action: If reserved workload: preempt eligible elastic workloads. If elastic: queue with estimated wait time, notify team.
Spot Instance Reclamation
Spot Instance Reclamation
Signal: Job preempted due to cloud provider spot instance reclamation.Diagnosis: Spot interruption notice followed by pod termination.Action: Resubmit job with same configuration. If frequent reclamation: recommend switching to reserved capacity.
Taint/Toleration Mismatch
Taint/Toleration Mismatch
Signal: Pod unschedulable despite available GPU resources.Diagnosis: Node taints don’t match pod tolerations in the workload manifest.Action: Add correct tolerations to manifest, resubmit.
Image Pull Failures
Image Pull Failures
Signal:
ImagePullBackOff or ErrImagePull in Kubernetes events.Diagnosis: Registry authentication expired or image tag not found.Action: If auth issue: escalate (requires credential update). If tag issue: recommend correct tag.Ray Workloads
First-class support for the Ray distributed computing framework, covering Ray Train, Ray Serve, Ray Data, and cluster management.Ray Train Worker OOM
Ray Train Worker OOM
Signal: Ray worker killed by memory monitor during distributed training.Diagnosis: Worker memory limit too low for model size, or missing memory optimization (DeepSpeed ZeRO, gradient checkpointing).Action: Increase worker memory allocation or enable memory optimization, resubmit RayJob.
Ray Serve Model Loading OOM
Ray Serve Model Loading OOM
Signal: Ray Serve replica fails to start. Model loading exceeds replica memory.Diagnosis: Model too large for configured replica resources.Action: Increase replica GPU memory, or configure model parallelism across replicas.
RayJob Stuck in Pending
RayJob Stuck in Pending
Signal: RayJob CR created but Ray cluster never reaches running state.Diagnosis: nodeSelector targeting wrong level (e.g., pod-level instead of node-level) or autoscaler deadlock.Action: Fix YAML structure, resubmit. For autoscaler issues: diagnose scaling constraints and recommend configuration changes.
Ray GCS Crash
Ray GCS Crash
Signal: All Ray workers disconnect. Cluster becomes unresponsive.Diagnosis: Global Control Store (GCS) crash on head node, typically from head node OOM.Action: If GCS fault tolerance enabled: wait for recovery. If not: cancel and resubmit with GCS FT configuration and larger head node memory.
Ray Data Backpressure
Ray Data Backpressure
Signal: GPU utilization drops to near zero despite data pipeline running.Diagnosis: Ray Data operator producing data faster than GPU training can consume, filling object store.Action: Configure backpressure limits, adjust data pipeline parallelism to match GPU throughput.
Stuck and Cascading Failures
Some failures require multi-step investigation across multiple data sources and time windows.Training Job Stalled
Training Job Stalled
Signal: Job status is RUNNING but loss metrics haven’t updated in 30+ minutes.Diagnosis: Agent checks GPU utilization (if near 0%, likely deadlock or I/O stall), network metrics (if communication stalled), and recent log entries to pinpoint where the job is stuck.Action: Based on diagnosis — may fix I/O bottleneck, restart communication, or cancel and resubmit with corrected config.
Multi-Job Cascade on Node Crash
Multi-Job Cascade on Node Crash
Signal: Multiple jobs fail simultaneously with different error messages.Diagnosis: Agent correlates the failures to a single node going offline, rather than treating each as an independent issue.Action: Identify the root cause (node crash), resubmit all affected jobs with node exclusion, escalate for node repair.
Dependency Chain Failures
Dependency Chain Failures
Signal: Training job blocked, but the error traces back through preprocessing to an expired credential.Diagnosis: Multi-hop investigation: training blocked by preprocessing failure, preprocessing blocked by expired S3 credentials.Action: Escalate credential refresh, then resubmit the full pipeline.
Intermittent Failures with Shifting Symptoms
Intermittent Failures with Shifting Symptoms
Signal: Same job fails repeatedly with different error messages at roughly the same training step.Diagnosis: Agent correlates across attempts — discovers the underlying cause (e.g., thermal throttling causing GPU clock reduction) is the same despite different surface errors.Action: Address the root cause rather than chasing individual error messages.
Capacity Planning
Beyond reactive incident handling, the agent provides proactive capacity intelligence.Demand Forecasting
Demand Forecasting
Signal: Team consistently hitting queue limits during peak hours.Diagnosis: Analyzes submission patterns over time — identifies peak demand windows, utilization gaps, and oversubscription trends.Recommendations: GPU allocation adjustments, time-slot scheduling for peak hours, priority queuing for short-duration jobs. Escalates to team leads for capacity decisions.
Utilization Optimization
Utilization Optimization
Signal: Heartbeat detects GPUs sitting idle while jobs are queued in other teams.Diagnosis: Identifies underutilized allocations and teams that could benefit from elastic burst capacity.Recommendations: Rebalancing suggestions, elastic capacity sharing policies, right-sizing team allocations.
Next Steps
Safety & Governance
How the agent keeps your infrastructure safe with five layers of guardrails
Getting Started
Enable AI Ops for your organization

