Three-Trigger Proactivity Model
Most monitoring tools are reactive — they wait for an alert, then notify you. The AI Ops Agent is proactive. It uses three complementary trigger mechanisms to ensure nothing slips through the cracks.Heartbeat — Catching What Alerts Miss
Every 5 minutes, the agent checks in on your infrastructure and asks: Is anything wrong? This catches the failures that don’t produce alerts — GPUs sitting idle, training jobs running slower than expected, costs drifting above budget, or utilization gradually degrading. If everything is healthy, the heartbeat is silent. No noise, no unnecessary messages.The heartbeat interval is configurable per organization. 5 minutes is the default — you can set it anywhere from 1 to 30 minutes.
Scheduled Tasks — Reports and Reviews on Your Schedule
Recurring operational tasks run on a schedule you define:| Task | Default Schedule | What It Does |
|---|---|---|
| Daily cost report | 9:00 AM | GPU spend breakdown, trends, anomalies |
| Weekly capacity review | Monday 10:00 AM | Utilization trends, demand forecast, optimization recommendations |
| Training loop check | Every 5 minutes | Monitors active training jobs for failures or stalls |
Event-Driven — Real-Time Response
When something happens in your cluster, the agent responds immediately:| Event | Response Time | Agent Action |
|---|---|---|
| Job failure | < 30 seconds | Diagnose root cause, attempt auto-fix or escalate |
| GPU hardware fault | < 30 seconds | Assess blast radius, migrate affected jobs, request maintenance |
| Budget threshold (80%) | < 1 minute | Summarize spending, alert team, recommend adjustments |
| Pod eviction | < 30 seconds | Determine cause, resubmit if appropriate |
Diagnosis Flow
When the agent detects an issue — whether from a heartbeat, scheduled check, or real-time event — it follows a structured diagnosis flow.Observe
The agent collects evidence before making any decisions:- Workload details — job configuration, GPU type, resource requests
- Logs — recent error messages and stack traces
- Metrics — GPU memory, utilization, temperature over time
- Kubernetes events — pod status, node conditions, scheduling events
Diagnose
Using the collected evidence, the agent performs root-cause analysis — not just pattern matching on error strings. It correlates across data sources to identify the actual cause. For example, an “NCCL timeout” error might be caused by:- A network interface misconfiguration (fix: set
NCCL_SOCKET_IFNAME) - An InfiniBand hardware failure (fix: migrate to healthy nodes)
- A thermal-throttled GPU creating a straggler (fix: exclude overheating node)
Recommend, Act, Verify
Based on the diagnosis and the action’s risk tier, the agent either:- Auto-executes low-risk fixes (e.g., resubmit with corrected batch size)
- Requests approval for high-risk actions (e.g., cancel a running job)
- Escalates when human judgment is needed (e.g., hardware replacement)
Tiered Intelligence
The agent uses different AI model tiers depending on the complexity of the task, optimizing for both speed and cost:| Tier | Used For | Examples |
|---|---|---|
| Fast | Heartbeat checks, event triage, status queries | ”Are all jobs healthy?” — answered in milliseconds |
| Reasoning | Root-cause analysis, pattern matching, planning | ”Why did this distributed training job fail?” — deep multi-source analysis |
| Critical | Destructive decisions, novel failures, complex multi-step remediation | ”Should we cancel this 64-GPU job and reallocate?” — careful reasoning with full context |
Memory and Learning
The AI Ops Agent doesn’t just fix problems — it learns from them. Over time, it builds a knowledge base specific to your infrastructure.Pattern Bank
Every resolved incident becomes a reusable pattern:Failure Journal
When a fix doesn’t work, the agent records what went wrong and adds a prevention rule:How Expertise Compounds
After several months of operation, the pattern bank becomes a comprehensive runbook specific to your infrastructure — covering the failure modes, GPU types, and workload patterns unique to your environment.Notification Philosophy
The agent communicates through Slack with a clear philosophy: inform without overwhelming.- 3-5 messages per incident, not dozens
- Batched updates — status changes are grouped, not streamed
- Silent when healthy — heartbeats produce no output when everything is fine
- Structured reports — daily and weekly summaries go to designated channels
- Approval requests — interactive Slack buttons for actions that need human sign-off
Next Steps
Supported Scenarios
See the full range of failures the agent detects and resolves
Safety & Governance
Learn about the five-layer safety model and risk-tiered actions

