Private Beta — The AI Ops Agent is currently in private beta and available to select customers. To request access, contact us or reach out to your account representative.
Prerequisites
- A Chamber account with admin access
- At least one Kubernetes cluster with the Chamber agent installed
- A Slack workspace for your team
Setup
Enable AI Ops in the Chamber Console
Navigate to Settings > AI Ops Agent in the Chamber dashboard. Toggle Enable AI Ops Agent to on.
Connect to Slack
Click Add to Slack to authorize the AI Ops Agent in your workspace. This creates the
@chambie-ops bot in your Slack — separate from any existing Chamber integrations.Select the channels where the agent should post:- Alerts channel — real-time incident notifications (e.g.,
#ml-ops-alerts) - Daily reports channel — scheduled summaries (e.g.,
#ml-ops-daily) - Approvals channel — approval requests for high-risk actions (e.g.,
#ml-ops-approvals) - Escalation channel — unresolved issues that need human attention (e.g.,
#ml-ops-escalation)
Configure Your Tenant Policy
Set the operational boundaries for the agent. The default policy allows read-only operations and requires approval for all write actions — a safe starting point.You can customize:
| Setting | Default | Description |
|---|---|---|
| Heartbeat interval | 5 minutes | How often the agent checks your infrastructure |
| Daily budget limit | $500 | Maximum GPU spend the agent can authorize per day |
| Monthly budget limit | $10,000 | Maximum GPU spend per month |
| Max GPUs per auto-submit | 4 | Jobs up to this size can be auto-submitted (Tier 2) |
| Tier 3 approval channel | — | Slack channel for high-risk approval requests |
| Tier 4 approvers | — | Specific users who can approve destructive actions |
| Approval timeout | 30 minutes | How long before unanswered approvals escalate |
What Happens Next
Once running, the AI Ops Agent immediately begins:- Monitoring your active workloads, GPU health, and capacity utilization
- Posting daily cost reports to your configured channel
- Responding to failures as they occur — diagnosing, recommending, and acting within your policy
- Building its knowledge base — learning the patterns specific to your infrastructure
The AI Ops Agent is a separate Slack bot (
@chambie-ops) from Chamber’s conversational assistant (@chambie). The conversational assistant handles ad-hoc questions and manual tasks. The AI Ops Agent handles autonomous monitoring and remediation.Adjusting the Policy Over Time
As the agent proves itself, you can gradually expand its autonomous capabilities:| Phase | Recommended Policy | What Changes |
|---|---|---|
| Week 1 | Read-only auto, all writes need approval | Observe how the agent diagnoses issues |
| Week 2-4 | Auto-submit small elastic jobs (up to 4 GPUs) | Agent handles routine OOM resubmissions automatically |
| Month 2+ | Auto-submit up to 16 GPUs, auto-resubmit preempted jobs | Agent handles the majority of routine incidents |
| Ongoing | Tune based on your comfort level | Review the audit trail and expand or restrict as needed |
Next Steps
AI Ops Overview
Learn what the AI Ops Agent does and why it matters
Supported Scenarios
See the full range of failures the agent can handle
Safety & Governance
Deep dive into the five-layer safety model
Slack Integration
Configure your existing Slack integration

