Skip to main content
Private Beta — The AI Ops Agent is currently in private beta and available to select customers. To request access, contact us or reach out to your account representative.
Get the Chamber AI Ops Agent running for your organization in minutes. The agent connects to your existing Chamber deployment — no additional cluster-side components required.

Prerequisites

  • A Chamber account with admin access
  • At least one Kubernetes cluster with the Chamber agent installed
  • A Slack workspace for your team

Setup

1

Enable AI Ops in the Chamber Console

Navigate to Settings > AI Ops Agent in the Chamber dashboard. Toggle Enable AI Ops Agent to on.
2

Connect to Slack

Click Add to Slack to authorize the AI Ops Agent in your workspace. This creates the @chambie-ops bot in your Slack — separate from any existing Chamber integrations.Select the channels where the agent should post:
  • Alerts channel — real-time incident notifications (e.g., #ml-ops-alerts)
  • Daily reports channel — scheduled summaries (e.g., #ml-ops-daily)
  • Approvals channel — approval requests for high-risk actions (e.g., #ml-ops-approvals)
  • Escalation channel — unresolved issues that need human attention (e.g., #ml-ops-escalation)
3

Configure Your Tenant Policy

Set the operational boundaries for the agent. The default policy allows read-only operations and requires approval for all write actions — a safe starting point.You can customize:
SettingDefaultDescription
Heartbeat interval5 minutesHow often the agent checks your infrastructure
Daily budget limit$500Maximum GPU spend the agent can authorize per day
Monthly budget limit$10,000Maximum GPU spend per month
Max GPUs per auto-submit4Jobs up to this size can be auto-submitted (Tier 2)
Tier 3 approval channelSlack channel for high-risk approval requests
Tier 4 approversSpecific users who can approve destructive actions
Approval timeout30 minutesHow long before unanswered approvals escalate
4

Verify the Agent is Running

Once enabled, the agent starts its first heartbeat check within 5 minutes. You’ll see a message in your alerts channel:
@chambie-ops connected

Monitoring: 3 capacity pools, 48 GPUs, 12 active workloads
Policy: Tier 0-2 auto, Tier 3+ approval required
Heartbeat: every 5 minutes
Daily report: 9:00 AM to #ml-ops-daily

All systems healthy. HEARTBEAT_OK
If everything is healthy, the agent will be silent until something needs attention.
Start with the default policy (read-only auto, approval for writes) and gradually expand autonomous permissions as you build confidence in the agent’s behavior.

What Happens Next

Once running, the AI Ops Agent immediately begins:
  1. Monitoring your active workloads, GPU health, and capacity utilization
  2. Posting daily cost reports to your configured channel
  3. Responding to failures as they occur — diagnosing, recommending, and acting within your policy
  4. Building its knowledge base — learning the patterns specific to your infrastructure
The AI Ops Agent is a separate Slack bot (@chambie-ops) from Chamber’s conversational assistant (@chambie). The conversational assistant handles ad-hoc questions and manual tasks. The AI Ops Agent handles autonomous monitoring and remediation.

Adjusting the Policy Over Time

As the agent proves itself, you can gradually expand its autonomous capabilities:
PhaseRecommended PolicyWhat Changes
Week 1Read-only auto, all writes need approvalObserve how the agent diagnoses issues
Week 2-4Auto-submit small elastic jobs (up to 4 GPUs)Agent handles routine OOM resubmissions automatically
Month 2+Auto-submit up to 16 GPUs, auto-resubmit preempted jobsAgent handles the majority of routine incidents
OngoingTune based on your comfort levelReview the audit trail and expand or restrict as needed

Next Steps

AI Ops Overview

Learn what the AI Ops Agent does and why it matters

Supported Scenarios

See the full range of failures the agent can handle

Safety & Governance

Deep dive into the five-layer safety model

Slack Integration

Configure your existing Slack integration