Skip to main content
Autonomous agents operating on GPU infrastructure must be safe by design. The Chamber AI Ops Agent uses a five-layer defense-in-depth model — every action passes through multiple independent safety checks before execution. No single layer’s failure can result in an unauthorized action.

Five-Layer Defense-in-Depth

LayerWhat It DoesCannot Be Bypassed By
Layer 1: Credential IsolationEach organization’s agent runs with its own scoped credentials. No cross-tenant access is possible at the infrastructure level.Any client-side code
Layer 2: Tenant PolicyA deny-by-default policy defines exactly which actions the agent can take for your organization. Anything not explicitly allowed is blocked.Agent reasoning or prompts
Layer 3: CLI GuardrailsBudget checks, scope validation, and rate limits built into the Chamber CLI. Every write action is validated before execution.Policy misconfiguration
Layer 4: Human ApprovalHigh-risk actions require explicit human approval via Slack before execution. Approvals are cryptographically bound and single-use.Automated processes
Layer 5: Server GuardrailsChamber’s API server enforces final budget limits, scope validation, and rate limits. This is the ultimate safety net — even if all client-side layers fail, the server rejects unauthorized actions.Nothing — this is the final gate
Layers 1, 2, and 5 are always active and cannot be disabled. Layers 3 and 4 provide additional defense for write operations.

Risk-Tiered Action Model

Every action the agent can take is classified into a risk tier. Higher tiers require progressively more safety checks.
TierCategoryExamplesPolicy
0 — ObserveRead-only queriesList workloads, get metrics, check GPU status, read logsAuto-execute
1 — InformAnalysis and insightsExplain failures, analyze capacity trends, generate reportsAuto-execute, notify team
2 — Low WriteSafe mutationsSubmit small jobs (up to 4 GPUs), create allocations (up to 8 instances)Dry-run first, then execute
3 — High WriteSignificant mutationsSubmit large jobs (over 4 GPUs), transfer allocations, create reservationsHuman approval required
4 — DestructiveIrreversible actionsCancel running jobs, release allocationsDual approval required, mandatory audit
5 — ProhibitedDangerous operationsDelete pools, destroy infrastructureHard blocked — not available to the agent

Dry-Run Enforcement

For Tier 2 actions, the agent always performs a dry-run first. The dry-run shows exactly what will happen — resources consumed, configuration applied, estimated cost — before any mutation occurs. Only after the dry-run passes validation does the agent execute.

Human Approval Workflow

For Tier 3 and Tier 4 actions, the agent sends an approval request to your team via Slack:
@chambie-ops requests approval:

Action:    Cancel job llama-finetune-v3 (job-a1b2c3d4)
Reason:    GPU ECC errors detected on assigned node. Job producing
           corrupted gradients. Recommend cancel and resubmit on
           healthy hardware.
Risk Tier: 4 (Destructive)
Evidence:  12 uncorrectable ECC errors in last 10 minutes.
           Loss spiked to NaN at step 4,521.

[Approve]  [Deny]

Expires in 30 minutes. If no response, escalates to #ml-ops-escalation.
Approval security:
  • Approval buttons are cryptographically signed — they cannot be forged or replayed
  • Each approval is single-use — clicking Approve twice has no effect
  • Approvals expire after 30 minutes (configurable) — unresolved requests escalate automatically
  • Approver identity is verified — only authorized team members can approve Tier 4 actions

Deny-by-Default Tenant Policy

Your organization’s tenant policy defines exactly what the agent is allowed to do. Everything else is denied.
# Example tenant policy
version: "1.0"

allowed_tools:
  # Read-only operations — always allowed
  tier_0:
    - workloads.list
    - workloads.get
    - metrics.get
    - failure_analysis.diagnose
    - gpus.status
    - logs.get

  # Analysis — allowed with notification
  tier_1:
    - workloads.explain
    - insights.analyze
    - capacity.plan

  # Low-risk writes — allowed with constraints
  tier_2:
    - workloads.submit:
        max_gpus: 4
        job_type: [elastic]
    - allocations.create:
        max_instances: 8

  # High-risk writes — require approval
  tier_3:
    - workloads.submit:
        max_gpus: 64
    - allocations.transfer
    - reservations.create

  # Destructive — require dual approval
  tier_4:
    - workloads.cancel
    - allocations.release

budget:
  daily_limit_usd: 500
  monthly_limit_usd: 10000

approval:
  tier_3_channel: "#ml-ops-approvals"
  tier_4_approvers: ["@alice", "@bob"]
  timeout_minutes: 30
  escalation_channel: "#ml-ops-escalation"
The tenant policy is version-controlled and auditable. Changes to the policy require manual configuration — the agent cannot modify its own permissions.

Audit Trail

Every action the agent takes is logged with full context:
FieldDescription
ActionWhat the agent did (e.g., workloads.submit)
Risk TierThe action’s risk classification
ReasonWhy the agent took this action (root-cause description)
EvidenceThe logs, metrics, and events that informed the decision
ApproverWho approved the action (for Tier 3+)
OutcomeWhether the action succeeded and its effect
TimestampWhen the action occurred
Audit records are retained for compliance review and can be exported for integration with your existing governance tools.

Incident Deduplication

When multiple signals detect the same issue — for example, a real-time event and the next heartbeat both notice a failed job — the agent deduplicates them. Only one remediation runs per incident. If two team members try to address the same issue from different Slack threads, the second is told the issue is already being handled and is offered updates on the resolution.

Guardrail Enforcement Examples

A resubmission would exceed the daily budget limit. The agent blocks the action, notifies the team, and suggests waiting until the next budget cycle or requesting a budget increase.
A job requests 128 GPUs but the tenant policy caps individual jobs at 64. The agent blocks the request locally — it never reaches the API.
Releasing an allocation that has running workloads is a Tier 4 action. The agent sends a dual-approval request showing which workloads would be affected. Both designated approvers must approve before execution.
The agent encounters a failure it hasn’t seen before and cannot confidently diagnose. Rather than guessing, it escalates to your team with the evidence it has collected — logs, metrics, and its best hypothesis — for human review.

Next Steps

Getting Started

Enable AI Ops and configure your tenant policy

How It Works

Learn about the three-trigger proactivity model and diagnosis flow