Five-Layer Defense-in-Depth
| Layer | What It Does | Cannot Be Bypassed By |
|---|---|---|
| Layer 1: Credential Isolation | Each organization’s agent runs with its own scoped credentials. No cross-tenant access is possible at the infrastructure level. | Any client-side code |
| Layer 2: Tenant Policy | A deny-by-default policy defines exactly which actions the agent can take for your organization. Anything not explicitly allowed is blocked. | Agent reasoning or prompts |
| Layer 3: CLI Guardrails | Budget checks, scope validation, and rate limits built into the Chamber CLI. Every write action is validated before execution. | Policy misconfiguration |
| Layer 4: Human Approval | High-risk actions require explicit human approval via Slack before execution. Approvals are cryptographically bound and single-use. | Automated processes |
| Layer 5: Server Guardrails | Chamber’s API server enforces final budget limits, scope validation, and rate limits. This is the ultimate safety net — even if all client-side layers fail, the server rejects unauthorized actions. | Nothing — this is the final gate |
Layers 1, 2, and 5 are always active and cannot be disabled. Layers 3 and 4 provide additional defense for write operations.
Risk-Tiered Action Model
Every action the agent can take is classified into a risk tier. Higher tiers require progressively more safety checks.| Tier | Category | Examples | Policy |
|---|---|---|---|
| 0 — Observe | Read-only queries | List workloads, get metrics, check GPU status, read logs | Auto-execute |
| 1 — Inform | Analysis and insights | Explain failures, analyze capacity trends, generate reports | Auto-execute, notify team |
| 2 — Low Write | Safe mutations | Submit small jobs (up to 4 GPUs), create allocations (up to 8 instances) | Dry-run first, then execute |
| 3 — High Write | Significant mutations | Submit large jobs (over 4 GPUs), transfer allocations, create reservations | Human approval required |
| 4 — Destructive | Irreversible actions | Cancel running jobs, release allocations | Dual approval required, mandatory audit |
| 5 — Prohibited | Dangerous operations | Delete pools, destroy infrastructure | Hard blocked — not available to the agent |
Dry-Run Enforcement
For Tier 2 actions, the agent always performs a dry-run first. The dry-run shows exactly what will happen — resources consumed, configuration applied, estimated cost — before any mutation occurs. Only after the dry-run passes validation does the agent execute.Human Approval Workflow
For Tier 3 and Tier 4 actions, the agent sends an approval request to your team via Slack:- Approval buttons are cryptographically signed — they cannot be forged or replayed
- Each approval is single-use — clicking Approve twice has no effect
- Approvals expire after 30 minutes (configurable) — unresolved requests escalate automatically
- Approver identity is verified — only authorized team members can approve Tier 4 actions
Deny-by-Default Tenant Policy
Your organization’s tenant policy defines exactly what the agent is allowed to do. Everything else is denied.Audit Trail
Every action the agent takes is logged with full context:| Field | Description |
|---|---|
| Action | What the agent did (e.g., workloads.submit) |
| Risk Tier | The action’s risk classification |
| Reason | Why the agent took this action (root-cause description) |
| Evidence | The logs, metrics, and events that informed the decision |
| Approver | Who approved the action (for Tier 3+) |
| Outcome | Whether the action succeeded and its effect |
| Timestamp | When the action occurred |
Incident Deduplication
When multiple signals detect the same issue — for example, a real-time event and the next heartbeat both notice a failed job — the agent deduplicates them. Only one remediation runs per incident. If two team members try to address the same issue from different Slack threads, the second is told the issue is already being handled and is offered updates on the resolution.Guardrail Enforcement Examples
Budget limit exceeded
Budget limit exceeded
A resubmission would exceed the daily budget limit. The agent blocks the action, notifies the team, and suggests waiting until the next budget cycle or requesting a budget increase.
GPU count exceeds per-job limit
GPU count exceeds per-job limit
A job requests 128 GPUs but the tenant policy caps individual jobs at 64. The agent blocks the request locally — it never reaches the API.
Destructive action on running workloads
Destructive action on running workloads
Releasing an allocation that has running workloads is a Tier 4 action. The agent sends a dual-approval request showing which workloads would be affected. Both designated approvers must approve before execution.
Unknown failure pattern
Unknown failure pattern
The agent encounters a failure it hasn’t seen before and cannot confidently diagnose. Rather than guessing, it escalates to your team with the evidence it has collected — logs, metrics, and its best hypothesis — for human review.
Next Steps
Getting Started
Enable AI Ops and configure your tenant policy
How It Works
Learn about the three-trigger proactivity model and diagnosis flow

