Auto-Containerize & Run

Coming Soon — Fully Managed GPU InfrastructureWe’re building automated infrastructure management across cloud providers so your research and MLE teams can run GPU workloads directly in their own cloud accounts without ever thinking about infrastructure. If you’d like early access or want to learn more, reach out at support@usechamber.io.

The client.run() method is designed for scientists and ML practitioners who want to submit GPU workloads without needing expertise in Docker or Kubernetes. Point it at your training project and let the SDK handle the rest.

Install with the run extra to use this feature:

pip install chamber-sdk[run]

Why Use `client.run()`?

Traditional GPU workload submission requires:

Writing a Dockerfile optimized for your ML framework
Choosing the right base image (CUDA version, framework version, etc.)
Building and pushing the container image
Authenticating to your container registry
Writing a Kubernetes manifest
Submitting via the API

With client.run(), all of this happens automatically:

from chamber_sdk import ChamberClient

client = ChamberClient.from_config()

# Configure your registries once
ChamberClient.add_registry("prod", "us-east1-docker.pkg.dev/my-project/prod", set_default=True)
ChamberClient.add_registry("dev", "us-east1-docker.pkg.dev/my-project/dev")

# Then submit with just a name
job = client.run(
    "./my-training-project",
    gpus=4,
    gpu_type="H100",
    team="ml-research",
    registry="prod",  # Uses named registry
)
print(f"Submitted: {job.id}")

What It Does Automatically

Framework Detection

Analyzes your requirements.txt to identify PyTorch, TensorFlow, JAX, or other frameworks and selects the optimal NVIDIA NGC base image.

Dockerfile Generation

Creates an optimized Dockerfile with proper CUDA configuration, dependency installation, and entrypoint setup.

Container Build & Push

Uses docker buildx build --push to build and push in a single efficient step. For AWS ECR and Google Artifact Registry, authentication and repository creation are handled automatically. Pulls the :latest tag to seed the layer cache for faster rebuilds.

Kubernetes Manifest

Generates the appropriate manifest (Job or RayJob) with correct GPU resource requests, environment variables, and volume mounts.

Workload Submission

Submits the workload to Chamber with proper workload class, priority, and team assignment.

Supported Container Registries

Chamber automatically handles authentication and repository creation for major cloud registries. Just provide your registry URL and the SDK does the rest.

Google Artifact Registry

Full auto-authentication via gcloud CLI. Repositories are created automatically if they don’t exist.

AWS ECR

Full auto-authentication via AWS CLI. Repositories are created automatically if they don’t exist.

Registry	URL Pattern	Auto-Auth	Auto-Create Repo
Google Artifact Registry	`{region}-docker.pkg.dev/{project}/{repo}`	✅	✅
AWS ECR	`{account}.dkr.ecr.{region}.amazonaws.com`	✅	✅
Other registries	Any Docker-compatible registry	Manual	Manual

Prerequisites

Docker

Docker must be installed and running on your machine. The SDK uses Docker to build and push images.

gcloud CLI

For Google Artifact Registry. Install and run gcloud auth login.

AWS CLI

For AWS ECR. Install and run aws configure.

Google Artifact Registry
AWS ECR

Setup:

# Install gcloud CLI: https://cloud.google.com/sdk/docs/install
gcloud auth login

Required IAM Permissions:

artifactregistry.repositories.get
artifactregistry.repositories.create
artifactregistry.repositories.uploadArtifacts

Setup:

# Install AWS CLI: https://aws.amazon.com/cli/
aws configure

Required IAM Permissions:

ecr:GetAuthorizationToken
ecr:CreateRepository
ecr:BatchCheckLayerAvailability
ecr:PutImage
ecr:InitiateLayerUpload
ecr:UploadLayerPart
ecr:CompleteLayerUpload

Basic Usage

Minimal Example

Google Artifact Registry
AWS ECR

from chamber_sdk import ChamberClient

client = ChamberClient.from_config()

job = client.run(
    "./my-training-project",
    gpus=4,
    gpu_type="H100",
    team="my-team-id",
    registry="us-central1-docker.pkg.dev/my-project/ml-images",
)

from chamber_sdk import ChamberClient

client = ChamberClient.from_config()

job = client.run(
    "./my-training-project",
    gpus=4,
    gpu_type="H100",
    team="my-team-id",
    registry="123456789012.dkr.ecr.us-east-1.amazonaws.com",
)

With Progress Callbacks

Monitor each stage of the pipeline:

def on_progress(stage: str, message: str):
    print(f"[{stage}] {message}")

job = client.run(
    "./my-project",
    gpus=4,
    gpu_type="H100",
    team="my-team",
    registry="us-central1-docker.pkg.dev/my-project/ml-images",
    on_progress=on_progress,
)

Google Artifact Registry Output
AWS ECR Output

[config] Resolving configuration...
[detect] Detecting project...
[detect] Framework: pytorch
[detect] Entrypoint: train.py
[dockerfile] Generating Dockerfile...
[manifest] Generating K8s manifest...
[docker] Authenticating to Google Artifact Registry (us-central1)...
[docker] GAR authentication successful
[docker] Creating repository ml-images...
[docker] Repository ready
[docker] Pulling cache image: us-central1-docker.pkg.dev/my-project/ml-images/my-project:latest
[docker] Cache image pulled successfully
[docker] Build context size: 42.5 MB
[docker] Building and pushing image for linux/amd64 (x86_64)
[docker] Image tag: us-central1-docker.pkg.dev/my-project/ml-images/my-project:a1b2c3d4
[docker] Build and push complete
[submit] Submitting workload...
[submit] Workload submitted: wl_abc123

[config] Resolving configuration...
[detect] Detecting project...
[detect] Framework: pytorch
[detect] Entrypoint: train.py
[dockerfile] Generating Dockerfile...
[manifest] Generating K8s manifest...
[docker] Authenticating to ECR (us-east-1)...
[docker] ECR authentication successful
[docker] Pulling cache image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-project:latest
[docker] Cache image pulled successfully
[docker] Build context size: 42.5 MB
[docker] Building and pushing image for linux/amd64 (x86_64)
[docker] Image tag: 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-project:a1b2c3d4
[docker] Build and push complete
[submit] Submitting workload...
[submit] Workload submitted: wl_abc123

Wait for Completion

Block until the workload finishes:

job = client.run(
    "./my-project",
    gpus=4,
    gpu_type="H100",
    team="my-team",
    registry="123456.dkr.ecr.us-east-1.amazonaws.com",
    wait=True,
    poll_interval=30,
    timeout=7200,  # 2 hours
)

print(f"Final status: {job.status}")

Registry Configuration

Configure your container registries once and reference them by name. This makes it easy to switch between dev/staging/prod environments.

Setting Up Named Registries

Add registries to ~/.chamber/config.json:

{
    "registries": {
        "prod": "us-east1-docker.pkg.dev/my-project/prod",
        "dev": "us-east1-docker.pkg.dev/my-project/dev",
        "staging": "us-central1-docker.pkg.dev/my-project/staging",
        "ecr": "123456789012.dkr.ecr.us-east-1.amazonaws.com"
    },
    "default_registry": "prod"
}

Or configure programmatically:

from chamber_sdk import ChamberClient

# Add registries (persisted to ~/.chamber/config.json)
ChamberClient.add_registry("prod", "us-east1-docker.pkg.dev/my-project/prod", set_default=True)
ChamberClient.add_registry("dev", "us-east1-docker.pkg.dev/my-project/dev")
ChamberClient.add_registry("ecr", "123456789012.dkr.ecr.us-east-1.amazonaws.com")

# List configured registries
print(ChamberClient.list_registries())
# {'prod': 'us-east1-docker.pkg.dev/my-project/prod', 'dev': '...', 'ecr': '...'}

# Check default
print(ChamberClient.get_default_registry())
# ('prod', 'us-east1-docker.pkg.dev/my-project/prod')

# Change default
ChamberClient.set_default_registry("dev")

Using Named Registries

Once configured, use registries by name:

# Uses default registry (no registry parameter needed)
job = client.run("./my-project", gpus=4, team="ml-team")

# Use a specific registry by name
job = client.run("./my-project", gpus=4, team="ml-team", registry="dev")
job = client.run("./my-project", gpus=4, team="ml-team", registry="ecr")

# Full URLs still work
job = client.run("./my-project", gpus=4, team="ml-team",
                 registry="us-west1-docker.pkg.dev/other-project/images")

Configuration File

Create a .chamber.yaml file in your project directory to avoid repeating parameters:

# .chamber.yaml
name: my-training-job
gpus: 4
gpu_type: H100
team: my-team-id
registry: prod  # Use named registry, or full URL

# Python entrypoint
entrypoint: train.py
entrypoint_args: --batch-size 32 --epochs 10

# Distributed training mode
distributed: auto  # "auto", "ray", "deepspeed", or "none"

# Workload class
job_class: ELASTIC  # or "RESERVED"

# Environment variables
env:
  CUDA_VISIBLE_DEVICES: "0,1,2,3"
  NCCL_DEBUG: INFO

# Forward secrets from local environment
forward_env:
  - WANDB_API_KEY
  - HF_TOKEN

# Additional packages
extra_pip_packages:
  - wandb
  - tensorboard

extra_apt_packages:
  - ffmpeg

# Build customization
pre_build_commands:
  - pip install flash-attn --no-build-isolation

post_build_commands:
  - python -c "import torch; print(torch.cuda.is_available())"

# Resource overrides
cpu: "24"
memory: 200Gi
shm_size: 64Gi

# Files to exclude from container
ignore:
  - "*.ckpt"
  - "wandb/"
  - "outputs/"

Then submit with minimal arguments:

job = client.run("./my-project")  # Uses .chamber.yaml settings

Dry Run (Preview Mode)

Preview exactly what will happen without building or submitting:

result = client.run("./my-project", dry_run=True)

print("=== Detected Profile ===")
print(f"Framework: {result.profile.framework}")
print(f"Entrypoint: {result.profile.entrypoint}")
print(f"Python version: {result.profile.python_version}")

print("\n=== Generated Dockerfile ===")
print(result.dockerfile)

print("\n=== K8s Manifest ===")
print(result.manifest)

print("\n=== Submit Payload ===")
for key, value in result.submit_payload.items():
    if key != "k8s_manifest":
        print(f"  {key}: {value}")

Save Generated Files

Inspect the generated Dockerfile and manifest:

job = client.run(
    "./my-project",
    gpus=4,
    gpu_type="H100",
    team="my-team",
    registry="123456.dkr.ecr.us-east-1.amazonaws.com",
    save_dockerfile=True,  # Writes Dockerfile.chamber
    save_manifest=True,    # Writes manifest.chamber.yaml
)

Framework Detection

The SDK automatically detects your ML framework from requirements.txt and selects the optimal base image:

Framework	Detected From	Base Image
PyTorch	`torch`, `pytorch`	`nvcr.io/nvidia/pytorch:24.04-py3`
TensorFlow	`tensorflow`, `keras`	`nvcr.io/nvidia/tensorflow:24.04-tf2-py3`
JAX	`jax`, `jaxlib`	`nvcr.io/nvidia/jax:24.04-py3`
Generic	(fallback)	`nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04`

Override the base image if needed:

job = client.run(
    "./my-project",
    base_image="nvcr.io/nvidia/pytorch:24.01-py3",
    # ... other args
)

Distributed Training

The SDK auto-detects distributed training frameworks and configures the appropriate launch command:

Framework	Detected From	Launch Command
DeepSpeed	`deepspeed` in requirements	`deepspeed --num_gpus N train.py`
Accelerate	`accelerate` in requirements	`accelerate launch train.py`
Ray	`ray` in requirements	Creates RayJob K8s manifest
Horovod	`horovod` in requirements	Uses standard launcher

Force a specific distributed mode:

# DeepSpeed
job = client.run(
    "./deepspeed-project",
    gpus=8,
    distributed="deepspeed",
    # ...
)

# Ray (creates RayJob for large-scale distributed training)
job = client.run(
    "./ray-project",
    gpus=32,
    distributed="ray",
    # ...
)

# Disable distributed detection
job = client.run(
    "./single-gpu-project",
    gpus=1,
    distributed="none",
    # ...
)

Using an Existing Dockerfile

Skip Dockerfile generation and use your own:

job = client.run(
    "./my-project",
    dockerfile="./Dockerfile.custom",
    # ...
)

Registry Auto-Authentication

The SDK automatically detects your registry type and handles authentication seamlessly. No manual docker login required.

Google Artifact Registry
AWS ECR

When using Google Artifact Registry, the SDK automatically:

Detects GAR registry URLs (pattern: {region}-docker.pkg.dev/{project}/{repo})
Authenticates Docker using your gcloud CLI credentials
Creates the repository if it doesn’t exist

# GAR authentication happens automatically!
job = client.run(
    "./my-project",
    registry="us-central1-docker.pkg.dev/my-gcp-project/ml-images",
    team="ml-team",
    gpus=4,
    gpu_type="H100",
)

Supported regions: All GCP regions (us-central1, us-east1, europe-west1, asia-northeast1, etc.)

Ensure your gcloud CLI is authenticated (gcloud auth login) before running.

When using AWS ECR, the SDK automatically:

Detects ECR registry URLs (pattern: {account}.dkr.ecr.{region}.amazonaws.com)
Authenticates Docker using your AWS CLI credentials
Creates the ECR repository if it doesn’t exist

# ECR authentication happens automatically!
job = client.run(
    "./my-project",
    registry="123456789012.dkr.ecr.us-east-1.amazonaws.com",
    team="ml-team",
    gpus=4,
    gpu_type="H100",
)

Supported regions: All AWS regions (us-east-1, us-west-2, eu-west-1, ap-northeast-1, etc.)

Ensure your AWS CLI is configured (aws configure) before running.

How It Works

When you call client.run(), the SDK: This means you can switch between registries just by changing the URL—no code changes required.

All Parameters

Parameter	Type	Default	Description
`directory`	str	required	Path to project directory
`gpus`	int	1	Number of GPUs
`gpu_type`	str	”H100”	GPU type (H100, A100, L40S, etc.)
`team`	str	None	Team ID (required)
`name`	str	directory name	Workload name
`entrypoint`	str	auto-detect	Python entry point
`entrypoint_args`	str	None	CLI arguments for entrypoint
`registry`	str	None	Registry name (e.g., “prod”) or URL. Uses default if not specified.
`base_image`	str	auto-select	Override base Docker image
`dockerfile`	str	None	Path to existing Dockerfile
`distributed`	str	”auto"	"auto”, “ray”, “deepspeed”, or “none”
`job_class`	str	”ELASTIC"	"RESERVED” or “ELASTIC”
`env`	dict	None	Environment variables
`no_cache`	bool	False	Force rebuild even if image exists
`dry_run`	bool	False	Preview without executing
`save_dockerfile`	bool	False	Save generated Dockerfile
`save_manifest`	bool	False	Save generated K8s manifest
`wait`	bool	False	Wait for workload completion
`poll_interval`	float	10.0	Seconds between status checks
`timeout`	float	None	Max wait time in seconds
`on_progress`	callable	None	Progress callback(stage, message)

Error Handling

from chamber_sdk import ChamberClient, DockerError
from chamber_sdk.run import RegistryAuthError

client = ChamberClient.from_config()

try:
    job = client.run(
        "./my-project",
        registry="123456.dkr.ecr.us-east-1.amazonaws.com",
        team="my-team",
        gpus=4,
        gpu_type="H100",
    )
except RegistryAuthError as e:
    # Registry authentication failed (401/403 from registry)
    # Useful for detecting expired ECR tokens (12-hour validity)
    print(f"Registry auth failed for {e.registry}")
    print(f"Details: {e.detail}")
except DockerError as e:
    # Docker not installed, daemon not running, build failed, or push failed
    print(f"Docker error: {e}")
except FileNotFoundError as e:
    # No Python entrypoint found in project
    print(f"Project error: {e}")
except ValueError as e:
    # Missing required parameters (registry, team)
    print(f"Configuration error: {e}")

Configuration Priority

Settings are resolved in this order (highest priority first):

Function arguments — client.run(..., gpus=8) overrides everything
Project .chamber.yaml — Project-specific settings
Global ~/.chamber/config.json — Default registry and other global settings
Auto-detection — Framework, entrypoint, Python version

Registry Resolution

The registry parameter can be a name or URL:

If it looks like a URL (contains . or /), it’s used as-is
Otherwise, it’s looked up in ~/.chamber/config.json under registries
If not specified, the default_registry from config is used

Caching

The SDK computes a content hash of your project and uses it as the image tag. If the image already exists in the registry, the build and push steps are skipped:

[docker] Image already exists: 123456.dkr.ecr.us-east-1.amazonaws.com/my-project:a1b2c3d4 (skipping build & push)

Force a rebuild with no_cache=True:

job = client.run(
    "./my-project",
    no_cache=True,  # Always rebuild
    # ...
)

Docker Build Optimizations

The SDK automatically applies several optimizations to make builds fast:

Optimization	What it does
Single build+push	Uses `docker buildx build --push` to build and push in one step. This is significantly faster than separate build+push because buildkit pushes layers directly as they complete
BuildKit	Enabled by default (`DOCKER_BUILDKIT=1`) for parallel build stages and advanced caching
Pip cache mounts	Uses `--mount=type=cache,target=/root/.cache/pip` so pip packages are cached across builds
Remote layer caching	Pulls the `:latest` tag before building to seed the layer cache. After a successful build, creates a `:latest` tag alias using `docker buildx imagetools create` (fast manifest aliasing, no layer re-upload)
Content-addressed tags	Images are tagged with a content hash. If the image already exists, build and push are skipped entirely
Platform targeting	Explicitly builds for `linux/amd64` to ensure consistent images
Reduced metadata	Uses `--provenance=false --sbom=false` to skip unnecessary metadata generation
Context size warnings	Reports build context size and warns if it exceeds 500MB

Why single build+push matters: With Docker Desktop’s containerd image store, a separate docker push can push ALL manifests from multi-platform base images (e.g., 6 manifests for NVIDIA images), causing each layer to be checked 6 times. The buildx build --push approach only pushes what it built — making pushes dramatically faster, especially for large ML images.

Low-Level Build Functions

For advanced use cases, you can access the optimized build functions directly:

from chamber_sdk.run import (
    build_and_push_image,
    pull_image_for_cache,
    create_image_tag,
    latest_tag,
    compute_context_size,
)

# Check context size before building
size = compute_context_size("./my-project")
print(f"Build context: {size / 1024 / 1024:.1f} MB")

# Pull cache image
cache_from = None
latest = latest_tag("registry.example.com/my-project:abc123")
if pull_image_for_cache(latest):
    cache_from = latest

# Build and push in one step
build_and_push_image(
    context_dir="./my-project",
    tag="registry.example.com/my-project:abc123",
    dockerfile_content=dockerfile,
    dockerignore_content=dockerignore,
    cache_from=cache_from,
    on_progress=lambda msg: print(f"[docker] {msg}"),
)

# Create :latest alias (fast, no layer re-upload)
create_image_tag("registry.example.com/my-project:abc123", latest)

​Why Use client.run()?

​What It Does Automatically

​Supported Container Registries

Google Artifact Registry

AWS ECR

​Prerequisites

Docker

gcloud CLI

AWS CLI

​Basic Usage

​Minimal Example

​With Progress Callbacks

​Wait for Completion

​Registry Configuration

​Setting Up Named Registries

​Using Named Registries

​Configuration File

​Dry Run (Preview Mode)

​Save Generated Files

​Framework Detection

​Distributed Training

​Using an Existing Dockerfile

​Registry Auto-Authentication

​How It Works

​All Parameters

​Error Handling

​Configuration Priority

​Registry Resolution

​Caching

​Docker Build Optimizations

​Low-Level Build Functions

Why Use `client.run()`?

What It Does Automatically

Supported Container Registries

Prerequisites

Basic Usage

Minimal Example

With Progress Callbacks

Wait for Completion

Registry Configuration

Setting Up Named Registries

Using Named Registries

Configuration File

Dry Run (Preview Mode)

Save Generated Files

Framework Detection

Distributed Training

Using an Existing Dockerfile

Registry Auto-Authentication

How It Works

All Parameters

Error Handling

Configuration Priority

Registry Resolution

Caching

Docker Build Optimizations

Low-Level Build Functions