Skip to main content
Coming Soon — Fully Managed GPU InfrastructureWe’re building automated infrastructure management across cloud providers so your research and MLE teams can run GPU workloads directly in their own cloud accounts without ever thinking about infrastructure. If you’d like early access or want to learn more, reach out at support@usechamber.com.
The client.run() method is designed for scientists and ML practitioners who want to submit GPU workloads without needing expertise in Docker or Kubernetes. Point it at your training project and let the SDK handle the rest.
Install with the run extra to use this feature:
pip install chamber-sdk[run]

Why Use client.run()?

Traditional GPU workload submission requires:
  1. Writing a Dockerfile optimized for your ML framework
  2. Choosing the right base image (CUDA version, framework version, etc.)
  3. Building and pushing the container image
  4. Authenticating to your container registry
  5. Writing a Kubernetes manifest
  6. Submitting via the API
With client.run(), all of this happens automatically:
from chamber_sdk import ChamberClient

client = ChamberClient.from_config()

# Configure your registries once
ChamberClient.add_registry("prod", "us-east1-docker.pkg.dev/my-project/prod", set_default=True)
ChamberClient.add_registry("dev", "us-east1-docker.pkg.dev/my-project/dev")

# Then submit with just a name
job = client.run(
    "./my-training-project",
    gpus=4,
    gpu_type="H100",
    team="ml-research",
    registry="prod",  # Uses named registry
)
print(f"Submitted: {job.id}")

What It Does Automatically

1

Framework Detection

Analyzes your requirements.txt to identify PyTorch, TensorFlow, JAX, or other frameworks and selects the optimal NVIDIA NGC base image.
2

Dockerfile Generation

Creates an optimized Dockerfile with proper CUDA configuration, dependency installation, and entrypoint setup.
3

Container Build & Push

Uses docker buildx build --push to build and push in a single efficient step. For AWS ECR and Google Artifact Registry, authentication and repository creation are handled automatically. Pulls the :latest tag to seed the layer cache for faster rebuilds.
4

Kubernetes Manifest

Generates the appropriate manifest (Job or RayJob) with correct GPU resource requests, environment variables, and volume mounts.
5

Workload Submission

Submits the workload to Chamber with proper workload class, priority, and team assignment.

Supported Container Registries

Chamber automatically handles authentication and repository creation for major cloud registries. Just provide your registry URL and the SDK does the rest.

Google Artifact Registry

Full auto-authentication via gcloud CLI. Repositories are created automatically if they don’t exist.

AWS ECR

Full auto-authentication via AWS CLI. Repositories are created automatically if they don’t exist.
RegistryURL PatternAuto-AuthAuto-Create Repo
Google Artifact Registry{region}-docker.pkg.dev/{project}/{repo}
AWS ECR{account}.dkr.ecr.{region}.amazonaws.com
Other registriesAny Docker-compatible registryManualManual

Prerequisites

Docker

Docker must be installed and running on your machine. The SDK uses Docker to build and push images.

gcloud CLI

For Google Artifact Registry. Install and run gcloud auth login.

AWS CLI

For AWS ECR. Install and run aws configure.
Setup:
# Install gcloud CLI: https://cloud.google.com/sdk/docs/install
gcloud auth login
Required IAM Permissions:
  • artifactregistry.repositories.get
  • artifactregistry.repositories.create
  • artifactregistry.repositories.uploadArtifacts

Basic Usage

Minimal Example

from chamber_sdk import ChamberClient

client = ChamberClient.from_config()

job = client.run(
    "./my-training-project",
    gpus=4,
    gpu_type="H100",
    team="my-team-id",
    registry="us-central1-docker.pkg.dev/my-project/ml-images",
)

With Progress Callbacks

Monitor each stage of the pipeline:
def on_progress(stage: str, message: str):
    print(f"[{stage}] {message}")

job = client.run(
    "./my-project",
    gpus=4,
    gpu_type="H100",
    team="my-team",
    registry="us-central1-docker.pkg.dev/my-project/ml-images",
    on_progress=on_progress,
)
[config] Resolving configuration...
[detect] Detecting project...
[detect] Framework: pytorch
[detect] Entrypoint: train.py
[dockerfile] Generating Dockerfile...
[manifest] Generating K8s manifest...
[docker] Authenticating to Google Artifact Registry (us-central1)...
[docker] GAR authentication successful
[docker] Creating repository ml-images...
[docker] Repository ready
[docker] Pulling cache image: us-central1-docker.pkg.dev/my-project/ml-images/my-project:latest
[docker] Cache image pulled successfully
[docker] Build context size: 42.5 MB
[docker] Building and pushing image for linux/amd64 (x86_64)
[docker] Image tag: us-central1-docker.pkg.dev/my-project/ml-images/my-project:a1b2c3d4
[docker] Build and push complete
[submit] Submitting workload...
[submit] Workload submitted: wl_abc123

Wait for Completion

Block until the workload finishes:
job = client.run(
    "./my-project",
    gpus=4,
    gpu_type="H100",
    team="my-team",
    registry="123456.dkr.ecr.us-east-1.amazonaws.com",
    wait=True,
    poll_interval=30,
    timeout=7200,  # 2 hours
)

print(f"Final status: {job.status}")

Registry Configuration

Configure your container registries once and reference them by name. This makes it easy to switch between dev/staging/prod environments.

Setting Up Named Registries

Add registries to ~/.chamber/config.json:
{
    "registries": {
        "prod": "us-east1-docker.pkg.dev/my-project/prod",
        "dev": "us-east1-docker.pkg.dev/my-project/dev",
        "staging": "us-central1-docker.pkg.dev/my-project/staging",
        "ecr": "123456789012.dkr.ecr.us-east-1.amazonaws.com"
    },
    "default_registry": "prod"
}
Or configure programmatically:
from chamber_sdk import ChamberClient

# Add registries (persisted to ~/.chamber/config.json)
ChamberClient.add_registry("prod", "us-east1-docker.pkg.dev/my-project/prod", set_default=True)
ChamberClient.add_registry("dev", "us-east1-docker.pkg.dev/my-project/dev")
ChamberClient.add_registry("ecr", "123456789012.dkr.ecr.us-east-1.amazonaws.com")

# List configured registries
print(ChamberClient.list_registries())
# {'prod': 'us-east1-docker.pkg.dev/my-project/prod', 'dev': '...', 'ecr': '...'}

# Check default
print(ChamberClient.get_default_registry())
# ('prod', 'us-east1-docker.pkg.dev/my-project/prod')

# Change default
ChamberClient.set_default_registry("dev")

Using Named Registries

Once configured, use registries by name:
# Uses default registry (no registry parameter needed)
job = client.run("./my-project", gpus=4, team="ml-team")

# Use a specific registry by name
job = client.run("./my-project", gpus=4, team="ml-team", registry="dev")
job = client.run("./my-project", gpus=4, team="ml-team", registry="ecr")

# Full URLs still work
job = client.run("./my-project", gpus=4, team="ml-team",
                 registry="us-west1-docker.pkg.dev/other-project/images")

Configuration File

Create a .chamber.yaml file in your project directory to avoid repeating parameters:
# .chamber.yaml
name: my-training-job
gpus: 4
gpu_type: H100
team: my-team-id
registry: prod  # Use named registry, or full URL

# Python entrypoint
entrypoint: train.py
entrypoint_args: --batch-size 32 --epochs 10

# Distributed training mode
distributed: auto  # "auto", "ray", "deepspeed", or "none"

# Workload class
job_class: ELASTIC  # or "RESERVED"

# Environment variables
env:
  CUDA_VISIBLE_DEVICES: "0,1,2,3"
  NCCL_DEBUG: INFO

# Forward secrets from local environment
forward_env:
  - WANDB_API_KEY
  - HF_TOKEN

# Additional packages
extra_pip_packages:
  - wandb
  - tensorboard

extra_apt_packages:
  - ffmpeg

# Build customization
pre_build_commands:
  - pip install flash-attn --no-build-isolation

post_build_commands:
  - python -c "import torch; print(torch.cuda.is_available())"

# Resource overrides
cpu: "24"
memory: 200Gi
shm_size: 64Gi

# Files to exclude from container
ignore:
  - "*.ckpt"
  - "wandb/"
  - "outputs/"
Then submit with minimal arguments:
job = client.run("./my-project")  # Uses .chamber.yaml settings

Dry Run (Preview Mode)

Preview exactly what will happen without building or submitting:
result = client.run("./my-project", dry_run=True)

print("=== Detected Profile ===")
print(f"Framework: {result.profile.framework}")
print(f"Entrypoint: {result.profile.entrypoint}")
print(f"Python version: {result.profile.python_version}")

print("\n=== Generated Dockerfile ===")
print(result.dockerfile)

print("\n=== K8s Manifest ===")
print(result.manifest)

print("\n=== Submit Payload ===")
for key, value in result.submit_payload.items():
    if key != "k8s_manifest":
        print(f"  {key}: {value}")

Save Generated Files

Inspect the generated Dockerfile and manifest:
job = client.run(
    "./my-project",
    gpus=4,
    gpu_type="H100",
    team="my-team",
    registry="123456.dkr.ecr.us-east-1.amazonaws.com",
    save_dockerfile=True,  # Writes Dockerfile.chamber
    save_manifest=True,    # Writes manifest.chamber.yaml
)

Framework Detection

The SDK automatically detects your ML framework from requirements.txt and selects the optimal base image:
FrameworkDetected FromBase Image
PyTorchtorch, pytorchnvcr.io/nvidia/pytorch:24.04-py3
TensorFlowtensorflow, kerasnvcr.io/nvidia/tensorflow:24.04-tf2-py3
JAXjax, jaxlibnvcr.io/nvidia/jax:24.04-py3
Generic(fallback)nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04
Override the base image if needed:
job = client.run(
    "./my-project",
    base_image="nvcr.io/nvidia/pytorch:24.01-py3",
    # ... other args
)

Distributed Training

The SDK auto-detects distributed training frameworks and configures the appropriate launch command:
FrameworkDetected FromLaunch Command
DeepSpeeddeepspeed in requirementsdeepspeed --num_gpus N train.py
Accelerateaccelerate in requirementsaccelerate launch train.py
Rayray in requirementsCreates RayJob K8s manifest
Horovodhorovod in requirementsUses standard launcher
Force a specific distributed mode:
# DeepSpeed
job = client.run(
    "./deepspeed-project",
    gpus=8,
    distributed="deepspeed",
    # ...
)

# Ray (creates RayJob for large-scale distributed training)
job = client.run(
    "./ray-project",
    gpus=32,
    distributed="ray",
    # ...
)

# Disable distributed detection
job = client.run(
    "./single-gpu-project",
    gpus=1,
    distributed="none",
    # ...
)

Using an Existing Dockerfile

Skip Dockerfile generation and use your own:
job = client.run(
    "./my-project",
    dockerfile="./Dockerfile.custom",
    # ...
)

Registry Auto-Authentication

The SDK automatically detects your registry type and handles authentication seamlessly. No manual docker login required.
When using Google Artifact Registry, the SDK automatically:
  1. Detects GAR registry URLs (pattern: {region}-docker.pkg.dev/{project}/{repo})
  2. Authenticates Docker using your gcloud CLI credentials
  3. Creates the repository if it doesn’t exist
# GAR authentication happens automatically!
job = client.run(
    "./my-project",
    registry="us-central1-docker.pkg.dev/my-gcp-project/ml-images",
    team="ml-team",
    gpus=4,
    gpu_type="H100",
)
Supported regions: All GCP regions (us-central1, us-east1, europe-west1, asia-northeast1, etc.)
Ensure your gcloud CLI is authenticated (gcloud auth login) before running.

How It Works

When you call client.run(), the SDK: This means you can switch between registries just by changing the URL—no code changes required.

All Parameters

ParameterTypeDefaultDescription
directorystrrequiredPath to project directory
gpusint1Number of GPUs
gpu_typestr”H100”GPU type (H100, A100, L40S, etc.)
teamstrNoneTeam ID (required)
namestrdirectory nameWorkload name
entrypointstrauto-detectPython entry point
entrypoint_argsstrNoneCLI arguments for entrypoint
registrystrNoneRegistry name (e.g., “prod”) or URL. Uses default if not specified.
base_imagestrauto-selectOverride base Docker image
dockerfilestrNonePath to existing Dockerfile
distributedstr”auto""auto”, “ray”, “deepspeed”, or “none”
job_classstr”ELASTIC""RESERVED” or “ELASTIC”
envdictNoneEnvironment variables
no_cacheboolFalseForce rebuild even if image exists
dry_runboolFalsePreview without executing
save_dockerfileboolFalseSave generated Dockerfile
save_manifestboolFalseSave generated K8s manifest
waitboolFalseWait for workload completion
poll_intervalfloat10.0Seconds between status checks
timeoutfloatNoneMax wait time in seconds
on_progresscallableNoneProgress callback(stage, message)

Error Handling

from chamber_sdk import ChamberClient, DockerError
from chamber_sdk.run import RegistryAuthError

client = ChamberClient.from_config()

try:
    job = client.run(
        "./my-project",
        registry="123456.dkr.ecr.us-east-1.amazonaws.com",
        team="my-team",
        gpus=4,
        gpu_type="H100",
    )
except RegistryAuthError as e:
    # Registry authentication failed (401/403 from registry)
    # Useful for detecting expired ECR tokens (12-hour validity)
    print(f"Registry auth failed for {e.registry}")
    print(f"Details: {e.detail}")
except DockerError as e:
    # Docker not installed, daemon not running, build failed, or push failed
    print(f"Docker error: {e}")
except FileNotFoundError as e:
    # No Python entrypoint found in project
    print(f"Project error: {e}")
except ValueError as e:
    # Missing required parameters (registry, team)
    print(f"Configuration error: {e}")

Configuration Priority

Settings are resolved in this order (highest priority first):
  1. Function argumentsclient.run(..., gpus=8) overrides everything
  2. Project .chamber.yaml — Project-specific settings
  3. Global ~/.chamber/config.json — Default registry and other global settings
  4. Auto-detection — Framework, entrypoint, Python version

Registry Resolution

The registry parameter can be a name or URL:
  1. If it looks like a URL (contains . or /), it’s used as-is
  2. Otherwise, it’s looked up in ~/.chamber/config.json under registries
  3. If not specified, the default_registry from config is used

Caching

The SDK computes a content hash of your project and uses it as the image tag. If the image already exists in the registry, the build and push steps are skipped:
[docker] Image already exists: 123456.dkr.ecr.us-east-1.amazonaws.com/my-project:a1b2c3d4 (skipping build & push)
Force a rebuild with no_cache=True:
job = client.run(
    "./my-project",
    no_cache=True,  # Always rebuild
    # ...
)

Docker Build Optimizations

The SDK automatically applies several optimizations to make builds fast:
OptimizationWhat it does
Single build+pushUses docker buildx build --push to build and push in one step. This is significantly faster than separate build+push because buildkit pushes layers directly as they complete
BuildKitEnabled by default (DOCKER_BUILDKIT=1) for parallel build stages and advanced caching
Pip cache mountsUses --mount=type=cache,target=/root/.cache/pip so pip packages are cached across builds
Remote layer cachingPulls the :latest tag before building to seed the layer cache. After a successful build, creates a :latest tag alias using docker buildx imagetools create (fast manifest aliasing, no layer re-upload)
Content-addressed tagsImages are tagged with a content hash. If the image already exists, build and push are skipped entirely
Platform targetingExplicitly builds for linux/amd64 to ensure consistent images
Reduced metadataUses --provenance=false --sbom=false to skip unnecessary metadata generation
Context size warningsReports build context size and warns if it exceeds 500MB
Why single build+push matters: With Docker Desktop’s containerd image store, a separate docker push can push ALL manifests from multi-platform base images (e.g., 6 manifests for NVIDIA images), causing each layer to be checked 6 times. The buildx build --push approach only pushes what it built — making pushes dramatically faster, especially for large ML images.

Low-Level Build Functions

For advanced use cases, you can access the optimized build functions directly:
from chamber_sdk.run import (
    build_and_push_image,
    pull_image_for_cache,
    create_image_tag,
    latest_tag,
    compute_context_size,
)

# Check context size before building
size = compute_context_size("./my-project")
print(f"Build context: {size / 1024 / 1024:.1f} MB")

# Pull cache image
cache_from = None
latest = latest_tag("registry.example.com/my-project:abc123")
if pull_image_for_cache(latest):
    cache_from = latest

# Build and push in one step
build_and_push_image(
    context_dir="./my-project",
    tag="registry.example.com/my-project:abc123",
    dockerfile_content=dockerfile,
    dockerignore_content=dockerignore,
    cache_from=cache_from,
    on_progress=lambda msg: print(f"[docker] {msg}"),
)

# Create :latest alias (fast, no layer re-upload)
create_image_tag("registry.example.com/my-project:abc123", latest)