Coming Soon — Fully Managed GPU Infrastructure We’re building automated infrastructure management across cloud providers so your
research and MLE teams can run GPU workloads directly in their own cloud accounts
without ever thinking about infrastructure. If you’d like early access or want to
learn more, reach out at support@usechamber.com .
The client.run() method is designed for scientists and ML practitioners who want to submit GPU workloads without needing expertise in Docker or Kubernetes . Point it at your training project and let the SDK handle the rest.
Install with the run extra to use this feature:pip install chamber-sdk[run]
Why Use client.run()?
Traditional GPU workload submission requires:
Writing a Dockerfile optimized for your ML framework
Choosing the right base image (CUDA version, framework version, etc.)
Building and pushing the container image
Authenticating to your container registry
Writing a Kubernetes manifest
Submitting via the API
With client.run(), all of this happens automatically:
from chamber_sdk import ChamberClient
client = ChamberClient.from_config()
# Configure your registries once
ChamberClient.add_registry( "prod" , "us-east1-docker.pkg.dev/my-project/prod" , set_default = True )
ChamberClient.add_registry( "dev" , "us-east1-docker.pkg.dev/my-project/dev" )
# Then submit with just a name
job = client.run(
"./my-training-project" ,
gpus = 4 ,
gpu_type = "H100" ,
team = "ml-research" ,
registry = "prod" , # Uses named registry
)
print ( f "Submitted: { job.id } " )
What It Does Automatically
Framework Detection
Analyzes your requirements.txt to identify PyTorch, TensorFlow, JAX, or other frameworks and selects the optimal NVIDIA NGC base image.
Dockerfile Generation
Creates an optimized Dockerfile with proper CUDA configuration, dependency installation, and entrypoint setup.
Container Build & Push
Uses docker buildx build --push to build and push in a single efficient step. For AWS ECR and Google Artifact Registry, authentication and repository creation are handled automatically. Pulls the :latest tag to seed the layer cache for faster rebuilds.
Kubernetes Manifest
Generates the appropriate manifest (Job or RayJob) with correct GPU resource requests, environment variables, and volume mounts.
Workload Submission
Submits the workload to Chamber with proper workload class, priority, and team assignment.
Supported Container Registries
Chamber automatically handles authentication and repository creation for major cloud registries. Just provide your registry URL and the SDK does the rest.
Google Artifact Registry Full auto-authentication via gcloud CLI. Repositories are created automatically if they don’t exist.
AWS ECR Full auto-authentication via AWS CLI. Repositories are created automatically if they don’t exist.
Registry URL Pattern Auto-Auth Auto-Create Repo Google Artifact Registry {region}-docker.pkg.dev/{project}/{repo}✅ ✅ AWS ECR {account}.dkr.ecr.{region}.amazonaws.com✅ ✅ Other registries Any Docker-compatible registry Manual Manual
Prerequisites
Docker Docker must be installed and running on your machine. The SDK uses Docker to build and push images.
gcloud CLI For Google Artifact Registry. Install and run gcloud auth login.
AWS CLI For AWS ECR. Install and run aws configure.
Google Artifact Registry
AWS ECR
Setup: # Install gcloud CLI: https://cloud.google.com/sdk/docs/install
gcloud auth login
Required IAM Permissions:
artifactregistry.repositories.get
artifactregistry.repositories.create
artifactregistry.repositories.uploadArtifacts
Setup: # Install AWS CLI: https://aws.amazon.com/cli/
aws configure
Required IAM Permissions:
ecr:GetAuthorizationToken
ecr:CreateRepository
ecr:BatchCheckLayerAvailability
ecr:PutImage
ecr:InitiateLayerUpload
ecr:UploadLayerPart
ecr:CompleteLayerUpload
Basic Usage
Minimal Example
Google Artifact Registry
AWS ECR
from chamber_sdk import ChamberClient
client = ChamberClient.from_config()
job = client.run(
"./my-training-project" ,
gpus = 4 ,
gpu_type = "H100" ,
team = "my-team-id" ,
registry = "us-central1-docker.pkg.dev/my-project/ml-images" ,
)
from chamber_sdk import ChamberClient
client = ChamberClient.from_config()
job = client.run(
"./my-training-project" ,
gpus = 4 ,
gpu_type = "H100" ,
team = "my-team-id" ,
registry = "123456789012.dkr.ecr.us-east-1.amazonaws.com" ,
)
With Progress Callbacks
Monitor each stage of the pipeline:
def on_progress ( stage : str , message : str ):
print ( f "[ { stage } ] { message } " )
job = client.run(
"./my-project" ,
gpus = 4 ,
gpu_type = "H100" ,
team = "my-team" ,
registry = "us-central1-docker.pkg.dev/my-project/ml-images" ,
on_progress = on_progress,
)
[config] Resolving configuration...
[detect] Detecting project...
[detect] Framework: pytorch
[detect] Entrypoint: train.py
[dockerfile] Generating Dockerfile...
[manifest] Generating K8s manifest...
[docker] Authenticating to Google Artifact Registry (us-central1)...
[docker] GAR authentication successful
[docker] Creating repository ml-images...
[docker] Repository ready
[docker] Pulling cache image: us-central1-docker.pkg.dev/my-project/ml-images/my-project:latest
[docker] Cache image pulled successfully
[docker] Build context size: 42.5 MB
[docker] Building and pushing image for linux/amd64 (x86_64)
[docker] Image tag: us-central1-docker.pkg.dev/my-project/ml-images/my-project:a1b2c3d4
[docker] Build and push complete
[submit] Submitting workload...
[submit] Workload submitted: wl_abc123
[config] Resolving configuration...
[detect] Detecting project...
[detect] Framework: pytorch
[detect] Entrypoint: train.py
[dockerfile] Generating Dockerfile...
[manifest] Generating K8s manifest...
[docker] Authenticating to ECR (us-east-1)...
[docker] ECR authentication successful
[docker] Pulling cache image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-project:latest
[docker] Cache image pulled successfully
[docker] Build context size: 42.5 MB
[docker] Building and pushing image for linux/amd64 (x86_64)
[docker] Image tag: 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-project:a1b2c3d4
[docker] Build and push complete
[submit] Submitting workload...
[submit] Workload submitted: wl_abc123
Wait for Completion
Block until the workload finishes:
job = client.run(
"./my-project" ,
gpus = 4 ,
gpu_type = "H100" ,
team = "my-team" ,
registry = "123456.dkr.ecr.us-east-1.amazonaws.com" ,
wait = True ,
poll_interval = 30 ,
timeout = 7200 , # 2 hours
)
print ( f "Final status: { job.status } " )
Registry Configuration
Configure your container registries once and reference them by name. This makes it easy to switch between dev/staging/prod environments.
Setting Up Named Registries
Add registries to ~/.chamber/config.json:
{
"registries" : {
"prod" : "us-east1-docker.pkg.dev/my-project/prod" ,
"dev" : "us-east1-docker.pkg.dev/my-project/dev" ,
"staging" : "us-central1-docker.pkg.dev/my-project/staging" ,
"ecr" : "123456789012.dkr.ecr.us-east-1.amazonaws.com"
},
"default_registry" : "prod"
}
Or configure programmatically:
from chamber_sdk import ChamberClient
# Add registries (persisted to ~/.chamber/config.json)
ChamberClient.add_registry( "prod" , "us-east1-docker.pkg.dev/my-project/prod" , set_default = True )
ChamberClient.add_registry( "dev" , "us-east1-docker.pkg.dev/my-project/dev" )
ChamberClient.add_registry( "ecr" , "123456789012.dkr.ecr.us-east-1.amazonaws.com" )
# List configured registries
print (ChamberClient.list_registries())
# {'prod': 'us-east1-docker.pkg.dev/my-project/prod', 'dev': '...', 'ecr': '...'}
# Check default
print (ChamberClient.get_default_registry())
# ('prod', 'us-east1-docker.pkg.dev/my-project/prod')
# Change default
ChamberClient.set_default_registry( "dev" )
Using Named Registries
Once configured, use registries by name:
# Uses default registry (no registry parameter needed)
job = client.run( "./my-project" , gpus = 4 , team = "ml-team" )
# Use a specific registry by name
job = client.run( "./my-project" , gpus = 4 , team = "ml-team" , registry = "dev" )
job = client.run( "./my-project" , gpus = 4 , team = "ml-team" , registry = "ecr" )
# Full URLs still work
job = client.run( "./my-project" , gpus = 4 , team = "ml-team" ,
registry = "us-west1-docker.pkg.dev/other-project/images" )
Configuration File
Create a .chamber.yaml file in your project directory to avoid repeating parameters:
# .chamber.yaml
name : my-training-job
gpus : 4
gpu_type : H100
team : my-team-id
registry : prod # Use named registry, or full URL
# Python entrypoint
entrypoint : train.py
entrypoint_args : --batch-size 32 --epochs 10
# Distributed training mode
distributed : auto # "auto", "ray", "deepspeed", or "none"
# Workload class
job_class : ELASTIC # or "RESERVED"
# Environment variables
env :
CUDA_VISIBLE_DEVICES : "0,1,2,3"
NCCL_DEBUG : INFO
# Forward secrets from local environment
forward_env :
- WANDB_API_KEY
- HF_TOKEN
# Additional packages
extra_pip_packages :
- wandb
- tensorboard
extra_apt_packages :
- ffmpeg
# Build customization
pre_build_commands :
- pip install flash-attn --no-build-isolation
post_build_commands :
- python -c "import torch; print(torch.cuda.is_available())"
# Resource overrides
cpu : "24"
memory : 200Gi
shm_size : 64Gi
# Files to exclude from container
ignore :
- "*.ckpt"
- "wandb/"
- "outputs/"
Then submit with minimal arguments:
job = client.run( "./my-project" ) # Uses .chamber.yaml settings
Dry Run (Preview Mode)
Preview exactly what will happen without building or submitting:
result = client.run( "./my-project" , dry_run = True )
print ( "=== Detected Profile ===" )
print ( f "Framework: { result.profile.framework } " )
print ( f "Entrypoint: { result.profile.entrypoint } " )
print ( f "Python version: { result.profile.python_version } " )
print ( " \n === Generated Dockerfile ===" )
print (result.dockerfile)
print ( " \n === K8s Manifest ===" )
print (result.manifest)
print ( " \n === Submit Payload ===" )
for key, value in result.submit_payload.items():
if key != "k8s_manifest" :
print ( f " { key } : { value } " )
Save Generated Files
Inspect the generated Dockerfile and manifest:
job = client.run(
"./my-project" ,
gpus = 4 ,
gpu_type = "H100" ,
team = "my-team" ,
registry = "123456.dkr.ecr.us-east-1.amazonaws.com" ,
save_dockerfile = True , # Writes Dockerfile.chamber
save_manifest = True , # Writes manifest.chamber.yaml
)
Framework Detection
The SDK automatically detects your ML framework from requirements.txt and selects the optimal base image:
Framework Detected From Base Image PyTorch torch, pytorchnvcr.io/nvidia/pytorch:24.04-py3TensorFlow tensorflow, kerasnvcr.io/nvidia/tensorflow:24.04-tf2-py3JAX jax, jaxlibnvcr.io/nvidia/jax:24.04-py3Generic (fallback) nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04
Override the base image if needed:
job = client.run(
"./my-project" ,
base_image = "nvcr.io/nvidia/pytorch:24.01-py3" ,
# ... other args
)
Distributed Training
The SDK auto-detects distributed training frameworks and configures the appropriate launch command:
Framework Detected From Launch Command DeepSpeed deepspeed in requirementsdeepspeed --num_gpus N train.pyAccelerate accelerate in requirementsaccelerate launch train.pyRay ray in requirementsCreates RayJob K8s manifest Horovod horovod in requirementsUses standard launcher
Force a specific distributed mode:
# DeepSpeed
job = client.run(
"./deepspeed-project" ,
gpus = 8 ,
distributed = "deepspeed" ,
# ...
)
# Ray (creates RayJob for large-scale distributed training)
job = client.run(
"./ray-project" ,
gpus = 32 ,
distributed = "ray" ,
# ...
)
# Disable distributed detection
job = client.run(
"./single-gpu-project" ,
gpus = 1 ,
distributed = "none" ,
# ...
)
Using an Existing Dockerfile
Skip Dockerfile generation and use your own:
job = client.run(
"./my-project" ,
dockerfile = "./Dockerfile.custom" ,
# ...
)
Registry Auto-Authentication
The SDK automatically detects your registry type and handles authentication seamlessly. No manual docker login required.
Google Artifact Registry
AWS ECR
When using Google Artifact Registry, the SDK automatically:
Detects GAR registry URLs (pattern: {region}-docker.pkg.dev/{project}/{repo})
Authenticates Docker using your gcloud CLI credentials
Creates the repository if it doesn’t exist
# GAR authentication happens automatically!
job = client.run(
"./my-project" ,
registry = "us-central1-docker.pkg.dev/my-gcp-project/ml-images" ,
team = "ml-team" ,
gpus = 4 ,
gpu_type = "H100" ,
)
Supported regions: All GCP regions (us-central1, us-east1, europe-west1, asia-northeast1, etc.)Ensure your gcloud CLI is authenticated (gcloud auth login) before running.
When using AWS ECR, the SDK automatically:
Detects ECR registry URLs (pattern: {account}.dkr.ecr.{region}.amazonaws.com)
Authenticates Docker using your AWS CLI credentials
Creates the ECR repository if it doesn’t exist
# ECR authentication happens automatically!
job = client.run(
"./my-project" ,
registry = "123456789012.dkr.ecr.us-east-1.amazonaws.com" ,
team = "ml-team" ,
gpus = 4 ,
gpu_type = "H100" ,
)
Supported regions: All AWS regions (us-east-1, us-west-2, eu-west-1, ap-northeast-1, etc.)Ensure your AWS CLI is configured (aws configure) before running.
How It Works
When you call client.run(), the SDK:
This means you can switch between registries just by changing the URL—no code changes required.
All Parameters
Parameter Type Default Description directorystr required Path to project directory gpusint 1 Number of GPUs gpu_typestr ”H100” GPU type (H100, A100, L40S, etc.) teamstr None Team ID (required) namestr directory name Workload name entrypointstr auto-detect Python entry point entrypoint_argsstr None CLI arguments for entrypoint registrystr None Registry name (e.g., “prod”) or URL. Uses default if not specified. base_imagestr auto-select Override base Docker image dockerfilestr None Path to existing Dockerfile distributedstr ”auto" "auto”, “ray”, “deepspeed”, or “none” job_classstr ”ELASTIC" "RESERVED” or “ELASTIC” envdict None Environment variables no_cachebool False Force rebuild even if image exists dry_runbool False Preview without executing save_dockerfilebool False Save generated Dockerfile save_manifestbool False Save generated K8s manifest waitbool False Wait for workload completion poll_intervalfloat 10.0 Seconds between status checks timeoutfloat None Max wait time in seconds on_progresscallable None Progress callback(stage, message)
Error Handling
from chamber_sdk import ChamberClient, DockerError
from chamber_sdk.run import RegistryAuthError
client = ChamberClient.from_config()
try :
job = client.run(
"./my-project" ,
registry = "123456.dkr.ecr.us-east-1.amazonaws.com" ,
team = "my-team" ,
gpus = 4 ,
gpu_type = "H100" ,
)
except RegistryAuthError as e:
# Registry authentication failed (401/403 from registry)
# Useful for detecting expired ECR tokens (12-hour validity)
print ( f "Registry auth failed for { e.registry } " )
print ( f "Details: { e.detail } " )
except DockerError as e:
# Docker not installed, daemon not running, build failed, or push failed
print ( f "Docker error: { e } " )
except FileNotFoundError as e:
# No Python entrypoint found in project
print ( f "Project error: { e } " )
except ValueError as e:
# Missing required parameters (registry, team)
print ( f "Configuration error: { e } " )
Configuration Priority
Settings are resolved in this order (highest priority first):
Function arguments — client.run(..., gpus=8) overrides everything
Project .chamber.yaml — Project-specific settings
Global ~/.chamber/config.json — Default registry and other global settings
Auto-detection — Framework, entrypoint, Python version
Registry Resolution
The registry parameter can be a name or URL:
If it looks like a URL (contains . or /), it’s used as-is
Otherwise, it’s looked up in ~/.chamber/config.json under registries
If not specified, the default_registry from config is used
Caching
The SDK computes a content hash of your project and uses it as the image tag. If the image already exists in the registry, the build and push steps are skipped:
[docker] Image already exists: 123456.dkr.ecr.us-east-1.amazonaws.com/my-project:a1b2c3d4 (skipping build & push)
Force a rebuild with no_cache=True:
job = client.run(
"./my-project" ,
no_cache = True , # Always rebuild
# ...
)
Docker Build Optimizations
The SDK automatically applies several optimizations to make builds fast:
Optimization What it does Single build+push Uses docker buildx build --push to build and push in one step. This is significantly faster than separate build+push because buildkit pushes layers directly as they complete BuildKit Enabled by default (DOCKER_BUILDKIT=1) for parallel build stages and advanced caching Pip cache mounts Uses --mount=type=cache,target=/root/.cache/pip so pip packages are cached across builds Remote layer caching Pulls the :latest tag before building to seed the layer cache. After a successful build, creates a :latest tag alias using docker buildx imagetools create (fast manifest aliasing, no layer re-upload) Content-addressed tags Images are tagged with a content hash. If the image already exists, build and push are skipped entirely Platform targeting Explicitly builds for linux/amd64 to ensure consistent images Reduced metadata Uses --provenance=false --sbom=false to skip unnecessary metadata generation Context size warnings Reports build context size and warns if it exceeds 500MB
Why single build+push matters: With Docker Desktop’s containerd image store, a separate docker push can push ALL manifests from multi-platform base images (e.g., 6 manifests for NVIDIA images), causing each layer to be checked 6 times. The buildx build --push approach only pushes what it built — making pushes dramatically faster, especially for large ML images.
Low-Level Build Functions
For advanced use cases, you can access the optimized build functions directly:
from chamber_sdk.run import (
build_and_push_image,
pull_image_for_cache,
create_image_tag,
latest_tag,
compute_context_size,
)
# Check context size before building
size = compute_context_size( "./my-project" )
print ( f "Build context: { size / 1024 / 1024 :.1f} MB" )
# Pull cache image
cache_from = None
latest = latest_tag( "registry.example.com/my-project:abc123" )
if pull_image_for_cache(latest):
cache_from = latest
# Build and push in one step
build_and_push_image(
context_dir = "./my-project" ,
tag = "registry.example.com/my-project:abc123" ,
dockerfile_content = dockerfile,
dockerignore_content = dockerignore,
cache_from = cache_from,
on_progress = lambda msg : print ( f "[docker] { msg } " ),
)
# Create :latest alias (fast, no layer re-upload)
create_image_tag( "registry.example.com/my-project:abc123" , latest)