Google Cloud GKE

The terraform-google-chamber-gke module deploys a production-ready Google GKE cluster with GPU autoscaling, NVIDIA drivers, and the Chamber Agent — all in a single terraform apply.

Prerequisites

Terraform >= 1.3.0

Install from developer.hashicorp.com/terraform/install. Verify with terraform version.

gcloud CLI authenticated

Install the gcloud CLI and authenticate:

gcloud auth login
gcloud auth application-default login

Required GCP APIs enabled

Enable the required APIs in your project:

gcloud services enable \
  container.googleapis.com \
  compute.googleapis.com \
  iam.googleapis.com \
  --project=YOUR_PROJECT_ID

Chamber Console account

You need a cluster token and cluster ID from the Chamber Console. See Getting a Cluster Token for instructions.

Quick Start

The GKE module requires explicit provider configuration because the kubernetes, helm, and kubectl providers need the cluster endpoint and credentials from the module outputs. The full configuration is shown below.

Create main.tf

Create a new directory for your Terraform configuration and add a main.tf file:

provider "google"     { project = var.gcp_project_id; region = var.gcp_region }
provider "google-beta" { project = var.gcp_project_id; region = var.gcp_region }

data "google_client_config" "default" {}

provider "kubernetes" {
  host                   = "https://${module.chamber_gke.cluster_endpoint}"
  token                  = data.google_client_config.default.access_token
  cluster_ca_certificate = base64decode(module.chamber_gke.cluster_ca_certificate)
}

provider "helm" {
  kubernetes {
    host                   = "https://${module.chamber_gke.cluster_endpoint}"
    token                  = data.google_client_config.default.access_token
    cluster_ca_certificate = base64decode(module.chamber_gke.cluster_ca_certificate)
  }
}

provider "kubectl" {
  host                   = "https://${module.chamber_gke.cluster_endpoint}"
  token                  = data.google_client_config.default.access_token
  cluster_ca_certificate = base64decode(module.chamber_gke.cluster_ca_certificate)
  load_config_file       = false
}

module "chamber_gke" {
  source = "github.com/ChamberOrg/terraform-google-chamber-gke"

  gcp_project_id        = var.gcp_project_id
  gcp_region            = var.gcp_region
  cluster_name          = "my-gpu-cluster"
  chamber_cluster_token = var.chamber_cluster_token
  chamber_cluster_id    = var.chamber_cluster_id
}

variable "gcp_project_id" {
  type = string
}

variable "gcp_region" {
  type = string
}

variable "chamber_cluster_token" {
  type      = string
  sensitive = true
}

variable "chamber_cluster_id" {
  type = string
}

output "configure_kubectl" {
  value = module.chamber_gke.configure_kubectl
}

Create terraform.tfvars

gcp_project_id        = "my-gcp-project"
gcp_region            = "us-central1"
chamber_cluster_token = "your-token-here"
chamber_cluster_id    = "your-cluster-id"

Do not commit terraform.tfvars to version control. Add it to .gitignore. For CI/CD pipelines, use environment variables: TF_VAR_chamber_cluster_token.

Deploy

terraform init
terraform plan
terraform apply

Deployment takes approximately 15-20 minutes.

Configure kubectl

$(terraform output -raw configure_kubectl)

Verify

# Verify system nodes are ready
kubectl get nodes -l purpose=system

# Verify Karpenter is running
kubectl get pods -n karpenter

# Verify Chamber Agent is connected
kubectl get pods -n chamber-system

Your cluster should appear in the Chamber Console under Capacity Pools.

Using an Existing VPC

To deploy into an existing VPC instead of creating a new one:

module "chamber_gke" {
  source = "github.com/ChamberOrg/terraform-google-chamber-gke"

  gcp_project_id        = var.gcp_project_id
  gcp_region            = var.gcp_region
  cluster_name          = "my-gpu-cluster"
  chamber_cluster_token = var.chamber_cluster_token
  chamber_cluster_id    = var.chamber_cluster_id

  create_vpc        = false
  network_name      = "my-existing-vpc"
  subnetwork_name   = "my-existing-subnet"
  ip_range_pods     = "pods"
  ip_range_services = "services"
}

The existing subnet must have secondary IP ranges named for pods and services. Cloud NAT must be configured for private node egress.

Key Variables

The table below covers the most commonly configured variables. For the complete list, see the module README on GitHub.

Required

Variable	Description	Type
`gcp_project_id`	GCP project ID	`string`
`cluster_name`	Name of the GKE cluster	`string`
`chamber_cluster_token`	Cluster token from Chamber Console	`string`
`chamber_cluster_id`	Cluster ID from Chamber Console	`string`

GCP

Variable	Description	Default
`gcp_region`	GCP region for the GKE cluster	`"us-central1"`
`gcp_zones`	Zones within the region (defaults to first 3)	`[]`

VPC

Variable	Description	Default
`create_vpc`	Create a new VPC or use existing	`true`
`network_name`	Existing VPC name (required when `create_vpc = false`)	`null`
`subnetwork_name`	Existing subnet name (required when `create_vpc = false`)	`null`
`vpc_cidr`	CIDR block for new VPC primary subnet	`"10.0.0.0/16"`

GKE

Variable	Description	Default
`cluster_version`	Kubernetes version	`"1.32"`
`system_node_machine_type`	Machine type for system node pool	`"e2-standard-4"`
`enable_private_endpoint`	Private endpoint only (no public API access)	`false`

GPU

Variable	Description	Default
`create_default_gpu_nodepool`	Create Terraform-managed GPU NodePool	`false`
`gpu_machine_families`	GPU machine families for NodePool	`["a2", "g2", "a3"]`
`gpu_accelerator_types`	GPU accelerator types	`["nvidia-l4", "nvidia-a100-80gb", "nvidia-h100-80gb"]`
`capacity_types`	Capacity types (`on-demand`, `spot`)	`["on-demand", "spot"]`
`gpu_limits`	Maximum GPUs for NodePool	`100`

Chamber

Variable	Description	Default
`chamber_agent_version`	Chamber Agent version	`"latest"`
`enable_kai_scheduler`	Enable KAI fractional GPU scheduler	`true`

Key Outputs

Output	Description
`cluster_name`	GKE cluster name
`cluster_endpoint`	GKE API server endpoint
`network_name`	VPC network name
`configure_kubectl`	gcloud command to configure kubectl
`verification_commands`	Commands to verify the deployment
`karpenter_service_account_email`	Karpenter controller service account email

For all outputs, see the module README on GitHub.

GPU Pool Management

After deployment, you need GPU pools for Karpenter to know which GPU nodes to provision. There are two approaches:

Console-managed (default)
Terraform-managed

Manage GPU pools through the Chamber Console:

Go to Capacity Pools > Create Dynamic Pool
Select your cluster and configure GPU type, limits, and capacity types
The pool syncs to your cluster automatically — Karpenter provisions GPU nodes on demand

This is the recommended approach for most teams. It allows per-GPU-type management with real-time limit adjustments.

To manage GPU pools entirely through Terraform, set create_default_gpu_nodepool = true:

module "chamber_gke" {
  source = "github.com/ChamberOrg/terraform-google-chamber-gke"

  # ... required variables ...

  create_default_gpu_nodepool = true
  gpu_machine_families        = ["a2", "g2"]
  gpu_accelerator_types       = ["nvidia-l4", "nvidia-a100-80gb"]
  capacity_types              = ["on-demand", "spot"]
  gpu_limits                  = 50
}

This creates a single broad NodePool covering the specified machine families and accelerator types.

Troubleshooting

Chamber Agent not connecting

Check the Chamber Agent logs:

kubectl logs -n chamber-system -l app.kubernetes.io/name=chamber-agent --tail=50

Verify your chamber_cluster_token and chamber_cluster_id are correct.

GPU nodes not provisioning

Verify a GPU pool exists:
```
kubectl get nodepools.karpenter.sh
```
If none exists, create one via the Chamber Console or set create_default_gpu_nodepool = true.

Check Karpenter logs:

kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=100

Verify GCENodeClass:
```
kubectl get gcenodeclasses
```

GPU pods stuck in Pending

Check the KAI Scheduler logs and pod events:

kubectl logs -n chamber-system -l app=kai-scheduler --tail=50
kubectl describe pod <pod-name>

Cleanup

terraform destroy

Ensure all GPU workloads are terminated before destroying to avoid orphaned resources.

Next Steps

Quickstart

Submit your first GPU workload

Capacity Management

Configure capacity pools and reservations

Agent Troubleshooting

Detailed troubleshooting guide

GitHub Repository

Full source, examples, and changelog

​Prerequisites

​Quick Start

​Using an Existing VPC

​Key Variables

​Required

​GCP

​VPC

​GKE

​GPU

​Chamber

​Key Outputs

​GPU Pool Management

​Troubleshooting

​Cleanup

​Next Steps

Quickstart

Capacity Management

Agent Troubleshooting

GitHub Repository

Prerequisites

Quick Start

Using an Existing VPC

Key Variables

Required

GCP

VPC

GKE

GPU

Chamber

Key Outputs

GPU Pool Management

Troubleshooting

Cleanup

Next Steps