Via Pathways#
This guide provides a comprehensive walkthrough for running MaxText workloads on a Google Kubernetes Engine (GKE) cluster using Pathways. Pathways acts as a powerful orchestrator for large-scale JAX jobs on AI Hypercomputer infrastructure.
This document assumes you have already created a Pathways GKE cluster using xpk. If you haven’t, follow the instructions at the Google Cloud Pathways & XPK documentation.
We will cover two primary modes of operation:
Batch workload: Ideal for long-running, non-interactive training jobs.
Headless workload: Ideal for interactive development, debugging, and running code from a local machine or CPU VM.
1. Prerequisites#
Before you can run a MaxText workload, you must complete the following setup steps.
Install XPK and its dependencies. Ensure that the
xpkcommand-line tool is installed.Create a GKE cluster configured for Pathways.
Build and upload a MaxText Docker image to your project’s Artifact Registry. For instructions on building and uploading the MaxText Docker image, please refer to the official documentation.
2. Environment configuration#
The following commands use placeholder variables. Before running them, set these environment variables in your shell.
# -- Google Cloud Configuration --
# Your GCP project ID. Find it on the [Cloud Console Dashboard](https://console.cloud.google.com/home/dashboard).
export PROJECT_ID=<GCP project ID>
# The GCP location (listed as "Location" in the UI) and name of your
# TPU-enabled GKE cluster. Both can be found on the
# [Cloud Console](https://console.cloud.google.com/kubernetes/list).
export ZONE=<GCP location> # e.g., 'us-central1'
export GKE_CLUSTER=<cluster name>
# -- Workload Configuration --
# An arbitrary string to identify this specific run.
# Note: Kubernetes requires workload names to be valid DNS labels (lowercase, no underscores or periods).
export RUN_NAME="maxtext-run-$(date +%Y%m%d-%H%M%S)"
# For a full list of MaxText-supported TPU types, see: `src/maxtext/utils/accelerator_to_spec_map.py`. To see the TPU type
# of your cluster:
# 1. Connect to the cluster (required for kubectl commands later):
# gcloud container clusters get-credentials ${GKE_CLUSTER?} --location ${ZONE?} --project ${PROJECT_ID?}
# 2. Find your TPU type (e.g., 'v5p-128') by checking the accelerator labels on your nodes:
# kubectl get nodes -l cloud.google.com/gke-tpu-accelerator -o jsonpath='{.items[*].metadata.labels.cloud\.google\.com/gke-tpu-accelerator}' | tr ' ' '\n' | sort -u
export TPU_TYPE="v5p-8" # Or your desired TPU type, e.g., v5e-4
export NUM_SLICES=1 # Number of TPU slices for your job
# -- MaxText & Storage Configuration --
# Use a GCS bucket you own to store logs and checkpoints. Ideally in the same
# region as your TPUs to minimize latency and costs.
# You can list your buckets and their locations in the
# [Cloud Console](https://console.cloud.google.com/storage/browser).
export BASE_OUTPUT_DIRECTORY=<gcs bucket path> # e.g., gs://my-bucket/maxtext-runs
# The Docker image you pushed in the prerequisite step
export CLOUD_IMAGE_NAME=<image name>
export DOCKER_IMAGE="gcr.io/${PROJECT_ID?}/${CLOUD_IMAGE_NAME?}"
3. Running a batch workload#
A batch workload runs entirely within the GKE cluster. You submit the job definition, and Pathways manages its execution.
Submit the batch workload#
Use the xpk workload create-pathways command to start the job.
xpk workload create-pathways \
--workload=${RUN_NAME?} \
--cluster=${GKE_CLUSTER?} \
--num-slices=${NUM_SLICES?} \
--tpu-type=${TPU_TYPE?} \
--project=${PROJECT_ID?} \
--zone=${ZONE?} \
--docker-image=${DOCKER_IMAGE?} \
--command="python3 -m maxtext.trainers.pre_train.train \
base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
per_device_batch_size=1 \
enable_checkpointing=false \
dataset_type=synthetic \
enable_single_controller=True \
run_name=${RUN_NAME?}-pathways-batch"
Verify the workload#
You can check the status of your running workloads with the xpk workload list command.
xpk workload list --cluster=${GKE_CLUSTER?} --project=${PROJECT_ID?} --zone=${ZONE?}
4. Running a headless (interactive) workload#
A headless workload reserves TPUs on the cluster and sets up a controller, but the Python script itself runs on a separate machine, like a local laptop or a Compute Engine VM. This is useful for rapid development and debugging. The headless mode refers to launching the Pathways backend services, such as resource manager and IFRT proxy, without a predefined user-workload container.
Step 1: Start the headless service#
This command reserves the TPUs and starts the Pathways head service on the cluster. It will wait until the resources are ready.
xpk workload create-pathways \
--headless \
--workload=${RUN_NAME?} \
--num-slices=${NUM_SLICES?} \
--tpu-type=${TPU_TYPE?} \
--project=${PROJECT_ID?} \
--zone=${ZONE?} \
--cluster=${GKE_CLUSTER?}
Step 2: Connect to the cluster via port forwarding#
On the machine where you will run your Python script, open a new terminal and create a secure tunnel to the cluster’s Pathways controller.
This command forwards local port 29000 to the controller pod in the cluster. It runs in the background.
kubectl port-forward \
"$(kubectl get pods -o name | grep ${RUN_NAME?}-pathways-head)" \
29000:29000 &> /dev/null &
Step 3: Run your MaxText script locally#
With the port forward active, you can now run your MaxText script. The JAX environment variables direct it to connect to the TPUs through the tunnel.
# Set these environment variables to tell JAX how to connect to the TPUs
export JAX_PLATFORMS=proxy
export JAX_BACKEND_TARGET=grpc://127.0.0.1:29000
# Run the training script
python3 -m maxtext.trainers.pre_train.train \
base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
per_device_batch_size=1 \
enable_checkpointing=false \
dataset_type=synthetic \
enable_single_controller=True \
run_name=${RUN_NAME?}-pathways-headless
The output streams directly to your terminal, just as if you were running on a local accelerator.
Troubleshooting#
Permission denied errors for Cloud Storage bucket: Check that the service account used by your GKE nodes has “Storage Object Admin” permissions on your GCS bucket.
Image not foundorImagePullBackOff:Verify your
DOCKER_IMAGEvariable is correct.Ensure you have successfully pushed the image to your project’s Artifact Registry.
Check that your GKE cluster has permissions to pull from the registry.
kubectl port-forwardfails:Confirm that the pod from Step 1 is running (
kubectl get pods). The name should match${RUN_NAME?}-pathways-head-0.Ensure you are authenticated with
kubectland have the correct context set for your GKE cluster.
Make sure you import
pathwaysutilspackage and callpathwaysutils.initialize()in your script when running the workload.
More information#
For more advanced configurations and a deeper dive into the Pathways architecture, see the official Pathways on Cloud documentation.