Via Pathways#
This guide provides a comprehensive walkthrough for running MaxText workloads on a Google Kubernetes Engine (GKE) cluster using Pathways. Pathways acts as a powerful orchestrator for large-scale JAX jobs on AI Hypercomputer infrastructure.
This document assumes you have already created a Pathways GKE cluster using xpk. If you haven’t, follow the instructions at the Google Cloud Pathways & XPK documentation.
We will cover two primary modes of operation:
Batch workload: Ideal for long-running, non-interactive training jobs.
Headless workload: Ideal for interactive development, debugging, and running code from a local machine or CPU VM.
1. Prerequisites#
Before you can run a MaxText workload, you must complete the following setup steps.
Install XPK and its dependencies. Ensure that the
xpkcommand-line tool is installed.Create a GKE cluster configured for Pathways.
Build and upload a MaxText Docker image to your project’s Artifact Registry.
Step 1: Build the Docker image for a TPU device. This image contains MaxText and its dependencies.
bash dependencies/scripts/docker_build_dependency_image.sh DEVICE=tpu MODE=stable
Step 2: Configure Docker to authenticate with Google Cloud
gcloud auth configure-docker
Step 3: Upload the image to your project’s registry. Replace
$USER_runnerwith your desired image name.bash dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=$USER_runner
2. Environment configuration#
The following commands use placeholder variables. Before running them, set these environment variables in your shell.
# -- Google Cloud Configuration --
export PROJECT="your-gcp-project-id"
export ZONE="your-gcp-zone"
export CLUSTER="your-gke-cluster-name"
# -- Workload Configuration --
export WORKLOAD_NAME="maxtext-job-$(date +%Y%m%d-%H%M%S)"
export TPU_TYPE="v5p-8" # Or your desired TPU type, e.g., v5e-4
export WORKLOAD_NODEPOOL_COUNT=1 # Number of TPU slices for your job
# -- MaxText & Storage Configuration --
export BUCKET_NAME="your-gcs-bucket-name"
export RUN_NAME="maxtext-run-1"
# The Docker image you pushed in the prerequisite step
export DOCKER_IMAGE="gcr.io/${PROJECT}/${USER}_runner"
3. Running a batch workload#
A batch workload runs entirely within the GKE cluster. You submit the job definition, and Pathways manages its execution.
Submit the batch workload#
Use the xpk workload create-pathways command to start the job.
xpk workload create-pathways \
--workload=$WORKLOAD_NAME \
--cluster=$CLUSTER \
--num-slices=$WORKLOAD_NODEPOOL_COUNT \
--tpu-type=$TPU_TYPE \
--project=$PROJECT \
--zone=$ZONE \
--docker-image=${DOCKER_IMAGE} \
--command="python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
base_output_directory=gs://${BUCKET_NAME} \
per_device_batch_size=1 \
enable_checkpointing=false \
dataset_type=synthetic \
enable_single_controller=True \
run_name=${RUN_NAME}-pathways-batch"
Verify the workload#
You can check the status of your running workloads with the xpk workload list command.
xpk workload list --cluster=$CLUSTER --project=$PROJECT --zone=$ZONE
4. Running a headless (interactive) workload#
A headless workload reserves TPUs on the cluster and sets up a controller, but the Python script itself runs on a separate machine, like a local laptop or a Compute Engine VM. This is useful for rapid development and debugging. The headless mode refers to launching the Pathways backend services, such as resource manager and IFRT proxy, without a predefined user-workload container.
Step 1: Start the headless service#
This command reserves the TPUs and starts the Pathways head service on the cluster. It will wait until the resources are ready.
xpk workload create-pathways \
--headless \
--workload=${WORKLOAD_NAME} \
--num-slices=${WORKLOAD_NODEPOOL_COUNT} \
--tpu-type=${TPU_TYPE} \
--project=${PROJECT} \
--zone=${ZONE} \
--cluster=${CLUSTER}
Step 2: Connect to the cluster via port forwarding#
On the machine where you will run your Python script, open a new terminal and create a secure tunnel to the cluster’s Pathways controller.
This command forwards local port 29000 to the controller pod in the cluster. It runs in the background.
kubectl port-forward \
"$(kubectl get pods -o name | grep ${WORKLOAD_NAME}-pathways-head)" \
29000:29000 &> /dev/null &
Step 3: Run your MaxText script locally#
With the port forward active, you can now run your MaxText script. The JAX environment variables direct it to connect to the TPUs through the tunnel.
# Set these environment variables to tell JAX how to connect to the TPUs
export JAX_PLATFORMS=proxy
export JAX_BACKEND_TARGET=grpc://127.0.0.1:29000
# Run the training script
python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
base_output_directory=gs://${BUCKET_NAME} \
per_device_batch_size=1 \
enable_checkpointing=false \
dataset_type=synthetic \
enable_single_controller=True \
run_name=${RUN_NAME}-pathways-headless
The output streams directly to your terminal, just as if you were running on a local accelerator.
Troubleshooting#
Permission denied errors for Cloud Storage bucket: Check that the service account used by your GKE nodes has “Storage Object Admin” permissions on your GCS bucket.
Image not foundorImagePullBackOff:Verify your
DOCKER_IMAGEvariable is correct.Ensure you have successfully pushed the image to your project’s Artifact Registry.
Check that your GKE cluster has permissions to pull from the registry.
kubectl port-forwardfails:Confirm that the pod from Step 1 is running (
kubectl get pods). The name should match${WORKLOAD_NAME}-pathways-head-0.Ensure you are authenticated with
kubectland have the correct context set for your GKE cluster.
Make sure you import
pathwaysutilspackage and callpathwaysutils.initialize()in your script when running the workload.
More information#
For more advanced configurations and a deeper dive into the Pathways architecture, see the official Pathways on Cloud documentation.