Via Pathways#

This guide provides a comprehensive walkthrough for running MaxText workloads on a Google Kubernetes Engine (GKE) cluster using Pathways. Pathways acts as a powerful orchestrator for large-scale JAX jobs on AI Hypercomputer infrastructure.

This document assumes you have already created a Pathways GKE cluster using xpk. If you haven’t, follow the instructions at the Google Cloud Pathways & XPK documentation.

We will cover two primary modes of operation:

  • Batch workload: Ideal for long-running, non-interactive training jobs.

  • Headless workload: Ideal for interactive development, debugging, and running code from a local machine or CPU VM.

1. Prerequisites#

Before you can run a MaxText workload, you must complete the following setup steps.

  1. Install XPK and its dependencies. Ensure that the xpk command-line tool is installed.

  2. Create a GKE cluster configured for Pathways.

  3. Build and upload a MaxText Docker image to your project’s Artifact Registry. For instructions on building and uploading the MaxText Docker image, please refer to the official documentation.

2. Environment configuration#

The following commands use placeholder variables. Before running them, set these environment variables in your shell.

# -- Google Cloud Configuration --
# Your GCP project ID. Find it on the [Cloud Console Dashboard](https://console.cloud.google.com/home/dashboard).
export PROJECT_ID=<GCP project ID>

# The GCP location (listed as "Location" in the UI) and name of your
# TPU-enabled GKE cluster. Both can be found on the
# [Cloud Console](https://console.cloud.google.com/kubernetes/list).
export ZONE=<GCP location> # e.g., 'us-central1'
export GKE_CLUSTER=<cluster name>

# -- Workload Configuration --
# An arbitrary string to identify this specific run.
# Note: Kubernetes requires workload names to be valid DNS labels (lowercase, no underscores or periods).
export RUN_NAME="maxtext-run-$(date +%Y%m%d-%H%M%S)"

# For a full list of MaxText-supported TPU types, see: `src/maxtext/utils/accelerator_to_spec_map.py`. To see the TPU type
# of your cluster:

# 1. Connect to the cluster (required for kubectl commands later):
# gcloud container clusters get-credentials ${GKE_CLUSTER?} --location ${ZONE?} --project ${PROJECT_ID?}

# 2. Find your TPU type (e.g., 'v5p-128') by checking the accelerator labels on your nodes:
# kubectl get nodes -l cloud.google.com/gke-tpu-accelerator -o jsonpath='{.items[*].metadata.labels.cloud\.google\.com/gke-tpu-accelerator}' | tr ' ' '\n' | sort -u
export TPU_TYPE="v5p-8" # Or your desired TPU type, e.g., v5e-4
export NUM_SLICES=1 # Number of TPU slices for your job

# -- MaxText & Storage Configuration --
# Use a GCS bucket you own to store logs and checkpoints. Ideally in the same
# region as your TPUs to minimize latency and costs.
# You can list your buckets and their locations in the
# [Cloud Console](https://console.cloud.google.com/storage/browser).
export BASE_OUTPUT_DIRECTORY=<gcs bucket path> # e.g., gs://my-bucket/maxtext-runs

# The Docker image you pushed in the prerequisite step
export CLOUD_IMAGE_NAME=<image name>
export DOCKER_IMAGE="gcr.io/${PROJECT_ID?}/${CLOUD_IMAGE_NAME?}"

3. Running a batch workload#

A batch workload runs entirely within the GKE cluster. You submit the job definition, and Pathways manages its execution.

Submit the batch workload#

Use the xpk workload create-pathways command to start the job.

xpk workload create-pathways \
  --workload=${RUN_NAME?} \
  --cluster=${GKE_CLUSTER?} \
  --num-slices=${NUM_SLICES?} \
  --tpu-type=${TPU_TYPE?} \
  --project=${PROJECT_ID?} \
  --zone=${ZONE?} \
  --docker-image=${DOCKER_IMAGE?} \
  --command="python3 -m maxtext.trainers.pre_train.train \
    base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
    per_device_batch_size=1 \
    enable_checkpointing=false \
    dataset_type=synthetic \
    enable_single_controller=True \
    run_name=${RUN_NAME?}-pathways-batch"

Verify the workload#

You can check the status of your running workloads with the xpk workload list command.

xpk workload list --cluster=${GKE_CLUSTER?} --project=${PROJECT_ID?} --zone=${ZONE?}

4. Running a headless (interactive) workload#

A headless workload reserves TPUs on the cluster and sets up a controller, but the Python script itself runs on a separate machine, like a local laptop or a Compute Engine VM. This is useful for rapid development and debugging. The headless mode refers to launching the Pathways backend services, such as resource manager and IFRT proxy, without a predefined user-workload container.

Step 1: Start the headless service#

This command reserves the TPUs and starts the Pathways head service on the cluster. It will wait until the resources are ready.

xpk workload create-pathways \
  --headless \
  --workload=${RUN_NAME?} \
  --num-slices=${NUM_SLICES?} \
  --tpu-type=${TPU_TYPE?} \
  --project=${PROJECT_ID?} \
  --zone=${ZONE?} \
  --cluster=${GKE_CLUSTER?}

Step 2: Connect to the cluster via port forwarding#

On the machine where you will run your Python script, open a new terminal and create a secure tunnel to the cluster’s Pathways controller.

This command forwards local port 29000 to the controller pod in the cluster. It runs in the background.

kubectl port-forward \
  "$(kubectl get pods -o name | grep ${RUN_NAME?}-pathways-head)" \
  29000:29000 &> /dev/null &

Step 3: Run your MaxText script locally#

With the port forward active, you can now run your MaxText script. The JAX environment variables direct it to connect to the TPUs through the tunnel.

# Set these environment variables to tell JAX how to connect to the TPUs
export JAX_PLATFORMS=proxy
export JAX_BACKEND_TARGET=grpc://127.0.0.1:29000

# Run the training script
python3 -m maxtext.trainers.pre_train.train \
  base_output_directory=${BASE_OUTPUT_DIRECTORY?} \
  per_device_batch_size=1 \
  enable_checkpointing=false \
  dataset_type=synthetic \
  enable_single_controller=True \
  run_name=${RUN_NAME?}-pathways-headless

The output streams directly to your terminal, just as if you were running on a local accelerator.

Troubleshooting#

  • Permission denied errors for Cloud Storage bucket: Check that the service account used by your GKE nodes has “Storage Object Admin” permissions on your GCS bucket.

  • Image not found or ImagePullBackOff:

    • Verify your DOCKER_IMAGE variable is correct.

    • Ensure you have successfully pushed the image to your project’s Artifact Registry.

    • Check that your GKE cluster has permissions to pull from the registry.

  • kubectl port-forward fails:

    • Confirm that the pod from Step 1 is running (kubectl get pods). The name should match ${RUN_NAME?}-pathways-head-0.

    • Ensure you are authenticated with kubectl and have the correct context set for your GKE cluster.

  • Make sure you import pathwaysutils package and call pathwaysutils.initialize() in your script when running the workload.

More information#

For more advanced configurations and a deeper dive into the Pathways architecture, see the official Pathways on Cloud documentation.