Via Pathways#

This guide provides a comprehensive walkthrough for running MaxText workloads on a Google Kubernetes Engine (GKE) cluster using Pathways. Pathways acts as a powerful orchestrator for large-scale JAX jobs on AI Hypercomputer infrastructure.

This document assumes you have already created a Pathways GKE cluster using xpk. If you haven’t, follow the instructions at the Google Cloud Pathways & XPK documentation.

We will cover two primary modes of operation:

  • Batch workload: Ideal for long-running, non-interactive training jobs.

  • Headless workload: Ideal for interactive development, debugging, and running code from a local machine or CPU VM.

1. Prerequisites#

Before you can run a MaxText workload, you must complete the following setup steps.

  1. Install XPK and its dependencies. Ensure that the xpk command-line tool is installed.

  2. Create a GKE cluster configured for Pathways.

  3. Build and upload a MaxText Docker image to your project’s Artifact Registry.

    Step 1: Build the Docker image for a TPU device. This image contains MaxText and its dependencies.

    bash dependencies/scripts/docker_build_dependency_image.sh DEVICE=tpu MODE=stable
    

    Step 2: Configure Docker to authenticate with Google Cloud

    gcloud auth configure-docker
    

    Step 3: Upload the image to your project’s registry. Replace $USER_runner with your desired image name.

    bash dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=$USER_runner
    

2. Environment configuration#

The following commands use placeholder variables. Before running them, set these environment variables in your shell.

# -- Google Cloud Configuration --
export PROJECT="your-gcp-project-id"
export ZONE="your-gcp-zone"
export CLUSTER="your-gke-cluster-name"

# -- Workload Configuration --
export WORKLOAD_NAME="maxtext-job-$(date +%Y%m%d-%H%M%S)"
export TPU_TYPE="v5p-8" # Or your desired TPU type, e.g., v5e-4
export WORKLOAD_NODEPOOL_COUNT=1 # Number of TPU slices for your job

# -- MaxText & Storage Configuration --
export BUCKET_NAME="your-gcs-bucket-name"
export RUN_NAME="maxtext-run-1"
# The Docker image you pushed in the prerequisite step
export DOCKER_IMAGE="gcr.io/${PROJECT}/${USER}_runner"

3. Running a batch workload#

A batch workload runs entirely within the GKE cluster. You submit the job definition, and Pathways manages its execution.

Submit the batch workload#

Use the xpk workload create-pathways command to start the job.

xpk workload create-pathways \
  --workload=$WORKLOAD_NAME \
  --cluster=$CLUSTER \
  --num-slices=$WORKLOAD_NODEPOOL_COUNT \
  --tpu-type=$TPU_TYPE \
  --project=$PROJECT \
  --zone=$ZONE \
  --docker-image=${DOCKER_IMAGE} \
  --command="python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
    base_output_directory=gs://${BUCKET_NAME} \
    per_device_batch_size=1 \
    enable_checkpointing=false \
    dataset_type=synthetic \
    enable_single_controller=True \
    run_name=${RUN_NAME}-pathways-batch"

Verify the workload#

You can check the status of your running workloads with the xpk workload list command.

xpk workload list --cluster=$CLUSTER --project=$PROJECT --zone=$ZONE

4. Running a headless (interactive) workload#

A headless workload reserves TPUs on the cluster and sets up a controller, but the Python script itself runs on a separate machine, like a local laptop or a Compute Engine VM. This is useful for rapid development and debugging. The headless mode refers to launching the Pathways backend services, such as resource manager and IFRT proxy, without a predefined user-workload container.

Step 1: Start the headless service#

This command reserves the TPUs and starts the Pathways head service on the cluster. It will wait until the resources are ready.

xpk workload create-pathways \
  --headless \
  --workload=${WORKLOAD_NAME} \
  --num-slices=${WORKLOAD_NODEPOOL_COUNT} \
  --tpu-type=${TPU_TYPE} \
  --project=${PROJECT} \
  --zone=${ZONE} \
  --cluster=${CLUSTER}

Step 2: Connect to the cluster via port forwarding#

On the machine where you will run your Python script, open a new terminal and create a secure tunnel to the cluster’s Pathways controller.

This command forwards local port 29000 to the controller pod in the cluster. It runs in the background.

kubectl port-forward \
  "$(kubectl get pods -o name | grep ${WORKLOAD_NAME}-pathways-head)" \
  29000:29000 &> /dev/null &

Step 3: Run your MaxText script locally#

With the port forward active, you can now run your MaxText script. The JAX environment variables direct it to connect to the TPUs through the tunnel.

# Set these environment variables to tell JAX how to connect to the TPUs
export JAX_PLATFORMS=proxy
export JAX_BACKEND_TARGET=grpc://127.0.0.1:29000

# Run the training script
python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
  base_output_directory=gs://${BUCKET_NAME} \
  per_device_batch_size=1 \
  enable_checkpointing=false \
  dataset_type=synthetic \
  enable_single_controller=True \
  run_name=${RUN_NAME}-pathways-headless

The output streams directly to your terminal, just as if you were running on a local accelerator.

Troubleshooting#

  • Permission denied errors for Cloud Storage bucket: Check that the service account used by your GKE nodes has “Storage Object Admin” permissions on your GCS bucket.

  • Image not found or ImagePullBackOff:

    • Verify your DOCKER_IMAGE variable is correct.

    • Ensure you have successfully pushed the image to your project’s Artifact Registry.

    • Check that your GKE cluster has permissions to pull from the registry.

  • kubectl port-forward fails:

    • Confirm that the pod from Step 1 is running (kubectl get pods). The name should match ${WORKLOAD_NAME}-pathways-head-0.

    • Ensure you are authenticated with kubectl and have the correct context set for your GKE cluster.

  • Make sure you import pathwaysutils package and call pathwaysutils.initialize() in your script when running the workload.

More information#

For more advanced configurations and a deeper dive into the Pathways architecture, see the official Pathways on Cloud documentation.