Monitoring and debugging

Monitoring and debugging#

🕵️ Features & Diagnostics

Diagnostic tools and features for monitoring MaxText.

Features and diagnostics
☁️ GCP Observability

Observability for workloads running on Google Cloud Platform.

Enable GCP workload observabiltiy
🚫 Hang Playbook

Troubleshooting guide for training hangs at megascale.

Troubleshooting: Megascale hangs
📈 Goodput

Monitoring efficient training time (Goodput).

ML Goodput measurement
📊 Logs & Metrics

Understanding MaxText logs and performance metrics.

Understand logs and metrics
📉 TensorBoard

Using Vertex AI TensorBoard for visualization.

Use Vertex AI Tensorboard
⏱️ XProf

Profiling performance with XProf.

Profiling with XProf