Data pipelines#

Currently MaxText has three data input pipelines:

Pipeline

Dataset formats

Features

Limitations

Grain (recommended)

ArrayRecord (random access, available through Tensorflow Datasets, or conversion)
Parquet (sequential access)

With arrayrecord: fully deterministic, resilient to preemption; global shuffle
With parquet: performant; fully deterministic, resilient to preemption; hierarchical shuffle

Hugging Face

datasets in Hugging Face Hub
local/Cloud Storage datasets in json, parquet, arrow, csv, txt (sequential access)

no download needed, convenience;
multiple formats

limit scalability using the Hugging Face Hub (no limit using Cloud Storage);
non-deterministic with preemption
(deterministic without preemption)

TFDS

TFRecord (sequential access), available through Tensorflow Datasets

performant

only supports TFRecords;
non-deterministic with preemption
(deterministic without preemption)

Multihost dataloading best practice#

Training in a multi-host environment presents unique challenges for data input pipelines. An effective data loading strategy must address three key issues:

  1. Concurrent access: Multiple hosts need to read from the same dataset simultaneously without causing conflicts.

  2. Data uniqueness: Each host must be fed a unique, non-overlapping subset of the data to ensure the model sees each example correctly.

  3. Uneven completion: Handling the scenario where some hosts run out of data before others, which can lead to hanging. The approaches to solve these challenges depend on whether your dataset supports random access or is limited to sequential access.

Sequential access dataset#

  • Concurrent access and uniqueness: Sequential-access datasets (e.g., Parquet, JSON, TFRecord) cannot be accessed by index, requiring a different strategy – file-based sharding, where each host is given exclusive access to a specific subset of data files. Key requirement: (Number of data files) % (Number of data-loading hosts) == 0. If the file count isn’t a multiple of the host count, the files will be distributed unevenly. For example, with 10 files and 8 hosts, some hosts will get two files while others get one, significantly worsening the “uneven completion” problem. If you have fewer files than hosts, performance will be severely degraded as all hosts are concurrently accessing all the files.

  • Uneven completion: Similar to random-access datasets, you can use the generate_padding_batch_train/generate_padding_batch_eval flag to handle hosts that finish their file shards early.