TFDS pipeline#
Download the Allenai C4 dataset in TFRecord format to a Cloud Storage bucket. For information about cost, see this discussion
bash download_dataset.sh {GCS_PROJECT} {GCS_BUCKET_NAME}
In
src/maxtext/configs/base.ymlor through command line, set the following parameters:
dataset_type: tfds
dataset_name: 'c4/en:3.0.1'
# set eval_interval > 0 to use the specified eval dataset. Otherwise, only metrics on the train set will be calculated.
eval_interval: 10000
eval_dataset_name: 'c4/en:3.0.1'
eval_split: 'validation'
# TFDS input pipeline only supports tokenizer in spm format
tokenizer_path: 'src/maxtext/assets/tokenizers/tokenizer.llama2'