Hugging Face pipeline#
The Hugging Face pipeline supports streaming directly from the Hugging Face Hub, or from a Cloud Storage bucket in Hugging Face supported formats (parquet, json, etc.). This is through the Hugging Face datasets.load_dataset API with streaming=True, which takes in hf_* parameters.
Example config for streaming from Hugging Face Hub (no download needed)#
In src/maxtext/configs/base.yml or through command line, set the following parameters:
dataset_type: hf
hf_path: 'allenai/c4' # for using https://huggingface.co/datasets/allenai/c4
hf_data_dir: 'en'
hf_train_files: ''
# set eval_interval > 0 to use the specified eval dataset, otherwise, only metrics on the train set will be calculated.
eval_interval: 10000
hf_eval_split: 'validation'
hf_eval_files: ''
# for HF pipeline, tokenizer_path can be a path in Hugging Face Hub,
# or a local path containing tokenizer in a format supported by transformers.AutoTokenizer
tokenizer_path: 'google-t5/t5-large' # for using https://huggingface.co/google-t5/t5-large
hf_access_token: '' # provide token if using gated dataset or tokenizer
Example config for streaming from downloaded data in a Cloud Storage bucket#
In src/maxtext/configs/base.yml or through the command line, set the following parameters:
dataset_type: hf
hf_path: 'parquet' # or json, arrow, etc.
hf_data_dir: ''
hf_train_files: 'gs://<bucket>/<folder>/*-train-*.parquet' # match the train files
# set eval_interval > 0 to use the specified eval dataset. Otherwise, only metrics on the train set will be calculated.
eval_interval: 10000
hf_eval_split: ''
hf_eval_files: 'gs://<bucket>/<folder>/*-validation-*.parquet' # match the val files
# for Hugging Face pipeline, tokenizer_path can be a path in Hugging Face Hub,
# or a local path containing tokenizer in a format supported by transformers.AutoTokenizer
tokenizer_path: 'google-t5/t5-large' # for using https://huggingface.co/google-t5/t5-large
Limitations and Recommendations#
Streaming data directly from Hugging Face Hub may be impacted by the traffic of the server. During peak hours you may encounter “504 Server Error: Gateway Time-out”. It’s recommended to download the Hugging Face dataset to a Cloud Storage bucket or disk for the most stable experience.
Streaming data directly from Hugging Face Hub works in multi-host settings with a small number of hosts. With a host number larger than 16, you might encounter a “read time out” error.