maxtext.input_pipeline.tfds_data_processing module#
Input pipeline for a LM1B dataset.
- maxtext.input_pipeline.tfds_data_processing.get_datasets(dataset_name, data_split, shuffle_files, shuffle_seed, dataloading_host_index, dataloading_host_count, dataset_path=None)[source]#
Load a TFDS dataset.
- maxtext.input_pipeline.tfds_data_processing.preprocessing_pipeline(dataset, tokenizer_path, tokenizer_type, global_batch_size, max_target_length, data_column_names, shuffle=False, data_shuffle_seed=0, tokenize=True, add_bos=True, add_eos=True, num_epochs=1, pack_examples=True, shuffle_buffer_size=1024, shift=True, drop_remainder=True, prefetch_size=-1, use_dpo=False, hf_access_token='')[source]#
pipeline for preprocessing TFDS dataset.
- Parameters:
tokenizer_type (str)
global_batch_size (int)
max_target_length (int)
shuffle (bool)
tokenize (bool)
add_bos (bool)
add_eos (bool)
num_epochs (None | int)
pack_examples (bool)
shuffle_buffer_size (int)
shift (bool)
drop_remainder (bool)
use_dpo (bool)
hf_access_token (str)