maxtext.input_pipeline.tfds_data_processing module

maxtext.input_pipeline.tfds_data_processing module#

Input pipeline for a LM1B dataset.

maxtext.input_pipeline.tfds_data_processing.get_datasets(dataset_name, data_split, shuffle_files, shuffle_seed, dataloading_host_index, dataloading_host_count, dataset_path=None)[source]#

Load a TFDS dataset.

maxtext.input_pipeline.tfds_data_processing.preprocessing_pipeline(dataset, tokenizer_path, tokenizer_type, global_batch_size, max_target_length, data_column_names, shuffle=False, data_shuffle_seed=0, tokenize=True, add_bos=True, add_eos=True, num_epochs=1, pack_examples=True, shuffle_buffer_size=1024, shift=True, drop_remainder=True, prefetch_size=-1, use_dpo=False, hf_access_token='')[source]#

pipeline for preprocessing TFDS dataset.

Parameters:
  • tokenizer_type (str)

  • global_batch_size (int)

  • max_target_length (int)

  • shuffle (bool)

  • tokenize (bool)

  • add_bos (bool)

  • add_eos (bool)

  • num_epochs (None | int)

  • pack_examples (bool)

  • shuffle_buffer_size (int)

  • shift (bool)

  • drop_remainder (bool)

  • use_dpo (bool)

  • hf_access_token (str)

maxtext.input_pipeline.tfds_data_processing.make_tfds_train_iterator(config, global_mesh, process_indices_train)[source]#

load dataset, preprocess and return iterators

Parameters:

config (ConfigDict)

maxtext.input_pipeline.tfds_data_processing.make_tfds_eval_iterator(config, global_mesh, process_indices_eval)[source]#

load eval dataset, preprocess and return iterators

Parameters:

config (ConfigDict)