maxtext.input_pipeline.tfds_data_processing_c4_mlperf module#

Input pipeline for gpt3 c4 mlperf dataset.

maxtext.input_pipeline.tfds_data_processing_c4_mlperf.rekey(ds, key_map=None)[source]#

normalization with key mapping

maxtext.input_pipeline.tfds_data_processing_c4_mlperf.reduce_concat_tokens(dataset, feature_key='targets', batch_size=128)[source]#

Token-preprocessor to concatenate multiple unrelated documents. If we want to generate examples of exactly the right length, (to avoid wasting space on padding), then we use this function, followed by split_tokens. :param dataset: a tf.data.Dataset with dictionaries containing the key feature_key. :param feature_key: an string :param batch_size: an integer - how many documents to concatenate into one

Returns:

a dataset

maxtext.input_pipeline.tfds_data_processing_c4_mlperf.split_tokens(dataset, max_tokens_per_segment=128, feature_key='targets')[source]#

Split examples into multiple examples each. The intended use case is to break up long examples for use in unsupervised transfer-learning. This function is generally preceded by select_random_chunk. :param dataset: a tf.data.Dataset with dictionaries containing the key feature_key. :param max_tokens_per_segment: an integer, the maximum number of tokens in each

segment. Only the final segment may be shorter.

Parameters:

feature_key – a string, the feature to split

Returns:

a dataset

maxtext.input_pipeline.tfds_data_processing_c4_mlperf.split_tokens_to_targets_length(dataset, sequence_length)[source]#
maxtext.input_pipeline.tfds_data_processing_c4_mlperf.get_dataset(dataset_name, split, dataloading_host_index, dataloading_host_count, enable_data_shuffling=False, data_shuffle_seed=0, shard_in_read=False)[source]#

Load and return a dataset of examples.

Parameters:
  • dataset_name (str)

  • split (str)

  • dataloading_host_index (int)

  • dataloading_host_count (int)

  • enable_data_shuffling (bool)

  • data_shuffle_seed (int)

  • shard_in_read (bool)

Return type:

DatasetV2

maxtext.input_pipeline.tfds_data_processing_c4_mlperf.format_fn(x, eos_id=1, pad_id=0)[source]#

Format function for c4_mlperf.

Parameters:
  • eos_id (int)

  • pad_id (int)

maxtext.input_pipeline.tfds_data_processing_c4_mlperf.preprocess_train_dataset(train_ds, sp_tokenizer, train_global_batch_size_to_load, max_target_length, shuffle_buffer_size, data_shuffle_seed)[source]#

Preprocess the training dataset.

Parameters:
  • train_ds (DatasetV2)

  • train_global_batch_size_to_load (int)

  • max_target_length (int)

  • shuffle_buffer_size (int)

  • data_shuffle_seed (int)

Return type:

DatasetV2

maxtext.input_pipeline.tfds_data_processing_c4_mlperf.preprocess_eval_dataset(eval_ds, sp_tokenizer, eval_global_batch_size_to_load, max_target_length, num_examples=None, is_tokenized_dataset=True)[source]#

Preprocess the evaluation dataset.

Parameters:
  • eval_ds (DatasetV2)

  • eval_global_batch_size_to_load (int)

  • max_target_length (int)

  • num_examples (None | int)

  • is_tokenized_dataset (bool)

Return type:

DatasetV2

maxtext.input_pipeline.tfds_data_processing_c4_mlperf.make_c4_mlperf_train_iterator(config, global_mesh, process_indices)[source]#

Make train iterator of customized C4 dataset for mlperf gpt3 training.

Parameters:

config (ConfigDict)

maxtext.input_pipeline.tfds_data_processing_c4_mlperf.make_c4_mlperf_eval_iterator(config, global_mesh, process_indices)[source]#

Make eval iterator of customized C4 dataset for mlperf gpt3 training.

Parameters:

config (ConfigDict)