maxtext.input_pipeline.tfds_data_processing_c4_mlperf module#
Input pipeline for gpt3 c4 mlperf dataset.
- maxtext.input_pipeline.tfds_data_processing_c4_mlperf.rekey(ds, key_map=None)[source]#
normalization with key mapping
- maxtext.input_pipeline.tfds_data_processing_c4_mlperf.reduce_concat_tokens(dataset, feature_key='targets', batch_size=128)[source]#
Token-preprocessor to concatenate multiple unrelated documents. If we want to generate examples of exactly the right length, (to avoid wasting space on padding), then we use this function, followed by split_tokens. :param dataset: a tf.data.Dataset with dictionaries containing the key feature_key. :param feature_key: an string :param batch_size: an integer - how many documents to concatenate into one
- Returns:
a dataset
- maxtext.input_pipeline.tfds_data_processing_c4_mlperf.split_tokens(dataset, max_tokens_per_segment=128, feature_key='targets')[source]#
Split examples into multiple examples each. The intended use case is to break up long examples for use in unsupervised transfer-learning. This function is generally preceded by select_random_chunk. :param dataset: a tf.data.Dataset with dictionaries containing the key feature_key. :param max_tokens_per_segment: an integer, the maximum number of tokens in each
segment. Only the final segment may be shorter.
- Parameters:
feature_key – a string, the feature to split
- Returns:
a dataset
- maxtext.input_pipeline.tfds_data_processing_c4_mlperf.split_tokens_to_targets_length(dataset, sequence_length)[source]#
- maxtext.input_pipeline.tfds_data_processing_c4_mlperf.get_dataset(dataset_name, split, dataloading_host_index, dataloading_host_count, enable_data_shuffling=False, data_shuffle_seed=0, shard_in_read=False)[source]#
Load and return a dataset of examples.
- Parameters:
dataset_name (str)
split (str)
dataloading_host_index (int)
dataloading_host_count (int)
enable_data_shuffling (bool)
data_shuffle_seed (int)
shard_in_read (bool)
- Return type:
DatasetV2
- maxtext.input_pipeline.tfds_data_processing_c4_mlperf.format_fn(x, eos_id=1, pad_id=0)[source]#
Format function for c4_mlperf.
- Parameters:
eos_id (int)
pad_id (int)
- maxtext.input_pipeline.tfds_data_processing_c4_mlperf.preprocess_train_dataset(train_ds, sp_tokenizer, train_global_batch_size_to_load, max_target_length, shuffle_buffer_size, data_shuffle_seed)[source]#
Preprocess the training dataset.
- Parameters:
train_ds (DatasetV2)
train_global_batch_size_to_load (int)
max_target_length (int)
shuffle_buffer_size (int)
data_shuffle_seed (int)
- Return type:
DatasetV2
- maxtext.input_pipeline.tfds_data_processing_c4_mlperf.preprocess_eval_dataset(eval_ds, sp_tokenizer, eval_global_batch_size_to_load, max_target_length, num_examples=None, is_tokenized_dataset=True)[source]#
Preprocess the evaluation dataset.
- Parameters:
eval_ds (DatasetV2)
eval_global_batch_size_to_load (int)
max_target_length (int)
num_examples (None | int)
is_tokenized_dataset (bool)
- Return type:
DatasetV2