maxtext.input_pipeline.grain_data_processing module#

Input pipeline using Grain.

maxtext.input_pipeline.grain_data_processing.find_data_files(data_file_pattern)[source]#

Find data files matching the pattern.

maxtext.input_pipeline.grain_data_processing.get_datasets(data_file_pattern, data_file_type, shuffle, shuffle_seed, shuffle_buffer_size, num_epoch, dataloading_host_index, dataloading_host_count, grain_worker_count, grain_num_threads, grain_prefetch_buffer_size, grain_data_source_max_workers, mixture_config_path=None, elastic=False)[source]#

Load dataset from array_record files for using with grain

maxtext.input_pipeline.grain_data_processing.pretrain_preprocessing_pipeline(dataset, config, data_columns, tokenize, grain_worker_count, grain_per_worker_buffer_size)[source]#

Use grain pipeline to pre-process the dataset and return iterators for pretrain.

When config.grain_use_elastic_iterator is True, the pipeline stops before batching and multiprocessing (which ElasticIterator performs itself) and applies shift pre-batch on axis 0 rather than post-batch on axis 1.

maxtext.input_pipeline.grain_data_processing.dpo_preprocessing_pipeline(dataset, config, data_columns, tokenize, grain_worker_count, grain_per_worker_buffer_size)[source]#

Use grain to pre-process the dataset and return iterators for dpo fine-tuning

maxtext.input_pipeline.grain_data_processing.sft_preprocessing_pipeline(dataset, config, data_columns, tokenize, grain_worker_count, grain_per_worker_buffer_size)[source]#

Use grain pipeline to pre-process the dataset and return iterators for sft fine-tuning

maxtext.input_pipeline.grain_data_processing.make_grain_train_iterator(config, global_mesh, process_indices)[source]#

Load, preprocess dataset and return iterators

Parameters:

config (ConfigDict)

maxtext.input_pipeline.grain_data_processing.make_grain_eval_iterator(config, global_mesh, process_indices)[source]#

Load, preprocess dataset and return iterators

Parameters:

config (ConfigDict)