maxtext.input_pipeline.grain_data_processing module#
Input pipeline using Grain.
- maxtext.input_pipeline.grain_data_processing.find_data_files(data_file_pattern)[source]#
Find data files matching the pattern.
- maxtext.input_pipeline.grain_data_processing.get_datasets(data_file_pattern, data_file_type, shuffle, shuffle_seed, shuffle_buffer_size, num_epoch, dataloading_host_index, dataloading_host_count, grain_worker_count, grain_num_threads, grain_prefetch_buffer_size, grain_data_source_max_workers, mixture_config_path=None, elastic=False)[source]#
Load dataset from array_record files for using with grain
- maxtext.input_pipeline.grain_data_processing.pretrain_preprocessing_pipeline(dataset, config, data_columns, tokenize, grain_worker_count, grain_per_worker_buffer_size)[source]#
Use grain pipeline to pre-process the dataset and return iterators for pretrain.
When config.grain_use_elastic_iterator is True, the pipeline stops before batching and multiprocessing (which ElasticIterator performs itself) and applies shift pre-batch on axis 0 rather than post-batch on axis 1.
- maxtext.input_pipeline.grain_data_processing.dpo_preprocessing_pipeline(dataset, config, data_columns, tokenize, grain_worker_count, grain_per_worker_buffer_size)[source]#
Use grain to pre-process the dataset and return iterators for dpo fine-tuning
- maxtext.input_pipeline.grain_data_processing.sft_preprocessing_pipeline(dataset, config, data_columns, tokenize, grain_worker_count, grain_per_worker_buffer_size)[source]#
Use grain pipeline to pre-process the dataset and return iterators for sft fine-tuning