maxtext.input_pipeline.hf_data_processing module#
Input pipeline using Huggingface datasets.
- maxtext.input_pipeline.hf_data_processing.vision_sft_preprocessing_pipeline(dataset, config, dataloading_host_index, dataloading_host_count, global_mesh, text_columns, image_column, global_batch_size)[source]#
pipeline for multimodal SFT with HF dataset
- maxtext.input_pipeline.hf_data_processing.preprocessing_pipeline(dataloading_host_index, dataloading_host_count, global_mesh, dataset, config, data_column_names, tokenize, tokenizer_path, hf_access_token, global_batch_size, max_target_length, shuffle, data_shuffle_seed, chat_template_path='', add_bos=True, add_eos=True, packing=True, shift=True, num_threads=1, drop_remainder=True, generate_padding_batch=False, use_dpo=None, use_sft=None, use_tunix_gradient_accumulation=False, num_microbatches=1, sft_train_on_completion_only=True, grain_worker_count=1, max_segments_per_seq=None, num_epoch=1, chat_template=None, formatting_func_path=None, formatting_func_kwargs=None)[source]#
pipeline for preprocessing HF dataset
- Parameters:
chat_template (str | None)
formatting_func_path (str | None)
formatting_func_kwargs (dict | None)