maxtext.input_pipeline.hf_data_processing module#

Input pipeline using Huggingface datasets.

maxtext.input_pipeline.hf_data_processing.vision_sft_preprocessing_pipeline(dataset, config, dataloading_host_index, dataloading_host_count, global_mesh, text_columns, image_column, global_batch_size)[source]#

pipeline for multimodal SFT with HF dataset

maxtext.input_pipeline.hf_data_processing.preprocessing_pipeline(dataloading_host_index, dataloading_host_count, global_mesh, dataset, config, data_column_names, tokenize, tokenizer_path, hf_access_token, global_batch_size, max_target_length, shuffle, data_shuffle_seed, chat_template_path='', add_bos=True, add_eos=True, packing=True, shift=True, num_threads=1, drop_remainder=True, generate_padding_batch=False, use_dpo=None, use_sft=None, use_tunix_gradient_accumulation=False, num_microbatches=1, sft_train_on_completion_only=True, grain_worker_count=1, max_segments_per_seq=None, num_epoch=1, chat_template=None, formatting_func_path=None, formatting_func_kwargs=None)[source]#

pipeline for preprocessing HF dataset

Parameters:
  • chat_template (str | None)

  • formatting_func_path (str | None)

  • formatting_func_kwargs (dict | None)

maxtext.input_pipeline.hf_data_processing.make_hf_train_iterator(config, global_mesh, process_indices_train)[source]#

Load, preprocess dataset and return iterators

Parameters:

config (ConfigDict)

maxtext.input_pipeline.hf_data_processing.make_hf_eval_iterator(config, global_mesh, process_indices_eval)[source]#

Make Hugging Face evaluation iterator. Load and preprocess eval dataset: and return iterator.

Parameters:

config (ConfigDict)