maxtext.input_pipeline.distillation_data_processing module#

Input pipeline to generate knowledge distillation dataset from conversational dataset. The conversational dataset should conform to one of the two schemas: 1. Contains a messages column: Typically holding a list of message

(e.g., [{‘role’: ‘user’, ‘content’: ‘…’}, {‘role’: ‘assistant’, ‘content’: ‘…’}, …]).

  1. Contains prompt and completion columns: Separating the input

    query [{‘role’: ‘user’, ‘content’: ‘…’}] from the target output [{‘role’: ‘assistant’, ‘content’: ‘…’}].

class maxtext.input_pipeline.distillation_data_processing.InputRequest(prompt: str = '', prompt_token_ids: list[int] = <factory>, actual_completion: str = '', max_output_tokens: int = 0)[source]#

Bases: object

Parameters:
  • prompt (str)

  • prompt_token_ids (list[int])

  • actual_completion (str)

  • max_output_tokens (int)

prompt: str = ''#
prompt_token_ids: list[int]#
actual_completion: str = ''#
max_output_tokens: int = 0#
maxtext.input_pipeline.distillation_data_processing.map_to_prompt_completion(example)[source]#
example = {
“messages”: [

{“role”: “user”, “content”: “prompt_1”}, {“role”: “assistant”, “content”: “completion_1”}, {“role”: “user”, “content”: “prompt_2”}, {“role”: “assistant”, “content”: “completion_2”}

]

} map_to_prompt_completion(example) returns:

{

“prompt”: [{“role”: “user”, “content”: “prompt_1”}, {“role”: “user”, “content”: “prompt_2”}], “completion”: [{“role”: “assistant”, “content”: “completion_1”}, {“role”: “assistant”, “content”: “completion_2”}]

}

maxtext.input_pipeline.distillation_data_processing.extract_content(example, data_column_names)[source]#
example = {

“prompt”: [{“role”: “user”, “content”: “prompt_1”}, {“role”: “user”, “content”: “prompt_2”}], “completion”: [{“role”: “assistant”, “content”: “completion_1”}, {“role”: “assistant”, “content”: “completion_2”}]

} extract_content(example, [“prompt”, “completion”]) returns:

{

“prompt”: [“prompt_1”, “prompt_2”], “completion”: [“completion_1”, “completion_2”]

}

maxtext.input_pipeline.distillation_data_processing.process_dataset(config, dataset)[source]#

Pipeline for preprocessing dataset.

maxtext.input_pipeline.distillation_data_processing.load_dataset(config)[source]#

Loads dataset from Hugging Face.

maxtext.input_pipeline.distillation_data_processing.filter_dataset(config, dataset, tokenizer)[source]#

Filter out samples from the dataset.