maxtext.input_pipeline.distillation_data_processing module#
Input pipeline to generate knowledge distillation dataset from conversational dataset. The conversational dataset should conform to one of the two schemas: 1. Contains a messages column: Typically holding a list of message
(e.g., [{‘role’: ‘user’, ‘content’: ‘…’}, {‘role’: ‘assistant’, ‘content’: ‘…’}, …]).
- Contains prompt and completion columns: Separating the input
query [{‘role’: ‘user’, ‘content’: ‘…’}] from the target output [{‘role’: ‘assistant’, ‘content’: ‘…’}].
- class maxtext.input_pipeline.distillation_data_processing.InputRequest(prompt: str = '', prompt_token_ids: list[int] = <factory>, actual_completion: str = '', max_output_tokens: int = 0)[source]#
Bases:
object- Parameters:
prompt (str)
prompt_token_ids (list[int])
actual_completion (str)
max_output_tokens (int)
- prompt: str = ''#
- prompt_token_ids: list[int]#
- actual_completion: str = ''#
- max_output_tokens: int = 0#
- maxtext.input_pipeline.distillation_data_processing.map_to_prompt_completion(example)[source]#
- example = {
- “messages”: [
{“role”: “user”, “content”: “prompt_1”}, {“role”: “assistant”, “content”: “completion_1”}, {“role”: “user”, “content”: “prompt_2”}, {“role”: “assistant”, “content”: “completion_2”}
]
} map_to_prompt_completion(example) returns:
- {
“prompt”: [{“role”: “user”, “content”: “prompt_1”}, {“role”: “user”, “content”: “prompt_2”}], “completion”: [{“role”: “assistant”, “content”: “completion_1”}, {“role”: “assistant”, “content”: “completion_2”}]
}
- maxtext.input_pipeline.distillation_data_processing.extract_content(example, data_column_names)[source]#
- example = {
“prompt”: [{“role”: “user”, “content”: “prompt_1”}, {“role”: “user”, “content”: “prompt_2”}], “completion”: [{“role”: “assistant”, “content”: “completion_1”}, {“role”: “assistant”, “content”: “completion_2”}]
} extract_content(example, [“prompt”, “completion”]) returns:
- {
“prompt”: [“prompt_1”, “prompt_2”], “completion”: [“completion_1”, “completion_2”]
}
- maxtext.input_pipeline.distillation_data_processing.process_dataset(config, dataset)[source]#
Pipeline for preprocessing dataset.