maxtext.multimodal.processor_gemma3 module#
Gemma3-specific utilities for multimodal features.
- class maxtext.multimodal.processor_gemma3.Gemma3PreprocessorOutput(pixel_values=None, pixel_mask=None, aspect_ratios=None, num_images=0, audio_values=None, audio_mask=None)[source]#
Bases:
PreprocessorOutputHolds the output of Gemma3 image preprocessor.
- Parameters:
pixel_values (None | ndarray)
pixel_mask (None | ndarray)
aspect_ratios (None | ndarray)
num_images (int)
audio_values (None | ndarray)
audio_mask (None | ndarray)
- Inherited from `mm_utils.PreprocessorOutput`.
- num_images: int = 0#
- pixel_values: None | ndarray = None#
- pixel_mask: None | ndarray = None#
- maxtext.multimodal.processor_gemma3.preprocess_mm_data_gemma3(images)[source]#
Preprocesses multimodal data for Gemma3 models.
- maxtext.multimodal.processor_gemma3.get_image_offsets_gemma3(processor_output)[source]#
Get the increase in total token count after inserting image token placeholders
- Parameters:
processor_output (PreprocessorOutput | None)
- maxtext.multimodal.processor_gemma3.reformat_prompt_gemma3(prompt, image_placeholder, num_images)[source]#
Reformat prompt for Gemma3 models by inserting image placeholders.
- maxtext.multimodal.processor_gemma3.insert_sequence(tokens, *, at, sequence, max_num_images)[source]#
Inserts a sequence of tokens at all occurrences of a specific token at. This function is fully vectorized and operates on a batch of token sequences.
- Parameters:
tokens (ndarray) – A 1D or 2D array of input tokens.
at (int) – The token ID to find and replace with the sequence.
sequence (list[int]) – The list of new token IDs to insert.
max_num_images (int) – The maximum number of times at can appear.
- Returns:
The modified token array with the sequences inserted.
- Return type:
ndarray
- maxtext.multimodal.processor_gemma3.add_extra_tokens_for_images_gemma3(tokens, *, max_num_images=1)[source]#
Add the extra image tokens to the text tokens.
If the model has images, we expand each <start_of_image> token by the image placeholder tokens.
Example:
```python input = […, x, <start_of_image>, y, …] output = [
…, x, nn, <start_of_image>, SOFT_TOKEN_PLACEHOLDER, SOFT_TOKEN_PLACEHOLDER, …, SOFT_TOKEN_PLACEHOLDER, SOFT_TOKEN_PLACEHOLDER, <end_of_image>, nn, y, …
]#
The nn tokens are added to match how the model was trained.
- param tokens:
The text tokens.
- param max_num_images:
The maximum number of images in the batch.
- param num_tokens_per_image:
The number of soft tokens per image.
- returns:
The text tokens with the extra image tokens.
- Parameters:
tokens (ndarray | list)
max_num_images (int)