maxtext.multimodal.processor_gemma3 module

maxtext.multimodal.processor_gemma3 module#

Gemma3-specific utilities for multimodal features.

class maxtext.multimodal.processor_gemma3.Gemma3PreprocessorOutput(pixel_values=None, pixel_mask=None, aspect_ratios=None, num_images=0, audio_values=None, audio_mask=None)[source]#

Bases: PreprocessorOutput

Holds the output of Gemma3 image preprocessor.

Parameters:

pixel_values (None | ndarray)
pixel_mask (None | ndarray)
aspect_ratios (None | ndarray)
num_images (int)
audio_values (None | ndarray)
audio_mask (None | ndarray)

Inherited from `mm_utils.PreprocessorOutput`.

num_images: int = 0#

pixel_values: None | ndarray = None#

pixel_mask: None | ndarray = None#

maxtext.multimodal.processor_gemma3.preprocess_mm_data_gemma3(images)[source]#: Preprocesses multimodal data for Gemma3 models.

maxtext.multimodal.processor_gemma3.get_image_offsets_gemma3(processor_output)[source]#

Get the increase in total token count after inserting image token placeholders

Parameters:: processor_output (PreprocessorOutput | None)

maxtext.multimodal.processor_gemma3.reformat_prompt_gemma3(prompt, image_placeholder, num_images)[source]#: Reformat prompt for Gemma3 models by inserting image placeholders.

maxtext.multimodal.processor_gemma3.insert_sequence(tokens, *, at, sequence, max_num_images)[source]#

Inserts a sequence of tokens at all occurrences of a specific token at. This function is fully vectorized and operates on a batch of token sequences.

Parameters:

tokens (ndarray) – A 1D or 2D array of input tokens.
at (int) – The token ID to find and replace with the sequence.
sequence (list[int]) – The list of new token IDs to insert.
max_num_images (int) – The maximum number of times at can appear.

Returns:

The modified token array with the sequences inserted.

Return type:

ndarray

maxtext.multimodal.processor_gemma3.add_extra_tokens_for_images_gemma3(tokens, *, max_num_images=1)[source]#

Add the extra image tokens to the text tokens.

If the model has images, we expand each <start_of_image> token by the image placeholder tokens.

Example:

```python input = […, x, <start_of_image>, y, …] output = [

…, x, nn, <start_of_image>, SOFT_TOKEN_PLACEHOLDER, SOFT_TOKEN_PLACEHOLDER, …, SOFT_TOKEN_PLACEHOLDER, SOFT_TOKEN_PLACEHOLDER, <end_of_image>, nn, y, …

]#

The nn tokens are added to match how the model was trained.

param tokens:: The text tokens.
param max_num_images:: The maximum number of images in the batch.
param num_tokens_per_image:: The number of soft tokens per image.
returns:: The text tokens with the extra image tokens.

Parameters:

tokens (ndarray | list)
max_num_images (int)

maxtext.multimodal.processor_gemma3.get_dummy_image_shape_for_init_gemma3(batch_size=1, num_image_per_sequence=1)[source]#: Return the shape of the dummy image for Gemma3 model’s initialization.

maxtext.multimodal.processor_gemma3 module

Contents

maxtext.multimodal.processor_gemma3 module#

]#