maxtext.multimodal.processor_gemma3 module#

Gemma3-specific utilities for multimodal features.

class maxtext.multimodal.processor_gemma3.Gemma3PreprocessorOutput(pixel_values=None, pixel_mask=None, aspect_ratios=None, num_images=0, audio_values=None, audio_mask=None)[source]#

Bases: PreprocessorOutput

Holds the output of Gemma3 image preprocessor.

Parameters:
  • pixel_values (None | ndarray)

  • pixel_mask (None | ndarray)

  • aspect_ratios (None | ndarray)

  • num_images (int)

  • audio_values (None | ndarray)

  • audio_mask (None | ndarray)

Inherited from `mm_utils.PreprocessorOutput`.
num_images: int = 0#
pixel_values: None | ndarray = None#
pixel_mask: None | ndarray = None#
maxtext.multimodal.processor_gemma3.preprocess_mm_data_gemma3(images)[source]#

Preprocesses multimodal data for Gemma3 models.

maxtext.multimodal.processor_gemma3.get_image_offsets_gemma3(processor_output)[source]#

Get the increase in total token count after inserting image token placeholders

Parameters:

processor_output (PreprocessorOutput | None)

maxtext.multimodal.processor_gemma3.reformat_prompt_gemma3(prompt, image_placeholder, num_images)[source]#

Reformat prompt for Gemma3 models by inserting image placeholders.

maxtext.multimodal.processor_gemma3.insert_sequence(tokens, *, at, sequence, max_num_images)[source]#

Inserts a sequence of tokens at all occurrences of a specific token at. This function is fully vectorized and operates on a batch of token sequences.

Parameters:
  • tokens (ndarray) – A 1D or 2D array of input tokens.

  • at (int) – The token ID to find and replace with the sequence.

  • sequence (list[int]) – The list of new token IDs to insert.

  • max_num_images (int) – The maximum number of times at can appear.

Returns:

The modified token array with the sequences inserted.

Return type:

ndarray

maxtext.multimodal.processor_gemma3.add_extra_tokens_for_images_gemma3(tokens, *, max_num_images=1)[source]#

Add the extra image tokens to the text tokens.

If the model has images, we expand each <start_of_image> token by the image placeholder tokens.

Example:

```python input = […, x, <start_of_image>, y, …] output = [

…, x, nn, <start_of_image>, SOFT_TOKEN_PLACEHOLDER, SOFT_TOKEN_PLACEHOLDER, …, SOFT_TOKEN_PLACEHOLDER, SOFT_TOKEN_PLACEHOLDER, <end_of_image>, nn, y, …

]#

The nn tokens are added to match how the model was trained.

param tokens:

The text tokens.

param max_num_images:

The maximum number of images in the batch.

param num_tokens_per_image:

The number of soft tokens per image.

returns:

The text tokens with the extra image tokens.

Parameters:
  • tokens (ndarray | list)

  • max_num_images (int)

maxtext.multimodal.processor_gemma3.get_dummy_image_shape_for_init_gemma3(batch_size=1, num_image_per_sequence=1)[source]#

Return the shape of the dummy image for Gemma3 model’s initialization.