maxtext.multimodal.processor_qwen3_omni module

maxtext.multimodal.processor_qwen3_omni module#

Qwen3-Omni-specific preprocessing utilities for multimodal features.

Original implementation from HuggingFace: Qwen/Qwen3-Omni-30B-A3B-Instruct.

class maxtext.multimodal.processor_qwen3_omni.Qwen3OmniPreprocessorOutput(pixel_values=None, pixel_mask=None, aspect_ratios=None, num_images=0, audio_values=None, audio_mask=None, pixel_grid_thw=None, num_videos=0, video_values=None, video_grid_thw=None, video_second_per_grid=None, num_audios=0, audio_lengths=None)[source]#

Bases: PreprocessorOutput

Holds the output of Qwen3-Omni image preprocessor.

Parameters:

pixel_values (None | ndarray)
pixel_mask (None | ndarray)
aspect_ratios (None | ndarray)
num_images (int)
audio_values (None | ndarray)
audio_mask (None | ndarray)
pixel_grid_thw (None | ndarray)
num_videos (int)
video_values (None | ndarray)
video_grid_thw (None | ndarray)
video_second_per_grid (None | ndarray)
num_audios (int)
audio_lengths (None | ndarray)

Inherited from `mm_utils.PreprocessorOutput`.

num_images: int = 0#

pixel_values: None | ndarray = None#

pixel_grid_thw: None | ndarray = None#

num_videos: int = 0#

video_values: None | ndarray = None#

video_grid_thw: None | ndarray = None#

video_second_per_grid: None | ndarray = None#

num_audios: int = 0#

audio_values: None | ndarray = None#

audio_mask: None | ndarray = None#

audio_lengths: None | ndarray = None#

maxtext.multimodal.processor_qwen3_omni.smart_resize(height, width, factor=28, min_pixels=3136, max_pixels=1003520)[source]#

Rescales the image so that the following conditions are met:

Both dimensions (height and width) are divisible by ‘factor’.
The total number of pixels is within the range [‘min_pixels’, ‘max_pixels’].
The aspect ratio of the image is maintained as closely as possible.

Parameters:

height (int)
width (int)
factor (int)
min_pixels (int)
max_pixels (int)

maxtext.multimodal.processor_qwen3_omni.pre_process_qwen3_image(image, config)[source]#

Performs a bi-linear resize (with anti-aliasing) and normalizes the image.

Parameters:: image (ndarray | list[ndarray])

maxtext.multimodal.processor_qwen3_omni.calculate_video_frame_range(ele, total_frames, video_fps)[source]#

Calculate the start and end frame indices based on the given time range.

Parameters:

ele (dict) – A dictionary containing optional ‘video_start’ and ‘video_end’ keys (in seconds).
total_frames (int) – Total number of frames in the video.
video_fps (float) – Frames per second of the video.

Returns:

A tuple containing (start_frame, end_frame, frame_count).

Return type:

tuple

Raises:

ValueError – If input parameters are invalid or the time range is inconsistent.

maxtext.multimodal.processor_qwen3_omni.smart_nframes(ele, total_frames, video_fps)[source]#

Calculate the number of frames for video used for model inputs.

Parameters:

ele (dict) –
a dict contains the configuration of video. support either fps or nframes:
- nframes: the number of frames to extract for model inputs.
- fps: the fps to extract frames for model inputs.
  
  min_frames: the minimum number of frames of the video, only used when fps is provided.
  
  max_frames: the maximum number of frames of the video, only used when fps is provided.
total_frames (int) – the original total number of frames of the video.
video_fps (int | float) – the original fps of the video.

Returns:

the number of frames for video used for model inputs.

Return type:

int

maxtext.multimodal.processor_qwen3_omni.preprocess_video(video, config)[source]#: Preprocess the video for Qwen3-Omni model.

maxtext.multimodal.processor_qwen3_omni.pre_process_audio_qwen3_omni(audio_array)[source]#: Preprocess audio for Qwen3-Omni model.

maxtext.multimodal.processor_qwen3_omni.preprocess_mm_data_qwen3_omni(config)[source]#: Placeholder for multimodal data preprocessing.

maxtext.multimodal.processor_qwen3_omni.add_extra_tokens_for_qwen3_omni(tokens, config, processor_output)[source]#

Add extra tokens for Qwen3-Omni multimodal sequences.

For audio-in-video mode, interleaves audio and video tokens based on temporal ordering.

Parameters:

tokens – Input token sequence (1D array or list).
image_grid_thw – Image dimensions (num_images, 3) with [temporal, height, width].
video_grid_thw – Video dimensions (num_videos, 3) with [temporal, height, width].
audio_lengths – Pre-computed audio token counts (num_audios,).
spatial_merge_size – Number of patches merged spatially (e.g., 2 for 2x2→1).
use_audio_in_video – If True, interleave audio and video tokens.
second_per_grids – Time interval per temporal grid (num_videos,).
position_id_per_seconds – Temporal granularity (tokens per second).

Returns:

Expanded token sequence with correct number of image/video/audio tokens.

maxtext.multimodal.processor_qwen3_omni.get_dummy_image_shape_for_init_qwen3_omni(batch_size)[source]#: Return the shape of the dummy image for Qwen3-Omni model’s initialization.

maxtext.multimodal.processor_qwen3_omni.get_dummy_audio_shape_for_init_qwen3_omni(config)[source]#: Return the shape of the dummy audio for Qwen3-Omni model’s initialization.

maxtext.multimodal.processor_qwen3_omni.get_llm_pos_ids_for_vision(start_idx, vision_idx, spatial_merge_size, t_index, grid_hs, grid_ws)[source]#

Computes 3D position IDs (temporal, height, width) for vision tokens.

Creates position embeddings for a grid of vision tokens representing an image or video. For each temporal frame, generates a spatial grid of (height, width) positions.

Parameters:

start_idx (int | Array) – Starting position ID value to add as offset.
vision_idx (int) – Index of the current image/video being processed.
spatial_merge_size (int) – Number of patches merged spatially (e.g., 2 means 2x2 patches → 1 token).
t_index (Array) – Temporal position for each frame. Shape: (num_frames,).
grid_hs (Array) – Height dimensions for all images/videos. Shape: (num_images,).
grid_ws (Array) – Width dimensions for all images/videos. Shape: (num_images,).

Returns:

dim 0: temporal positions
dim 1: height positions
dim 2: width positions

Return type:

3D position IDs with shape (3, num_vision_tokens) where

Example

If spatial_merge_size=2, grid_h=4, grid_w=4, num_frames=2:

After merge: 2x2 grid per frame
Total tokens: 2 frames x 2 x 2 = 8 tokens
Output shape: (3, 8)
t_index: [0, 0, 0, 0, 50, 50, 50, 50]
h_index: [0, 0, 1, 1, 0, 0, 1, 1]
w_index: [0, 1, 0, 1, 0, 1, 0, 1]

maxtext.multimodal.processor_qwen3_omni.get_chunked_index(token_indices, tokens_per_chunk, remove_index)[source]#

Splits token index list into chunks based on token value ranges.

Given a list of monotonically increasing token indices, returns a list of (start, end) index tuples representing slices where token values fall within successive ranges of tokens_per_chunk.

Parameters:

token_indices (Array) – Monotonically increasing array of token index values. Shape: (seq_len,).
tokens_per_chunk (int) – Chunk size threshold (e.g., 100 means first chunk has values < 100).
remove_index (int) – Offset to subtract from token_indices before chunking.

Returns:

List of (start_idx, end_idx) tuples, each representing a chunk.

Return type:

list[tuple[int, int]]

Example

token_indices = [5, 10, 52, 105, 150, 250] tokens_per_chunk = 100 remove_index = 0

Result: [(0, 3), (3, 5), (5, 6)]

Chunk 0: indices 0-3 (values 5, 10, 52 are < 100)
Chunk 1: indices 3-5 (values 105, 150 are >= 100 and < 200)
Chunk 2: indices 5-6 (value 250 is >= 200)

maxtext.multimodal.processor_qwen3_omni.get_rope_index(input_ids, image_grid_thw=None, video_grid_thw=None, attention_mask=None, use_audio_in_video=False, audio_lengths=None, second_per_grids=None, spatial_merge_size=2, position_id_per_seconds=25)[source]#

Calculate 3D RoPE position indices for multimodal sequences.

This function computes position IDs that encode both sequential (text) and spatial-temporal (vision/audio) structure for Qwen3-Omni multimodal inputs.

For pure text sequences:

All 3 dimensions receive the same sequential positions: [0, 1, 2, 3, 4]

For multimodal sequences with vision:

Vision tokens get 3D positions (temporal, height, width)
Text tokens continue sequentially from max(vision_pos) + 1
Example with video (3 temporal patches, 2x2 spatial):
Vision temporal: [0, 0, 0, 0, 50, 50, 50, 50, 100, 100, 100, 100] Vision height: [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1] Vision width: [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1] Text positions: [101, 102, 103, 104, 105]

Parameters:

input_ids (ndarray) – Input token IDs. Shape: (batch, seq_len).
image_grid_thw (ndarray | None) – Image dimensions (temporal, height, width). Shape: (num_images, 3).
video_grid_thw (ndarray | None) – Video dimensions (temporal, height, width). Shape: (num_videos, 3).
attention_mask (ndarray | None) – Padding mask (1 = real token, 0 = padding). Shape: (batch, seq_len).
use_audio_in_video (bool) – If True, audio tokens are interleaved with video tokens.
audio_lengths (ndarray | None) – Audio sequence lengths. Shape: (num_audios,).
second_per_grids (ndarray | None) – Time interval per temporal grid (for videos). Shape: (num_videos,).
spatial_merge_size (int) – Number of patches merged spatially (e.g., 2 for 2x2→1).
position_id_per_seconds (int) – Temporal granularity (tokens per second, typically 25).

Returns:

position_ids: 3D position IDs. Shape: (3, batch, seq_len).
mrope_position_deltas: Position offset for each sequence. Shape: (batch, 1).

Return type:

A tuple of

Raises:

ValueError – If multimodal tokens are present but grid info is missing.

maxtext.multimodal.processor_qwen3_omni.reformat_prompt_qwen3_omni(prompt, image_placeholder='<|image|>', num_images=0, video_placeholder='<|video|>', num_videos=0)[source]#: Reformat the prompt for Qwen3-Omni model.

maxtext.multimodal.processor_qwen3_omni.get_mm_offsets_qwen3_omni(config, processor_output)[source]#: Calculate the token offsets for multimodal tokens in Qwen3-Omni model.

maxtext.multimodal.processor_qwen3_omni module

Contents

maxtext.multimodal.processor_qwen3_omni module#