maxtext.multimodal.processor_qwen3_omni module#
Qwen3-Omni-specific preprocessing utilities for multimodal features.
Original implementation from HuggingFace: Qwen/Qwen3-Omni-30B-A3B-Instruct.
- class maxtext.multimodal.processor_qwen3_omni.Qwen3OmniPreprocessorOutput(pixel_values=None, pixel_mask=None, aspect_ratios=None, num_images=0, audio_values=None, audio_mask=None, pixel_grid_thw=None, num_videos=0, video_values=None, video_grid_thw=None, video_second_per_grid=None, num_audios=0, audio_lengths=None)[source]#
Bases:
PreprocessorOutputHolds the output of Qwen3-Omni image preprocessor.
- Parameters:
pixel_values (None | ndarray)
pixel_mask (None | ndarray)
aspect_ratios (None | ndarray)
num_images (int)
audio_values (None | ndarray)
audio_mask (None | ndarray)
pixel_grid_thw (None | ndarray)
num_videos (int)
video_values (None | ndarray)
video_grid_thw (None | ndarray)
video_second_per_grid (None | ndarray)
num_audios (int)
audio_lengths (None | ndarray)
- Inherited from `mm_utils.PreprocessorOutput`.
- num_images: int = 0#
- pixel_values: None | ndarray = None#
- pixel_grid_thw: None | ndarray = None#
- num_videos: int = 0#
- video_values: None | ndarray = None#
- video_grid_thw: None | ndarray = None#
- video_second_per_grid: None | ndarray = None#
- num_audios: int = 0#
- audio_values: None | ndarray = None#
- audio_mask: None | ndarray = None#
- audio_lengths: None | ndarray = None#
- maxtext.multimodal.processor_qwen3_omni.smart_resize(height, width, factor=28, min_pixels=3136, max_pixels=1003520)[source]#
Rescales the image so that the following conditions are met:
Both dimensions (height and width) are divisible by ‘factor’.
The total number of pixels is within the range [‘min_pixels’, ‘max_pixels’].
The aspect ratio of the image is maintained as closely as possible.
- Parameters:
height (int)
width (int)
factor (int)
min_pixels (int)
max_pixels (int)
- maxtext.multimodal.processor_qwen3_omni.pre_process_qwen3_image(image, config)[source]#
Performs a bi-linear resize (with anti-aliasing) and normalizes the image.
- Parameters:
image (ndarray | list[ndarray])
- maxtext.multimodal.processor_qwen3_omni.calculate_video_frame_range(ele, total_frames, video_fps)[source]#
Calculate the start and end frame indices based on the given time range.
- Parameters:
ele (dict) – A dictionary containing optional ‘video_start’ and ‘video_end’ keys (in seconds).
total_frames (int) – Total number of frames in the video.
video_fps (float) – Frames per second of the video.
- Returns:
A tuple containing (start_frame, end_frame, frame_count).
- Return type:
tuple
- Raises:
ValueError – If input parameters are invalid or the time range is inconsistent.
- maxtext.multimodal.processor_qwen3_omni.smart_nframes(ele, total_frames, video_fps)[source]#
Calculate the number of frames for video used for model inputs.
- Parameters:
ele (dict) –
a dict contains the configuration of video. support either fps or nframes:
nframes: the number of frames to extract for model inputs.
- fps: the fps to extract frames for model inputs.
min_frames: the minimum number of frames of the video, only used when fps is provided.
max_frames: the maximum number of frames of the video, only used when fps is provided.
total_frames (int) – the original total number of frames of the video.
video_fps (int | float) – the original fps of the video.
- Returns:
the number of frames for video used for model inputs.
- Return type:
int
- maxtext.multimodal.processor_qwen3_omni.preprocess_video(video, config)[source]#
Preprocess the video for Qwen3-Omni model.
- maxtext.multimodal.processor_qwen3_omni.pre_process_audio_qwen3_omni(audio_array)[source]#
Preprocess audio for Qwen3-Omni model.
- maxtext.multimodal.processor_qwen3_omni.preprocess_mm_data_qwen3_omni(config)[source]#
Placeholder for multimodal data preprocessing.
- maxtext.multimodal.processor_qwen3_omni.add_extra_tokens_for_qwen3_omni(tokens, config, processor_output)[source]#
Add extra tokens for Qwen3-Omni multimodal sequences.
Expands special tokens (<|image_pad|>, <|video_pad|>, <|audio_pad|>) into the correct number of placeholder tokens based on grid dimensions and merge size.
For audio-in-video mode, interleaves audio and video tokens based on temporal ordering.
- Parameters:
tokens – Input token sequence (1D array or list).
image_grid_thw – Image dimensions (num_images, 3) with [temporal, height, width].
video_grid_thw – Video dimensions (num_videos, 3) with [temporal, height, width].
audio_lengths – Pre-computed audio token counts (num_audios,).
spatial_merge_size – Number of patches merged spatially (e.g., 2 for 2x2→1).
use_audio_in_video – If True, interleave audio and video tokens.
second_per_grids – Time interval per temporal grid (num_videos,).
position_id_per_seconds – Temporal granularity (tokens per second).
- Returns:
Expanded token sequence with correct number of image/video/audio tokens.
- maxtext.multimodal.processor_qwen3_omni.get_dummy_image_shape_for_init_qwen3_omni(batch_size)[source]#
Return the shape of the dummy image for Qwen3-Omni model’s initialization.
- maxtext.multimodal.processor_qwen3_omni.get_dummy_audio_shape_for_init_qwen3_omni(config)[source]#
Return the shape of the dummy audio for Qwen3-Omni model’s initialization.
- maxtext.multimodal.processor_qwen3_omni.get_llm_pos_ids_for_vision(start_idx, vision_idx, spatial_merge_size, t_index, grid_hs, grid_ws)[source]#
Computes 3D position IDs (temporal, height, width) for vision tokens.
Creates position embeddings for a grid of vision tokens representing an image or video. For each temporal frame, generates a spatial grid of (height, width) positions.
- Parameters:
start_idx (int | Array) – Starting position ID value to add as offset.
vision_idx (int) – Index of the current image/video being processed.
spatial_merge_size (int) – Number of patches merged spatially (e.g., 2 means 2x2 patches → 1 token).
t_index (Array) – Temporal position for each frame. Shape: (num_frames,).
grid_hs (Array) – Height dimensions for all images/videos. Shape: (num_images,).
grid_ws (Array) – Width dimensions for all images/videos. Shape: (num_images,).
- Returns:
dim 0: temporal positions
dim 1: height positions
dim 2: width positions
- Return type:
3D position IDs with shape (3, num_vision_tokens) where
Example
- If spatial_merge_size=2, grid_h=4, grid_w=4, num_frames=2:
After merge: 2x2 grid per frame
Total tokens: 2 frames x 2 x 2 = 8 tokens
Output shape: (3, 8)
t_index: [0, 0, 0, 0, 50, 50, 50, 50]
h_index: [0, 0, 1, 1, 0, 0, 1, 1]
w_index: [0, 1, 0, 1, 0, 1, 0, 1]
- maxtext.multimodal.processor_qwen3_omni.get_chunked_index(token_indices, tokens_per_chunk, remove_index)[source]#
Splits token index list into chunks based on token value ranges.
Given a list of monotonically increasing token indices, returns a list of (start, end) index tuples representing slices where token values fall within successive ranges of tokens_per_chunk.
- Parameters:
token_indices (Array) – Monotonically increasing array of token index values. Shape: (seq_len,).
tokens_per_chunk (int) – Chunk size threshold (e.g., 100 means first chunk has values < 100).
remove_index (int) – Offset to subtract from token_indices before chunking.
- Returns:
List of (start_idx, end_idx) tuples, each representing a chunk.
- Return type:
list[tuple[int, int]]
Example
token_indices = [5, 10, 52, 105, 150, 250] tokens_per_chunk = 100 remove_index = 0
- Result: [(0, 3), (3, 5), (5, 6)]
Chunk 0: indices 0-3 (values 5, 10, 52 are < 100)
Chunk 1: indices 3-5 (values 105, 150 are >= 100 and < 200)
Chunk 2: indices 5-6 (value 250 is >= 200)
- maxtext.multimodal.processor_qwen3_omni.get_rope_index(input_ids, image_grid_thw=None, video_grid_thw=None, attention_mask=None, use_audio_in_video=False, audio_lengths=None, second_per_grids=None, spatial_merge_size=2, position_id_per_seconds=25)[source]#
Calculate 3D RoPE position indices for multimodal sequences.
This function computes position IDs that encode both sequential (text) and spatial-temporal (vision/audio) structure for Qwen3-Omni multimodal inputs.
- For pure text sequences:
All 3 dimensions receive the same sequential positions: [0, 1, 2, 3, 4]
- For multimodal sequences with vision:
Vision tokens get 3D positions (temporal, height, width)
Text tokens continue sequentially from max(vision_pos) + 1
- Example with video (3 temporal patches, 2x2 spatial):
Vision temporal: [0, 0, 0, 0, 50, 50, 50, 50, 100, 100, 100, 100] Vision height: [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1] Vision width: [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1] Text positions: [101, 102, 103, 104, 105]
- Parameters:
input_ids (ndarray) – Input token IDs. Shape: (batch, seq_len).
image_grid_thw (ndarray | None) – Image dimensions (temporal, height, width). Shape: (num_images, 3).
video_grid_thw (ndarray | None) – Video dimensions (temporal, height, width). Shape: (num_videos, 3).
attention_mask (ndarray | None) – Padding mask (1 = real token, 0 = padding). Shape: (batch, seq_len).
use_audio_in_video (bool) – If True, audio tokens are interleaved with video tokens.
audio_lengths (ndarray | None) – Audio sequence lengths. Shape: (num_audios,).
second_per_grids (ndarray | None) – Time interval per temporal grid (for videos). Shape: (num_videos,).
spatial_merge_size (int) – Number of patches merged spatially (e.g., 2 for 2x2→1).
position_id_per_seconds (int) – Temporal granularity (tokens per second, typically 25).
- Returns:
position_ids: 3D position IDs. Shape: (3, batch, seq_len).
mrope_position_deltas: Position offset for each sequence. Shape: (batch, 1).
- Return type:
A tuple of
- Raises:
ValueError – If multimodal tokens are present but grid info is missing.