maxtext.multimodal.processor_qwen3_omni module#

Qwen3-Omni-specific preprocessing utilities for multimodal features.

Original implementation from HuggingFace: Qwen/Qwen3-Omni-30B-A3B-Instruct.

class maxtext.multimodal.processor_qwen3_omni.Qwen3OmniPreprocessorOutput(pixel_values=None, pixel_mask=None, aspect_ratios=None, num_images=0, audio_values=None, audio_mask=None, pixel_grid_thw=None, num_videos=0, video_values=None, video_grid_thw=None, video_second_per_grid=None, num_audios=0, audio_lengths=None)[source]#

Bases: PreprocessorOutput

Holds the output of Qwen3-Omni image preprocessor.

Parameters:
  • pixel_values (None | ndarray)

  • pixel_mask (None | ndarray)

  • aspect_ratios (None | ndarray)

  • num_images (int)

  • audio_values (None | ndarray)

  • audio_mask (None | ndarray)

  • pixel_grid_thw (None | ndarray)

  • num_videos (int)

  • video_values (None | ndarray)

  • video_grid_thw (None | ndarray)

  • video_second_per_grid (None | ndarray)

  • num_audios (int)

  • audio_lengths (None | ndarray)

Inherited from `mm_utils.PreprocessorOutput`.
num_images: int = 0#
pixel_values: None | ndarray = None#
pixel_grid_thw: None | ndarray = None#
num_videos: int = 0#
video_values: None | ndarray = None#
video_grid_thw: None | ndarray = None#
video_second_per_grid: None | ndarray = None#
num_audios: int = 0#
audio_values: None | ndarray = None#
audio_mask: None | ndarray = None#
audio_lengths: None | ndarray = None#
maxtext.multimodal.processor_qwen3_omni.smart_resize(height, width, factor=28, min_pixels=3136, max_pixels=1003520)[source]#

Rescales the image so that the following conditions are met:

  1. Both dimensions (height and width) are divisible by ‘factor’.

  2. The total number of pixels is within the range [‘min_pixels’, ‘max_pixels’].

  3. The aspect ratio of the image is maintained as closely as possible.

Parameters:
  • height (int)

  • width (int)

  • factor (int)

  • min_pixels (int)

  • max_pixels (int)

maxtext.multimodal.processor_qwen3_omni.pre_process_qwen3_image(image, config)[source]#

Performs a bi-linear resize (with anti-aliasing) and normalizes the image.

Parameters:

image (ndarray | list[ndarray])

maxtext.multimodal.processor_qwen3_omni.calculate_video_frame_range(ele, total_frames, video_fps)[source]#

Calculate the start and end frame indices based on the given time range.

Parameters:
  • ele (dict) – A dictionary containing optional ‘video_start’ and ‘video_end’ keys (in seconds).

  • total_frames (int) – Total number of frames in the video.

  • video_fps (float) – Frames per second of the video.

Returns:

A tuple containing (start_frame, end_frame, frame_count).

Return type:

tuple

Raises:

ValueError – If input parameters are invalid or the time range is inconsistent.

maxtext.multimodal.processor_qwen3_omni.smart_nframes(ele, total_frames, video_fps)[source]#

Calculate the number of frames for video used for model inputs.

Parameters:
  • ele (dict) –

    a dict contains the configuration of video. support either fps or nframes:

    • nframes: the number of frames to extract for model inputs.

    • fps: the fps to extract frames for model inputs.
      • min_frames: the minimum number of frames of the video, only used when fps is provided.

      • max_frames: the maximum number of frames of the video, only used when fps is provided.

  • total_frames (int) – the original total number of frames of the video.

  • video_fps (int | float) – the original fps of the video.

Returns:

the number of frames for video used for model inputs.

Return type:

int

maxtext.multimodal.processor_qwen3_omni.preprocess_video(video, config)[source]#

Preprocess the video for Qwen3-Omni model.

maxtext.multimodal.processor_qwen3_omni.pre_process_audio_qwen3_omni(audio_array)[source]#

Preprocess audio for Qwen3-Omni model.

maxtext.multimodal.processor_qwen3_omni.preprocess_mm_data_qwen3_omni(config)[source]#

Placeholder for multimodal data preprocessing.

maxtext.multimodal.processor_qwen3_omni.add_extra_tokens_for_qwen3_omni(tokens, config, processor_output)[source]#

Add extra tokens for Qwen3-Omni multimodal sequences.

Expands special tokens (<|image_pad|>, <|video_pad|>, <|audio_pad|>) into the correct number of placeholder tokens based on grid dimensions and merge size.

For audio-in-video mode, interleaves audio and video tokens based on temporal ordering.

Parameters:
  • tokens – Input token sequence (1D array or list).

  • image_grid_thw – Image dimensions (num_images, 3) with [temporal, height, width].

  • video_grid_thw – Video dimensions (num_videos, 3) with [temporal, height, width].

  • audio_lengths – Pre-computed audio token counts (num_audios,).

  • spatial_merge_size – Number of patches merged spatially (e.g., 2 for 2x2→1).

  • use_audio_in_video – If True, interleave audio and video tokens.

  • second_per_grids – Time interval per temporal grid (num_videos,).

  • position_id_per_seconds – Temporal granularity (tokens per second).

Returns:

Expanded token sequence with correct number of image/video/audio tokens.

maxtext.multimodal.processor_qwen3_omni.get_dummy_image_shape_for_init_qwen3_omni(batch_size)[source]#

Return the shape of the dummy image for Qwen3-Omni model’s initialization.

maxtext.multimodal.processor_qwen3_omni.get_dummy_audio_shape_for_init_qwen3_omni(config)[source]#

Return the shape of the dummy audio for Qwen3-Omni model’s initialization.

maxtext.multimodal.processor_qwen3_omni.get_llm_pos_ids_for_vision(start_idx, vision_idx, spatial_merge_size, t_index, grid_hs, grid_ws)[source]#

Computes 3D position IDs (temporal, height, width) for vision tokens.

Creates position embeddings for a grid of vision tokens representing an image or video. For each temporal frame, generates a spatial grid of (height, width) positions.

Parameters:
  • start_idx (int | Array) – Starting position ID value to add as offset.

  • vision_idx (int) – Index of the current image/video being processed.

  • spatial_merge_size (int) – Number of patches merged spatially (e.g., 2 means 2x2 patches → 1 token).

  • t_index (Array) – Temporal position for each frame. Shape: (num_frames,).

  • grid_hs (Array) – Height dimensions for all images/videos. Shape: (num_images,).

  • grid_ws (Array) – Width dimensions for all images/videos. Shape: (num_images,).

Returns:

  • dim 0: temporal positions

  • dim 1: height positions

  • dim 2: width positions

Return type:

3D position IDs with shape (3, num_vision_tokens) where

Example

If spatial_merge_size=2, grid_h=4, grid_w=4, num_frames=2:
  • After merge: 2x2 grid per frame

  • Total tokens: 2 frames x 2 x 2 = 8 tokens

  • Output shape: (3, 8)

  • t_index: [0, 0, 0, 0, 50, 50, 50, 50]

  • h_index: [0, 0, 1, 1, 0, 0, 1, 1]

  • w_index: [0, 1, 0, 1, 0, 1, 0, 1]

maxtext.multimodal.processor_qwen3_omni.get_chunked_index(token_indices, tokens_per_chunk, remove_index)[source]#

Splits token index list into chunks based on token value ranges.

Given a list of monotonically increasing token indices, returns a list of (start, end) index tuples representing slices where token values fall within successive ranges of tokens_per_chunk.

Parameters:
  • token_indices (Array) – Monotonically increasing array of token index values. Shape: (seq_len,).

  • tokens_per_chunk (int) – Chunk size threshold (e.g., 100 means first chunk has values < 100).

  • remove_index (int) – Offset to subtract from token_indices before chunking.

Returns:

List of (start_idx, end_idx) tuples, each representing a chunk.

Return type:

list[tuple[int, int]]

Example

token_indices = [5, 10, 52, 105, 150, 250] tokens_per_chunk = 100 remove_index = 0

Result: [(0, 3), (3, 5), (5, 6)]
  • Chunk 0: indices 0-3 (values 5, 10, 52 are < 100)

  • Chunk 1: indices 3-5 (values 105, 150 are >= 100 and < 200)

  • Chunk 2: indices 5-6 (value 250 is >= 200)

maxtext.multimodal.processor_qwen3_omni.get_rope_index(input_ids, image_grid_thw=None, video_grid_thw=None, attention_mask=None, use_audio_in_video=False, audio_lengths=None, second_per_grids=None, spatial_merge_size=2, position_id_per_seconds=25)[source]#

Calculate 3D RoPE position indices for multimodal sequences.

This function computes position IDs that encode both sequential (text) and spatial-temporal (vision/audio) structure for Qwen3-Omni multimodal inputs.

For pure text sequences:
  • All 3 dimensions receive the same sequential positions: [0, 1, 2, 3, 4]

For multimodal sequences with vision:
  • Vision tokens get 3D positions (temporal, height, width)

  • Text tokens continue sequentially from max(vision_pos) + 1

  • Example with video (3 temporal patches, 2x2 spatial):

    Vision temporal: [0, 0, 0, 0, 50, 50, 50, 50, 100, 100, 100, 100] Vision height: [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1] Vision width: [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1] Text positions: [101, 102, 103, 104, 105]

Parameters:
  • input_ids (ndarray) – Input token IDs. Shape: (batch, seq_len).

  • image_grid_thw (ndarray | None) – Image dimensions (temporal, height, width). Shape: (num_images, 3).

  • video_grid_thw (ndarray | None) – Video dimensions (temporal, height, width). Shape: (num_videos, 3).

  • attention_mask (ndarray | None) – Padding mask (1 = real token, 0 = padding). Shape: (batch, seq_len).

  • use_audio_in_video (bool) – If True, audio tokens are interleaved with video tokens.

  • audio_lengths (ndarray | None) – Audio sequence lengths. Shape: (num_audios,).

  • second_per_grids (ndarray | None) – Time interval per temporal grid (for videos). Shape: (num_videos,).

  • spatial_merge_size (int) – Number of patches merged spatially (e.g., 2 for 2x2→1).

  • position_id_per_seconds (int) – Temporal granularity (tokens per second, typically 25).

Returns:

  • position_ids: 3D position IDs. Shape: (3, batch, seq_len).

  • mrope_position_deltas: Position offset for each sequence. Shape: (batch, 1).

Return type:

A tuple of

Raises:

ValueError – If multimodal tokens are present but grid info is missing.

maxtext.multimodal.processor_qwen3_omni.reformat_prompt_qwen3_omni(prompt, image_placeholder='<|image|>', num_images=0, video_placeholder='<|video|>', num_videos=0)[source]#

Reformat the prompt for Qwen3-Omni model.

maxtext.multimodal.processor_qwen3_omni.get_mm_offsets_qwen3_omni(config, processor_output)[source]#

Calculate the token offsets for multimodal tokens in Qwen3-Omni model.