maxtext.input_pipeline.grain_tokenizer module#

Tokenize Op used by Grain

class maxtext.input_pipeline.grain_tokenizer.TokenizerTransformBase(feature_names, sequence_length, tokenizer)[source]#

Bases: object

Base class for tokenizer transforms with common functionality.

Parameters:
feature_names: str | Sequence[str]#
sequence_length: int | Sequence[int]#
tokenizer: SentencePieceTokenizer | HFTokenizer | TikTokenTokenizer#
class maxtext.input_pipeline.grain_tokenizer.TokenizeAndTrim(*args, **kwargs)[source]#

Bases: TokenizerTransformBase, MapTransform

Tokenize and trim features to sequence length.

Parameters:
map(element)[source]#

Maps to each element.

Parameters:

element (dict[str, Any])

Return type:

dict[str, Any]

class maxtext.input_pipeline.grain_tokenizer.TokenizeAndChunk(*args, **kwargs)[source]#

Bases: TokenizerTransformBase, FlatMapTransform

Tokenize and chunk features into multiple examples of sequence length.

Parameters:
max_fan_out: int = 2048#
flat_map(element)[source]#

Tokenize and chunk text into multiple examples of sequence length.

Parameters:

element (dict[str, Any])

Return type:

list[dict[str, Any]]