maxtext.input_pipeline.grain_tokenizer module#
Tokenize Op used by Grain
- class maxtext.input_pipeline.grain_tokenizer.TokenizerTransformBase(feature_names, sequence_length, tokenizer)[source]#
Bases:
objectBase class for tokenizer transforms with common functionality.
- Parameters:
feature_names (str | Sequence[str])
sequence_length (int | Sequence[int])
tokenizer (SentencePieceTokenizer | HFTokenizer | TikTokenTokenizer)
- feature_names: str | Sequence[str]#
- sequence_length: int | Sequence[int]#
- tokenizer: SentencePieceTokenizer | HFTokenizer | TikTokenTokenizer#
- class maxtext.input_pipeline.grain_tokenizer.TokenizeAndTrim(*args, **kwargs)[source]#
Bases:
TokenizerTransformBase,MapTransformTokenize and trim features to sequence length.
- Parameters:
feature_names (str | Sequence[str])
sequence_length (int | Sequence[int])
tokenizer (SentencePieceTokenizer | HFTokenizer | TikTokenTokenizer)
- class maxtext.input_pipeline.grain_tokenizer.TokenizeAndChunk(*args, **kwargs)[source]#
Bases:
TokenizerTransformBase,FlatMapTransformTokenize and chunk features into multiple examples of sequence length.
- Parameters:
feature_names (str | Sequence[str])
sequence_length (int | Sequence[int])
tokenizer (SentencePieceTokenizer | HFTokenizer | TikTokenTokenizer)
max_fan_out (int)
- max_fan_out: int = 2048#