maxtext.input_pipeline.tokenizer module#

Provides op for tokenizing a dataset.

class maxtext.input_pipeline.tokenizer.TikTokenTokenizer(model_path, add_bos, add_eos)[source]#

Bases: object

Tokenizing and encoding/decoding text using the Tiktoken tokenizer.

Parameters:
  • model_path (str)

  • add_bos (bool)

  • add_eos (bool)

num_reserved_special_tokens = 256#
pat_str = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"#
special_tokens: dict[str, int]#
encode(s, *, allowed_special=(), disallowed_special=())[source]#

Encodes a string into a list of token IDs.

Parameters:
  • s (str) – The input string to be encoded.

  • bos (bool) – Whether to prepend the beginning-of-sequence token.

  • eos (bool) – Whether to append the end-of-sequence token.

  • allowed_tokens (“all”|set[str]) – allowed special tokens in string

  • disallowed_tokens (“all”|set[str]) – special tokens that raise an error when in string

  • allowed_special (Literal['all'] | ~typing.Collection[str])

  • disallowed_special (Literal['all'] | ~typing.Collection[str])

Returns:

A list of token IDs.

Return type:

list[int]

By default, setting disallowed_special=() encodes a string by ignoring special tokens. Specifically: - Setting disallowed_special to () will cause all text corresponding

to special tokens to be encoded as natural text (insteading of raising an error).

  • Setting allowed_special to “all” will treat all text corresponding to special tokens to be encoded as special tokens.

decode(t)[source]#

Decodes a list of token IDs into a string.

Parameters:

t (list[int]) – The list of token IDs to be decoded.

Returns:

The decoded string.

Return type:

str

class maxtext.input_pipeline.tokenizer.SentencePieceTokenizer(model_path, add_bos, add_eos)[source]#

Bases: object

Tokenizing and encoding/decoding text using the native sentencepiece library. Supports both local and GCS (gs://) model paths.

Parameters:
  • model_path (str)

  • add_bos (bool)

  • add_eos (bool)

encode(s)[source]#
Parameters:

s (str)

Return type:

list[int]

decode(t)[source]#
Parameters:

t (Sequence[int])

Return type:

str

class maxtext.input_pipeline.tokenizer.HFTokenizer(model_path, add_bos, add_eos, hf_access_token)[source]#

Bases: object

Tokenizing using huggingface tokenizer

Parameters:
  • model_path (str)

  • add_bos (bool)

  • add_eos (bool)

  • hf_access_token (str)

encode(s)[source]#
Parameters:

s (str)

Return type:

list[int]

decode(t)[source]#
Parameters:

t (Sequence[int])

Return type:

str

maxtext.input_pipeline.tokenizer.build_tokenizer(tokenizer_path, tokenizer_type, add_bos, add_eos, hf_access_token)[source]#

Loads the tokenizer at tokenizer_path