maxtext.input_pipeline.tokenizer module

maxtext.input_pipeline.tokenizer module#

Provides op for tokenizing a dataset.

class maxtext.input_pipeline.tokenizer.TikTokenTokenizer(model_path, add_bos, add_eos)[source]#

Bases: object

Tokenizing and encoding/decoding text using the Tiktoken tokenizer.

Parameters:

model_path (str)
add_bos (bool)
add_eos (bool)

num_reserved_special_tokens = 256#

pat_str = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"#

special_tokens: dict[str, int]#

encode(s, *, allowed_special=(), disallowed_special=())[source]#

Encodes a string into a list of token IDs.

Parameters:

s (str) – The input string to be encoded.
bos (bool) – Whether to prepend the beginning-of-sequence token.
eos (bool) – Whether to append the end-of-sequence token.
allowed_tokens (“all”|set[str]) – allowed special tokens in string
disallowed_tokens (“all”|set[str]) – special tokens that raise an error when in string
allowed_special (Literal['all'] | ~typing.Collection[str])
disallowed_special (Literal['all'] | ~typing.Collection[str])

Returns:

A list of token IDs.

Return type:

list[int]

By default, setting disallowed_special=() encodes a string by ignoring special tokens. Specifically: - Setting disallowed_special to () will cause all text corresponding

to special tokens to be encoded as natural text (insteading of raising an error).

Setting allowed_special to “all” will treat all text corresponding to special tokens to be encoded as special tokens.

decode(t)[source]#

Decodes a list of token IDs into a string.

Parameters:: t (list[int]) – The list of token IDs to be decoded.
Returns:: The decoded string.
Return type:: str

class maxtext.input_pipeline.tokenizer.SentencePieceTokenizer(model_path, add_bos, add_eos)[source]#

Bases: object

Tokenizing and encoding/decoding text using the native sentencepiece library. Supports both local and GCS (gs://) model paths.

Parameters:

model_path (str)
add_bos (bool)
add_eos (bool)

encode(s)[source]#

Parameters:: s (str)
Return type:: list[int]

decode(t)[source]#

Parameters:: t (Sequence[int])
Return type:: str

class maxtext.input_pipeline.tokenizer.HFTokenizer(model_path, add_bos, add_eos, hf_access_token)[source]#

Bases: object

Tokenizing using huggingface tokenizer

Parameters:

model_path (str)
add_bos (bool)
add_eos (bool)
hf_access_token (str)

encode(s)[source]#

Parameters:: s (str)
Return type:: list[int]

decode(t)[source]#

Parameters:: t (Sequence[int])
Return type:: str

maxtext.input_pipeline.tokenizer.build_tokenizer(tokenizer_path, tokenizer_type, add_bos, add_eos, hf_access_token)[source]#: Loads the tokenizer at tokenizer_path

maxtext.input_pipeline.tokenizer module

Contents

maxtext.input_pipeline.tokenizer module#