maxtext.input_pipeline.tokenizer module#
Provides op for tokenizing a dataset.
- class maxtext.input_pipeline.tokenizer.TikTokenTokenizer(model_path, add_bos, add_eos)[source]#
Bases:
objectTokenizing and encoding/decoding text using the Tiktoken tokenizer.
- Parameters:
model_path (str)
add_bos (bool)
add_eos (bool)
- num_reserved_special_tokens = 256#
- pat_str = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"#
- special_tokens: dict[str, int]#
- encode(s, *, allowed_special=(), disallowed_special=())[source]#
Encodes a string into a list of token IDs.
- Parameters:
s (str) – The input string to be encoded.
bos (bool) – Whether to prepend the beginning-of-sequence token.
eos (bool) – Whether to append the end-of-sequence token.
allowed_tokens (“all”|set[str]) – allowed special tokens in string
disallowed_tokens (“all”|set[str]) – special tokens that raise an error when in string
allowed_special (Literal['all'] | ~typing.Collection[str])
disallowed_special (Literal['all'] | ~typing.Collection[str])
- Returns:
A list of token IDs.
- Return type:
list[int]
By default, setting disallowed_special=() encodes a string by ignoring special tokens. Specifically: - Setting disallowed_special to () will cause all text corresponding
to special tokens to be encoded as natural text (insteading of raising an error).
Setting allowed_special to “all” will treat all text corresponding to special tokens to be encoded as special tokens.
- class maxtext.input_pipeline.tokenizer.SentencePieceTokenizer(model_path, add_bos, add_eos)[source]#
Bases:
objectTokenizing and encoding/decoding text using the native sentencepiece library. Supports both local and GCS (gs://) model paths.
- Parameters:
model_path (str)
add_bos (bool)
add_eos (bool)