maxtext.input_pipeline.olmo_data module#
Shared utilities for OLMo-core-style numpy FSL datasets.
Dependency-free layer. AI2’s mix files describe a virtual concatenation of
flat token-ID arrays; instances are non-overlapping sequence_length-token
windows of that stream. This module builds the index that maps a global
instance index to (file, byte-offset), and ports OLMo-core’s repeated-n-gram
filter (olmo_core/data/utils.py::find_periodic_sequences).
- class maxtext.input_pipeline.olmo_data.OlmoNpyFileEntry(path, label, n_tokens, n_instances, instance_offset)[source]#
Bases:
objectOne file in the mix:
n_tokens // sequence_lengthinstances starting at global indexinstance_offset. Trailing tokens are dropped (matches OLMo-core).- Parameters:
path (str)
label (str)
n_tokens (int)
n_instances (int)
instance_offset (int)
- path: str#
- label: str#
- n_tokens: int#
- n_instances: int#
- instance_offset: int#
- class maxtext.input_pipeline.olmo_data.OlmoNpyIndex(format_version, sequence_length, dtype, tokenizer, files, total_instances, total_tokens, fingerprint='', _instance_offset_starts=None)[source]#
Bases:
objectIndex over the files in an OLMo data mix. Build via
build_index(), persist viasave(), restore viaload_index(). Mutating fields invalidatesfingerprint.- Parameters:
format_version (str)
sequence_length (int)
dtype (str)
tokenizer (str)
files (Tuple[OlmoNpyFileEntry, ...])
total_instances (int)
total_tokens (int)
fingerprint (str)
_instance_offset_starts (List[int] | None)
- format_version: str#
- sequence_length: int#
- dtype: str#
- tokenizer: str#
- files: Tuple[OlmoNpyFileEntry, ...]#
- total_instances: int#
- total_tokens: int#
- fingerprint: str = ''#
- maxtext.input_pipeline.olmo_data.load_index(path)[source]#
Load an index from JSON written by
OlmoNpyIndex.save().- Parameters:
path (str) – Local filesystem path to the JSON file.
- Returns:
The materialized
OlmoNpyIndex.- Raises:
ValueError – If
format_versiondoesn’t match this code’s expectation.- Return type:
- maxtext.input_pipeline.olmo_data.global_to_local(index, instance_id)[source]#
Global instance index →
(file_idx, token_offset).token_offsetis in tokens (not bytes); the slicearr[token_offset : token_offset + sequence_length]is the instance.- Parameters:
index (OlmoNpyIndex)
instance_id (int)
- Return type:
Tuple[int, int]
- maxtext.input_pipeline.olmo_data.compute_fingerprint(sequence_length, dtype, tokenizer, files)[source]#
Stable hash over the fields a restart must preserve.
If any of these change, the global instance ordering changes and resuming training from a checkpoint would silently produce different batches.
- Parameters:
sequence_length (int)
dtype (str)
tokenizer (str)
files (Sequence[OlmoNpyFileEntry])
- Return type:
str
- maxtext.input_pipeline.olmo_data.parse_npy_header(stream)[source]#
Parse a .npy v1/v2/v3 header. Returns
(dtype_str, shape).- Parameters:
stream (BinaryIO)
- Return type:
Tuple[str, Tuple[int, …]]
- maxtext.input_pipeline.olmo_data.read_npy_header_from_path(path)[source]#
Convenience wrapper for
parse_npy_header()on a local file.- Parameters:
path (str)
- Return type:
Tuple[str, Tuple[int, …]]
- maxtext.input_pipeline.olmo_data.read_raw_metadata_from_path(path, dtype)[source]#
Headerless raw binary:
n_tokens = file_size // itemsize.AI2’s
.npy-extension files are actually raw uint32 dumps, no header; olmo-core reads them withnp.memmapand a known dtype.- Parameters:
path (str)
dtype (str)
- Return type:
Tuple[str, Tuple[int, …]]
- maxtext.input_pipeline.olmo_data.has_npy_magic(first_bytes)[source]#
Quick check: does this look like a real .npy file?
- Parameters:
first_bytes (bytes)
- Return type:
bool
- maxtext.input_pipeline.olmo_data.build_index(paths_and_labels, sequence_length, *, tokenizer, header_reader=<function read_npy_header_from_path>)[source]#
Build an
OlmoNpyIndexfrom(path, label)entries.Order matters — global instance ordering is the concatenation in this order.
header_readeris the seam tests use to avoid disk; production paths pass a GCS-aware reader.- Parameters:
paths_and_labels (Sequence[Tuple[str, str]])
sequence_length (int)
tokenizer (str)
- Return type:
- class maxtext.input_pipeline.olmo_data.RepetitionTuple(start, end, period, times)[source]#
Bases:
NamedTuplearr[start:end]is a periodic span of lengthperiod,times = (end - start) // period.- Parameters:
start (int)
end (int)
period (int)
times (int)
- start: int#
Alias for field number 0
- end: int#
Alias for field number 1
- period: int#
Alias for field number 2
- times: int#
Alias for field number 3
- maxtext.input_pipeline.olmo_data.find_periodic_sequences(arr, max_period, min_period=1, mask_value=-1)[source]#
Yield
RepetitionTuplefor periodic spans of length ≥ 3 inarr.mask_valueis reshape padding and must not appear inarr. Default -1 is the max uint32 value, above any realistic vocab; pass an out-of-vocab sentinel if your vocab hits that id.- Parameters:
arr (ndarray)
max_period (int)
min_period (int)
mask_value (int)
- Return type:
Generator[RepetitionTuple, None, None]
- maxtext.input_pipeline.olmo_data.is_clean_instance(input_ids, *, repetition_max_period=13, repetition_min_period=1, repetition_max_count=32, mask_value=-1)[source]#
Falseiffinput_idshas any periodic span (period ∈ [min, max]) that repeats ≥repetition_max_counttimes. Defaults match OLMo-core’s_validate_instance.- Parameters:
input_ids (ndarray)
repetition_max_period (int)
repetition_min_period (int)
repetition_max_count (int)
mask_value (int)
- Return type:
bool