maxtext.input_pipeline.olmo_data module

maxtext.input_pipeline.olmo_data module#

Shared utilities for OLMo-core-style numpy FSL datasets.

Dependency-free layer. AI2’s mix files describe a virtual concatenation of flat token-ID arrays; instances are non-overlapping sequence_length-token windows of that stream. This module builds the index that maps a global instance index to (file, byte-offset), and ports OLMo-core’s repeated-n-gram filter (olmo_core/data/utils.py::find_periodic_sequences).

class maxtext.input_pipeline.olmo_data.OlmoNpyFileEntry(path, label, n_tokens, n_instances, instance_offset)[source]#

Bases: object

One file in the mix: n_tokens // sequence_length instances starting at global index instance_offset. Trailing tokens are dropped (matches OLMo-core).

Parameters:

path (str)
label (str)
n_tokens (int)
n_instances (int)
instance_offset (int)

path: str#

label: str#

n_tokens: int#

n_instances: int#

instance_offset: int#

class maxtext.input_pipeline.olmo_data.OlmoNpyIndex(format_version, sequence_length, dtype, tokenizer, files, total_instances, total_tokens, fingerprint='', _instance_offset_starts=None)[source]#

Bases: object

Index over the files in an OLMo data mix. Build via build_index(), persist via save(), restore via load_index(). Mutating fields invalidates fingerprint.

Parameters:

format_version (str)
sequence_length (int)
dtype (str)
tokenizer (str)
files (Tuple[OlmoNpyFileEntry, ...])
total_instances (int)
total_tokens (int)
fingerprint (str)
_instance_offset_starts (List[int] | None)

format_version: str#

sequence_length: int#

dtype: str#

tokenizer: str#

files: Tuple[OlmoNpyFileEntry, ...]#

total_instances: int#

total_tokens: int#

fingerprint: str = ''#

to_json_dict()[source]#

Return a JSON-serializable view (drops cached lookup helpers).

Return type:: dict

save(path)[source]#

Write the index as JSON to path (local filesystem).

Parameters:: path (str)
Return type:: None

maxtext.input_pipeline.olmo_data.load_index(path)[source]#

Load an index from JSON written by OlmoNpyIndex.save().

Parameters:: path (str) – Local filesystem path to the JSON file.
Returns:: The materialized OlmoNpyIndex.
Raises:: ValueError – If format_version doesn’t match this code’s expectation.
Return type:: OlmoNpyIndex

maxtext.input_pipeline.olmo_data.global_to_local(index, instance_id)[source]#

Global instance index → (file_idx, token_offset).

token_offset is in tokens (not bytes); the slice arr[token_offset : token_offset + sequence_length] is the instance.

Parameters:

index (OlmoNpyIndex)
instance_id (int)

Return type:

Tuple[int, int]

maxtext.input_pipeline.olmo_data.compute_fingerprint(sequence_length, dtype, tokenizer, files)[source]#

Stable hash over the fields a restart must preserve.

If any of these change, the global instance ordering changes and resuming training from a checkpoint would silently produce different batches.

Parameters:

sequence_length (int)
dtype (str)
tokenizer (str)
files (Sequence[OlmoNpyFileEntry])

Return type:

str

maxtext.input_pipeline.olmo_data.parse_npy_header(stream)[source]#

Parse a .npy v1/v2/v3 header. Returns (dtype_str, shape).

Parameters:: stream (BinaryIO)
Return type:: Tuple[str, Tuple[int, …]]

maxtext.input_pipeline.olmo_data.read_npy_header_from_path(path)[source]#

Convenience wrapper for parse_npy_header() on a local file.

Parameters:: path (str)
Return type:: Tuple[str, Tuple[int, …]]

maxtext.input_pipeline.olmo_data.read_raw_metadata_from_path(path, dtype)[source]#

Headerless raw binary: n_tokens = file_size // itemsize.

AI2’s .npy-extension files are actually raw uint32 dumps, no header; olmo-core reads them with np.memmap and a known dtype.

Parameters:

path (str)
dtype (str)

Return type:

Tuple[str, Tuple[int, …]]

maxtext.input_pipeline.olmo_data.has_npy_magic(first_bytes)[source]#

Quick check: does this look like a real .npy file?

Parameters:: first_bytes (bytes)
Return type:: bool

maxtext.input_pipeline.olmo_data.build_index(paths_and_labels, sequence_length, *, tokenizer, header_reader=<function read_npy_header_from_path>)[source]#

Build an OlmoNpyIndex from (path, label) entries.

Order matters — global instance ordering is the concatenation in this order. header_reader is the seam tests use to avoid disk; production paths pass a GCS-aware reader.

Parameters:

paths_and_labels (Sequence[Tuple[str, str]])
sequence_length (int)
tokenizer (str)

Return type:

OlmoNpyIndex

class maxtext.input_pipeline.olmo_data.RepetitionTuple(start, end, period, times)[source]#

Bases: NamedTuple

arr[start:end] is a periodic span of length period, times = (end - start) // period.

Parameters:

start (int)
end (int)
period (int)
times (int)

start: int#: Alias for field number 0

end: int#: Alias for field number 1

period: int#: Alias for field number 2

times: int#: Alias for field number 3

maxtext.input_pipeline.olmo_data.find_periodic_sequences(arr, max_period, min_period=1, mask_value=-1)[source]#

Yield RepetitionTuple for periodic spans of length ≥ 3 in arr.

mask_value is reshape padding and must not appear in arr. Default -1 is the max uint32 value, above any realistic vocab; pass an out-of-vocab sentinel if your vocab hits that id.

Parameters:

arr (ndarray)
max_period (int)
min_period (int)
mask_value (int)

Return type:

Generator[RepetitionTuple, None, None]

maxtext.input_pipeline.olmo_data.is_clean_instance(input_ids, *, repetition_max_period=13, repetition_min_period=1, repetition_max_count=32, mask_value=-1)[source]#

False iff input_ids has any periodic span (period ∈ [min, max]) that repeats ≥ repetition_max_count times. Defaults match OLMo-core’s _validate_instance.

Parameters:

input_ids (ndarray)
repetition_max_period (int)
repetition_min_period (int)
repetition_max_count (int)
mask_value (int)

Return type:

bool

maxtext.input_pipeline.olmo_data module

Contents

maxtext.input_pipeline.olmo_data module#