maxtext.input_pipeline.olmo_data module#

Shared utilities for OLMo-core-style numpy FSL datasets.

Dependency-free layer. AI2’s mix files describe a virtual concatenation of flat token-ID arrays; instances are non-overlapping sequence_length-token windows of that stream. This module builds the index that maps a global instance index to (file, byte-offset), and ports OLMo-core’s repeated-n-gram filter (olmo_core/data/utils.py::find_periodic_sequences).

class maxtext.input_pipeline.olmo_data.OlmoNpyFileEntry(path, label, n_tokens, n_instances, instance_offset)[source]#

Bases: object

One file in the mix: n_tokens // sequence_length instances starting at global index instance_offset. Trailing tokens are dropped (matches OLMo-core).

Parameters:
  • path (str)

  • label (str)

  • n_tokens (int)

  • n_instances (int)

  • instance_offset (int)

path: str#
label: str#
n_tokens: int#
n_instances: int#
instance_offset: int#
class maxtext.input_pipeline.olmo_data.OlmoNpyIndex(format_version, sequence_length, dtype, tokenizer, files, total_instances, total_tokens, fingerprint='', _instance_offset_starts=None)[source]#

Bases: object

Index over the files in an OLMo data mix. Build via build_index(), persist via save(), restore via load_index(). Mutating fields invalidates fingerprint.

Parameters:
  • format_version (str)

  • sequence_length (int)

  • dtype (str)

  • tokenizer (str)

  • files (Tuple[OlmoNpyFileEntry, ...])

  • total_instances (int)

  • total_tokens (int)

  • fingerprint (str)

  • _instance_offset_starts (List[int] | None)

format_version: str#
sequence_length: int#
dtype: str#
tokenizer: str#
files: Tuple[OlmoNpyFileEntry, ...]#
total_instances: int#
total_tokens: int#
fingerprint: str = ''#
to_json_dict()[source]#

Return a JSON-serializable view (drops cached lookup helpers).

Return type:

dict

save(path)[source]#

Write the index as JSON to path (local filesystem).

Parameters:

path (str)

Return type:

None

maxtext.input_pipeline.olmo_data.load_index(path)[source]#

Load an index from JSON written by OlmoNpyIndex.save().

Parameters:

path (str) – Local filesystem path to the JSON file.

Returns:

The materialized OlmoNpyIndex.

Raises:

ValueError – If format_version doesn’t match this code’s expectation.

Return type:

OlmoNpyIndex

maxtext.input_pipeline.olmo_data.global_to_local(index, instance_id)[source]#

Global instance index → (file_idx, token_offset).

token_offset is in tokens (not bytes); the slice arr[token_offset : token_offset + sequence_length] is the instance.

Parameters:
Return type:

Tuple[int, int]

maxtext.input_pipeline.olmo_data.compute_fingerprint(sequence_length, dtype, tokenizer, files)[source]#

Stable hash over the fields a restart must preserve.

If any of these change, the global instance ordering changes and resuming training from a checkpoint would silently produce different batches.

Parameters:
  • sequence_length (int)

  • dtype (str)

  • tokenizer (str)

  • files (Sequence[OlmoNpyFileEntry])

Return type:

str

maxtext.input_pipeline.olmo_data.parse_npy_header(stream)[source]#

Parse a .npy v1/v2/v3 header. Returns (dtype_str, shape).

Parameters:

stream (BinaryIO)

Return type:

Tuple[str, Tuple[int, …]]

maxtext.input_pipeline.olmo_data.read_npy_header_from_path(path)[source]#

Convenience wrapper for parse_npy_header() on a local file.

Parameters:

path (str)

Return type:

Tuple[str, Tuple[int, …]]

maxtext.input_pipeline.olmo_data.read_raw_metadata_from_path(path, dtype)[source]#

Headerless raw binary: n_tokens = file_size // itemsize.

AI2’s .npy-extension files are actually raw uint32 dumps, no header; olmo-core reads them with np.memmap and a known dtype.

Parameters:
  • path (str)

  • dtype (str)

Return type:

Tuple[str, Tuple[int, …]]

maxtext.input_pipeline.olmo_data.has_npy_magic(first_bytes)[source]#

Quick check: does this look like a real .npy file?

Parameters:

first_bytes (bytes)

Return type:

bool

maxtext.input_pipeline.olmo_data.build_index(paths_and_labels, sequence_length, *, tokenizer, header_reader=<function read_npy_header_from_path>)[source]#

Build an OlmoNpyIndex from (path, label) entries.

Order matters — global instance ordering is the concatenation in this order. header_reader is the seam tests use to avoid disk; production paths pass a GCS-aware reader.

Parameters:
  • paths_and_labels (Sequence[Tuple[str, str]])

  • sequence_length (int)

  • tokenizer (str)

Return type:

OlmoNpyIndex

class maxtext.input_pipeline.olmo_data.RepetitionTuple(start, end, period, times)[source]#

Bases: NamedTuple

arr[start:end] is a periodic span of length period, times = (end - start) // period.

Parameters:
  • start (int)

  • end (int)

  • period (int)

  • times (int)

start: int#

Alias for field number 0

end: int#

Alias for field number 1

period: int#

Alias for field number 2

times: int#

Alias for field number 3

maxtext.input_pipeline.olmo_data.find_periodic_sequences(arr, max_period, min_period=1, mask_value=-1)[source]#

Yield RepetitionTuple for periodic spans of length ≥ 3 in arr.

mask_value is reshape padding and must not appear in arr. Default -1 is the max uint32 value, above any realistic vocab; pass an out-of-vocab sentinel if your vocab hits that id.

Parameters:
  • arr (ndarray)

  • max_period (int)

  • min_period (int)

  • mask_value (int)

Return type:

Generator[RepetitionTuple, None, None]

maxtext.input_pipeline.olmo_data.is_clean_instance(input_ids, *, repetition_max_period=13, repetition_min_period=1, repetition_max_count=32, mask_value=-1)[source]#

False iff input_ids has any periodic span (period ∈ [min, max]) that repeats ≥ repetition_max_count times. Defaults match OLMo-core’s _validate_instance.

Parameters:
  • input_ids (ndarray)

  • repetition_max_period (int)

  • repetition_min_period (int)

  • repetition_max_count (int)

  • mask_value (int)

Return type:

bool