OLMo numpy pipeline (dataset_type=olmo_grain)

OLMo numpy pipeline (`dataset_type=olmo_grain`)#

Grain-based input pipeline for AI2’s pre-tokenized OLMo data mixes (e.g. OLMo-mix-0925-official.txt). Reads headerless flat .npy token streams from a gcsfuse mount, shards across hosts, optionally masks repeated-n-gram instances, and yields the shapes the MaxText pretrain trainer expects.

Quick start#

Download the data to a GCS bucket. --mix-file is a local AI2 manifest listing relative npy paths to fetch from AI2’s public bucket (e.g. OLMo-mix-0925-official.txt for the 6T pretrain mix or OLMo-midtraining-mix-0625-100B.txt for the 100B midtraining mix).
```
python tools/data_generation/download_olmo_data_to_gcs.py \
    --mix-file ./OLMo-mix-0925-official.txt \
    --gcs-dest gs://my-bucket/dataset/ \
    --staging-dir /mnt/local-ssd/olmo-staging \
    --workers 16
```

Mount it read-only with gcsfuse (np.memmap needs a local path):

gcsfuse --implicit-dirs --o ro my-bucket /mnt/olmo-readonly

Build the index:

python tools/data_generation/build_olmo_npy_index.py \
    --mix-file /path/to/OLMo-mix-0925-official.txt \
    --gcs-base gs://my-bucket/dataset/ \
    --tokenizer allenai/dolma3-tokenizer \
    --sequence-length 8192 \
    --output /path/to/olmo_index_seq8192.json

Configure + run the trainer:

dataset_type: olmo_grain
olmo_index_path: /path/to/olmo_index_seq8192.json
olmo_path_remap_from: "gs://my-bucket/"
olmo_path_remap_to:   "/mnt/olmo-readonly/"
max_target_length: 8192        # must equal index sequence_length
tokenizer_type: huggingface
tokenizer_path: allenai/Olmo-3-7B-Instruct

See scripts/run_olmo3_7b_grain_smoke.sh for a runnable smoke launcher, or src/maxtext/trainers/pre_train/scripts/olmo/ for end-to-end stage-1 pretraining launchers (single-host + XPK).

Resume#

Stateless sampler: record at step k is a pure function of (seed, shard, k). On startup, the trainer adapter reads the latest step from config.checkpoint_dir and shifts the sampler so the data stream picks up where it left off — no Grain-iterator-state in the checkpoint.

scripts/run_olmo3_7b_grain_resume_test.sh validates this end-to-end.

Notes#

Files are headerless raw uint32 by default (matches AI2’s published format). The numpy .npy extension is misleading.
Documents may span instance boundaries; this matches OLMo-core.
olmo_apply_ngram_filter: True (default) zeroes loss on instances with ≥ 32 repetitions of any 1–13-gram, per OLMo-core.
For mixing pretraining + midtraining, build a combined index by concatenating the two .txt mix files.

Troubleshooting#

Symptom	Fix
`OLMo index sequence_length=N but config.max_target_length=M`	Rebuild the index with `--sequence-length M`.
`q_block_size=512 should divide q_seq_len=…`	Set `max_target_length` to a multiple of 512.
OOM during compile on a small TPU	Shrink with `override_model_config=True base_num_decoder_layers=N`, use `weight_dtype=bfloat16`.
Resume restarts at step 0	Iterator log should print `resumed_step=N initial_step=…`; if both 0, `checkpoint_dir` is empty or wrong.