OLMo numpy pipeline (dataset_type=olmo_grain)#
Grain-based input pipeline for AI2’s pre-tokenized OLMo data mixes (e.g.
OLMo-mix-0925-official.txt). Reads headerless flat .npy token streams
from a gcsfuse mount, shards across hosts, optionally masks repeated-n-gram
instances, and yields the shapes the MaxText pretrain trainer expects.
Quick start#
Download the data to a GCS bucket.
--mix-fileis a local AI2 manifest listing relative npy paths to fetch from AI2’s public bucket (e.g.OLMo-mix-0925-official.txtfor the 6T pretrain mix orOLMo-midtraining-mix-0625-100B.txtfor the 100B midtraining mix).python tools/data_generation/download_olmo_data_to_gcs.py \ --mix-file ./OLMo-mix-0925-official.txt \ --gcs-dest gs://my-bucket/dataset/ \ --staging-dir /mnt/local-ssd/olmo-staging \ --workers 16
Mount it read-only with gcsfuse (
np.memmapneeds a local path):gcsfuse --implicit-dirs --o ro my-bucket /mnt/olmo-readonly
Build the index:
python tools/data_generation/build_olmo_npy_index.py \ --mix-file /path/to/OLMo-mix-0925-official.txt \ --gcs-base gs://my-bucket/dataset/ \ --tokenizer allenai/dolma3-tokenizer \ --sequence-length 8192 \ --output /path/to/olmo_index_seq8192.json
Configure + run the trainer:
dataset_type: olmo_grain olmo_index_path: /path/to/olmo_index_seq8192.json olmo_path_remap_from: "gs://my-bucket/" olmo_path_remap_to: "/mnt/olmo-readonly/" max_target_length: 8192 # must equal index sequence_length tokenizer_type: huggingface tokenizer_path: allenai/Olmo-3-7B-Instruct
See
scripts/run_olmo3_7b_grain_smoke.shfor a runnable smoke launcher, orsrc/maxtext/trainers/pre_train/scripts/olmo/for end-to-end stage-1 pretraining launchers (single-host + XPK).
Resume#
Stateless sampler: record at step k is a pure function of (seed, shard, k). On startup, the trainer adapter reads the latest step from
config.checkpoint_dir and shifts the sampler so the data stream picks
up where it left off — no Grain-iterator-state in the checkpoint.
scripts/run_olmo3_7b_grain_resume_test.sh validates this end-to-end.
Notes#
Files are headerless raw uint32 by default (matches AI2’s published format). The numpy
.npyextension is misleading.Documents may span instance boundaries; this matches OLMo-core.
olmo_apply_ngram_filter: True(default) zeroes loss on instances with ≥ 32 repetitions of any 1–13-gram, per OLMo-core.For mixing pretraining + midtraining, build a combined index by concatenating the two .txt mix files.
Troubleshooting#
Symptom |
Fix |
|---|---|
|
Rebuild the index with |
|
Set |
OOM during compile on a small TPU |
Shrink with |
Resume restarts at step 0 |
Iterator log should print |