OLMo numpy pipeline (dataset_type=olmo_grain)

OLMo numpy pipeline (dataset_type=olmo_grain)#

Grain-based input pipeline for AI2’s pre-tokenized OLMo data mixes (e.g. OLMo-mix-0925-official.txt). Reads headerless flat .npy token streams from a gcsfuse mount, shards across hosts, optionally masks repeated-n-gram instances, and yields the shapes the MaxText pretrain trainer expects.

Quick start#

  1. Download the data to a GCS bucket. --mix-file is a local AI2 manifest listing relative npy paths to fetch from AI2’s public bucket (e.g. OLMo-mix-0925-official.txt for the 6T pretrain mix or OLMo-midtraining-mix-0625-100B.txt for the 100B midtraining mix).

    python tools/data_generation/download_olmo_data_to_gcs.py \
        --mix-file ./OLMo-mix-0925-official.txt \
        --gcs-dest gs://my-bucket/dataset/ \
        --staging-dir /mnt/local-ssd/olmo-staging \
        --workers 16
    
  2. Mount it read-only with gcsfuse (np.memmap needs a local path):

    gcsfuse --implicit-dirs --o ro my-bucket /mnt/olmo-readonly
    
  3. Build the index:

    python tools/data_generation/build_olmo_npy_index.py \
        --mix-file /path/to/OLMo-mix-0925-official.txt \
        --gcs-base gs://my-bucket/dataset/ \
        --tokenizer allenai/dolma3-tokenizer \
        --sequence-length 8192 \
        --output /path/to/olmo_index_seq8192.json
    
  4. Configure + run the trainer:

    dataset_type: olmo_grain
    olmo_index_path: /path/to/olmo_index_seq8192.json
    olmo_path_remap_from: "gs://my-bucket/"
    olmo_path_remap_to:   "/mnt/olmo-readonly/"
    max_target_length: 8192        # must equal index sequence_length
    tokenizer_type: huggingface
    tokenizer_path: allenai/Olmo-3-7B-Instruct
    

    See scripts/run_olmo3_7b_grain_smoke.sh for a runnable smoke launcher, or src/maxtext/trainers/pre_train/scripts/olmo/ for end-to-end stage-1 pretraining launchers (single-host + XPK).

Resume#

Stateless sampler: record at step k is a pure function of (seed, shard, k). On startup, the trainer adapter reads the latest step from config.checkpoint_dir and shifts the sampler so the data stream picks up where it left off — no Grain-iterator-state in the checkpoint.

scripts/run_olmo3_7b_grain_resume_test.sh validates this end-to-end.

Notes#

  • Files are headerless raw uint32 by default (matches AI2’s published format). The numpy .npy extension is misleading.

  • Documents may span instance boundaries; this matches OLMo-core.

  • olmo_apply_ngram_filter: True (default) zeroes loss on instances with ≥ 32 repetitions of any 1–13-gram, per OLMo-core.

  • For mixing pretraining + midtraining, build a combined index by concatenating the two .txt mix files.

Troubleshooting#

Symptom

Fix

OLMo index sequence_length=N but config.max_target_length=M

Rebuild the index with --sequence-length M.

q_block_size=512 should divide q_seq_len=…

Set max_target_length to a multiple of 512.

OOM during compile on a small TPU

Shrink with override_model_config=True base_num_decoder_layers=N, use weight_dtype=bfloat16.

Resume restarts at step 0

Iterator log should print resumed_step=N initial_step=…; if both 0, checkpoint_dir is empty or wrong.