maxtext.layers.engram module#

DeepSeek-AI, `Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

<https://arxiv.org/pdf/2601.07372>`_, 2026

Reference implementation: deepseek-ai/Engram

class maxtext.layers.engram.CompressedTokenizer(tokenizer)[source]#

Bases: object

A canonicalizing wrapper that reduces vocabulary sparsity for n-gram lookup.

This class maps semantically equivalent tokens (e.g., “Apple”, “ apple”, “APPLE”) to a single unified ID. This many-to-one mapping significantly reduces the combinatorial size of the n-gram space.

Parameters:

tokenizer (HFTokenizer)

lookup_table#

Array mapping original_id -> compressed_id.

num_new_token#

Size of the compressed vocabulary.

class maxtext.layers.engram.NgramHashMapping(engram_vocab_bases, max_ngram_size, engram_num_heads, layer_ids, tokenizer, pad_id, seed)[source]#

Bases: object

Deterministically maps token indices to n-gram hash indices for embedding lookups.

This class implements Multi-Head Hashing to bypass the combinatorial memory requirements of explicit n-gram vocabularies. Specifically, it applies multiplicative-XOR hashing to each n-gram window.

Key Mechanisms for Collision Mitigation: - Multi-Head Factorization: Uses K distinct hash heads per n-gram order to increase

effective capacity within fixed memory constraints.

  • Unique Prime Moduli: Assigns a unique prime vocabulary size to each head to minimize simultaneous collisions.

Parameters:
  • engram_vocab_bases (List[int])

  • max_ngram_size (int)

  • engram_num_heads (int)

  • layer_ids (List[int])

  • tokenizer (HFTokenizer)

  • pad_id (int)

  • seed (int)

get_vocab_sizes(layer_id)[source]#

Returns a flattened list of prime vocabulary sizes for a specific layer.

Parameters:

layer_id (int)

Return type:

List[int]

class maxtext.layers.engram.StaticWrapper(val)[source]#

Bases: object

Wrapper to prevent nnx from treating the value as a variable.

class maxtext.layers.engram.MultiHeadEmbedding(*args, **kwargs)[source]#

Bases: Module

A flattened table representation for multi-head embedding spaces across n-gram orders.

Parameters:
  • args (Any)

  • kwargs (Any)

Return type:

Any

class maxtext.layers.engram.ShortConv(*args, **kwargs)[source]#

Bases: Module

Depthwise causal 1D convolution, with multi-branch integration.

Applies local temporal smoothing - Independent RMSNorms to each branch - Convolution to mix time steps [t-k, t]

Parameters:
  • args (Any)

  • kwargs (Any)

Return type:

Any

class maxtext.layers.engram.Engram(*args, **kwargs)[source]#

Bases: Module

Engram Memory Layer with n-gram embedding, with multi-branch integration.

Main components: - Context-independent Retrieval: Fetch static n-gram embeddings via Multi-Head Hashing. - Context-aware Gating: Compute similarity between memory (Key) and context (Query) to determine relevance. - Mix: Apply local temporal smoothing via convolution.

Parameters:
  • args (Any)

  • kwargs (Any)

Return type:

Any