Conditional Memory for Small Language Models
An Empirical Evaluation of N-Gram Hash Lookup Tables
Abstract
We investigate whether conditional memory mechanisms—specifically, N-gram hash lookup tables ("Engram" modules)—can improve the downstream performance of small language models at consumer hardware scale. Engram modules augment transformer layers with trainable embedding tables indexed by rolling N-gram hashes, providing the model with direct access to token-sequence statistics. Across four experiments spanning three model sizes (0.5B, 1.5B, and 2B-4B parameters), two model families (Qwen2.5 and Qwen3.5), and two training regimes (frozen base model with Engram-only training, and full joint fine-tuning), we observe a consistent and striking pattern: Engram modules reduce perplexity by 22-30% on held-out text, yet produce zero measurable improvement on downstream benchmarks (HellaSwag, PIQA, ARC-Challenge). All downstream deltas fall within ±2 percentage points of baseline, with no consistent directional trend. We analyze why perplexity gains fail to transfer, discuss implications for conditional memory research at small scale, and provide practical guidance for future work.
Key Findings
What we found
Perplexity drops sharply, downstream stays flat
Across all four experiments, Engram modules reduced perplexity by 22-30% on held-out text. Yet HellaSwag, PIQA, and ARC-Challenge accuracy moved within ±2 percentage points of baseline — well within noise.
The effect is surface-level
Layer ablation on Qwen2.5-1.5B showed that a single Engram layer at position 1 captures 86% of the total perplexity gain. This suggests the module primarily captures local token statistics rather than building transferable representations.
Scale may be the deciding factor
DeepSeek-V3 (671B parameters, 14.8T training tokens) saw meaningful benefits from Engram. Our experiments at 0.5B-4B parameters did not. The mechanism's effectiveness may be fundamentally tied to model scale.
Full fine-tuning doesn't help
Joint fine-tuning of both the base model and Engram parameters (Experiment 4, Qwen3.5-2B) showed the same pattern: 29.9% perplexity reduction, zero downstream improvement. Co-adaptation doesn't solve the transfer problem.
Results
Main results
| Model | Mode | Base PPL | Engram PPL | PPL % | HellaSwag Δ | PIQA Δ | ARC-C Δ |
|---|---|---|---|---|---|---|---|
| Qwen2.5-0.5B | Frozen, 100K | 20.75 | 15.73 | -24.2 | +0.35 | -1.14 | +0.34 |
| Qwen2.5-1.5B | Frozen, 5K | 14.84 | 11.02 | -25.7 | +0.30 | n/a | -0.61 |
| Qwen3.5-4B | Frozen, 10K | 12.87 | 9.76 | -24.2 | -0.02 | -0.33 | +1.79 |
| Qwen3.5-2B | Full FT, 10K | 16.31 | 11.44 | -29.9 | +0.17 | -0.71 | -0.85 |
Analysis
What it means
A 30% perplexity reduction is a major result in most papers — yet it corresponded to literally zero downstream improvement in our experiments. We find that the most compelling explanation is a combination of two factors: the modules provide local prediction accuracy (where N-gram statistics are directly predictive) rather than global reasoning capability, and the perplexity gains reflect memorization of training patterns rather than transferable representations.
This has broader implications for conditional memory research. Perplexity is widely used as a proxy for model quality, but our results suggest it can be deeply misleading when evaluating architectural interventions that inject static memory. A model can become much better at predicting the next token without becoming better at reasoning, understanding, or answering questions.
For practitioners working with small models on consumer hardware, the practical takeaway is clear: Engram-style conditional memory is not a viable path to improving downstream performance at this scale. Resources may be better spent on larger base models, better training data, or established techniques like knowledge distillation.
References
Cited works
- [1]DeepSeek-AI et al. DeepSeek-V3 Technical Report. 2024.
- [2]Grave et al. Improving neural language models with a continuous cache. ICLR, 2017.
- [3]Khandelwal et al. Generalization through memorization: Nearest neighbor language models. ICLR, 2020.
- [4]Lewis et al. Retrieval-augmented generation for knowledge intensive NLP tasks. NeurIPS, 2020.
- [5]Zellers et al. HellaSwag: Can a machine really finish your sentence? ACL, 2019.
- [6]Bisk et al. PIQA: Reasoning about physical commonsense in natural language. AAAI, 2020.
- [7]Clark et al. Try ARC, the AI2 Reasoning Challenge. 2018.