April 21, 2026|Machine LearningNLPEfficient AttentionConditional RoutingNegative Result

Testing Conditional Attention Routing at Consumer Scale

An Evaluation of Learning When to Attend Under Resource Constraints

Tate Hertel

Abstract

We evaluate "Learning When to Attend" (L2A), a conditional attention routing mechanism that learns token-wise routing between local sliding-window attention and global full attention, under severe resource constraints. Testing on Qwen2.5 models at 0.5B, 1.5B, and 3B scale on a single RTX 4090 with 1000x less training data than the original paper, we identify a collapse threshold at approximately 1 token per trainable parameter. Router-only training (65K-130K params) produces healthy 60-87% sparsity, but all attention fine-tuning regimes (88M-1.7B params) catastrophically collapse: loss drops to zero, sparsity reaches 100%, and generation degrades to repetitive loops. Even the successful router-only regime degrades downstream benchmarks by 1-11 percentage points. This study characterizes L2A at one specific operating point and does not refute the original paper's claims at full scale.

Key Findings

What we found

The collapse threshold is ~1 token per trainable parameter

Router-only training (65K-130K params, 62-184 tokens/param) produces healthy routing with 60-87% learned sparsity. All attention fine-tuning regimes (88M-1.7B params, 0.005-0.14 tokens/param) catastrophically collapse. The threshold is approximately 1 token per parameter.

Even healthy routing degrades downstream performance

HellaSwag accuracy drops 1.4-11.0 percentage points, PIQA drops 0-4.6pp, and ARC-Challenge drops 0.8-7.4pp across all router-only experiments. Only the 3B model with window equal to context shows minimal degradation.

L2A destroys long-range retrieval at small scale

Needle-in-a-haystack retrieval collapses to 0-20% for 0.5B and 1.5B models, versus 60-80% baseline. The mechanism designed to improve long-range retrieval instead destroys it at consumer scale.

Diverse data helps routing but cannot prevent collapse

4-domain training data (narrative, math, code, science) improves router-only perplexity by 32%, but cannot rescue attention fine-tuning from catastrophic collapse.

Results

Main results

Scale	Regime	Params	PPL	HellaSwag	PIQA	ARC-C	NIAH
0.5B Baseline	Full causal	---	11.52	0.391	0.700	0.289	60-80%
0.5B Router	Diverse	65K	22.38	0.354	0.688	0.234	0-20%
0.5B L2A FT	Diverse	88M	60.83	0.368	0.674	0.236	0%
0.5B Full FT	Diverse	538M	39.21	0.362	0.668	0.232	0%
1.5B Baseline	Full causal	---	8.47	0.492	0.750	0.358	80%
1.5B Router	Diverse	130K	13.91	0.382	0.718	0.284	0%
1.5B L2A FT	Diverse	308M	21.21	---	---	---	Collapsed
3B Router	WikiText	65K	8.00	0.452	0.766	0.354	80%

Analysis

What it means

The central finding is a collapse threshold determined by the token-to-parameter ratio, not the training regime itself. With 8-12M training tokens, only router-only training (65K-130K params) has sufficient data per parameter. The paper's positive results used 16.7-25B tokens (1000x more), which would place all regimes well above the collapse threshold.

The downstream degradation under healthy router-only training likely stems from three factors: the dual-path attention disrupts residual stream representations through 24-36 layers, the 5x RoPE frequency scaling shifts all positional encodings, and the routing decision itself introduces noise at small scale. The 3B model's minimal degradation suggests larger models can absorb these perturbations more readily.

The NIAH failure is particularly striking: L2A is explicitly designed to improve long-range retrieval, yet it destroys that capability at the scales where we can test it. Only the 3B model preserves NIAH at 80%, suggesting the mechanism requires sufficient model capacity to maintain retrieval alongside routing.

For practitioners, the practical takeaway is that L2A's benefits are strongly scale-dependent. At consumer hardware scale with limited training budgets, only the router-only regime is viable, and it comes with measurable downstream costs. The paper's full fine-tuning approach requires training budgets proportional to the trainable parameter count, which is impractical on a single GPU.

References

Cited works

[1]L2A Authors. Learning When to Attend: Conditional Memory Access for Long-Context LLMs. arXiv:2603.17484, 2026.
[2]Qwen Team. Qwen2.5 Technical Report. arXiv:2412.15115, 2024.
[3]Zellers et al. HellaSwag: Can a machine really finish your sentence? ACL, 2019.
[4]Bisk et al. PIQA: Reasoning about physical commonsense in natural language. AAAI, 2020.
[5]Clark et al. Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge. 2018.