When Less is Less: The LIMO Hypothesis Fails at Small Scale
A Replication Study of 'Less is More for Reasoning'
Abstract
The LIMO hypothesis (Ye et al., 2025) posits that sophisticated mathematical reasoning can emerge from minimal but high-quality demonstrations in sufficiently capable language models. At 32B parameters, fine-tuning on just 817 curated examples reportedly improves AIME 2024 performance from 16.5% to 63.3% and MATH500 accuracy to 95.6%. We test whether this "less is more" effect survives at 1.5B scale using Qwen2.5-1.5B-Instruct. Surprisingly, we find the opposite: fine-tuning on either the LIMO dataset (817 examples) or the comparable s1K dataset (1,000 examples) catastrophically degrades performance. On MATH500, accuracy drops from 49.4% to 26.4% (LIMO) and 27.8% (s1K) — a loss of over 20 percentage points (p < 10⁻¹¹). On GSM8K, accuracy drops from 72.5% to 49.8% and 47.4% respectively (p ≈ 0). Both datasets cause statistically indistinguishable degradation (p = 0.62), suggesting the failure is scale-dependent rather than data-dependent.
Key Findings
What we found
Fine-tuning destroys reasoning at 1.5B scale
At 32B, LIMO training improves MATH500 by +46pp. At 1.5B, it drops accuracy from 49.4% to 26.4% — a catastrophic loss of 23 percentage points. GSM8K drops from 72.5% to 49.8%. Both degradations are highly significant (p < 10⁻¹¹).
Both datasets cause identical degradation
LIMO (817 examples) and s1K (1,000 examples) produce statistically indistinguishable results (p = 0.62 on MATH500). The failure is driven by model scale, not data quality.
The model learns format, loses knowledge
Training loss converged normally, and the model generates long chain-of-thought reasoning traces after fine-tuning. It learned the format of reasoning but lost the mathematical knowledge needed to solve problems correctly — a textbook case of catastrophic forgetting.
Results
Main results
| Condition | MATH500 | GSM8K | AIME 2024 (pass@1) |
|---|---|---|---|
| Baseline (Qwen2.5-1.5B-Instruct) | 49.4% | 72.5% | 2.5% |
| + LIMO fine-tuning (817 ex) | 26.4%*** | 49.8%*** | 0.8% |
| + s1K fine-tuning (1,000 ex) | 27.8%*** | 47.4%*** | 0.0% |
| LIMO (Qwen2.5-32B, published) | 95.6% | 97.8% | 63.3% |
Analysis
What it means
The most natural explanation is catastrophic forgetting. The 1.5B model has limited representational capacity, and when fine-tuned on 817-1,000 long reasoning traces, it appears to overwrite its pre-trained mathematical knowledge with the patterns of the training data. The model successfully learned the training data (loss converged to 0.82 for LIMO, 0.44 for s1K) and generates reasoning traces in the correct format — but the underlying knowledge is gone.
This has direct implications for the knowledge elicitation hypothesis. The LIMO paper frames their results as evidence that reasoning capabilities are latent in large models and need only be demonstrated. Our results qualify this: the model must be large enough for the knowledge to exist with sufficient redundancy to survive the fine-tuning process. Below that threshold, the "elicitation" mechanism breaks down and is replaced by destructive overwriting.
The biggest limitation is our use of LoRA rather than the full-parameter fine-tuning from the original paper. We cannot fully separate the effect of model scale from the effect of training method. However, LoRA rank 64 is a relatively expressive adapter, and the fact that both datasets produce identical degradation suggests the issue is architectural rather than methodological.
For practitioners, the takeaway is clear: reasoning-focused fine-tuning with small datasets is not safe for models below approximately 7B parameters. The model's ability to generate plausible reasoning traces can be deeply misleading — it may have lost the knowledge needed for those traces to be correct.
References
Cited works
- [1]Ye et al. LIMO: Less is More for Reasoning. arXiv:2502.03387, 2025.
- [2]Muennighoff et al. s1: Simple Test-Time Scaling. arXiv:2501.19393, 2025.
- [3]Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022.
- [4]Hendrycks et al. Measuring Mathematical Problem Solving with the MATH Dataset. arXiv:2103.03874, 2021.
- [5]Cobbe et al. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168, 2021.
- [6]Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021.