10. Mai 2025
The Illusion of Progress? Rethinking Reasoning in LLMs
For nearly two years, GPT-4 held the crown. It felt like the party was over — again. Just like after DeepBlue and Jeopardy, or when convolutional nets and autonomous driving plateaued, progress seemed to stall. But then came OpenAI's O1 and DeepSeek's R1 — so-called "reasoning models" — and everything felt possible again. These models didn't just predict text; they appeared to think. Or did they?
The Calm Before the Breakthrough
While competitors emerged and open-source projects flourished, none clearly outperformed GPT-4. Some came close, others were smaller but impressive. Still, no one shattered the ceiling. It looked like a familiar AI winter was looming.
Then Came the Reasoners
OpenAI introduced "O1," and shortly after, DeepSeek released "R1." Both models made headlines for their apparent reasoning capabilities. They weren't just mimicking human output — they were solving complex problems through multi-step thinking. Or at least that was the impression.
What Do We Mean by "Reasoning"?
In the early days, prompting models with instructions like "think step-by-step" noticeably improved their performance. This technique, known as Chain-of-Thought (CoT), effectively made the model talk to itself, using its own autoregressive outputs as a kind of working memory. But earlier models lacked the context length to sustain meaningful CoT chains.
Advancements like flash-attention changed that, enabling much longer context windows and allowing models to "think out loud" at length. This shifted computational effort from training into inference, which in itself was a meaningful architectural evolution.
OpenAI capitalized on this with O1. The model was likely fine-tuned using long CoT examples, learning to emulate structured internal dialogues. It worked. O1 broke performance barriers and seemed to move us closer to AGI.
DeepSeek and RLVR
DeepSeek's contribution went even further. Among many innovations, it introduced (or at least popularized) Reinforcement Learning with Verifiable Rewards (RLVR). The idea echoes AlphaGo's success: train by self-play, using the outcome as feedback. This works well when it's easy to tell who won — like in Go.
But language is messier. DeepSeek focused on domains where verifying correctness is tractable — like coding and math. Their R1 model, trained with RLVR, eventually began generating CoT outputs without being prompted. This looked like an emergent property. Had we trained models into true reasoning?
RLVR was intended to create new reasoning capabilities — to train models not just to find correct answers, but to reason their way to them. That was the promise. But whether it delivered on that promise is another story.
The Paper That Changed the Narrative
The researchers took a different angle. LLMs are usually benchmarked via one-shot tasks — a single prompt, a single answer. But models are probabilistic and can generate many valid answers with different likelihoods. Some benchmarks look at top-5 results ("pass@5"). Yue et al. pushed further: they analyzed the top 200 samples.
Surprisingly, in most cases, the correct answer and a valid CoT chain were already present in the 200 outputs from the base model — even without RLVR fine-tuning. The difference? Reasoning models were better at surfacing those outputs as top-1.
What That Means
This might sound like a footnote — but it isn't. The much-touted reasoning capabilities of RLVR-trained models weren't entirely new. The reasoning was latent in the base model. The RL process merely adjusted probabilities to favor it.
And there's a tradeoff. While reasoning models pick better answers more consistently, they tend to be less creative. RLVR, by optimizing for verifiability and correctness, narrows the model's exploratory reasoning space. It improves consistency and accuracy — but at the cost of diversity in thought paths.
So, Are We Getting SMARTER Models?
You could argue that picking the right answer is what matters. After all, we care about outcomes, not internal monologues. But the real hope behind reasoning models was something more: new ways of thinking, new chains of thought that base models wouldn't invent.
This paper throws cold water on that dream — at least for now. We're not yet seeing models that reason beyond their training. We're just getting better at guiding them to reason when it helps.
Still, that's not nothing. Extending context, shifting effort to inference, and pushing model alignment through techniques like RLVR are essential steps. The path to AGI might not be a sudden leap in reasoning, but a steady uncovering of what was already there — hidden in the probabilities all along.
Is There Hope?
Interestingly, the same paper suggests that other training methods — like knowledge distillation — may offer a way forward. Unlike RLVR, which reweights existing reasoning paths, distillation can actually introduce new ones by synthesizing broader patterns from multiple teacher models. This could allow future systems to truly expand their reasoning capabilities — not just rearrange them. So while RLVR might not be the silver bullet we hoped, the search for smarter, more original models is far from over.