The (Dis)Illusion of Thinking: Beyond the Collapse

Date:

Jun 10, 2025

Author:

Rubén Castillo Sánchez - Clintell Data Lead

The Illusion of Thinking has reignited debate on LLMs. But what if it signals a beginning, not an end?

In recent days, The Illusion of Thinking has become one of the most discussed papers in debates about the limits of artificial intelligence. Its conclusions have served many as definitive proof supporting a growing critique: that language models cannot be the foundation for general artificial intelligence because they merely predict the next word based on statistical patterns. However, perhaps we are reading the paper as a sentence when we should be reading it as a symptom. This article offers a different perspective—not to deny the limits the study exposes, but to understand them in context and, perhaps, reinterpret what it means to “think” within these architectures.

What the Paper Actually Says

The study begins with a provocative hypothesis: if we want to understand the true reasoning capabilities of LLMs, it’s not enough to evaluate correct answers on contaminated benchmarks. That’s why the authors design controlled puzzle environments with transparent logic and adjustable complexity. Their results reveal a clear three-phase structure: in simple problems, “non-reasoning” models (i.e., those without intermediate reasoning steps) perform better than those that do incorporate them; in medium-complexity tasks, reasoning helps and is noticeable; but once complexity surpasses a certain threshold, all models—including the most advanced—collapse. They not only fail more often, but “think less,” using fewer tokens on harder problems. The conclusion is unsettling: there seems to be no generalizable reasoning, only a fragile illusion of thought.

Earlier Critiques: Theoretical vs. Empirical

Yann LeCun, among others, has long argued that LLMs cannot be the foundation for general intelligence because they are limited by their design principle: they are text prediction models. According to this view, they will never develop true understanding or causal reasoning because they are not built for it. While these critiques are relevant, they have largely been conceptual rather than experimental. What this paper contributes is empirical validation of those suspicions: it not only describes the theoretical limitations, it measures them precisely in contexts where reasoning should shine.

A Rare and Rigorous Experiment

Unlike many prior studies focused on benchmarks contaminated by training (like MATH500 or GSM8K), this paper uses puzzles that are novel to the models and governed by clear rules. It analyzes both the final answers and the intermediate traces of thought. Very few studies have done this with such control. Notable exceptions include Faith and Fate (Dziri et al., 2023), which shows the compositional limits of transformers, or Embers of Autoregression (McCoy et al., 2023), which examines how LLMs are only good at tasks aligned with their autoregressive training. But The Illusion of Thinking goes further: it doesn’t just criticize, it dissects the collapse.

The End of LLMs as a Path Toward AGI?

The paper’s results fuel the possibility that LLMs may not be enough to achieve AGI. We may be reaching a ceiling where the Transformer architecture simply can’t go any further. If so, new structural approaches will be needed: neuro-symbolic models that integrate planning and explicit memory, architectures with separate modules for perception, deliberation, and action, or even hybrid systems that combine LLMs with external logic engines. That wouldn’t be a failure, but an evolution.

But in the Meantime...

It’s hard to deny that—even if they don’t think the way we’d like—LLMs have changed the world. Since 2017, with the introduction of the Transformer architecture, and 2018 with GPT-1, we’ve gone from models that complete sentences to assistants that generate scientific hypotheses, spot coding errors, summarize papers, and suggest experiments. In fields like computational biology, legal research, or creative ideation, they are already producing useful knowledge. Maybe they don’t think, but sometimes it seems like they do. And that’s enough to stretch the boundaries of science.

Structural Limit or Just an Emergent One?

The collapse revealed by the paper is real, but it’s still unclear whether it’s structural or merely a matter of scale. The difference between the problems LLMs can solve and those they can’t seems to be a matter of compositional depth, not type. That suggests that—even though they collapse today—we can’t rule out that larger models, with more context, better training, or new forms of internal verification, might cross that threshold. The study of emergent properties is, by definition, nonlinear. No one expected models to start reasoning with chains of thought just from scaling. Something similar might happen with complex problems in the future.

Beyond the Collapse

The Illusion of Thinking confronts us with an uncomfortable truth: LLMs don’t reason as we thought. But it also opens a door. Knowing where they fail isn’t the end of the road—it’s the beginning of a new one. Maybe LLMs won’t be the foundation of AGI, or maybe we just haven’t crossed the right threshold of scale and architecture. In any case, studies like this help us map out what is possible today and what still isn’t. And that map will be key to building the intelligences of tomorrow.