Rebuttal Challenges Apple’s “Reasoning Collapse” Claim in Large Language Models

2025-06-14 CoolPal

Rebuttal Challenges Apple’s “Reasoning Collapse” Claim in Large Language Models

Apple’s recent study, “The Illusion of Thinking,” asserted that even advanced Large Reasoning Models (LRMs) fail on complex tasks, sparking considerable debate within the AI research community. A detailed rebuttal by Alex Lawsen of Open Philanthropy, co-authored with Anthropic’s Claude Opus model, challenges this conclusion, arguing that the original paper’s findings are largely attributable to experimental design flaws rather than inherent limitations in LRM reasoning capabilities.

Lawsen’s counter-argument, titled “The Illusion of the Illusion of Thinking,” doesn’t deny the struggles LRMs face with complex planning. Instead, it posits that Apple’s research conflates practical output constraints and flawed evaluation methods with genuine reasoning failure. The rebuttal highlights three key issues:

Firstly, Lawsen points out that Apple’s interpretation disregarded token budget limits. In the Tower of Hanoi puzzles, where Apple claimed model “collapse” at 8+ disks, models like Claude were reaching their token output limits, explicitly indicating their inability to continue due to resource constraints. This suggests that the perceived failure was a consequence of output limitations rather than an inability to reason.

Secondly, the rebuttal criticizes the inclusion of unsolvable puzzle instances in Apple’s River Crossing test. Models were penalized for correctly identifying and refusing to solve these impossible scenarios, misrepresenting their actual reasoning abilities. This highlights a significant flaw in the experimental design, where failure to solve was conflated with an inability to reason.

Thirdly, Lawsen argues against Apple’s automated evaluation pipelines, which judged models solely on complete move lists. This approach unfairly penalized models that generated partial or strategic solutions, failing to account for output truncation due to token limits. The rigid evaluation criteria obscured the models’ underlying reasoning processes.

To support his claims, Lawsen re-ran a subset of the Tower of Hanoi tests, requesting models to generate a recursive Lua function instead of a complete move list. This alternative approach yielded successful solutions for 15-disk problems, significantly exceeding the complexity where Apple reported complete failure. This demonstrates that, when artificial output constraints are removed, LRMs exhibit a much greater capacity for complex reasoning, at least in terms of algorithmic generation.

The implications of this debate extend beyond typical research nitpicking. The original Apple paper has been widely cited as evidence of fundamental limitations in LLM reasoning, a potentially misleading interpretation. Lawsen’s rebuttal suggests a more nuanced reality: while LLMs may struggle with long-form token enumeration under current deployment constraints, their underlying reasoning mechanisms may be more robust than initially suggested.

While Lawsen’s findings don’t entirely exonerate LRMs – true algorithmic generalization remains a challenge – they underscore the importance of rigorous evaluation methodologies. He proposes several improvements for future research, including designing evaluations that distinguish between reasoning capability and output constraints, verifying puzzle solvability, using complexity metrics that reflect computational difficulty, and considering multiple solution representations. Ultimately, the question isn’t whether LRMs can reason, but whether our evaluations accurately measure their reasoning abilities.

阅读中文版 (Read Chinese Version)

Disclaimer: This content is aggregated from public sources online. Please verify information independently. If you believe your rights have been infringed, contact us for removal.