Apple Researchers Challenge the “Reasoning” Capabilities of Large Language Models
Apple Researchers Challenge the “Reasoning” Capabilities of Large Language Models

A recent research paper from Apple casts doubt on the widely touted “reasoning” abilities of leading large language models (LLMs). The study, authored by a team of Apple’s machine learning experts including Samy Bengio, Director of Artificial Intelligence and Machine Learning Research, challenges the claims made by companies like OpenAI, Anthropic, and Google regarding the advanced reasoning capabilities of models such as OpenAI’s GPT-3, Anthropic’s Claude 3.7, and Google’s Gemini.
The researchers argue that the industry’s assessment of LLM reasoning is significantly overstated, characterizing it as an “illusion of thinking.” Their analysis focuses on the methodology used to benchmark these models, highlighting concerns about data contamination and a lack of insight into the structure and quality of reasoning processes. Using “controllable puzzle environments,” the Apple team conducted extensive experiments to evaluate the models’ actual reasoning capabilities.
The results revealed a concerning trend: a “complete accuracy collapse” in LLMs beyond a certain complexity threshold. This “overthinking” phenomenon, as the paper describes it, indicates a decline in reasoning accuracy despite sufficient training data and computational resources. This finding aligns with broader observations showing an increased propensity for hallucinations in newer generation reasoning models, suggesting potential limitations in current development approaches.
The Apple researchers further highlight inconsistencies in how LLMs approach problem-solving. They found that these models lack the ability to utilize explicit algorithms and demonstrate inconsistent reasoning across similar puzzles. The team concludes that their findings raise critical questions about the true reasoning capabilities of current LLMs, particularly given the substantial financial investment and computational power dedicated to their development.
This research adds to the growing debate surrounding the limitations of current LLM technology. While companies continue to invest heavily in developing increasingly powerful models, Apple’s findings suggest that fundamental challenges remain in achieving truly generalizable reasoning capabilities. The implications of this research are significant, particularly for the future development and application of LLMs across various sectors.
The timing of this publication is also noteworthy, given Apple’s relatively cautious approach to integrating AI into its consumer products. While the company has promised a suite of Apple Intelligence tools, this research could be interpreted as a cautious assessment of the current state of the technology, suggesting a potential need for re-evaluation of existing development strategies within the AI industry as a whole.
Disclaimer: This content is aggregated from public sources online. Please verify information independently. If you believe your rights have been infringed, contact us for removal.