I advocate for AI in education openly, but advocacy without evidence is empty. Every few months a new “reasoning” model lands and the marketing language gets bolder. We’re told these systems think, plan, reflect, self-correct. So when a research team systematically tests those claims against controlled puzzles and finds they don’t hold up, that means alot.
Shojaee et al. (2025) ran that test. The results give educators something concrete to work with when AI vendors start using the word “reasoning” as if it carries the meaning it has in a classroom.
The team built four puzzle environments (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World) where they could turn complexity up or down without changing the underlying logic. Existing math benchmarks have a contamination problem. Models may have seen the answers during training. Puzzles let researchers test what the model actually does when no memorized answer is available.

Three Regimes of AI Reasoning Models
Shojaee and colleagues compared frontier reasoning models like Claude 3.7 Sonnet (thinking), DeepSeek-R1, and OpenAI’s o3-mini against their non-thinking siblings under matched compute budgets. Three patterns emerged. At low complexity, the standard non-thinking models did just as well or better and used fewer tokens.
The “thinking” mode added cost without value. At medium complexity, the thinking versions pulled ahead, which is the regime AI vendors love to showcase. Push complexity higher, and both crashed to zero accuracy.
That collapse isn’t gradual. Performance holds, then falls off a cliff. And the cliff isn’t a budget issue. The models had access to up to 64,000 tokens. They simply stopped using them.
When the AI Stops Thinking
Reasoning effort is supposed to scale with problem difficulty. A harder problem should mean more thought and more tokens. That’s the pitch behind chain-of-thought models. The research team found something stranger. As puzzles got harder, thinking models initially used more tokens, but right before the accuracy collapse point, they started using fewer. The harder it got, the less they tried.
Shojaee and colleagues call this “a fundamental scaling limitation in the thinking capabilities of current reasoning models relative to problem complexity” (p. 9). My read is that this points to a structural problem, not a tuning problem. A budget explanation doesn’t fit the data. The models are giving up at exactly the moment they should be working harder, which undermines the central claim that these systems “think” in any meaningful sense.
The Algorithm Experiment
The most damning finding has nothing to do with how clever the models are. It’s about whether they can follow instructions. The researchers handed Tower of Hanoi to the model alongside the actual recursive algorithm to solve it, step by step. The model only had to execute. Performance didn’t improve. The collapse happened at roughly the same complexity point.
That result raises serious questions about what reasoning models are actually doing. If a system can’t reliably follow a known algorithm it’s been handed, it isn’t searching, planning, or verifying anything. It’s predicting plausible next tokens. Rudolph and colleagues (2023), in their early commentary on ChatGPT as a “bullshit spewer” in higher education, made roughly this argument before the reasoning model wave arrived. They look more right with each new study.
The team also found wild inconsistency across puzzle types. Claude 3.7 Sonnet (thinking) handled up to 100 correct moves in Tower of Hanoi with ten disks, but failed in River Crossing after only four moves with three actor-agent pairs. The likely cause is training data. River Crossing variants with larger N are scarce on the web. The model never memorized them.
What This Means for Classroom Practice
I want to be careful here. Apple’s puzzles aren’t classroom tasks. The authors say so themselves. But the findings should still shape how we talk to teachers and students about AI.
Language matters most. When a vendor says a model is “reasoning,” ask what kind. There’s a difference between fluent text generation and step-by-step problem-solving, and Shojaee et al. show that reasoning models can do the first but break down on the second. Teachers don’t need to become AI researchers, but they do need a working sense of where these tools fail. This is core AI literacy, the kind Kalantzis and Cope (2025) describe in their work on literacy in the time of AI.
The findings also reshape how we design AI use in learning. If models collapse on tasks that require sustained logical execution, then assignments that ask students to lean on AI for multi-step problem solving are likely to produce confident-sounding nonsense. Students need to be taught to treat AI output as a draft, not an answer, and to verify each step. That’s the same pedagogical move the metacognitive laziness research from Fan et al. (2025) has been calling for.
Most importantly, the Apple findings remind us that “thinking” tokens are not thinking. Shaw and Nave (2026) describe the cognitive surrender that happens when students assume the AI has done the cognitive work for them. This study shows that, past a certain point, neither the student nor the model has done it. Pedagogy still does the heavy lifting. What students learn to do with their own minds, against tools that hit a ceiling like this one, is the only durable lesson.
References
- Fan, Y., Tang, L., Le, H., Shen, K., Tan, S., Zhao, Y., Shen, Y., Li, X., & Gašević, D. (2025). Beware of metacognitive laziness: Effects of generative artificial intelligence on learning motivation, processes, and performance. British Journal of Educational Technology, 56(2), 489–530. https://doi.org/10.1111/bjet.13544
- Gerlich, M. (2025). AI tools in society: Impacts on cognitive offloading and the future of critical thinking. Societies, 15(1), Article 6. https://doi.org/10.3390/soc15010006
- Kalantzis, M., & Cope, B. (2025). Literacy in the time of artificial intelligence. Reading Research Quarterly, 60, e591. https://doi.org/10.1002/rrq.591
- Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? Journal of Applied Learning & Teaching, 6(1), 342–354. https://doi.org/10.37074/jalt.2023.6.1.9
- Shaw, S. D., & Nave, G. (2026). Thinking fast, slow, and artificial: How AI is reshaping human reasoning and the rise of cognitive surrender. Working paper, The Wharton School, University of Pennsylvania. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6097646
- Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. Apple Machine Learning Research. https://arxiv.org/abs/2506.06941
