I’ve been arguing for intentional AI use in education for a long time now. Widely, aggressively, unapologetically. But intentional is the key word. And this study from Bastani et al. (2025), published in the Proceedings of the National Academy of Sciences, is one of the strongest pieces of evidence I’ve seen for why design matters as much as access.
The paper, “Generative AI Without Guardrails Can Harm Learning: Evidence From High School Mathematics,” reports results from a large randomized controlled trial with nearly 1,000 high school students in Turkey. Three conditions were compared: no AI access (control), a standard ChatGPT-style interface built on GPT-4 (GPT Base), and a guarded version designed with teacher input (GPT Tutor). GPT Base worked like a typical open chat tool. GPT Tutor included teacher-written prompts, embedded correct solutions, and instructions to provide hints rather than full answers.
The results tell two very different stories depending on when you look.
With AI access, students performed dramatically better on practice problems. GPT Base increased assisted practice scores by 48%. GPT Tutor increased them by 127% (Table 1, p. 4). Those are massive gains. If you stopped the study here, the headline would be simple: AI helps students do math.
But the study didn’t stop there.
On a closed-book exam with no AI access, students who had used GPT Base scored 17% lower than students who never had AI at all. Bastani et al. are direct about this: “on the exam, students in the GPT Base arm perform statistically significantly worse than students in the control arm by 17%” (p. 2).
GPT Tutor erased that negative effect. Students in the tutoring condition performed roughly on par with the control group. But it didn’t produce gains beyond baseline either.
So the open tool actively harmed learning. The guarded tool prevented harm but didn’t boost it. No condition produced durable learning gains beyond what students achieved on their own.
Why GPT Base Hurt Learning
You might assume the problem was accuracy. GPT Base gave correct answers only 51% of the time, with logical errors 42% of the time and arithmetic errors 8% (p. 5). That’s a lot of wrong math. But Bastani et al. found that logical error rates did not significantly predict exam performance (Table 2, p. 6). Bad answers weren’t the main problem.
How students used the tool was the real problem. Message analysis showed that students in GPT Base most frequently asked for answers directly. Bastani et al. conclude that “the vast majority of students are using GPT Base to obtain solutions” (p. 6). They weren’t working through problems. They were outsourcing them.
GPT Tutor conversations looked different. Students attempted solutions, requested hints, and engaged with the material. The guardrails shaped the interaction, and the interaction shaped the learning.

This is exactly what Shaw and Nave (2026) described as cognitive surrender: deferring to AI outputs without critical engagement. Fan et al. (2025) found a similar pattern in their writing study. ChatGPT improved essays but triggered metacognitive laziness. The thinking stopped because AI was doing the work.
One of the most unsettling findings: students in GPT Base “did not perceive that they performed worse or learned less” (p. 4). They thought they were doing fine. Perceived learning and actual learning diverged completely.
Kosmyna et al. (2025) found a similar disconnect at MIT. Students using ChatGPT for writing showed reduced neural engagement but didn’t report feeling less engaged. Using AI can mask what’s actually happening cognitively. They feel productive because the output looks good. Internally, learning tells a different story.
What GPT Tutor Got Right (and Where It Fell Short)
GPT Tutor’s guardrails worked. Hints pushed students to think before receiving answers. Teacher-written prompts kept the tool aligned with learning objectives. The negative effect disappeared.
But Bastani et al. are careful not to oversell it. GPT Tutor was reactive and passive. It responded to student queries but didn’t proactively identify misconceptions or generate tailored guidance. The authors suggest future systems should do more: diagnose errors, adapt in real time, and guide students toward understanding.
I agree, and I think we’re already seeing models move in that direction. The AI Assessment Scale from Perkins, Roe, and Furze (2024) provides a framework for calibrating how much AI does at each stage of a task. Level 2 and 3 uses, where students collaborate with AI but retain critical judgment, align well with what GPT Tutor was trying to achieve. Sperber et al.’s (2025) PAIRR model takes it further by building reflection into the feedback loop. Students compare human and AI feedback and articulate their revision decisions.
Bastani et al. frame their contribution carefully: “While generative AI has beenshown to enhance productivity,its influence on learningnew skills remains unclear” (p.q6).
I want to be clear about what this study does and doesn’t say. It doesn’t say AI is bad for learning. It says unguarded AI, without pedagogical design, without scaffolding, without intentionality, can harm learning. And that the harm is invisible to students, which makes it worse.
For anyone still debating whether to allow or ban AI in classrooms, this study moves the conversation past that binary. Access isn’t the question. Design is. Give students an open chat tool and let them figure it out, and many will take the path of least resistance. Build guardrails that prompt thinking, require engagement, and resist answer-giving, and you protect the learning process.
Pedagogy determines whether AI helps or hurts. That’s been true in every study I’ve covered. Bastani et al. just proved it with nearly 1,000 students and a controlled experiment in PNAS.
References
- Bastani, H., Bastani, O., Sungu, A., Geb, H., Kabakcı, Ö., & Marimane, R. (2025). Generative AI without guardrails can harm learning: Evidence from high school mathematics. Proceedings of the National Academy of Sciences, 122(26), e2422633122. https://doi.org/10.1073/pnas.2422633122
- Fan, Y., Tang, L., Le, H., Shen, K., Tan, S., Zhao, Y., Shen, Y., Li, X., & Gašević, D. (2025). Beware of metacognitive laziness: Effects of generative artificial intelligence on learning motivation, processes, and performance. British Journal of Educational Technology, 56(2), 489–530. https://doi.org/10.1111/bjet.13544
- Kosmyna, N., Hauptmann, E., Yuan, Y. T., Situ, J., Liao, X.-H., Beresnitzky, A. V., Braunstein, I., & Maes, P. (2025). Your brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing tasks. MIT Media Lab. https://www.media.mit.edu/publications/your-brain-on-chatgpt/
- Perkins, M., Roe, J., & Furze, L. (2024). The AI Assessment Scale revisited: A framework for educational assessment (Preprint). December 2024. https://arxiv.org/abs/2412.09029
- Shaw, S. D., & Nave, G. (2026). Thinking fast, slow, and artificial: How AI is reshaping human reasoning and the rise of cognitive surrender. Working paper, The Wharton School, University of Pennsylvania. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6097646
