Why AI Detection in Education Does Not Work

If there’s one argument that keeps building momentum in the AI and education literature, it’s this: detection is not the answer. And this new paper from Bassett et al. (2026), published in the Journal of Higher Education Policy and Management, may be the most comprehensive and uncompromising articulation of that argument I’ve read. The authors don’t call for better AI detectors or more thoughtful implementation. They call for abolition. AI detection, they argue, is conceptually broken, procedurally unfair, and methodologically indefensible.

I agree with the core thesis. Every educator and administrator who still relies on Turnitin’s AI Writing Indicator or GPTZero scores to build misconduct cases needs to read this paper carefully.

The Unverifiable Foundations of AI Detection

The fundamental problem, as Bassett et al. frame it, is that AI detectors produce probabilistic estimates that cannot be independently verified in real-world conditions. In a controlled lab, you can test a detector against a known corpus of human and AI-generated texts. You know the ground truth. But in a classroom, nobody knows whether a student’s essay was written by a human, by AI, or, most commonly, with some combination of both. There is no external evidence that can conclusively confirm what a detector flags. The output is a number. That number looks authoritative. It is not.

The training data compounds the problem. Bassett et al. point out that Turnitin tested its detector on approximately 700,000 papers submitted before 2019, all predating generative AI entirely. The assumption is that student writing from the pre-AI era looks like student writing today. It doesn’t. Students in 2026 have spent years exposed to AI-generated text across social media, academic tools, and everyday communication. Their linguistic patterns have shifted. A detector trained on 2018 writing tells us very little about how it performs on a 2026 essay.

AI Detection in Education

Confirmation Bias and the Illusion of Validation

What struck me most in this paper is the systematic takedown of every method institutions currently use to “validate” an AI detector’s result. Bassett et al. walk through five of them, and none survives scrutiny.

Linguistic markers are the first to fall. When a detector flags a text, staff often search for features they associate with AI: formulaic prose, semicolons, certain transition words. The problem is that these features appear in all academic writing. A post-flag search for these features is textbook confirmation bias, not independent verification. The authors cite GPTZero’s own published list of “common AI words,” which includes phrases like “it is essential to recognise” and “in conclusion.” These are staples of student writing long before ChatGPT existed.

Submitting the same text to multiple detectors doesn’t help either. Bassett et al. argue that this approach amplifies shared methodological flaws, creating a misleading appearance of consensus. They compare it, memorably, to asking a group of phrenologists for a second opinion. The consensus reflects shared assumptions, not factual accuracy.

Student confessions are equally unreliable. When a student admits to using AI after being flagged, the institution treats that as proof the detector was right. Bassett et al. call this the post hoc ergo propter hoc fallacy: the confession followed the flag, so the flag must have caused it. A confession under the pressure of a disciplinary process, where students often lack legal support and face serious academic consequences, may say more about power dynamics than about what actually happened.

I’ve written before about how Corbin, Bearman, Boud and Dawson (2025) framed AI and assessment as a wicked problem with no clean technical fix. Bassett et al. make that case even more forcefully here: the technical fix itself is the problem.

The Base Rate Fallacy and False Positives

The statistical argument in this paper is devastating and simple. Bassett et al. walk through a base rate fallacy example that every educator should see. Imagine a detector with a 1% false positive rate and a 90% true positive rate. Sounds reliable, right? Now submit 1,000 papers, 10 of which are actually AI-generated. The detector will correctly flag 9 of the 10 AI papers. But it will also falsely flag about 10 of the 990 human-written papers. That means roughly half of all flagged papers are false positives. A flagged paper has only a 47.6% chance of being correctly identified. And remember: no institution knows its actual base rate of AI use, which means this probability can never be calculated in practice.

I covered Dawson, Bearman, Dollinger and Boud (2024) on this blog, and their argument that assessment validity should be the priority over cheating detection lands with full force here. A detection tool that cannot tell you, with any defensible confidence, whether a given paper was AI-generated does not meet the balance-of-probabilities standard required for academic misconduct cases.

The False Dichotomy That Breaks Everything

The deepest problem Bassett et al. identify is the binary at the center of AI detection: text is either human-written or AI-generated. That binary does not reflect how students write in 2026. Students use AI at every stage of the writing process, from first brainstorm to final polish. The work is hybrid. A student who generates structured notes using AI and then rewrites them into an essay has created work with AI, not by AI. The attempt to draw a sharp line between the two is, as the authors put it, “an inherently absurd” exercise.

This connects directly to the boundary confusion that Corbin, Dawson, Nicola-Richmond and Partridge (2025) documented. When does assessment begin? If a student uses AI to brainstorm before they start writing, is that a violation? If they use it to proofread their final draft, is that different? Most institutional policies don’t answer these questions. Students are left to self-regulate against rules that nobody has defined clearly enough to follow.

Bassett et al. also challenge the AI Assessment Scale proposed by Furze (2024a), which I’ve covered on this blog. The AIAS offers five levels of AI integration, from full prohibition to unrestricted collaboration. The authors argue that its categories assume AI use can be segmented into discrete stages, but a student who moves fluidly between AI brainstorming, AI-assisted drafting, and independent revision doesn’t fit neatly into Level 2 or Level 3. The scale is useful for classroom conversation. It is less useful as an enforcement mechanism.

Burden of Proof and Student Rights

The final sections shift from technical to procedural, and they’re just as damaging. Bassett et al. note that the burden of proof belongs to the institution, not the student. But disciplinary processes in most universities create a power imbalance that effectively reverses this. Students feel compelled to prove their innocence. Some universities explicitly require students to demonstrate the “originality” of their work. That is a reversal of the standard, and it should trouble anyone who cares about procedural fairness.

The authors also raise the right to silence, and it’s a point that reframes the entire conversation about misconduct processes. Students under investigation for academic misconduct have the right not to respond. A refusal to explain or defend their work cannot be treated as evidence of guilt. Bassett et al. draw a critical line between academic requests, where a student is asked to explain their work as part of an assessment, and integrity investigations, where silence is a protected right.

Where This Leaves Us

This paper is a full-stop argument against AI detection in education. The tools can’t be verified, the validation methods are riddled with bias, the binary they operate on is a fiction, and the institutional processes they feed into routinely violate basic procedural standards. Bassett et al. don’t overstate their case. The evidence is thorough and the logic is tight.

If detection doesn’t work, then assessment has to be redesigned around learning, not surveillance. Oral components, process-based evaluation, reflective tasks, assignment designs that make AI collaboration visible and pedagogically productive. The research keeps confirming it.

References

  • Bassett, M. A., Bradshaw, W., Bornsztejn, H., Hogg, A., Murdoch, K., Pearce, B., & Webber, C. (2026). Heads we win, tails you lose: AI detectors in education. Journal of Higher Education Policy and Management. https://doi.org/10.1080/1360080X.2026.2622146
  • Corbin, T., Bearman, M., Boud, D., & Dawson, P. (2025). The wicked problem of AI and assessment. Assessment & Evaluation in Higher Education. 1–17. https://doi.org/10.1080/02602938.2025.2553340
  • Corbin, T., Dawson, P., & Liu, D. (2025). Talk is cheap: Why structural assessment changes are needed for a time of GenAI. Assessment & Evaluation in Higher Education, 50(7), 1087–1097. https://doi.org/10.1080/02602938.2025.2503964 
  • Dawson, P., Bearman, M., Dollinger, M., & Boud, D. (2024). Validity matters more than cheating. Assessment & Evaluation in Higher Education, 49(7), 1005–1016. https://doi.org/10.1080/02602938.2024.2386662

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top