AI Text Detection: Why Humans Still Beat the Tools

For two years, schools have thrown money at AI detection tools that don’t work. The same tools claim to detect AI-generated student writing, but they routinely flag human-written essays, miss actual AI text, and collapse the moment students paraphrase. A new study from Russell, Karpinska, and Iyyer (2025) gives us the empirical case for what teachers have been intuiting: the most reliable AI detector isn’t a tool. It’s a human who uses AI every day.

The Numbers Are Striking

The setup is straightforward. The researchers hired annotators to read 300 non-fiction English articles, half human-written, half AI-generated by GPT-4o, Claude-3.5-Sonnet, and o1-Pro. The annotators didn’t know the source. They had to label each article as human or AI and explain their reasoning.

Two populations emerged. People who rarely or never use LLMs detected AI text at chance level: 56.7% true positive rate, 51.7% false positive rate, with high confidence in their (mostly wrong) judgments. People who frequently use LLMs for writing tasks (editors, copywriters, content writers) hit a 92.7% true positive rate with a 4% false positive rate.

The killer number is the majority vote. Russell et al. (2025) write that “the majority vote of five expert annotators—without any specialized training—misclassified only 1 out of 300 articles, on par with the most accurate commercial detector (Pangram), even under adversarial paraphrasing and humanization” (p. 8). That’s one miss in 300, achieved without training and without feedback loops. Their detection ability came from regular AI use alone.

AI Text detection

What Experts Actually See

The qualitative analysis is where the paper gets really useful. The authors coded every expert explanation and built a taxonomy of detection clues. Vocabulary led the way: specific overused words like crucial, vibrant, testament, delve, and tapestry showed up in 53.1% of explanations. After that came sentence structure (35.9%), grammar patterns (24.8%), originality (23.7%), quotes (22.3%), and clarity (19.5%).

But the higher-order signals are where the real argument lives. Experts noticed AI’s tendency to write quotes that all sound the same, to use the same names (Emily, Sarah) and titles (Dr., Prof.), and to wrap articles in optimistic, vague conclusions. Russell et al. (2025) point out that “many of these categories (e.g., originality, factuality, tone) are much more difficult to assess automatically than others (e.g., vocabulary), and these may currently be areas where humans have an advantage over automatic detectors” (p. 8).

That sentence is the structural argument of the paper. Vocabulary patterns can be paraphrased away. Originality, voice, and tonal coherence can’t.

Why the Detection Tools Keep Failing

Open-source detectors collapsed under the harder conditions. On humanized o1-Pro articles, Binoculars hit a 6.7% true positive rate. Fast-DetectGPT managed 23.3%. The expert human majority vote on the same articles was 100%. Even Pangram, the only commercial tool that matched human experts on accuracy, slipped slightly on humanized text.

Schools using these tools to police student writing are running an enforcement regime built on probabilistic guessing. Eaton’s (2023) postplagiarism argument, which I’ve covered before, lands harder after reading this paper. Detection isn’t the answer. The answer is building students who can think critically with and about AI.

What This Means for Teachers

Russell et al. (2025) didn’t set out to write a paper for educators, but the implication is direct. The teachers most likely to spot AI-generated student work are the teachers who use AI themselves. The teachers refusing to touch the tools are the same nonexperts in the study, confidently flagging human writing as AI and missing the actual AI cases.

This isn’t a moral argument. It’s a calibration argument. You can’t develop an eye for what AI writing looks like without spending real time with these tools. Hawkins, Taylor-Griffiths, and Lodge’s (2025) work on feedback literacy and AI-enhanced essay writing makes a related point: students need to evaluate AI output against their own judgment, and that judgment only develops through use.

The same logic applies to faculty. If you refuse to use AI, you don’t get protection from it. You just become a worse detector.

The Limits Worth Naming

The study has real limits. It covers only American English nonfiction articles under 1000 words. Five experts is a small sample. Factual accuracy didn’t show up as a major cue, which probably reflects the article topics, not a general feature of AI text. The findings can’t be assumed to transfer to scientific papers, multilingual contexts, or social media posts. Hysaj, Dean, and Freeman’s (2025) work on multicultural students and AI-flagged academic writing, which I’ve covered before, shows how badly detector errors land when context shifts.

The cost structure also matters. The expert annotators were paid roughly $2.82 per article. For high-stakes academic integrity cases, that’s workable. For routine essay grading at scale, it isn’t.

The Real Argument

The paper’s strongest line comes at the end. Russell et al. (2025) write that “a population of ‘expert’ annotators—those who frequently use LLMs for writing-related tasks—are highly accurate and robust detectors of AI-generated text without any additional training” (p. 9). There’s no training involved, no feedback loop, just the daily habit of using AI.

For teachers reading this: the AI tool you’ve been avoiding is the same tool you’d need to use to spot AI-generated student work. The detector industry isn’t going to fix that for you. The eye comes from use.

References

  • Eaton, S. E. (2023). Postplagiarism: Transdisciplinary ethics and integrity in the age of artificial intelligence and neurotechnology. International Journal for Educational Integrity, 19(23). https://doi.org/10.1007/s40979-023-00144-1 
  • Hawkins, B., Taylor-Griffiths, D., & Lodge, J. M. (2025). Summarise, elaborate, try again: Exploring the effect of feedback literacy on AI-enhanced essay writing. Assessment & Evaluation in Higher Education. https://doi.org/10.1080/02602938.2025.2492070 
  • Hysaj, A., Dean, B. A., & Freeman, M. (2025). Exploring the purposes and uses of generative artificial intelligence tools in academic writing for multicultural students. Higher Education Research & Development, 44(7), 1686–1700. https://doi.org/10.1080/07294360.2025.2488862 
  • Russell, J., Karpinska, M., & Iyyer, M. (2025). People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text (arXiv:2501.15654). arXiv. https://arxiv.org/abs/2501.15654

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top