Cotton, Cotton, and Shipway (2024) did something unusual with their paper on ChatGPT and academic integrity. They let ChatGPT write half of it. The entire first half of the article, from the introduction through the conclusion, was generated by ChatGPT with minimal human input. The authors provided prompts, rearranged the output, swapped out ChatGPT’s fabricated references for real ones, and added subheadings. The actual human analysis only begins in the Discussion section, where they reflect on what the experiment revealed about the state of AI-generated academic writing.
It’s a clever design, and it produced a paper that reads like two documents stitched together. The ChatGPT-written sections are competent, organized, and completely generic. They read like a decent undergraduate essay: solid structure, no original insight, and a tendency to repeat the same points in slightly different language. If you’ve ever graded a stack of essays where students clearly understood the topic but had nothing interesting to say about it, you’ll recognize the tone.
This paper was published online in March 2023, barely three months after ChatGPT launched. That makes it one of the very first academic responses to the tool, and reading it now, in 2026, is a fascinating exercise in time travel. Some of what Cotton et al. predicted has come true. Some of it looks almost quaint. And some of their assumptions about detection have aged poorly in ways that are instructive for how we think about academic integrity today.
The Contract Cheating Connection
Cotton et al. (2024) frame ChatGPT as an escalation of existing academic integrity challenges, not a new category of problem. The numbers were already concerning before ChatGPT existed. About 22% of Austrian university students admitted to plagiarism (Hopp & Speil, 2021, cited in Cotton et al., 2024), and UK data estimated that one in seven graduates may have paid someone to complete their assignments (QAA, 2020, cited in Cotton et al., 2024).
What ChatGPT changed, according to Cotton et al. (2024), is the accessibility of that shortcut. As they put it, “ChatGPT multiplies the risks which already exist around contract cheating in potentially opening up these services to a wider range of students who may not see using AI as cheating or who may not have the funds to use essay mill sites previously” (p. 235). The tool is free, requires no specialized knowledge, and produces output in seconds. The barrier to generating a passable essay dropped to near zero.
And the detection problem was already severe before AI arrived. Cotton et al. (2024) cite a study where 26 purchased assignments were submitted through normal marking processes. None were flagged as suspicious by markers. Only three were caught by Turnitin. A follow-up study where markers were specifically primed to look for contract cheating found they detected it correctly 62% of the time. Unassisted markers did even worse, at 48%. If human graders couldn’t reliably spot purchased essays, the chances of catching AI-generated ones looked slim.
Rudolph, Tan, and Tan (2023) published their own early ChatGPT paper around the same time, calling it a “bullshit spewer” and raising similar concerns about detection. Both papers arrived at the same moment of academic panic, and both reflect the field’s initial response: how do we catch this?

Detection Optimism That Hasn’t Aged Well
Cotton et al. (2024) tested OpenAI’s GPT-2 Output Detector and found results that were, at the time, encouraging. Ten genuine student essays all scored below 1% likelihood of being AI-generated. ChatGPT essays on the same topics scored close to 100%. And when the authors tried to disguise AI output by prompting ChatGPT to “reference, use a varied sentence structure and transitions, and to emulate the writing of an undergraduate student,” the detection score “only reduced the score down to around 97%” (p. 236).
In early 2023, that looked promising. In 2026, we know it was a snapshot of a moment that passed quickly. AI detection tools have been widely discredited, with documented bias against non-native English speakers and false positive rates that make them unreliable. The arms race Cotton et al. (2024) predicted has mostly been won by the generators. Dawson, Bearman, Dollinger, and Boud (2024) made a persuasive case that the obsession with catching cheating has distracted us from a more fundamental question: whether our assessments are actually measuring what they claim to measure.
The authors also noted that ChatGPT’s output tends to be formulaic and repetitive across similar prompts. If multiple students used comparable prompts for the same assignment, their submissions would look nearly identical, which Turnitin could flag for high similarity. That observation was accurate for early ChatGPT. It’s far less true of the models available now, which produce more varied and nuanced output. The consistency that was supposed to be AI’s weakness has largely disappeared.
What the AI-Written Sections Actually Reveal
The ChatGPT-generated text itself is the most instructive part of the paper. It’s technically correct but entirely hollow, covering opportunities, challenges, and recommendations without saying anything a reader couldn’t predict from the section headings alone. Every reference ChatGPT inserted was plausible but completely fictional. The AI couldn’t distinguish between real and invented scholarship, which in 2023 was a genuine surprise. By now, hallucinated references are a well-known limitation, but this paper was among the first to document it in an academic context.
Cotton et al. (2024) also considered listing ChatGPT as a co-author but decided against it: ChatGPT couldn’t agree to submission, couldn’t review the final manuscript, and couldn’t take responsibility for the article’s contents. Cleland et al. (2025) have since built on these questions with a more developed framework for AI disclosure in academic publishing.
Assessment Redesign Was the Strongest Recommendation
Cotton et al. (2024) close with a call to action: “Whatever happens on the technology side, this should serve as a wake-up call to university staff to think very carefully about the design of their assessments and ways to ensure that academic dishonesty is clearly explained to students and minimised” (p. 236). Their practical recommendations lean toward assessment redesign: tasks that require critical thinking, problem-solving, and original argumentation; rubrics; draft submissions; and transparency with students about what counts as misconduct.
This is where the paper holds up best. The detection arguments have mostly collapsed, but the assessment redesign recommendations have been validated by almost everything published since. Moorhouse, Yeo, and Wan (2023) studied how top universities responded to ChatGPT in those early months and found a similar pattern: institutions that adapted fastest redesigned assessments, not doubled down on detection. Eaton (2023) took the argument further with her postplagiarism framework, arguing that the detection-and-punishment paradigm needs to give way to a more nuanced understanding of academic integrity.
Reading This Paper in 2026
Cotton et al. (2024) deserve credit for speed and for the creative experiment design. The decision to publish a serious academic paper with ChatGPT-authored content just months after launch took a willingness to engage with the technology when most of the field was still figuring out what to make of it.
What it couldn’t know is how fast everything would move. The detection tools that showed 97% accuracy are now unreliable. The formulaic AI output the authors described has given way to writing that’s far harder to distinguish from human work. And the fundamental question has shifted from “can we detect AI use” to “how do we build assessment systems that produce genuine learning, regardless of what tools students have access to.”
That shift is the lasting contribution of early papers like this one. They forced the conversation to start, even if the field has moved well past the answers they offered.
References
- Cleland, J., Driessen, E., Masters, K., Lingard, L., & Maggio, L. A. (2025). When and how to disclose AI use in academic publishing: AMEE Guide No. 192. Medical Teacher. https://doi.org/10.1080/0142159X.2025.2607513
- Cotton, D. R. E., Cotton, P. A., & Shipway, J. R. (2024). Chatting and cheating: Ensuring academic integrity in the era of ChatGPT. Innovations in Education and Teaching International, 61(2), 228–239. https://doi.org/10.1080/14703297.2023.2190148
- Dawson, P., Bearman, M., Dollinger, M., & Boud, D. (2024). Validity matters more than cheating. Assessment & Evaluation in Higher Education, 49(7), 1005–1016. https://doi.org/10.1080/02602938.2024.2386662
- Eaton, S. E. (2023). Postplagiarism: Transdisciplinary ethics and integrity in the age of artificial intelligence and neurotechnology. International Journal for Educational Integrity, 19(23). https://doi.org/10.1007/s40979-023-00144-1
- Hopp, C., & Speil, A. (2021). How prevalent is plagiarism among college students? Anonymity preserving evidence from Austrian undergradua
- Moorhouse, B. L., Yeo, M. A., & Wan, Y. (2023). Generative AI tools and assessment: Guidelines of the world’s top-ranking universities. Computers and Education Open, 5, 100151. https://doi.org/10.1016/j.caeo.2023.100151
- Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? Journal of Applied Learning & Teaching, 6(1), 342–354. https://doi.org/10.37074/jalt.2023.6.1.9
