I’ve written about AI and assessment from several angles now, including the wicked problem framing, the AI Assessment Scale, and the future of exams. But this new paper from Corbin, Dawson, and Liu (2025) sharpens the conversation in a way I think the field urgently needs.
Their argument is simple, and it cuts deep: most of what universities are doing to respond to AI in assessment is cosmetic. The rules look good on paper, but they change nothing about what students actually do.
The paper, “Talk Is Cheap: Why Structural Assessment Changes Are Needed for a Time of GenAI,” published in Assessment & Evaluation in Higher Education, introduces a conceptual split between two types of assessment responses. Discursive changes modify the instructions around a task, telling students what they should or shouldn’t do with AI.
Structural changes modify the task itself so that its validity doesn’t depend on whether students follow those instructions. Higher education has overwhelmingly chosen the discursive path, and the authors argue convincingly that this path creates an illusion of security without delivering any of its substance.
The Problem with Telling Students What to Do
Corbin et al. define a discursive change as any modification that works purely through communication. It could be a line added to an assignment brief saying “GenAI use is not permitted,” or instructions telling students they can brainstorm with AI but not use it for drafting, or a required declaration form. In every case, the assessment task stays exactly the same. Only the language around it changes, and students remain free to follow or ignore it.

Traffic light systems are the most widespread version of this approach. Universities across the UK, Australia, and beyond have adopted color-coded categories where red prohibits AI, amber allows limited use, and green encourages full integration. Corbin et al. are clear about why it fails:
We are borrowing the language of structural change to describe what are merely discursive. This creates a dangerous illusion of control and safety, as educators might assume these frameworks carry the same force as actual traffic lights, when in fact they lack any meaningful enforcement mechanism. (p. 1094)
The authors unpack this analogy effectively. Real traffic lights work through infrastructure: cameras, patrols, and penalties. Authorities don’t just post guidelines about when drivers should stop; they install physical systems that force compliance. The educational version borrows the metaphor’s authority without any of that infrastructure, and that gap is exactly what Corbin et al. call the “enforcement illusion” (p. 1094).
The AI Assessment Scale from Perkins, Roe, and Furze (2024) receives a similar critique from Corbin et al. I’ve written favorably about the AIAS and I still think it has genuine value for helping educators conceptualize how AI might fit into different assignments.
But the authors make a valid point: the scale communicates levels of permitted use, and its effectiveness depends entirely on students honoring those levels. The AIAS authors themselves acknowledged this in their 2024 revision.
The authors note that declarative approaches run into the same wall. At King’s College London, 74% of students didn’t complete the AI declaration form properly, with many worrying that admitting AI use would count against them.
Corbin et al. trace the failure of discursive approaches to three assumptions that don’t hold up. They assume students will clearly understand the rules, but the boundary between “AI for editing” and “AI for drafting” blurs quickly across multiple rounds of revision.
They assume students will comply voluntarily even when non-compliance offers obvious advantages, and Corbin’s own earlier research shows that student and educator views on appropriate AI use often diverge. And they assume educators can verify compliance, which is the most damaging assumption because AI detection tools remain unreliable.
The authors name this dynamic the “discursive paradox”: “The more detailed and specific our instructions become about ‘acceptable’ AI use, the more we highlight the gap between what we can specify and what we can verify” (p. 1092). As they see it, the more carefully you write the rules, the more obvious it becomes that no one can check whether anyone followed them.
I think this connects directly to what Bastani et al. (2025) found in their PNAS study on math learning. When students had open access to ChatGPT without guardrails, they used it as a shortcut and their exam scores dropped 17%. The guardrailed GPT Tutor version produced different results because it changed how the tool itself responded. The rules in the open condition described desired behavior; the structure in the tutoring condition actually shaped it.
Building Validity into the Task Itself
If discursive changes can’t protect assessment validity, Corbin et al. argue that the focus needs to move to structural redesign. Their examples are concrete. A supervised in-class writing session restricts AI access because of how the room is set up, not because of what the instructions say. A live oral follow-up to a quiz requires students to explain their reasoning in real time. A lab checkpoint where a tutor signs off on data before the report is written makes fabrication structurally difficult.
Corbin et al. propose two strategies that deserve particular attention. The first is reorienting assessment from evaluating finished products to evaluating the development process. When you build in authenticated checkpoints where students demonstrate how their thinking evolved, the value of AI-generated shortcuts drops considerably. Sperber et al.’s (2025) PAIRR model offers a working example: students receive AI feedback and peer feedback side by side, then articulate their revision decisions. The assessment captures how thinking developed, not just what was submitted.
The second strategy the authors propose is to think about validity at the unit or module level. No single assignment needs to be AI-proof if the sequence of tasks across a course builds visibly on earlier work. Corbin et al. (citing St-Onge et al. 2017) describe this as creating “an argument-based evidentiary-chain” (p. 1095) where validity comes from coherent demonstration of learning across multiple connected touchpoints.
Where I’d Push Back Slightly
I agree with almost everything in this paper, and I think the traffic light critique alone makes it essential reading for anyone writing institutional AI policy. But I want to defend the pedagogical value of discursive work, even if it can’t solve the enforcement problem.
The AIAS may not prevent misuse, but it has helped thousands of educators think intentionally about what role AI should play in their courses. Clear expectations matter for learning culture, even when they can’t guarantee behavior. We need structural redesign and better discourse about AI, because one without the other leaves the job incomplete.
But on validity, Corbin et al. are right that structure wins. Their closing line deserves to be read by every assessment committee in higher education: current frameworks “say much but change little. They direct behaviour they cannot monitor. They prohibit actions they cannot detect. In other words, when it comes to appropriate assessment change for a time of AI, talk is cheap” (p. 1096). Assessment in the age of AI needs to be built differently, not just explained differently.
Reference
- Corbin, T., Dawson, P., & Liu, D. (2025). Talk is cheap: Why structural assessment changes are needed for a time of GenAI. Assessment & Evaluation in Higher Education, 50(7), 1087–1097. https://doi.org/10.1080/02602938.2025.2503964
- Desai, H. (2025, May 19). What’s worth measuring? The future of assessment in the AI age. UNESCO Courier. https://www.unesco.org/en/articles/whats-worth-measuring-future-assessment-ai-age
- Perkins, M., Roe, J., & Furze, L. (2024). The AI Assessment Scale revisited: A framework for educational assessment (Preprint). December 2024. https://arxiv.org/abs/2412.09029
- Sperber, L., MacArthur, M., Minnillo, S., Stillman, N., & Whithaus, C. (2025). Peer and AI Review + Reflection (PAIRR): A human-centered approach to formative assessment. Computers and Composition, 76, 102921. https://doi.org/10.1016/j.compcom.2025.102921
- St-Onge, C., m. Young, K. W. Eva, and B. Hodges. (2017). “Validity: One Word with a Plurality of meanings.” Advances in Health Sciences Education: Theory and Practice, 22 (4): 853–867. doi:10.1007/s10459-016-9716-3.
