Assessment Validity in the Age of AI: Why We Need to Agree on What We Mean

Every debate I’ve followed about AI and assessment eventually hits the same wall. Someone defends a traditional exam by calling it “validated.” Someone else argues the exam has lost its meaning now that AI can complete it. A third person wants to throw out the whole model and assess different competencies entirely. The conversation stalls. And it stalls, I think, because the people in the room are using the same word to mean very different things.

St-Onge, Young, Eva, and Hodges (2017) wrote about exactly this problem in a paper published in Advances in Health Sciences Education. Their context is health professions education, not AI. The paper is from 2017, years before ChatGPT existed. But the argument maps onto the current AI assessment conversation with uncomfortable precision, and I think it exposes a structural reason why so many of these debates go in circles.

The authors used discourse analysis to examine how the word “validity” actually functions in the assessment literature. They built an archive of 68 peer-reviewed articles, two books, a research report, and seven commentaries, then coded the language systematically. What they found is that validity doesn’t operate as a single shared concept. It functions through three distinct discourses, each with its own logic, its own power dynamics, and its own blind spots.

Assessment Validity in the Age of AI

Three Discourses, Three Different Conversations

The first discourse treats validity as a property of a test. A tool gets labeled “valid,” and that label sticks. As St-Onge et al. describe it, “Validity as a test characteristic is underpinned by the notion that validity is an intrinsic property of a tool and could, therefore, be seen as content and context independent” (p. 853). You hear this constantly in education. “We use a validated rubric.” “MCQs are a valid measure of knowledge.” The tool carries a kind of gold seal, and once it has that seal, there’s little incentive to revisit how it performs in new settings or under new conditions.

The second discourse reframes validity as an evidentiary argument. Validity doesn’t belong to the tool. It belongs to the interpretation of scores. St-Onge et al. cited Downing’s (2003) definition of : “the evidence presented to support or refute the meaning or interpretation assigned to assessment results” (p. 859).

Each new administration, each new context requires fresh evidence. The work of validation never ends. This view aligns with Messick (1995) and Kane (2013), with contemporary measurement theory, and with the kind of rigor that testing organizations promote. It’s also demanding. Most programs lack the psychometric expertise or the time to maintain ongoing validation arguments for every assessment they run.

The third discourse treats validity as a social imperative. The focus shifts from technical properties to consequences. What does this assessment do to learners? Who benefits? Who gets excluded? What signal does it send about what matters? St-Onge et al. note that “this discourse shifts the focus of attention from the properties of the tool or the validation process to the desired purpose of assessment for the learner and for society” (p. 862). The concern is programmatic and systemic, not instrument-level.

The authors are careful to note these discourses aren’t mutually exclusive. They found instances where more than one discourse appeared in a single publication. But the discourses are distinct enough that people operating from different ones will talk past each other without realizing it.

Assessment Validity in the Age of AI

Why This Framework Matters for AI and Assessment

St-Onge et al. didn’t write about AI. But their framework explains a pattern I’ve been watching for two years across the AI assessment literature.

When a faculty member defends their multiple-choice exam by saying “this is a validated instrument,” they’re working within discourse one. Validity belongs to the tool. The tool has been proven. End of story. When someone counters that the same exam no longer means what it used to, because students can use AI to prepare or because AI can generate correct answers, they’re working from discourse two. The conditions of administration have changed. The evidence supporting the original score interpretations may no longer hold. Same word, completely different argument.

And when someone argues, as I’ve seen more frequently in recent months, that we should stop measuring recall and analytical writing and start assessing things like critical evaluation of AI output, ethical reasoning, or creative problem-solving that builds on AI-generated material, they’ve moved into discourse three. The question isn’t whether the test works. The question is whether we’re measuring what actually matters for students who will work alongside AI for the rest of their careers.

I’ve covered the AI assessment debate in various posts on this blog. For instance, Corbin, Bearman, Boud, and Dawson (2025) framed AI and assessment as a wicked problem precisely because stakeholders define the problem differently depending on their starting assumptions. St-Onge et al.’s framework helps explain why those definitions diverge.

If your conception of validity is “the test works,” you’ll respond to AI by trying to protect the test. If your conception is “the interpretation needs evidence,” you’ll ask whether the evidence still supports the same inferences. If your conception is “assessment should serve learners and society,” you’ll want to redesign what gets assessed altogether. None of those responses is wrong. But they lead to very different actions, and the friction comes from people arguing across discourses without naming the gap.

The Power Dynamics Are Worth Noting

One of the most interesting parts of the paper, and one I didn’t fully engage with in my first reading, is the analysis of who benefits from each discourse. St-Onge et al. observe that the first discourse, validity as a test property, creates a consumer-producer dynamic.

Producers develop and market “validated” tools. Consumers, often faculty with limited psychometric training, adopt those tools because they need ready-made solutions. The label “validated” functions as a shortcut. It answers a real need in resource-constrained programs. But it also means the people using the assessment may never feel the need to question it, even when context changes dramatically, as it has with AI.

That dynamic is directly relevant right now. Faculty who adopted “validated” assessments years ago are now facing a context that has fundamentally shifted. The tool hasn’t changed. The environment has. And if your understanding of validity is that it lives inside the instrument, you may not see the problem until the scores stop making sense.

The authors also note that the second discourse, the evidentiary chain, concentrates power among measurement specialists. The language is technical. The frameworks are complex. The implicit message is that only certain experts can legitimately validate assessments. That’s a real barrier in the AI conversation, because many of the people redesigning assessments right now are classroom teachers and department chairs, not psychometricians. St-Onge et al.’s analysis suggests the field needs to make space for practitioners to participate in validation work, particularly when the conditions around assessment are changing as fast as they are now.

What Should Change

St-Onge et al. close with a recommendation that sounds simple but would change a lot if people actually followed it: “explicitly describe one’s conceptualizations before making statements of truth about the worth and/or appropriateness of assessment tools and programs” (p. 866). In an AI context, that means naming which version of validity you’re working with before you start redesigning your course or arguing with colleagues about whether exams still mean anything.

I think the AI assessment field has spent most of its energy in discourses one and two. We debate whether existing tools still “work.” We argue about detection and proctoring. We discuss whether take-home essays are still viable. Hartmann (2025), whose recent paper on oral exams I covered, redesigned an entire course around a different format precisely because the written take-home model had lost its evidentiary strength.

All of that is necessary work. But the third discourse, the one focused on consequences and purpose, deserves more room. Perkins and Roe (2025) argued in their chapter on the future of assessment that we need to rethink what kinds of knowledge we prioritize and validate. That’s a discourse-three argument. And I think it’s where the most productive energy in the AI assessment conversation needs to go.

St-Onge et al. published this paper in 2017, well before generative AI forced assessment into crisis mode. But the framework they built is more useful now than when it was written. Name the discourse. Make your assumptions visible. Then argue. The conversations will be sharper, and the decisions that follow will be better for it.

References

  • Corbin, T., Bearman, M., Boud, D., & Dawson, P. (2025). The wicked problem of AI and assessment. Assessment & Evaluation in Higher Education. Advance online publication. https://doi.org/10.1080/02602938.2025.2553340
  • Downing, S. M. (2003). Validity: On the meaningful interpretation of assessment data. Medical Education,37, 830–837
  • Hartmann, C. (2025). Oral exams for a generative AI world: Managing concerns and logistics for undergraduate humanities instruction. College Teaching. Advance online publication. https://doi.org/10.1080/87567555.2025.2558563
  • Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000
  • Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741
  • Perkins, M., & Roe, J. (2025). The end of assessment as we know it: GenAI, inequality and the future of knowing. In AI and the future of education: Disruptions, dilemmas and directions (pp. 76–80).  https://durham-repository.worktribe.com/output/4472558/the-end-of-assessment-as-we-know-it-genai-inequality-and-the-future-of-knowing
  • St-Onge, C., Young, M., Eva, K. W., & Hodges, B. (2017). Validity: One word with a plurality of meanings. Advances in Health Sciences Education, 22, 853-867. https://doi.org/10.1007/s10459-016-9716-3

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top