If AI can pass your exam, your exam has a problem. That’s the core of what Fatima, Sheikh, and Osama (2024) argue in their commentary on authentic assessment in medical education. They cite ChatGPT scoring near the passing threshold on the USMLE without any clinical training, and ChatGPT-4 hitting 78.77% on a radiation oncology in-training exam. If a language model can clear those bars, the assessments aren’t testing what they should be testing.
I’ve been making a version of this argument across several posts on this blog, most recently when covering Perkins and Roe (2025) on the end of assessment as we’ve known it. Fatima et al. come at the same problem from a medical education angle, and they bring two ideas I think are worth engaging with: AI-powered authentic assessment and the Student-as-Partners (SaP) model. This is a commentary paper, not an empirical study, and that’s worth noting upfront. But the proposals here are concrete enough to be genuinely useful, which is rare for the genre.

Why Traditional Medical Assessments Fall Short
Fatima et al. open with a familiar critique: MCQs, essays, and short-answer questions test memorization in predictable formats. They don’t measure what actually makes a competent physician, things like clinical reasoning under pressure, communication with patients, or the ability to adapt when a case takes an unexpected turn. I agree with this starting point. It’s well established and the AI performance data makes it harder to ignore.
The authors ground their argument in two frameworks for authentic assessment. Ashford-Rowe et al.’s (2014) eight critical elements (challenge, product, transfer, metacognition, accuracy, fidelity, feedback, collaboration) and Villarroel et al.’s (2020) three dimensions (realism, cognitive challenge, evaluative judgment).
Naming these frameworks matters because too many papers toss around “authentic assessment” as a buzzword without defining what it actually requires. Fatima et al. are specific about the standards they’re working with, and that gives the rest of the paper a foundation to build on.
They then identify three AI capabilities converging in medical education: generative AI for personalized simulations and learning materials, semantic search tools for research navigation, and immersive AR/VR for placing students in realistic clinical environments before they see real patients.
The Student-as-Partners Angle
The part of this commentary I found most interesting is the argument for Student-as-Partners (SaP) in assessment design. Fatima et al. propose that students shouldn’t just take assessments. They should help build them. The research they cite shows SaP initiatives increase student motivation, deepen content understanding, and develop skills like leadership and critical thinking. Faculty benefit too, gaining insight into how students actually learn and where the gaps are.
The Tuckamore Simulation Research Collaborative in Newfoundland is a strong example. Student teams, guided by expert mentors, created simulation cases using the Royal College CanMEDS framework. The collaboration crossed medical, academic, and technology sectors.
That’s the kind of interdisciplinary work that produces assessments rooted in clinical complexity, not just theoretical ideals. And it reveals something that faculty-driven design processes routinely miss: what students think a realistic clinical challenge actually looks like from their side of the table.
Most AI assessment discussions I’ve seen on this blog and in the literature treat assessment as something done to students. The SaP model flips that. When students co-design assessments, they bring knowledge of their own learning gaps, firsthand experience with the tools, and a stake in making the process meaningful. That’s a corrective to the top-down approach most institutions default to, and I think the AI assessment conversation needs more of it.
The Gap Between Ambition and Evidence
I should be clear about what this paper doesn’t do. It proposes ideas, compiles tools, and suggests frameworks, but it doesn’t test any of them. The assessment scenarios provided are interesting, but we don’t know whether they produce better learning outcomes, whether faculty can implement them without massive training investments, or whether they generate valid evidence of clinical competence. Commentary papers are valuable for opening up possibilities. They can’t close the loop on whether those possibilities work.
The authors are upfront about practical barriers: resource constraints (particularly in low- and middle-income countries), subjectivity in grading complex tasks, difficulty scaling authentic assessments across disciplines, and the risk that heavy AI reliance might erode the hands-on clinical skills it’s supposed to complement.
I think that last point is the most pressing one. Sperber et al. (2025) have shown with their PAIRR framework that AI-powered formative assessment can work when it’s designed with pedagogical intention. But “designed with intention” is doing a lot of work in that sentence. AI tools dropped into medical curricula without alignment to learning goals won’t produce better physicians. They’ll produce fancier assessments.
Fatima et al. also call for comprehensive faculty development covering test construction, alignment with competencies, blueprinting, assessor training, and scoring standards. I agree completely. And I’d add that faculty development is where most ambitious AI assessment proposals fall apart.
Institutions announce new tools with enthusiasm. They’re far less willing to invest in the training that makes those tools pedagogically sound. Dawson, Bearman, Dollinger, and Boud (2024) have been building the case that assessment design should start from validity questions, not from technology (see also Bearman et al., 2024). That principle applies here with full force.
What This Commentary Adds to the Conversation
This isn’t a paper that reshapes the field, and it doesn’t claim to be. It’s a commentary that moves the conversation in a useful direction. The tool tables give medical educators something concrete to evaluate and adapt. The SaP model introduces a collaborator that most AI assessment papers ignore entirely. And the medical education context grounds everything in clinical specifics that generic higher education research often lacks.
What’s missing is the evidence. Someone needs to take these proposals into a medical school, build one of those AI-infused assessments, run it with students, and measure what happens. The ideas here are worth testing. They just haven’t been tested yet. That’s the next step, and it’s the one that will tell us whether this vision holds up or stays aspirational.
References
- Ashford-Rowe, K., Herrington, J., & Brown, C. (2014). Establishing the critical elements that determine authentic assessment. Assessment & Evaluation in Higher Education, 39(2), 205–222
- Bearman, M., Nieminen, J. H., & Ajjawi, R. (2023). Designing assessment in a digital world: An organising framework. Assessment & Evaluation in Higher Education, 48(3), 291-304. https://doi.org/10.1080/02602938.2022.2069674 .
- Dawson, P., Bearman, M., Dollinger, M., & Boud, D. (2024). Validity matters more than cheating. Assessment & Evaluation in Higher Education, 49(7), 1005–1016. https://doi.org/10.1080/02602938.2024.2386662 //
- Fatima, S. S., Sheikh, N. A., & Osama, A. (2024). Authentic assessment in medical education: exploring AI integration and student-as-partners collaboration. Postgraduate Medical Journal, 100(1190), 959-967. https://doi.org/10.1093/postmj/qgae088
- Perkins, M., & Roe, J. (2025). The end of assessment as we know it: GenAI, inequality and the future of knowing. In AI and the future of education: Disruptions, dilemmas and directions (pp. 76–80). https://durham-repository.worktribe.com/output/4472558/the-end-of-assessment-as-we-know-it-genai-inequality-and-the-future-of-knowing. https://medkharbach.com/the-future-of-ai-and-assessment/
- Sperber, L., MacArthur, M., Minnillo, S., Stillman, N., & Whithaus, C. (2025). Peer and AI Review + Reflection (PAIRR): A human-centered approach to formative assessment. Computers and Composition, 76, 102921. https://doi.org/10.1016/j.compcom.2025.102921
- Villarroel V, Boud D, Bloxham S, Bruna D, Bruna C. Using principles of authentic assessment to redesign written examinations
and tests. Innov Educ Teach Int [Internet] 2020 ;57(1):38–49.
