AI Assessment in Higher Education: Why a Multi-Stakeholder Framework Matters

I’ve been writing about AI and assessment for a while, and one pattern keeps showing up across the research: most frameworks focus on a single stakeholder at a time. They’ll tell teachers what to do, or outline what students should know, or explain what institutions need to regulate. Rarely does a framework try to hold all three together in one coherent model.

Ilieva, Yankova, Ruseva, and Kabaivanov (2025) attempt exactly that. Their paper, published in the journal Information, proposes a framework for generative AI-supported assessment in higher education built around three branches: instructors, students, and quality assurance bodies. And it does something I think the field needs far more of. It treats assessment as a shared responsibility across all three groups, not a problem any one of them can solve alone.

Why Existing Frameworks Leave Gaps

The paper reviews several existing models, and the gaps are telling. Some focus narrowly on specific task types, quantitative economics problems or essay-based assignments, without addressing the broader curriculum. Others rely on perception surveys that tell us how people feel about AI in assessment but don’t operationalize anything at the task level. A few strong conceptual models exist, but they remain theoretical and haven’t been tested against real assessment data.

I’ve seen this tension across the research I cover. The AI Assessment Scale from Perkins, Roe, and Furze (2024) gives instructors five levels of AI integration tied to specific learning goals. It’s practical, flexible, and grounded in learning theory. But it’s primarily a tool for instructor-level decisions. The institutional layer, how departments coordinate, how quality assurance bodies audit AI-supported evaluation, how accreditation standards adapt, those questions stay outside its scope.

Corbin, Bearman, Boud, and Dawson (2025) framed AI and assessment as a wicked problem. I’ve covered that paper on this blog, and the framing is powerful. But wicked problems still need structured responses. You can acknowledge complexity and still try to build something that works across multiple levels. Ilieva et al. are trying to do both at once.

Three Branches, One Assessment Process

The framework organizes assessment around three stakeholder groups, each with specific responsibilities at every stage.

Instructors handle planning, preparation, delivery, evaluation, and reflection. AI tools support them in generating rubrics, building adaptive question banks, personalizing feedback at scale, and analyzing grading patterns over time. The framework keeps final grading authority with the instructor. As Ilieva et al. put it:

While generative AI tools offer significant benefits, such as immediate feedback, reduced administrative workload, and individualised student support, it is essential to emphasise that all critical decisions regarding instructional design, assessment criteria, and course outcomes remain the responsibility of the instructor. The tool function is to complement, not replace, the academic judgment and pedagogical expertise of educators. (p. 14)

That language matches a position I’ve held across many posts on this blog and I feel it becomes now a cliche: AI should support professional judgment, not substitute for it.

Students, in this model, are active participants. They access AI-assisted learning archives, use chatbots to clarify assignment requirements, practice with AI-generated formative assessments, and receive personalized feedback tied to rubrics. The framework also asks students to reflect on that feedback and use it to guide future learning.

I think that element is crucial. Fan et al. (2025) showed in their metacognitive laziness study that AI can improve the final product while leaving the actual learning process untouched. Structured reflection is one of the few interventions that disrupts that pattern.

Quality assurance bodies form the third branch. They monitor grading consistency, audit assessment materials for fairness and alignment with learning outcomes, and use AI-powered dashboards to track performance trends across departments. The framework even suggests AI tools that evaluate the cognitive level of test items using Bloom’s taxonomy and flag problems like redundancy or bias.

I like the three-way structure. Assessment doesn’t happen in isolation. A rubric designed by an instructor, completed by a student with AI support, and audited by a quality assurance committee involves all three groups. Any model that ignores one of them is, I think, working with an incomplete picture.

AI Assessment in Higher Education

Real Exam Data, Not Just Theory

Too many conceptual papers stop at the proposal stage. Ilieva et al. go further. They tested their framework using a final exam in a university-level data analysis course, comparing scores from a team of three lecturers with grades generated by ChatGPT 4.5, both using the same rubric.

The results are interesting, though the scale is small. Across fifteen students and six open-ended questions, ChatGPT matched or closely approximated lecturer scores in most cases. Twelve out of fifteen final grades were identical. The overall mean absolute error was 2.93 points on a 40-point scale, roughly one grade level of difference.

But the exceptions are where the real lessons are. According to the authors, “ChatGPT achieved the closest agreement with lecturer scoring on Q3, with a MAE of just 0.53, suggesting very high consistency” (p. 18). On the other end, “Q4 presented the greatest challenge for ChatGPT, with the highest MAE (1.87) and RMSE (2.71)” (p. 18), because that question required students to submit an Excel file with embedded solver configurations, a format the AI couldn’t meaningfully evaluate at the time.

One case was particularly revealing. A student received a score of 1 from ChatGPT but 4 from the lecturers. The human evaluators rewarded unconventional thinking, even though the student’s response was technically incomplete. The AI couldn’t recognize the value of a creative approach that didn’t fit neatly into the rubric. And that’s exactly the kind of judgment call that keeps human evaluators essential in summative assessment.

Where the Framework Falls Short

I want to be direct about two things.

First, the empirical test is very small. Fifteen students, one course, one discipline (data analysis), one university. The authors describe it as a pilot, and that’s fair. But we should be cautious about how far we extend the findings. AI grading consistency in a structured data analysis course may look very different from what you’d see in a philosophy seminar, a creative writing workshop, or a clinical placement evaluation. The framework is ambitious in scope. The evidence behind it, so far, is narrow.

Second, equity. Ilieva et al. mention fairness, transparency, and inclusion as quality assurance criteria. Those are the right words. But the paper doesn’t really grapple with what happens when institutions lack the digital infrastructure, the training capacity, or the funding to implement AI-supported assessment at any meaningful scale. Perkins and Roe (2025) warned in their chapter on the future of AI and assessment that AI risks becoming another layer of exclusion if access isn’t addressed. I’ve been tracking that concern across the research I cover, and it applies directly here. In well-resourced universities, this framework is actionable. In under-resourced contexts, it risks staying aspirational.

AI Assessment in Higher Education

What This Means for Educators

If you’re teaching in higher education and trying to figure out where AI fits into your assessment practice, Ilieva et al. offer something concrete. Use AI to support rubric design and question generation during the planning stage. Let students practice with AI-generated formative activities and receive personalized feedback. Use AI for initial scoring on structured tasks. But keep your professional judgment at the center of summative decisions, particularly for complex, interpretive, or creative work.

The authors themselves frame this clearly:

These findings suggest that generative AI is well suited for evaluating structured tasks and providing formative feedback but still requires human oversight for complex or high-impact assessments. Its integration into the assessment process is most effective when used to complement, rather than replace, academic judgment. (p. 18)

The framework’s strongest contribution is its multi-stakeholder structure. It reminds us that assessment is never just an instructor problem or a student problem or an institutional problem. It’s all three at once. And any serious attempt to integrate AI into assessment needs to account for all three. The field has been slow to build models at that level of coordination. Ilieva et al. have given us a starting architecture, even if the evidence base needs to grow considerably before we can call it proven.

References

  • Corbin, T., Bearman, M., Boud, D., & Dawson, P. (2025). The wicked problem of AI and assessment. Assessment & Evaluation in Higher Education. Advance online publication. https://doi.org/10.1080/02602938.2025.2553340
  • Fan, Y., Tang, L., Le, H., Shen, K., Tan, S., Zhao, Y., Shen, Y., Li, X., & Gašević, D. (2025). Beware of metacognitive laziness: Effects of generative artificial intelligence on learning motivation, processes, and performance. British Journal of Educational Technology, 56(2), 489-530. https://doi.org/10.1111/bjet.13544
  • Ilieva, G., Yankova, T., Ruseva, M., & Kabaivanov, S. (2025). A framework for generative AI-driven assessment in higher education. Information, 16(6), 472. https://doi.org/10.3390/info16060472
  • Perkins, M., Roe, J., & Furze, L. (2024). The AI Assessment Scale revisited: A framework for educational assessment (Preprint). December 2024. https://arxiv.org/abs/2412.09029
  • Perkins, M., & Roe, J. (2025). The end of assessment as we know it: GenAI, inequality and the future of knowing. In AI and the future of education: Disruptions, dilemmas and directions (pp. 76-80). https://durham-repository.worktribe.com/output/4472558

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top