Most conversations about AI and assessment focus on whether students are using ChatGPT to cheat. That’s the wrong conversation. The harder question, and, I believe, the one that will shape the next decade of educational measurement, is whether AI can score student work in ways that actually mean something. Kaldaras, Akaeze, and Reckase (2024) take that question seriously in a conceptual analysis published in Frontiers in Education, and they arrive at a conclusion I think more educators need to hear: we don’t yet have the guidelines to answer it.
Their paper lands at an interesting moment. Education systems in Germany, Finland, China, and the United States have all been moving away from memorization-based testing toward knowledge application, the ability to use what you know to solve real problems and explain real phenomena. Kaldaras et al. point out that this shift requires open-ended assessments where students build models, write explanations, and demonstrate complex reasoning.
Those kinds of assessments are expensive to create and even more expensive to score at scale. That’s exactly what makes GAI tools attractive. They could, in theory, speed up every part of the process, from item creation to rubric-based scoring. The catch is that nobody has established standards for evaluating whether AI-generated assessments or AI-produced scores are actually valid.
AI Assessment Validity
Kaldaras et al. ground their framework in the Standards for Educational and Psychological Testing published by the American Educational Research Association (Eignor, 2013). Citing Messick (1980) and Eignor (2013), Kaldaras et al define validity, as “the degree to which evidence and theory support interpretations of assessment results (for example, test scores) for the proposed uses of a given assessment (p. 2).
That’s not a technicality. It’s the foundation everything else rests on. If a GAI tool scores a student’s response as proficient, and that score gets used to guide instruction or report progress, there needs to be evidence that the score accurately reflects what the student knows and can do. Right now, that evidence doesn’t exist in any systematic form for GAI-based scoring.
The framework Kaldaras et al. propose evaluates GAI through five existing sources of validity evidence: test content, response process, internal structure, relation to other variables, and validity generalization. Each one maps onto specific concerns about how GAI performs in educational contexts.

On test content, Kaldaras et al. suggest using elements from evidence-centered design (ECD), specifically claims and evidence statements, as a basis for generating GAI prompts. The logic is straightforward: if the prompt is grounded in the construct the test is supposed to measure, the resulting items are more likely to align with what they should be assessing.
They also note that previously validated test questions can serve as models for GAI-based item generation. I think this is the most immediately useful recommendation in the paper, because it gives practitioners something concrete. You don’t need a new theory. You need to feed your GAI tool the right inputs and then check the outputs against what you already know works.
The most original contribution is what Kaldaras et al. call “GAI scoring process-based validity evidence, which relates to assessing the alignment between human and GAI-scored response features” (p. 6). That’s a new category they’re proposing. Before GAI, the standard method for validating AI scores was to compare them against human scores and measure agreement.
Kaldaras et al. argue that agreement alone isn’t enough. A GAI model might assign the same score a human would, but for completely different reasons. If the model is keying on word count or vocabulary level and a trained human scorer is keying on conceptual understanding, the match is coincidental. The scores look valid but aren’t. Their recommendation: ask the GAI to explain its rationale for each score, then compare that rationale to what a trained human scorer would consider. It’s a simple idea, and I’m surprised it wasn’t standard practice already.
The bias finding is the part that should concern anyone working with diverse student populations. Kaldaras et al. report from their own work using ChatGPT to score math-science sensemaking assessments: “We discovered that responses consistent with sophisticated reasoning but use non-standard language or show evidence of responders being non-native English speakers tended to be mis-scored by GAI to lower LP levels” (p. 7).
Students who demonstrated strong reasoning but expressed it in non-standard English got penalized. That’s a validity threat, and it falls hardest on multilingual students and communities already underrepresented in training data. I’ve covered related findings about how multicultural students experience GenAI tools in academic writing (e.g., see Hysaj et al., 2025) and the pattern holds: AI systems built on standard English norms systematically undervalue the reasoning of students who communicate differently.
Kaldaras et al. suggest one possible fix: developing a vocabulary of non-standard language and incorporating it into the GAI training process. They also recommend running GAI scoring multiple times on the same dataset to evaluate consistency. In their own project, inconsistent scores across trials flagged where the prompt needed revision. Better agreement followed after those revisions. That iterative process of prompting, checking, and revising is something I think teachers will recognize. It’s not fundamentally different from refining a rubric after the first round of scoring reveals problems.
The generalizability argument is where I think the paper could have gone further. Kaldaras et al. correctly flag two serious problems: GAI models can hallucinate, meaning they produce incorrect predictions, and they can drift, losing accuracy on new inputs compared to their performance during training. Both of those phenomena mean that a validated model doesn’t stay validated forever. The authors argue for ongoing human monitoring even after release. Fine. But they don’t give much guidance on what that monitoring should look like at scale, or who pays for it, or how schools and testing agencies with limited budgets are supposed to implement it. The recommendation is sound. The infrastructure question is wide open.

The paper also acknowledges something uncomfortable about human scoring itself: “Research has shown that humans are biased toward longer than shorter responses, and therefore, human scores also represent nonperfect criteria” (p. 7). That line complicates the whole discussion. If humans aren’t a perfect benchmark, then treating human scores as the gold standard for validating AI is already flawed.
Kaldaras et al. suggest holding AI to the same training standards as human scorers, including seeding in previously scored responses to check for drift, which is already standard practice in large-scale human scoring operations. I think the comparison is useful precisely because it lowers the temperature. It’s not about AI being good or bad. It’s about whether we’re applying the same rigor we’ve always been supposed to apply, and whether AI gives us new tools to do it better.
One idea from the paper that deserves more attention: using multiple GAI algorithms to cross-validate each other. Kaldaras et al. propose having one algorithm identify patterns in a dataset and a second one score the same data. The results across algorithms could reveal inconsistencies that a single-model approach would miss. I haven’t seen this discussed much in the educational AI literature, and it’s a practical strategy that testing organizations could implement now.
Kaldaras et al. close with a firm position: “We also believe that while GAI can perform many of the tasks outlined above, the end judge of the validity of GAI actions should always be humans” (p. 9). I agree with that conclusion, and I’d add that it isn’t just a safety precaution. It’s a statement about what assessment is for. Assessments exist to help people learn and to tell us something true about what students understand. If we hand the entire process over to models that can’t explain their own reasoning, we’ve lost something essential. The tools are getting faster, no question. The question is whether our standards are keeping up.
References
- Eignor, D. R. (2013). “The standards for educational and psychological testing” in APA Handbook of Testing and Assessment in Psychology, Vol. 1. Test Theory and Testing and Assessment in Industrial and Organizational Psychology. eds. K. F. Geisinger, B. A. Bracken, J. F. Carlson, J.-I. C. Hansen, N. R. Kuncel and S. P. Reiseet al. (Washington D.C.: American Psychological Association), 245–250.
- Hysaj, A., Dean, B. A., & Freeman, M. (2025). Exploring the purposes and uses of generative artificial intelligence tools in academic writing for multicultural students. Higher Education Research & Development, 44(7), 1686-1700. https://doi.org/10.1080/07294360.2025.2488862
- Kaldaras, L., Akaeze, H. O., & Reckase, M. D. (2024). Developing valid assessments in the era of generative artificial intelligence. Frontiers in Education, 9, 1399377. https://doi.org/10.3389/feduc.2024.1399377
- Messick, S. (1980). Test validity and the ethics of assessment. Am. Psychol. 35, 1012–1027. doi: 10.1037/0003-066X.35.11.1012
