Machine-Assisted Grading of Nationwide School-Leaving Essay Exams with LLMs and Statistical NLP
This paper evaluates LLM and statistical NLP methods for automated grading of nationwide school-leaving essay exams in Estonia, comparing their performance against human raters using curriculum-based rubrics. The study examines two full national cohorts of trial exams, assessing reliability, validity, bias, and the viability of human-in-the-loop automated scoring for high-stakes K-12 assessments.
Large language models (LLMs) enable rapid and consistent automated evaluation of open-ended exam responses, including dimensions of content and argumentation that have traditionally required human judgment. This is particularly important in cases where a large amount of exams need to be graded in a limited time frame, such as nation-wide graduation exams in various countries. Here, we examine the applicability of automated scoring on two large datasets of trial exam essays of two full national c