Machine-Assisted Grading of Nationwide School-Leaving Essay Exams with LLMs and Statistical NLP
This paper evaluates LLM-based and statistical NLP methods for automated scoring of nationwide high school graduation essay exams in Estonia, comparing machine-generated scores against human raters across multiple rubric dimensions including content, argumentation, and language quality. The study demonstrates that automated scoring achieves reliability comparable to human raters while also examining bias, prompt injection risks, and providing personalized feedback capabilities.
Large language models (LLMs) enable rapid and consistent automated evaluation of open-ended exam responses, including dimensions of content and argumentation that have traditionally required human judgment. This is particularly important in cases where a large amount of exams need to be graded in a limited time frame, such as nation-wide graduation exams in various countries. Here, we examine the applicability of automated scoring on two large datasets of trial exam essays of two full national c