4.1 Research Report

Scoring and Grading: Benchmarking AI in Automated Educational Assessment

Benchmarks evaluating automated scoring, grading, and rubric application.

How this was produced: We identified high-relevance papers (scored ≥7/10) classified under this category, extracted key sections (abstract, introduction, results, discussion, conclusions) from each, then used Claude to synthesise findings into a structured analysis. The report below reflects what the research covers — and what it doesn't.

View benchmarks

Automated scoring and grading represents one of the most mature and extensively researched areas in educational AI, with 139 papers spanning automated essay scoring (AES), short answer grading, programming assessment, and domain-specific evaluation across multiple languages and educational contexts. The field has undergone a significant methodological evolution — from hand-crafted linguistic features and vector space models through to deep learning architectures and, most recently, large language models (LLMs) used in zero-shot and few-shot configurations. Neural network models now routinely achieve human-level agreement on established benchmarks, with quadratic weighted kappa (QWK) scores exceeding 0.80 on standard datasets such as ASAP.

However, a critical tension sits at the heart of this body of research. The overwhelming emphasis on statistical agreement with human raters — measured through QWK, Pearson correlation, and root mean square error (RMSE) — has come at the expense of deeper questions about pedagogical validity, fairness, and real-world impact on learning. Most systems treat human scores as perfect ground truth, yet inter-rater disagreement is well documented. LLMs frequently assign systematically different scores than humans — often harsher — and most systems remain opaque in their scoring rationales. Only 5 of 139 papers meaningfully address cognitive offloading or learning science concerns. The field now stands at a juncture where technical capability has outpaced our understanding of whether these systems support genuine learning, serve all students equitably, or hold up under adversarial scrutiny. This represents both a significant gap and a major opportunity for the education sector.