Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability To Mark Short Answer Questions in K-12 Education
This paper evaluates GPT-4's ability to automatically grade short answer questions in K-12 education (ages 5-16) across Science and History using a novel dataset from the Carousel quizzing platform, finding performance (Kappa 0.70) close to human-level (0.75). The study demonstrates that LLMs can reliably perform formative assessment grading tasks across multiple subjects and grade levels with minimal prompt engineering.
This paper presents reports on a series of experiments with a novel dataset evaluating how well Large Language Models (LLMs) can mark (i.e. grade) open text responses to short answer questions, Specifically, we explore how well different combinations of GPT version and prompt engineering strategies performed at marking real student answers to short answer across different domain areas (Science and History) and grade-levels (spanning ages 5-16) using a new, never-used-before dataset from Carousel