Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability To Mark Short Answer Questions in K-12 Education
This paper evaluates how well Large Language Models (specifically GPT-4) can automatically grade short answer questions across different K-12 subjects (Science and History) and grade levels (ages 5-16) using a novel dataset from Carousel Learning, finding performance close to human-level marking (Kappa 0.70 vs 0.75). The study tests various prompt engineering strategies to assess LLM capabilities for formative assessment tasks.
This paper presents reports on a series of experiments with a novel dataset evaluating how well Large Language Models (LLMs) can mark (i.e. grade) open text responses to short answer questions, Specifically, we explore how well different combinations of GPT version and prompt engineering strategies performed at marking real student answers to short answer across different domain areas (Science and History) and grade-levels (spanning ages 5-16) using a new, never-used-before dataset from Carousel