Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability To Mark Short Answer Questions in K-12 Education

Benchmark (Not Published) Relevance: 9/10 25 cited 2024 paper

This paper evaluates how well Large Language Models (specifically GPT-4) can automatically grade short answer questions across different K-12 subjects (Science and History) and grade levels (ages 5-16) using a novel dataset from Carousel Learning, finding performance close to human-level marking (Kappa 0.70 vs 0.75). The study tests various prompt engineering strategies to assess LLM capabilities for formative assessment tasks.

This paper presents reports on a series of experiments with a novel dataset evaluating how well Large Language Models (LLMs) can mark (i.e. grade) open text responses to short answer questions, Specifically, we explore how well different combinations of GPT version and prompt engineering strategies performed at marking real student answers to short answer across different domain areas (Science and History) and grade-levels (spanning ages 5-16) using a new, never-used-before dataset from Carousel

Study Type

Benchmark (Not Published)

Framework Categories

Tool Types

Teacher Support Tools Tools that assist teachers — lesson planning, content generation, grading, analytics.

Tags

LLM evaluation K-12 educationcomputer-science