Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability To Mark Short Answer Questions in K-12 Education

Relevance: 10/10 25 cited 2024 paper

This paper evaluates GPT-4's ability to automatically grade short answer questions in K-12 education (ages 5-16) across Science and History using a novel dataset from the Carousel quizzing platform, finding performance (Kappa 0.70) close to human-level (0.75). The study demonstrates that LLMs can reliably perform formative assessment grading tasks across multiple subjects and grade levels with minimal prompt engineering.

This paper presents reports on a series of experiments with a novel dataset evaluating how well Large Language Models (LLMs) can mark (i.e. grade) open text responses to short answer questions, Specifically, we explore how well different combinations of GPT version and prompt engineering strategies performed at marking real student answers to short answer across different domain areas (Science and History) and grade-levels (spanning ages 5-16) using a new, never-used-before dataset from Carousel

Source

View source Open PDF

Framework Categories

4.1 Scoring and grading

Tool Types

Teacher Support Tools Tools that assist teachers — lesson planning, content generation, grading, analytics.

Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability To Mark Short Answer Questions in K-12 Education

Source

Framework Categories

Tool Types

Tags