Are Large Language Models Good Essay Graders?
This paper evaluates Large Language Models (ChatGPT and Llama) for automated essay scoring by comparing their grades to human raters using the ASAP dataset, finding that LLMs assign lower scores that correlate poorly with human evaluations, though they reliably detect spelling and grammar mistakes.
We evaluate the effectiveness of Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading. More precisely, we evaluate ChatGPT and Llama in the Automated Essay Scoring (AES) task, a crucial natural language processing (NLP) application in Education. We consider both zero-shot and few-shot learning and different prompting approaches. We compare the numeric grade provided by the LLMs to human rater-provided scores utilizing the ASAP dataset, a well-kn