Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy

Benchmark (Published & Automated) Relevance: 9/10 2 cited 2024 paper

This paper introduces AMMORE, a dataset of 53,000 middle-school math open-response question-answer pairs from an African WhatsApp-based tutoring platform, and evaluates LLM-based approaches (including chain-of-thought prompting) for automated grading of challenging student answers. The study demonstrates that LLM grading improves overall accuracy from 98.7% to 99.9% and significantly reduces misclassification of student mastery status in a Bayesian Knowledge Tracing model.

This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a learning platform used by students in several African countries and conducts two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. The AMMORE dataset enables various potential analyses and provides an important resource for researching student math acquisition in understudied, real-world, educational contexts. In experiment

Study Type

Benchmark (Published & Automated)

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.
Teacher Support Tools Tools that assist teachers — lesson planning, content generation, grading, analytics.

Tags

LLM evaluation K-12 educationcomputer-science