Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy
This paper introduces AMMORE, a dataset of 53,000 middle-school math open-response question-answer pairs from an African WhatsApp-based tutoring platform, and evaluates LLM-based approaches (including chain-of-thought prompting) for automated grading of challenging student answers. The study demonstrates that LLM grading improves overall accuracy from 98.7% to 99.9% and significantly reduces misclassification of student mastery status in a Bayesian Knowledge Tracing model.
This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a learning platform used by students in several African countries and conducts two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. The AMMORE dataset enables various potential analyses and provides an important resource for researching student math acquisition in understudied, real-world, educational contexts. In experiment