Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy
This paper introduces the AMMORE dataset of 53,000 middle-school math open-response answers from students in Africa using the Rori WhatsApp AI tutor, and evaluates LLM-based grading approaches (including chain-of-thought prompting) to improve automated scoring accuracy from 98.7% to 99.9%, demonstrating consequential validity through impacts on Bayesian Knowledge Tracing estimates of student mastery.
This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a learning platform used by students in several African countries and conducts two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. The AMMORE dataset enables various potential analyses and provides an important resource for researching student math acquisition in understudied, real-world, educational contexts. In experiment