Improving the Validity of Automatically Generated Feedback via Reinforcement Learning
This paper proposes a reinforcement learning framework using direct preference optimization (DPO) to generate and evaluate automated feedback for incorrect student answers in math education, using GPT-4 annotations to train smaller models like Llama 2 to produce pedagogically valid feedback that explains misconceptions and encourages students.
Automatically generating feedback via large language models (LLMs) in intelligent tutoring systems and online learning platforms has the potential to improve the learning outcomes of many students. However, both feedback generation and evaluation are challenging: feedback content has to be valid especially in subjects like math, which requires models to understand the problem, the solution, and where the student's error lies. Feedback also has to be pedagogically valid to reflect effective tutor