Using Large Language Models to Assess Tutors' Performance in Reacting to Students Making Math Errors
This paper evaluates GPT-3.5 and GPT-4's ability to assess human tutors' responses to students making math errors, specifically measuring whether tutors appropriately guide students to self-correct rather than directly pointing out mistakes. The study analyzes 50 real-life tutoring dialogues using LLMs to automate tutor performance assessment based on pedagogical criteria from the 'Reacting to Errors' lesson.
Research suggests that tutors should adopt a strategic approach when addressing math errors made by low-efficacy students. Rather than drawing direct attention to the error, tutors should guide the students to identify and correct their mistakes on their own. While tutor lessons have introduced this pedagogical skill, human evaluation of tutors applying this strategy is arduous and time-consuming. Large language models (LLMs) show promise in providing real-time assessment to tutors during their