Using Large Language Models to Assess Tutors' Performance in Reacting to Students Making Math Errors
This paper evaluates GPT-3.5-Turbo and GPT-4's ability to assess human tutors' performance in responding to K-12 students' math errors, specifically measuring whether tutors use indirect guidance strategies versus direct error correction. The study analyzes 50 real tutoring dialogues to determine if LLMs can provide automated feedback on tutoring quality.
Research suggests that tutors should adopt a strategic approach when addressing math errors made by low-efficacy students. Rather than drawing direct attention to the error, tutors should guide the students to identify and correct their mistakes on their own. While tutor lessons have introduced this pedagogical skill, human evaluation of tutors applying this strategy is arduous and time-consuming. Large language models (LLMs) show promise in providing real-time assessment to tutors during their