Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors
This paper presents the BEA 2025 Shared Task which evaluates pedagogical abilities of AI tutors (powered by LLMs) in educational dialogues, specifically assessing mistake identification, guidance provision, and feedback actionability in math tutoring contexts. The benchmark includes five evaluation tracks with publicly released datasets and annotation guidelines, achieving macro F1 scores ranging from 58.34 to 71.81 across pedagogical dimensions.
This shared task has aimed to assess pedagogical abilities of AI tutors powered by large language models (LLMs), focusing on evaluating the quality of tutor responses aimed at student's mistake remediation within educational dialogues. The task consisted of five tracks designed to automatically evaluate the AI tutor's performance across key dimensions of mistake identification, precise location of the mistake, providing guidance, and feedback actionability, grounded in learning science principle