Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors

Relevance: 10/10 39 cited 2024 paper

This paper proposes a unified evaluation taxonomy with eight pedagogical dimensions to assess LLM-powered AI tutors' abilities in remediating student mistakes in mathematics, and releases MRBench - a benchmark containing 192 conversations and 1,596 responses from seven tutors with human annotations across all dimensions. The taxonomy evaluates pedagogical interactions including mistake identification, guidance provision, actionability, and tutor tone, directly measuring whether AI tutors demonstrate effective pedagogical abilities rather than simply revealing answers.

In this paper, we investigate whether current state-of-the-art large language models (LLMs) are effective as AI tutors and whether they demonstrate pedagogical abilities necessary for good AI tutoring in educational dialogues. Previous efforts towards evaluation have been limited to subjective protocols and benchmarks. To bridge this gap, we propose a unified evaluation taxonomy with eight pedagogical dimensions based on key learning sciences principles, which is designed to assess the pedagogic

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Tags

tutoring dialogue evaluationcomputer-science