Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors

Benchmark (Published & Automated) Relevance: 9/10 39 cited 2024 paper

This paper presents MRBench, a benchmark for evaluating LLM-powered AI tutors' pedagogical abilities in mathematical dialogues, featuring 192 conversations across eight pedagogical dimensions (mistake identification, guidance provision, answer revealing, etc.) with gold human annotations. The taxonomy and dataset enable systematic assessment of whether AI tutors demonstrate sound pedagogical practices when remediating student mistakes.

In this paper, we investigate whether current state-of-the-art large language models (LLMs) are effective as AI tutors and whether they demonstrate pedagogical abilities necessary for good AI tutoring in educational dialogues. Previous efforts towards evaluation have been limited to subjective protocols and benchmarks. To bridge this gap, we propose a unified evaluation taxonomy with eight pedagogical dimensions based on key learning sciences principles, which is designed to assess the pedagogic

Study Type

Benchmark (Published & Automated)

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Tags

tutoring dialogue evaluationcomputer-science