TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

Relevance: 10/10 1 cited 2025 paper

TutorBench is a purpose-built benchmark with 1,490 expert-curated samples evaluating LLMs on three core tutoring skills: generating adaptive explanations, providing actionable feedback, and creating effective hints for high-school and AP-level content. The benchmark uses sample-specific rubrics and LLM-judge evaluation to assess 16 frontier models, finding none exceed 56% overall performance and all achieve less than 60% pass rate on criteria related to guiding, diagnosing, and supporting students.

As students increasingly adopt large language models (LLMs) as learning aids, it is crucial to build models that are adept at handling the nuances of tutoring: they need to identify the core needs of students, be adaptive, provide personalized guidance, and be accurate. To this end, we introduce TutorBench, a dataset and evaluation benchmark designed to rigorously evaluate the core tutoring skills of LLMs. The dataset comprises 1,490 samples curated by human experts, focused on high-school and A

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Tags

LLM as judge evaluationcomputer-science