TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

Benchmark (Published & Automated) Relevance: 10/10 1 cited 2025 paper

TutorBench is a benchmark dataset of 1,490 expert-curated samples evaluating LLMs' tutoring capabilities across three core tasks: generating adaptive explanations, providing actionable feedback, and creating effective hints for high-school and AP-level curricula. The benchmark uses sample-specific rubrics and LLM-judge evaluation to assess 16 frontier models, finding none achieve above 56% overall performance.

As students increasingly adopt large language models (LLMs) as learning aids, it is crucial to build models that are adept at handling the nuances of tutoring: they need to identify the core needs of students, be adaptive, provide personalized guidance, and be accurate. To this end, we introduce TutorBench, a dataset and evaluation benchmark designed to rigorously evaluate the core tutoring skills of LLMs. The dataset comprises 1,490 samples curated by human experts, focused on high-school and A

Study Type

Benchmark (Published & Automated)

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Tags

LLM as judge evaluationcomputer-science