TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models
TutorBench is a benchmark dataset of 1,490 expert-curated samples evaluating LLMs' tutoring capabilities across three core tasks: generating adaptive explanations, providing actionable feedback, and creating effective hints for high-school and AP-level curricula. The benchmark uses sample-specific rubrics and LLM-judge evaluation to assess 16 frontier models, finding none achieve above 56% overall performance.
As students increasingly adopt large language models (LLMs) as learning aids, it is crucial to build models that are adept at handling the nuances of tutoring: they need to identify the core needs of students, be adaptive, provide personalized guidance, and be accurate. To this end, we introduce TutorBench, a dataset and evaluation benchmark designed to rigorously evaluate the core tutoring skills of LLMs. The dataset comprises 1,490 samples curated by human experts, focused on high-school and A