EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

Relevance: 8/10 6 cited 2025 paper

EduBench introduces a comprehensive benchmark dataset with 18,821 data points across 9 educational scenarios and 4,000+ educational contexts (covering K-12 and higher education subjects, different difficulty levels, and multiple question types), evaluated using 12 multi-dimensional metrics covering scenario adaptation, factual/reasoning accuracy, and pedagogical application. The benchmark assesses LLM capabilities in diverse educational tasks including assignment grading, study planning, tutoring, and psychological counseling, with both human and automated evaluation.

As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers a

Source

View source

Framework Categories

1 General reasoning 2.1 Pedagogical knowledge 2.2 Pedagogy of generated outputs 2.3 Pedagogical interactions 3.1 Content knowledge 3.2 Content alignment 4.1 Scoring and grading 4.2 Feedback with reasoning

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Teacher Support Tools Tools that assist teachers — lesson planning, content generation, grading, analytics.

EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

Source

Framework Categories

Tool Types

Tags