EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios
EduBench introduces a comprehensive benchmark dataset with 18,821 data points across 9 educational scenarios and 4,000+ educational contexts (covering K-12 and higher education subjects, different difficulty levels, and multiple question types), evaluated using 12 multi-dimensional metrics covering scenario adaptation, factual/reasoning accuracy, and pedagogical application. The benchmark assesses LLM capabilities in diverse educational tasks including assignment grading, study planning, tutoring, and psychological counseling, with both human and automated evaluation.
As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers a