EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios
EduBench is a comprehensive benchmark dataset with 18,821 data points covering 9 educational scenarios (assignment grading, study planning, psychological counseling, etc.) across 4,000+ educational contexts, with 12 multi-dimensional evaluation metrics assessing LLM performance in diverse educational roles and tasks. The benchmark includes automated evaluation capabilities using both human annotation and LLM-based assessment, with code and dataset publicly available.
As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers a