Benchmarking the Pedagogical Knowledge of Large Language Models
This paper introduces The Pedagogy Benchmark, a novel dataset of 920 multiple-choice questions from Chilean teacher training exams designed to evaluate large language models' cross-domain pedagogical knowledge (CDPK) and special education needs (SEND) knowledge. The benchmark tests 97 models on their understanding of teaching strategies, assessment methods, and pedagogical concepts, with accuracies ranging from 28% to 89%.
Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI's knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models'understanding of pedagogy - the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their Cross-Domain Pedagogical Knowledge (CDPK) and Spe