2.1 Research Report

Pedagogical Knowledge: How Well Do LLMs Understand Teaching?

Benchmarks measuring knowledge about teaching — instructional strategies, learning theories, curriculum design.

How this was produced: We identified high-relevance papers (scored ≥7/10) classified under this category, extracted key sections (abstract, introduction, results, discussion, conclusions) from each, then used Claude to synthesise findings into a structured analysis. The report below reflects what the research covers — and what it doesn't.

View benchmarks

Research into the pedagogical knowledge of large language models (LLMs) represents one of the most consequential — and contested — areas of AI-in-education benchmarking. Across 43 papers reviewed in this category, a fundamental tension emerges: while LLMs demonstrate increasing competency in generating pedagogically-informed content such as lesson plans, feedback, and assessments, they frequently lack the deep understanding of learning science principles required for nuanced instructional decision-making. Leading benchmarks such as The Pedagogy Benchmark (920 questions drawn from Chilean teacher certification exams) and EduGuardBench show models achieving 28–89% accuracy on pedagogical knowledge tests, with reasoning-enabled models like GPT-4 and Gemini 2.5 Pro substantially outperforming smaller alternatives. However, these quantitative scores frequently mask qualitative deficiencies: teachers consistently report that AI-generated materials require significant human refinement.

Perhaps the most striking finding across this body of work concerns cognitive offloading. A field experiment with approximately 1,000 high school students found that those using GPT-4 for practice solved 48% more problems correctly but scored 17% lower on unassisted tests — suggesting that AI-supported practice may actually impede the development of independent problem-solving skills. This single statistic encapsulates the central challenge: pedagogical AI that improves surface-level performance while undermining genuine learning is not merely unhelpful — it is actively harmful. The field is now moving beyond simple content generation towards complex multi-agent systems and frameworks grounded in learning science, but critical gaps remain in longitudinal evidence, cross-cultural validation, and deployment in authentic classroom settings — particularly in low- and middle-income countries (LMICs).