3.1 Research Report

Content Knowledge: Benchmarking LLM Mastery of K-12 Subject Matter

Benchmarks measuring mastery of subject-matter content (STEM, humanities, etc.).

How this was produced: We identified high-relevance papers (scored ≥7/10) classified under this category, extracted key sections (abstract, introduction, results, discussion, conclusions) from each, then used Claude to synthesise findings into a structured analysis. The report below reflects what the research covers — and what it doesn't.

View benchmarks

Content knowledge — the ability of large language models (LLMs) to demonstrate mastery of K-12 subject matter across science, mathematics, humanities, and languages — is the most extensively benchmarked dimension of AI in education. Our analysis covers 211 papers spanning a remarkable breadth of evaluation approaches, from static standardised test performance to adaptive intelligent tutoring systems that track student knowledge over time. The field has produced an impressive infrastructure of benchmarks, datasets, and evaluation frameworks. Yet a striking paradox emerges: models that excel at solving problems consistently fail at teaching them, and high accuracy on standard benchmarks frequently masks reliance on shallow heuristics rather than genuine understanding.

The numbers tell a compelling story of both capability and limitation. GPT-4o — which was among the leading commercial models during the period covered by most of these studies — achieves only 31% accuracy on the MM-MATH multimodal mathematics benchmark, compared to 82% human performance. Models experience accuracy drops of up to 20% when problems are only slightly rephrased in the GSM-PLUS robustness benchmark. On the TutorBench evaluation of tutoring quality, no frontier LLM exceeds 56% overall performance, and all achieve less than 60% on criteria related to guiding, diagnosing, and supporting students. These findings carry significant implications for deployment in low- and middle-income countries (LMICs), where the promise of AI-EdTech as a scalable complement to limited teaching capacity depends on the technology doing more than simply producing correct answers.

Methodologically, the papers cluster around three approaches: static benchmark evaluation using existing K-12 test datasets, intelligent tutoring systems that assess content knowledge through dialogue and problem-solving, and adaptive learning platforms that track knowledge states over time. A dominant trend is the creation of large-scale, hierarchical datasets — such as the STEM benchmark (448 skills, over 1 million questions), CMMaTH (23,000 Chinese multimodal maths problems), and MDK12-Bench (141,000 multimodal exam instances) — that systematically organise questions by grade level, difficulty, and cognitive taxonomy. However, the overwhelming focus remains on evaluating what AI systems can do rather than measuring what students actually learn when using them.