Multimodal Capabilities: Benchmarking AI Systems That See, Hear, and Reason Across K-12 Education
Benchmarks evaluating vision, audio, diagram understanding, and multimodal reasoning for education.
How this was produced: We identified high-relevance papers (scored ≥7/10) classified under this category, extracted key sections (abstract, introduction, results, discussion, conclusions) from each, then used Claude to synthesise findings into a structured analysis. The report below reflects what the research covers — and what it doesn't.
Multimodal AI — systems that process and reason across text, images, diagrams, audio, and video — represents one of the most active frontiers in educational technology research, with 128 papers identified in this category. The field is producing increasingly ambitious benchmarks, from MDK12-Bench's 141,000 questions across six K-12 disciplines to KidsArtBench's expert-annotated evaluation of children's artwork across nine rubric-aligned dimensions. Yet a striking paradox emerges from this body of work: while researchers are investing heavily in measuring whether AI models can solve visual mathematics problems and interpret educational diagrams, almost none are measuring whether students learn better when supported by these multimodal systems. The overwhelming focus remains on predictive accuracy metrics — AUC scores, classification accuracy, next-item correctness — rather than learning science outcomes such as conceptual understanding, long-term retention, or the development of spatial reasoning skills.
The dominant research strand, accounting for 28 or more papers, evaluates how well large multimodal models (LMMs) interpret mathematical diagrams, geometric figures, and scientific visualisations alongside textual problem statements. These benchmarks span elementary through secondary mathematics across multiple languages — English, Chinese, Vietnamese, French, and others — and reveal that even models considered leading at the time of testing, including GPT-4V and Gemini, struggle with fine-grained visual discrimination, spatial reasoning, and integrating multiple images with text. A secondary cluster of 15 or more papers explores knowledge tracing systems that predict student performance by modelling learning trajectories, with several recent studies incorporating multimodal signals such as code submissions, facial expressions, and speech. Critically, the research is overwhelmingly concentrated in East Asian and Western educational contexts, with very limited coverage of African, Latin American, or South Asian settings — a significant gap given the educational priorities of low- and middle-income countries (LMICs).
This report examines what is being measured, what is being missed, and what the education sector should prioritise to ensure that multimodal AI serves genuine learning rather than simply demonstrating technical capability.