MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams
MDK12-Bench is a large-scale multimodal benchmark comprising 141K instances from real-world K-12 exams across six disciplines (math, physics, chemistry, biology, geography, information science), with 6,225 knowledge points organized in a taxonomy. It evaluates MLLMs across difficulty levels, temporal shifts, contextual shifts, and knowledge-driven reasoning, including a dynamic evaluation framework and knowledge-point reference-augmented generation (KP-RAG).
Multimodal large language models (MLLMs), which integrate language and visual cues for problem-solving, are crucial for advancing artificial general intelligence (AGI). However, current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge, offering only static and undifferentiated evaluations. To bridge this gap, we introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six di