MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams

Relevance: 10/10 3 cited 2025 paper

MDK12-Bench is a large-scale benchmark built from 141K real-world K-12 exam questions across six disciplines (Math, Physics, Chemistry, Biology, Geography, Information Science) with 6,225 structured knowledge points, designed to evaluate multimodal large language models on problem-solving capabilities across difficulty levels, temporal shifts, contextual shifts, and knowledge-driven reasoning. The benchmark includes a dynamic evaluation framework and knowledge-point reference-augmented generation (KP-RAG) to assess model generalization and the role of knowledge in problem-solving.

Multimodal large language models (MLLMs), which integrate language and visual cues for problem-solving, are crucial for advancing artificial general intelligence (AGI). However, current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge, offering only static and undifferentiated evaluations. To bridge this gap, we introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six di

Tool Types

Tags

K-12 AI benchmarkcomputer-science