VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning
VisScience is a benchmark dataset of 3,000 K-12 multi-modal questions spanning mathematics, physics, and chemistry (elementary through high school), designed to evaluate multi-modal large language models' scientific reasoning capabilities across 21 subjects and 5 difficulty levels. The paper evaluates 25 MLLMs on this benchmark, finding closed-source models generally outperform open-source ones, with best accuracies of 53.4% (math), 38.2% (physics), and 47.0% (chemistry).
Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks by integrating textual and visual information to achieve visual understanding in complex scenarios. Despite the availability of several benchmarks aims to evaluating MLLMs in tasks from visual question answering to complex problem-solving, most focus predominantly on mathematics or general visual understanding tasks. This reveals a critical gap in current benchmarks, which often overlook the i