EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models
EXAMS-V is a multilingual multimodal benchmark dataset containing 20,932 multiple-choice exam questions across 20 school disciplines in 11 languages, designed to evaluate vision-language models on their ability to reason over integrated text, images, tables, diagrams, and scientific symbols from real school exams.
We introduce EXAMS-V, a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision language models. It consists of 20,932 multiple-choice questions across 20 school disciplines covering natural science, social science, and other miscellaneous studies, e.g., religion, fine arts, business, etc. EXAMS-V includes a variety of multimodal features such as text, images, tables, figures, diagrams, maps, scientific symbols, and equations. The questions come in 11 langua