Measuring Vision-Language STEM Skills of Neural Models

Relevance: 8/10 13 cited 2024 paper

This paper introduces STEM, a large-scale multimodal vision-language benchmark with 1,073,146 questions spanning 448 skills across science, technology, engineering, and math subjects, specifically designed based on K-12 curriculum (Pre-K through grade 8). The benchmark evaluates state-of-the-art AI models (GPT-3.5-Turbo, CLIP) on fundamental STEM skills, finding they perform well below elementary student levels (averaging 54.7%).

We introduce a new challenge to test the STEM skills of neural models. The problems in the real world often require solutions, combining knowledge from STEM (science, technology, engineering, and math). Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information of STEM. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM subjects. Compared to exi

Tool Types

Tags

elementary math benchmarkcomputer-science