Multilingual Performance of a Multimodal Artificial Intelligence System on Multisubject Physics Concept Inventories

Benchmark (Not Published) Relevance: 7/10 19 cited 2025 paper

This paper evaluates GPT-4o's multimodal and multilingual performance on physics concept inventories (standardized assessments of conceptual understanding) across multiple subjects and languages, comparing AI results to undergraduate student performance. The study uses existing concept inventory datasets uploaded as images to test the AI's ability to interpret visual information and answer physics questions in various languages.

We investigate the multilingual and multimodal performance of a large language model-based artificial intelligence (AI) system, GPT-4o, using a diverse set of physics concept inventories spanning multiple languages and subject categories. The inventories, sourced from the PhysPort website, cover classical physics topics such as mechanics, electromagnetism, optics, and thermodynamics, as well as relativity, quantum mechanics, astronomy, mathematics, and laboratory skills. Unlike previous text-onl

Study Type

Benchmark (Not Published)

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Tags

large language model evaluation educationcomputer-sciencephysics