Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment
This study evaluates the physics problem-solving capabilities of GPT-4o and o1-preview on German Physics Olympiad problems, comparing their performance against top high school students and analyzing solution quality beyond just answer correctness. The research examines implications for both summative and formative assessment in physics education, finding that LLMs can outperform human participants on advanced physics problems.
Large language models (LLMs) are now widely accessible, reaching learners at all educational levels. This development has raised concerns that their use may circumvent essential learning processes and compromise the integrity of established assessment formats. In physics education, where problem solving plays a central role in instruction and assessment, it is therefore essential to understand the physics-specific problem-solving capabilities of LLMs. Such understanding is key to informing respo