Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

Benchmark (Published & Automated) Relevance: 8/10 6 cited 2025 paper

This paper introduces PBLBench, a benchmark for evaluating multimodal large language models (MLLMs) on assessing STEM project-based learning outcomes, using a new dataset (PBL-STEM) with over 500 projects and expert-validated evaluation criteria derived through the Analytic Hierarchy Process. The benchmark tests 15 leading MLLMs on their ability to handle long-context, cross-modal STEM project evaluation to assist teachers with grading.

Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human exp

Study Type

Benchmark (Published & Automated)

Tool Types

Teacher Support Tools Tools that assist teachers — lesson planning, content generation, grading, analytics.

Tags

teacher knowledge evaluation AIcomputer-science