Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning
This paper introduces PBLBench, a benchmark for evaluating multimodal large language models (MLLMs) on their ability to assess STEM Project-Based Learning (PBL) outcomes, including evaluating research reports, code, experimental data, and videos using expert-validated criteria derived through the Analytic Hierarchy Process (AHP). The benchmark tests 15 MLLMs on grading and ranking PBL projects across multiple STEM disciplines with long-context multimodal inputs.
Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human exp