Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

Relevance: 8/10 6 cited 2025 paper

This paper introduces PBLBench, a benchmark for evaluating multimodal large language models (MLLMs) on their ability to assess STEM Project-Based Learning (PBL) outcomes, including evaluating research reports, code, experimental data, and videos using expert-validated criteria derived through the Analytic Hierarchy Process (AHP). The benchmark tests 15 MLLMs on grading and ranking PBL projects across multiple STEM disciplines with long-context multimodal inputs.

Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human exp

Source

View source

Framework Categories

2.1 Pedagogical knowledge 2.2 Pedagogy of generated outputs 3.1 Content knowledge 4.1 Scoring and grading 4.2 Feedback with reasoning 6.1 Multimodal capabilities

Tool Types

Teacher Support Tools Tools that assist teachers — lesson planning, content generation, grading, analytics.

Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

Source

Framework Categories

Tool Types

Tags