Benchmarking Educational Program Repair

Benchmark (Published & Automated) Relevance: 7/10 8 cited 2024 paper

This paper presents a benchmark for evaluating educational program repair systems that use LLMs to automatically fix bugs in student code, introducing a novel rouge@k evaluation metric and establishing baseline performance across five recent models on two curated datasets of introductory programming problems.

The emergence of large language models (LLMs) has sparked enormous interest due to their potential application across a range of educational tasks. For example, recent work in programming education has used LLMs to generate learning resources, improve error messages, and provide feedback on code. However, one factor that limits progress within the field is that much of the research uses bespoke datasets and different evaluation metrics, making direct comparisons between results unreliable. Thus,

Study Type

Benchmark (Published & Automated)

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.
Teacher Support Tools Tools that assist teachers — lesson planning, content generation, grading, analytics.

Tags

benchmark dataset education learningcomputer-science