Benchmarking Educational Program Repair
This paper presents a benchmark for evaluating educational program repair systems that use LLMs to automatically fix bugs in student code, introducing a novel rouge@k evaluation metric and establishing baseline performance across five recent models on two curated datasets of introductory programming problems.
The emergence of large language models (LLMs) has sparked enormous interest due to their potential application across a range of educational tasks. For example, recent work in programming education has used LLMs to generate learning resources, improve error messages, and provide feedback on code. However, one factor that limits progress within the field is that much of the research uses bespoke datasets and different evaluation metrics, making direct comparisons between results unreliable. Thus,