Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors
This paper systematically benchmarks ChatGPT (GPT-3.5) and GPT-4 against human tutors across six programming education scenarios (program repair, hint generation, grading feedback, pair programming, contextualized explanation, and task synthesis) using introductory Python problems and real-world buggy student code. The evaluation uses expert annotations to assess pedagogical quality across scenarios relevant to digital tutoring, teaching assistance, and peer learning.
Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies. State-of-the-art models like OpenAI’s ChatGPT [8] and GPT-4 [9] could enhance programming education in various roles, e.g., by acting as a personalized digital tutor for a student, a digital assistant for an educator, and a digital peer for collaborative learning [1, 2, 7]. In our work, we seek to comprehensively evaluate and benchmark state-of-the-art