Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors
This paper systematically evaluates ChatGPT (GPT-3.5) and GPT-4 against human tutors across six programming education scenarios (program repair, hint generation, grading feedback, pair programming, contextualized explanation, and task synthesis) using introductory Python problems and real-world buggy student code from an online platform. The evaluation uses expert-based annotations to assess pedagogical quality across different tutoring functions.
Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies. State-of-the-art models like OpenAI’s ChatGPT [8] and GPT-4 [9] could enhance programming education in various roles, e.g., by acting as a personalized digital tutor for a student, a digital assistant for an educator, and a digital peer for collaborative learning [1, 2, 7]. In our work, we seek to comprehensively evaluate and benchmark state-of-the-art