Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors

Benchmark (Not Published) Relevance: 7/10 122 cited 2023 paper

This paper systematically evaluates ChatGPT (GPT-3.5) and GPT-4 against human tutors across six programming education scenarios (program repair, hint generation, grading feedback, pair programming, contextualized explanation, and task synthesis) using introductory Python problems and real-world buggy student code from an online platform. The evaluation uses expert-based annotations to assess pedagogical quality across different tutoring functions.

Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies. State-of-the-art models like OpenAI’s ChatGPT [8] and GPT-4 [9] could enhance programming education in various roles, e.g., by acting as a personalized digital tutor for a student, a digital assistant for an educator, and a digital peer for collaborative learning [1, 2, 7]. In our work, we seek to comprehensively evaluate and benchmark state-of-the-art

Study Type

Benchmark (Not Published)

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Tags

large language model evaluation educationcomputer-sciencehighly-cited