Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors

Relevance: 7/10 122 cited 2023 paper

This paper systematically benchmarks ChatGPT (GPT-3.5) and GPT-4 against human tutors across six programming education scenarios (program repair, hint generation, grading feedback, pair programming, contextualized explanation, and task synthesis) using introductory Python problems and real-world buggy student code. The evaluation uses expert annotations to assess pedagogical quality across scenarios relevant to digital tutoring, teaching assistance, and peer learning.

Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies. State-of-the-art models like OpenAI’s ChatGPT [8] and GPT-4 [9] could enhance programming education in various roles, e.g., by acting as a personalized digital tutor for a student, a digital assistant for an educator, and a digital peer for collaborative learning [1, 2, 7]. In our work, we seek to comprehensively evaluate and benchmark state-of-the-art

Source

View source Open PDF

Framework Categories

1 General reasoning 2.2 Pedagogy of generated outputs 2.3 Pedagogical interactions 4.1 Scoring and grading 4.2 Feedback with reasoning

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Teacher Support Tools Tools that assist teachers — lesson planning, content generation, grading, analytics.

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors

Source

Framework Categories

Tool Types

Tags