Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study

Relevance: 9/10 13 cited 2025 paper

This paper benchmarks whether large language models (LLMs) can replicate the adaptivity of intelligent tutoring systems by systematically removing context components from 75 real-world tutoring scenarios and evaluating how three LLMs respond to student errors, knowledge states, and pedagogical requirements. The study finds that even the best-performing LLM only marginally mimics ITS adaptivity, with concerning patterns like GPT-4o providing overly direct feedback instead of effective Socratic questioning.

Large Language Models (LLMs) hold promise as dynamic instructional aids. Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)--where student knowledge and pedagogical strategies are explicitly modeled. We propose a prompt variation framework to assess LLM-generated instructional moves' adaptivity and pedagogical soundness across 75 real-world tutoring scenarios from an ITS. We systematically remove key context components (e.g., student errors an

Source

View source

Framework Categories

2.3 Pedagogical interactions 2.2 Pedagogy of generated outputs

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study

Source

Framework Categories

Tool Types

Tags