Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study
This paper benchmarks whether large language models (LLMs) can replicate the adaptivity of intelligent tutoring systems by systematically removing context components from 75 real-world tutoring scenarios and evaluating how three LLMs respond to student errors, knowledge states, and pedagogical requirements. The study finds that even the best-performing LLM only marginally mimics ITS adaptivity, with concerning patterns like GPT-4o providing overly direct feedback instead of effective Socratic questioning.
Large Language Models (LLMs) hold promise as dynamic instructional aids. Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)--where student knowledge and pedagogical strategies are explicitly modeled. We propose a prompt variation framework to assess LLM-generated instructional moves' adaptivity and pedagogical soundness across 75 real-world tutoring scenarios from an ITS. We systematically remove key context components (e.g., student errors an