Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study

Benchmark (Published & Automated) Relevance: 9/10 13 cited 2025 paper

This paper presents a benchmarking framework that systematically evaluates whether LLMs can replicate the adaptivity of intelligent tutoring systems by testing three LLMs (Llama3-8B, Llama3-70B, GPT-4o) across 75 real-world tutoring scenarios, measuring both adaptivity to student context and pedagogical soundness of generated instructional moves. The authors use prompt variations that remove key contextual features (student errors, knowledge components) to assess whether LLMs adjust their responses appropriately, finding that current LLMs only marginally match ITS adaptivity.

Large Language Models (LLMs) hold promise as dynamic instructional aids. Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)--where student knowledge and pedagogical strategies are explicitly modeled. We propose a prompt variation framework to assess LLM-generated instructional moves' adaptivity and pedagogical soundness across 75 real-world tutoring scenarios from an ITS. We systematically remove key context components (e.g., student errors an

Study Type

Benchmark (Published & Automated)

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.
Personalised Adaptive Learning Systems that adapt content and difficulty to individual learners.

Tags

intelligent tutoring system evaluationcomputer-science