2.3 Research Report

Pedagogical Interactions: How AI Tutoring Systems Engage Learners and Where They Fall Short

Benchmarks evaluating interactive teaching behaviours โ€” Socratic questioning, scaffolding, adaptive dialogue.

How this was produced: We identified high-relevance papers (scored โ‰ฅ7/10) classified under this category, extracted key sections (abstract, introduction, results, discussion, conclusions) from each, then used Claude to synthesise findings into a structured analysis. The report below reflects what the research covers โ€” and what it doesn't.

View benchmarks

Pedagogical interactions โ€” encompassing Socratic questioning, adaptive scaffolding, dialogue-based tutoring, and formative feedback โ€” represent one of the most active and consequential areas of AI-in-education research. Our analysis covers 332 papers examining how large language model (LLM)-powered systems engage Kโ€“12 and undergraduate learners in instructional dialogue, and the findings reveal a field grappling with a fundamental tension. While LLMs demonstrate impressive linguistic fluency and can generate plausible tutoring moves at scale, they frequently fail to replicate the adaptive, theory-grounded pedagogical reasoning that characterises effective human tutoring. Current systems tend toward cognitive shortcuts โ€” revealing answers too early, providing overly direct feedback, and struggling to maintain coherent multi-turn scaffolding โ€” rather than sustaining the guided questioning essential for deep learning.

Critically, the evidence on cognitive offloading is substantial and concerning. Across 52 papers addressing the issue directly, researchers document that students using AI tutors without proper instructional framing develop problematic dependencies, accept AI outputs uncritically, and demonstrate reduced independent problem-solving ability when support is withdrawn. One large-scale study found that students using GPT-4 for homework solved 48% more practice problems but scored 17% lower on unassisted tests, suggesting that AI reliance may actively impede the development of lasting competence. The most striking finding across this body of work is that most systems are not explicitly trained to maximise student learning outcomes โ€” they are trained to mimic tutor utterances or follow surface-level pedagogical principles, leading to interactions that feel helpful but may be pedagogically suboptimal.

The research base is methodologically diverse, spanning randomised controlled trials (RCTs) in authentic classrooms, large-scale dialogue corpus analysis, reinforcement learning-based policy optimisation, and theory-driven benchmark development. However, significant gaps persist: the vast majority of studies measure short-term performance rather than long-term retention, domain transfer, or metacognitive development. Mathematics dominates as a subject domain, with limited coverage of humanities, creative reasoning, or open-ended inquiry. And evaluation remains heavily reliant on automated metrics or single-session experiments rather than longitudinal classroom deployments.