AI Tutors

1-to-1 conversational tutoring systems.

📋

Research Summary

AI tutors represent one of the most mature and actively researched areas in educational AI — and one of the most contested. Our analysis covers 348 papers spanning intelligent tutoring systems (ITS), large language model (LLM)-powered conversational tutors, and adaptive learning platforms, primarily targeting K-12 mathematics and STEM education. The field demonstrates impressive technical progress: systems like ASSISTments, Reasoning Mind Genie 2, and newer LLM-based platforms such as Khanmigo and Duolingo Max can now deliver fluent, personalised instruction at scale. A rigorous randomised controlled trial (RCT) in UK classrooms found human-supervised AI tutoring achieved comparable efficacy to human tutors, with knowledge transfer rates of 66.2% versus 60.7% for human instruction.

Yet beneath this progress lies a fundamental tension. Research consistently reveals that AI tutors — particularly those powered by LLMs — risk undermining the very learning they are designed to support. Studies show that students with unrestricted ChatGPT access scored 17% lower on independent tests despite solving 48% more practice problems. One large-scale study found cognitive engagement scores were significantly lower (mean 2.95/5) for ChatGPT users compared with controls (4.19/5). The field's most authoritative benchmark, TutorBench, demonstrates that no frontier LLM exceeds 56% overall performance on core tutoring skills. These findings point to a critical gap between what AI tutors can do technically and what they achieve pedagogically.

The methodological landscape is shifting rapidly — from rule-based systems toward LLM-powered approaches, and from evaluating answer correctness toward assessing the quality of the tutoring process itself. However, the vast majority of studies measure immediate post-test performance rather than long-term retention, transfer, or metacognitive development. For funders and policymakers in low- and middle-income countries (LMICs), this evidence base demands careful interpretation: AI tutors hold genuine promise for scaling personalised instruction, but deployment without pedagogical safeguards risks creating what researchers have termed a "Zone of No Development" — where permanent AI scaffolding replaces, rather than supports, cognitive growth.

Read full evidence summary
Min relevance
Hide pre-2023
1,976 benchmarks across 11 categories
View all 1243 benchmarks in Pedagogical interactions →
View all 501 benchmarks in Pedagogy of generated outputs →
View all 290 benchmarks in Feedback with reasoning →
View all 562 benchmarks in Content knowledge →
View all 613 benchmarks in General reasoning →