AI Tutors Landscape Summary

AI Tutors: Benchmarking Personalised Learning Systems in K-12 Education

1-to-1 conversational tutoring systems.

How this was produced: We identified high-relevance papers (scored ≥7/10) classified under this tool type, extracted key sections (abstract, introduction, results, discussion, conclusions) from each, then used Claude to synthesise findings into a structured evidence summary. The focus is on what benchmarks and evaluation methods exist to measure whether these tools work in the lab.

View benchmarks

AI tutors represent one of the most mature and actively researched areas in educational AI — and one of the most contested. Our analysis covers 348 papers spanning intelligent tutoring systems (ITS), large language model (LLM)-powered conversational tutors, and adaptive learning platforms, primarily targeting K-12 mathematics and STEM education. The field demonstrates impressive technical progress: systems like ASSISTments, Reasoning Mind Genie 2, and newer LLM-based platforms such as Khanmigo and Duolingo Max can now deliver fluent, personalised instruction at scale. A rigorous randomised controlled trial (RCT) in UK classrooms found human-supervised AI tutoring achieved comparable efficacy to human tutors, with knowledge transfer rates of 66.2% versus 60.7% for human instruction.

Yet beneath this progress lies a fundamental tension. Research consistently reveals that AI tutors — particularly those powered by LLMs — risk undermining the very learning they are designed to support. Studies show that students with unrestricted ChatGPT access scored 17% lower on independent tests despite solving 48% more practice problems. One large-scale study found cognitive engagement scores were significantly lower (mean 2.95/5) for ChatGPT users compared with controls (4.19/5). The field's most authoritative benchmark, TutorBench, demonstrates that no frontier LLM exceeds 56% overall performance on core tutoring skills. These findings point to a critical gap between what AI tutors can do technically and what they achieve pedagogically.

The methodological landscape is shifting rapidly — from rule-based systems toward LLM-powered approaches, and from evaluating answer correctness toward assessing the quality of the tutoring process itself. However, the vast majority of studies measure immediate post-test performance rather than long-term retention, transfer, or metacognitive development. For funders and policymakers in low- and middle-income countries (LMICs), this evidence base demands careful interpretation: AI tutors hold genuine promise for scaling personalised instruction, but deployment without pedagogical safeguards risks creating what researchers have termed a "Zone of No Development" — where permanent AI scaffolding replaces, rather than supports, cognitive growth.