Multilingual Capabilities: How AI Systems Perform Across Languages in Education
Benchmarks evaluating performance across languages and cross-lingual educational tasks.
How this was produced: We identified high-relevance papers (scored ≥7/10) classified under this category, extracted key sections (abstract, introduction, results, discussion, conclusions) from each, then used Claude to synthesise findings into a structured analysis. The report below reflects what the research covers — and what it doesn't.
The ability of large language models (LLMs) to operate effectively across multiple languages is arguably the single most consequential factor determining whether AI-EdTech can deliver equitable educational benefits in low- and middle-income countries (LMICs). Our analysis of 15 papers focussed on multilingual capabilities in educational contexts reveals a consistent and concerning finding: LLMs exhibit systematic performance gaps of 10–40% between English and other languages, with degradation most severe for low-resource languages and non-Latin scripts. This means that learners who stand to benefit most from AI-powered educational tools — those in multilingual LMIC settings — are precisely those receiving the lowest-quality AI support.
The research base is growing in sophistication. Multiple studies now construct authentic multilingual benchmarks drawn from real national examinations and curricula, moving beyond the earlier — and deeply flawed — practice of simply translating English-language datasets. Landmark benchmarks such as EXAMS (covering 16 languages and 24 subjects) and IndoMMLU (nearly 15,000 questions from Indonesian education) provide granular evidence of where models succeed and fail. A particularly striking finding from the IndoMMLU study is that earlier-generation models such as GPT-3.5 achieved only primary school-level performance when tested on Indonesian educational content, despite performing competently on equivalent English-language assessments.
However, the field remains heavily focussed on measuring answer correctness rather than the quality of educational interaction. Almost no research examines long-term learning outcomes for students using multilingual AI tutors, the impact of language-dependent performance gaps on student motivation, or how code-switching — a natural feature of multilingual classrooms — is handled by these systems. These gaps represent urgent priorities for the education research and funding community.