Multilingual Performance Biases of Large Language Models in Education

Relevance: 8/10 10 cited 2025 paper

This paper evaluates the multilingual performance of large language models (GPT-4o, Gemini, Claude, Llama, Mistral) on four K-12 educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations across nine languages including English, Mandarin, Hindi, Arabic, German, Farsi, Telugu, Ukrainian, and Czech.

Large language models (LLMs) are increasingly being adopted in educational settings. These applications expand beyond English, though current LLMs remain primarily English-centric. In this work, we ascertain if their use in education settings in non-English languages is warranted. We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in eight languages (Mandarin, Hind

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.
Teacher Support Tools Tools that assist teachers — lesson planning, content generation, grading, analytics.

Tags

large language model evaluation educationcomputer-science