Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs

Relevance: 8/10 3 cited 2025 paper

This paper presents an automated multilingual pipeline for generating, solving, and evaluating K-10 math problems using three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, Qwen-plus) across English, German, and Arabic. The study reveals consistent linguistic bias, with English solutions rated highest and Arabic lowest, highlighting equity concerns in AI-based educational tools.

Large Language Models (LLMs) are increasingly used for educational support, yet their response quality varies depending on the language of interaction. This paper presents an automated multilingual pipeline for generating, solving, and evaluating math problems aligned with the German K-10 curriculum. We generated 628 math exercises and translated them into English, German, and Arabic. Three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus) were prompted to produce step-by-step solut

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Tags

multilingual evaluation educationcomputer-science