Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs
This paper presents an automated multilingual pipeline that generates, solves, and evaluates 628 math problems aligned with the German K-10 curriculum across English, German, and Arabic using three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, Qwen-plus), finding consistent linguistic bias with English solutions rated highest and Arabic lowest. The pipeline includes automated generation, translation, solving, and LLM-judge evaluation to measure quality disparities in educational AI outputs across languages.
Large Language Models (LLMs) are increasingly used for educational support, yet their response quality varies depending on the language of interaction. This paper presents an automated multilingual pipeline for generating, solving, and evaluating math problems aligned with the German K-10 curriculum. We generated 628 math exercises and translated them into English, German, and Arabic. Three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus) were prompted to produce step-by-step solut