Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs

Benchmark (Published & Automated) Relevance: 8/10 3 cited 2025 paper

This paper presents an automated multilingual pipeline that generates, solves, and evaluates 628 math problems aligned with the German K-10 curriculum across English, German, and Arabic using three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, Qwen-plus), finding consistent linguistic bias with English solutions rated highest and Arabic lowest. The pipeline includes automated generation, translation, solving, and LLM-judge evaluation to measure quality disparities in educational AI outputs across languages.

Large Language Models (LLMs) are increasingly used for educational support, yet their response quality varies depending on the language of interaction. This paper presents an automated multilingual pipeline for generating, solving, and evaluating math problems aligned with the German K-10 curriculum. We generated 628 math exercises and translated them into English, German, and Arabic. Three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus) were prompted to produce step-by-step solut

Study Type

Benchmark (Published & Automated)

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Tags

multilingual evaluation educationcomputer-science