Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs

Benchmark (Published & Automated) Relevance: 8/10 3 cited 2025 paper

This paper presents an automated multilingual pipeline that generates, solves, and evaluates 628 math problems aligned with the German K-10 curriculum across English, German, and Arabic using three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, Qwen-plus), finding consistent linguistic bias with English solutions rated highest and Arabic lowest. The pipeline includes automated generation, translation, solving, and LLM-judge evaluation to measure quality disparities in educational AI outputs across languages.

Large Language Models (LLMs) are increasingly used for educational support, yet their response quality varies depending on the language of interaction. This paper presents an automated multilingual pipeline for generating, solving, and evaluating math problems aligned with the German K-10 curriculum. We generated 628 math exercises and translated them into English, German, and Arabic. Three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus) were prompted to produce step-by-step solut

Study Type

Benchmark (Published & Automated)

Source

View source

Framework Categories

1 General reasoning 3.1 Content knowledge 3.2 Content alignment 5 Ethics and bias 6.2 Multilingual capabilities

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs

Study Type

Source

Framework Categories

Tool Types

Tags