MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages
MuBench is a multilingual benchmark evaluating LLM capabilities across 61 languages, covering tasks like natural language understanding, commonsense reasoning, factual recall, and academic reasoning, with cross-lingual alignment to enable fair comparisons. The paper focuses on assessing multilingual LLM performance gaps and cross-lingual transfer dynamics through pretraining experiments.
Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities. We evaluate several state-of-the-art multilingual LL