INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
INCLUDE is a multilingual language understanding benchmark comprising 197,243 QA pairs from local exam sources across 44 languages, designed to evaluate LLM performance in regional and cultural contexts. The benchmark draws from educational, professional, and practical tests from different countries to capture authentic regional knowledge rather than translated content.
The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate Engli