MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages

Relevance: 3/10 1 cited 2025 paper

MENLO is a framework for evaluating native-like quality of LLM responses across 47 languages by breaking down quality into four dimensions (language quality, cultural/linguistic alignment, factual correctness, and writing style). The paper creates a dataset of 6,423 human-annotated preference pairs and develops LLM judges that can evaluate multilingual response quality.

Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM ju

Framework Categories

Tool Types

Tags

LLM as judge evaluationcomputer-science