MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Relevance: 3/10 25 cited 2024 paper

MM-EVAL is a multilingual meta-evaluation benchmark designed to assess LLM-based evaluators (LLM-as-a-judge and reward models) across 18-122 languages, testing whether evaluator LLMs can reliably assess multilingual outputs through pairwise accuracy and consistency metrics. The benchmark focuses on general multilingual evaluation challenges rather than educational contexts, with subsets covering chat, reasoning, safety, language hallucination, and linguistics.

As Large Language Models (LLMs) are now capable of producing fluent and coherent content in languages other than English, it is not imperative to precisely evaluate these non-English outputs. However, when assessing the outputs from mutlilingual LLMs, prior works often employed LLM based evaluators that excel at assessing English outputs, without a thorough examination of whether these evaluators could effectively assess non-English text as well. Moreover, existing benchmarks to test evaluator L

Framework Categories

Tool Types

Tags

LLM as judge evaluationcomputer-science