OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

Relevance: 2/10 13 cited 2024 paper

OpenEval is a comprehensive evaluation platform for Chinese large language models (LLMs) that benchmarks them across capability (NLP tasks, knowledge, reasoning), alignment (bias, offensiveness), and safety (power-seeking, self-awareness risks). The platform evaluates general-purpose Chinese LLMs on various dimensions but does not specifically focus on K-12 education contexts or pedagogical applications.

The rapid development of Chinese large language models (LLMs) poses big challenges for efficient LLM evaluation. While current initiatives have introduced new benchmarks or evaluation platforms for assessing Chinese LLMs, many of these focus primarily on capabilities, usually overlooking potential alignment and safety issues. To address this gap, we introduce OpenEval, an evaluation testbed that benchmarks Chinese LLMs across capability, alignment and safety. For capability assessment, we includ

Tool Types

Tags

commonsense reasoning testcomputer-science