E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

Benchmark (Published & Automated) Relevance: 10/10 7 cited 2024 paper

E-EVAL is a comprehensive Chinese K-12 education evaluation benchmark containing 4,351 multiple-choice questions across primary, middle, and high school levels in subjects including Chinese, English, Mathematics, Physics, Chemistry, and more, designed to assess large language models' knowledge and reasoning capabilities in the Chinese K-12 education domain. The benchmark is publicly available with dataset and code, and evaluates both English-dominant and Chinese-dominant LLMs across different grade levels and subject areas.

With the accelerating development of Large Language Models (LLMs), many LLMs are beginning to be used in the Chinese K-12 education domain. The integration of LLMs and education is getting closer and closer, however, there is currently no benchmark for evaluating LLMs that focuses on the Chinese K-12 education domain. Therefore, there is an urgent need for a comprehensive natural language processing benchmark to accurately assess the capabilities of various LLMs in the Chinese K-12 education dom

Study Type

Benchmark (Published & Automated)

Tool Types

Tags

LLM evaluation K-12 educationcomputer-science