CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?

Relevance: 7/10 63 cited 2023 paper

CMATH is a dataset of 1,700 Chinese elementary school math word problems (grades 1-6) designed to benchmark LLMs' arithmetic and reasoning capabilities against human grade-level performance, evaluating models like GPT-4, ChatGPT, and others on their ability to solve age-appropriate math problems.

We present the Chinese Elementary School Math Word Problems (CMATH) dataset, comprising 1.7k elementary school-level math word problems with detailed annotations, source from actual Chinese workbooks and exams. This dataset aims to provide a benchmark tool for assessing the following question: to what grade level of elementary school math do the abilities of popular large language models (LLMs) correspond? We evaluate a variety of popular LLMs, including both commercial and open-source options,

Tool Types

Tags

reasoning evaluation LLMcomputer-science