A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Relevance: 7/10 172 cited 2024 paper

This paper introduces GSM1k, a new benchmark designed to mirror the style and complexity of GSM8k (grade school math problems) to evaluate whether LLMs genuinely reason or have memorized training data. The study finds evidence of dataset contamination and overfitting in several model families, with performance drops of up to 8% on the novel benchmark.

Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark

Tool Types

Tags

reasoning evaluation LLMcomputer-sciencehighly-cited