GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers
GSM-PLUS is a robustness benchmark for evaluating large language models' mathematical reasoning capabilities by perturbing grade school math problems from GSM8K with various mathematical variations (numerical substitutions, problem understanding changes, distractor insertions, etc.). The benchmark tests whether LLMs truly understand mathematical concepts or rely on shortcuts, revealing accuracy gaps of up to 20% when questions are slightly modified.
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. However, there are increasing debates regarding whether these models truly understand and apply mathematical knowledge or merely rely on shortcuts for mathematical reasoning. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capabilit