Beyond Final Answers: Evaluating Large Language Models for Math Tutoring

Benchmark (Not Published) Relevance: 8/10 15 cited 2025 paper

This paper evaluates multiple ChatGPT models (3.5 Turbo through o1-preview) as math tutors for college algebra using two approaches: problem-solving accuracy assessment via an intelligent tutoring system testbed, and interactive tutoring quality evaluation via human evaluators acting as students. The study finds that while LLMs achieve 85.5% correct final answers and 90% high-quality instructional support, only 56.6% of tutoring dialogues are entirely error-free, concluding that LLMs require human oversight for math tutoring.

Researchers have made notable progress in applying Large Language Models (LLMs) to solve math problems, as demonstrated through efforts like GSM8k, ProofNet, AlphaGeometry, and MathOdyssey. This progress has sparked interest in their potential use for tutoring students in mathematics. However, the reliability of LLMs in tutoring contexts -- where correctness and instructional quality are crucial -- remains underexplored. Moreover, LLM problem-solving capabilities may not necessarily translate in

Study Type

Benchmark (Not Published)

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Tags

tutoring dialogue evaluationcomputer-science