Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic

Relevance: 7/10 6 cited 2025 paper

This paper proposes using Signal Temporal Logic (STL) to evaluate confidence trajectories in Chain-of-Thought reasoning for LLMs solving high school mathematics problems, aiming to improve calibration and reduce overconfident incorrect answers. The method is tested on Chinese Gaokao mathematics questions to provide more reliable uncertainty estimates for educational AI systems.

Large Language Models (LLMs) have shown impressive performance in mathematical reasoning tasks when guided by Chain-of-Thought (CoT) prompting. However, they tend to produce highly confident yet incorrect outputs, which poses significant risks in domains like education, where users may lack the expertise to assess reasoning steps. To address this, we propose a structured framework that models stepwise confidence as a temporal signal and evaluates it using Signal Temporal Logic (STL). In particul

Tool Types

AI Tutors 1-to-1 conversational tutoring systems.

Tags

large language model evaluation educationcomputer-science