Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic
This paper proposes using Signal Temporal Logic (STL) to evaluate confidence trajectories in Chain-of-Thought reasoning for LLMs solving high school mathematics problems, aiming to improve calibration and reduce overconfident incorrect answers. The method is tested on Chinese Gaokao mathematics questions to provide more reliable uncertainty estimates for educational AI systems.
Large Language Models (LLMs) have shown impressive performance in mathematical reasoning tasks when guided by Chain-of-Thought (CoT) prompting. However, they tend to produce highly confident yet incorrect outputs, which poses significant risks in domains like education, where users may lack the expertise to assess reasoning steps. To address this, we propose a structured framework that models stepwise confidence as a temporal signal and evaluates it using Signal Temporal Logic (STL). In particul