UCO: A Multi-Turn Interactive Reinforcement Learning Method for Adaptive Teaching with Large Language Models
This paper proposes UCO (Unidirectional Cognitive Optimization), a multi-turn interactive reinforcement learning method for training LLM-based AI tutors that adapts teaching strategies to students' cognitive states by evaluating genuine understanding (not just correct answers) and operating within students' Zone of Proximal Development. The method is evaluated on BigMath and MathTutorBench benchmarks comparing against 11 baseline models.
Large language models (LLMs) are shifting from answer providers to intelligent tutors in educational settings, yet current supervised fine-tuning methods only learn surface teaching patterns without dynamic adaptation capabilities. Recent reinforcement learning approaches address this limitation but face two critical challenges. First, they evaluate teaching effectiveness solely based on whether students produce correct outputs, unable to distinguish whether students genuinely understand or echo