Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs
This paper introduces GuideEval, a benchmark that evaluates LLMs' ability to provide adaptive Socratic tutoring by assessing three pedagogical phases: perceiving learner states (confusion, comprehension, errors), orchestrating appropriate instructional strategies, and eliciting productive reflections. The benchmark is grounded in authentic K-12 educational dialogues and specifically measures whether LLMs can dynamically adjust their guidance based on student cognitive states rather than just generate generic responses.
The conversational capabilities of large language models hold significant promise for enabling scalable and interactive tutoring. While prior research has primarily examined their ability to generate Socratic questions, it often overlooks a critical aspect: adaptively guiding learners in accordance with their cognitive states. This study moves beyond question generation to emphasize instructional guidance capability. We ask: Can LLMs emulate expert tutors who dynamically adjust strategies in res