Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs
This paper introduces GuideEval, a benchmark for evaluating LLMs' ability to provide adaptive Socratic tutoring by assessing their capacity to perceive learner states, orchestrate instructional strategies, and elicit appropriate reflections based on authentic educational dialogues. The benchmark specifically measures whether LLMs can dynamically adjust pedagogical guidance in response to learners' cognitive states (confusion, comprehension, errors) rather than merely generating questions.
The conversational capabilities of large language models hold significant promise for enabling scalable and interactive tutoring. While prior research has primarily examined their ability to generate Socratic questions, it often overlooks a critical aspect: adaptively guiding learners in accordance with their cognitive states. This study moves beyond question generation to emphasize instructional guidance capability. We ask: Can LLMs emulate expert tutors who dynamically adjust strategies in res