Education Benchmarks and Evals Mapping

Personalised Adaptive Learning

Systems that adapt content and difficulty to individual learners.

3.22.14.14.26.16.2

Teacher Support Tools

Tools that assist teachers — lesson planning, content generation, grading, analytics.

2.13.13.24.14.25

Concerns

Cross-cutting risk themes identified across the research — what could go wrong when AI is used in education, and what do we know about it.

Concern

Cognitive Offloading & Over-reliance

When AI does the thinking for learners — reducing effort, bypassing productive struggle, and creating dependency.

Concern

Productive Struggle & Scaffolding

The balance between helpful AI scaffolding and over-scaffolding that removes the desirable difficulty learners need to grow.

Concern

Metacognition & Self-regulation

Whether AI tools help or hinder learners’ ability to monitor their own understanding and self-regulate.

Concern

Critical Thinking & Higher-order Skills

Impact of AI on higher-order cognitive skills — analysis, evaluation, synthesis, and creative problem-solving.

Concern

Equity & Access

Risks of AI widening existing education gaps — digital divide, language bias, cost barriers, and disparate impact.

10/10 25 cited 2024 paper

All Benchmarks

6,529

Min relevance

Hide pre-2023

Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability To Mark Short Answer Questions in K-12 Education

This paper evaluates GPT-4's ability to automatically grade short answer questions in K-12 education (ages 5-16) across Science and History using a novel dataset from the Carousel quizzing platform, finding performance (Kappa 0.70) close to human-level (0.75). The study demonstrates that LLMs can reliably perform formative assessment grading tasks across multiple subjects and grade levels with minimal prompt engineering.

Teacher Support Tools LLM evaluation K-12 educationcomputer-science

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

10/10 7 cited 2024 paper

E-EVAL is a comprehensive evaluation benchmark specifically designed for Chinese K-12 education, consisting of 4,351 multiple-choice questions across primary, middle, and high school levels covering nine subjects (Chinese, English, Politics, History, Ethics, Physics, Chemistry, Mathematics, Geography) to assess LLM capabilities in the Chinese K-12 education domain.

LLM evaluation K-12 educationcomputer-science

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

AI TutorsTeacher Support Tools LLM evaluation K-12 educationcomputer-science

EduEval is a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education, comprising 24 task types with over 11,000 questions organized across six cognitive dimensions (Memorization, Understanding, Application, Reasoning, Creativity, and Ethics) based on Bloom's Taxonomy and Webb's Depth of Knowledge. The benchmark uses authentic educational materials including real exam questions, classroom dialogues, student essays, and expert-designed prompts spanning primary through high school levels.

FoundationalASSIST: An Educational Dataset for Foundational Knowledge Tracing and Pedagogical Grounding of LLMs

10/10 2026 paper

FoundationalASSIST introduces a 1.7-million interaction K-12 educational dataset with full question text, student responses, and Common Core alignment, specifically designed to evaluate whether LLMs can perform knowledge tracing (predicting student performance) and pedagogical grounding (understanding assessment item properties). The paper evaluates four frontier LLMs on these tasks, revealing significant gaps in their ability to predict student performance and understand item discrimination.

AI TutorsPersonalised Adaptive Learning LLM evaluation K-12 educationcomputer-science

EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs

AI TutorsPersonalised Adaptive Learning LLM evaluation K-12 educationcomputer-science

EduAdapt introduces a benchmark dataset of nearly 48k grade-labeled QA pairs across grades 1-12 and nine science subjects to evaluate whether LLMs can adapt their responses to different grade levels. The paper evaluates multiple open-source LLMs and finds they struggle to generate developmentally appropriate responses, especially for early-grade students.

FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English Writing

Teacher Support Tools LLM evaluation K-12 educationcomputer-science

FEANEL is a benchmark for evaluating LLMs' ability to provide fine-grained error analysis and pedagogical feedback on K-12 English writing, comprising 1,000 student essays with expert-annotated errors categorized by type, severity, and explanations. The benchmark specifically assesses whether AI systems can identify writing errors and provide educationally meaningful, interpretable feedback to support student learning.

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

10/10 73 cited 2024 paper

This paper presents LearnLM-Tutor, a fine-tuned Gemini model for educational use, and introduces a comprehensive evaluation framework spanning seven diverse benchmarks (quantitative, qualitative, automatic, and human evaluations) grounded in learning science principles to assess pedagogical quality in K-12 AI tutoring systems. The work includes real-world deployment at Arizona State University and systematic evaluation of pedagogical dimensions including Socratic dialogue, adaptive scaffolding, and learning-centered interactions.

AI Tutors benchmark dataset education learningcomputer-science

MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams

10/10 3 cited 2025 paper

MDK12-Bench is a large-scale benchmark built from 141K real-world K-12 exam questions across six disciplines (Math, Physics, Chemistry, Biology, Geography, Information Science) with 6,225 structured knowledge points, designed to evaluate multimodal large language models on problem-solving capabilities across difficulty levels, temporal shifts, contextual shifts, and knowledge-driven reasoning. The benchmark includes a dynamic evaluation framework and knowledge-point reference-augmented generation (KP-RAG) to assess model generalization and the role of knowledge in problem-solving.

K-12 AI benchmarkcomputer-science

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

AI Tutors teacher knowledge evaluation AIcomputer-science

EduGuardBench is a dual-component benchmark designed to evaluate LLMs acting as simulated teachers, measuring both pedagogical fidelity (role-playing accuracy, teaching competence) and adversarial safety (resistance to jailbreaking, handling of academic misconduct requests). The benchmark identifies harmful teaching behaviors (incompetence, indolence, offensiveness) and uses persona-based adversarial prompts to test ethical boundaries specific to educational contexts.

Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors

10/10 39 cited 2024 paper

This paper proposes a unified evaluation taxonomy with eight pedagogical dimensions to assess LLM-powered AI tutors' abilities in remediating student mistakes in mathematics, and releases MRBench - a benchmark containing 192 conversations and 1,596 responses from seven tutors with human annotations across all dimensions. The taxonomy evaluates pedagogical interactions including mistake identification, guidance provision, actionability, and tutor tone, directly measuring whether AI tutors demonstrate effective pedagogical abilities rather than simply revealing answers.

AI Tutors tutoring dialogue evaluationcomputer-science

Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs

10/10 1 cited 2025 paper

This paper introduces GuideEval, a benchmark that evaluates LLMs' ability to provide adaptive Socratic tutoring by assessing three pedagogical phases: perceiving learner states (confusion, comprehension, errors), orchestrating appropriate instructional strategies, and eliciting productive reflections. The benchmark is grounded in authentic K-12 educational dialogues and specifically measures whether LLMs can dynamically adjust their guidance based on student cognitive states rather than just generate generic responses.

AI Tutors tutoring dialogue evaluationcomputer-science

TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

10/10 1 cited 2025 paper

TutorBench is a purpose-built benchmark with 1,490 expert-curated samples evaluating LLMs on three core tutoring skills: generating adaptive explanations, providing actionable feedback, and creating effective hints for high-school and AP-level content. The benchmark uses sample-specific rubrics and LLM-judge evaluation to assess 16 frontier models, finding none exceed 56% overall performance and all achieve less than 60% pass rate on criteria related to guiding, diagnosing, and supporting students.

AI Tutors LLM as judge evaluationcomputer-science

PhysicsAssistant: An LLM-Powered Interactive Learning Robot for Physics Lab Investigations

9/10 23 cited 2024 paper

This paper presents PhysicsAssistant, a multimodal robot combining YOLOv8 object detection with GPT-3.5-turbo to provide real-time interactive assistance to 8th-grade students during physics lab experiments. The system is empirically evaluated through a user study with 10 students, where expert ratings based on Bloom's taxonomy assess the quality of responses compared to GPT-4.

AI Tutors LLM evaluation K-12 educationcomputer-science

FACET: Teacher-Centred LLM-Based Multi-Agent Systems-Towards Personalized Educational Worksheets

9/10 2 cited 2025 paper

FACET is a teacher-facing LLM-based multi-agent system that generates personalized mathematics worksheets for grade 8 students by modeling learner profiles (cognitive proficiency and intrinsic motivation), adapting content through a teacher agent, and evaluating quality through an automated evaluator agent. The system was evaluated through automated agent-based assessment and exploratory feedback from K-12 in-service teachers on authentic curriculum content.

Teacher Support Tools LLM evaluation K-12 educationcomputer-science

Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy

9/10 2 cited 2024 paper

This paper introduces the AMMORE dataset of 53,000 middle-school math open-response answers from students in Africa using the Rori WhatsApp AI tutor, and evaluates LLM-based grading approaches (including chain-of-thought prompting) to improve automated scoring accuracy from 98.7% to 99.9%, demonstrating consequential validity through impacts on Bayesian Knowledge Tracing estimates of student mastery.

AI Tutors LLM evaluation K-12 educationcomputer-science

Learning to Use AI for Learning: Teaching Responsible Use of AI Chatbot to K-12 Students Through an AI Literacy Module

AI Tutors LLM evaluation K-12 educationcomputer-science

This paper presents an LLM-based instructional module to teach prompting literacy to K-12 students through scenario-based practice with AI chatbots, deployed across 11 secondary education classrooms. The study evaluates an AI auto-grader's capability to assess student prompts, measures changes in students' prompting performance and confidence in using AI for learning, and analyzes the quality of assessment materials.

Problems With Large Language Models for Learner Modelling: Why LLMs Alone Fall Short for Responsible Tutoring in K-12 Education

AI TutorsPersonalised Adaptive Learning LLM evaluation K-12 educationcomputer-science

This paper empirically evaluates LLM-based tutoring systems against traditional deep knowledge tracing (DKT) models for learner modelling in K-12 education, demonstrating that LLMs fall short in accurately tracking student knowledge over time even after fine-tuning. The study directly measures prediction accuracy, temporal coherence, and multi-skill mastery estimation using a large-scale K-12 dataset to assess whether LLMs can responsibly support adaptive instruction.

Machine-Assisted Grading of Nationwide School-Leaving Essay Exams with LLMs and Statistical NLP

Teacher Support Tools educational assessment natural language processingcomputer-science

This paper evaluates LLM-based and statistical NLP methods for automated scoring of nationwide high school graduation essay exams in Estonia, comparing machine-generated scores against human raters across multiple rubric dimensions including content, argumentation, and language quality. The study demonstrates that automated scoring achieves reliability comparable to human raters while also examining bias, prompt injection risks, and providing personalized feedback capabilities.

MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems

9/10 122 cited 2023 paper

MathDial presents a dataset of 3,000 one-to-one math tutoring dialogues where human teachers guide LLM-simulated students through multi-step reasoning problems using scaffolding questions and pedagogical moves, with extensive annotations for training and evaluating AI tutoring systems. The paper demonstrates that current LLMs fail at effective tutoring by revealing solutions too early or providing incorrect feedback, and shows how models finetuned on MathDial improve interactive tutoring performance.

AI Tutors large language model evaluation educationcomputer-sciencehighly-cited

Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes

9/10 63 cited 2023 paper

This paper develops Bridge, a method that translates expert tutors' decision-making processes into a framework for LLMs to remediate elementary math mistakes, using a dataset of 700 real tutoring conversations with 1st-5th grade students. The work evaluates GPT-4 and Llama-2-70b on their ability to provide pedagogically sound responses to student errors when guided by expert decision models.

AI Tutors large language model evaluation educationcomputer-science

AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails

9/10 51 cited 2024 paper

This paper presents MWPTutor, an LLM-based intelligent tutoring system for math word problems that combines structured pedagogical strategies (finite state transducers) with LLM flexibility, and evaluates it against GPT-4 through human evaluation studies. The system implements guardrails to prevent common tutoring pitfalls like answer-leaking while maintaining pedagogical control through predefined teaching strategies.

AI Tutors large language model evaluation educationcomputer-science

MathVC: An LLM-Simulated Multi-Character Virtual Classroom for Mathematics Education

9/10 40 cited 2024 paper

MathVC is an LLM-simulated multi-persona virtual classroom that creates AI-powered peer agents to facilitate collaborative mathematical problem-solving for middle school students. The system was evaluated with 14 U.S. middle-schoolers to assess engagement, motivation, and collaborative learning through simulated peer interactions with intentionally injected misconceptions.

AI Tutors large language model evaluation educationcomputer-science

Enhancing Critical Thinking in Education by means of a Socratic Chatbot

9/10 34 cited 2024 paper

This paper presents a Socratic chatbot fine-tuned on open-source LLMs (Llama2 7B/13B) designed to foster critical thinking in students through structured questioning rather than providing direct answers. The system is evaluated through simulated student-chatbot interactions to assess its effectiveness in promoting reflection and critical thinking compared to standard chatbots.

AI Tutors large language model evaluation educationcomputer-science

Beyond Answers: Large Language Model-Powered Tutoring System in Physics Education for Deep Learning and Precise Understanding

9/10 11 cited 2024 paper

This paper presents Physics-STAR, an LLM-powered tutoring system for high school physics education, and evaluates it through a controlled experiment with 12 high school sophomores against traditional lectures and generic LLM tutoring. The system provides step-by-step guidance, reflective learning prompts, and personalized scaffolding to improve conceptual understanding and problem-solving skills in physics.

AI Tutors large language model evaluation educationcomputer-sciencephysics

Mentigo: An Intelligent Agent for Mentoring Students in the Creative Problem Solving Process

9/10 11 cited 2024 paper

Mentigo is an AI mentor agent system designed to guide middle school students through creative problem solving (CPS) processes, providing scaffolding, personalized feedback, and Socratic questioning based on real classroom mentor-student interactions. The system was evaluated through comparative experiments with 12 students and reviewed by expert teachers, demonstrating improvements in student engagement and creative outcomes.

AI Tutors large language model evaluation educationcomputer-science

Unlocking Scientific Concepts: How Effective Are LLM-Generated Analogies for Student Understanding and Classroom Practice?

9/10 10 cited 2025 paper

This paper evaluates the effectiveness of LLM-generated analogies for teaching scientific concepts (biology and physics) through controlled in-class tests with high school students and classroom field studies with teachers. The study measures student understanding, learning outcomes, potential over-reliance, and teacher satisfaction with LLM-generated analogies, culminating in the development of a practical system for teachers to generate and refine teaching analogies.

AI TutorsTeacher Support Tools large language model evaluation educationcomputer-science

Children's Expectations, Engagement, and Evaluation of an LLM-enabled Spherical Visualization Platform in the Classroom

AI Tutors primary school AI evaluationcomputer-science

This paper presents a classroom study evaluating an LLM-augmented spherical visualization platform used with Swedish primary school children (ages 9-10) to explore Earth-related datasets through spoken natural language queries and coordinated visual-verbal responses. The study examines children's expectations, engagement patterns, and evaluations of the system in a formal educational context.

Partnering with AI: A Pedagogical Feedback System for LLM Integration into Programming Education

9/10 2 cited 2025 paper

This paper develops and evaluates a pedagogical framework for LLM-driven feedback generation in secondary school Python programming education, aligning automated feedback with established pedagogical principles like mastery adaptation and progress-based scaffolding. Through mixed-method evaluation with eight secondary school computer science teachers, the study assesses how well LLM-generated feedback adheres to pedagogical standards compared to human teacher feedback.

AI TutorsTeacher Support Tools secondary school AI evaluationcomputer-science

Employment of Generative Artificial Intelligence in Classroom Environments to Improve Financial Education in Secondary School Students

9/10 2 cited 2024 paper

This quasi-experimental study evaluates the use of ChatGPT to teach financial education to 110 secondary school students, comparing learning outcomes between an experimental group using AI tools and a control group receiving traditional instruction. The study measures student performance across five dimensions of financial literacy including planning, analysis, behavior, expense management, and investment initiative.

AI Tutors secondary school AI evaluation

AI tutoring can safely and effectively support students: An exploratory RCT in UK classrooms

AI Tutors secondary school AI evaluationcomputer-science

This paper reports results from an exploratory RCT (N=165) evaluating LearnLM, a pedagogically fine-tuned AI tutor, in UK secondary school mathematics classrooms, where expert tutors supervised all AI-generated messages. Students receiving supervised AI tutoring performed at least as well as those with human tutors alone, with significantly better knowledge transfer to novel problems (66.2% vs 60.7% success rate).

ChatGPT-5 in Secondary Education: A Mixed-Methods Analysis of Student Attitudes, AI Anxiety, and Hallucination-Aware Use

AI Tutors secondary school AI evaluationcomputer-science

This mixed-methods study directly evaluates ChatGPT-5 use with 109 secondary (age 16) students in Greek high schools, measuring attitudes (SATAI), anxiety (AIAS), and student responses to deliberately triggered hallucinations across multiple task modalities. The research identifies pedagogical affordances, constraints, and documents how students develop 'epistemic safeguarding' strategies after encountering incorrect AI outputs.

An Experience Report on a Pedagogically Controlled, Curriculum-Constrained AI Tutor for SE Education

AI Tutors secondary school AI evaluationcomputer-science

This paper presents RockStartIT Tutor, a GPT-4-powered AI tutoring system designed for secondary school students learning programming and computational thinking, using a curriculum-constrained knowledge base and pedagogically controlled prompting. The system was pilot-evaluated with 13 students and teachers using the Technology Acceptance Model to assess its effectiveness in providing scaffolded, personalized support.

A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

9/10 73 cited 2024 paper

This paper develops a chain-of-thought prompting approach using GPT-4 to automatically score and generate explanations for middle school Earth Science formative assessment responses, employing human-in-the-loop few-shot and active learning methods. The system evaluates open-ended short-answer responses and provides meaningful feedback to support student learning.

Teacher Support Tools reasoning evaluation LLMcomputer-science

Educators' Perceptions of Large Language Models as Tutors: Comparing Human and AI Tutors in a Blind Text-only Setting

AI Tutors math word problems grade schoolcomputer-science

This paper compares LLM-based tutors with human tutors on grade-school math word problems by having educators annotate and compare tutoring dialog snippets on engagement, empathy, scaffolding, and conciseness in a blind text-only setting. The study finds that educators with teaching experience perceive LLM tutors as performing better than human tutors across all four pedagogical dimensions.

Exploring the Potential of ChatGPT as a Substitute Teacher: A Case Study

9/10 22 cited 2024 paper

This case study evaluates ChatGPT as a substitute teacher for 11th-grade chemistry students in the UAE, comparing student engagement and learning outcomes across cognitive domains (knowledge, application, reasoning) between ChatGPT-taught and traditionally-taught control groups using Bloom's taxonomy. The study found that while ChatGPT showed some promise in knowledge recall and reasoning, the control group significantly outperformed the experimental group, with double the percentage of students achieving good/outstanding results.

AI Tutors teacher knowledge evaluation AI

EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework

9/10 9 cited 2025 paper

EducationQ is a multi-agent dialogue framework that evaluates LLMs' teaching capabilities through simulated teacher-student interactions, testing 14 models across 1,498 questions spanning 13 disciplines and 10 difficulty levels. The framework incorporates formative assessment principles to measure pedagogical effectiveness including questioning strategies, adaptive feedback, and scaffolding behaviors.

AI Tutors teacher knowledge evaluation AIcomputer-science

Artificial Intelligence in Enhancing Ecology Essays: A Study in a Brazilian High School

9/10 2024 paper

This qualitative action research study evaluates AI chatbots as auxiliary tools to help Brazilian high school students improve their argumentative ecology essays through iterative feedback and revision cycles. The research directly measures learning outcomes and engagement when students use chatbots for writing improvement in a classroom setting.

AI Tutors teacher knowledge evaluation AI

Enabling Multi-Agent Systems as Learning Designers: Applying Learning Sciences to AI Instructional Design

Teacher Support Tools teacher knowledge evaluation AIcomputer-science

This paper evaluates three multi-agent LLM systems that embed the Knowledge-Learning-Instruction (KLI) framework to generate secondary Math and Science learning activities, comparing them against a baseline single-agent system through teacher evaluations and LLM-as-a-judge assessments using Quality Matters K-12 standards. The study demonstrates that collaborative multi-agent systems can produce more pedagogically sound, creative, and classroom-ready activities by encoding learning science principles directly into the AI architecture.

ConvoLearn: A Dataset of Constructivist Tutor-Student Dialogue

AI Tutors teacher knowledge evaluation AIcomputer-science

ConvoLearn introduces a dataset of 1,250 constructivist tutor-student dialogues in middle school Earth Science, grounded in knowledge-building theory across six pedagogical dimensions (cognitive engagement, formative assessment, accountability, cultural responsiveness, metacognition, and power dynamics). The paper demonstrates that fine-tuning LLMs on this dataset shifts model behavior toward constructivist teaching strategies, with the fine-tuned Mistral-7B significantly outperforming base models and Claude Sonnet in teacher evaluations.

Evaluating ChatGPT's Decimal Skills and Feedback Generation in a Digital Learning Game

9/10 48 cited 2023 paper

This paper evaluates ChatGPT's ability to solve decimal math problems, assess correctness of student self-explanations, and generate feedback within the Decimal Point learning game for middle school students, using over 5,000 real student responses. The study assesses ChatGPT's content knowledge in decimals, automated grading accuracy (75%), and pedagogical quality of generated feedback using a structured rubric.

AI Tutors explanation quality evaluationcomputer-science

Classroom AI: Large Language Models as Grade-Specific Teachers

AI Tutors age-appropriate explanation generationcomputer-science

This paper presents a framework for finetuning LLMs to generate grade-appropriate educational content across six grade levels (lower elementary through adult), evaluating the pedagogical quality and age-appropriateness of AI-generated explanations using readability metrics and human evaluation with 208 participants.

Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues

9/10 24 cited 2025 paper

This paper trains an open-source LLM (Llama 3.1 8B) to generate tutor utterances that maximize student learning outcomes in math tutoring dialogues by optimizing for both student response correctness and pedagogical quality using direct preference optimization. The approach uses a student model to predict correctness and GPT-4o to evaluate adherence to pedagogical principles, directly measuring the impact on student learning through dialogue interactions.

AI Tutors tutoring dialogue evaluationcomputer-science

Mathematics intelligent tutoring system for learning multiplication and division of fractions based on diagnostic teaching

9/10 17 cited 2023 paper

This paper develops and evaluates a dialogue-based intelligent tutoring system (ITS) for teaching sixth-grade students multiplication and division of fractions, using diagnostic teaching methodology with real-time error identification and adaptive instructional strategies. The system was tested through a quasi-experimental study with 66 sixth graders, showing significant learning gains compared to conventional classroom instruction.

AI TutorsPersonalised Adaptive Learning tutoring dialogue evaluationcomputer-sciencemedicine

Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors

9/10 14 cited 2025 paper

This paper presents findings from the BEA 2025 Shared Task that evaluates pedagogical abilities of AI tutors powered by LLMs, specifically focusing on assessing tutor responses for mistake identification, guidance provision, and feedback actionability in mathematics education dialogues. The task established pedagogically-motivated evaluation tracks grounded in learning science principles to measure how effectively AI tutors remediate student mistakes through dialogue.

AI Tutors tutoring dialogue evaluationcomputer-science

CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation

9/10 3 cited 2025 paper

CODAE is a framework that fine-tunes open-source LLMs for AI tutoring by augmenting real student-tutor dialogues with Chain-of-Thought prompting to improve pedagogical quality. The paper addresses three key limitations (over-compliance, low response adaptivity, and threat vulnerability) and evaluates models on their ability to provide step-by-step guidance without prematurely revealing answers.

AI Tutors tutoring dialogue evaluationcomputer-science

Leveraging LLMs to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study

Teacher Support Tools tutoring dialogue evaluationcomputer-science

This paper evaluates the feasibility of using LLMs (GPT-4, Gemini, LearnLM) to automatically identify and assess two specific tutoring moves in real-life math tutoring dialogues: delivering effective praise and responding to student errors. The study analyzes 50 transcripts of college tutors working with middle school students, demonstrating that LLMs can reliably detect tutoring situations (94-98% accuracy for praise detection, 82-88% for error detection) and evaluate adherence to best practices (83-89% and 73-77% alignment with human judgment).

BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses

AI Tutors tutoring dialogue evaluationcomputer-science

This paper presents an MPNet-based ensemble system for automatically evaluating AI tutor responses in educational dialogues, specifically assessing whether tutors correctly identify and locate student mistakes across two classification tasks at the BEA 2025 Shared Task.

Pedagogy-driven Evaluation of Generative AI-powered Intelligent Tutoring Systems

AI Tutors tutoring dialogue evaluationcomputer-science

This paper provides a comprehensive review of evaluation practices for LLM-powered Intelligent Tutoring Systems (ITSs), critically analyzing existing benchmarks and proposing three pedagogy-driven research directions for establishing unified, scalable evaluation methodologies grounded in learning science principles. It emphasizes cognitive offloading concerns, citing empirical studies showing students' over-reliance on AI tutors leading to reduced independent problem-solving skills.

Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation

AI TutorsTeacher Support Tools tutoring dialogue evaluationcomputer-science

This paper develops an automated system using GPT-3.5 to evaluate five key tutoring strategies (praise, error reaction, knowledge assessment, managing inequity, responding to negative self-talk) in one-on-one tutoring dialogues, classifying whether each strategy is employed effectively or ineffectively. The system analyzes tutoring transcripts to provide color-coded feedback on pedagogical quality of tutor-student interactions.

Letting Tutor Personas"Speak Up"for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization