AI Tutors

EduEval is a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education, comprising 24 task types with over 11,000 questions organized across six cognitive dimensions (Memorization, Understanding, Application, Reasoning, Creativity, and Ethics) based on Bloom's Taxonomy and Webb's Depth of Knowledge. The benchmark uses authentic educational materials including real exam questions, classroom dialogues, student essays, and expert-designed prompts spanning primary through high school levels.

FoundationalASSIST: An Educational Dataset for Foundational Knowledge Tracing and Pedagogical Grounding of LLMs

10/10 2026

FoundationalASSIST introduces a 1.7-million interaction K-12 educational dataset with full question text, student responses, and Common Core alignment, specifically designed to evaluate whether LLMs can perform knowledge tracing (predicting student performance) and pedagogical grounding (understanding assessment item properties). The paper evaluates four frontier LLMs on these tasks, revealing significant gaps in their ability to predict student performance and understand item discrimination.

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

This paper presents LearnLM-Tutor, a fine-tuned Gemini model for educational use, and introduces a comprehensive evaluation framework spanning seven diverse benchmarks (quantitative, qualitative, automatic, and human evaluations) grounded in learning science principles to assess pedagogical quality in K-12 AI tutoring systems. The work includes real-world deployment at Arizona State University and systematic evaluation of pedagogical dimensions including Socratic dialogue, adaptive scaffolding, and learning-centered interactions.

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

EduGuardBench is a dual-component benchmark designed to evaluate LLMs acting as simulated teachers, measuring both pedagogical fidelity (role-playing accuracy, teaching competence) and adversarial safety (resistance to jailbreaking, handling of academic misconduct requests). The benchmark identifies harmful teaching behaviors (incompetence, indolence, offensiveness) and uses persona-based adversarial prompts to test ethical boundaries specific to educational contexts.

Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors

10/10 39 cited 2024

This paper proposes a unified evaluation taxonomy with eight pedagogical dimensions to assess LLM-powered AI tutors' abilities in remediating student mistakes in mathematics, and releases MRBench - a benchmark containing 192 conversations and 1,596 responses from seven tutors with human annotations across all dimensions. The taxonomy evaluates pedagogical interactions including mistake identification, guidance provision, actionability, and tutor tone, directly measuring whether AI tutors demonstrate effective pedagogical abilities rather than simply revealing answers.

Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs

This paper introduces GuideEval, a benchmark that evaluates LLMs' ability to provide adaptive Socratic tutoring by assessing three pedagogical phases: perceiving learner states (confusion, comprehension, errors), orchestrating appropriate instructional strategies, and eliciting productive reflections. The benchmark is grounded in authentic K-12 educational dialogues and specifically measures whether LLMs can dynamically adjust their guidance based on student cognitive states rather than just generate generic responses.

TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

TutorBench is a purpose-built benchmark with 1,490 expert-curated samples evaluating LLMs on three core tutoring skills: generating adaptive explanations, providing actionable feedback, and creating effective hints for high-school and AP-level content. The benchmark uses sample-specific rubrics and LLM-judge evaluation to assess 16 frontier models, finding none exceed 56% overall performance and all achieve less than 60% pass rate on criteria related to guiding, diagnosing, and supporting students.

PhysicsAssistant: An LLM-Powered Interactive Learning Robot for Physics Lab Investigations

9/10 23 cited 2024

This paper presents PhysicsAssistant, a multimodal robot combining YOLOv8 object detection with GPT-3.5-turbo to provide real-time interactive assistance to 8th-grade students during physics lab experiments. The system is empirically evaluated through a user study with 10 students, where expert ratings based on Bloom's taxonomy assess the quality of responses compared to GPT-4.

Learning to Use AI for Learning: Teaching Responsible Use of AI Chatbot to K-12 Students Through an AI Literacy Module

This paper presents an LLM-based instructional module to teach prompting literacy to K-12 students through scenario-based practice with AI chatbots, deployed across 11 secondary education classrooms. The study evaluates an AI auto-grader's capability to assess student prompts, measures changes in students' prompting performance and confidence in using AI for learning, and analyzes the quality of assessment materials.

Problems With Large Language Models for Learner Modelling: Why LLMs Alone Fall Short for Responsible Tutoring in K-12 Education

This paper empirically evaluates LLM-based tutoring systems against traditional deep knowledge tracing (DKT) models for learner modelling in K-12 education, demonstrating that LLMs fall short in accurately tracking student knowledge over time even after fine-tuning. The study directly measures prediction accuracy, temporal coherence, and multi-skill mastery estimation using a large-scale K-12 dataset to assess whether LLMs can responsibly support adaptive instruction.

MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems

9/10 122 cited 2023

MathDial presents a dataset of 3,000 one-to-one math tutoring dialogues where human teachers guide LLM-simulated students through multi-step reasoning problems using scaffolding questions and pedagogical moves, with extensive annotations for training and evaluating AI tutoring systems. The paper demonstrates that current LLMs fail at effective tutoring by revealing solutions too early or providing incorrect feedback, and shows how models finetuned on MathDial improve interactive tutoring performance.

Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes

9/10 63 cited 2023

This paper develops Bridge, a method that translates expert tutors' decision-making processes into a framework for LLMs to remediate elementary math mistakes, using a dataset of 700 real tutoring conversations with 1st-5th grade students. The work evaluates GPT-4 and Llama-2-70b on their ability to provide pedagogically sound responses to student errors when guided by expert decision models.

AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails

9/10 51 cited 2024

This paper presents MWPTutor, an LLM-based intelligent tutoring system for math word problems that combines structured pedagogical strategies (finite state transducers) with LLM flexibility, and evaluates it against GPT-4 through human evaluation studies. The system implements guardrails to prevent common tutoring pitfalls like answer-leaking while maintaining pedagogical control through predefined teaching strategies.

MathVC: An LLM-Simulated Multi-Character Virtual Classroom for Mathematics Education

9/10 40 cited 2024

MathVC is an LLM-simulated multi-persona virtual classroom that creates AI-powered peer agents to facilitate collaborative mathematical problem-solving for middle school students. The system was evaluated with 14 U.S. middle-schoolers to assess engagement, motivation, and collaborative learning through simulated peer interactions with intentionally injected misconceptions.

Enhancing Critical Thinking in Education by means of a Socratic Chatbot

9/10 34 cited 2024

This paper presents a Socratic chatbot fine-tuned on open-source LLMs (Llama2 7B/13B) designed to foster critical thinking in students through structured questioning rather than providing direct answers. The system is evaluated through simulated student-chatbot interactions to assess its effectiveness in promoting reflection and critical thinking compared to standard chatbots.

Beyond Answers: Large Language Model-Powered Tutoring System in Physics Education for Deep Learning and Precise Understanding

This paper presents Physics-STAR, an LLM-powered tutoring system for high school physics education, and evaluates it through a controlled experiment with 12 high school sophomores against traditional lectures and generic LLM tutoring. The system provides step-by-step guidance, reflective learning prompts, and personalized scaffolding to improve conceptual understanding and problem-solving skills in physics.

Mentigo: An Intelligent Agent for Mentoring Students in the Creative Problem Solving Process

Mentigo is an AI mentor agent system designed to guide middle school students through creative problem solving (CPS) processes, providing scaffolding, personalized feedback, and Socratic questioning based on real classroom mentor-student interactions. The system was evaluated through comparative experiments with 12 students and reviewed by expert teachers, demonstrating improvements in student engagement and creative outcomes.

Unlocking Scientific Concepts: How Effective Are LLM-Generated Analogies for Student Understanding and Classroom Practice?

9/10 10 cited 2025

This paper evaluates the effectiveness of LLM-generated analogies for teaching scientific concepts (biology and physics) through controlled in-class tests with high school students and classroom field studies with teachers. The study measures student understanding, learning outcomes, potential over-reliance, and teacher satisfaction with LLM-generated analogies, culminating in the development of a practical system for teachers to generate and refine teaching analogies.

Children's Expectations, Engagement, and Evaluation of an LLM-enabled Spherical Visualization Platform in the Classroom

View all 1247 benchmarks in Pedagogical interactions →

This paper presents a classroom study evaluating an LLM-augmented spherical visualization platform used with Swedish primary school children (ages 9-10) to explore Earth-related datasets through spoken natural language queries and coordinated visual-verbal responses. The study examines children's expectations, engagement patterns, and evaluations of the system in a formal educational context.

Partnering with AI: A Pedagogical Feedback System for LLM Integration into Programming Education

9/10 2 cited 2025

This paper develops and evaluates a pedagogical framework for LLM-driven feedback generation in secondary school Python programming education, aligning automated feedback with established pedagogical principles like mastery adaptation and progress-based scaffolding. Through mixed-method evaluation with eight secondary school computer science teachers, the study assesses how well LLM-generated feedback adheres to pedagogical standards compared to human teacher feedback.

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs

TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

Mentigo: An Intelligent Agent for Mentoring Students in the Creative Problem Solving Process

Unlocking Scientific Concepts: How Effective Are LLM-Generated Analogies for Student Understanding and Classroom Practice?

9/10 10 cited 2025

Partnering with AI: A Pedagogical Feedback System for LLM Integration into Programming Education

9/10 2 cited 2025

EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework

9/10 9 cited 2025

EducationQ is a multi-agent dialogue framework that evaluates LLMs' teaching capabilities through simulated teacher-student interactions, testing 14 models across 1,498 questions spanning 13 disciplines and 10 difficulty levels. The framework incorporates formative assessment principles to measure pedagogical effectiveness including questioning strategies, adaptive feedback, and scaffolding behaviors.

ConvoLearn: A Dataset of Constructivist Tutor-Student Dialogue

ConvoLearn introduces a dataset of 1,250 constructivist tutor-student dialogues in middle school Earth Science, grounded in knowledge-building theory across six pedagogical dimensions (cognitive engagement, formative assessment, accountability, cultural responsiveness, metacognition, and power dynamics). The paper demonstrates that fine-tuning LLMs on this dataset shifts model behavior toward constructivist teaching strategies, with the fine-tuned Mistral-7B significantly outperforming base models and Claude Sonnet in teacher evaluations.

Evaluating ChatGPT's Decimal Skills and Feedback Generation in a Digital Learning Game

9/10 48 cited 2023

This paper evaluates ChatGPT's ability to solve decimal math problems, assess correctness of student self-explanations, and generate feedback within the Decimal Point learning game for middle school students, using over 5,000 real student responses. The study assesses ChatGPT's content knowledge in decimals, automated grading accuracy (75%), and pedagogical quality of generated feedback using a structured rubric.

Classroom AI: Large Language Models as Grade-Specific Teachers

This paper presents a framework for finetuning LLMs to generate grade-appropriate educational content across six grade levels (lower elementary through adult), evaluating the pedagogical quality and age-appropriateness of AI-generated explanations using readability metrics and human evaluation with 208 participants.

CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation

9/10 3 cited 2025

CODAE is a framework that fine-tunes open-source LLMs for AI tutoring by augmenting real student-tutor dialogues with Chain-of-Thought prompting to improve pedagogical quality. The paper addresses three key limitations (over-compliance, low response adaptivity, and threat vulnerability) and evaluates models on their ability to provide step-by-step guidance without prematurely revealing answers.

Pedagogy-driven Evaluation of Generative AI-powered Intelligent Tutoring Systems

This paper provides a comprehensive review of evaluation practices for LLM-powered Intelligent Tutoring Systems (ITSs), critically analyzing existing benchmarks and proposing three pedagogy-driven research directions for establishing unified, scalable evaluation methodologies grounded in learning science principles. It emphasizes cognitive offloading concerns, citing empirical studies showing students' over-reliance on AI tutors leading to reduced independent problem-solving skills.

How Real Is AI Tutoring? Comparing Simulated and Human Dialogues in One-on-One Instruction

This paper systematically compares AI-simulated tutoring dialogues with authentic human teacher-student dialogues using IRF coding and Epistemic Network Analysis, finding that human dialogues show superior pedagogical questioning, feedback, and cognitively-guided interaction patterns, while AI dialogues exhibit structural simplification and behavioral convergence. The work directly evaluates the pedagogical quality and instructional effectiveness of LLM-generated one-on-one tutoring interactions.

MetaCLASS: Metacognitive Coaching for Learning with Adaptive Self-regulation Support

MetaCLASS introduces a framework and benchmark for evaluating LLM-based metacognitive tutoring that explicitly targets self-regulated learning processes (planning, monitoring, debugging, evaluation) through 11 interpretable coach moves, including productive restraint. The paper generates 1,015 annotated tutoring conversations and benchmarks nine LLMs on predicting appropriate metacognitive coaching actions, revealing systematic compulsive intervention bias where models fail to recognize when silence is pedagogically optimal.

Improving the Validity of Automatically Generated Feedback via Reinforcement Learning

9/10 20 cited 2024

This paper develops and evaluates a reinforcement learning framework for automatically generating pedagogically valid feedback for incorrect student answers in math education, using GPT-4 to evaluate feedback quality according to a rubric measuring both correctness and alignment with educational goals. The work demonstrates that fine-tuning Llama 2 with direct preference optimization significantly improves feedback quality across correctness and pedagogical alignment dimensions.

Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study

9/10 13 cited 2025

This paper benchmarks whether large language models (LLMs) can replicate the adaptivity of intelligent tutoring systems by systematically removing context components from 75 real-world tutoring scenarios and evaluating how three LLMs respond to student errors, knowledge states, and pedagogical requirements. The study finds that even the best-performing LLM only marginally mimics ITS adaptivity, with concerning patterns like GPT-4o providing overly direct feedback instead of effective Socratic questioning.

Contextualizing Problems to Student Interests at Scale in Intelligent Tutoring System Using Large Language Models

9/10 9 cited 2023

This paper explores using GPT-4 to automatically contextualize math problems in CTAT (an intelligent tutoring system) to align with individual student interests at scale, aiming to increase engagement and learning outcomes. The authors use iterative prompt engineering to personalize problem contexts while preserving difficulty and pedagogical intent.

Reasoning Trajectories for Socratic Debugging of Student Code: From Misconceptions to Contradictions and Updated Beliefs

View all 497 benchmarks in Pedagogy of generated outputs →

This paper introduces the task of generating reasoning trajectories for Socratic debugging conversations where AI tutors guide novice programmers to identify and correct their own programming misconceptions through cognitive dissonance rather than direct correction. The work includes a manually annotated dataset of debugging problems with reasoning trajectories and evaluates LLM-generated Socratic conversations.

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors

10/10 39 cited 2024

TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems

9/10 122 cited 2023

Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes

9/10 63 cited 2023

AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails

9/10 51 cited 2024

Beyond Answers: Large Language Model-Powered Tutoring System in Physics Education for Deep Learning and Precise Understanding

Mentigo: An Intelligent Agent for Mentoring Students in the Creative Problem Solving Process

Partnering with AI: A Pedagogical Feedback System for LLM Integration into Programming Education

9/10 2 cited 2025

An Experience Report on a Pedagogically Controlled, Curriculum-Constrained AI Tutor for SE Education

This paper presents RockStartIT Tutor, a GPT-4-powered AI tutoring system designed for secondary school students learning programming and computational thinking, using a curriculum-constrained knowledge base and pedagogically controlled prompting. The system was pilot-evaluated with 13 students and teachers using the Technology Acceptance Model to assess its effectiveness in providing scaffolded, personalized support.

EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework

9/10 9 cited 2025

Artificial Intelligence in Enhancing Ecology Essays: A Study in a Brazilian High School

9/10 2024

This qualitative action research study evaluates AI chatbots as auxiliary tools to help Brazilian high school students improve their argumentative ecology essays through iterative feedback and revision cycles. The research directly measures learning outcomes and engagement when students use chatbots for writing improvement in a classroom setting.

Evaluating ChatGPT's Decimal Skills and Feedback Generation in a Digital Learning Game

9/10 48 cited 2023

Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues

9/10 24 cited 2025

This paper trains an open-source LLM (Llama 3.1 8B) to generate tutor utterances that maximize student learning outcomes in math tutoring dialogues by optimizing for both student response correctness and pedagogical quality using direct preference optimization. The approach uses a student model to predict correctness and GPT-4o to evaluate adherence to pedagogical principles, directly measuring the impact on student learning through dialogue interactions.

Mathematics intelligent tutoring system for learning multiplication and division of fractions based on diagnostic teaching

9/10 17 cited 2023

This paper develops and evaluates a dialogue-based intelligent tutoring system (ITS) for teaching sixth-grade students multiplication and division of fractions, using diagnostic teaching methodology with real-time error identification and adaptive instructional strategies. The system was tested through a quasi-experimental study with 66 sixth graders, showing significant learning gains compared to conventional classroom instruction.

Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors

9/10 14 cited 2025

This paper presents findings from the BEA 2025 Shared Task that evaluates pedagogical abilities of AI tutors powered by LLMs, specifically focusing on assessing tutor responses for mistake identification, guidance provision, and feedback actionability in mathematics education dialogues. The task established pedagogically-motivated evaluation tracks grounded in learning science principles to measure how effectively AI tutors remediate student mistakes through dialogue.

BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses

This paper presents an MPNet-based ensemble system for automatically evaluating AI tutor responses in educational dialogues, specifically assessing whether tutors correctly identify and locate student mistakes across two classification tasks at the BEA 2025 Shared Task.

Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation

This paper develops an automated system using GPT-3.5 to evaluate five key tutoring strategies (praise, error reaction, knowledge assessment, managing inequity, responding to negative self-talk) in one-on-one tutoring dialogues, classifying whether each strategy is employed effectively or ineffectively. The system analyzes tutoring transcripts to provide color-coded feedback on pedagogical quality of tutor-student interactions.

Rewarding How Models Think Pedagogically: Integrating Pedagogical Reasoning and Thinking Rewards for LLMs in Education

View all 300 benchmarks in Feedback with reasoning →

This paper introduces PedagogicalRL-Thinking, a reinforcement learning framework that trains LLM tutors to generate pedagogically appropriate reasoning traces (not just responses) by rewarding thinking processes grounded in Polya's problem-solving framework, with evaluation on mathematics tutoring dialogues measuring solution correctness, answer leakage prevention, and helpfulness.

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

FoundationalASSIST: An Educational Dataset for Foundational Knowledge Tracing and Pedagogical Grounding of LLMs

10/10 2026

EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs

EduAdapt introduces a benchmark dataset of nearly 48k grade-labeled QA pairs across grades 1-12 and nine science subjects to evaluate whether LLMs can adapt their responses to different grade levels. The paper evaluates multiple open-source LLMs and finds they struggle to generate developmentally appropriate responses, especially for early-grade students.

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy

9/10 2 cited 2024

This paper introduces the AMMORE dataset of 53,000 middle-school math open-response answers from students in Africa using the Rori WhatsApp AI tutor, and evaluates LLM-based grading approaches (including chain-of-thought prompting) to improve automated scoring accuracy from 98.7% to 99.9%, demonstrating consequential validity through impacts on Bayesian Knowledge Tracing estimates of student mastery.

MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems

9/10 122 cited 2023

Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes

9/10 63 cited 2023

AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails

9/10 51 cited 2024

MathVC: An LLM-Simulated Multi-Character Virtual Classroom for Mathematics Education

9/10 40 cited 2024

Beyond Answers: Large Language Model-Powered Tutoring System in Physics Education for Deep Learning and Precise Understanding

Unlocking Scientific Concepts: How Effective Are LLM-Generated Analogies for Student Understanding and Classroom Practice?

9/10 10 cited 2025

Employment of Generative Artificial Intelligence in Classroom Environments to Improve Financial Education in Secondary School Students

9/10 2 cited 2024

This quasi-experimental study evaluates the use of ChatGPT to teach financial education to 110 secondary school students, comparing learning outcomes between an experimental group using AI tools and a control group receiving traditional instruction. The study measures student performance across five dimensions of financial literacy including planning, analysis, behavior, expense management, and investment initiative.

AI tutoring can safely and effectively support students: An exploratory RCT in UK classrooms

This paper reports results from an exploratory RCT (N=165) evaluating LearnLM, a pedagogically fine-tuned AI tutor, in UK secondary school mathematics classrooms, where expert tutors supervised all AI-generated messages. Students receiving supervised AI tutoring performed at least as well as those with human tutors alone, with significantly better knowledge transfer to novel problems (66.2% vs 60.7% success rate).

An Experience Report on a Pedagogically Controlled, Curriculum-Constrained AI Tutor for SE Education

Exploring the Potential of ChatGPT as a Substitute Teacher: A Case Study

9/10 22 cited 2024

This case study evaluates ChatGPT as a substitute teacher for 11th-grade chemistry students in the UAE, comparing student engagement and learning outcomes across cognitive domains (knowledge, application, reasoning) between ChatGPT-taught and traditionally-taught control groups using Bloom's taxonomy. The study found that while ChatGPT showed some promise in knowledge recall and reasoning, the control group significantly outperformed the experimental group, with double the percentage of students achieving good/outstanding results.

EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework

9/10 9 cited 2025

ConvoLearn: A Dataset of Constructivist Tutor-Student Dialogue

Evaluating ChatGPT's Decimal Skills and Feedback Generation in a Digital Learning Game

9/10 48 cited 2023

Classroom AI: Large Language Models as Grade-Specific Teachers

View all 577 benchmarks in Content knowledge →

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

FoundationalASSIST: An Educational Dataset for Foundational Knowledge Tracing and Pedagogical Grounding of LLMs

10/10 2026

EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

Problems With Large Language Models for Learner Modelling: Why LLMs Alone Fall Short for Responsible Tutoring in K-12 Education

Enhancing Critical Thinking in Education by means of a Socratic Chatbot

9/10 34 cited 2024

Employment of Generative Artificial Intelligence in Classroom Environments to Improve Financial Education in Secondary School Students

9/10 2 cited 2024

CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation

9/10 3 cited 2025

Evolutionary Reinforcement Learning based AI tutor for Socratic Interdisciplinary Instruction