Teacher Support Tools

FACET is a teacher-facing LLM-based multi-agent system that generates individualized mathematics worksheets for grade 8 students by incorporating both cognitive proficiency and intrinsic motivation profiles. The framework uses three specialized agents (learner, teacher, evaluator) and was validated through automated assessments and exploratory feedback from K-12 teachers.

FoundationalASSIST: An Educational Dataset for Foundational Knowledge Tracing and Pedagogical Grounding of LLMs

FoundationalASSIST is a K-12 educational dataset of 1.7M student interactions with full question text, actual student responses, and Common Core alignment, designed to evaluate LLMs on knowledge tracing (predicting student performance and exact answers) and pedagogical grounding (understanding properties that make assessment items effective). The paper demonstrates that current frontier LLMs struggle significantly on both task families, performing barely above trivial baselines on knowledge tracing and below random chance on item discrimination.

Implementation Considerations for Automated AI Grading of Student Work

This study evaluates an AI-powered grading platform (Colleague AI) through a co-design pilot with 19 K-12 teachers, combining usage logs, surveys, and interviews to examine how teachers and students use AI-generated rubrics and feedback for formative assessment purposes. Findings show teachers value rapid narrative feedback but distrust automated scoring, emphasizing the need for human oversight in AI grading systems.

Enabling Multi-Agent Systems as Learning Designers: Applying Learning Sciences to AI Instructional Design

This paper evaluates three multi-agent LLM systems for generating K-12 math and science learning activities guided by the Knowledge-Learning-Instruction (KLI) framework, comparing their pedagogical quality through teacher evaluations and LLM-as-judge assessments using Quality Matters standards. The collaborative multi-agent system (MAS-CMD) produced activities that teachers found significantly more creative, contextually relevant, and classroom-ready despite only small differences in rubric scores.

Adapting to Educate: Conversational AI's Role in Mathematics Education Across Different Educational Contexts

Research / Other 8/10 11 cited 2024

This paper examines how conversational AI (LLM-based tools) supports K-12 mathematics educators during lesson preparation by analyzing educator-AI dialogues to assess AI's responsiveness to different educational contexts and instructional needs. Through qualitative content analysis, the study evaluates whether AI can accurately adapt responses to varied educational settings and provide actionable pedagogical guidance.

Artificial intelligence (AI) learning tools in K-12 education: A scoping review

"From Unseen Needs to Classroom Solutions": Exploring AI Literacy Challenges & Opportunities with Project-based Learning Toolkit in K-12 Education

This paper explores K-12 teachers' AI literacy levels and how they integrate Project-Based Learning AI toolkits (AI Art Lab, AI Music Studio, AI Chatbot) into diverse subject areas through interviews and co-design sessions, examining pedagogical adaptations, challenges, and ethical concerns.

Large Language Models for Education: A survey and outlook

Research / Other 7/10 55 cited 2025

This is a comprehensive survey paper that systematically reviews the technological applications of large language models (LLMs) in K-12 education across multiple dimensions including student assistance, teacher support, adaptive learning, and commercial tools, while also identifying datasets, benchmarks, risks, and future research opportunities.

LLM Agents for Education: Advances and Applications

This survey paper provides a comprehensive review of LLM agents in educational settings, organizing recent advances around core educational tasks including teaching assistance (classroom simulation, feedback generation, curriculum design) and student support (adaptive learning, knowledge tracing, error correction). The paper proposes a task-centric taxonomy and discusses datasets, benchmarks, challenges like hallucination/overreliance, and integration issues in deploying educational LLM agents.

Benchmarking the Pedagogical Knowledge of Large Language Models

Benchmark (Published & Automated) 7/10 2 cited 2025

This paper introduces The Pedagogy Benchmark, a dataset of 920 multiple-choice questions from Chilean teacher training exams designed to evaluate large language models' cross-domain pedagogical knowledge (CDPK) and Special Education Needs and Disability (SEND) knowledge. The benchmark tests 97 models on their understanding of teaching strategies, assessment methods, and other pedagogical concepts, with results published on an interactive online leaderboard.

Automated Educational Question Generation at Different Bloom's Skill Levels Using Large Language Models: Strategies and Evaluation

Research / Other 7/10 39 cited 2024

This paper evaluates five large language models' ability to automatically generate educational questions at different Bloom's taxonomy cognitive levels using advanced prompting techniques, with both expert human and LLM-based evaluation of question quality. The study finds that LLMs can generate pedagogically relevant questions across cognitive levels when properly prompted, though performance varies significantly across models and automated evaluation does not match human judgment.

EduPlanner: LLM-Based Multiagent Systems for Customized and Intelligent Instructional Design

Research / Other 7/10 21 cited 2025

EduPlanner is an LLM-based multi-agent system that automatically generates customized instructional designs (lesson plans) for mathematics education by modeling student knowledge using a Skill-Tree structure and iteratively optimizing content through adversarial collaboration between evaluator and optimizer agents. The system is evaluated on GSM8K and Algebra datasets using a five-dimensional assessment framework (CIDDP) measuring Clarity, Integrity, Depth, Practicality, and Pertinence of lesson plans.

Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models

Benchmark (Published & Automated) 7/10 6 cited 2024

Edu-Values is a Chinese education values benchmark with 1,418 questions evaluating LLMs across seven core educational dimensions including professional philosophy, teachers' ethics, education laws, cultural literacy, educational knowledge/skills, basic competencies, and subject knowledge. The benchmark evaluates 21 LLMs using human feedback-based automatic evaluation and finds Chinese LLMs outperform English ones, with particular weaknesses in professional ethics and philosophy.

EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

Benchmark (Published & Automated) 7/10 6 cited 2025

EduBench is a comprehensive benchmark dataset with 18,821 data points covering 9 educational scenarios (assignment grading, study planning, psychological counseling, etc.) across 4,000+ educational contexts, with 12 multi-dimensional evaluation metrics assessing LLM performance in diverse educational roles and tasks. The benchmark includes automated evaluation capabilities using both human annotation and LLM-based assessment, with code and dataset publicly available.

Enhancing mathematics teachers’ pedagogical skills by using ChatGPT

Research / Other 7/10 5 cited 2024

This study proposes a conceptual framework for using ChatGPT to enhance creative teaching skills among secondary mathematics teachers in Saudi Arabia, evaluating teachers' current pedagogical proficiency through a questionnaire administered to 31 teachers and assessing the framework's appropriateness for integration.

Examining the views of primary school teachers on the use of artificial intelligence in education

Research / Other 7/10 2024

This qualitative study examines primary school teachers' views on using artificial intelligence in education through interviews with 16 teachers, exploring perceived advantages (personalized learning, rapid feedback, time-saving) and disadvantages (student laziness, reduced social interaction, ethical concerns) of AI integration in K-12 classrooms.

Co-Designing Interdisciplinary Design Projects with AI

Research / Other 7/10 2025

This paper presents IDPplanner, a GPT-based tool for helping teachers design interdisciplinary design thinking projects for Singapore secondary schools, and evaluates it through a within-subject study with 33 in-service teachers comparing AI-assisted versus manual project planning quality using a six-dimensional rubric.

A LLM-Powered Automatic Grading Framework with Human-Level Guidelines Optimization

Benchmark (Published & Automated) 7/10 11 cited 2024

GradeOpt is a multi-agent LLM framework for automatic short-answer grading (ASAG) that uses self-reflection to optimize grading guidelines, evaluated on datasets measuring teachers' pedagogical knowledge and students' learning progress in mathematics and physical science.

AI in Education: Rationale, Principles, and Instructional Implications

Peningkatan Kapasitas Guru dalam Literasi Digital melalui Edukasi Keterampilan Digital dan Pemanfaatan Kecerdasan Buatan dalam Pembelajaran

Research / Other 7/10 3 cited 2024

This paper describes a professional development workshop for Indonesian K-12 teachers on digital literacy and AI use in education, delivered via Zoom in May 2024. The intervention assessed teacher knowledge gains and perceived usefulness (94% found it beneficial) but did not formally evaluate AI systems or student learning outcomes.

View all 546 benchmarks in Pedagogical knowledge →

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

EduEval is a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education, comprising 24 task types with over 11,000 questions organized across six cognitive dimensions (Memorization, Understanding, Application, Reasoning, Creativity, and Ethics) based on Bloom's Taxonomy and Webb's Depth of Knowledge. The benchmark incorporates authentic educational materials including real exam questions, classroom dialogues, student essays, and expert-designed prompts spanning primary through high school levels.

FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English Writing

FEANEL is a benchmark dataset of 1,000 K-12 student essays (elementary and secondary) with fine-grained error annotations by language education experts, evaluating LLMs' ability to identify error types, assess severity, and provide pedagogical explanations for English writing errors. The benchmark uses a part-of-speech-based error taxonomy and evaluates state-of-the-art LLMs on their error analysis and feedback quality capabilities.

FACET: Teacher-Centred LLM-Based Multi-Agent Systems-Towards Personalized Educational Worksheets

Benchmark (Published & Automated) 9/10 2 cited 2024

Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy

This paper introduces AMMORE, a dataset of 53,000 middle-school math open-response question-answer pairs from an African WhatsApp-based tutoring platform, and evaluates LLM-based approaches (including chain-of-thought prompting) for automated grading of challenging student answers. The study demonstrates that LLM grading improves overall accuracy from 98.7% to 99.9% and significantly reduces misclassification of student mastery status in a Bayesian Knowledge Tracing model.

FoundationalASSIST: An Educational Dataset for Foundational Knowledge Tracing and Pedagogical Grounding of LLMs

Research / Other 9/10 147 cited 2023

Evaluating Reading Comprehension Exercises Generated by LLMs: A Showcase of ChatGPT in Education Applications

This paper evaluates ChatGPT's ability to generate personalized reading comprehension exercises (passages and multiple-choice questions) for middle school English learners in China, comparing AI-generated materials against human-written textbook exercises through both automatic and manual evaluation by students, teachers, and native speakers.

Unlocking Scientific Concepts: How Effective Are LLM-Generated Analogies for Student Understanding and Classroom Practice?

Research / Other 9/10 10 cited 2025

This paper evaluates LLM-generated analogies for teaching scientific concepts through a two-stage study with high school students and teachers, including controlled in-class tests and classroom field studies in biology and physics. The research finds that LLM-generated analogies can enhance student understanding particularly in biology, but require teacher guidance to prevent over-reliance and overconfidence, leading to the development of a practical system for teachers to generate and refine teaching analogies.

A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

Research / Other 9/10 73 cited 2024

This paper develops a human-in-the-loop chain-of-thought prompting approach using GPT-4 to automatically score and generate explanations for middle school Earth Science formative assessment responses. The method combines few-shot learning, active learning, and chain-of-thought reasoning to evaluate open-ended short-answer student responses.

Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses

Benchmark (Not Published) 8/10 7 cited 2024

This paper compares three approaches (fine-tuned Mistral/GOAT, SBERT-Canberra, and zero-shot GPT-4) for automatically scoring and providing feedback on middle-school students' open-ended math responses, evaluating both scoring accuracy and feedback quality using teacher judgments against rubrics.

Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

Benchmark (Published & Automated) 8/10 6 cited 2025

This paper introduces PBLBench, a benchmark for evaluating multimodal large language models (MLLMs) on assessing STEM project-based learning outcomes, using a new dataset (PBL-STEM) with over 500 projects and expert-validated evaluation criteria derived through the Analytic Hierarchy Process. The benchmark tests 15 leading MLLMs on their ability to handle long-context, cross-modal STEM project evaluation to assist teachers with grading.

Enabling Multi-Agent Systems as Learning Designers: Applying Learning Sciences to AI Instructional Design

Adapting to Educate: Conversational AI's Role in Mathematics Education Across Different Educational Contexts

Research / Other 7/10 16 cited 2024

Artificial intelligence (AI) learning tools in K-12 education: A scoping review

A Systematic Review on Prompt Engineering in Large Language Models for K-12 STEM Education

This systematic review analyzes 30 empirical studies published between 2021-2024 that explore the use of LLMs with prompt engineering techniques in K-12 STEM education, examining prompting strategies, model types, evaluation methods, and limitations. The review identifies how different prompting approaches (zero-shot, few-shot, chain-of-thought) are applied to educational tasks and their effectiveness in teaching and learning contexts.

Generating AI Literacy MCQs: A Multi-Agent LLM Approach

Research / Other 7/10 7 cited 2024

This paper presents a multi-agent LLM system that automatically generates multiple-choice questions (MCQs) for K-12 AI literacy assessments, using critique agents to ensure questions align with learning objectives, grade levels, and Bloom's Taxonomy. The system was evaluated by three K-12 AI literacy teaching experts who assessed 40 generated questions using a quality rubric.

EDUMATH: Generating Standards-aligned Educational Math Word Problems

Benchmark (Published & Automated) 7/10 2025

This paper develops EDUMATH, a system for generating math word problems (MWPs) aligned with K-12 math standards and customized to student interests. The work includes a teacher-annotated dataset of over 11,000 generated MWPs evaluated on four criteria (solvability, accuracy, educational appropriateness, standards alignment), trained models for generation, and a classroom study with grade school students showing similar performance but higher preference for customized problems.

This paper presents a benchmark for evaluating educational program repair systems that use LLMs to automatically fix bugs in student code, introducing a novel rouge@k evaluation metric and establishing baseline performance across five recent models on two curated datasets of introductory programming problems.

Exploring Automatic Readability Assessment for Science Documents within a Multilingual Educational Context

Benchmark (Published & Automated) 7/10 6 cited 2024

This paper develops and evaluates automatic readability assessment models for science education texts in Basque, Spanish, and English at the secondary education level (ages 12-16), creating domain-specific corpora and testing both feature-based machine learning and deep learning approaches to help teachers find appropriate materials for multilingual STEM instruction.

View all 240 benchmarks in Content knowledge →

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

FACET: Teacher-Centred LLM-Based Multi-Agent Systems-Towards Personalized Educational Worksheets

FoundationalASSIST: An Educational Dataset for Foundational Knowledge Tracing and Pedagogical Grounding of LLMs

Enabling Multi-Agent Systems as Learning Designers: Applying Learning Sciences to AI Instructional Design

Research / Other 7/10 7 cited 2024

Generating AI Literacy MCQs: A Multi-Agent LLM Approach

EDUMATH: Generating Standards-aligned Educational Math Word Problems

Benchmark (Published & Automated) 7/10 2025

Exploring Automatic Readability Assessment for Science Documents within a Multilingual Educational Context

Benchmark (Published & Automated) 7/10 6 cited 2024

How Useful are Educational Questions Generated by Large Language Models?

Research / Other 7/10 43 cited 2023

This paper evaluates the quality and usefulness of educational questions generated by large language models (specifically InstructGPT/GPT-3) using controllable text generation with Bloom's taxonomy and difficulty levels, through human evaluation by teachers across two domains (computer science and biology). Teachers rated the generated questions on quality and usefulness for classroom use.

A Novel Approach to Scalable and Automatic Topic-Controlled Question Generation in Education

Research / Other 7/10 12 cited 2025

This paper introduces a Topic-Controlled Question Generation (T-CQG) method using fine-tuned T5-small models to automatically generate topic-specific educational questions from paragraph contexts, aiming to reduce teacher workload in creating assessment content. The work evaluates generated question quality through offline metrics and human evaluation, focusing on topical alignment and semantic relevance to K-12 educational needs.

Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts

Benchmark (Published & Automated) 7/10 10 cited 2024

This paper introduces prompt-based metrics to evaluate text difficulty and appropriateness for different education levels, improving upon traditional readability measures like Flesch-Kincaid. The authors develop and validate these metrics through user studies and regression experiments to better measure LLMs' ability to adapt educational content to student levels.

Standardize: Aligning Language Models with Expert-Defined Standards for Content Generation

Benchmark (Published & Automated) 7/10 7 cited 2024

This paper introduces STANDARDIZE, a retrieval-style in-context learning framework that aligns large language models with expert-defined educational standards (CEFR and Common Core Standards) for automated content generation. The framework extracts knowledge artifacts from standards to guide LLMs in producing text that meets specific grade-level and proficiency requirements for K-12 learners.

Towards AI-assisted Board Game-based Learning: Assessing LLMs in Game Personalisation

Research / Other 7/10 3 cited 2024

This paper evaluates the ability of several large language models (ChatGPT, Copilot, Claude) to personalize board games for K-12 educational settings by comparing AI-generated game modifications against human expert recommendations across different classroom scenarios. The study uses blind expert evaluation to assess whether LLMs can effectively adapt board games to align with learning objectives and student needs.

Connecting Feedback to Choice: Understanding Educator Preferences in GenAI vs. Human-Created Lesson Plans in K-12 Education - A Comparative Analysis

Research / Other 7/10 3 cited 2025

This study conducts a comparative evaluation of K-12 math lesson plans created by human curriculum designers versus those generated by fine-tuned LLaMA-2-13b and customized GPT-4 models, using educator preference ratings across multiple instructional dimensions (warm-up, main task, cool-down, overall quality). Through mixed-methods analysis including quantitative preference data and qualitative thematic coding, the research examines how AI-generated lesson plans compare to human-authored ones across different grade levels.

A Feasibility Study of AI-Generated Resources for K-12 Information Literacy

Research / Other 7/10 2 cited 2024

This study evaluates ChatGPT-generated digital citizenship and information literacy resources for K-12 educators using Common Sense Media's Digital Citizenship framework as a template. The research assesses the suitability and effectiveness of AI-generated instructional materials to support information literacy education in K-12 settings.

Lightweight Prompt Engineering for Cognitive Alignment in Educational AI: A OneClickQuiz Case Study

Research / Other 7/10 2025

This paper investigates how different prompt engineering strategies affect the cognitive alignment of AI-generated quiz questions with Bloom's Taxonomy in OneClickQuiz, a Moodle plugin for automated quiz generation. The study evaluates three prompt variants using automated classification and human review to assess whether generated questions match intended cognitive levels (Knowledge, Application, Analysis).

Automated Analysis of Learning Outcomes and Exam Questions Based on Bloom's Taxonomy

Benchmark (Not Published) 7/10 2025

This paper develops and evaluates automated classification systems for exam questions and learning outcomes according to Bloom's Taxonomy cognitive levels, comparing traditional ML models, RNNs, transformers (BERT/RoBERTa), and LLMs on a 600-sentence dataset. The best performance was achieved by augmented SVM (94% accuracy), while LLMs achieved 72-73% accuracy in zero-shot settings.

Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models

Benchmark (Published & Automated) 7/10 28 cited 2023

This paper evaluates how well instruction-tuned language models (ChatGPT, BLOOMZ, FlanT5, Llama) align their generated text with specified readability standards (Flesch-Kincaid Grade Level and CEFR) when prompted to write stories or simplify text at particular grade levels. The study tests whether these models can generate educational content that matches the complexity levels teachers need for classroom materials.

COGENT: A Curriculum-oriented Framework for Generating Grade-appropriate Educational Content

Research / Other 7/10 2 cited 2025

COGENT is a framework for generating grade-appropriate K-12 science reading materials that align with curriculum standards (NGSS) and control readability through length, vocabulary, and sentence complexity constraints. The paper evaluates generated content using LLM-as-a-judge and human expert analysis across curriculum alignment, comprehensibility, and readability dimensions.

Fine-Tuning IndoBERT for Indonesian Exam Question Classification Based on Bloom's Taxonomy

Benchmark (Published & Automated) 6/10 16 cited 2023

This paper fine-tunes IndoBERT to automatically classify Indonesian elementary school exam questions according to Bloom's Taxonomy cognitive levels, achieving 97% accuracy. The system aims to automate the manual teacher task of categorizing questions by cognitive complexity (LOTS vs HOTS).

AUTOMATIC EVALUATION OF QUALITY OF EXAMS' QUESTIONS WRITTEN IN ARABIC LANGUAGE BASED ON BLOOM’S TAXONOMY: A SURVEY

Research / Other 6/10 1 cited 2023

This survey paper reviews approaches for automatically evaluating the quality of exam questions written in Arabic language using Bloom's Taxonomy classification, focusing on cognitive level assessment of questions in computer science courses at Sudanese universities. The paper discusses existing English-language work on automatic question classification and proposes developing similar NLP-based automated evaluation tools for Arabic educational content.

View all 326 benchmarks in Content alignment →

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

Benchmark (Not Published) 9/10 25 cited 2024

Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability To Mark Short Answer Questions in K-12 Education

This paper evaluates how well Large Language Models (specifically GPT-4) can automatically grade short answer questions across different K-12 subjects (Science and History) and grade levels (ages 5-16) using a novel dataset from Carousel Learning, finding performance close to human-level marking (Kappa 0.70 vs 0.75). The study tests various prompt engineering strategies to assess LLM capabilities for formative assessment tasks.

Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy

Benchmark (Published & Automated) 9/10 2 cited 2024

FoundationalASSIST: An Educational Dataset for Foundational Knowledge Tracing and Pedagogical Grounding of LLMs

Research / Other 9/10 2026

Machine-Assisted Grading of Nationwide School-Leaving Essay Exams with LLMs and Statistical NLP

This paper evaluates LLM and statistical NLP methods for automated grading of nationwide school-leaving essay exams in Estonia, comparing their performance against human raters using curriculum-based rubrics. The study examines two full national cohorts of trial exams, assessing reliability, validity, bias, and the viability of human-in-the-loop automated scoring for high-stakes K-12 assessments.

A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

Research / Other 9/10 73 cited 2024

Implementation Considerations for Automated AI Grading of Student Work

Benchmark (Published & Automated) 9/10 2025

From Handwriting to Feedback: Evaluating VLMs and LLMs for AI-Powered Assessment in Indonesian Classrooms

This paper evaluates vision-language models (VLMs) and large language models (LLMs) for automated grading and feedback generation on over 14K handwritten student answers from Grade 4 classrooms in Indonesia, covering Mathematics and English. The study introduces a multimodal pipeline that processes handwritten responses, grades them against rubrics, and generates personalized Indonesian feedback.

Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses

Benchmark (Not Published) 8/10 7 cited 2024

Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

Benchmark (Published & Automated) 8/10 6 cited 2025

CHECK-MAT: Checking Hand-Written Mathematical Answers for the Russian Unified State Exam

Benchmark (Published & Automated) 8/10 2 cited 2025

This paper introduces EGE-Math Solutions Assessment Benchmark, evaluating Vision-Language Models on their ability to grade handwritten mathematical solutions from Russia's high-stakes graduation exam (EGE) by assessing student work against fixed rubrics, identifying errors, and assigning grades like human expert graders. The benchmark includes 122 scanned solutions with official expert grades and tests seven state-of-the-art VLMs across three inference modes.

Large Language Models for Education: A survey and outlook

Research / Other 7/10 55 cited 2025

LLM Agents for Education: Advances and Applications

Are Large Language Models Good Essay Graders?

Research / Other 7/10 13 cited 2024

This paper evaluates Large Language Models (ChatGPT and Llama) for automated essay scoring by comparing their grades to human raters using the ASAP dataset, finding that LLMs generally assign lower scores and correlate poorly with human evaluations, though they reliably detect spelling and grammar errors.

Grade Guard: A Smart System for Short Answer Automated Grading

Benchmark (Not Published) 7/10 2025

Grade Guard is an LLM-based automated short answer grading system that introduces an Indecisiveness Score to reflect uncertainty in predicted grades and uses self-reflection to flag answers requiring human re-evaluation. The framework fine-tunes temperature parameters and introduces Confidence-Aware Loss to improve grading accuracy compared to traditional LLM approaches.

Enhancing Essay Scoring with Adversarial Weights Perturbation and Metric-specific AttentionPooling

Research / Other 7/10 23 cited 2023

This paper proposes using DeBERTa with Adversarial Weights Perturbation (AWP) and Metric-specific AttentionPooling to improve automated essay scoring (AES) for English Language Learners (ELLs), focusing on hyperparameter optimization to enhance scoring accuracy. The study evaluates the model's performance on essay scoring tasks through experimentation with adversarial learning rates and perturbation magnitudes.

The Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models

Research / Other 7/10 15 cited 2024

This chapter explores how large language models (LLMs) can be used for automated question generation and answer assessment in education, examining prompting techniques, fine-tuning methods, and human evaluation of generated questions. The work demonstrates LLMs' capabilities in generating contextually relevant questions and providing automated feedback on student responses.

Improving Academic Skills Assessment with NLP and Ensemble Learning

Research / Other 7/10 11 cited 2024

This paper develops an ensemble learning approach combining multiple NLP models (BERT, RoBERTa, BART, DeBERTa, T5) to automatically assess English Language Learners' essays in grades 8-12 across six linguistic dimensions (cohesion, syntax, vocabulary, phraseology, grammar, conventions). The system uses stacking techniques with LightGBM and Ridge regression to provide automated scoring and feedback on student writing.

Marking: Visual Grading with Highlighting Errors and Annotating Missing Bits

Benchmark (Published & Automated) 7/10 6 cited 2024

This paper introduces 'Marking', a novel automated grading task that goes beyond binary scoring by highlighting correct/incorrect/irrelevant segments in student responses and identifying omissions from gold answers. The authors create the BioMarking dataset (curated by biology experts) and train transformer models (BERT, RoBERTa) to perform fine-grained assessment of student responses.

Thematic control and criteria-based assessment of foreign language writing skills using artificial intelligence technologies

Research / Other 7/10 5 cited 2024

This paper investigates the use of AI technologies (chatbots, NLP) for automated thematic control and criteria-based assessment of foreign language writing skills, evaluating student texts on structure, coherence, grammatical/lexical correctness, and style. The research demonstrates how AI can automate routine grading and feedback tasks in foreign language education.

View all 490 benchmarks in Scoring and grading →

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English Writing

Benchmark (Published & Automated) 9/10 2 cited 2024

Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy

FoundationalASSIST: An Educational Dataset for Foundational Knowledge Tracing and Pedagogical Grounding of LLMs

Research / Other 9/10 2026

Machine-Assisted Grading of Nationwide School-Leaving Essay Exams with LLMs and Statistical NLP

Evaluating Reading Comprehension Exercises Generated by LLMs: A Showcase of ChatGPT in Education Applications

Research / Other 9/10 147 cited 2023

Partnering with AI: A Pedagogical Feedback System for LLM Integration into Programming Education

Research / Other 9/10 73 cited 2024

This paper develops a pedagogical framework for LLM-driven feedback generation in programming education and evaluates it through a mixed-methods study with eight secondary-school computer science teachers using a web-based Python programming application. The study assesses whether LLM-generated feedback aligned with pedagogical principles (mastery adaptation, progress adaptation) can match or exceed human teacher feedback quality.

A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students' Formative Assessment Responses in Science

Generative AI in K-12 Education: The CyberScholar Initiative

Research / Other 9/10 2025

This paper evaluates CyberScholar, a GenAI writing assistant that provides formative feedback aligned with teacher rubrics in K-12 classrooms (grades 7-11) across multiple subject areas. The study uses observations, surveys, and interviews with 121 students and 4 teachers to assess the tool's impact on writing quality, metacognition, and student-teacher interactions.

A Theory of Adaptive Scaffolding for LLM-Based Pedagogical Agents

Research / Other 9/10 3 cited 2025

This paper presents a theoretical framework integrating Evidence-Centered Design with Social Cognitive Theory for adaptive scaffolding in LLM-based pedagogical agents, and demonstrates it through Inquizzitor, an LLM-based formative assessment agent tested with 104 middle school students in Earth Science STEM+C curriculum. The study evaluates the agent's scoring accuracy, interaction quality aligned with learning theories, and student perceptions of its value.

Implementation Considerations for Automated AI Grading of Student Work

Benchmark (Published & Automated) 9/10 2025

From Handwriting to Feedback: Evaluating VLMs and LLMs for AI-Powered Assessment in Indonesian Classrooms

Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses

Benchmark (Not Published) 8/10 7 cited 2024

Artificial Intelligence in Enhancing Ecology Essays: A Study in a Brazilian High School

Research / Other 8/10 2024

This action research study evaluates the use of AI chatbots to help Brazilian high school students improve their argumentative essays on ecology topics, examining how AI tools can serve as complementary classroom resources for content application and knowledge construction.

Using Large Language Models to Assess Tutors' Performance in Reacting to Students Making Math Errors

Research / Other 8/10 11 cited 2024

This paper evaluates GPT-3.5 and GPT-4's ability to assess human tutors' responses to students making math errors, specifically measuring whether tutors appropriately guide students to self-correct rather than directly pointing out mistakes. The study analyzes 50 real-life tutoring dialogues using LLMs to automate tutor performance assessment based on pedagogical criteria from the 'Reacting to Errors' lesson.

Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs

Research / Other 8/10 1 cited 2024

This paper develops and evaluates a system that automatically generates personalized feedback for student responses to short-answer reading comprehension questions using Answer Diagnostic Graphs (ADG) that align student responses to the logical structure of reading texts. An empirical study with students compares learning outcomes between those receiving model answers only versus those also receiving system-generated feedback.

Using Generative AI and Multi-Agents to Provide Automatic Feedback

Research / Other 8/10 23 cited 2024

This study develops and evaluates a multi-agent AI system (AutoFeedback) to generate and validate automatic feedback for student constructed responses in science assessments, comparing its performance against single-agent LLMs in reducing over-praise and over-inference errors. The research tests the system on 240 student responses to science assessment items and demonstrates improved feedback quality through the multi-agent approach.

LLMs as Educational Analysts: Transforming Multimodal Data Traces into Actionable Reading Assessment Reports

Research / Other 8/10 3 cited 2025

This paper develops a system using LLMs to transform multimodal data (eye-tracking, learning outcomes, assessment content) into actionable reading assessment reports for K-12 teachers. The system uses unsupervised clustering to identify reading behavior patterns and LLMs to synthesize insights into teacher-friendly reports, which are then evaluated by educators and LLM experts.

Large Language Models for Education: A survey and outlook

Research / Other 7/10 55 cited 2025

LLM Agents for Education: Advances and Applications

View all 428 benchmarks in Feedback with reasoning →

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

Research / Other 8/10 5 cited 2025

"How can we learn and use AI at the same time?": Participatory Design of GenAI with High School Students

This paper reports on a participatory design workshop with 17 high school students to understand their perspectives on GenAI in education, identifying concerns about bias, misinformation, plagiarism, over-reliance, and false accusations of academic dishonesty, and proposing design guidelines for EdTech developers. Students co-designed GenAI tools and school policies addressing these concerns through structured activities.

"From Unseen Needs to Classroom Solutions": Exploring AI Literacy Challenges & Opportunities with Project-based Learning Toolkit in K-12 Education

Research / Other 8/10 11 cited 2024

Large Language Models for Education: A survey and outlook