5 Ethics and bias

Ethics and bias

Benchmarks measuring fairness, bias, safety, and ethical behaviour in educational contexts.

Read SoTA Research Report

Benchmarks

1,231

Min relevance

Hide pre-2023

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

Benchmark (Published & Automated) 10/10 2025 paper

EduEval is a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education, comprising 24 task types with over 11,000 questions organized across six cognitive dimensions (Memorization, Understanding, Application, Reasoning, Creativity, and Ethics) based on Bloom's Taxonomy and Webb's Depth of Knowledge. The benchmark incorporates authentic educational materials including real exam questions, classroom dialogues, student essays, and expert-designed prompts spanning primary through high school levels.

AI TutorsTeacher Support Tools LLM evaluation K-12 educationcomputer-science

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

Benchmark (Published & Automated) 10/10 73 cited 2024 paper

This paper presents LearnLM-Tutor, a fine-tuned Gemini model for education, and introduces a comprehensive evaluation framework consisting of seven diverse benchmarks (quantitative, qualitative, automatic, and human evaluations) grounded in learning science principles to assess pedagogical capabilities of AI tutoring systems. The work includes real-world deployment at Arizona State University's Study Hall and demonstrates that LearnLM-Tutor is consistently preferred by educators and learners over prompt-tuned Gemini across multiple pedagogical dimensions.

AI Tutors benchmark dataset education learningcomputer-science

MinorBench: A hand-built benchmark for content-based risks for children

Benchmark (Published & Automated) 10/10 4 cited 2025 paper

MinorBench is a hand-built, open-source benchmark that evaluates LLMs' ability to refuse unsafe or age-inappropriate content requests from children, using a taxonomy of content-based risks specific to minors derived from real middle-school chatbot deployment. The benchmark tests six prominent LLMs under different system prompts to assess child-safety compliance.

AI Tutors safety evaluation language model childrencomputer-science

Learning to Use AI for Learning: Teaching Responsible Use of AI Chatbot to K-12 Students Through an AI Literacy Module

Research / Other 9/10 1 cited 2025 paper

This paper describes the design and classroom evaluation of an LLM-based instructional module that teaches K-12 students prompting literacy through scenario-based practice with AI chatbots, including an AI auto-grader for evaluating student-written prompts. The study deployed the module across 11 secondary education classrooms in two iterations, measuring students' prompting performance improvements, confidence changes, and the effectiveness of different assessment question types.

AI Tutors LLM evaluation K-12 educationcomputer-science

ChatGPT-5 in Secondary Education: A Mixed-Methods Analysis of Student Attitudes, AI Anxiety, and Hallucination-Aware Use

Research / Other 9/10 2025 paper

This mixed-methods study examines 109 Greek secondary students' (age 16) attitudes, anxiety, and responses to hallucinated outputs when using ChatGPT-5 in classroom settings across multiple tasks. The research measures cognitive and behavioral attitudes toward AI, AI-related anxiety, and documents students' 'epistemic safeguarding' strategy of restricting AI use to domains where they can verify outputs after encountering hallucinations.

AI Tutors secondary school AI evaluationcomputer-science

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

Benchmark (Published & Automated) 9/10 2025 paper

EduGuardBench is a dual-component benchmark that evaluates LLMs as simulated teachers by measuring both professional fidelity (using Role-playing Fidelity Score to detect pedagogical harms like incompetence, indolence, and offensiveness) and adversarial safety (using persona-based jailbreak prompts to test vulnerability to harmful requests including academic misconduct). The benchmark tests 14 leading models and is publicly available with automated evaluation code.

AI Tutors teacher knowledge evaluation AIcomputer-science

Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-LLM Interactions

Benchmark (Published & Automated) 9/10 2 cited 2025 paper

Safe-Child-LLM introduces a comprehensive benchmark with 200 adversarial prompts and standardized ethical refusal scales to systematically evaluate LLM safety across two developmental stages: children (7-12) and adolescents (13-17). The paper evaluates leading LLMs including ChatGPT, Claude, Gemini, and others, revealing critical safety deficiencies in child-facing scenarios, with both datasets and evaluation code publicly released.

AI Tutors safety evaluation language model childrencomputer-science

LLMs are Biased Teachers: Evaluating LLM Bias in Personalized Education

Benchmark (Published & Automated) 8/10 28 cited 2024 paper

This paper evaluates bias in LLMs acting as personalized tutors by measuring how models generate and select educational content differently for students with varying demographic characteristics (race, gender, disability, income, etc.). The study introduces two bias metrics (MAB and MDB) and applies them to 9 LLMs using over 17,000 educational explanations across multiple difficulty levels and subjects.

AI TutorsPersonalised Adaptive Learning large language model evaluation educationcomputer-science

Is ChatGPT Massively Used by Students Nowadays? A Survey on the Use of Large Language Models such as ChatGPT in Educational Settings

Research / Other 8/10 12 cited 2024 paper

This paper presents survey results from 395 students aged 13-25 in France and Italy investigating how they use LLMs like ChatGPT in educational settings, finding widespread adoption across age groups and disciplines but revealing concerning patterns including gender disparities in usage and lack of critical evaluation among younger users. The study examines usage frequency, purposes, proofreading habits, and potential risks to cognitive skill development.

AI Tutors large language model evaluation educationcomputer-science

LLM Safety for Children

Benchmark (Not Published) 8/10 4 cited 2025 paper

This paper develops a comprehensive taxonomy of content harms specific to children interacting with LLMs and creates Child User Models based on child psychology literature to evaluate the safety of six state-of-the-art LLMs through red-teaming. The evaluation reveals significant safety gaps in LLMs for child-specific harm categories that are not captured by standard adult-focused safety evaluations.

AI Tutors safety evaluation language model childrencomputer-science

Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs

Benchmark (Published & Automated) 8/10 3 cited 2025 paper

This paper presents an automated multilingual pipeline that generates, solves, and evaluates 628 math problems aligned with the German K-10 curriculum across English, German, and Arabic using three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, Qwen-plus), finding consistent linguistic bias with English solutions rated highest and Arabic lowest. The pipeline includes automated generation, translation, solving, and LLM-judge evaluation to measure quality disparities in educational AI outputs across languages.

AI Tutors multilingual evaluation educationcomputer-science

"How can we learn and use AI at the same time?": Participatory Design of GenAI with High School Students

Research / Other 8/10 5 cited 2025 paper

This paper reports on a participatory design workshop with 17 high school students to understand their perspectives on GenAI in education, identifying concerns about bias, misinformation, plagiarism, over-reliance, and false accusations of academic dishonesty, and proposing design guidelines for EdTech developers. Students co-designed GenAI tools and school policies addressing these concerns through structured activities.

AI TutorsTeacher Support Tools student over-reliance AIcomputer-science

"From Unseen Needs to Classroom Solutions": Exploring AI Literacy Challenges & Opportunities with Project-based Learning Toolkit in K-12 Education

Research / Other 8/10 11 cited 2024 paper

This paper explores K-12 teachers' AI literacy levels and how they integrate Project-Based Learning AI toolkits (AI Art Lab, AI Music Studio, AI Chatbot) into diverse subject areas through interviews and co-design sessions, examining pedagogical adaptations, challenges, and ethical concerns.

Teacher Support Tools adaptive learning K-12computer-science

Large Language Models for Education: A survey and outlook

Research / Other 7/10 255 cited 2024 paper

This is a comprehensive survey paper that systematically reviews the technological applications of large language models (LLMs) in K-12 education across multiple dimensions including student assistance, teacher support, adaptive learning, and commercial tools, while also identifying datasets, benchmarks, risks, and future research opportunities.

AI TutorsPersonalised Adaptive LearningTeacher Support Tools benchmark dataset education learningcomputer-sciencehighly-cited

Assessing LLM Text Detection in Educational Contexts: Does Human Contribution Affect Detection?

Benchmark (Published & Automated) 7/10 2 cited 2025 paper

This paper introduces GEDE (Generative Essay Detection in Education), a benchmark dataset with over 900 student essays and 12,500 LLM-generated essays across various contribution levels (from human-written to fully AI-generated), and evaluates five state-of-the-art detection methods. The benchmark assesses detectors' ability to identify different levels of student vs. LLM contribution in educational writing assignments.

Teacher Support Tools benchmark dataset education learningcomputer-science

Safeguarding Privacy: Privacy-Preserving Detection of Mind Wandering and Disengagement Using Federated Learning in Online Education

Research / Other 7/10 2026 paper

This paper proposes a federated learning framework for privacy-preserving detection of mind wandering, behavioral disengagement, and boredom in online learning environments using facial expressions and gaze features from webcam video. The approach is validated across five datasets and benchmarks multiple federated learning algorithms for automated learner state detection.

Personalised Adaptive Learning benchmark dataset education learningcomputer-science

Towards Automatic Boundary Detection for Human-AI Collaborative Hybrid Essay in Education

Benchmark (Published & Automated) 7/10 22 cited 2023 paper

This paper proposes a boundary detection method to identify transition points between human-written and AI-generated content in hybrid essays, constructing a dataset by having ChatGPT fill in incomplete student essays and evaluating detection approaches. The work addresses academic integrity concerns by detecting AI-assisted writing in educational assignments rather than assuming essays are entirely human or AI-generated.

Teacher Support Tools large language model evaluation educationcomputer-science

LLMs and Childhood Safety: Identifying Risks and Proposing a Protection Framework for Safe Child-LLM Interaction

Research / Other 7/10 7 cited 2025 paper

This paper conducts a systematic literature review of safety risks when children interact with LLMs, identifying concerns around harmful content, bias, developmental inappropriateness, and adversarial attacks, and proposes a protection framework with measurable evaluation targets for child-safe LLM deployment.

AI Tutors large language model evaluation educationcomputer-science

Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models

Benchmark (Published & Automated) 7/10 6 cited 2024 paper

Edu-Values is a Chinese education values benchmark with 1,418 questions evaluating LLMs across seven core educational dimensions including professional philosophy, teachers' ethics, education laws, cultural literacy, educational knowledge/skills, basic competencies, and subject knowledge. The benchmark evaluates 21 LLMs using human feedback-based automatic evaluation and finds Chinese LLMs outperform English ones, with particular weaknesses in professional ethics and philosophy.

Teacher Support Tools large language model evaluation educationcomputer-science

Building a Domain-specific Guardrail Model in Production

Benchmark (Not Published) 7/10 6 cited 2024 paper

This paper describes the development and deployment of a domain-specific guardrail model for a K-12 educational platform that ensures content appropriateness, safety, and policy compliance. The authors benchmark their guardrail model against proprietary education-related benchmarks and public safety benchmarks, demonstrating superior performance in filtering inappropriate content for K-12 contexts.

AI Tutors K-12 AI benchmarkcomputer-science

Examining the views of primary school teachers on the use of artificial intelligence in education

Research / Other 7/10 2024 paper

This qualitative study examines primary school teachers' views on using artificial intelligence in education through interviews with 16 teachers, exploring perceived advantages (personalized learning, rapid feedback, time-saving) and disadvantages (student laziness, reduced social interaction, ethical concerns) of AI integration in K-12 classrooms.

AI TutorsPersonalised Adaptive LearningTeacher Support Tools primary school AI evaluation

Dr. R.O. Bott Will See You Now: Exploring AI for Wellbeing with Middle School Students

Research / Other 7/10 5 cited 2024 paper

This paper presents AI for Wellbeing, a middle school curriculum where students learn about conversational AI and ethics by building chatbot prototypes for wellbeing support. The study evaluates curriculum effectiveness through a 5-day virtual workshop with 23 middle school students and 5 teachers, measuring knowledge gains and student perspectives on AI.

AI Tutors teacher knowledge evaluation AIcomputer-science

AI in Education: Rationale, Principles, and Instructional Implications

Learning About Algorithm Auditing in Five Steps: Scaffolding How High School Youth Can Systematically and Critically Evaluate Machine Learning Applications

Research / Other 7/10 6 cited 2024 paper

This paper presents a five-step framework for teaching high school students to systematically audit machine learning systems through hands-on activities, demonstrated via a case study where teens audited peer-designed TikTok filters to evaluate their limitations and biases. The study focuses on scaffolding critical evaluation skills and algorithmic accountability rather than evaluating AI tutoring systems or education-specific AI tools.

K-12 education evaluation AIcomputer-science

Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents

Research / Other 7/10 2025 paper

This paper presents a multi-agent LLM system with five role-based agents (Evaluator, Equity Monitor, Metacognitive Coach, Aggregator, and Reflexion Reviewer) that automatically scores learner reflections using rubrics and generates formative feedback comments while checking for bias and promoting metacognition. The system was evaluated in a 12-session AI literacy program with adult learners, achieving expert-level agreement in scoring and producing feedback rated as helpful and empathetic.

Teacher Support Tools formative assessment AIcomputer-science

A systematic literature review of empirical research on ChatGPT in education

Research / Other 7/10 75 cited 2024 paper

This systematic literature review synthesizes 14 empirical studies examining how ChatGPT has been utilized in educational settings by both students and teachers, analyzing its applications, benefits, and drawbacks in learning contexts. The review identifies uses including virtual tutoring, writing assistance, personalized learning support, and teacher productivity tools, while noting concerns about overreliance affecting innovation and collaborative learning.

AI TutorsTeacher Support Tools student homework feedback

KidLM: Advancing Language Models for Children – Early Insights and Future Directions

Benchmark (Published & Automated) 7/10 11 cited 2024 paper

This paper introduces KidLM, a language model specifically designed for children through a novel data collection pipeline and training objective (Stratified Masking). The model is evaluated on its ability to understand lower grade-level text, avoid stereotypes, and capture children's unique preferences using both automated metrics and human evaluation.

AI Tutors safety evaluation language model childrencomputer-science

Examining Science Education in ChatGPT: An Exploratory Study of Generative Artificial Intelligence

Research / Other 7/10 880 cited 2023 paper

This exploratory self-study examines how ChatGPT answers science education questions, explores potential pedagogical applications for science educators, and reflects on its use as a research tool, identifying both capabilities (generating units, rubrics, quizzes) and risks (presenting information as authoritative without evidence, potential bias).

AI TutorsTeacher Support Tools critical thinking AI education evaluationhighly-cited

A Critical Examination of the Role of ChatGPT in Learning Research:A Thing Ethnographic Study

Research / Other 7/10 1 cited 2024 paper

This paper conducts a SWOT analysis of ChatGPT's role in learning research, examining its strengths (natural language processing, accessibility), weaknesses (lack of deep understanding, objectivity issues), opportunities (intelligent research assistant, writing support), and threats (risks to critical thinking, information security, educational equity). The study uses an ethnographic approach treating ChatGPT as a research participant to assess its impact on learning research practices.

AI TutorsTeacher Support Tools critical thinking AI education evaluation

Potentials of ChatGPT in Computer Programming: Insights from Programming Instructors

Research / Other 7/10 34 cited 2024 paper

This qualitative study examines programming instructors' perceptions of ChatGPT's potential benefits and drawbacks for computer programming education through interviews with 12 university faculty members, identifying advantages like personalized learning and code debugging alongside concerns about accuracy, over-reliance, and ethical issues.

Teacher Support Tools student over-reliance AIcomputer-science

Bidirectional Human-AI Alignment in Education for Trustworthy Learning Environments

An XAI Social Media Platform for Teaching K-12 Students AI-Driven Profiling, Clustering, and Engagement-Based Recommending

Research / Other 7/10 4 cited 2024 paper

This paper presents an explainable AI (XAI) educational tool designed for K-12 students (grades 4-9) to teach them about data-driven mechanisms in social media platforms, including data collection, user profiling, engagement metrics, and recommendation algorithms. The tool uses an Instagram-like interface with real-time visualizations and was tested with 209 children in 12 two-hour sessions, using learning analytics to track how students navigated and understood these AI-driven processes.

AI Tutors student agency AI learningcomputer-science

Integration of AI in STEM Education, Addressing Ethical Challenges in K-12 Settings

"How Can I Code A.I. Responsibly?": The Effect of Computational Action on K-12 Students Learning and Creating Socially Responsible A.I

Research / Other 7/10 7 cited 2023 paper

This paper presents a human-subject research study evaluating a computational action curriculum that teaches 101 K-12 students (ages 9-18) to evaluate and create socially responsible AI through structured reflection and an impact matrix tool. The study measured students' perceptions of AI ethics and their ability to analyze positive and negative impacts of AI technologies using pre-post questionnaires and open-ended responses.

code education evaluation programming K-12computer-science

Exploring User Perspectives on ChatGPT: Applications, Perceptions, and Implications for AI-Integrated Education

Research / Other 7/10 67 cited 2023 paper

This qualitative study analyzes social media discourse to explore early adopters' perceptions and experiences with ChatGPT across educational sectors (K-12, higher education, skills training), examining usage patterns, attitudes, and concerns about cognitive offloading, critical thinking erosion, and ethical implications.

AI Tutors educational dialogue systemcomputer-science

LLMs to Support K-12 Teachers in Culturally Relevant Pedagogy: An AI Literacy Example

Research / Other 7/10 3 cited 2025 paper

This paper presents CulturAIEd, an LLM-powered tool designed to help K-12 teachers adapt AI literacy curricula to students' cultural contexts using Culturally Relevant Pedagogy (CRP) principles. Through a pilot study with four teachers, the research evaluates how the tool influences teachers' confidence and ability to design culturally responsive AI literacy activities.

Teacher Support Tools adaptive learning K-12computer-science

Towards Building Child-Centered Machine Learning Pipelines: Use Cases from K-12 and Higher-Education

Research / Other 7/10 2023 paper

This paper proposes a framework for adapting machine learning pipelines to be child-centered and presents two case studies: predicting classroom engagement levels from video/biometric data to support teachers, and developing a handwriting recognition system for young learners with special educational needs.

Teacher Support Tools adaptive learning K-12computer-science

Counterfactual Fairness Evaluation of Machine Learning Models on Educational Datasets

Research / Other 6/10 1 cited 2025 paper

This paper evaluates counterfactual fairness (a causal individual-level fairness notion) of machine learning models on educational datasets, examining how sensitive attributes like race and gender causally influence predictions of student outcomes. The study demonstrates counterfactual fairness analysis on benchmark educational datasets to assess whether models produce the same decisions regardless of demographic group membership.

Personalised Adaptive LearningTeacher Support Tools benchmark dataset education learningcomputer-science

The Rise of Artificial Intelligence in Educational Measurement: Opportunities and Ethical Challenges

AGI: Artificial General Intelligence for Education

Investigating generative AI models and detection techniques: impacts of tokenization and dataset size on identification of AI-generated text

Research / Other 6/10 9 cited 2024 paper

This paper investigates methods for detecting AI-generated text in K-12 student writing assessments using classical machine learning and large language models, comparing outputs from ChatGPT, Claude, and Gemini, and examining the effectiveness of paraphrasing tools like GPT-Humanizer and QuillBot in evading detection.

Teacher Support Tools large language model evaluation educationmedicinecomputer-science

Primary school students’ perceptions of artificial intelligence – for good or bad

Research / Other 6/10 17 cited 2024 paper

This qualitative case study examines Swedish primary school students' (ages 11-12) cognitive and affective perceptions of AI and their current usage patterns through pre-tests, focus group interviews, and post-lesson evaluations. The study explores students' understanding of AI concepts, emotional responses to AI, and their concerns about rapid AI development, job loss, and privacy.

primary school AI evaluation

LLM-Powered AI Tutors with Personas for d/Deaf and Hard-of-Hearing Online Learners

Research / Other 6/10 3 cited 2024 paper

This paper explores how d/Deaf and Hard-of-Hearing (DHH) learners interact with LLM-powered AI tutors that have different personas representing varying experiences in DHH education, focusing on accessibility preferences and cultural knowledge through a user study with 16 DHH participants. The study examines interaction patterns, transparency needs, and multimodal support requirements (especially sign language) for DHH learners using AI tutoring systems.

AI Tutors tutoring dialogue evaluationcomputer-science

Generative AI and Educational (In)Equity

Artificial Intelligence Competence of K-12 Students Shapes Their AI Risk Perception: A Co-occurrence Network Analysis

Research / Other 6/10 2025 paper

This paper surveys 163 Finnish K-12 upper secondary students about their self-perceived AI competence and concerns regarding AI risks across systemic, institutional, and personal domains, using co-occurrence network analysis to examine relationships between competence levels and risk perceptions. The study finds that lower-competence students emphasize personal risks (reduced creativity, critical thinking) while higher-competence students focus on systemic risks (bias, inaccuracy).

K-12 education evaluation AIcomputer-science

Exploring the Issues and Challenges of Online Assessment and Evaluation in the Era of Artificial Intelligence

Research / Other 6/10 3 cited 2024 paper

This qualitative study interviews 20 higher education teachers about challenges they face in assessing student work when AI text generators like ChatGPT are available, finding concerns about academic integrity, difficulty detecting AI-generated assignments, and fears about students' loss of creativity and writing skills.

Teacher Support Tools formative assessment AI

The Impact of AI on Educational Assessment: A Framework for Constructive Alignment

Research / Other 6/10 2025 paper

This paper develops a theoretical framework based on Constructive Alignment theory and Bloom's taxonomy to guide how educational assessment should be adapted in response to students' use of AI tools like LLMs, proposing that different Bloom levels require different assessment approaches when AI is available.

Teacher Support Tools formative assessment AIcomputer-science

FairMonitor: A Four-Stage Automatic Framework for Detecting Stereotypes and Biases in Large Language Models

Benchmark (Published & Automated) 6/10 1 cited 2023 paper

FairMonitor is a four-stage automated framework for detecting stereotypes and biases in LLM-generated content through open-ended questions, with a case study implementation (Edu-FairMonitor) covering educational scenarios across nine sensitive factors including gender, race, age, learning ability, and socioeconomic status. The framework uses direct inquiry, story-based testing, implicit association testing, and unknown situation testing to evaluate bias in five LLMs.

bias fairness evaluation LLM educationcomputer-science

SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth

Benchmark (Published & Automated) 6/10 2025 paper

SproutBench is a safety evaluation benchmark for LLMs targeting youth, comprising 1,283 adversarial prompts designed to assess age-appropriate responses across early childhood (0-6), middle childhood (7-12), and adolescence (13-18). It evaluates 47 LLMs on dimensions including safety, risk prevention, interactivity, and age appropriateness, focusing on ethical and developmental considerations rather than learning outcomes.

safety evaluation language model childrencomputer-science

Generative AI in Education: Student Skills and Lecturer Roles

Research / Other 6/10 4 cited 2025 paper

This mixed-methods study identifies 14 essential student competencies for engaging with generative AI in education and six lecturer strategies for integrating GenAI into teaching, based on literature review and survey data from 130 university students in South Asia and Europe. The research focuses on higher education contexts, examining skills like AI literacy, critical thinking, prompt engineering, and ethical AI practices.

Teacher Support Tools critical thinking AI education evaluationcomputer-science