Education Benchmarks and Evals Mapping

We searched Semantic Scholar for benchmarks and evals relevant to AI in education, and mapped them across 11 quality components. We used LLMs to classify 6,529 papers, which are shown below.

Tool Types

Concerns

Cross-cutting risk themes identified across the research — what could go wrong when AI is used in education, and what do we know about it.

Concern

Cognitive Offloading & Over-reliance

When AI does the thinking for learners — reducing effort, bypassing productive struggle, and creating dependency.

Landscape Summary
Concern

Productive Struggle & Scaffolding

The balance between helpful AI scaffolding and over-scaffolding that removes the desirable difficulty learners need to grow.

Landscape Summary
Concern

Metacognition & Self-regulation

Whether AI tools help or hinder learners’ ability to monitor their own understanding and self-regulate.

Landscape Summary
Concern

Critical Thinking & Higher-order Skills

Impact of AI on higher-order cognitive skills — analysis, evaluation, synthesis, and creative problem-solving.

Landscape Summary
Concern

Equity & Access

Risks of AI widening existing education gaps — digital divide, language bias, cost barriers, and disparate impact.

Landscape Summary

All Benchmarks

6,529
Min relevance
Hide pre-2023