Education Benchmarks and Evals Mapping
We searched Semantic Scholar for benchmarks and evals relevant to AI in education, and mapped them across 11 quality components. We used LLMs to classify 6,529 papers, which are shown below.
| Area | ID | Category | Benchmarks | Landscape Summary |
|---|---|---|---|---|
| General reasoning | 1 | General reasoning | 2,159 | Read |
| Pedagogy | 2.1 | Pedagogical knowledge | 866 | Read |
| 2.2 | Pedagogy of generated outputs | 1,126 | Read | |
| 2.3 | Pedagogical interactions | 1,865 | Read | |
| Educational content | 3.1 | Content knowledge | 1,493 | Read |
| 3.2 | Content alignment | 919 | Read | |
| Assessment | 4.1 | Scoring and grading | 809 | Read |
| 4.2 | Feedback with reasoning | 910 | Read | |
| Ethics and bias | 5 | Ethics and bias | 1,239 | Read |
| Digitisation / accessibility | 6.1 | Multimodal capabilities | 548 | Read |
| 6.2 | Multilingual capabilities | 298 | Read |
Tool Types
Concerns
Cross-cutting risk themes identified across the research — what could go wrong when AI is used in education, and what do we know about it.
Cognitive Offloading & Over-reliance
When AI does the thinking for learners — reducing effort, bypassing productive struggle, and creating dependency.
Productive Struggle & Scaffolding
The balance between helpful AI scaffolding and over-scaffolding that removes the desirable difficulty learners need to grow.
Metacognition & Self-regulation
Whether AI tools help or hinder learners’ ability to monitor their own understanding and self-regulate.
Critical Thinking & Higher-order Skills
Impact of AI on higher-order cognitive skills — analysis, evaluation, synthesis, and creative problem-solving.
Equity & Access
Risks of AI widening existing education gaps — digital divide, language bias, cost barriers, and disparate impact.