Methods
How we built this mapping β from paper discovery to classification to synthesis.
Overview
This project maps AI/LLM benchmarks and evaluations to a quality framework for K-12 education, with a focus on low- and middle-income countries (LMICs). The goal is to identify what we can measure about AI tools for education β and where the gaps are.
The pipeline has four stages: discovery, classification, enrichment, and synthesis. Each stage is automated but human-reviewed.
& HuggingFace
LLM (Claude)
TLDRs, citations
summaries
Stage 1: Paper Discovery
We search for papers and datasets from two primary sources:
Semantic Scholar
A free, AI-powered research tool for scientific literature maintained by the Allen Institute for AI. We use the Semantic Scholar API to perform bulk searches across 230M+ academic papers, fetching up to 1,000 results per query.
HuggingFace
We search HuggingFace for datasets and daily papers related to education benchmarks. This captures evaluation datasets that may not appear in traditional academic search.
We run over 80 targeted search queries designed to cover every part of the quality framework β from broad queries like "LLM evaluation K-12 education" to specific ones like "automated essay scoring evaluation" or "zone of proximal development AI tutoring". Results are deduplicated by URL to build a comprehensive corpus.
Stage 2: Two-Stage Classification
Every paper is classified against 11 quality components, organised into 6 areas, and 3 tool types. Classification uses a two-stage pipeline:
Heuristic Scoring
Word-boundary keyword matching against the paper's title, description, and tags. Each framework category has a curated keyword list (e.g. "pedagogical knowledge", "scaffolding", "automated essay scoring"). Single words use word-boundary matching; multi-word phrases use substring matching. This produces initial framework and tool-type assignments with confidence scores.
LLM Classification (Claude)
Each paper is sent to Anthropic's Claude (claude-haiku-4-5) with the full framework definitions and tool type descriptions. The LLM assigns framework IDs and tool types, provides a relevance score (1-10) indicating how relevant the paper is to K-12 AI education, and generates a brief rationale. LLM results are merged with heuristic results to produce the final classification.
Stage 3: Enrichment
After classification, we enrich each paper with additional metadata from Semantic Scholar's batch API:
- TLDR summaries β AI-generated one-line paper summaries from Semantic Scholar
- Citation counts β how often the paper has been cited
- PDF URLs β direct links to open-access versions where available
- Publication venues β journal or conference names
- Publication year β for time-based filtering
Stage 4: Landscape Summaries
For each quality component, tool type, and concern theme, we produce an AI-generated landscape summary synthesising findings across all high-relevance papers (scored β₯7/10):
- Section extraction β we extract key sections (abstract, introduction, results, discussion, conclusions) from each paper's full text
- Batch synthesis β extracted sections are sent to Claude in batches, which synthesises findings into structured analyses covering: executive summary, key findings, what's measured, what's missing, notable studies, LMIC context, and recommendations
- Multi-batch merging β when there are too many papers for a single LLM context window, multiple batch reports are generated and then merged into a single coherent summary
The Quality Framework
Papers are mapped to 11 quality components across 6 areas. These components define what a high-quality AI tool for K-12 education should be measured on:
| Area | ID | Component |
|---|---|---|
| General reasoning | 1 | General reasoning |
| Ethics and bias | 5 | Ethics and bias |
| Pedagogy | 2.1 | Pedagogical knowledge |
| Pedagogy | 2.2 | Pedagogy of generated outputs |
| Pedagogy | 2.3 | Pedagogical interactions |
| Educational content | 3.1 | Content knowledge |
| Educational content | 3.2 | Content alignment |
| Assessment | 4.1 | Scoring and grading |
| Assessment | 4.2 | Feedback with reasoning |
| Digitisation / accessibility | 6.1 | Multimodal capabilities |
| Digitisation / accessibility | 6.2 | Multilingual capabilities |
Concern Themes
In addition to the quality framework, we identify 5 cross-cutting concern themes β risks and challenges that span multiple framework categories. Papers are matched to concerns via keyword search over their title, summary, and full text.
Cognitive Offloading & Over-reliance
When AI does the thinking for learners β reducing effort, bypassing productive struggle, and creating dependency.
Productive Struggle & Scaffolding
The balance between helpful AI scaffolding and over-scaffolding that removes the desirable difficulty learners need to grow.
Metacognition & Self-regulation
Whether AI tools help or hinder learnersβ ability to monitor their own understanding and self-regulate.
Critical Thinking & Higher-order Skills
Impact of AI on higher-order cognitive skills β analysis, evaluation, synthesis, and creative problem-solving.
Equity & Access
Risks of AI widening existing education gaps β digital divide, language bias, cost barriers, and disparate impact.
Tools & Technology
| Component | Technology |
|---|---|
| Paper discovery | Semantic Scholar API, HuggingFace API |
| Classification & synthesis | Anthropic Claude (claude-haiku-4-5) |
| Pipeline | Python 3.11+ |
| Website | SvelteKit, Tailwind CSS |
| Search | MiniSearch (client-side full-text search) |
Limitations
- Search coverage β while we query 80+ search terms across multiple sources, some relevant papers may be missed if they don't match our query set or aren't indexed by Semantic Scholar or HuggingFace.
- Classification accuracy β LLM-based classification is not perfect. Some papers may be mis-classified or assigned incorrect relevance scores. The heuristic + LLM two-stage approach reduces errors but doesn't eliminate them.
- Synthesis quality β landscape summaries are AI-generated and may contain inaccuracies or miss nuances present in the original papers. They should be treated as starting points for further investigation, not as definitive reviews.
- Temporal bias β the corpus reflects what's available at the time of search. Newly published papers won't appear until the pipeline is re-run.
About This Project
This project is part of the Fab AI initiative focused on quality assurance for AI in education, particularly in low- and middle-income countries.