Methods

How we built this mapping — from paper discovery to classification to synthesis.

Overview

This project maps AI/LLM benchmarks and evaluations to a quality framework for K-12 education, with a focus on low- and middle-income countries (LMICs). The goal is to identify what we can measure about AI tools for education — and where the gaps are.

The pipeline has four stages: discovery, classification, enrichment, and synthesis. Each stage is automated but human-reviewed.

🔍 1. Discovery Semantic Scholar
& HuggingFace

↓

🏷️ 2. Classification Heuristic +
LLM (Claude)

↓

📊 3. Enrichment S2 metadata,
TLDRs, citations

↓

📝 4. Synthesis LLM landscape
summaries

Stage 1: Paper Discovery

We search for papers and datasets from two primary sources:

📚

Semantic Scholar

A free, AI-powered research tool for scientific literature maintained by the Allen Institute for AI. We use the Semantic Scholar API to perform bulk searches across 230M+ academic papers, fetching up to 1,000 results per query.

🤗

HuggingFace

We search HuggingFace for datasets and daily papers related to education benchmarks. This captures evaluation datasets that may not appear in traditional academic search.

We run over 80 targeted search queries designed to cover every part of the quality framework — from broad queries like "LLM evaluation K-12 education" to specific ones like "automated essay scoring evaluation" or "zone of proximal development AI tutoring". Results are deduplicated by URL to build a comprehensive corpus.

Stage 2: Two-Stage Classification

Every paper is classified against 11 quality components, organised into 6 areas, and 3 tool types. Classification uses a two-stage pipeline:

Heuristic Scoring

Word-boundary keyword matching against the paper's title, description, and tags. Each framework category has a curated keyword list (e.g. "pedagogical knowledge", "scaffolding", "automated essay scoring"). Single words use word-boundary matching; multi-word phrases use substring matching. This produces initial framework and tool-type assignments with confidence scores.

LLM Classification (Claude)

Each paper is sent to Anthropic's Claude (claude-haiku-4-5) with the full framework definitions and tool type descriptions. The LLM assigns framework IDs and tool types, provides a relevance score (1-10) indicating how relevant the paper is to K-12 AI education, and generates a brief rationale. LLM results are merged with heuristic results to produce the final classification.

Stage 3: Enrichment

After classification, we enrich each paper with additional metadata from Semantic Scholar's batch API:

TLDR summaries — AI-generated one-line paper summaries from Semantic Scholar
Citation counts — how often the paper has been cited
PDF URLs — direct links to open-access versions where available
Publication venues — journal or conference names
Publication year — for time-based filtering

Stage 4: Landscape Summaries

For each quality component, tool type, and concern theme, we produce an AI-generated landscape summary synthesising findings across all high-relevance papers (scored ≥7/10):

Section extraction — we extract key sections (abstract, introduction, results, discussion, conclusions) from each paper's full text
Batch synthesis — extracted sections are sent to Claude in batches, which synthesises findings into structured analyses covering: executive summary, key findings, what's measured, what's missing, notable studies, LMIC context, and recommendations
Multi-batch merging — when there are too many papers for a single LLM context window, multiple batch reports are generated and then merged into a single coherent summary

The Quality Framework

Papers are mapped to 11 quality components across 6 areas. These components define what a high-quality AI tool for K-12 education should be measured on:

Area	ID	Component
General reasoning	1	General reasoning
Ethics and bias	5	Ethics and bias
Pedagogy	2.1	Pedagogical knowledge
Pedagogy	2.2	Pedagogy of generated outputs
Pedagogy	2.3	Pedagogical interactions
Educational content	3.1	Content knowledge
Educational content	3.2	Content alignment
Assessment	4.1	Scoring and grading
Assessment	4.2	Feedback with reasoning
Digitisation / accessibility	6.1	Multimodal capabilities
Digitisation / accessibility	6.2	Multilingual capabilities

Concern Themes

In addition to the quality framework, we identify 5 cross-cutting concern themes — risks and challenges that span multiple framework categories. Papers are matched to concerns via keyword search over their title, summary, and full text.

Concern

Cognitive Offloading & Over-reliance

When AI does the thinking for learners — reducing effort, bypassing productive struggle, and creating dependency.

Concern

Productive Struggle & Scaffolding

The balance between helpful AI scaffolding and over-scaffolding that removes the desirable difficulty learners need to grow.

Concern

Metacognition & Self-regulation

Whether AI tools help or hinder learners’ ability to monitor their own understanding and self-regulate.

Concern

Critical Thinking & Higher-order Skills

Impact of AI on higher-order cognitive skills — analysis, evaluation, synthesis, and creative problem-solving.

Concern

Equity & Access

Risks of AI widening existing education gaps — digital divide, language bias, cost barriers, and disparate impact.

Tools & Technology

Component	Technology
Paper discovery	Semantic Scholar API, HuggingFace API
Classification & synthesis	Anthropic Claude (claude-haiku-4-5)
Pipeline	Python 3.11+
Website	SvelteKit, Tailwind CSS
Search	MiniSearch (client-side full-text search)

Limitations

Search coverage — while we query 80+ search terms across multiple sources, some relevant papers may be missed if they don't match our query set or aren't indexed by Semantic Scholar or HuggingFace.
Classification accuracy — LLM-based classification is not perfect. Some papers may be mis-classified or assigned incorrect relevance scores. The heuristic + LLM two-stage approach reduces errors but doesn't eliminate them.
Synthesis quality — landscape summaries are AI-generated and may contain inaccuracies or miss nuances present in the original papers. They should be treated as starting points for further investigation, not as definitive reviews.
Temporal bias — the corpus reflects what's available at the time of search. Newly published papers won't appear until the pipeline is re-run.

About This Project

This project is part of the Fab AI initiative focused on quality assurance for AI in education, particularly in low- and middle-income countries.