Methods

How we built this mapping β€” from paper discovery to classification to synthesis.

Overview

This project maps AI/LLM benchmarks and evaluations to a quality framework for K-12 education, with a focus on low- and middle-income countries (LMICs). The goal is to identify what we can measure about AI tools for education β€” and where the gaps are.

The pipeline has four stages: discovery, classification, enrichment, and synthesis. Each stage is automated but human-reviewed.

πŸ” 1. Discovery Semantic Scholar
& HuggingFace
↓
🏷️ 2. Classification Heuristic +
LLM (Claude)
↓
πŸ“Š 3. Enrichment S2 metadata,
TLDRs, citations
↓
πŸ“ 4. Synthesis LLM landscape
summaries

Stage 1: Paper Discovery

We search for papers and datasets from two primary sources:

A free, AI-powered research tool for scientific literature maintained by the Allen Institute for AI. We use the Semantic Scholar API to perform bulk searches across 230M+ academic papers, fetching up to 1,000 results per query.

πŸ€—

HuggingFace

We search HuggingFace for datasets and daily papers related to education benchmarks. This captures evaluation datasets that may not appear in traditional academic search.

We run over 80 targeted search queries designed to cover every part of the quality framework β€” from broad queries like "LLM evaluation K-12 education" to specific ones like "automated essay scoring evaluation" or "zone of proximal development AI tutoring". Results are deduplicated by URL to build a comprehensive corpus.

Stage 2: Two-Stage Classification

Every paper is classified against 11 quality components, organised into 6 areas, and 3 tool types. Classification uses a two-stage pipeline:

1

Heuristic Scoring

Word-boundary keyword matching against the paper's title, description, and tags. Each framework category has a curated keyword list (e.g. "pedagogical knowledge", "scaffolding", "automated essay scoring"). Single words use word-boundary matching; multi-word phrases use substring matching. This produces initial framework and tool-type assignments with confidence scores.

2

LLM Classification (Claude)

Each paper is sent to Anthropic's Claude (claude-haiku-4-5) with the full framework definitions and tool type descriptions. The LLM assigns framework IDs and tool types, provides a relevance score (1-10) indicating how relevant the paper is to K-12 AI education, and generates a brief rationale. LLM results are merged with heuristic results to produce the final classification.

Stage 3: Enrichment

After classification, we enrich each paper with additional metadata from Semantic Scholar's batch API:

  • TLDR summaries β€” AI-generated one-line paper summaries from Semantic Scholar
  • Citation counts β€” how often the paper has been cited
  • PDF URLs β€” direct links to open-access versions where available
  • Publication venues β€” journal or conference names
  • Publication year β€” for time-based filtering

Stage 4: Landscape Summaries

For each quality component, tool type, and concern theme, we produce an AI-generated landscape summary synthesising findings across all high-relevance papers (scored β‰₯7/10):

  1. Section extraction β€” we extract key sections (abstract, introduction, results, discussion, conclusions) from each paper's full text
  2. Batch synthesis β€” extracted sections are sent to Claude in batches, which synthesises findings into structured analyses covering: executive summary, key findings, what's measured, what's missing, notable studies, LMIC context, and recommendations
  3. Multi-batch merging β€” when there are too many papers for a single LLM context window, multiple batch reports are generated and then merged into a single coherent summary

The Quality Framework

Papers are mapped to 11 quality components across 6 areas. These components define what a high-quality AI tool for K-12 education should be measured on:

AreaIDComponent
General reasoning1General reasoning
Ethics and bias5Ethics and bias
Pedagogy2.1Pedagogical knowledge
Pedagogy2.2Pedagogy of generated outputs
Pedagogy2.3Pedagogical interactions
Educational content3.1Content knowledge
Educational content3.2Content alignment
Assessment4.1Scoring and grading
Assessment4.2Feedback with reasoning
Digitisation / accessibility6.1Multimodal capabilities
Digitisation / accessibility6.2Multilingual capabilities

Concern Themes

In addition to the quality framework, we identify 5 cross-cutting concern themes β€” risks and challenges that span multiple framework categories. Papers are matched to concerns via keyword search over their title, summary, and full text.

Concern

Cognitive Offloading & Over-reliance

When AI does the thinking for learners β€” reducing effort, bypassing productive struggle, and creating dependency.

Concern

Productive Struggle & Scaffolding

The balance between helpful AI scaffolding and over-scaffolding that removes the desirable difficulty learners need to grow.

Concern

Metacognition & Self-regulation

Whether AI tools help or hinder learners’ ability to monitor their own understanding and self-regulate.

Concern

Critical Thinking & Higher-order Skills

Impact of AI on higher-order cognitive skills β€” analysis, evaluation, synthesis, and creative problem-solving.

Concern

Equity & Access

Risks of AI widening existing education gaps β€” digital divide, language bias, cost barriers, and disparate impact.

Tools & Technology

ComponentTechnology
Paper discoverySemantic Scholar API, HuggingFace API
Classification & synthesisAnthropic Claude (claude-haiku-4-5)
PipelinePython 3.11+
WebsiteSvelteKit, Tailwind CSS
SearchMiniSearch (client-side full-text search)

Limitations

  • Search coverage β€” while we query 80+ search terms across multiple sources, some relevant papers may be missed if they don't match our query set or aren't indexed by Semantic Scholar or HuggingFace.
  • Classification accuracy β€” LLM-based classification is not perfect. Some papers may be mis-classified or assigned incorrect relevance scores. The heuristic + LLM two-stage approach reduces errors but doesn't eliminate them.
  • Synthesis quality β€” landscape summaries are AI-generated and may contain inaccuracies or miss nuances present in the original papers. They should be treated as starting points for further investigation, not as definitive reviews.
  • Temporal bias β€” the corpus reflects what's available at the time of search. Newly published papers won't appear until the pipeline is re-run.

About This Project

This project is part of the Fab AI initiative focused on quality assurance for AI in education, particularly in low- and middle-income countries.