Beyond Agreement: Rethinking Ground Truth in Educational AI Annotation
This position paper critiques the overreliance on inter-rater reliability (IRR) metrics like Cohen's kappa for validating human annotations in educational AI systems, particularly for automated assessment tasks like grading open responses and classifying tutor dialogue moves. It proposes alternative evaluation methods (multi-label schemes, expert-based approaches, close-the-loop validity) that better capture educational validity and learning impact.
Humans can be notoriously imperfect evaluators. They are often biased, unreliable, and unfit to define"ground truth."Yet, given the surging need to produce large amounts of training data in educational applications using AI, traditional inter-rater reliability (IRR) metrics like Cohen's kappa remain central to validating labeled data. IRR remains a cornerstone of many machine learning pipelines for educational data. Take, for example, the classification of tutors'moves in dialogues or labeling o