Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models

Relevance: 7/10 12 cited 2024 paper

This paper explores using large language models (GPT-4 and Llama 2) to automatically generate and evaluate German reading comprehension multiple-choice test items, proposing a new evaluation metric called 'text informativity' based on answerability and guessability. The work evaluates both human and LLM-based assessment of generated items using zero-shot prompting.

Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts. However, creating such tests manually and ensuring their quality is difficult and time-consuming. In this paper, we explore how large language models (LLMs) can be used to generate and evaluate multiple-choice reading comprehension items. To this end, we compiled a dataset of German reading comprehension items and developed a new protocol for human an

Tool Types

Teacher Support Tools Tools that assist teachers — lesson planning, content generation, grading, analytics.

Tags

large language model evaluation educationcomputer-science