Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models
This paper explores using large language models (GPT-4 and Llama 2) to automatically generate and evaluate German reading comprehension multiple-choice test items, proposing a new evaluation metric called 'text informativity' based on answerability and guessability. The work evaluates both human and LLM-based assessment of generated items using zero-shot prompting.
Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts. However, creating such tests manually and ensuring their quality is difficult and time-consuming. In this paper, we explore how large language models (LLMs) can be used to generate and evaluate multiple-choice reading comprehension items. To this end, we compiled a dataset of German reading comprehension items and developed a new protocol for human an