Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models
This paper evaluates how well instruction-tuned language models (ChatGPT, BLOOMZ, FlanT5, Llama) align their generated text with specified readability standards (Flesch-Kincaid Grade Level and CEFR) when prompted to write stories or simplify text at particular grade levels. The study tests whether these models can generate educational content that matches the complexity levels teachers need for classroom materials.
Readability metrics and standards such as Flesch Kincaid Grade Level (FKGL) and the Common European Framework of Reference for Languages (CEFR) exist to guide teachers and educators to properly assess the complexity of educational materials before administering them for classroom use. In this study, we select a diverse set of open and closed-source instruction-tuned language models and investigate their performances in writing story completions and simplifying narratives—tasks that teachers perf