Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models

Relevance: 7/10 28 cited 2023 paper

This paper evaluates instruction-tuned language models (ChatGPT, BLOOMZ, FlanT5, Llama) on their ability to generate and simplify text at specified readability levels using standard metrics like Flesch-Kincaid Grade Level and CEFR, tasks commonly performed by teachers when creating educational materials.

Readability metrics and standards such as Flesch Kincaid Grade Level (FKGL) and the Common European Framework of Reference for Languages (CEFR) exist to guide teachers and educators to properly assess the complexity of educational materials before administering them for classroom use. In this study, we select a diverse set of open and closed-source instruction-tuned language models and investigate their performances in writing story completions and simplifying narratives—tasks that teachers perf

Tool Types

Teacher Support Tools Tools that assist teachers — lesson planning, content generation, grading, analytics.

Tags

grade level text complexitycomputer-science