Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT
This paper benchmarks large language models (LLMs), primarily GPT-3.5-turbo, GPT-4, and OpenChat-3.5, on diverse Persian language tasks including classic NLP tasks, reasoning tasks, and knowledge-based tasks. The study introduces two new Persian benchmarks based on elementary school math questions and entrance exams for 7th and 10th grades, but focuses on evaluating general LLM capabilities in Persian rather than their pedagogical effectiveness or impact on student learning.
This paper explores the efficacy of large language models (LLMs) for Persian. While ChatGPT and consequent LLMs have shown remarkable performance in English, their efficiency for more low-resource languages remains an open question. We present the first comprehensive benchmarking study of LLMs across diverse Persian language tasks. Our primary focus is on GPT-3.5-turbo, but we also include GPT-4 and OpenChat-3.5 to provide a more holistic evaluation. Our assessment encompasses a diverse set of t