Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT

Relevance: 3/10 16 cited 2024 paper

This paper benchmarks large language models (LLMs), primarily GPT-3.5-turbo, GPT-4, and OpenChat-3.5, on diverse Persian language tasks including classic NLP tasks, reasoning tasks, and knowledge-based tasks. The study introduces two new Persian benchmarks based on elementary school math questions and entrance exams for 7th and 10th grades, but focuses on evaluating general LLM capabilities in Persian rather than their pedagogical effectiveness or impact on student learning.

This paper explores the efficacy of large language models (LLMs) for Persian. While ChatGPT and consequent LLMs have shown remarkable performance in English, their efficiency for more low-resource languages remains an open question. We present the first comprehensive benchmarking study of LLMs across diverse Persian language tasks. Our primary focus is on GPT-3.5-turbo, but we also include GPT-4 and OpenChat-3.5 to provide a more holistic evaluation. Our assessment encompasses a diverse set of t

Framework Categories

Tool Types

Tags

math reasoning evaluation grade schoolcomputer-science