How well can LLMs Grade Essays in Arabic?

Relevance: 7/10 7 cited 2025 paper

This paper evaluates the effectiveness of multiple large language models (ChatGPT, Llama, Aya, Jais, ACEGPT) for automated essay scoring in Arabic using the AR-AES dataset, comparing zero-shot, few-shot, and fine-tuning approaches with various prompt engineering strategies. The study finds that while ACEGPT performs best among LLMs (QWK=0.67), a smaller BERT-based model outperforms all tested LLMs (QWK=0.88).

This research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in-context learning, and fine-tuning, and examines the influence of instruction-following capabilities through the inclusion of marking guidelines within the prompts. A mixed-language prompting strategy, in

Framework Categories

Tool Types

Teacher Support Tools Tools that assist teachers — lesson planning, content generation, grading, analytics.

Tags

automated essay scoring evaluationcomputer-science