How well can LLMs Grade Essays in Arabic?
This paper evaluates the effectiveness of multiple large language models (ChatGPT, Llama, Aya, Jais, ACEGPT) for automated essay scoring in Arabic using the AR-AES dataset, comparing zero-shot, few-shot, and fine-tuning approaches with various prompt engineering strategies. The study finds that while ACEGPT performs best among LLMs (QWK=0.67), a smaller BERT-based model outperforms all tested LLMs (QWK=0.88).
This research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in-context learning, and fine-tuning, and examines the influence of instruction-following capabilities through the inclusion of marking guidelines within the prompts. A mixed-language prompting strategy, in