LLM-powered Data Augmentation for Enhanced Crosslingual Performance
This paper explores using Large Language Models (LLMs) like ChatGPT and GPT-4 to generate synthetic training data for multilingual commonsense reasoning tasks (XCOPA, XWinograd, XStoryCloze), then evaluates whether fine-tuning smaller models on this synthetic data improves performance in low-resource language settings.
This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in multilingual commonsense reasoning datasets where the available training data is extremely limited. To achieve this, we utilise several LLMs, namely Dolly-v2, StableVicuna, ChatGPT, and GPT-4, to augment three datasets: XCOPA, XWinograd, and XStoryCloze. Subsequently, we evaluate the effectiveness of fine-tuning smaller multilingual models, mBERT and XLMR, using the synthesised data. We compare