ERUPD - English to Roman Urdu Parallel Dataset
This paper presents ERUPD, a parallel dataset of 75,146 English-Roman Urdu sentence pairs created through synthetic data generation and real-world messaging data, intended for machine translation and NLP tasks. The dataset aims to support multilingual education and cross-cultural exchange by bridging linguistic gaps for Roman Urdu speakers.
Bridging linguistic gaps fosters global growth and cultural exchange. This study addresses the challenges of Roman Urdu -- a Latin-script adaptation of Urdu widely used in digital communication -- by creating a novel parallel dataset comprising 75,146 sentence pairs. Roman Urdu's lack of standardization, phonetic variability, and code-switching with English complicates language processing. We tackled this by employing a hybrid approach that combines synthetic data generated via advanced prompt e