SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

Relevance: 3/10 5 cited 2025 paper

SwitchLingua is a large-scale multilingual code-switching dataset (420K text samples, 80 hours audio across 12 languages) designed to benchmark Automatic Speech Recognition (ASR) systems on code-switching scenarios, with a novel evaluation metric (SAER) that incorporates semantic information.

Code-switching (CS) is the alternating use of two or more languages within a conversation or utterance, often influenced by social context and speaker identity. This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems, which are typically designed for a single language and struggle to handle multilingual inputs. The growing global demand for multilingual applications, including Code-Switching ASR (CSASR), Text-to-Speech (CSTTS), and Cross-Lingual Information Ret

Framework Categories

Tool Types

Tags

cross-lingual assessment benchmarkcomputer-science