Crosslingual Reasoning through Test-Time Scaling
This paper investigates how English-centric reasoning language models (RLMs) generalize to multilingual mathematical reasoning tasks through test-time compute scaling, examining language-mixing patterns and cross-domain generalization. The work focuses on technical aspects of multilingual AI reasoning capabilities rather than K-12 educational applications or learning outcomes.
Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they out