EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

Relevance: 3/10 2025 paper

This paper introduces EverydayMMQA, a framework for creating culturally-grounded multimodal datasets, and OASIS, a large-scale dataset with 14.8M QA pairs combining speech, images, and text in English and Arabic varieties across 18 countries. The work focuses on evaluating multimodal LLMs for everyday reasoning and cultural awareness in low-resource languages.

Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimod

Tool Types

Tags

commonsense reasoning testcomputer-science