From Handwriting to Feedback: Evaluating VLMs and LLMs for AI-Powered Assessment in Indonesian Classrooms
This paper evaluates state-of-the-art VLMs and LLMs on grading over 14,000 handwritten answers from Indonesian Grade 4 students in Mathematics and English, and generating personalized feedback. The study introduces a multimodal pipeline integrating vision models for handwriting recognition and language models for rubric-based assessment in real-world, underrepresented classroom settings.
Despite rapid progress in vision-language and large language models (VLMs and LLMs), their effectiveness for AI-driven educational assessment in real-world, underrepresented classrooms remains largely unexplored. We evaluate state-of-the-art VLMs and LLMs on over 14K handwritten answers from grade-4 classrooms in Indonesia, covering Mathematics and English aligned with the local national curriculum. Unlike prior work on clean digital text, our dataset features naturally curly, diverse handwritin