Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach
This paper presents LearnLM-Tutor, a fine-tuned Gemini model for educational use, and introduces a comprehensive evaluation framework spanning seven diverse benchmarks (quantitative, qualitative, automatic, and human evaluations) grounded in learning science principles to assess pedagogical quality in K-12 AI tutoring systems. The work includes real-world deployment at Arizona State University and systematic evaluation of pedagogical dimensions including Socratic dialogue, adaptive scaffolding, and learning-centered interactions.
A major challenge facing the world is the provision of equitable and universal access to quality education. Recent advances in generative AI (gen AI) have created excitement about the potential of new technologies to offer a personal tutor for every learner and a teaching assistant for every teacher. The full extent of this dream, however, has not yet materialised. We argue that this is primarily due to the difficulties with verbalising pedagogical intuitions into gen AI prompts and the lack of