Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach
This paper presents LearnLM-Tutor, a fine-tuned Gemini model for education, and introduces a comprehensive evaluation framework consisting of seven diverse benchmarks (quantitative, qualitative, automatic, and human evaluations) grounded in learning science principles to assess pedagogical capabilities of AI tutoring systems. The work includes real-world deployment at Arizona State University's Study Hall and demonstrates that LearnLM-Tutor is consistently preferred by educators and learners over prompt-tuned Gemini across multiple pedagogical dimensions.
A major challenge facing the world is the provision of equitable and universal access to quality education. Recent advances in generative AI (gen AI) have created excitement about the potential of new technologies to offer a personal tutor for every learner and a teaching assistant for every teacher. The full extent of this dream, however, has not yet materialised. We argue that this is primarily due to the difficulties with verbalising pedagogical intuitions into gen AI prompts and the lack of