HANNAH-BETH CLARK, MARGAUX DOWLAND, LAURA BENTON, REKA BUDAI, IBRAHIM KAAN KESKIN, EMMA SEARLE, MATTHEW GREGORY, MARK HODIERNE, WILLIAM GAYNE AND JOHN ROBERTS, OAK NATIONAL ACADEMY, UK
Following the launch of GPT in 2022, the EdTech market has been flooded with artificial intelligence (AI) tools to support teachers with time-consuming tasks, such as lesson planning or creating lesson resources. This has resulted in two-thirds of teachers using AI in their work (Teacher Tapp, 2024). Ensuring the accuracy of the content that is produced is vital, as inaccuracies and biases in content created by AI tools can exacerbate misconceptions in classrooms and be detrimental to student outcomes (Levonian and Henkel, 2024). Limited checks on the quality of content produced by AI tools (Chiu et al., 2023; EEF, 2025) make it challenging for teachers to make good decisions about the use of AI tools within their schools.
As a publicly funded body, at Oak National Academy we aim to improve student outcomes and close the disadvantage gap by providing teachers with high-quality curriculum materials. We have created curriculum sequences and over 10,000 open-source lessons and resources, aligned with the National Curriculum for England. These have been designed and quality-assured by expert, subject specialist teachers, in line with Oak’s evidence-informed curriculum principles (McCrea, 2023). Having codified and exemplified high-quality curricula, we have been in a unique position to design Aila, an AI-powered lesson planning tool, that is free to use.
In this article, we provide a brief summary of our paper, published with MIT Open Learning (Clark et al., 2025). We present a case study demonstrating how the auto-evaluation tool can be used to improve the quality of multiple-choice quiz questions.
How we designed our AI lesson assistant
Aila is designed to support teacher agency by enabling teachers to adapt lesson plans step by step to better suit their students. Prompt engineering has allowed us to incorporate our existing curriculum design principles into a detailed AI prompt, with examples and non-examples to support the AI model in designing a pedagogically rigorous lesson. We use a technique called ‘retrieval augmented generation’ (RAG) to incorporate content from our large corpus of lessons and resources.
Alongside user evaluation, we have also built an auto-evaluation agent (a tool that uses AI to judge the quality of AI-generated content) to enable us to check the accuracy, quality and safety of the lessons that Aila produces.
Improving the quality of multiple-choice quiz questions
We wanted to understand how closely aligned the auto-evaluation agent was to judgements made by expert teachers.
To do this, we first created a dataset of 4,985 lessons. The lessons were across all four key stages and included maths, English, history, geography and science. We used the auto-evaluation agent to score the lessons on a range of different evaluation benchmarks, including the use of Americanisms, age-appropriateness, cultural/gender bias and progressive increase in quiz question complexity, and then provide a justification for each score. One area identified by the agent as being below the expected quality was multiple-choice quiz question distractors (incorrect but plausible answer choices).
Looking specifically at the quality of quiz distractor answers in our quizzes, we wanted to understand three key things:
- What makes a multiple-choice quiz question distractor high- or low-quality?
- How closely aligned was our auto-evaluation agent with expert teacher judgements?
- Could we improve the alignment further using findings from human evaluations?
We recruited 20 qualified teachers (both primary and secondary) who currently work at Oak National Academy, with an average of 14 years of teaching experience, to participate as evaluators. We asked participants to evaluate randomly assigned multiple-choice quiz questions from their subject specialism, using the same evaluation criteria as the auto-evaluation agent. Participants were asked to rate the quality of answers on a scale of 1–5 in terms of the answers being minimally different from each other (with 1 being significantly different and 5 being minimally different), and to add a justification for their score. Participants assessed 311 multiple-choice questions from the dataset, averaging 16.4 questions per participant.
How we measured quality
By analysing patterns in the assessments and justifications for low- and high-quality distractors given by the teacher participants, we concluded that the auto-evaluation agent was effective, but there were some areas for improvement. We identified some key themes in the AI-generated multiple-choice quiz questions to help to determine what makes a low- and a high-quality distractor:
- Low-quality distractors: These included opposite sentiments to correct answers (e.g. the correct answer was a positive trait and the distractors were all negative traits) and/or had a different grammatical structure to the correct answer, as well as the correct answer repeating words from the question but the distractors not doing so
- High-quality distractors: These incorporated common misconceptions, shared thematic consistency with the correct answers and featured similar grammatical structures to correct answers.
Armed with these insights, we codified our findings with accompanying examples and incorporated this into our auto-evaluation agent’s prompt (the instructions that we have given to the tool). The results were significant:
- Our AI-powered auto-evaluation tool now produces judgements more aligned with expert teachers. We know this because the difference between auto- and human evaluations decreased (mean squared error reduced from 3.83 to 2.95) and the agreement between these evaluations increased (from 0.17 to 0.32, measured using a Quadratic Weighted Kappa).
- We were also able to incorporate these findings into Aila’s main prompt, which improved the quality of the quiz questions produced.
Implications for pedagogy
We hope that by sharing what we have learned through this work, it may help other teachers to make better choices about the AI tools that they are using:
- Start with quality: Access to high-quality original materials is essential for building effective AI tools. General-purpose AI tools draw from datasets that have been trained on a range of different materials that are not education-specific and do not use specific AI techniques to drive pedagogical quality, align to the National Curriculum or provide appropriate content in this way.
- Codify excellence: Defining and exemplifying high quality helps to guide AI tools in producing appropriate content and enables the setting of clear benchmarks for evaluation. When using general-purpose tools, teachers can provide additional guidance within their prompt to define excellence in their context, but we have already done this for you in Aila.
- Iterative evaluation: Cycles of auto- and human evaluation refine tools over time, driving quality and consistency. It is important that we find ways in which to ensure that the expert teacher is kept in the loop, both in the design of the AI tool and in the generation of the AI content.
The examples of AI use and specific tools in this article are for context only. They do not imply endorsement or recommendation of any particular tool or approach by the Department for EducationThe ministerial department responsible for children’s services and education in England or the Chartered College of Teaching and any views stated are those of the individual. Any use of AI also needs to be carefully planned, and what is appropriate in one setting may not be elsewhere. You should always follow the DfE’s Generative AI In Education policy position and product safety expectations in addition to aligning any AI use with the DfE’s latest Keeping Children Safe in Education guidance. You can also find teacher and leader toolkits on gov.uk .