Early Career Hub

Avoiding common assessment pitfalls

Written By: Matthew Benyohai
4 min read

This case study was written by Matthew Benyohai, Head of Physics at a secondary school.

As you read this case study, reflect on how the teacher considers the validity and reliability of their assessment approaches. Take some time to think about what the teacher does, how they do it, what they might do differently and how this might influence your own practice.


Two important qualities of assessments are validity and reliability. Validity is a measure of how well an assessment reflects what you want to assess (Issacs et al. 2013) or how well an assessment supports the inferences we make from the results (Messick 1989). Reliability is a measure of how consistent the results of, or inferences from, an assessment are (Isaacs et al. 2013).

These two concepts are often in competition with one another (Harlen 2005). If I try to make an assessment more reliable, by introducing lots of multiple-choice questions, I’d be reducing the validity (Wiliam 1993). If I were to use an open investigation as a form of assessing physics ability it would arguably be more valid, but it would be very unreliable. The trade-off between reliability and validity is known as dependability (James 1998). How dependable your assessment needs to be, and therefore how it should be structured and designed, depends on the purpose of the assessment (Mansell, James & Gardner, 2009).  My day to day assessment practice is based on thinking about dependability against a simplified list of assessment purposes; assessment as learning (AaL), assessment for learning (AfL) and assessment of learning (AoL).


Assessment as learning

Through the act of retrieval, or recalling information to mind, our memory for that information is strengthened and forgetting is less likely to occur (Dann 2002). In terms of dependability, this needs to be highly valid, to ensure students are learning the correct constructs, but the reliability of each assessment or question doesn’t need to be particularly high. This is because a large volume of assessments can take place, so the cumulative reliability will be high. One way I achieve this is by providing students with Quizlet sets (www.quizlet.com) covering core knowledge that they need to memorise. Quizlet is primarily a flashcard site that students can use to self-quiz, but it also offers a selection of different games and modes for students to test their knowledge. (see Figures 1 and 2).

Figure 1
Figure 2

Assessment for learning

To teach effectively I need to know the limit of the students’ understanding so that I may build upon it and set an appropriate level of challenge (Looney 2005). Again, this assessment needs to be valid and reliability can be achieved by asking multiple questions.  Early in my career, I would have assumed high reliability of a single question (signposting it as my hinge-question or formative assessment); moving on when a number of students may have guessed or copied others. In contrast, my lessons today are littered with multiple-choice questions and worked examples, more than I may actually need to use in the lesson (see Figure 3). I prefer not to have the solutions prepared as it’s beneficial for the students to see my thinking live (Rosenshine 2012). I might ask them to have a go first on mini-whiteboard or I might just ask them to watch the first time around. Leaving the slide blank gives me the room to adapt.

Figure 3

Assessment of learning

As a classroom teacher, I want to know what the students can do and what they know. This helps me make a judgement of whether they have kept up with the work or if they are falling behind. I might also use summative assessment to help them close off a topic and motivate them to work a little harder before moving on (Harlen & Deakin Crick 2003). It should be valid in that it should match the summative assessment they will see at the end of the course (GCSE), so we use past GCSE questions to act as preparation. It should also be a little more reliable than the previous two purposes, however, it is a common pitfall to think that any test that fits in a single 60-minute lesson is going to have particularly high reliability. That is why it is so important not to draw many inferences beyond student A achieving more/less than student B.



Dann R (2002) Promoting assessment as learning: Improving the learning process. London: Routledge/Falmer.

Harlen W (2005) Trusting teachers’ judgement: Research evidence of the reliability and validity of teachers’ assessment used for summative purposes. Research Papers in Education 20 (3): 245–270.

Harlen W and Deakin Crick R (2003) Testing and motivation for learning. Assessment in Education: Principles, Policy & Practice 10 (2): 169–207.

Isaacs T, Zara C, Herbert G, Coombs SJ and Smith C (2013) Key concepts in educational assessment. Thousand Oaks: SAGE Publications.

James M (1998) Using assessment for school improvement. Oxford: Heinemann Educational.

Looney J (2005) Formative assessment: improving learning in secondary classrooms. Paris: Organisation for Economic Cooperation and Development.

Mansell W, James M and Gardner J (2009) Assessment in schools. fit for purpose? A commentary by the teaching and learning research programme. Economic and Social Research Council, Teaching and Learning Research Programme.

Messick S (1989) Validity. In: Linn R (ed) Educational measurement. New York City: Macmillan Publishing, pp.13–103.

Rosenshine B (2012) Principles of Instruction: Research-Based Strategies That All Teachers Should Know. American educator 36(1): 12.

Wiliam D (1993) Validity, dependability and reliability in national curriculum assessment. The Curriculum Journal 4(3): 335–350.


    0 0 votes
    Please Rate this content
    Notify of
    Inline Feedbacks
    View all comments

    Other content you may be interested in