Assessment is an important part of any education system. Without assessment, we cannot be sure that students are learning anything, because, as many countries have found, the amount of time students spend in school is a poor guide to how much they have actually learned (Pritchett, 2013). However, assessment is often unpopular with key stakeholders in education for a variety of reasons. In some systems it is used to hold teachers and schools to account for the quality of education provided, and when stakes are high, formal assessments can be stressful for students and their parents and carers. But even where the stakes are low, assessment is often seen as taking time from more valuable activities, such as teaching.
The result is that in any education system, the assessment system in place is the result of a large number of trade-offs. We cannot get rid of the trade-offs, but we can make them explicit, so that we are better able to judge whether the balances we strike are ones with which we are comfortable.
The idea that any assessment system involves trade-offs is important, because it seems to me that many people imagine there is some perfect assessment system. There isn’t.
Take the issue of the reliabilityIn assessment, the degree to which the outcome of a particular assessment would be consistent – for example, if it were marked by a different marker or taken again of assessments. In the UK (and in many other countries) we tend to ignore the fact that no assessment (or indeed any other kind of measurement) is perfectly accurate. Students are told they got 72 per cent on a test, or that they got a grade C on an A-level examination, and these results are hardly ever accompanied by any indication of the margin of error associated with those assessments.
One reason for this is, I suspect, that the margins of error are generally much greater than people assume; they are, in short, a bit embarrassing. A typical school test would have a margin of error of 10 marks or more, and many, perhaps most, teachers would be uncomfortable saying to a parent that the pass mark for a course was 70, and their child scored 65, plus or minus 10. The parent might ask whether their child passed, and all we could say is, ‘Probably not. But they may have done.’ At this point the parent would probably ask, ‘Why don’t you know?’ to which the only sensible reply would be, ‘Because no test is perfectly reliable.’ Parents may well ask, ‘Can’t you make the test more reliable?’ and of course the answer is yes. We can make the test more reliable, but this involves making the test longer. Much longer – and we have better things to do with the time, such as teaching.
The important point here is that we can make our assessments as reliable as we like. The question is, what is the best trade-off between the time we spend assessing and all the other things we could do with the time? Indeed, highly reliable assessments may be a sign that we are spending too much time assessing, and not enough time teaching.
This is why anyone who hopes that this special issue of Impact will provide a series of answers about how to design the perfect assessment system for a school will be disappointed. What you will find, however, is a series of thoughtful explorations of the trade-offs that arise in the design and implementation of any system of assessments: explorations that might spur further reflection about your own challenges and also, perhaps, highlight some pitfalls to avoid.
In assessment in schools and colleges in particular, one of the most significant issues is whether our assessments capture all the important aspects of student achievement, or only some of them. As Tim Oates points out, assessments cannot assess everything that students know and so we have to sample. When that sample becomes predictable, and high stakes are attached to the results, teachers and students have an incentive to focus on those aspects of the subject that are tested, although of course whether they do so or not is a separate issue.
Problems with assessment
The issue of assessments that, to use the psychological jargon, ‘underrepresent’ the things they are meant to be assessing, is particularly important in the assessment of English. In other words, the assessment is ‘too small’. From the earliest national curriculum assessments, speaking and listening were excluded from formal assessment, and as a result, many people feel that these aspects of English have been given less attention in schools, although they are clearly important for success in adult life. In their paper on assessing students’ spoken language skills, Neil Mercer, Ayesha Ahmed and Paul Warwick attempt to redress the balance by showing how a relatively simple oracy assessment toolkit can help schools determine how well students can use spoken English in different contexts and for different purposes.
More recently, as concerns over the accuracy of the marking of student writing have grown, national curriculum assessment has focused increasingly on reading, and yet, as papers by Daisy Christodoulou and by Philip Stock show, a technique called ‘comparative judgement’, first proposed by Louis Thurstone 90 years ago (Thurstone, 1927), provides a manageable and efficient procedure for assessing extended student writing that is at least as reliable as more traditional forms of assessment.
The other major problem with drawing valid conclusions from assessment results is not that the assessment is too ‘small’ (so it fails to assess the things it should), rather that it is too ‘big’ in that student results depend on things that should not really affect the result. The results of a maths test with a high reading demand are difficult to interpret. We can be reasonably sure that students with high scores can do the mathematics that is tested (and the reading required), but for students with low scores, we cannot be sure whether these are due to the fact that they could not do the mathematics, or whether they could not understand the questions. Such a test would support inferences about mathematical competence for some students (good readers) and not for others (poor readers). The problem with such a test is that the variation in scores between students is partly due to differences in their mathematics achievement (which we want), and partly due to differences in their reading ability (which we do not want).
In this particular example, the unwanted variation in scores affects all poor readers – in other words, this is a systematic effect. But sometimes the unwanted variation in scores occurs randomly. This is particularly noticeable, as Catherine Kirkup points out, in the assessment of the youngest students because their performance is very variable from day to day; they have good days and bad days. Students can also be lucky or unlucky in who marks their work. Sometimes students are given the ‘benefit of the doubt’ and sometimes they are not. It is common to regard this as an issue of inconsistency between teachers, but as we know from the groundbreaking work of Starch and Elliott over 100 years ago, the difference in marks given by the same teacher in (say) the morning and the afternoon to the same piece of work is almost as great as the difference between one teacher and another (Starch and Elliott, 1912, 1913a, 1913b).
Assessment experts generally refer to the extent to which a student’s score on a test or some other assessment varies according to random factors is an issue of reliability. As Sarah Earle points out, a concern for reliability can often mean changing assessments to ensure greater consistency in the assessment process. Let’s make all the students take the assessment at the same time. Let’s ensure that all teachers say exactly the same thing to every student. Let’s concentrate on assessing on facts, because then we can be sure teachers are assessing things in the same way. The problem, of course, is that when we do this, our assessments underrepresent some of the things we are interested in.
This is what makes the relationship between reliability and validityIn assessment, the degree to which a particular assessment measures what it is intended to measure, and the extent to which proposed interpretations and uses are justified so complex. Reliability is a prerequisite for validity – if a student achieves a different score on a different day, or if someone else marks her work, then any conclusions we can draw on the basis of the result of the assessment are suspect. But attempts to improve reliability often narrow the scope of the assessment, so that we can say some things with more certainty, but now there are other things about which we can say nothing, because we didn’t assess them. So while reliability is a prerequisite for validity, it is also in tension with it, since attempts to increase validity reduce other aspects of validity.
In the spotlight
One way to envisage this is by thinking about stage lighting. We can use spotlights, and gather a great deal of detail about what is happening in one particular part of the stage (high reliability) but our knowledge of what is happening on the unlit parts of the stage is minimal (reduced validity, at least for the unlit areas). With the same lighting power, we can use floodlights, and get some light on all parts of the stage. The clarity of detail we can get on any particular area of the stage may be low (low reliability) but at least we now know something about other parts of the stage (increased validity). Trade-offs again.
Ultimately, perhaps the most important thing to realise about assessment is that, as Lee Cronbach pointed out almost 50 years ago, an assessment is simply a procedure for drawing inferences (Cronbach, 1971). Taking such a view, validity is not a property of assessments, but rather a property of the inferences that we draw on the basis of assessment outcomes. For any assessment, taking into account the circumstances of the assessment, some inferences will be warranted, and others will not. From such a perspective, it makes no sense to ask ‘Is this assessment valid?’, because the results will be valid for some purposes and not others. When someone asks, ‘Is this assessment valid?’ the only sensible response is: ‘You tell me what conclusions you propose to draw on the basis of the results of the assessment, and I will tell you whether those conclusions are justified.’ While this means that validating an assessment is essentially an interminable process, the paper by Deep Singh Ghataura draws on the work of Michael Kane and others to provide a helpful framework for schools in such work.
While the exact nature of these trade-offs between different aspects of assessment will depend on the particular school context, Stef Edwards provides some useful principles for the design of an assessment system in the context of the seven primary schools that constitute the Learn Academies Trust (Learn-AT), emphasising the importance that all such developments need to be sustainable over the longer term.
However, perhaps the most exciting and important changes in assessment practice over the last 20 years or so have focused on the idea that assessment can improve learning, as well as measuring how much of it has occurred. While there is a debate about whether it is better to call this assessment for learningKnown as AfL for short, and also known as formative assessment, this is the process of gathering evidence through assessment to inform and support next steps for a students’ teaching and learning or formative assessment, there is little doubt that when teachers work with their students to identify what has been learned, and what needs to happen next, considerable improvements in student achievement are possible. As one teacher said: ‘It’s all about making students’ voices louder, and the teacher’s hearing better.’
Ultimately, to be effective, formative assessment has to become more than just a vague set of ideas, and several of the papers in this special issue explore particular strategies of formative assessment.
Building on my work with various colleagues over the years, Nikki Booth suggests that formative assessment involves a number of strategies:
- Sharing learning intentions and success criteria with students
- Eliciting evidence of achievement
- Providing feedback
- Peer- and self-assessment.
Addressing the first of these strategies, Wynne Harlen points out the value of being clear about what students need to learn – the ‘big ideas’ of a subject. Once teachers and students are clear about the destinations, and the possible routes they might take to get there, then assessment becomes a much more straightforward task of seeing how far students have progressed in their learning, by eliciting evidence of achievement. Whereas this can be done in many ways, including observation, formal assessment, and so on, the ‘beating heart’ of good teaching, as Jonathan Doherty suggests, is skilful classroom questioning, which requires deep teacher subject knowledge, not only of how to make sense of students’ responses, but also so that teachers know which questions to ask in the first place. While this is not a particularly strong feature of teachers’ education in England, as Stuart Kime illustrates, this is an important feature of the assessment of teachers in Germany.
Once teachers are clear where students are in their learning, they must feed back to their students what might be done to improve learning. Harry Fletcher-Wood shows how this can be done with a whole class, and there is no doubt that such practices are under-utilised at the moment – there is much that can be done to move the learning of the whole group forward. However, there are times when the only way to this is through feedback to individual students, and the papers by Andy Moor, Rachael Falkner, Antony Barton, Clare Sealy and Michael Taylor explore the issue of marking in a range of contexts, while José Picardo shows that although digital technology has often, in the past, been more of a hindrance than a help to teachers in their daily work, technology can play a powerful role in managing and improving feedback to learners. A commentary from the Education Endowment Foundation provides links to other research in this area, and provides details of how schools wanting to participate in future research in this area can sign up.
Ultimately, although all these strategies are important, the aim of all teaching, arguably, is to create self-regulating learners – if students can evaluate their own achievements effectively, then they can advance their own learning when there is no-one around to give help. As Beth Budden describes it, enabling the ‘learning independence’ of our students.
The importance of assessment for evaluating learning and guiding teaching has been clear within the last decade or so, but there has been a gradual realisation that assessment can also improve learning in a much more direct way.
When, in the late 1980s, the Conservative government announced that it intended to raise standards by introduce national testing for 7, 11, 14 and 16-year-olds, many people expressed their skepticism through the old adage that ‘Weighing the pig doesn’t fatten it.’ It’s a great soundbite, but it’s highly misleading. Drawing on the work of Bjork, Roediger, Karpicke and others, in three short articles, Jonathan Firth, Megan Smith, Blake Harvard and Adam Boxer show that the simple act of being tested improves long-term memory, even if the test is never marked, because successfully retrieving something from memory strengthens long-term memory, especially when that retrieval is difficult.
Making living examples
However, in the end, nothing matters unless teachers find ways of incorporating these ideas into their practice. As Paul Black and I pointed out twenty years ago, ‘Teachers will not take up attractive sounding ideas, albeit based on extensive research, if these are presented as general principles which leave entirely to them the task of translating them into everyday practice – their lives are too busy and too fragile for this to be possible for all but an outstanding few. What they need is a variety of living examples of implementation, by teachers with whom they can identify and from whom they can both derive conviction and confidence that they can do better, and see concrete examples of what doing better means in practice’ (Black and Wiliam, 1998: p.15). While there are many ways that teachers could be supported in developing their practice of formative assessment, as Sam Sims, Gemma Moss and Ed Marshall show, Teacher Journal Clubs may be a particularly powerful and manageable way of doing so.
Finally, as Jonathan Sharples, Sandy Oliver, Andrew Oxman, Kamal Mahtani, Iain Chalmers, Kevan Collins, Astgrid Austvoll-Dahlgren and Tammy Hoffman show, similar issues arise in
other professions, such as healthcare. Indeed, medical education, with its use of extended practical assessments of student knowledge through authentic tasks such as Objective Structured Clinical Examinations, has often led the way in developing and validating new forms of assessment. As indicated earlier, assessment is a complex area, but one where dialogue, networking, sharing experience and learning by doing can yield significant benefits for learners. After all, as long as teachers are exploring the relationship between what they taught, and what their students learned as a result, there is no more powerful focus for professional improvement.
In closing, I would like to record my thanks to Miriam Davey and the editorial staff at the College, who have worked tirelessly behind the scenes to produce this first issue of Impact under extremely demanding deadlines. The fact that this issue exists at all, let alone is of such high quality, and has appeared on time, is testament to their hard work and professionalism.
References
Black PJ and Wiliam D (1998) Inside the Black Box: Raising Standards Through Classroom Assessment. London, UK: King’s College London School of Education.
Cronbach LJ (1971) Test validation. In: Thorndike RL (ed.) Educational Measurement (2nd ed.) Washington DC: American Council on Education, pp. 443-507.
Pritchett L (2013) The Rebirth of Education: Schooling Ain’t Learning. Washington, DC: Brookings Institution Press.
Starch D and Elliott EC (1912) Reliability of the grading of highschool work in English. The School Review 20: 442-457.
Starch D and Elliott EC (1913a) Reliability of grading high school work in history. School Review 22(10): 676-681.
Starch D and Eliott EC (1913b) Reliability of grading work in mathematics. The School Review 22: 254-259.
Thurstone LL (1927) A law of comparative judgment. Psychological Review 34(4): 273-286. Available at: doi: 10.1037/h0070288 (accessed 22 August 2017).