Comparative judgementAn approach to marking where teachers compare two students’ responses to a task and choose which is better, then repeat this process with other pieces of work is not a new method of assessment. It was first proposed by LL Thurstone (1927) as a means of describing ‘the processes of human judgement that are not visible to the observer’. Thurstone suggested comparative judgement could be used to quantify the quality of things that are sometimes hard to measure holistically, such as writing or drawing. More recently, Alastair Pollitt’s work on adaptive comparative judgement (2012), together with high-profile championing by Daisy Christodoulou and David Didau, has helped bring comparative judgement to a wider audience.
In comparative judgement, overall quality is assessed through direct comparison. Comparisons might be of individual items, longer written responses or even performances. Technology such as the No More Marking website set up by Dr Chris Wheadon has made widespread use of comparative judgement viable. On the site, scanned pieces of work are uploaded, judgements are analysed and a scale is produced. Unlike the model envisaged by Thurstone, No More Marking uses adaptive comparative judgement, where knowledge of existing judgements is used to work out further comparisons required to produce a reliable scale.
Trialling comparative judgement
Our school saw comparative judgement as a possible means of addressing some of the problems we were experiencing with assessment, particularly with issues of validityIn assessment, the degree to which a particular assessment measures what it is intended to measure, and the extent to which proposed interpretations and uses are justified and reliabilityIn assessment, the degree to which the outcome of a particular assessment would be consistent – for example, if it were marked by a different marker or taken again exposed by the removal of National Curriculum levels at Key Stage 3. When we started to develop our own assessment framework, we could see how deep the problems with levels ran in producing valid inferences about student attainment. We were keen to explore the potential for comparative judgement to improve our evaluation of student learning, as well as to see whether it could help reduce teacher-marking workload.
Following an initial exploratory session, in which middle leaders compared essays by Year 12 students about the Aristotelian good life, we set up a trial to look at the viability of using comparative judgement at scale to assess student attainment. The trial gave us a great insight into the strengths of comparative judgement, as well as an understanding of some of the issues we needed to address before adopting it more widely across the school. These issues can be framed as three questions that anyone seeking to use comparative judgement in their own context should consider before implementing.
1. How can script illegibility best be overcome?
The early stages focused on Year 11 English classes. We assessed one band’s mock examinations using comparative judgement, and the other using a mark scheme. The ranking produced was largely as expected, but with a number of significant anomalies. Teachers attributed these discrepancies to the illegibility of some scripts. Whether the process highlighted a handwriting bias usually hidden, or whether teachers grow accustomed to the idiosyncrasies of their students’ writing and compensate accordingly, is hard to tell. Either way, issues of illegibility seemed to account for some responses receiving the wrong mark. To combat this, we planned in future to relay more precise instructions to students, to insist on black pens that show up more clearly on the scans and to better brief teachers judging the scripts about bias and how to try and overcome it.
2. What is the optimal type and length of written task?
Our next use of comparative judgement involved comparing creative responses. Teachers found the straight writing focus better suited to the format, saying there were fewer things to consider than when making judgements about analytical writing: that is, less cognitive load than when having to apply mental models of coherence and veracity simultaneously. Teachers’ assumptions appeared to be supported by the slightly higher reliability score achieved and by the lower mean average response time to form judgements.
We were too ambitious with our first task, an analytical essay on RL Stevenson’s Dr Jekyll and Mr Hyde, and believed this accounted for some of the judging anomalies that could not be put down to handwriting issues alone. Neither the national primary Sharing Standards project run by No More Marking nor the Fischer Family Trust proof of progress tests, which both employ comparative judgement methodology, include any type of analytical task. It may be that it is simply much easier to make quick and accurate judgements in relation to writing quality and accuracy alone.
With students increasingly required to write analytical essays, we were keen to try to improve our use of comparative judgement. With this in mind, we set up a further analytical task, a comparison of two poems. Rather than assessing half the cohort, however, this time we selected only two groups, and as well as having students judged comparatively we also had the work marked by their class teacher in the usual way. We reasoned that it would be easier to uncover the source of any discrepancies if we were dealing with fewer students and if we could directly compare scores from the different assessment methods used.
As a consequence of making many more judgements in relation to the number of scripts to be marked (8:1 for the poetry comparison against 5:1 for the Stevenson essay) the reliability increased considerably from 0.70 to 0.85. While there was general agreement between marks from comparative judgement and marks from the teacher, as before, there were still some anomalies. Two scripts from the lower set really stuck out. They had achieved far lower rankings from their teacher using a mark scheme than from the rest of the department marking comparatively. This seemed to point to some kind of anchoring by perceived ability, which anonymity and straight comparison may have helped overcome.
Another essay from the lower group to rank unexpectedly high provided an important insight into the nature of task setting for comparative judgement. A close look at the script suggested that the judges had been far too influenced by its strong opening, an inference supported by its relatively low mean judgement time. After its strong start, the response rapidly declined and it even started to compare more than two poems. In future we realised that analytical tasks would be better if they were shorter and more tightly focused, ideally a page in length at most. This precision would have the added benefits of shortening judgement times and encouraging a move away from focusing on GSCE endgame-style questions too soon lower down the school.
3. How do you get high reliability at scale?
We now felt ready to trial comparative judgement in other subjects and at greater scale. We chose to run some end-of-year examinations for Year 7 in English, maths, geography and history. English and maths set two comparative judgement tasks each, and geography and history just one. Learning from our past efforts, we ensured the tasks required focused responses (see Table 1). We also gave standardised instructions to all students about the importance of legibility, as well as guidance to teachers about bias and how to guard against it while judging.
The results were disappointing and ultimately too unreliable to be used to draw any solid inferences about student attainment. In previous rounds, we had achieved reliability scores of around 0.70 and 0.80, but now those scores had fallen to 0.45 (history) and 0.55 (geography). In many respects, these low scores were hardly a surprise. Even though we had dedicated two hours for teachers to judge the assessments in their departments, we had simply not factored in nearly enough time for the amount of judgements required for 300+ scripts.
The majority of teachers felt cautiously optimistic about the scope for comparative judgement to be used more widely, suggesting that the ability to see across a cohort quickly and efficiently had provided invaluable insights into student learning, as well as into curriculum design and sequencing. There was consensus about the need for more quantitative evaluation of the impact on workload, and whether the time needed to get high reliability with a large cohort and a small teaching team outweighed the time taken for teachers of subjects who typically have three or four classes to mark work in the more traditional way.
The future
Most of all, we learned that comparative judgement can tell us about the ways our students think about and understand different subjects, and how that might be different from knowledge we gain from other measures. Our developing understanding of comparative judgement has very much been a developing understanding of assessment – what it can reveal or mask about student learning; where it can yield reliable and useful information; and where it maybe falls short. There is still a great deal we need to learn about how to use comparative judgement effectively before we can confidently assert whether it has a future in our assessment framework, and what that future might look like.