‘Assessments should be accurate for 90% of students plus or minus one grade’
At a hearing of the Education Select Committee enquiring into summer 2020’s grading fiasco, Ofqual’s Executive Director of Strategy, Risk and Research stated that:
‘There is a benchmark that is used in assessment evidence that any assessment should be accurate for 90% of students plus or minus one grade. That is a standard benchmark. On average, the subjects were doing much better than that. For A-level we were looking at 98%; for GCSE we were looking at 96%, so we did take some solace from that.’ (Education Select Committee, 2020)
These appear to be reassuring words: school exams are exceeding the ‘standard benchmark’ by a considerable margin. Furthermore, for someone so senior to ‘take solace’ must imply that the exam system in England is working excellently, and is in the best possible hands.
What does ‘accurate’ mean?
Central to this statement is the word ‘accurate’ – but what does this mean in practice? Supposing a particular script is marked 64 and is awarded grade B, is this grade ‘accurate’?
To answer this question, there needs to be an associated, independently verifiable ‘truth’ to which the grade B can be compared. If the ‘truth’ is grade B, then that grade is accurate; if not, then that grade is not accurate but wrong. However, according to Ofqual:
‘There is often no single, correct mark for a question. In long, extended or essay-type questions, it is possible for two examiners to give different but appropriate marks to the same answer. There is nothing wrong or unusual about that.‘ (Swan, 2016)
If there is ‘often no single, correct mark for a question’, there will also often be no single, correct mark for an entire script; a script might be marked 64 by one examiner – corresponding to grade B – and 65 by another, which – if the B/A grade boundary is 64/65 – results in grade A. Neither mark is ‘correct’ for both marks are merely ‘different but appropriate’. Accordingly, neither grade B nor grade A is ‘correct’ for both are, once again, ‘different but appropriate’.
There is therefore no single ‘right’ mark, and no single ‘right’ grade that is an independently verifiable ‘truth’ to which any actual grade might be compared. I can only conclude that no grade can ever be deemed ‘accurate’ – indeed the whole concept of ‘accuracy’ cannot apply to exam grades or exam marks.
So why did Ofqual’s Executive Director for Strategy, Risk and Research use the word ‘accurate’ in evidence to the Select Committee?
The ‘definitive’ grade
In a landmark report published in November 2016, Ofqual presented the findings of a research project in which whole subject cohorts of scripts were marked and graded twice: once by an ‘ordinary’ examiner and once by a ‘senior’ examiner, whose grade was designated ‘definitive‘ or ‘true‘ (Ofqual, 2016). So, despite their acknowledgement just a few months earlier that ‘there is no single, correct mark for a question‘, Ofqual then found it convenient to assert that ‘some marks are more correct than others’, with a ‘senior’ examiner’s mark being the most correct of all (Ofqual, 2016).
It might be thought that the two grades awarded to each script were always identical. But no:
- The percentage of the total cohort for which the two grades were the same varied according to each of the 14 subjects studied. For example, for Maths, about 96 per cent of the grades were the same, whilst about four per cent were different; for Geography, about 65 per cent the same, 35 per cent different; for History, about 56 per cent the same, 44 per cent different (Ofqual, 2018).
- Within any subject, the percentage of grades that were the same varied by mark, with high A*s (or 9s) and low Us about 100 per cent the same. But for scripts marked close to, or at, any grade boundary in any subject, about 50 per cent of the grades were the same and about 50 per cent different (Ofqual, 2018; Sherwood, 2019a).
To examine the implications of these results, consider the specific case of the 261,537 grades awarded for 2019 GCSE History in England (JCQ, 2019).
If all scripts were marked by a ‘senior’ examiner and assuming that:
- all ‘senior’ examiners would give the same script exactly the same mark, without exception
- and any one ‘senior’ examiner would give the same script the same mark if marked a second time, or on another occasion, no matter how tired the examiner might be or what other scripts might have been marked in the intervening time…
…then all 261,537 candidates would receive – to use Ofqual’s words – the ‘definitive’ or ‘true’ grade.
Alternatively, if all 261,537 scripts were marked by an ‘ordinary’ examiner, only about 56 per cent of these (that’s approximately 146,500 scripts) would be awarded the ‘definitive’ or ‘true’ grade, and the remaining 44 per cent (some 115,000 scripts) would not. Ofqual offer no terminology for this, but presumably words such as ‘non-definitive’ or ‘false’ might apply.
How many of the 261,537 scripts were in fact marked by an ‘ordinary’ examiner is unknown, but it is quite likely to be many more than by a ‘senior’ one – in which case, the number of ‘non-definitive’ or ‘false’ grades actually awarded for 2019 GCSE History in England was probably much closer to 115,000 than to zero.
ReliabilityIn assessment, the degree to which the outcome of a particular assessment would be consistent – for example, if it were marked by a different marker or taken again – a much more useful concept than accuracy
In this context, ‘accuracy’ is a misleading word to use – there is no ‘right’ grade to which any actual grade might meaningfully be compared. Furthermore, although defining ‘right’ as ‘that determined by a senior examiner’ might be convenient in a research project, from a practical standpoint, it is singularly unhelpful. In general, ‘senior’ examiners do not mark scripts, and even if they did, there is still the possibility that different ‘senior’ examiners would give the same script different marks, and hence grades.
Much better than ‘accuracy’ is ‘reliability’, which measures the probability that a subsequent fair re-mark would confirm, rather than change, the originally awarded grade.
In essence, reliability is all about the reassurance associated with an expert second opinion. Whereas ‘accuracy’ requires an independently verifiable correct result, ‘reliability’ is simply the comparison between two outcomes – the grade resulting from one examiner’s mark and that resulting from another’s. There is no requirement for either examiner to be ‘senior’ and, operationally, ‘reliability’ is very easy to measure – just mark a single script twice (or more) and see what happens.
Considered in this context, Ofqual’s research determined not so much the ‘accuracy’ of grades, but rather the reliability of grades with respect to the special case in which all scripts were re-marked by a ‘senior’ examiner. But what would the measures of reliability be if all the scripts had been re-marked by an ‘ordinary’ examiner? Ofqual’s research cannot answer this question. But in practice, there is a high likelihood that any script is marked by an ‘ordinary’ examiner (or team). Since it is a lottery as to which particular ‘ordinary’ examiner marks any particular script, two important, real, questions are:
- What is the probability that the grade actually awarded would be the same had a different ‘ordinary’ examiner marked that script?
- How trustworthy is the grade shown on the certificate?
Conclusions
In connection with exams in general, and grades in particular, terms such as ‘accurate’, ‘definitive’, ‘true’, ‘right’ (and their converses) have no benchmark or relevance, and therefore should not be used. Rather, the most appropriate term to represent the trustworthiness of grades is ‘reliable’ – where reliability is the likelihood that an originally awarded grade would be confirmed, and not changed, as the result of a fair re-mark by another examiner.
Ofqual should measure, and publish in their annual statistics, the reliability of the grades associated with every exam subject, analysed by exam board. As a matter of urgency, a project should be carried out to determine how assessments might be made significantly more reliable – ideally, approaching 100 per cent reliability for every subject (Sherwood, 2019b).
This was so useful. Thank you so much. I’ve downloaded the report, and as an English teacher, it is no surprise to find that English is the least accurate subject to Mark. This certainly calls into question the value of exam criteria, and indeed the exam process itself.
It also suggests that students accept the paper more than once.