**Dylan Wiliam, UCL Institute of Education, UK**

**Introduction**

There is increasing consensus, among both researchers and policymakers, that teacher quality is both variable and consequential; some teachers are more effective than others, and the differences are both large and significant for the students they teach.

Predictably, this has led to a variety of proposals for improving teacher quality. Some, such as Eric Hanushek (2004), have argued for policies that aim to replace departing teachers with teachers that are more effective, often combined with proposals to accelerate the process by systematically removing less effective practitioners. However, such proposals only work if we can accurately identify more and less effective teachers. The argument of this article is that such policies are unlikely to be effective because we are unable to identify the effectiveness of individual teachers with any accuracy, and therefore the only way to improve teacher quality is to invest in improving the quality of teachers already in post. Of course, what we are really interested in is teach*ing* quality, but as we shall see, our measures of the quality of teaching are weak, so the best that we can do is to focus on the quality of individual teachers as a proxy for teaching quality.

For many in the education system, the idea that we cannot identify better teachers with any accuracy seems nonsensical, but there is now a large body of evidence to suggest that we cannot identify effective teachers by observing teachers, nor can we do so by looking at changes in pupil achievement – what is sometimes referred to as ‘value-added’. These two claims are discussed in turn below.

**We can’t identify good teachers by observing them**

One of the most enduring myths in education is that we can ‘know good teaching when we see it’. It might not be surprising that less experienced teachers are unable to distinguish between more and less effective teaching, but the evidence that we have is that school leaders aren’t able to do this with any accuracy either.

In one study (Strong et al., 2011), 165 school leaders (heads and deputies) from California and Texas were shown eight videos of teachers teaching a lesson on fractions to 10- and 11-year-old pupils. Four of the teachers had been chosen because their pupils had made significantly more progress (about two to three more months a year) than the average for their local authority in *each* of the previous three years. The other four had been chosen because they had been consistently *less* effective than average, by a similar margin, for each of the previous three years.

The school leaders were then asked to identify, for each of the eight videos, whether the teacher in that video was of one of the more effective or one of the less effective teachers, and their responses showed a reasonable degree of consensus; the judges agreed almost 75 per cent of the time. However, while there was consensus among the judges, their responses were rather inaccurate. If each of the judges had flipped a coin, they would have selected the correct response four times out of eight. The average number correct was, in fact, 3.85. In other words, in this study, heads and deputies could not even reach chance levels in identifying the more effective teachers. They would, in fact, have been better off flipping a coin.

Of course, this is just one study, and it is important to note that the heads and deputies received no special training in teacher observation for this study, but even when judges *are* given special training, the results are not much better.

Charlotte Danielson’s (1996) Framework For Teaching (FFT) is probably the best validated teacher observation system in widespread use, and it represents an important achievement, in that the ratings of trained observers do correlate positively with the progress made by students. However, the correlation is not particularly high. One study in Chicago (Sartain et al., 2011) examined the relationship between the four-point ratings that teachers received (below basic, basic, proficient, distinguished) and the progress made by pupils. The good news is that pupils taught by higher-rated teachers did make more progress: pupils taught by ‘distinguished’ teachers made on average approximately 30 per cent more progress than those taught by teachers who were ‘below basic’. This is an important finding, since many previous studies have failed to find any relationship between teacher observations and pupil progress. However, it is also important to bear in mind that the FFT captures only a small proportion of the variation in teacher quality.

To see why, it useful to look at studies of teacher effectiveness in the USA. One such review (Hanushek and Rivkin, 2010) suggests that if we divided teachers into quartiles (i.e. four equal-sized batches) of quality, then pupils taught by the most effective 25 per cent of teachers would make more than twice as much progress each year than those taught by the least effective 25 per cent (Wiliam, 2016). If the most effective teachers are more than twice as effective as the least effective, but the highest-rated teachers are only 30 per cent more effective than the least effective, then the FFT is capturing only between a quarter and a third of the variation in teacher quality.

As well as being rather unreliable, ratings of teachers from observations are also skewed. Matthew Steinberg and Rachel Garrett (2016) looked at the ratings given to teachers using the FFT in terms of the prior achievement of the pupils being taught. Middle school mathematics teachers teaching students in the top 20 per cent of achievement were over *six* times as likely to receive the highest rating as those teaching students in the bottom 20 per cent of achievement. Put bluntly, we are unable to disentangle the relative contributions of teachers and pupils; every teacher looks better when teaching higher-achievers.

For many people, these results seem hard to accept, but there are two important points to bear in mind when thinking about observations of teaching that illustrate just how hard it is to judge the quality of teaching by observation. The first is that judging the quality of teaching is inherently difficult because we are generally not interested in whether pupils can do what they have been taught at the end of the lesson – what psychologists call *performance* in the learning task. We are interested in what they can do some time later–*learning*. Trying to predict how much of what pupils are taught in a lesson they will remember in (say) four weeks’ time is challenging.

The second point is that the quality of performance in a learning task is often a poor guide to what will be learned by successfully completing that task. As Soderstrom and Bjork (2015) point out in their review of the relationship between learning and performance, ‘certain experimental manipulations have been shown to confer opposite effects on learning and performance, such that the conditions that produce the most errors during acquisition are often the very conditions that produce the most learning’ (p. 176).

Because of these problems with teacher observations, some have suggested relying instead on test scores, but again the available evidence suggests that this is – if not impossible – very difficult.

**We can’t identify good teachers from changes in test scores**

The idea of identifying effective teachers from changes in test scores is attractive. Test students at the beginning of the year and test them again at the end, and then see which students have made most progress. Such approaches are obviously limiting, in that they tend to focus on the ability of teachers to improve progress on the things that are easy to assess, but even if those assessments cover everything that we care about, measuring the ‘value-added’ by individual teachers is much more difficult than it first appears, for a variety of reasons.

The first is that the results we get depend greatly on the assumptions that we make. One study (Goldhaber et al., 2013), involving 212 high school teachers, looked at the progress made by students in English, mathematics and science. In their analysis, the researchers needed to make decisions about which statistical models to use. (For the statistically minded, they tried both a fixed-effects and a random-effects model.) Nineteen of the teachers who had been assessed as being in the top 20 per cent in one model were judged to be in the worst 20 per cent in the other model.

Further evidence of the challenges of measuring the ‘value-added’ by a teacher comes from a study of teachers in New York City (Bitler et al., 2021). The researchers compared the impact of teachers on student achievement with the impact that teachers had on something that they could not plausibly affect: student height. They found that the range of effects of teachers on pupil height was only 20 per cent less than the effect on achievement in mathematics and reading, which suggests that much of the variation we see in individual teacher quality is caused simply by natural year-to-year variation.

A stark illustration of the volatility of teacher value-added ratings is illustrated by Cathy O’Neil in her book *Weapons of Math Destruction* (2016). She tells the story of Tim Clifford – a middle school English teacher in New York City with 26 years of experience. The city had introduced a value-added rating system for its teachers, and in the first year of its implementation, Tim scored six on a zero-to-100 scale. If he hadn’t had a permanent contract, he would probably have been fired on the spot, but since he did, he kept his job. There was nothing in the evaluation that gave him any indication about what he needed to do to improve his teaching, so the following year, he taught in exactly the same way, and got a score of 96. The problem with these ratings, even if we accept that the scores capture all we care about in terms of student progress, is that the errors of measurement are of the order of 40 to 50 points on a 100-point scale. A teacher who is really weak is unlikely to get a score over 90, and a really strong teacher probably won’t get a single-digit score, but that’s about it. Even if this weren’t bad enough, there is a more serious underlying issue that renders the whole business of trying to measure the quality of an individual teacher even more difficult, and that is the difference between short-term and long-term effects.

The US military provides an interesting context for educational research, since it frequently allocates students to instructors at random, uses the same courses and teaching materials, and assesses students using the same examinations administered during a common testing period. One study (Carrell and West, 2010) looked at the performance in calculus courses of 10,534 students over an eight-year period, from the autumn of 2000 to the spring of 2007. They found that students taught by less experienced, less qualified instructors did better on their end-of-course examinations, and rated their instructors more positively than students taught by more qualified, more experienced instructors. However, the students taught by more experienced instructors did better on follow-on courses. The less experienced instructors were preparing students for this year’s test; the more experienced instructors were preparing students for *next* year’s test. Every teacher builds on the foundations laid by their predecessors, so the idea of apportioning the progress made by pupils at school to different teachers is not just hard; it is, in principle, impossible.

**So what does this mean?**

The limited validityIn assessment, the degree to which a particular assessment m of evaluations of teachers, whether from observations, test scores or even from combinations of these (Wiliam, 2016), means that any attempt to improve education with sanctions and rewards, or by retaining some teachers and removing others, is unlikely to succeed. And even if these attempts are successful, the impact on pupil achievement will be very small. One study in Florida (Winters and Cowan, 2013) found that if the 25 per cent of teachers with the lowest value-added ratings were removed and replaced with average teachers, the net impact on student achievement would be an extra three or four days’ learning per year, because many of those rated as being in the bottom 25 per cent would, in fact, be above average. What we can do, instead, is to create a culture where every teacher is expected to improve, not because they are ineffective, but because they can be even better.

At this point, a counterargument is often raised: how can we improve teachers if we don’t know how good they are? The answer is this: if we think of teacher quality as a continuum, we now know that we are unable to locate a particular teacher along that continuum with any accuracy, but we do know which way is better. If we direct our energies not on evaluating teachers but on improving them, we are far more likely to improve the quality of education that our pupils receive.

Great thinking here! A little and often approach to moving T&L forwards, one lesson at a time – with an emphasis on improving and not proving. Love it.