Impact Journal Logo

Using third-party assessments: Deciding when to trust online test scores

Written by: Clare Walsh
9 min read
Clare Walsh, University of Southampton, UK

Educational assessments are, at the best of times, an imprecise science. For many teachers today, regular ongoing assessment can improve their practice. Once success criteria have been shared and the lesson delivered, it makes sense to check what progress students and their teachers are making against classroom goals. Yet creating, delivering and marking tests remains time-consuming, and successful assessment design requires a set of skills that are given scant attention in initial teacher training courses (Carter, 2015). Third-party assessment tools, particularly those delivered online, have introduced a route for teachers to subcontract that process and focus on teaching and learning. Like all assessments, though, there are payoffs and this article briefly summarises five considerations when working with online assessment tools.

Consistency of assessment paradigm matters

Not all assessments are designed with the same paradigm, and results from one approach may not predict success in another. Since the explosion in computational processing power, the psychometric paradigm has grown in popularity. This approach uses sophisticated data analytics to give authority to testing processes. When we test, we hope that any variance in test scores reflects ability, but that is not always the case. By quantifying different types of variance, comparisons in standards between tests can be made and poorly worded questions removed. The downside is that this paradigm not only lays out the process of obtaining scores as a mathematical function of performance data, but it also dictates the kinds of questions that can be asked (Baird and Opposs, 2018). The popularity of psychometrics accounts for the many tests with 20 to 40 isolated questions, possibly in short answer or multiple-choice form.

In contrast, a curriculum-based paradigm, such as the approach in the Welsh Skills Challenge Certificate (SCC), lays out learning objectives and then a means of assessing those objectives is found. In the SCC, skills like collaboration or management of complex group projects are suited to an observational approach, scored against descriptors of success. Success in one paradigm does not automatically predict success in another.

The architecture behind online learning platforms favours psychometric approaches. In general, psychometric-style questions are suited to the binary yes/no record-keeping of computing. Kahoot tests, for example, are exclusively multiple choice and often focused on factual recall. Even if short-answer approaches align to the goals that you have for your students, the machine may be coded with certain peculiarities affecting scores. Answers may be sensitive to accurate spelling, upper and lower case or a random space bar tapped somewhere in an otherwise correct answer. More sophisticated approaches may use natural language processing (NLP), a subfield of artificial intelligence (AI), to score free text in spoken and written form. As a mature technology, it is already used in several high-stakes assessments with adults, such the PTE-A, a language assessment used widely as an entry test in higher education and some international immigration processes. NLP is likely to become more prevalent in homework tools soon.

When trialling a new online test environment, it is worth considering whether the goals of the tool align to your own goals. It is also worth putting the machine through some tests for response tolerance. Does the machine punish hitting the space bar or mixed cases? Do AI scoring process recognise strong regional accents or very softly spoken children? Is it biased against female voices?

Features of the test presentation can interfere with performance scores

Success can be defined in multiple ways, and some features of online test tools can distort judgments of ‘success’. Some assessments are criterion-referenced; in other words, there is a syllabus with a list of learning criteria and student performance is referenced solely against that list. If all the students have achieved all the learning targets, they all get an A*. Other tests are cohort- or norm-referenced, meaning that the children’s ability is referenced against other children taking the test at the same time (Baird, 2018). In such tests, there can only be a limited number of winners. The purpose of school leaving exams is largely to stratify learners into different levels of ability, but teachers generally want progress tests to be criterion-referenced. Their goal, after all, is for every learner to perform well.

A cohort-referenced approach can feature in online practice tests, particularly in online games, in the guise of competitive or collaborative tasks. Real-time competition and collaboration make games highly engaging and encourage repeated play. If the goal is to encourage children to practise something like number bonds until they can be recalled without hesitation, this feature supports those goals. If, however, you need to test that knowledge, and not practice it, there may be problems. Scores will be conditioned on the ability of competitor or collaborator, as well as the person you intend to measure. An able competitor will be much harder to defeat and vice versa.

In one game we looked at, we found that robot competitors, pre-programmed responses disguised as a human competitor, introduced more stability. However, robots could be challenging or overly lenient opponents, and that too can distort impressions of ability. For example, children playing at irregular times were more likely to play the highly able robot rather than another child, and subsequently appeared weaker.

Not all questions are equally difficult and not all tests are fair

It is not possible to make all tests of the same curriculum topic equally hard. Careful thought goes into syllabus design to present ideas in stages of increasing difficulty. However, the difficulty of test questions within the same component of a syllabus can vary. The common classroom practice of rewarding a constant value of ‘1’ for every tick gives the impression that we are measuring units of cognition (Almond et al., 2015). The misconception that all points are equally hard to achieve may be misleading when, perhaps more accurately, one correct answer deserved a value of 0.4 and another a value of 2.3.

Meaningful data about the quality and level of the test can be gained from an organised data table of performance in any one test, known as a scalogram (Guttman, cited in Bond and Fox, 2015). Students listed at the top of Table 1 are higher attaining, and questions listed to the left are easier; in other words, more students got the question correct. Each tick signifies a successful completion of the questions.

Table 1: Scalogram: An organised data matrix for student performance in different tasks

Table 1 displaying a scalogram which is an organised data matrix for student performance in different tasks

Questions 3, 6 and 8 all seemed too challenging for this group of students, with the exception of Aaliyah, whereas question 5 was very easy. In most tests, the ticks will line up in the form of a triangle, with some fluctuation around the limen or the hypotenuse line (Figure 1.1).

Figure 1: Possible patterns from the data matrix

Figure 1 showing possible patterns from the data matrix

If the ticks dominate the scalogram space (Figure 1.2), the test was easy or perhaps the teaching goals were fully met. If the triangle is very small (Figure 1.3), the test was difficult. The more worrying pattern would be missing ticks in the dark triangle, where easy questions were answered incorrectly by able students, or ticks in the blank space, where lower attaining students successfully guessed (Figure 1.4). These both indicate a problem with the testing process that needs to be addressed before the students can be measured. In Table 1, for example, performance in question 1 was slightly random, but making hasty and poor choices in questions that appear early in a test is not uncommon. Running out of time in later questions, like question 10, might also be expected. Question 9, however, has a very random pattern and the most plausible explanation is that this question was poorly worded or designed. While it would be burdensome to carry out this analysis on every test, it may be worth completing as part of a trialling process with new online test tools.

This simple check of test quality can become problematic with online activities, though, because random presentation of questions is a feature of good online educational design. It encourages repeated attempts, but it also means that you will not know which questions your student was asked. It would be unethical to take a cohort-referenced approach under these circumstances and use the output to make comparisons between students, as one student may have been given all the tricky questions.

Student intent is challenging to establish

Children do not naturally know how to complete school leaving exams. It is a process into which they must be socialised. In Willy Russell’s 1980 stage play Educating Rita, the lead character, a hairdresser, gets into higher education through an unconventional route. When set an essay task, ‘Suggest how you would resolve the staging difficulties inherent in a production of Ibsen’s Peer Gint’, Rita submits a one-sentence response: ‘Do it on the radio’ (Russell, 2000). It is a classic moment, when the audience realises that there is a right way to be a student and another way to be right. In the wider field of educational practice, it can be easy to overlook the fact that taking a test requires cultural rules to be internalised (Mislevy, 2018). In competitive online tests and games, we observed strong students allowing their opponent to win, or choosing to quit and start over again if they felt that their current performance was disappointing, and anecdotally, others simply wanted to see what happened when they lost. In other words, the rules of being an educational game player/test taker may not yet have been shared. If intended results are needed, sharing the expectations of how the exercise or game might be completed is fairer.

Test questions may not represent ability across the whole syllabus

Online tests and games tend to go deeply into one area of a syllabus, rather than widely across the whole syllabus. The spread of curriculum coverage in any test is affected by reasonable limits on how long we can ask someone to sit and complete test questions. Cognitive fatigue will start to produce additional evidence of skills, but with diminishing returns in quality (Sievertsen, 2016). There may be even further restrictions on curriculum coverage in online tests and games. Coding highly engaging environments requires significant resources (Fullerton, 2014), and curriculum coverage may be sacrificed. There are, for example, a very large number of games that teach and test entry-level languages, such as Memrise. As the curriculum becomes broader, coverage starts to become more patchy. Children may also exercise their free choice in fun learning environments to avoid the tasks that are not fun, thus specialising even further. This will lead to gaps in the curriculum or syllabus for which there is no data.


The aim of this article is not to deter teachers from using online third-party assessments. They are a useful resource when used well and can remove repetitive tasks that detract from quality teaching. All assessment is a series of difficult compromises, though, and it is important that stakeholders understand the limitations of the tools that they are using.


Almond RG, Mislevy RJ, Steinberg LS et al. (2015) Introduction. In: Bayesian Networks in Educational Assessment. New York: Springer, pp. 3–18.

Baird J (2018) The meaning of national examinations standards. In: Baird J, Isaacs T, Opposs D et al. (eds) Examination Standards: How Measures and Meanings Differ Around the World. London: IOE Press, pp 284-306

Baird J and Opposs D (2018) The standard setting project: Assessment paradigms. In: Baird J, Isaacs T, Opposs D et al. (eds) Examination Standards: How Measures and Meanings Differ Around the World. London: IOE Press. Pp. 2-25.

Bond T and Fox C (2015) Applying the Rasch Model: Fundamental Measurement in the Human Sciences, 3rd ed. New York: Routledge.

Carter A (2015) Carter review of initial teacher training. Available at: (accessed 12 January 2021).

Fullerton T (2014) Game design workshop: A playcentric approach to creating innovative games. Florida, US: CRC Press.

Mislevy RJ (2018) Socio-Cognitive Foundations of Educational Measurement. Abingdon: Routledge.

Russell W (2000) Educating Rita. Longman: Harlow. Act 1, Scene 3.

Sievertsen HH, Gino F and Piovesan M (2016) Cognitive fatigue influences students’ performance on standardized tests. Proceedings of the National Academy of Sciences 113(10): 2621–2624.

      0 0 votes
      Please Rate this content
      Notify of
      Inline Feedbacks
      View all comments

      From this issue

      Impact Articles on the same themes

      Author(s): Bill Lucas