The four pillars of assessment: What does a focus on validity, reliability, purpose and value in assessment practice look like on the ground?

Written by: Samantha Franklin
7 min read
Samantha Franklin, Assistant Headteacher, Long Stratton High School, UK

As Rob Coe (2017) describes, assessment is ‘one of those things that you think you know what it is until you start to think really hard about it’ (as cited in Kime, 2017, p. 13). When used well, effective assessment can enhance learning and raise attainment and yet too often assessment practices within schools are distorted by the external demands for assessments to be all things to all stakeholders (Broadfoot et al., 1999).

This article summarises the critical evaluation of our own assessment practices, through the lens of the ‘four pillars of assessment’ and the subsequent changes that we implemented towards a more evidence-informed approach (Evidence Based Education (EBE), 2018). At the time, our assessment model was based on half-termly, high-stakes summative assessments, which each generated a GCSE equivalent grade/subgrade. These fed into a whole-school flightpath system, which assumed linear progression from Year 7 to 11 in comparison to GCSE target grade.

The four pillars of assessment (EBE, 2018)

  • validity
  • reliability
  • purpose
  • value

There is a common misconception that we can design assessments that are entirely valid or reliable when, in fact, validity and reliability are both dependent on the inferences drawn from any assessment (EBE, 2018). We are able to increase the precision and consistency of the measurements generated (reliability) through effective assessment design. However, to maximise reliability, we must also seek to subsequently interpret these outcomes as accurately and consistently as possible. In turn, this will increase the value and relevance of the information gathered, towards its initial function (validity).

This function (or purpose) is key when designing assessments to ensure that they elicit high-quality information most appropriate for the agreed end use (EBE, 2018). For example, in our model, the use of high-frequency assessments to fulfil both summative and formative purposes had led to a dilution in the validity and reliability of inferences made for both (Christodoulou, 2016). This also reduced the value of these assessments, as the negative impacts of deploying them (such as high teacher workload and disproportionate use of curriculum time) were increasingly less justifiable given the resulting weak inferences.

Performance vs learning

Inadvertently, our half-termly assessments had been measuring student performance (temporary variations in knowledge and skills that are observed shortly after acquisition) rather than learning (relatively permanent changes to long-term memory) (Soderstrom and Bjork, 2015). The assumption that we had accurate information about the progress of students meant that we used valuable time, energy and resources on addressing needs that were less stable than we had believed. This pattern was reflected at whole-school level, where we regularly observed vast fluctuations in the outcomes of some subjects, making it difficult to react meaningfully to patterns or trends.

Our first steps were to halve the number of summative assessment points to termly and work with middle leaders on redesigning assessments that tested cumulative content from throughout the curriculum (rather than recently completed content). Both changes increased the validity and reliability of inferences made about learning, improving their value when informing responsive teaching. For example, by assessing a greater sample of the subject domain, we were able to identify gaps over a broader range of knowledge and skills, which could then be addressed promptly within the context of the whole curriculum (up to that point).

Greater spacing between assessments increased the likelihood that a correct response was representative of a change in long-term memory rather than performance, reducing the risk of erroneously judging learning to have taken place (and thus not addressing potential learning gaps). Similarly, poor responses were more likely to indicate established misconceptions or gaps in understanding rather than variations in performance, allowing us to target content that would have the greatest impact on future learning.

More widely, these changes also freed up curriculum time, created time and space for departments to design and implement formative assessments to better support learners (rather than populate a spreadsheet) and reduced the effect of assessment points driving curriculum sequencing decisions.

Assessment design

To build on this opportunity, in the second year we initiated a year-long, whole-staff CPD programme on assessment design, particularly focused on diagnostic formative assessment. Exploring various methods of high-frequency, low-stakes diagnostics generated valuable information about student learning, with multiple-choice questions favoured for their speed and objectivity.

In our third year, we have looked to increase assessment value by expanding the formative use of comparative judgement for extended writing tasks, as well as for the purpose of generating summative inferences in English and art. Benefits of this have included increased reliability and validity of formative inferences made, collaborative moderation and reduced teacher workload (Christodoulou, 2020). As part of the Assessment Lead Programme, tools such as assessment blueprints and reliability calculators have been used by middle leaders to increase the strength of summative inferences at Key Stage 4, adapting and refining assessments to increase reliability and validity (EBE, 2021).

From grades to percentages

Under the four pillars of assessment, it was clear that we couldn’t continue tracking students from Year 7 against GCSE target grades or seek to allocate GCSE grades to individual pieces of work. GCSE grades are not absolute measures but relative ones, designed to be meaningful only at the very end of a programme of study. Using them to denote progress in any single task (whether derived from GCSE content or not) is inaccurate and thus lacks reliability (Sherrington, 2020).

Inferences must also be consistent to be reliable. We found that vast subject-specific variance in assessment design and corresponding success criteria (evident in the considerable disparity between what pupils were expected to demonstrate to achieve equivalent grades in multiple subjects) had decreased reliability to the extent that progress comparisons across subjects at cohort and student levels were meaningless. Finally, grade-orientation had led to negative impacts on the process of learning itself. Pupils fixated on assessment grades, often switching off from reflections on areas for future growth where targets were reached and demonstrating demotivation or fear of failure where they weren’t (Kohn, 2011).

Following collaborations with a working group, we arrived at a whole-school summative assessment framework underpinned by percentages. Upon returning an assessment to a student, the percentage purports nothing more than the extent to which the student had mastered the knowledge and skills on that specific task. Although by no means a perfect system, we believe that this system offers a transparency and commonly understood language that was previously missing.

We have found percentages helpful in focusing student and parental conversations on which aspects of learning went well, what their learning gaps are and what they need to do to address these in the future. To consolidate this, each formal assessment is accompanied by FAR tasks (feedback, action, response), which identify the highest-priority gaps in transferable knowledge or skills to then engage with and improve upon through both teacher guidance and independent work. To monitor the impact of this, we have collected feedback in a number of ways including quality assurance processes, parental voice and staff survey feedback.

Reporting to parents and students

Key Stage 3 assessment information is reported to parents in the form of assessment percentage scores, cohort average and previous percentage by subject. The latter two are included in direct response to parental consultation feedback, which prioritised understanding their child’s score in relation to others’ and past assessment scores. These are accompanied by a ‘commitment to learning’ score per subject, attendance and behaviour data. Annually, students and parents also receive one-line subject reports, which give specific actionable targets to support further learning.

Internal tracking

Whilst this new framework has restored assessment for formative (rather than summative) purposes as the primary catalyst for system design, there remains a need for summative tracking and monitoring.

For this purpose, each student is assigned a cohort ranking based on their Key Stage 2 data. English and maths rankings are used to generate a target percentile for each student in those subjects, whilst an average of these percentile rankings is used to create an average target percentile for all other subjects. At each data point, students’ residual distance from target percentile is calculated to identify those for whom additional support may be required. The same principle can be applied at subject level, where a residual score is assigned based on average distance from target percentile within the cohort itself. None of this information is shared or discussed with students or parents, where dialogue remains centred on the process of learning rather than outcomes.

Final reflection

Despite being confident that the changes made have had an overwhelmingly positive impact on our assessment practices (such as more critical approaches to assessment data, significant improvements in the diagnostic value of internal assessments and relinquishing the pursuit of linear pupil progress), we are under no illusion that we have found solutions to every issue. We recognise the conflict of using Key Stage 2 outcome data as an anchor to measure future performance against (albeit behind the scenes) and the challenges with consistency of meaning attached to relative percentage scores between subjects. Simultaneously, we can accept that whilst using comparative judgement with some subjects and not others has greatly increased intra-subject reliability for those involved, further refinement is needed to increase inter-subject reliability when generating summative percentage scores for subject comparisons.

Whilst our assessment systems continue to serve multiple stakeholders, we are unlikely to square the circle on their competing demands. Instead, we continue to review, adapt and refine our practice, guided by the four pillars of assessment and driven by the desire to use assessment primarily for the purpose of maximising our students’ learning.

References

Broadfoot P, Daugherty R, Gardner J et al. (1999) Assessment for Learning: Beyond the Black Box. Cambridge: University of Cambridge, School of Education, Assessment Reform Group.

Christodoulou D (2016) Making Good Progress? The Future of Assessment for Learning. Oxford: Oxford University Press.

Christodoulou D (2020) Teachers vs Tech? The Case for an Ed Tech Revolution. Oxford: Oxford University Press.

Evidence Based Education (2018) The four pillars of assessment: A resource guide. Available at: https://evidencebased.education (accessed 19 January 2021).

Evidence Based Education (2021) The Assessment Lead Programme. Available at: https://evidencebased.education/assessment-lead-programme/ (accessed 19 January 2021).

Kime S et al. (2017) What makes great assessment? Durham: Evidence Based Education. Available at: https://evidencebased.education/assessment-training-consultancy/wp-content/themes/assessmentacademy/images/ebook.pdf (accessed 19 January 2021).

ohn A (2011) The case against grades. Educational Leadership: Journal of the Department of Supervision and Curriculum Development 69(3): 28–33.

Sherrington T (2020) Authentic assessment. In: Donarski S and Bennett T (eds) The ResearchED Guide to Assessment: An Evidence-Informed Guide For Teachers. Woodbridge: John Catt Educational, pp. 149–164.

Soderstrom NC and Bjork RA (2015) Learning versus performance: An integrative review. Perspectives on Psychological Science 10(2): 176–199.

      0 0 votes
      Please Rate this content
      Subscribe
      Notify of
      0 Comments
      Oldest
      Newest Most Voted
      Inline Feedbacks
      View all comments

      From this issue

      Impact Articles on the same themes