Wiliam (2000) on reliability and validity

25 January 2007

Wiliam’s paper, referenced by Mike Baker in his BBC summary, is not actually about the validity of National Curriculum (or any other) formal tests per se. It is about the inherent issues of validity and reliability of testing. The reduction of reliability comes from the inability of students to perform exactly the same way in tests. If they were to take the same test several times then they would expect to get different scores, argues Wiliam. This seems intuitively sensible, if impossible to prove as you can’t ever take a test again without it either being a different test or without you learning from your first attempt. The position is a theoretical one. Wiliam uses a simple statistical model to come up with the figures that are used in the BBC report. It is not that a test is 32% inaccurate, but that 32% is the number of misclassifications that might be expected given the nature of testing and quantitative scoring. The stats used by Baker are, themselves, theoretical, and should not be used as ‘headline figures’.

Wiliam then goes on to look at reliability of grades. He points out that we might intuitively know that it would be unreliable to say a student who scores 75% must be ‘better’ than one who scores 74%. But if the results are reported as grades we are more likely to confer reliability to the statement ‘the student achieving the higher level is better ‘.

On validity Wiliam says little in this paper but does point out the tension between validity and reliability. Sometimes making a test reliable means it becomes less valid. He cites the example of the divergent thinker who comes up with an alternative good answer that is not on the markscheme and who therefore receives no credit. this is a standard response by examining teams designed to eliminate differences between markers. While contingencies are always in place to consider exceptional answers, if they are not spotted until the end of the marking period then they cannot be accommodated. If several thousand scripts/tests have already been marked, they cannot be gone back over because one examiner feels that one alternative answer discovered late on should be rewarded. You either reward all those who came up with it or none. Usually it is none for pragmatic reasons, not for reasons of validity.

Wiliam (2000) Reliability, validity, and all that jazz in Education 3-13 vol 29(3) pp 9-13 available online at http://www.aaia.org.uk/pdf/2001DYLANPAPER3.PDF

and citing

Wiliam, D. (1992). Some technical issues in assessment: a user’s guide. British Journal for Curriculum and Assessment, 2(3), 11-20.

Wiliam, D. (1996). National curriculum assessments and programmes of study: validity and impact. British Educational Research Journal, 22(1), 129-141.