More on methodology – the marketing approach

26 October 2008

I wrote recently about the methodology of triple hermeneutics as described by Alvesson and Stöcklund and how it might be relevant to my work. The trail that led to this started with my director of studies’ suggestion that I look at the world of marketing in respect of how it deals with perceptions. This has now led to the writing of  Chisnall (2005). Sure enough in the chapter on “Basic Techniques” there is a discussion of the place of reliability and validity in qualitative and attitude research . I quite like this word ‘attitude’. It helps frame a question ‘What is the attitude‘ of 16-year olds to ICT capability and its assessment. Chisnall says

“The measurement of behavioural factors such as attitudes… has been attempted by a variety of techniques… the ones that are the most reliable and valid from a technical viewpoint generally being the most difficult… to apply” (p234).

Oh well!

Validity for Chisnall consists of content, concurrent and construct validity – so fairly conventional there. One would have expected face validity to be mentioned too, perhaps. He also cites a pamphlet (sic) by Bearden et al (1993) that describes some 124 scales for measuring such things in the field of marketing, consumer behaviour and social research.

Bearden, W, Netemeyer, R & Mobley, M (1993), Handbook of marketing scales: Multi-item measures for marketing and consumer behaviour research. Newbury Park, CA: Sage (in conjunction with the ACR).

Chisnall, P (2005), Marketing research (7th ed). NY: McGraw Hill.


Cambridge Assessment seminar

21 October 2008

I attended  a seminar, on the subject of validity, one of a series of events run by Cambridge Assessment (CA). It was led by Andrew Watts from CA.

This was extremely informative and useful, challenging my notions of assessment. As the basis for his theoretical standpoint Andrew used these  texts 

  • Brennan, R (2004), Educational Measurement (4th edition). Westport, CT: Greenwood
  • Downing, S (2006) Twelve Steps for Effective Test Development in Downing, S and Haldyna, T (2006) Handbook of TEst Development. NY: Routledge
  • Gronlund, N (2005), Assessment of Student Achievement (8th edition). NY: Allyn and Bacon [NB 9th edition (2008) now available by Gronlund and Waugh]

He also referred to articles published in CA’s Research Matters and used some of the IELTS materials as examplars. 

The main premise, after Gronlund, is that there is no such thing as a valid test/assessment per se. The validity is driven by the purposes of the test. Thus a test that may well be valid in one context may not be in another. The validity, he argued, is driven by the uses to which the assessment is put. In this respect, he gave an analagy with money. Money only has value when it is put to some use. The ntoes themselves are fairly worthless (except in the esoteric world of the numismatist). Assessments, analogously, have no validity until they are put to use.

Thus a test of English for entrance to a UK university (IELTS) is valid if, the UK university system validates it. Here then is the concept of consequential validity.  It is also only valid if it fits the context of those taking it. Here is the concept of face validity – the assessment must be ‘appealing’ to those taking it.

Despite these different facets of validity (and others were covered – predictive validity, concurrent validity, construct validity, content validity), Gronlund argues that validity is a unitary concept. This echoes Cronbach and Messick as discussed earlier. There is no validity without all of these facets I suppose would be one way of looking at this.

Gronlund also argues that validity cannot itself be determined – it can only be inferred. In particular, inferred from statements that are made about, and uses that are made of, the assessment.

The full list of chacteristics that were cited from Gronlund are that validity

  • is inferred from available evidence and not measured itself
  • depends on many different types of evidence
  • is expressed by degree (high, moderate, low)
  • is specific to a particular use
  • refers to the inferences drawn, not the instrument
  • is a unitary concept
  • is concerned with the consequences of using an assessment

Some issues arising for me here are that the purposes of ICT assessment at 16 are sometimes, perhaps, far from clear. Is it to certificate someone’s capability in ICT so that they may do a particular type of job, or have a level of skills for employment generally, or have an underpinning for further study or have general life skills, or something else, or all of these? Is ‘success’ in assessment of ICT at 16 a necessary pre requisite for A level study? For entrance to college? For employment? 

In particular I think the issue that hit me hardest was – is there face validity: do the students perceive it as a valid assessment (whatever ‘it’ is).

One final point – reliability was considered to be an aspect of validity (scoring validity in the ESOL framework of CA).

A three axis model

1 May 2008

Taking the ideas from the previous post and putting them into a diagram I get this

Some assessment uses ICT (or technology) – this is e-assessment (x axis).

Some assessment is designed to assess ICT capability (y axis).

Elliott’s Assessment 2.0 seems to be using ICT, not as e-assessment, but as a medium for allowing judgement to be made about the ICT capability (z axis).

Now of course, analysing any one particular assessment methodology one could locate it in this three-dimensional space. for example:

A traditional written paper would be on the y-axis. The NAA online assessment activities designed for KS3 would be in the space between all three axes (with perhaps a lower y- and z-values than x-value. Coursework would have an x-value of 0 but would have some components of y and z. Online assessments such as the driving test would be on the x-axis.

My questions here are “Where is the highest validity”? and “Where is the highest reliability?”. How does one use Elliott’s Assessment 2.0 to determine success in a certificated qualification?

Wiliam (2000) on reliability and validity

25 January 2007

Wiliam’s paper, referenced by Mike Baker in his BBC summary, is not actually about the validity of National Curriculum (or any other) formal tests per se. It is about the inherent issues of validity and reliability of testing. The reduction of reliability comes from the inability of students to perform exactly the same way in tests. If they were to take the same test several times then they would expect to get different scores, argues Wiliam. This seems intuitively sensible, if impossible to prove as you can’t ever take a test again without it either being a different test or without you learning from your first attempt. The position is a theoretical one. Wiliam uses a simple statistical model to come up with the figures that are used in the BBC report. It is not that a test is 32% inaccurate, but that 32% is the number of misclassifications that might be expected given the nature of testing and quantitative scoring. The stats used by Baker are, themselves, theoretical, and should not be used as ‘headline figures’.

Wiliam then goes on to look at reliability of grades. He points out that we might intuitively know that it would be unreliable to say a student who scores 75% must be ‘better’ than one who scores 74%. But if the results are reported as grades we are more likely to confer reliability to the statement ‘the student achieving the higher level is better ‘.

On validity Wiliam says little in this paper but does point out the tension between validity and reliability. Sometimes making a test reliable means it becomes less valid. He cites the example of the divergent thinker who comes up with an alternative good answer that is not on the markscheme and who therefore receives no credit. this is a standard response by examining teams designed to eliminate differences between markers. While contingencies are always in place to consider exceptional answers, if they are not spotted until the end of the marking period then they cannot be accommodated. If several thousand scripts/tests have already been marked, they cannot be gone back over because one examiner feels that one alternative answer discovered late on should be rewarded. You either reward all those who came up with it or none. Usually it is none for pragmatic reasons, not for reasons of validity.

Wiliam (2000) Reliability, validity, and all that jazz in Education 3-13 vol 29(3) pp 9-13 available online at

and citing

Wiliam, D. (1992). Some technical issues in assessment: a user’s guide. British Journal for Curriculum and Assessment, 2(3), 11-20.

Wiliam, D. (1996). National curriculum assessments and programmes of study: validity and impact. British Educational Research Journal, 22(1), 129-141.

Moss (1992): validity and assessment of performance

11 January 2007

Pamela Moss’s 1992 paper “Shifting conceptions of validity in educational measurement: implications for performance assessment” (1) is cited by Wiliam in his modification of Messick’s four-facet model. It would seem from the figure I extracted from Wiliam’s paper that he is suggesting that Moss is providing an extra dimension to the evidential paradigm. That was what I saw on first reading. On turning to Moss’s paper and re-reading Wiliam I am now not so sure of where he is placing Moss vis a vis Messick.

Moss’s paper is an overview of the landscape of construct validity from the inception by Cronbach and Meehl in 1955 (2) to its publication in 1992. In doing so she looks at evidential and interpretive aspects of the models of Cronbach (1980) (3) and Messick. The latter is not seen as being purely evidential as Wiliam’s paper might suggest.

The thrust of Moss is that a review was needed of the “Standards” of what might be called the Establishment of (American) assesment and measurement (AERA, APA, NCME). This review, she argues, is because of the emergence of performance assessment as a science (and as commonly used tool) to complement test/item-based assessment. She compares this to the contemporaneous diminuition of the dominance of positivism.

In performance assessment there is a strong interpretive base. The learner will interpret the task and manifest skills, knowledge and understanding through their performance. the assessor will re-interpret this performance to provide evidence to which rules of validity must be applied.

(1) Moss, P (1992) Shifting Conceptions of Validity in Educational Measurement: Implications for Performance Assessment in Review of Educational Research, Vol. 62, No. 3. (Autumn, 1992), pp. 229-258.

(2) Cronbach, L.J. and Meehl, P.E. (1955) Construct validity in psychological tests in Psychological Bulletin, 52, 281-302 also available online at

(3) Cronbach, L.J. (1980). Validity on parole: How can we go straight? in New directions for Testing and Measurement, 5, 99-108.

Embretson (1983): construct representation

10 January 2007

The “top left” quadrant of Wiliam’s enhancement of Messick’s four-facet model of validity deals with within-domain evidential/interpretive validity. How is the assessment designed so as to provide constructs that evidence that which is to be assessed within the domain. He cites Embretson (1983) (1) as providing part of the conceptual model for this quadrant.

Embretson’s model distinguishes between construct representation and nomothetic span. In the former, assessment designed so that it is situated in tasks that represent that which is to be assessed. In the latter it is designed to correlate with other tasks deemed valid.

Mislevy et al (2002) (2) discuss the model in the context of the “psychometric principles” of validity, reliability and comparability. They relate the task model to three other models – the student’s learning, the assessment (or measurement) of this learning and the scoring. Their argument appears to be that the construct representation resonates more with the psychometric principles than does nomothetic span, but that both may be needed.

In the context of my research it would seem that I am doing some sort of comparison between the two parts of Embretson’s dichotomy. Construct representation – using what the students have learnt by way of ICT capability to provide an assessment. Nomothetic span – using some assessment that correlates to this as measured by other assessments.

Is use of the former inherently more engaging than the latter? Does it fit with student’s own constructs of what they have learnt?

(1) Embretson, S. E. (1983), Construct validity: Construct representation versus nomothetic span in Psychological Bulletin, 93, 179-197.

(2) Mislevy R, Wilson M, Ercikan K, Chudowsky N (2002), Psychometric Principles in Student Assessment CSE Technical Report 583, Centre for Studies in Evaluation, LA also available online at

Wiliam’s model of construct validity (1996)

9 January 2007

Wiliam (1996) offers a model that starts from Messick’s four-facet model (1) of validity (subsequently, (1996), enhanced to six facets) and applies it the National Curriculum. Wiliam’s analysis has much to offer when looking at assessment at 16. He takes Messick’s distinction of the evidential and consequential in assessment and adds Moss’s (1992) interpretative basis to the former. Assessment validity needs to be looked at through the evidence, the interpretation and the impact (consequence). For each of these two bases – evidential/interpretive and consequential – Wiliam then builds on Messick’s other dimension of within- and beyond-domain.

Wiliam (1994)
Wiliam then examines each of the four zones in turn.

In regard of within-domain inferences Wiliam explains the work of Popham and others in trying to establish valid tests that test all, and only, the domain that is intended to be tested. The concluding criticism of the validity NC tests may well apply to any external traditional examination – they are unrepresentative of the domain because of their length compared to the length/volume of learning.

For beyond-domain inferences Wiliam cites the predictive nature of the use of test results. High performance in X predicts high performance in Y. He cites Guilford in saying that it doesn’t matter how this correlation is arrived at, merely that it is reliable. The test might not be valid though as it may not be in the same domain. For ICT at 16 there may be aspects of the achievement that is given far greater importance than maybe it should. A learner gets Key Skills level 2 in ICT (2) therefore s/he is functionally literate in ICT. It doesn’t matter how the level 2 was achieved.

Within-domain impact is of particular importance to the design of ICT assessments, I believe. Hence the move towards onscreen testing – it’s ICT so the the technology must be used to assess the capability. In Wiliam’s words, it “must look right” (p132).

Finally, Wiliam considers beyond-domain impact or consequence. In looking at National Curriculum testing, Wiliam argues, some of the validity is driven (or driven away) by beyond-domain impacts such as league tables – these are much higher stakes for schools than learners and so the validity of the assessment is corrupted.

(1) Messick, “Validity,” 20; Lorrie A. Shepard, “Evaluating Test Validity,” in Review of Educational Research, ed. Linda Darling-Hammond (Washington, DC: AERA, 1993), 405-50. cited in Orton (1994)

(2) The functional/key skill component of ICT learning is referred to as IT


10/01/07 Post on Embretson (1983)

11/01/07 Post on Moss (1992)