More on methodology – the marketing approach

26 October 2008

I wrote recently about the methodology of triple hermeneutics as described by Alvesson and Stöcklund and how it might be relevant to my work. The trail that led to this started with my director of studies’ suggestion that I look at the world of marketing in respect of how it deals with perceptions. This has now led to the writing of  Chisnall (2005). Sure enough in the chapter on “Basic Techniques” there is a discussion of the place of reliability and validity in qualitative and attitude research . I quite like this word ‘attitude’. It helps frame a question ‘What is the attitude‘ of 16-year olds to ICT capability and its assessment. Chisnall says

“The measurement of behavioural factors such as attitudes… has been attempted by a variety of techniques… the ones that are the most reliable and valid from a technical viewpoint generally being the most difficult… to apply” (p234).

Oh well!

Validity for Chisnall consists of content, concurrent and construct validity – so fairly conventional there. One would have expected face validity to be mentioned too, perhaps. He also cites a pamphlet (sic) by Bearden et al (1993) that describes some 124 scales for measuring such things in the field of marketing, consumer behaviour and social research.

Bearden, W, Netemeyer, R & Mobley, M (1993), Handbook of marketing scales: Multi-item measures for marketing and consumer behaviour research. Newbury Park, CA: Sage (in conjunction with the ACR).

Chisnall, P (2005), Marketing research (7th ed). NY: McGraw Hill.

Cambridge Assessment seminar

21 October 2008

I attended  a seminar, on the subject of validity, one of a series of events run by Cambridge Assessment (CA). It was led by Andrew Watts from CA.

This was extremely informative and useful, challenging my notions of assessment. As the basis for his theoretical standpoint Andrew used these  texts 

  • Brennan, R (2004), Educational Measurement (4th edition). Westport, CT: Greenwood
  • Downing, S (2006) Twelve Steps for Effective Test Development in Downing, S and Haldyna, T (2006) Handbook of TEst Development. NY: Routledge
  • Gronlund, N (2005), Assessment of Student Achievement (8th edition). NY: Allyn and Bacon [NB 9th edition (2008) now available by Gronlund and Waugh]

He also referred to articles published in CA’s Research Matters and used some of the IELTS materials as examplars. 

The main premise, after Gronlund, is that there is no such thing as a valid test/assessment per se. The validity is driven by the purposes of the test. Thus a test that may well be valid in one context may not be in another. The validity, he argued, is driven by the uses to which the assessment is put. In this respect, he gave an analagy with money. Money only has value when it is put to some use. The ntoes themselves are fairly worthless (except in the esoteric world of the numismatist). Assessments, analogously, have no validity until they are put to use.

Thus a test of English for entrance to a UK university (IELTS) is valid if, the UK university system validates it. Here then is the concept of consequential validity.  It is also only valid if it fits the context of those taking it. Here is the concept of face validity – the assessment must be ‘appealing’ to those taking it.

Despite these different facets of validity (and others were covered – predictive validity, concurrent validity, construct validity, content validity), Gronlund argues that validity is a unitary concept. This echoes Cronbach and Messick as discussed earlier. There is no validity without all of these facets I suppose would be one way of looking at this.

Gronlund also argues that validity cannot itself be determined – it can only be inferred. In particular, inferred from statements that are made about, and uses that are made of, the assessment.

The full list of chacteristics that were cited from Gronlund are that validity

  • is inferred from available evidence and not measured itself
  • depends on many different types of evidence
  • is expressed by degree (high, moderate, low)
  • is specific to a particular use
  • refers to the inferences drawn, not the instrument
  • is a unitary concept
  • is concerned with the consequences of using an assessment

Some issues arising for me here are that the purposes of ICT assessment at 16 are sometimes, perhaps, far from clear. Is it to certificate someone’s capability in ICT so that they may do a particular type of job, or have a level of skills for employment generally, or have an underpinning for further study or have general life skills, or something else, or all of these? Is ‘success’ in assessment of ICT at 16 a necessary pre requisite for A level study? For entrance to college? For employment? 

In particular I think the issue that hit me hardest was – is there face validity: do the students perceive it as a valid assessment (whatever ‘it’ is).

One final point – reliability was considered to be an aspect of validity (scoring validity in the ESOL framework of CA).

A three axis model

1 May 2008

Taking the ideas from the previous post and putting them into a diagram I get this

Some assessment uses ICT (or technology) – this is e-assessment (x axis).

Some assessment is designed to assess ICT capability (y axis).

Elliott’s Assessment 2.0 seems to be using ICT, not as e-assessment, but as a medium for allowing judgement to be made about the ICT capability (z axis).

Now of course, analysing any one particular assessment methodology one could locate it in this three-dimensional space. for example:

A traditional written paper would be on the y-axis. The NAA online assessment activities designed for KS3 would be in the space between all three axes (with perhaps a lower y- and z-values than x-value. Coursework would have an x-value of 0 but would have some components of y and z. Online assessments such as the driving test would be on the x-axis.

My questions here are “Where is the highest validity”? and “Where is the highest reliability?”. How does one use Elliott’s Assessment 2.0 to determine success in a certificated qualification?

Wiliam (2000) on reliability and validity

25 January 2007

Wiliam’s paper, referenced by Mike Baker in his BBC summary, is not actually about the validity of National Curriculum (or any other) formal tests per se. It is about the inherent issues of validity and reliability of testing. The reduction of reliability comes from the inability of students to perform exactly the same way in tests. If they were to take the same test several times then they would expect to get different scores, argues Wiliam. This seems intuitively sensible, if impossible to prove as you can’t ever take a test again without it either being a different test or without you learning from your first attempt. The position is a theoretical one. Wiliam uses a simple statistical model to come up with the figures that are used in the BBC report. It is not that a test is 32% inaccurate, but that 32% is the number of misclassifications that might be expected given the nature of testing and quantitative scoring. The stats used by Baker are, themselves, theoretical, and should not be used as ‘headline figures’.

Wiliam then goes on to look at reliability of grades. He points out that we might intuitively know that it would be unreliable to say a student who scores 75% must be ‘better’ than one who scores 74%. But if the results are reported as grades we are more likely to confer reliability to the statement ‘the student achieving the higher level is better ‘.

On validity Wiliam says little in this paper but does point out the tension between validity and reliability. Sometimes making a test reliable means it becomes less valid. He cites the example of the divergent thinker who comes up with an alternative good answer that is not on the markscheme and who therefore receives no credit. this is a standard response by examining teams designed to eliminate differences between markers. While contingencies are always in place to consider exceptional answers, if they are not spotted until the end of the marking period then they cannot be accommodated. If several thousand scripts/tests have already been marked, they cannot be gone back over because one examiner feels that one alternative answer discovered late on should be rewarded. You either reward all those who came up with it or none. Usually it is none for pragmatic reasons, not for reasons of validity.

Wiliam (2000) Reliability, validity, and all that jazz in Education 3-13 vol 29(3) pp 9-13 available online at

and citing

Wiliam, D. (1992). Some technical issues in assessment: a user’s guide. British Journal for Curriculum and Assessment, 2(3), 11-20.

Wiliam, D. (1996). National curriculum assessments and programmes of study: validity and impact. British Educational Research Journal, 22(1), 129-141.

Embretson (1983): construct representation

10 January 2007

The “top left” quadrant of Wiliam’s enhancement of Messick’s four-facet model of validity deals with within-domain evidential/interpretive validity. How is the assessment designed so as to provide constructs that evidence that which is to be assessed within the domain. He cites Embretson (1983) (1) as providing part of the conceptual model for this quadrant.

Embretson’s model distinguishes between construct representation and nomothetic span. In the former, assessment designed so that it is situated in tasks that represent that which is to be assessed. In the latter it is designed to correlate with other tasks deemed valid.

Mislevy et al (2002) (2) discuss the model in the context of the “psychometric principles” of validity, reliability and comparability. They relate the task model to three other models – the student’s learning, the assessment (or measurement) of this learning and the scoring. Their argument appears to be that the construct representation resonates more with the psychometric principles than does nomothetic span, but that both may be needed.

In the context of my research it would seem that I am doing some sort of comparison between the two parts of Embretson’s dichotomy. Construct representation – using what the students have learnt by way of ICT capability to provide an assessment. Nomothetic span – using some assessment that correlates to this as measured by other assessments.

Is use of the former inherently more engaging than the latter? Does it fit with student’s own constructs of what they have learnt?

(1) Embretson, S. E. (1983), Construct validity: Construct representation versus nomothetic span in Psychological Bulletin, 93, 179-197.

(2) Mislevy R, Wilson M, Ercikan K, Chudowsky N (2002), Psychometric Principles in Student Assessment CSE Technical Report 583, Centre for Studies in Evaluation, LA also available online at