0
Welcome Reliability Programme: Leading the way to better testing and assessments 22 March 2011 Event Chair: Dame Sandra Bu...
Welcome and Setting the Scene Glenys Stacey, Ofqual Chief Executive
Ofqual’s Reliability Programme Dennis Opposs
<ul><li>Reliability: quantifying the luck of the draw </li></ul><ul><li>Reliability work in England has generally been </l...
<ul><li>To gather evidence for Ofqual to develop regulatory policy on reliability of results from national tests, examinat...
<ul><li>Strand 1: Generating evidence of reliability </li></ul><ul><li>Strand 2: Interpreting and communicating evidence o...
Our Technical Advisory Group Paul Black Anton Beguin Alastair Pollitt Gordon Stanley Jo-Anne Baird
Strand 1 – Generating evidence <ul><li>Synthesising pre-existing evidence </li></ul><ul><li>Literature reviews </li></ul><...
Strand 2 – Interpreting and communicating evidence <ul><li>How do we conceptualise reliability? </li></ul><ul><li>How do w...
Strand 3 – Developing policy <ul><li>Exploring public understanding of, and attitudes towards, assessment error </li></ul>...
Student misclassification <ul><li>Controversial area - earlier conclusions include: </li></ul><ul><li>“…  it is likely tha...
Strand 1 – Generating evidence (1) <ul><li>National Curriculum tests: </li></ul><ul><ul><li>The reliabilities of KS2 scien...
Strand 1 – Generating evidence (2) <ul><li>KS2 science pre-tests </li></ul><ul><li>The reliabilities of KS2 Science tests ...
Strand 1 – Generating evidence (3) <ul><li>A KS2 English reading pre-test </li></ul><ul><li>Data collected in 2007 during ...
Strand 1 – Generating evidence (4) <ul><li>Cronbach’s alpha for GCSE components/units </li></ul>
Strand 1 – Generating evidence (5) <ul><li>Cronbach’s alpha for GCE units </li></ul>
Strand 1 – Generating evidence (6) <ul><li>Assessor agreement rates for a workplace-based vocational qualification  </li><...
Strand 1 - Generating evidence (7) <ul><li>The 2009 and 2010 live tests (populations) </li></ul>85 85 0.919 English 2010 8...
Strand 2 – Interpreting and communicating evidence (1) <ul><li>External research projects </li></ul><ul><ul><li>Estimating...
Strand 2 – Interpreting and communicating evidence (2) <ul><li>Reporting results </li></ul><ul><li>and associated </li></u...
Strand 2 – Interpreting and communicating evidence (3) <ul><li>Technical seminars </li></ul><ul><li>Factors that affect th...
Strand 2 – Interpreting and communicating evidence (4) <ul><li>International perspective on reliability </li></ul><ul><li>...
Strand 3a  – Public perceptions of reliability (1) <ul><li>External research projects </li></ul><ul><ul><li>Ipsos MORI sur...
<ul><li>Views on accuracy of GCSE grades </li></ul>Strand 3a  – Public perceptions of reliability (2)
Strand 3a  – Public perceptions of reliability (3) Views on national exams system
Strand 3b  – Developing Ofqual reliability policy (1) <ul><li>Ofqual reliability policy based on </li></ul><ul><ul><li>Eva...
Ofqual Board recommendations <ul><li>Continue work on reliability as a contribution to improving the quality assurance of ...
Next steps <ul><li>Publishing reliability compendium later this year </li></ul><ul><li>Reliability work becomes “business ...
Today <ul><li>Presentations from the Technical Advisory Group and experts in teaching, assessment research and communicati...
Findings from the Reliability Research Professor Jo-Anne Baird, Technical Advisory Group Member
Refreshment Break
A  view from the assessment community Paul E. Newton Director, Cambridge Assessment Network Division Presentation to Ofqua...
<ul><li>We need to talk about error </li></ul>
Talking about error
The Telegraph (front page)
<ul><li>The professional justification </li></ul><ul><ul><li>what the profession needs to accomplish through talking about...
The bad old days <ul><li>Boards seem to have strong objections to revealing their mysteries to ‘outsiders’  […]  There hav...
Promulgating the myth <ul><li>However,  any level of error has to be unacceptable  – even just one candidate getting the w...
<ul><li>The technical justification </li></ul><ul><ul><li>why users and stakeholders need to know about error </li></ul></ul>
Using knowledge of error <ul><li>Students and teachers </li></ul><ul><ul><li>maybe you’re better, or worse, than your grad...
Talking about error <ul><li>the commitment to greater openness and transparency about error is nothing new </li></ul><ul><...
The 20-point scale (1969-72) <ul><li>The presentation of results on </li></ul><ul><li>(i) the broadsheet will be by a sing...
The 20-point scale (1969-72) <ul><li>The following rubric is proposed, to be prominently displayed on both broadsheets and...
20-point scale (1983-86) <ul><li>It was proposed that the new scheme should have the following characteristics: </li></ul>...
Talking about error <ul><li>there is disagreement within the profession over  the concept of error </li></ul><ul><li>but, ...
Measuring attainment
Judging performance <ul><li>I argue that there is a strong case for saying that  it is more sensible to accept that exams ...
Uses of reliability information <ul><li>Evaluation and improvement </li></ul><ul><ul><li>highly technical (detailed & spec...
For education <ul><li>How can we achieve greater openness and transparency? </li></ul>
The Sun
For education <ul><li>use  analogy , wherever possible </li></ul><ul><li>use  commonsense , not technical, terms </li></ul...
<ul><li>The importance of assessment results in today’s education system... </li></ul><ul><li>and communicating uncertaint...
<ul><li>The emphasis being placed on test results </li></ul>
Child takes exams Head teacher Judgement: school level  Exams marked and graded Department School results Ofsted Judgement...
<ul><li>Types of “error” </li></ul>
<ul><li>Error: </li></ul><ul><li>“the difference between an approximate result and the true determination”. </li></ul>
<ul><li>Communication of measurement error: </li></ul><ul><li>It can, and is, done </li></ul>
<ul><li>“ The information in these tables only provides part of the picture of each school’s and its pupils’ achievements....
<ul><li>What can go wrong if measurement certainty is not understood and communicated </li></ul>
 
<ul><li>Is the public ready to accept the concept of measurement error? </li></ul>
<ul><li>Sats results “wrong for thousands of pupils” </li></ul><ul><li>Daily Telegraph, 13/11/09 </li></ul>
<ul><li>“ New Sats fiasco as one in three pupils 'will get wrong exam results’” </li></ul><ul><li>Daily Mail, 31/1/09 </li...
<ul><li>Talking about reliability at the macro, and at the micro, level </li></ul>
Was I reliably informed...? ... a former principal ponders John Guy Formerly Principal, Farnborough Sixth Form College
3250 students;  Mostly A levels 3312 applications for 1750 places in September 2010 61 AS courses  Biggest? AS Mathematics...
 
Reliability  refers to the  consistency of outcomes  that would be observed from an assessment process  were it to be repe...
Today’s session: Ponder aloud on reliability and the causes of unreliability and its impact upon College students A level ...
Hasna Benhassi Tatyana Tomashova
A level History 150 – 200 students taking A2 annually Previous achievements and value-added indicators suggest improving c...
History A level results Awarding Body 145 140 166 179 195 Completers
60 0 38 34 30 E A C B D 100 80 70 60 50 40 0 Mapping Raw Score  UMS Scale Map to UMS A B C D E BAR 30 Marking tolerance+/-...
History A level results Awarding Body Reliability  refers to the  consistency of outcomes  that would be observed from an ...
60 0 38 34 30 E A C B D Mapping Raw Score  BAR A* 42 27 A-E range should be 40% Narrow A-E range produces unreliability – ...
60 0 33 30 27 E A C B D Business Studies 2011 A2 raw marks  –  from web search BAR A* 36 25 18% A-E range!! 60% 42% 42 The...
The Regulated Assessment  (wobbly) Ruler? Questions 1,2,3 Questions 5, 6, 7, 8 When you measure things...   ...it’s a good...
AS level Art 2007  - 495 Candidates Reliability  refers to the  consistency of outcomes  that would be observed from an as...
2007 – a special year <ul><ul><li>New specification – 4 units </li></ul></ul><ul><ul><li>Awarding Body invited teachers to...
ANALYSIS Value added scores 2005: +0.4 2006:  +0.4  2007:  -0.3  2008:  +0.4  Chi-squared test A  B C D E U 2003-2006 21.8...
Was this a reliable assessment? <ul><ul><li>College immediately contacted Board and was told to appeal </li></ul></ul><ul>...
Conclusions Large cohorts from open access colleges are representative of the whole population Large cohorts of students t...
Questions and Answers to the Panel of Speakers Chair: Glenys Stacey, Ofqual Chief Executive
Ofqual’s Reliability Programme Closing remarks Dennis Opposs
Ofqual Board recommendations <ul><li>Recommendation 1: </li></ul><ul><li>Continue work on reliability as a contribution to...
Ofqual Board recommendations <ul><li>Recommendation 2: </li></ul><ul><li>Encourage Awarding Organisations to generate and ...
Ofqual Board recommendations <ul><li>Recommendation 3: </li></ul><ul><li>Continue to improve public and professional under...
Next steps <ul><li>Publishing reliability compendium later this year </li></ul><ul><li>Reliability work becomes “business ...
Today Tell us your opinions or email them to   [email_address]
Thank you for attending Networking Lunch
Upcoming SlideShare
Loading in...5
×

The Reliability Programme: Leading the way to better tests and assessments

701

Published on

This is the presentation from "The Reliability Programme: Leading the way to better tests and assessments" event.

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
701
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Because strand 1 is technically very complicated, we wanted to appoint a heavy-weight Technical Advisory Group. And we’re proud to have achieved that! We’ve got a team of five, including three professors, representing expertise from the ranks of: awarding body research teams academia, and educational testing agencies and only one of them is English. And we’ve got both critics and defenders of the system on board too. Paul Black, in particular, has been one of the most vociferous critics of the system: specifically challenging assessment agencies for a lack of openness and transparency concerning error. Because, inevitably, we have been working very closely with awarding bodies this Technical Group had a most important role to play in vouching for the independence of the programme and the trustworthiness of the results.
  • How reliable are results from national assessments, exams and qualifications in England?
  • Trying to find answers to questions such as How do we conceptualize reliability in different contexts? How do we interpret our findings – what do the results from strand 1 mean and how can we make sense of them? Class acc 84%; Cron alpha 0.78 How do we communicate our findings?
  • Finding answers to questions like what do the public know about reliability? What do they feel about reliability?
  • So, how unreliable is educational assessment? This is quite a controversial area, as it happens, and there wasn’t a great deal of evidence to be found. At least, not a lot of evidence that’s user-friendly enough to make good sense of. But one of England’s foremost professors of educational assessment has concluded that: [READ] He and two other professors provided evidence to the Select Committee in 2007 [READ] That’s quite high. Are they right? We’ll come back to that later.
  • Several empirical studies to investigate the reliabilities of results from NCTs, GCSEs, A levels, VQs
  • For several years NFER have been asking 11 year olds to pretest items that are to be used in the following year’s KS2 test before they take the current year’s test. So the data generated allows them to compare the pretest and live test results for five years, 2004-2008. Here is a summary of the results. Accuracy: degree of agreement between classifications based on observed scores and true scores on a test Consistency: degree of agreement between classifications based on two sets of observed scores from replications of the same measurement procedure Misclassification: the degree that observed scores and true scores on a test classify examinees into different categories. So 88% accuracy = 12% misclassification. We will come back to that later.
  • English 2008 English reading pre-test
  • 190 GCSE components – mainly objective tests and short answer questions
  • 97 GCE components – mainly objective tests and short answer questions
  • Assessors and internal verifiers for three workplace based NVQs. Kappa – a measure of agreement between two ratings of the same event that takes account of probability of agreement by chance. 0.61-0.80 substantial agreement 0.81-1.0 almost perfect agreement
  • Following the NFER work mentioned earlier on reliability in NC tests, some further analyses have been carried out to see what figures emerge for the internal consistency reliability - Cronbach’s alpha, and classification accuracy – degree of agreement between classifications based on observed scores and true scores on a test for the 2009 and 2010 live tests . Alpha values are relatively high and are similar over the two years for the subjects The classification accuracy figures estimated using two different methods are mostly around 87% for science, 85% for English and 90% for maths – so a misclassification rate of about 10%, 13 or 15%.
  • External research projects 1. Estimating and interpreting reliability, based on CTT – describes measurement process, describes different forms of reliability 4. Reporting of results and measurement uncertainties – international report on how results and associated errors are reported internationally Representing and reporting of assessment results and measurement uncertainties in some USA tests – high stake tests 6. Reliability of teacher assessment Internal research projects Reliability of composite scores: based on CTT, G-theory and IRT, qualification level
  • Example of error reporting from N Carolina. Confidence limits
  • Issues related to reliability discussed.
  • Ofqual and NFER – discussion group at 2009 AEA Europe conference in Malta to discuss issues with reporting assessment results and reliability information. Summary of views expressed by participants.
  • Participants show varied degrees of understanding and varying degrees of tolerance towards different kinds of error.
  • Ipsos-MORI 2009 survey. Teachers – 80% thought students got right grade
  • Ofqual quantitative online survey
  • Remember the public confidence objective – “ The public confidence objective is to promote public confidence in regulated qualifications and regulated assessment arrangements “. On to the media reaction to some of our work. You might want to reflect how that feeds into public confidence.
  • Transcript of "The Reliability Programme: Leading the way to better tests and assessments"

    1. 1. Welcome Reliability Programme: Leading the way to better testing and assessments 22 March 2011 Event Chair: Dame Sandra Burslem, DBE, Ofqual's Deputy Chair
    2. 2. Welcome and Setting the Scene Glenys Stacey, Ofqual Chief Executive
    3. 3. Ofqual’s Reliability Programme Dennis Opposs
    4. 4. <ul><li>Reliability: quantifying the luck of the draw </li></ul><ul><li>Reliability work in England has generally been </li></ul><ul><ul><li>Isolated </li></ul></ul><ul><ul><li>Partial </li></ul></ul><ul><ul><li>Under-theorised </li></ul></ul><ul><ul><li>Under-reported </li></ul></ul><ul><ul><li>Misunderstood </li></ul></ul><ul><li>Ofqual’s Reliability Programme aimed to improve the situation. </li></ul>Background
    5. 5. <ul><li>To gather evidence for Ofqual to develop regulatory policy on reliability of results from national tests, examinations and qualifications </li></ul>Aims
    6. 6. <ul><li>Strand 1: Generating evidence of reliability </li></ul><ul><li>Strand 2: Interpreting and communicating evidence of reliability </li></ul><ul><li>Strand 3: Developing reliability policy </li></ul><ul><li>Strand 3a: Exploring public understanding </li></ul><ul><li>of reliability </li></ul><ul><li>Strand 3b: Developing Ofqual policy on </li></ul><ul><li>reliability </li></ul>Programme structure
    7. 7. Our Technical Advisory Group Paul Black Anton Beguin Alastair Pollitt Gordon Stanley Jo-Anne Baird
    8. 8. Strand 1 – Generating evidence <ul><li>Synthesising pre-existing evidence </li></ul><ul><li>Literature reviews </li></ul><ul><li>Generating new evidence </li></ul><ul><li>Monitoring existing practices </li></ul><ul><li>Experimental studies </li></ul>
    9. 9. Strand 2 – Interpreting and communicating evidence <ul><li>How do we conceptualise reliability? </li></ul><ul><li>How do we interpret our findings? </li></ul><ul><li>How do we communicate our findings? </li></ul>
    10. 10. Strand 3 – Developing policy <ul><li>Exploring public understanding of, and attitudes towards, assessment error </li></ul><ul><li>Stimulating national debate on the significance of the reliability evidence generated by the programme </li></ul><ul><li>Developing Ofqual’s policy on reliability </li></ul>
    11. 11. Student misclassification <ul><li>Controversial area - earlier conclusions include: </li></ul><ul><li>“… it is likely that the proportion of students awarded a level higher or lower than they should be because of the unreliability of the tests is at least 30% at key stage 2” </li></ul><ul><li>Wiliam, D. (2001). Level best? London: ATL. </li></ul><ul><li>“ Professors Black, Gardner and Wiliam argued […] that up to 30% of candidates in any public examination in the UK will receive the wrong level or grade” </li></ul><ul><li>House of Commons Children, Schools and Families Committee. (2008a). Testing and Assessment. Third Report of Session 2007–08. Volume I. HC 169-I. London: TSO. </li></ul><ul><li>Is this accurate? </li></ul>
    12. 12. Strand 1 – Generating evidence (1) <ul><li>National Curriculum tests: </li></ul><ul><ul><li>The reliabilities of KS2 science pre-tests and the stability of consistency over time </li></ul></ul><ul><ul><li>The reliabilities of the 2008 KS2 English reading pre-test </li></ul></ul><ul><li>General qualifications: </li></ul><ul><ul><li>The reliabilities of GCSE components/units </li></ul></ul><ul><ul><li>The reliability of GCE units </li></ul></ul><ul><li>Vocational qualifications </li></ul>
    13. 13. Strand 1 – Generating evidence (2) <ul><li>KS2 science pre-tests </li></ul><ul><li>The reliabilities of KS2 Science tests over five years </li></ul><ul><li>Values of internal consistency reliability (alpha) generally over 0.85 </li></ul><ul><li>Classification accuracy (pre-tests) 83%-88% </li></ul><ul><li>Classification consistency (between pre-tests and live tests) 72%-79% </li></ul><ul><li>Reliability indices relatively stable over time </li></ul><ul><li>Relatively high reliability compared with similar tests </li></ul>
    14. 14. Strand 1 – Generating evidence (3) <ul><li>A KS2 English reading pre-test </li></ul><ul><li>Data collected in 2007 during pre-testing 2008 KS2 English reading test </li></ul><ul><li>Containing 34 items and having a total of 50 marks (mean 28.5 and standard deviation 9.1, 1387 pupils) </li></ul><ul><li>Internal consistency reliability 0.88 </li></ul><ul><li>Standard error of measurement 3.1 </li></ul><ul><li>Classification accuracy (IRT) 83% </li></ul><ul><li>Classification consistency (IRT) 76% </li></ul>
    15. 15. Strand 1 – Generating evidence (4) <ul><li>Cronbach’s alpha for GCSE components/units </li></ul>
    16. 16. Strand 1 – Generating evidence (5) <ul><li>Cronbach’s alpha for GCE units </li></ul>
    17. 17. Strand 1 – Generating evidence (6) <ul><li>Assessor agreement rates for a workplace-based vocational qualification </li></ul>Qualification Number of decisions Agreement rate (%) Cohen’s Kappa Q1 2144 96.1 0.763 Q2 479 100 1 Q3 3070 99.1 0.971
    18. 18. Strand 1 - Generating evidence (7) <ul><li>The 2009 and 2010 live tests (populations) </li></ul>85 85 0.919 English 2010 85 87 0.910 English 2009 90 91 0.964 Mathematics 2010 90 90 0.968 Mathematics 2009 86 87 0.926 Science 2010 87 88 0.928 Science 2009 Method 2 Method 1 Classification accuracy (%) Cronbach’s alpha Subject
    19. 19. Strand 2 – Interpreting and communicating evidence (1) <ul><li>External research projects </li></ul><ul><ul><li>Estimating and interpreting reliability, based on CTT </li></ul></ul><ul><ul><li>Estimating and interpreting reliability based on CTT and G-theory </li></ul></ul><ul><ul><li>Quantifying and interpreting GCSE and GCE component reliability based on G-theory </li></ul></ul><ul><ul><li>Reporting of results and measurement uncertainties </li></ul></ul><ul><ul><li>Representing and reporting of assessment results and measurement uncertainties in some USA tests </li></ul></ul><ul><ul><li>Reliability of teacher assessment </li></ul></ul><ul><li>Internal research projects </li></ul><ul><ul><li>Reliability of composite scores: based on CTT, G-theory and IRT, qualification level </li></ul></ul>
    20. 20. Strand 2 – Interpreting and communicating evidence (2) <ul><li>Reporting results </li></ul><ul><li>and associated </li></ul><ul><li>errors (students </li></ul><ul><li>and parents) </li></ul>
    21. 21. Strand 2 – Interpreting and communicating evidence (3) <ul><li>Technical seminars </li></ul><ul><li>Factors that affect the reliability of results from assessments </li></ul><ul><li>Definition and meaning of different forms of reliability </li></ul><ul><li>Statistical methods that are used to produce reliability estimates </li></ul><ul><li>Representing and reporting assessment results and reliability estimates / measurement errors </li></ul><ul><li>Improving reliability and implications </li></ul><ul><li>Disseminating reliability statistics </li></ul><ul><li>Tension in managing public confidence whilst exploring and improving reliability </li></ul><ul><li>Operational issues for awarding bodies in producing reliability information </li></ul><ul><li>Challenges posed by the reliability programme in vocational qualifications </li></ul>
    22. 22. Strand 2 – Interpreting and communicating evidence (4) <ul><li>International perspective on reliability </li></ul><ul><li>Reliability studies should be built into the assessment quality assurance process </li></ul><ul><li>Information on reliability (primary and derived indices) should be in the public domain </li></ul><ul><li>The introduction of information about reliability (misclassification / measurement error) should be managed carefully </li></ul><ul><li>Education of the public to understand concept of reliability (measurement error) is seen to play an important part to alleviate the problem of misinterpretation by the media </li></ul><ul><li>The reporting of results and measurement error can be complex as results are normally used by multiple users </li></ul><ul><li>Primary reliability indices and classification indices should be reported at population level </li></ul><ul><li>Standard error of measurement should be reported at individual test-taker level </li></ul>
    23. 23. Strand 3a – Public perceptions of reliability (1) <ul><li>External research projects </li></ul><ul><ul><li>Ipsos MORI survey </li></ul></ul><ul><ul><li>Ipsos MORI workshops </li></ul></ul><ul><ul><li>AQA focus groups </li></ul></ul><ul><li>Internal research project </li></ul><ul><ul><li>Online questionnaire survey </li></ul></ul><ul><li>Investigating </li></ul><ul><ul><li>Understanding of the assessment process </li></ul></ul><ul><ul><li>Understanding of factors affecting performance on exams </li></ul></ul><ul><ul><li>Understanding of factors introducing uncertainty in exam results </li></ul></ul><ul><ul><li>Distinction between inevitable errors and preventable errors </li></ul></ul><ul><ul><li>Tolerance for errors in results </li></ul></ul><ul><ul><li>Disseminating reliability information </li></ul></ul>
    24. 24. <ul><li>Views on accuracy of GCSE grades </li></ul>Strand 3a – Public perceptions of reliability (2)
    25. 25. Strand 3a – Public perceptions of reliability (3) Views on national exams system
    26. 26. Strand 3b – Developing Ofqual reliability policy (1) <ul><li>Ofqual reliability policy based on </li></ul><ul><ul><li>Evaluating findings from this programme </li></ul></ul><ul><ul><li>Evaluating findings from other reliability related studies </li></ul></ul><ul><ul><li>Reviewing current practices adopted elsewhere </li></ul></ul>
    27. 27. Ofqual Board recommendations <ul><li>Continue work on reliability as a contribution to improving the quality assurance of qualifications, examinations and tests </li></ul><ul><li>Encourage awarding organisations to generate and publish reliability data </li></ul><ul><li>Continue to improve public and professional understanding of reliability and increase public confidence </li></ul>
    28. 28. Next steps <ul><li>Publishing reliability compendium later this year </li></ul><ul><li>Reliability work becomes “business as usual” </li></ul><ul><li>Creation of a further policy </li></ul>
    29. 29. Today <ul><li>Presentations from the Technical Advisory Group and experts in teaching, assessment research and communications </li></ul><ul><li>Question and answer session </li></ul><ul><li>Tell us your opinions or email them to </li></ul><ul><li>[email_address] </li></ul>
    30. 30. Findings from the Reliability Research Professor Jo-Anne Baird, Technical Advisory Group Member
    31. 31. Refreshment Break
    32. 32. A view from the assessment community Paul E. Newton Director, Cambridge Assessment Network Division Presentation to Ofqual event The reliability programme: leading the way to better testing and assessments. 22 March 2011.
    33. 33. <ul><li>We need to talk about error </li></ul>
    34. 34. Talking about error
    35. 35. The Telegraph (front page)
    36. 36. <ul><li>The professional justification </li></ul><ul><ul><li>what the profession needs to accomplish through talking about error </li></ul></ul>
    37. 37. The bad old days <ul><li>Boards seem to have strong objections to revealing their mysteries to ‘outsiders’ […] There have undoubtedly been cases of inquiries […] where publication would have been in the interests of education, and would have helped to prevent the spread of ‘horror-stories’ about such things as lack of equivalence which is an inevitable concomitant of the present cloak of secrecy . </li></ul><ul><li>Wiseman, S. (1961). The efficiency of examinations. In S. Wiseman (Ed.). Examinations in education. Manchester: MUP. </li></ul>
    38. 38. Promulgating the myth <ul><li>However, any level of error has to be unacceptable – even just one candidate getting the wrong grade is entirely unacceptable for both the individual student and the system. </li></ul><ul><li>QCA. (2003). A level of preparation. TES Insert. The TES , 4 April. </li></ul>
    39. 39. <ul><li>The technical justification </li></ul><ul><ul><li>why users and stakeholders need to know about error </li></ul></ul>
    40. 40. Using knowledge of error <ul><li>Students and teachers </li></ul><ul><ul><li>maybe you’re better, or worse, than your grades suggest </li></ul></ul><ul><li>Employers and selectors </li></ul><ul><ul><li>maybe such fine distinctions shouldn’t be drawn </li></ul></ul><ul><ul><li>maybe other information should be taken into account </li></ul></ul><ul><li>Parents </li></ul><ul><ul><li>maybe that difference in value added is insignificant </li></ul></ul><ul><ul><li>maybe inferences like that should not be drawn </li></ul></ul><ul><li>Awarding bodies </li></ul><ul><ul><li>maybe that examination (structure) is insufficiently robust </li></ul></ul><ul><li>Policy makers </li></ul><ul><ul><li>maybe that proposed use of results is illegitimate </li></ul></ul><ul><ul><li>maybe that policy change will compromise accuracy </li></ul></ul>
    41. 41. Talking about error <ul><li>the commitment to greater openness and transparency about error is nothing new </li></ul><ul><li>but there is still a long way to go </li></ul>
    42. 42. The 20-point scale (1969-72) <ul><li>The presentation of results on </li></ul><ul><li>(i) the broadsheet will be by a single number denoting a scale point for each subject taken by each candidate, accompanied by a statement on the range of uncertainty ; and </li></ul><ul><li>(ii) the candidate's certificate as a range of scale points (eg 13-17, corresponding to 15 on the broadsheet and indicating a range of uncertainty of plus or minus 2 scale points.) </li></ul><ul><li>Schools Council (1971). General Certificate of Education. Introduction of a new A-level grading scheme . London: Schools Council. </li></ul>
    43. 43. The 20-point scale (1969-72) <ul><li>The following rubric is proposed, to be prominently displayed on both broadsheets and certificates: </li></ul><ul><ul><li>&quot; Attention is drawn to the uncertainty inherent in any examination . In terms of the scale on which the above results are recorded, users should consider that a candidate's true level of attainment in each subject while possibly represented by a scale point one or two higher or lower, is more likely to be represented by the scale point awarded than by any other scale point [...].&quot; </li></ul></ul><ul><ul><li>Report by the Joint Working Party on A-level comparability to the Second Examinations Committee of the Schools Council on grading at A-level in GCE examinations. (1971) </li></ul></ul>
    44. 44. 20-point scale (1983-86) <ul><li>It was proposed that the new scheme should have the following characteristics: </li></ul><ul><li>[...] (d) results should be accompanied by a statement of the possible margin of error. </li></ul><ul><li>JMB (1983). Problems of the GCE Advanced level grading scheme . Manchester: Joint Matriculation Board. </li></ul>
    45. 45. Talking about error <ul><li>there is disagreement within the profession over the concept of error </li></ul><ul><li>but, at least, we are beginning to make these differences of opinion more explicit </li></ul>
    46. 46. Measuring attainment
    47. 47. Judging performance <ul><li>I argue that there is a strong case for saying that it is more sensible to accept that exams are just about fair competition – which means your performance must be reliably turned into a score but you accept as the luck of the draw things like the question paper being tough for you or having hay fever on the day, etc. Moreover, I think if you do that you can design things like regulatory work on reliability so that they reflect the priorities of the public . This was behind my first question to you about your presentation yesterday – do you really think Joe Public is interested in Cell 6? That’s an empirical question of course; I think the answer is no, but I’d love to find out for sure. </li></ul><ul><li>Mike Cresswell, 20 October 2009, personal communication </li></ul>
    48. 48. Uses of reliability information <ul><li>Evaluation and improvement </li></ul><ul><ul><li>highly technical (detailed & specific & idiosyncratic) </li></ul></ul><ul><ul><li>obscure (typically not published) </li></ul></ul><ul><ul><li>primary users = awarding bodies </li></ul></ul><ul><li>Accountability </li></ul><ul><ul><li>technical (but how detailed & generic & uniform?) </li></ul></ul><ul><ul><li>translatable (published but not necessarily disseminated) </li></ul></ul><ul><ul><li>primary users = regulator & analysts </li></ul></ul><ul><li>Education </li></ul><ul><ul><li>non-technical (uncomplicated & generic & uniform) </li></ul></ul><ul><ul><li>translated (widely disseminated) </li></ul></ul><ul><ul><li>primary users = members of the public </li></ul></ul>
    49. 49. For education <ul><li>How can we achieve greater openness and transparency? </li></ul>
    50. 50. The Sun
    51. 51. For education <ul><li>use analogy , wherever possible </li></ul><ul><li>use commonsense , not technical, terms </li></ul><ul><li>convey misrepresentation , not variation </li></ul><ul><li>rely on heuristics , not statistics </li></ul><ul><li>[…] results on a six or seven point grading scale are accurate to about one grade either side of that awarded. </li></ul><ul><li>Schools Council. (1980). Focus on examinations . Pamphlet 5. London: Schools Council. </li></ul>
    52. 52. <ul><li>The importance of assessment results in today’s education system... </li></ul><ul><li>and communicating uncertainty in what they can tell us </li></ul><ul><li>Warwick Mansell </li></ul>
    53. 53. <ul><li>The emphasis being placed on test results </li></ul>
    54. 54. Child takes exams Head teacher Judgement: school level Exams marked and graded Department School results Ofsted Judgement: local level Local authority/ federation/academy chain Judgement: national level Education initiatives Civil servants Ministers National productivity Debate: state ed successful? Teacher One pupil’s exam results: national implications
    55. 55. <ul><li>Types of “error” </li></ul>
    56. 56. <ul><li>Error: </li></ul><ul><li>“the difference between an approximate result and the true determination”. </li></ul>
    57. 57. <ul><li>Communication of measurement error: </li></ul><ul><li>It can, and is, done </li></ul>
    58. 58. <ul><li>“ The information in these tables only provides part of the picture of each school’s and its pupils’ achievements. Schools change from year to year and their future results may differ from those achieved by current pupils. The tables should be considered alongside other important sources of information such as Ofsted reports and school prospectuses.” </li></ul><ul><li>DfE, school performance tables website, 2011 </li></ul>
    59. 59. <ul><li>What can go wrong if measurement certainty is not understood and communicated </li></ul>
    60. 61. <ul><li>Is the public ready to accept the concept of measurement error? </li></ul>
    61. 62. <ul><li>Sats results “wrong for thousands of pupils” </li></ul><ul><li>Daily Telegraph, 13/11/09 </li></ul>
    62. 63. <ul><li>“ New Sats fiasco as one in three pupils 'will get wrong exam results’” </li></ul><ul><li>Daily Mail, 31/1/09 </li></ul>
    63. 64. <ul><li>Talking about reliability at the macro, and at the micro, level </li></ul>
    64. 65. Was I reliably informed...? ... a former principal ponders John Guy Formerly Principal, Farnborough Sixth Form College
    65. 66. 3250 students; Mostly A levels 3312 applications for 1750 places in September 2010 61 AS courses Biggest? AS Mathematics AS Psychology AS English AS Media Smallest? AS Italian (6)
    66. 68. Reliability refers to the consistency of outcomes that would be observed from an assessment process were it to be repeated . High reliability means that broadly the same outcomes would arise. A range of factors that exist in the assessment process can introduce unreliability into assessment results. (un)reliability concerns the impact of the particular details that do happen to vary from one assessment to the next for whatever reason. So reliability was important to the College... ..and we paid over £800,000 a year to get it
    67. 69. Today’s session: Ponder aloud on reliability and the causes of unreliability and its impact upon College students A level History A level Business Studies A level Art O level Athletics
    68. 70. Hasna Benhassi Tatyana Tomashova
    69. 71. A level History 150 – 200 students taking A2 annually Previous achievements and value-added indicators suggest improving cohort Stable cohort of experienced and inspiring teachers, led by Chair of History Teaching Association Many experienced A level examiners Could be employed by Higher Education – and would be awarding degrees...
    70. 72. History A level results Awarding Body 145 140 166 179 195 Completers
    71. 73. 60 0 38 34 30 E A C B D 100 80 70 60 50 40 0 Mapping Raw Score UMS Scale Map to UMS A B C D E BAR 30 Marking tolerance+/- 5% Tolerance Amplified +/- 8% A* 90 42 27
    72. 74. History A level results Awarding Body Reliability refers to the consistency of outcomes that would be observed from an assessment process were it to be repeated . 145 140 166 179 195 Completers
    73. 75. 60 0 38 34 30 E A C B D Mapping Raw Score BAR A* 42 27 A-E range should be 40% Narrow A-E range produces unreliability – in this case range is 25% 70% 45%
    74. 76. 60 0 33 30 27 E A C B D Business Studies 2011 A2 raw marks – from web search BAR A* 36 25 18% A-E range!! 60% 42% 42 These raw marks over 42 worth nothing These raw marks between 27 and 42 worth 3% each These raw marks 23-27 worth 5% each These raw marks 0 - 23 worth 1.5 % each Candidate 1 Q4 = 4 raw marks Total 27 Candidate 2 Q4 = 0 raw marks Total 23 50% 30% Is this a reliable or valid assessment instrument?
    75. 77. The Regulated Assessment (wobbly) Ruler? Questions 1,2,3 Questions 5, 6, 7, 8 When you measure things... ...it’s a good idea to use a reliable ruler! Sometimes I think the College ruler is more reliable! 4 0!
    76. 78. AS level Art 2007 - 495 Candidates Reliability refers to the consistency of outcomes that would be observed from an assessment process were it to be repeated . A B C D E FSFC 2007 14.1 37.5 72.7 93.1 97.1 Joint Council Figure 2007 21 42 66 83 94 FSFC 2006 23.2 55.4 87.3 96.3 98.3 Joint Council Figure 2006 22 44 67 84 94 FSFC 2005 20.7 48.3 82.2 97.8 99.3 Joint Council Figure 2005 21 42 65 82 92 FSFC 2004 20.4 45.2 78.3 94.4 99 Joint Council Figure 2004 22.2 42.5 63.8 81.4 92.4 FSFC 2003 22.8 46.7 68.7 85.1 95.9 J oint Council Figure 2003 22.2 42.2 63.5 80.6 91.5
    77. 79. 2007 – a special year <ul><ul><li>New specification – 4 units </li></ul></ul><ul><ul><li>Awarding Body invited teachers to meeting to discuss grading </li></ul></ul><ul><ul><li>New boundaries for criterion judgements were proposed, with the grade A boundary set lower than in previous years. </li></ul></ul><ul><ul><li>Attendance at the Awarding Body meetings was not compulsory. </li></ul></ul>Grade A 62 Grade B 54 Grade C 46 Grade D 38 Grade E 30 New boundaries (used by College) Criterion judgments, no disagreements at moderation; Work praised (again) for consistent internal assessment Grade A 69 Grade B 60 Grade C 51 Grade D 42 Grade E 33 Adjusted boundaries (summer 2007) New boundaries close to historic grade boundaries which the awarding body had sought to change
    78. 80. ANALYSIS Value added scores 2005: +0.4 2006: +0.4 2007: -0.3 2008: +0.4 Chi-squared test A B C D E U 2003-2006 21.8% 27.1% 30.1% 14.3% 4.75% 2% 2007 Expected 107.9 134.1 149 70.8 23.3 9.9 2007 Actual 70 116 174 101 20 14 Chi-sq 13.32 2.45 4.2 12.9 0.46 1.7 sum 35.02 Tables give 18.47 at 0.1% significance level Assuming similar ability of cohort, agreed with moderator, the chances of this change occurring randomly is infinitessimaL
    79. 81. Was this a reliable assessment? <ul><ul><li>College immediately contacted Board and was told to appeal </li></ul></ul><ul><ul><li>College appeal, sending copy of letter to Ofqual and the Chief Executive </li></ul></ul><ul><ul><li>Appeal heard by three members who were interested only in process </li></ul></ul><ul><ul><li>Appeal was rejected </li></ul></ul><ul><ul><li>No doubt the process was followed assiduously </li></ul></ul><ul><ul><li>However, the process was flawed </li></ul></ul>
    80. 82. Conclusions Large cohorts from open access colleges are representative of the whole population Large cohorts of students therefore provide an opportunity for an additional check on processes Statistical analysis of the entire cohort will hide flaws in the assessment process An error is associated with every measurement but some measurements are error(mistake)-ridden – and unfair. Is error(mistake) designed into the assessment instrument? Awarding bodies are not keen to admit it! Reliability refers to the consistency of outcomes that would be observed from an assessment process were it to be repeated .
    81. 83. Questions and Answers to the Panel of Speakers Chair: Glenys Stacey, Ofqual Chief Executive
    82. 84. Ofqual’s Reliability Programme Closing remarks Dennis Opposs
    83. 85. Ofqual Board recommendations <ul><li>Recommendation 1: </li></ul><ul><li>Continue work on reliability as a contribution to improving the quality assessment of qualifications, examinations and tests. </li></ul><ul><li>Work in the areas of teacher assessment, workplace-based assessment and construct validity of assessment would be of particularly interest and importance </li></ul><ul><li>The scope of the work possible will clearly be limited by the resource available. </li></ul>
    84. 86. Ofqual Board recommendations <ul><li>Recommendation 2: </li></ul><ul><li>Encourage Awarding Organisations to generate and publish reliability data. </li></ul><ul><li>We need to use impact assessments to help decide what is appropriate. </li></ul><ul><li>The first progress is likely to involve GCSEs and A levels where the work has progressed furthest. </li></ul><ul><li>In due course we might make some of this regulatory requirements for Awarding Organisations. </li></ul>
    85. 87. Ofqual Board recommendations <ul><li>Recommendation 3: </li></ul><ul><li>Continue to improve public and professional understanding of reliability and increase public confidence in the examination system by working with the Awarding Organisations and others. </li></ul>
    86. 88. Next steps <ul><li>Publishing reliability compendium later this year </li></ul><ul><li>Reliability work becomes “business as usual” </li></ul><ul><li>Creation of a further policy </li></ul>
    87. 89. Today Tell us your opinions or email them to [email_address]
    88. 90. Thank you for attending Networking Lunch
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×