• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
MD - Assessment literacy in a teacher evaluation frame 11-12
 

MD - Assessment literacy in a teacher evaluation frame 11-12

on

  • 647 views

Utilizing a framework for using data in teacher evaluations we answer questions about assessments and data and provide areas to consider.

Utilizing a framework for using data in teacher evaluations we answer questions about assessments and data and provide areas to consider.

Statistics

Views

Total Views
647
Views on SlideShare
647
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Concept – If we fix schools we fix education. Schools actually did improve during this period.Race to the Top, Gates Foundation, Teach for America…Signaled in a number of waysNCLB about fixing schools – 100% Proficient by 2014Punishments for AYP – SES, Choice, RestructuringObama switch – Race to the TopFixing or improving teaching and the teaching professionRecruiting teachers from alternative careersMove from holding schools accountable to holding teachers accountable. Wrong no. Different Yes.David Brooks – Aug 2010 – Atlantic Monthly – Teachers are fair game – Teachers under scrutiny – Somewhat unfairlyBOE are asking about test based accountabilityCharleston SC – Any teacher without 50% of students on growth norm – Yr 1 on report, Yr 2 only rehired by approval by BOE50% Yr 1, 25% year 2 to be rehiredOur goal – Make sure you are prepared. Understand the risk. Proper ways to implement including legal issues. Clarify some of the implications – Very complex – Prepare you and a prudent course
  • Teacher evaluations and the use of data in them can take many forms. You can use them for supporting teachers and their improvement. You can use the evaluations to compensate teachers or groups of teachers differently or you can use them in their highest stakes way to terminate teachers. The higher the stakes put on the evaluation, the more risk there is to you and your organization from a political, legal, and equity perspective. Most people naturally respond with increasing the levels of rigor put into designing the process as a way to ameliorate the risk. One fact is that the risk can’t be eliminated. Our goal – Make sure you are prepared. Understand the risk. Proper ways to implement including legal issues. Clarify some of the implications – Very complex – Prepare you and a prudent course
  • Contrast with what value added communicatesPlot normal growth for Marcus vs anticipated growth – value added. If you ask whether the teachers provided value added, the answer is Yes.Other line is what is needed for college readinessBlue line is what is used to evaluate the teacher. Is he on the line the parents want him to be on? Probably not.Don’t focus on one at the expense of the otherNCLB – AYP vs what the parent really wants for goal settingCan be come so focused on measuring teachers that we lose sight of what parents valueWe are better off moving towards the kids aspirationsAs a parent I didn’t care if the school made AYP. I cared if my kids got the courses that helped them go where they want to go.
  • This is the value added metricNot easy to make nuanced decisions. Can learn about the ends.
  • Steps are quite important. People tend to skip some of these.Kids take a test – important that the test is aligned to instruction being givenMetric – look at growth vs growth norm and calculate a growth index. Two benefits – Very transparent/Simple.People tend to use our growth norms – if you hit 60% for a grade level within a school you are dong well.Norms – growth of a kid or group of kids compared to a nationally representative sample of studentsWhy isn’t this value added?Not all teachers can be compared to a nationally representative sample because they don’t teach kids that are just like the national sampleThe third step controls for variables unique to the teacher’s classroom or environmentFourth step – rating – how much below average before the district takes action or how much above before someone gets performance pay. Particular challenge in NY state right now. Law requires it.
  • Steps are quite important. People tend to skip some of these.Kids take a test – important that the test is aligned to instruction being givenMetric – look at growth vs growth norm and calculate a growth index. Two benefits – Very transparent/Simple.People tend to use our growth norms – if you hit 60% for a grade level within a school you are dong well.Norms – growth of a kid or group of kids compared to a nationally representative sample of studentsWhy isn’t this value added?Not all teachers can be compared to a nationally representative sample because they don’t teach kids that are just like the national sampleThe third step controls for variables unique to the teacher’s classroom or environmentFourth step – rating – how much below average before the district takes action or how much above before someone gets performance pay. Particular challenge in NY state right now. Law requires it.
  • State assessment designed to measure proficiency – many items in the middle not at the endsMust use multiple points of data over time to measure this. We also believe that a principal should be more in control of the evaluation than the test – Principal and Teacher leaders are what changes schools
  • 5th grade MD reading cut scores shown
  • Common core – very ambitious things they want to measure – tackle things on an AP test. Write and show their work.A CC assessment to evaluate teachers can be a problem.Raise your hand if you know what the capital of Chile is. Santiago. Repeat after me. We will review in a couple of minutes. Facts can be relatively easily acquired and are instructionally sensitive. If you expose kids to facts in a meaningful and engaging ways, it is sensitive to instruction.
  • Problem – insensitive to instructionPrereq skills – writing skills. Given events on N. Africa today, Q requires a lot of pre-req knowledge. Need to know the story. Put it into writing. Reasoning skills to put it together with events today. And I need to know what is going on today as well. One doesn’t develop this entire set of skills in the 9 months of instruction.Common core is what we want. Just not for teacher evaluation.These questions are not that sensitive to instruction. Problematic when we hold teachers accountable for instruction or growth.
  • Problem – insensitive to instructionPrereq skills – writing skills. Given events on N. Africa today, Q requires a lot of pre-req knowledge. Need to know the story. Put it into writing. Reasoning skills to put it together with events today. And I need to know what is going on today as well. One doesn’t develop this entire set of skills in the 9 months of instruction.Common core is what we want. Just not for teacher evaluation.These questions are not that sensitive to instruction. Problematic when we hold teachers accountable for instruction or growth.
  • Steps are quite important. People tend to skip some of these.Kids take a test – important that the test is aligned to instruction being givenMetric – look at growth vs growth norm and calculate a growth index. Two benefits – Very transparent/Simple.People tend to use our growth norms – if you hit 60% for a grade level within a school you are dong well.Norms – growth of a kid or group of kids compared to a nationally representative sample of studentsWhy isn’t this value added?Not all teachers can be compared to a nationally representative sample because they don’t teach kids that are just like the national sampleThe third step controls for variables unique to the teacher’s classroom or environmentFourth step – rating – how much below average before the district takes action or how much above before someone gets performance pay. Particular challenge in NY state right now. Law requires it.
  • NCLB required everyone to get above proficient – message focus on kids at or near proficientSchool systems respondedMS standards are harder than the elem standards – MS problemNo effort to calibrate them – no effort to project elem to ms standardsStart easy and ramp up.Proficient in elem and not in MS with normal growth. When you control for the difficulty in the standards Elem and MS performance are the same
  • Not only are standards different across grades, they are different across states.It’s data like this that helps to inspire the Common Core and consistent standards so we compare apples to apples
  • Dramatic differences between standards based vs growthKY 5th grade mathematicsSample of students from a large school systemX-axis Fall score, Y number of kidsBlue are the kids who did not change status between the fall and the spring on the state testRed are the kids who declined in performance over spring – DecenderGreen are kids who moved above it in performance over the spring – Ascender – Bubble kidsAbout 10% based on the total number of kidsAccountability plans are made typically based on these red and green kids
  • Same district as beforeYellow – did not meet target growth – spread over the entire range of kidsGreen – did meet growth targets60% vs 40% is doing well – This is a high performing district with high growthMust attend to all kids – this is a good thing – ones in the middle and at both extremesOld one was discriminatory – focus on some in lieu of othersTeachers who teach really hard at the standard for years – Teachers need to be able to reach them allThis does a lot to move the accountability system to parents and our desires.
  • Steps are quite important. People tend to skip some of these.Kids take a test – important that the test is aligned to instruction being givenMetric – look at growth vs growth norm and calculate a growth index. Two benefits – Very transparent/Simple.People tend to use our growth norms – if you hit 60% for a grade level within a school you are dong well.Norms – growth of a kid or group of kids compared to a nationally representative sample of studentsWhy isn’t this value added?Not all teachers can be compared to a nationally representative sample because they don’t teach kids that are just like the national sampleThe third step controls for variables unique to the teacher’s classroom or environmentFourth step – rating – how much below average before the district takes action or how much above before someone gets performance pay. Particular challenge in NY state right now. Law requires it.
  • There are wonderful teachers who teach in very challenging, dysfunctional settings. The setting can impact the growth. HLM embeds the student in a classroom, the classroom in the school, and controls for the school parameters. Is it perfect. No. Is it better? Yes.Opposite is true and learning can be magnified as well.What if kids are a challenge, ESL or attendance for instance. It can deflate scores especially with a low number of kids in the sample being analyzed. Also need to make sure you have a large enough ‘n’ to make this possible especially true in small districts.Our position is that a test can inform the decision, but the principal/administrator should collect the bulk of the data that is used in the performance evaluation process.
  • Experts recommend multiple years of data to do the evaluation. Invalid to just use two points and will testify to it.Principals never fire anyone – NY rubber room – mythIf they do, it’s not fast enough. – Need to speed up the processThis won’t make the process faster – Principals doing intense evaluations will
  • The question we asked: Are teachers who are rated poorly or well in one year likely to stay there in the second year? Important if high stakes where there is a belief that someone won’t improve.We did VA assessment in year one and again in year two – 493 teachers40% of people in the bottom quintile moved out.Yr 1 and yr 2 correlations – these results are more highly correlated than most other studies. Our is best case scenario.One class can impact results so need multiple years of data to get stable results
  • Measurement error is compounded in test 1 and test 2
  • Green line is their VA estimate and bar is the error of measureBoth on top and bottom people can be in other quartilesPeople in the middle can cross quintiles – just based on SEMCross country – winners spread out. End of the race spread. Middle you get a pack. Middle moving up makes a big difference in the overall race.Instability and narrowness of ranges means evaluating teachers in the middle of the test mean slight changes in performance can be a large change in performance ranking
  • Non –random assignments Models control for various things – FRL, ethnicity, school effectiveness overall. Beyond this point assignment is random.1st year teachers get more discipline problems than teachers who have been 30 years. Pick the kids they get. If the model doesn’t control for disciplinary record – none do have that data – scores are inflated. Makes model invalid.Principals do need to do non-random assignment – sound educational reasons for the placement – match adults for kids
  • One or two kids can impact a classroom and not a grade and schoolsWhy? A large n helps reduce the standard error
  • Steps are quite important. People tend to skip some of these.Kids take a test – important that the test is aligned to instruction being givenMetric – look at growth vs growth norm and calculate a growth index. Two benefits – Very transparent/Simple.People tend to use our growth norms – if you hit 60% for a grade level within a school you are dong well.Norms – growth of a kid or group of kids compared to a nationally representative sample of studentsWhy isn’t this value added?Not all teachers can be compared to a nationally representative sample because they don’t teach kids that are just like the national sampleThe third step controls for variables unique to the teacher’s classroom or environmentFourth step – rating – how much below average before the district takes action or how much above before someone gets performance pay. Particular challenge in NY state right now. Law requires it.
  • Use NY point system as the example
  • Assessment is ultimately to serve kids. Be thoughtful. Get help.Involve stakeholders in the creation of a comprehensive evaluation systems with multiple measures of teacher effectiveness (Rand, 2010)Select the measures and VA models carefullyBring as much data to bear as possible to create a body of evidenceStart small and learnWe wouldn’t be who we are if I didn’t stress using the data for formative purposes. That’s what we really value.

MD - Assessment literacy in a teacher evaluation frame 11-12 MD - Assessment literacy in a teacher evaluation frame 11-12 Presentation Transcript

  • Assessment Literacy in aTeacher Evaluation Frame Andy Hegedus, Ed.D. November 2012
  • Trying to gauge my audience and adjust my speed . . .• How many of you think your literacy with assessments in general is “Good” or better?• How many of you are currently figuring out how to use assessment data thoughtfully in a Teacher Evaluation process?
  • Go forth thoughtfully with care• What we’ve known to be true is now being shown to be true – Using data thoughtfully improves student achievement• There are dangers present however – Unintended Consequences
  • Remember the old adage?“What gets measured (and attended to), gets done”
  • An infamous example• NCLB – Cast light on inequities – Improved performance of “Bubble Kids” – Narrowed taught curriculum
  • A patient’s healthdoesn’t changebecause we knowtheir blood pressureIt’s our response thatmakes all thedifference It’s what we do that counts
  • Data Use in Teacher Evaluation is our construct for todayOur nation has moved from a model ofeducation reform that focused on fixingschools to a model that is focused onfixing the teaching profession
  • Be considerate of the continuum of stakes involved TerminateIncreasing risk Compensate Support Increasing levels of required rigor
  • Let’s get clear on terms• Growth • Depiction of progress over time along a cross- grade scale• Value-Added – A determination of whether growth is greater for a particular student or group of students than would be expected
  • Marcus’ growthCollege readiness standard Marcus Normal Growth Needed Growth
  • What question is being answered in support of using data in evaluating teachers? Is the progress produced by this teacher dramatically different than teaching peers who deliver instruction to comparable students in comparable situations?
  • There are four key steps required to answer this question Top-Down ModelThe Test The Growth Metric The Evaluation The Rating
  • How does the other popular processwork? Bottom-Up ModelAssessment 1 Understanding all four of the Goal Setting top-down elements are needed here Assessment(s) Results and Analysis Evaluation (Rating)
  • Let’s begin at the beginningThe Test The Growth Metric The Evaluation The Rating
  • The purpose and design ofthe instrument is significant • Many assessments are not designed to measure growth • Others do not measure growth equally well for all students
  • Both Status and Growth are important Adult Reading Value Added = Teacher Contribution to Growth x5th Grade Status x Time 1 Time 2 Beginning Literacy
  • Teachers encounter a distribution of student performance Adult Reading x x x Norm = 5th x xxxx xxx x Grade Level ―Typical‖ for Performance a referenceGrade xx population x Beginning Literacy
  • Traditional assessment uses items reflecting the grade level standards Adult Reading6th Grade5th Grade4th Grade Grade Level Standards Traditional Assessment Item Beginning Literacy Bank
  • Traditional assessment uses items reflecting the grade level standards Adult Reading6th Grade Grade Level Standards Overlap allows5th Grade Grade Level Standards linking and scale construction4th Grade Grade Level Standards Beginning Literacy
  • Adaptive testing works differentlyItem bankcan span fullrange ofachievement
  • Available item pool depth is crucial Est. RIT Correct Incorrect
  • Tests are not equally accurate for all students California STAR NWEA MAP
  • These differences impact measurement error Basic Proficient Advanced .12 Adaptive .10 Test Significantly Different .08 5th Grade ErrorInformation Level .06 Items .04 .02 Traditional Test 25th 82nd .00 165 175 185 195 205 215 225 235 245 Scale Score
  • Error can change your life! • Think of a high stakes test – State Summative – Designed to identify if a student is proficient or not • Do they do that well? • 93% correct on Proficiency determination • Does it go off design well? • 75% correct on Performance Levels determination with five performance levels*Testing: Not an Exact Science, Education Policy Brief, Delaware Education Research & Development Center, May2004, http://dspace.udel.edu:8080/dspace/handle/19716/244
  • Error can change your life!!!• Study on impact of assessment selection on VAM results –Defined a misidentified teacher as one who appeared to have growth which was incorrect by more than one-half a year1 • Less than .5 years or more than 1.5 years1Woodworth, J.L., Does Assessment Selection Matter When Computing Teacher Value-AddedMeasures?, http://www.kingsburycenter.org/sites/default/files/James%20Woodworth%20Data%20Award%20Research%20Brief.pdf
  • Error can change your life!!!• “. . . in the 25 student (single class) simulations. At the 25 student level, the VAM based on the TAKS misidentifies 35% of all teachers, whereas, the VAM based on the MAP misidentifies only 1% of teachers.”Unaccounted measurement error is a huge issue in AYP and Teacher Evaluation work
  • What is measured should be aligned to what is being taught• Assessments should align with the teacher’s instructional responsibility – Validity • Is it assessing what you think it’s assessing? – Reliability • If we gave it again, would the results be consistent?
  • The instrument must be able to detect instruction• …when science is defined in terms of knowledge of facts that are taught in school…(then) those students who have been taught the facts will know them, and those who have not will…not. A test that assesses these skills is likely to be highly sensitive to instruction.Black, P. and Wiliam, D.(2007) Large-scale assessment systems: Designprinciples drawn from international comparisons, Measurement:Interdisciplinary Research & Perspective, 5: 1, 1 — 53
  • The more complex, the harder to detect and attribute to one teacher• When ability in science is defined in terms of scientific reasoning…achievement will be less closely tied to age and exposure, and more closely related to general intelligence. In other words, science reasoning tasks are relatively insensitive to instruction.Black, P. and Wiliam, D.(2007) Large-scale assessment systems: Designprinciples drawn from international comparisons, Measurement:Interdisciplinary Research & Perspective, 5: 1, 1 — 53
  • Uncovered Subjects andTeachers High quality tests may not be administered, or available, for many teachers and grades. Subjects like social studies may be particularly problematic.
  • Considerations for developing your own assessment and student learning objectives• Developing valid instruments is very time consuming and resource intensive• The assessments developed must discriminate between effective and ineffective teachers• The assessments must be valid in other respects – Aligned to curriculum – Unbiased items• The assessments can’t be open to security violations or cheating
  • Other IssuesSecurity and CheatingWhen measuring growth, oneteacher who cheats disadvantagesthe next teacher.
  • Security considerations• Teachers should not be allowed to view the contents of the item bank or record items.• Districts should have policies for accommodations that are based on student IEPs.• Districts should consider having both the teacher and a proctor in the test room.• Districts should consider whether other security measures are needed for both the protection of the teacher and administrators.
  • Other issues Proctoring Proctoring both with and without the classroom teacher raises possible problems. Documentation that test administration procedures were properly followed is important.
  • Testing is complete . . .What is useful to answer our question?The Test The Growth Metric The Evaluation The Rating
  • The metric matters - Let’s go underneath ―Proficiency‖ Difficulty of Maryland Proficient Cut Score Reading 100 90 80 CollegeNational Percentile Readiness 70 60 50 40 Reading 30 20 10 0 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8
  • Difficulty of ACT college readiness standards
  • The metric matters - Let’s go underneath ―Proficiency‖Dahlin, M. and Durant, S., The State of Proficiency, Kingsbury Center at NWEA, July 2011
  • What gets measured and attended to really does matter Mathematics Proficiency College Readiness No Change DownNumber of Students Up Fall RIT One district’s change in 5th grade mathematics performance relative to the KY proficiency cut scores
  • Changing from Proficiency to Growth means all kids matter Mathematics Below projected growth Met or aboveNumber of Students projected growth Student’s score in fall Number of 5th grade students meeting projected mathematics growth in the same district
  • How can we make it fair?The Test The Growth Metric The Evaluation The Rating
  • Consider . . .• What if I skip this step? – Comparison is likely against normative data so the comparison is to “typical kids in typical settings”• How fair is it to disregard context? – Good teacher – bad school – Good teacher – challenging kids How does your goal setting consider context?
  • Challenges with goal setting• Lack of a historical context – What has this teacher and these students done in the past?• Lack of comparison groups – What have other teachers done in the past?• What is the objective? – Is the objective to meet a standard of performance or demonstrate improvement?• Do you set safety goals or stretch goals?
  • Nothing is perfect• Value added models control for a variety of classroom, school level, and other conditions – Over one hundred different value added models – All attempt to minimize error – Variables outside controls are assumed as random• Results are not stable – The use of multiple-years of data is highly recommended – Results are more likely to be stable at the extremes
  • Multiple years of data is necessary for some stability Teachers with growth scores in lowest and highest quintile over two years using NWEA’s MAP (493 teachers) 120 100 Vote – Year 2 aboveNumber of teachers 80 or below 60 Year 1 40 Year 2 20 0 Lowest Highest Typical r values for measures of teaching effectiveness range between .30 and .60 (Brown Center on Education Policy, 2010)
  • A variety of errors means more stability only at the extremes • Control for statistical error – All models attempt to address this issue • Error is compounded with combining two test events – Nevertheless, many teachers’ value-added scores will fall within the range of statistical error
  • Range of teacher value-added estimates 12.00 11.00 Mathematics Growth Index Distribution by Teacher - Validity Filtered 10.00 9.00 Each line in this display represents a single teacher. The graphic shows the average growth index score for each teacher (green 8.00 line), plus or minus the standard error of the growth index estimate 7.00 (black line). We removed students who had tests of questionable validity and teachers with fewer than 20 students. 6.00 5.00Average Growth Index Score and Range 4.00 Q5 3.00 2.00 Q4 1.00 0.00 Q3 -1.00 -2.00 Q2 -3.00 -4.00 Q1 -5.00 -6.00 -7.00 -8.00 -9.00 -10.00 -11.00 -12.00
  • With one teacher, error means a lot
  • Assumption of randomness can have risk implications• Value-added models assume that variation is caused by randomness if not controlled for explicitly – Young teachers are assigned disproportionate numbers of students with poor discipline records – Parent requests for the “best” teachers are honored• Sound educational reasons for placement are likely to be defensible
  • Possible racial bias in models―Significant evidence of bias plagued the value-added modelestimated for the Los Angeles Times in 2010, including significantpatterns of racial disparities in teacher ratings both by the race ofthe student served and by the race of the teachers (seeGreen, Baker and Oluwole, 2012). These model biases raise thepossibility that Title VII disparate impact claims might also be filedby teachers dismissed on the basis of their value-added estimates.Additional analyses of the data, including richer models usingadditional variables mitigated substantial portions of the bias in theLA Times models (Briggs & Domingue, 2010).‖Baker, B. (2012, April 28). If it’s not valid, reliability doesn’t matter so much! More onVAM-ing & SGP-ing Teacher Dismissal.
  • Instability at the tails of the distribution ―The findings indicate that these modeling choices can significantly influence outcomes for individual teachers, particularly those in the tails of the performance distribution who are most likely to be targeted by high-stakes policies.‖Ballou, D., Mokher, C. and Cavalluzzo, L. (2012) Using Value-Added Assessment for PersonnelDecisions: How Omitted Variables and Model Specification Influence Teachers’ Outcomes. LA Times Teacher #1 LA Times Teacher #2
  • Lower numbers can significantly impact a teacher level analysis • Idiosyncratic cases – In self-contained classrooms, one or two idiosyncratic cases can have a large effect on results
  • How tests are used to evaluate teachersThe Test The Growth Metric The Evaluation The Rating
  • Translation into ratings can be difficult to inform with data • How would you translate a rank order to a rating? • Data can be provided • Value judgment ultimately used to set cut scores for points or rating
  • Decisions are value based, not empirical • What is far below a district’s expectation is subjective • What about • Obligation to help teachers improve? • Quality of replacement teachers?
  • Even multiple measures need to be used well• System for combining elements and producing a rating is also a value based decision – Multiple measures and principal judgment must be included – Evaluate the extremes to make sure it makes sense
  • Potential Litigation IssuesThe use of value-added data for high stakespersonnel decisions does not yet have astrong, coherent, body of case law.Expect litigation if value-added results are thelynchpin evidence for a teacher-dismissal caseuntil a body of case law is established.
  • Possible legal issues• Title VII of the Civil Rights Act of 1964 – Disparate impact of sanctions on a protected group.• State statutes that provide tenure and other related protections to teachers.• Challenges to a finding of “incompetence” stemming from the growth or value-added data.
  • Recommendations• Embrace the formative advantages of growth measurement as well as the summative.• Create comprehensive evaluation systems with multiple measures of teacher effectiveness (Rand, 2010) – Involve variety of stakeholders – Begin with pilots to understand the accuracy and unintended consequences• Select measures as carefully as value-added models.• Use multiple years of student achievement data.• Understand the issues and the tradeoffs.• Be thoughtful
  • • Presentations and other recommended resources are available at: – www.nwea.org – www.kingsburycenter.org – Slideshare.net• Contacting us: NWEA Main Number 503-624-1951 E-mail: andy.hegedus@nwea.org More information