Colorado assessment summit_oct12

Considerations when using tests for
teacher evaluation

Presenter - John Cronin, Ph.D.

Contacting us:
NWEA Main Number: 503-624-1951
E-mail: rebecca.moore@nwea.org

This PowerPoint presentation and recommended resources are
available at our website: www.kingsburycenter.org

Label each player as effective, partially
effective, or ineffective

Avg. HR RBI SB
.309 5 54 7
.303 13 53 20
.271 4 30 7
.270 28 71 4
.260 16 58 3
.238 7 37 1
.217 5 28 0

Label each player as effective, partially
effective, or ineffective

Avg. HR RBI SB
Rosario .309 5 54 7
Gonzales .303 13 53 20
Scuturo .271 4 30 7
Cudger .270 28 71 4
Helton .260 16 58 3
Hernandez .238 7 37 1
Rosario .217 5 28 0

Facts about baseball players

• If effective baseball players hit .300, then 90%
of baseball players are ineffective.
• If effective baseball players are better-than-av
average hitters than 50% are ineffective.
• A baseball player retains his job is he performs
better than the available replacement.
• Most of the pool of available replacements are
lousy baseball players.

Application to teaching

Don’t dismiss teachers for incompetence unless
you know you can replace them with someone
better.

Don’t identify more teachers for dismissal than
you can support through remediation.

Don’t identify more teachers for dismissal than
you can manage through the dismissal process.

Key requirements related to testing

• Assessment constitutes 50% of the evaluation.
• Statewide summative assessments for subjects in which available.
Districts will be on their own for other subjects.
• Use of the Colorado Growth Model with statewide assessment.
• A measure of individually attributed or collectively attributed student
growth.
• Local measure must be credible, valid (aligned), reliable, and inferences
from the measure must be supportable by evidence and logic.
• The law requires that the measures should support consistent inferences.
• Rating of ineffective or partially effective can lead to loss of non-
probationary status.
• If a value-added model is used the model must be transparent enough to
permit external evaluation.

Unique characteristics of the
Colorado approach
• Student progress counts for 50% of the
evaluation.
• Teachers are evaluated on both a “catch up”
and “keep up” metric (at least on TCAP)
• The Colorado Growth Model will likely be used
to evaluate progress (at least on TCAP)

Obvious possible issues

• The requirement that the assessment support
inferences of teacher effectiveness opens a
legal question.
• The credibility requirement is unique and not
interpreted.

How tests are used to evaluate teachers and
principals

Testing

Metric (Growth or Gain Score)

Analysis (Value Added Effect
Size and/or ranking)

Evaluation (Performance
Rating)

Expect consistent inconsistency!

Inconsistency occurs because

• Of differences in test design.
• Differences in testing conditions.
• Differences in models being applied to
evaluate growth.

Inconsistency between tests

California STAR NWEA MAP

The reliability problem –
Inconsistency in testing conditions

Test Retest

Test 1 Test 2 Test 1 Test 2
Time 1 Time 1 Time 2 Time 2

The reliability problem –
Inconsistency in testing conditions




The problem with spring-spring testing

Teacher 1 Summer Teacher 2

3/11 4/11 5/11 6/11 7/11 8/11 9/11 10/11 11/11 12/11 1/12 2/12 3/12

Characteristics of value-added metrics

• Value-added metrics are inherently NORMATIVE.
• If below average = partially effective then half of the
average staff will be partially effective.
• Value-added metrics can’t measure progress of the
larger group over time.
• Extreme performance is more likely to have alternate
explanations.

Issues in the use of growth and value-
added measures

“Among those who ranked in the top
category on the TAKS reading test, more
than 17% ranked among the lowest two
categories on the Stanford. Similarly
more than 15% of the lowest value-added
teachers on the TAKS were in the highest
two categories on the Stanford.”

Corcoran, S., Jennings, J., & Beveridge, A., Teacher Effectiveness on High and Low Stakes
Tests, Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI
(2010).

Reliability of teacher value-added
estimates
Teachers with growth scores in lowest and
highest quintile over two years using NWEA’s
Measures of Academic Progress

Bottom Top quintile
quintile Y1&Y2
Y1&Y2
Number 59/493 63/493
Percent 12% 13%

r .64 r2 .41

Typical r values for measures of teaching effectiveness range
between .30 and .60 (Brown Center on Education Policy, 2010)

Range of teacher value-added
estimates
12.00
11.00
Mathematics Growth Index Distribution by Teacher - Validity Filtered
10.00
9.00 Each line in this display represents a single teacher. The graphic
shows the average growth index score for each teacher (green
8.00 line), plus or minus the standard error of the growth index estimate
7.00 (black line). We removed students who had tests of questionable
validity and teachers with fewer than 20 students.
6.00
5.00
Average Growth Index Score and Range

4.00 Q5
3.00
2.00
Q4
1.00
0.00
Q3
-1.00
-2.00 Q2
-3.00
-4.00 Q1
-5.00
-6.00
-7.00
-8.00
-9.00
-10.00
-11.00
-12.00

New York City

• Margins of error can be very large
• Increasing n doesn't always decrease the
margin of error
• The margin of error in math is typically less
than reading

Inconsistency among the Colorado
Growth Model and other value-added
approaches.

Los Angeles Unified

• Teachers can easily rate in multiple categories
• The choice of model can have a large impact
• Models effect English more than Math
• Teachers do better in some subjects than
others
• More complex models don't necessarily favor
the teacher

Issues with the Colorado Growth
Model

• When applied to MAP it discards the
advantages of a cross-grade scale and robust
growth norms.
• It is a descriptive and not a causal model.
• As currently applied it does not control for
factors outside the teacher’s influence that
may affect student growth.

A brief commentary on the Colorado Growth
Model

It’s limitations

•It does not support inference.
•It does not take advantage of the
useful characteristics of a vertical
scale.
•It uses only prior scores and past
testing history to evaluate growth.

A brief commentary on the Colorado Growth
Model

Other limitations

•The model can’t be used for cross-
state comparisons.
• the model is problematic for
assessing long-term trends.

A finding of effectiveness or ineffectiveness is
more defensible when it is arrived at by:

1. Two or more assessments of different designs.
2. Two or more models of different designs.
3. As many cases as possible.

It is not good to choose tests or models for local
assessment in hopes that they will mimic the
state assessment.

Potential Litigation Issues

The use of value-added data for high stakes
personnel decisions does not yet have a
strong, coherent, body of case law.

Expect litigation if value-added results are the
lynchpin evidence for a teacher-dismissal case
until a body of case law is established.

Instability at the tails of the
distribution

“The findings indicate that these modeling
choices can significantly influence outcomes
for individual teachers, particularly those in
the tails of the performance distribution who
are most likely to be targeted by high-stakes
policies.”

Ballou, D., Mokher, C. and Cavalluzzo, L. (2012) Using Value-Added Assessment for Personnel
Decisions: How Omitted Variables and Model Specification Influence Teachers’ Outcomes.

LA Times Teacher #1
LA Times Teacher #2

Possible racial bias in models

“Significant evidence of bias plagued the value-added model
estimated for the Los Angeles Times in 2010, including significant
patterns of racial disparities in teacher ratings both by the race of
the student served and by the race of the teachers (see
Green, Baker and Oluwole, 2012). These model biases raise the
possibility that Title VII disparate impact claims might also be filed
by teachers dismissed on the basis of their value-added estimates.

Additional analyses of the data, including richer models using
additional variables mitigated substantial portions of the bias in the
LA Times models (Briggs & Domingue, 2010).”

Baker, B. (2012, April 28). If it’s not valid, reliability doesn’t
matter so much! More on VAM-ing & SGP-ing
Teacher Dismissal.

Issues in the use of growth and value-
added measures

Lack of random assignment

The use of a value-added model
assumes that the school doesn’t
add a source of variation that isn’t
controlled for in the model.

e.g. Young teachers are assigned
disproportionate numbers of
students with poor discipline
records.

Measurement Issues

Moving from the model to
the teacher rating

Translating ranked data to ratings -
principles

• There is no “science” per se around translating a
ranking to a rating. If you call a bottom 40% teacher
ineffective that is a judgment.
• The rating process can be politicized.
• The process is easy to over-engineer.

New York Rating System

• 60 points assigned from classroom observation
• 20 points assigned from state assessment
• 20 points assigned from local assessment
• A score of 64 or less is rated ineffective.

Ineffective
(Growth Developing (Growth Measures) Effective (Growth Measures) Highly Effective (Growth Measures)
Measures)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
0 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
1 2 3 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
2 2 4 5 6 6 6 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
Ineffective (Observational)

3 2 5 6 7 7 8 8 9 9 9 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
4 3 5 7 8 9 9 10 10 11 11 11 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 15 15 15 15 15 15 15 15 15 15 15
5 3 6 8 9 10 11 11 12 12 13 13 14 14 14 14 15 15 15 15 16 16 16 16 16 16 16 17 17 17 17 17 17 17 17 17 18 18 18 18 18 18
6 3 6 8 10 11 12 13 13 14 14 15 15 16 16 16 17 17 17 17 18 18 18 18 18 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 21
7 3 7 9 11 12 13 14 15 15 16 16 17 17 18 18 18 19 19 19 20 20 20 20 20 21 21 21 21 21 22 22 22 22 22 22 22 23 23 23 23 23
8 3 7 10 11 13 14 15 16 17 17 18 18 19 19 20 20 20 21 21 21 22 22 22 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25 25
9 3 8 10 12 14 15 16 17 18 18 19 20 20 21 21 22 22 23 23 23 24 24 24 24 25 25 25 25 26 26 26 26 26 27 27 27 27 27 27 28 28
10 3 8 11 13 14 16 17 18 19 20 20 21 22 22 23 23 24 24 25 25 25 26 26 26 27 27 27 27 28 28 28 28 29 29 29 29 29 29 30 30 30
11 3 8 11 13 15 17 18 19 20 21 22 22 23 24 24 25 25 26 26 27 27 27 28 28 28 29 29 29 30 30 30 30 31 31 31 31 31 32 32 32 32
12 4 8 12 14 16 17 19 20 21 22 23 24 24 25 26 26 27 27 28 28 29 29 29 30 30 30 31 31 31 32 32 32 33 33 33 33 33 34 34 34 34
13 4 9 12 14 16 18 20 21 22 23 24 25 26 26 27 28 28 29 29 30 30 31 31 31 32 32 33 33 33 34 34 34 34 35 35 35 35 36 36 36 36
14 4 9 12 15 17 19 20 22 23 24 25 26 27 27 28 29 30 30 31 31 32 32 33 33 33 34 34 35 35 35 36 36 36 37 37 37 37 38 38 38 38
15 4 9 13 15 18 19 21 23 24 25 26 27 28 29 29 30 31 31 32 33 33 34 34 35 35 35 36 36 37 37 37 38 38 38 39 39 39 40 40 40 40
16 4 9 13 16 18 20 22 23 25 26 27 28 29 30 31 31 32 33 33 34 35 35 36 36 37 37 37 38 38 39 39 39 40 40 40 41 41 41 42 42 42
17 4 9 13 16 19 21 23 24 25 27 28 29 30 31 32 33 33 34 35 35 36 37 37 38 38 39 39 39 40 40 41 41 42 42 42 43 43 43 44 44 44
Developing (Observational)

18 4 10 14 17 19 21 23 25 26 28 29 30 31 32 33 34 35 35 36 37 37 38 38 39 40 40 41 41 41 42 42 43 43 44 44 44 45 45 45 46 46
19 4 10 14 17 20 22 24 26 27 28 30 31 32 33 34 35 36 36 37 38 39 39 40 40 41 42 42 43 43 43 44 44 45 45 46 46 46 47 47 47 48
20 4 10 14 17 20 22 24 26 28 29 31 32 33 34 35 36 37 38 38 39 40 41 41 42 42 43 43 44 45 45 45 46 46 47 47 48 48 48 49 49 49
21 4 10 14 18 21 23 25 27 29 30 31 33 34 35 36 37 38 39 40 40 41 42 42 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 50 51 51
22 4 10 15 18 21 23 26 27 29 31 32 34 35 36 37 38 39 40 41 42 42 43 44 44 45 46 46 47 47 48 48 49 49 50 50 51 51 52 52 52 53
23 4 10 15 18 21 24 26 28 30 31 33 34 36 37 38 39 40 41 42 43 43 44 45 46 46 47 48 48 49 49 50 50 51 51 52 52 53 53 54 54 54
24 4 11 15 19 22 24 27 29 31 32 34 35 36 38 39 40 41 42 43 44 45 45 46 47 48 48 49 50 50 51 51 52 52 53 53 54 54 55 55 56 56
25 4 11 15 19 22 25 27 29 31 33 34 36 37 39 40 41 42 43 44 45 46 47 47 48 49 50 50 51 52 52 53 53 54 54 55 55 56 56 57 57 58
26 4 11 16 19 23 25 28 30 32 34 35 37 38 39 41 42 43 44 45 46 47 48 49 49 50 51 51 52 53 53 54 55 55 56 56 57 57 58 58 59 59
27 4 11 16 20 23 26 28 30 32 34 36 37 39 40 42 43 44 45 46 47 48 49 50 50 51 52 53 53 54 55 55 56 57 57 58 58 59 59 60 60 61
28 4 11 16 20 23 26 29 31 33 35 37 38 40 41 42 44 45 46 47 48 49 50 51 52 52 53 54 55 55 56 57 57 58 59 59 60 60 61 61 62 62
29 4 11 16 20 24 26 29 31 34 35 37 39 40 42 43 45 46 47 48 49 50 51 52 53 54 54 55 56 57 57 58 59 59 60 61 61 62 62 63 63 64
30 4 11 16 20 24 27 30 32 34 36 38 40 41 43 44 45 47 48 49 50 51 52 53 54 55 56 56 57 58 59 59 60 61 61 62 62 63 64 64 65 65
31 4 11 17 21 24 27 30 32 35 37 39 40 42 43 45 46 47 49 50 51 52 53 54 55 56 57 57 58 59 60 61 61 62 63 63 64 64 65 66 66 67
32 4 11 17 21 25 28 30 33 35 37 39 41 43 44 46 47 48 50 51 52 53 54 55 56 57 58 59 59 60 61 62 62 63 64 64 65 66 66 67 68 68
33 4 12 17 21 25 28 31 33 36 38 40 42 43 45 46 48 49 50 52 53 54 55 56 57 58 59 60 61 61 62 63 64 64 65 66 66 67 68 68 69 69
Effective (Observational)

34 4 12 17 21 25 28 31 34 36 38 40 42 44 46 47 49 50 51 53 54 55 56 57 58 59 60 61 62 63 63 64 65 66 66 67 68 68 69 70 70 71
35 4 12 17 22 25 29 32 34 37 39 41 43 45 46 48 49 51 52 53 55 56 57 58 59 60 61 62 63 64 64 65 66 67 68 68 69 70 70 71 72 72
36 4 12 17 22 26 29 32 35 37 39 41 43 45 47 49 50 52 53 54 55 57 58 59 60 61 62 63 64 65 66 66 67 68 69 69 70 71 72 72 73 74
37 4 12 17 22 26 29 32 35 38 40 42 44 46 48 49 51 52 54 55 56 58 59 60 61 62 63 64 65 66 67 68 68 69 70 71 71 72 73 74 74 75
38 4 12 18 22 26 30 33 36 38 40 43 45 46 48 50 52 53 55 56 57 58 60 61 62 63 64 65 66 67 68 69 69 70 71 72 73 73 74 75 75 76
39 4 12 18 22 26 30 33 36 39 41 43 45 47 49 51 52 54 55 57 58 59 61 62 63 64 65 66 67 68 69 70 71 71 72 73 74 75 75 76 77 77
40 4 12 18 23 27 30 33 36 39 41 44 46 48 50 51 53 55 56 57 59 60 61 63 64 65 66 67 68 69 70 71 72 73 73 74 75 76 77 77 78 79
41 4 12 18 23 27 31 34 37 39 42 44 46 48 50 52 54 55 57 58 60 61 62 63 65 66 67 68 69 70 71 72 73 74 75 75 76 77 78 78 79 80
42 5 12 18 23 27 31 34 37 40 42 45 47 49 51 53 54 56 58 59 60 62 63 64 66 67 68 69 70 71 72 73 74 75 76 76 77 78 79 80 80 81
43 5 12 18 23 27 31 34 37 40 43 45 47 49 51 53 55 57 58 60 61 63 64 65 66 68 69 70 71 72 73 74 75 76 77 78 78 79 80 81 82 82
44 5 12 18 23 28 31 35 38 41 43 46 48 50 52 54 56 57 59 60 62 63 65 66 67 69 70 71 72 73 74 75 76 77 78 79 80 80 81 82 83 84
45 5 13 19 24 28 32 35 38 41 44 46 48 51 53 54 56 58 60 61 63 64 66 67 68 69 71 72 73 74 75 76 77 78 79 80 81 82 82 83 84 85
46 5 13 19 24 28 32 35 39 41 44 47 49 51 53 55 57 59 60 62 63 65 66 68 69 70 71 73 74 75 76 77 78 79 80 81 82 83 83 84 85 86
Highly Effective (Observational)

47 5 13 19 24 28 32 36 39 42 45 47 49 52 54 56 58 59 61 63 64 66 67 69 70 71 72 74 75 76 77 78 79 80 81 82 83 84 85 85 86 87
48 5 13 19 24 29 32 36 39 42 45 47 50 52 54 56 58 60 62 63 65 66 68 69 71 72 73 74 76 77 78 79 80 81 82 83 84 85 86 87 87 88
49 5 13 19 24 29 33 36 40 43 45 48 50 53 55 57 59 61 62 64 66 67 69 70 71 73 74 75 77 78 79 80 81 82 83 84 85 86 87 88 89 89
50 5 13 19 24 29 33 37 40 43 46 48 51 53 55 57 59 61 63 65 66 68 69 71 72 74 75 76 77 79 80 81 82 83 84 85 86 87 88 89 90 90
51 5 13 19 25 29 33 37 40 43 46 49 51 54 56 58 60 62 64 65 67 69 70 72 73 74 76 77 78 79 81 82 83 84 85 86 87 88 89 90 91 92
52 5 13 19 25 29 33 37 41 44 47 49 52 54 56 58 61 62 64 66 68 69 71 72 74 75 77 78 79 80 82 83 84 85 86 87 88 89 90 91 92 93
53 5 13 19 25 30 34 37 41 44 47 50 52 55 57 59 61 63 65 67 68 70 72 73 75 76 77 79 80 81 82 84 85 86 87 88 89 90 91 92 93 94
54 5 13 20 25 30 34 38 41 44 47 50 53 55 57 60 62 64 66 67 69 71 72 74 75 77 78 80 81 82 83 85 86 87 88 89 90 91 92 93 94 95
55 5 13 20 25 30 34 38 41 45 48 50 53 56 58 60 62 64 66 68 70 71 73 75 76 78 79 80 82 83 84 85 87 88 89 90 91 92 93 94 95 96
56 5 13 20 25 30 34 38 42 45 48 51 54 56 58 61 63 65 67 69 70 72 74 75 77 78 80 81 82 84 85 86 87 89 90 91 92 93 94 95 96 97
57 5 13 20 25 30 35 38 42 45 48 51 54 56 59 61 63 65 67 69 71 73 74 76 78 79 81 82 83 85 86 87 88 90 91 92 93 94 95 96 97 98
58 5 13 20 26 30 35 39 42 46 49 52 54 57 59 62 64 66 68 70 72 73 75 77 78 80 81 83 84 85 87 88 89 90 92 93 94 95 96 97 98 99
59 5 13 20 26 31 35 39 43 46 49 52 55 57 60 62 64 66 68 70 72 74 76 77 79 81 82 83 85 86 88 89 90 91 92 94 95 96 97 98 99 100
60 5 13 20 26 31 35 39 43 46 49 52 55 58 60 63 65 67 69 71 73 75 76 78 80 81 83 84 86 87 88 90 91 92 93 95 96 97 98 99 100 101

Cheating

Atlanta Public Schools
Crescendo Charter Schools
Philadelphia Public Schools
Washington DC Public Schools
Houston Independent School
District
Michigan Public Schools

Unintended Consequences?

• Many principals and teachers (including good ones)
will seek schools or teaching assignments that they
think will improve their results.
• Principals and teachers may game the system,
inadvertently or intentionally.
• Many teachers will seek opportunities to avoid
grades with standardized tests.
• Ranking metrics can discourage cooperation among
principals and teachers – finding ways to reward
teamwork and cooperation are important.

Case Study #1 - Mean value-added performance in mathematics by
school – fall to spring

6.00

4.00

2.00

0.00

-2.00

-4.00

-6.00

-8.00

Case Study #1 - Mean spring and fall test duration in minutes by
school

90.00

80.00

70.00

60.00

50.00
Spring term
Fall term
40.00

30.00

20.00

10.00

0.00

Case Study #1 - Mean value-added growth by school and test
duration

8.00

6.00

4.00

2.00

0.00

-2.00

-4.00

-6.00

-8.00

-10.00

Students taking 10+ minutes longer spring than fall All other students

Case Study # 2

Differences in fall-spring test durations Differences in growth index score
based on fall-spring test durations
Mathematics
15%
Mathematics
6.0

5.0

Growth Index
4.0

25% 3.0
60% 2.0

1.0

0.0
Spring < Fall Spring = Fall Spring > Fall
Spring < Fall Spring = Fall Spring > Fall

Case Study # 2

How much of summer loss is really summer loss?

Differences in spring -fall test durations Differences in raw growth based by
spring-fall test duration

0.0
-0.5
25%
-1.0
-1.5
42% -2.0
-2.5
-3.0
-3.5
-4.0
-4.5
33%
-5.0

Fall < Spring Fall = Spring Fall > Spring Fall < Spring Fall = Spring Fall >Spring

Case Study # 2

Differences in fall-spring test duration (yellow-black) and
Differences in growth index scores (green) by school

200 10.0

180 9.0

160 8.0

140 7.0

Growth Index
120 6.0
Minutes

100 5.0

80 4.0

60 3.0

40 2.0

20 1.0

0 0.0
School

Growth Index Fall test duration Spring test duration

Negotiated goals – Student Learning
Objectives

• Negotiated goals (SLOs) are likely to be
necessary in some subjects.
• It is difficult to set fair and reasonable goals
for improvement absent norms or context.
• It is likely that some goals will be absurdly high
and others way too low.

An alternate approach

• Give primacy to evaluator observation for judging teachers.
• Focus mandatory observations on low performers.
• Use assessments and value-added measurement to validate
observations.
• Require reassessment when observations and assessment
data are in significant misalignment.

Possible legal issues

• Title VII of the Civil Rights Act of 1964 –
Disparate impact of sanctions on a protected
group.
• State statutes that provide tenure and other
related protections to teachers.
• Challenges to a finding of “incompetence”
stemming from the growth or value-added
data.

Recommendations

• Embrace the formative advantages of growth
measurement as well as the summative.
• Create comprehensive evaluation systems with
multiple measures of teacher effectiveness (Rand,
2010)
• Select measures as carefully as value-added models.
• Use multiple years of student achievement data.
• Understand the issues and the tradeoffs.

Thank you for attending this event

Presenter - John Cronin, Ph.D.

Contacting us:
NWEA Main Number: 503-624-1951
E-mail: rebecca.moore@nwea.org

The presentation and recommended resources are
available at our website: www.kingsburycenter.org

Colorado assessment summit_oct12

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Similar to Colorado assessment summit_oct12

Similar to Colorado assessment summit_oct12 (20)

More from John Cronin

More from John Cronin (9)

Colorado assessment summit_oct12