The Well-EQUIPped Classroom: Using the Electronic Quality of Inquiry Protocol to evaluate effects of science inquiry professional development on PreK-3 classroom practice
Crowdfunding for Sustainable Entrepreneurship and Innovation
Similar to The Well-EQUIPped Classroom: Using the Electronic Quality of Inquiry Protocol to evaluate effects of science inquiry professional development on PreK-3 classroom practice
Similar to The Well-EQUIPped Classroom: Using the Electronic Quality of Inquiry Protocol to evaluate effects of science inquiry professional development on PreK-3 classroom practice (20)
The Well-EQUIPped Classroom: Using the Electronic Quality of Inquiry Protocol to evaluate effects of science inquiry professional development on PreK-3 classroom practice
1. Presented by
Gale A. Mentzer, PhD
T. Ryan Duckett, MA
Toledo, OH
Research and Evaluation,
LLC
1811 N. Reynolds Road,
Suite 204
Toledo, OH 43615
“Accuracy of observation is the equivalent of accuracy of thinking.”--Wallace Stevens
American Evaluation Association
National Conference
October 29, 2016
Atlanta, GA
2.
3.
4. • Time Usage
• Instruction
• Discourse
• Assessment
• Curriculum
Marshall, J. C., Horton, B., & White, C. (2009). EQUIPping
teachers: A protocol to guide and improve inquiry-based
instruction. The Science Teacher, 76(4), 46-53.
8. Rasch at a Glance
exp(Bp - Di)
[1 + exp(Bp - Di)]
P{xpi=1|Bp,Di}=
9. Higher levels of the trait
Level of Ability
(Person Measures)
Lower levels of the trait
Difficult to Endorse
Items that target the
latent trait
(Item Measures)
Easy to Endorse
3.0
2.0
1.0
0.0
-1.0
-2.0
-3.0
10.
11. Rasch Process in Four Steps
Step One
Are the items functioning
properly?
Step Two
Step Three Step Four
Are the teachers
participating reasonably?
Are you measuring a single
latent trait?
How do we use what we
know to understand?
13. Item Fit
Harder to Affirm
Teacher provides depth of content
that connects big picture
Lesson allowed for student
designed investigation
Lesson integrated content
and student investigation
Student organized information in
effective ways to communicate
their learning
Overall assessment of
curriculum factor
Easier to Affirm
Difficulty Fit
2.46
1.21
-0.91
-1.26
-1.50
Reliability = .95
0.63
0.93
0.30
1.58
1.07
14. Rating Scale
Example of a poorly functioning rating scaleRating scale for Spring 2015 Curriculum factor
As we all know, matching measures to intended outcomes is the foundation of sound evaluation practice and yet adequately achieving this decisive step has been elusive at times—particularly when focused on teacher PD designed to change teaching practice. And, of course, instrument or measurement reliability and validity are necessary precursors to establishing construct validity, interpreting analyses, and making legitimate causal inferences.
A measure must first provide consistent results over time and accurately reflect the construct of concern before inferences concluded from the data can be considered to have validity. In addition to developing measures that match intended characteristics or traits, the interpretation or use of the data must also match intentions Some researchers have claimed, for example, that measuring student achievement to make evaluation inferences about teacher quality is not necessarily an appropriate use of the data, particularly when error rates are not taken into consideration. Additionally, jumping directly to student outcomes without first verifying teacher implementation outcomes weakens cause and effect conclusions
While surveying teachers as to their opinion of the value of PD has been used as a measure of participants’ perception of PD usefulness it is a low level indicator of effective PD according to Guskey. Assessments may indicate level of learning achieved as a result of PD, but neither method verifies that teachers actually include newly learned strategies into instruction. If the goal of the evaluation is to determine the extent to which teaching practice changes, then artifacts of teaching (lesson plans, student assignments), interviews with teachers, and observations must be included. And while artifacts and interviews might indicate how a teacher implements new ideas and concepts, they may not indicate how well new concepts and ideas were implemented. Therefore, an observation using a high quality performance assessment could provide the best indicator of implementation.
However, the creation and validation of a performance assessment that includes adequate critical indicators can be a lengthy process and an evaluator may not have the time or resources or time to develop a high quality, reliable tool with evidence of validity. In such a case, using an existing, validated tool may be the best solution. But finding an instrument that matches the evaluator’s intent may not be possible. In this case, validating the instrument for the new use may provide the solution. It is the purpose of this study to show how an instrument designed to provide teachers with formative feedback on inquiry-based instruction can be used to evaluate the extent to which a teacher actually employs inquiry teaching strategies.
The EQUIP, developed by Marshall, Horton, & White, was designed to provide formative feedback to teachers regarding their implementation of inquiry-based instruction. It is based on NGSS and has 5 factors: [listed on the screen]. The creators recommend the tool be used by the teacher (using reflection or a video recording), by a colleague, or by an instructional coach.
Each factor has several constructs and each construct is measured based upon a four point scale: Preinquiry, Developing Inquiry, Proficient Inquiry, Exemplary Inquiry. As you can see, the rubric is quite detailed regarding hallmarks of each level and they vary depending upon the construct. Once the constructs have been rated, a summary or overall factor score is assigned using the same 4 point scale. This score is not necessarily the “average” of the factor scores because it is up to the observer to weight various constructs depending upon the intent of the lesson.
The tool also includes a time usage section where teaching is broken into 5 minute segments and scored in general using the inquiry proficiency scale (4 pts) as well as noting the lesson structure and student attention levels.
We used the EQUIP to measure the quality of inquiry-based instruction and examined it pre/post a Summer Institute professional development to determine whether the PD actually improved inquiry based instruction. We did not use the time usage portion of the instrument. Once our team established acceptable inter rater reliability using previously recorded lessons, we conducted the observations in real time. Now some prefer video because it allows one to review; however, I prefer real time as I believe video often hides elements of the experience due to the limited focus of the camera. Our team took detailed notes of the lesson and then completed the scoring of the EQUIP rubric immediately after the lesson and reviewing their notes (not during). Because the data is ordinal and because the use of the instrument for evaluation has not been validated, we decided to use RMM to more carefully examine how the instrument worked for our purposes.
As Dr. Mentzer said, we used the EQUIP tool in an attempt to measure the level of inquiry-based instruction the teacher demonstrated during a lesson. This “level of inquiry” is a quality of the teacher’s instruction and not a quantity. The 1, 2, 3, and 4s we assign are not numbers. They simply represent this “level of inquiry”. So while we can note if a teacher has less (pre-inquiry) or more (exemplary) of this trait, we cannot during the initial assessment determine the distance between each category. And, therefore, we cannot simply sum up the scores for each factor and get a meaningful average.
So how does the Rasch method determine these interval values? Here’s a peak at the mathematical model underlying the Rasch method. Without getting too far astray, this basically just says that the probability of an individual getting the best response on an item (or construct in our example) depends on the difficulty of that item.
But the Rasch method goes much further than that and calculates how individuals perform on items to produce a meaningful comparison of participants and items. [Brief explanation of person and item measures; provide math question examples] [Transcend sample dependence; thereby increase reliability]
Through this logarithmic calculation, Rasch assigns each person and item a difficulty level. This is the Rasch Person-Item Map for the S15 Curriculum factor. C1 = Content depth; C2 = Learner Centrality; C3 = Integration of content and design; C4 = Organizing and recording information. So our analysis mapped all of the participants for each cycle (62 in S15, 119 in F15, and 99 in S16) in relation to the each of the contributing factors for each of the four factors mentioned earlier (6 factors for Inquiry, 6 for Assessment, 6 for Discourse, 5 for curriculum, giving us 23 items overall).
Alright, great. We get a nice representation of person ability and item difficulty. But how do we know it is accurate and valid? I would like to now take a few minutes to show you how the Rasch Measurement model assisted us as evaluators to obtain an authentic understanding of the level of inquiry based instruction. Once we completed the Rasch analysis, this allowed for a meaningful and reliable assessment of the participants’ “score” over the three collection cycles to then evaluate overall impact of the NURTURES professional development program. [Briefly mention the roadmap: first items, then people, then unidimensionality, then putting it together]
The first step to ensure the EQUIP instrument is working properly is to make sure the people using the instrument (us, the evaluators scoring the teachers) are speaking a common language. It is vital that the four evaluators had a common definition of each construct for each of the four contributing factors; i.e. each of us knew what was meant by “learner centrality” for the curriculum factor and further that we had a solid framework for assigning a level within those constructs. The first step was, as Dr. Mentzer mentioned, to establish inter rater reliability. Rasch provides fit statistics that determine how consistently the items performed, and therefore, enhances reliability.
Sticking with the output from the S15 curriculum factor, we see the difficulty of each item expressed in logits [explain]. Next to that we see how well the items fit the model. [Discuss range of .6-1.4]; reason why there might be under or overfit; explain the reliability.
First is an example of a rating scale that is functioning extremely poorly. The participants cannot meaningfully differentiate between the different levels, hence the overlap. Here is how the rating scale functioned for the S15 curriculum, a thing of beauty. Each level has a range where it is the most probable response. [Briefly explain the x-y fields, etc.]
Just beyond the 1.4 – 5.0 logit step scale, but very consistent [they have seen 30 logit step scale]. This points to a place for increased rater reliability and to review the rating scale to see how it is functioning [possibly collapse categories]
Once we get a grasp on how the rating scale and items are functioning, we perform similar investigations on the participants. Fit statistics here show us if participants behaved appropriately and took the items seriously. [Describe measure increase pre and post Summer Institute; what the negative measure score means in the first instance in relation to items; outstanding reliability – i.e. results would hold with any group of similar levels -- Further, these statistics confidently predict that people of similar abilities in future iterations will perform similarly on the observation, thus ensuring repeatability and reliability of the results. ]
[4G+1]/3 = Strata.
Ruler can only measure one dimension at a time (length or width, etc.). We need to make sure our instrument is just measuring ability in inquiry based instruction. We need 60% of variance accounted for by the person and item measures. This means that our items are tapping into a cohesive trait.
Discuss the items that correlate.
Given that the EQUIP tool accurately observed and recorded the goal of the NURTURES program—quality of inquiry science instruction—it remained to be seen whether the NURTURES summer institute and intervention had any statistically significant impact on the quality of inquiry science instruction. Recall that the Spring 2015 observation cycle was a pre-intervention observation period. In order to establish a baseline comparison, teachers were observed before they received any instruction or assistance from the NURTURES program.
First a dependent t-test was conducted to test the null hypothesis that there was no statistically significant change in teachers’ EQUIP observation scores after NURTURES intervention. Cumulative inquiry based instruction scores were recorded using EQUP and then converted to normalized logits using the Rasch model for the entire cohort for each respective semester (Spring 2015 N=62, Fall 2015 N=119). Of those individuals, 59 participated in both sessions.
Table X reports the results of the dependent t-test. The test revealed a significant difference in the scores for Spring 2015 scores (M=-.2214, SD=2.62317) and Fall 2015 scores (M=1.7078, SD=2.32189); t(58)=-4.884, p =.0001. Thus, there is a statistically significant relationship between participation in the NURTURES summer institute program and cumulative inquiry based instruction scores, with participants performing 1.93 logits better on average after having completed the summer institute.
Next, to better evaluate the sustained impact of the NURTURES program, a repeated measures ANOVA was conducted (DV: teacher EQUIP score, IV: session – S15, F15, S16). The same steps were taken as in the dependent t-test to obtain a normalized score for cumulative science inquiry based instruction. This test determines if scores significantly changed for the group of 40 teachers who participated over all three sessions (n=40, N=280). The research question was stated as: is there a statistically significant change in teacher EQUIP scores before and after participation in the summer instructional institute? H0: x̅1 = x̅2 = x̅3; H1: at least two means are significantly different.
Since the dependent t-test already led us to reject the null hypothesis, this repeated measures ANOVA allows us to determine if the impact of NURTURES program sustained, decreased, or increased after a third observation. Thus, this one-way repeated measures ANOVA was conducted to evaluate the null hypothesis that there is no change in participants’ EQUIP scores when measured before and in two subsequent observations after participation in the summer instructional institute (n=40).
The results, shown in table x, indicated a significant session effect, Wilks’ Lambda = .583, F (2, 38) = 13.579, p<.0001, effect size of 42%. Maulchy’s test was not statistically significant, indicating that the assumption of sphericity has been met, χ2 (2) = 3.18, p = .204 (table x). Therefore we reject the null hypothesis and note that at least two means are significantly different.
Additional post hoc tests were conducted to determine which means differed. Table x shows the estimated means for each observation session (session 1=Spring 2015, etc.). The pairwise comparisons between each session bear out what the estimated means appear to convey; namely, the difference in scores between the first and second and the first and third session are statistically significant. Row two of column two in table x shows a mean difference of -9.175 and -9.200, indicating that the pre-intervention scores were lower by over 9 points on average. This table also shows that relatively small difference in means from Fall 2015 and Spring 2016 is not statistically significant (significance of 1.00> p value). This plateauing of scores reveals that intervention had a lasting effect that did not degrade over time.