Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Demonstrating the 
consequences of not taking into 
account sampling designs with 
TIMSS 2011 data 
Dr. Christian Bokhove ...
OUTLINE 
• International studies 
• IEA & OECD 
• PISA, TIMSS, … 
• Some aspects of their sampling design 
• Two stage sam...
OUTLINE 
• International studies 
• IEA & OECD 
• PISA, TIMSS, … 
• Some aspects of their sampling design 
• Two stage sam...
IEA & OECD 
The International Association for 
the Evaluation of Educational 
Achievement (IEA) is an 
independent, intern...
PISA 
http://www.oecd.org/pisa/ 
“The Programme for International Student Assessment (PISA) is a 
triennial international ...
TIMSS 
http://timssandpirls.bc.edu/timss2011/ 
“TIMSS 2011 is the fifth in IEA’s series of international assessments of 
s...
OUTLINE 
• International studies 
• IEA & OECD 
• PISA, TIMSS, … 
• Some aspects of their sampling design 
• Two stage sam...
Two-stage sampling in educational studies 
● Random sampling is rarely used in educational surveys: 
– Too expensive (e.g....
Replicate weights 
● Replicate weights or resampling techniques are used to calculate 
correct standard errors in two-stag...
Two replication methods 
● Jackknife 
– TIMSS and PIRLS 
– Schools are paired with other similar schools within zones 
– A...
OUTLINE 
• International studies 
• IEA & OECD 
• PISA, TIMSS, … 
• Some aspects of their sampling design 
• Two stage sam...
Weights 
• In theory sampling design provides student samples with equal 
selection probabilities. 
• But variation in num...
OUTLINE 
• International studies 
• IEA & OECD 
• PISA, TIMSS, … 
• Some aspects of their sampling design 
• Two stage sam...
Rotated test design 
● The item pool should include a large number of items for domain 
validity (e.g., mathematical liter...
Plausible values 
● Rotated booklets introduce challenges for estimating academic 
achievement 
– Students miss data on a ...
Plausible values 
● Plausible values are random draws from the distribution of a 
student's ability 
– Instead of obtainin...
Challenge 
● Ignoring the complex design leads to wrong conclusions, like different 
point estimates and/or underestimated...
Available software 
● IDB Analyzer (SPSS) 
● NAEP Data Explorer (web tool) 
● PISA SPSS macros 
● R package 'intsvy‘ (Dani...
Available software 
Multilevel software 
● R 
– Has multilevel package but no weights 
– Can link to MLwin 
● MLwin 
– Hav...
OUTLINE 
• International studies 
• IEA & OECD 
• PISA, TIMSS, … 
• Some aspects of their sampling design 
• Two stage sam...
Simulation with TIMSS 2011 data 
• TIMSS 2011 
• Three aspects: jackknife, weights, plausible values 
• Five countries: 
E...
Simulation with TIMSS 2011 data 
• Data preparation: 
• Publicly available TIMSS 2011 year 8 data files are used. 
• Addit...
Single level 
Different scenarios: 
• Two conditions concern variance estimation with jackknife (JK): 
either jackknife is...
PV1 Case 9 
With JK With Wgt 
Case 10 
No JK With Wgt 
Case 11 
With JK No Wgt 
Case 12 
No JK No Wgt 
Country Score SE # ...
Observations 
Differences in achievement results and standard errors: 
• Not taking into account Jackknife (example in yel...
Multilevel 
Used HLM, does not have Jackknife 
• Note that with MLwin you need to 
combine Plausible Values manually. 
• T...
Maths achievement scores and standard errors of five countries for multilevel null models in three different 
weighting sc...
Observations 
Differences in achievement results and standard errors: 
• The different weighting methods greatly influence...
Final thoughts 
• Not taking into account three features of complex sample designs for 
LSA’s can have a big influence on ...
Relevant references 
Beaton, A.E., & Gonzalez, E.J. (1995). NAEP Primer. Center for the study of testing, evaluation and 
...
Upcoming SlideShare
Loading in …5
×

Demonstrating the consequences of not taking into account sampling designs with TIMSS 2011 data

602 views

Published on

Presentation for the fourth meeting of the EARLI SIG 18 Educational Effectiveness.
Abstract: The topic of comparative international large-scale assessments (LSA) has always had a lot of attention from policy makers and educational researchers, inviting criticism. One criticism concerns the fact that the complex sampling design of LSA is not always taken into account. This paper aims to demonstrate the consequences of not taking into account the sampling design of one such assessment, TIMSS 2011. Three features, weights, proficiency estimation with plausible values and variance estimation with jackknife are used in single level (students) and multilevel (students and schools) cases. The results show that the consequences can be significant, but are not completely in line with previous literature.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Demonstrating the consequences of not taking into account sampling designs with TIMSS 2011 data

  1. 1. Demonstrating the consequences of not taking into account sampling designs with TIMSS 2011 data Dr. Christian Bokhove Lecturer in Mathematics Education University of Southampton EARLI SIG August 28th 2014
  2. 2. OUTLINE • International studies • IEA & OECD • PISA, TIMSS, … • Some aspects of their sampling design • Two stage sampling • Weights • Rotated test design • What if you don’t take this into account? • Simulation with TIMSS 2011 data • Single level model • Multilevel models
  3. 3. OUTLINE • International studies • IEA & OECD • PISA, TIMSS, … • Some aspects of their sampling design • Two stage sampling • Weights • Rotated test design • What if you don’t take this into account? • Simulation with TIMSS 2011 data • Single level model • Multilevel models
  4. 4. IEA & OECD The International Association for the Evaluation of Educational Achievement (IEA) is an independent, international cooperative of national research institutions and governmental research agencies. It conducts large-scale comparative studies of educational achievement and other aspects of education. The mission of the Organisation for Economic Co-operation and Development (OECD) is to promote policies that will improve the economic and social well-being of people around the world.
  5. 5. PISA http://www.oecd.org/pisa/ “The Programme for International Student Assessment (PISA) is a triennial international survey which aims to evaluate education systems worldwide by testing the skills and knowledge of 15-year-old students. To date, students representing more than 70 economies have participated in the assessment.” • Last one appeared in 2013 with 2012 data
  6. 6. TIMSS http://timssandpirls.bc.edu/timss2011/ “TIMSS 2011 is the fifth in IEA’s series of international assessments of student achievement dedicated to improving teaching and learning in mathematics and science. First conducted in 1995, TIMSS reports every four years on the achievement of fourth and eighth grade students.“
  7. 7. OUTLINE • International studies • IEA & OECD • PISA, TIMSS, … • Some aspects of their sampling design • Two stage sampling • Weights • Rotated test design • What if you don’t take this into account? • Simulation with TIMSS 2011 data • Single level model • Multilevel models
  8. 8. Two-stage sampling in educational studies ● Random sampling is rarely used in educational surveys: – Too expensive (e.g., training test administrators and travel costs) ● Selected students attend many different schools – It is not practical to contact many schools – A link with class, teacher, school variables is sought ● Sampling is usually conducted in two stages ● First stage – Schools are selected ● Second stage – Students (PISA) or classes (TIMSS/PIRLS) are selected ● 35 students selected randomly (PISA) ● One or two intact classes (TIMSS/PIRLS)
  9. 9. Replicate weights ● Replicate weights or resampling techniques are used to calculate correct standard errors in two-stage sampling designs ● The idea behind: – There are many possible samples of schools and not all of them yield the same estimates – Use different samples of schools to calculate estimates – Take into account error of selecting one school and not another (sampling error) ● Each replicate weight represents one sample ● Variability between estimates reflects the sampling error
  10. 10. Two replication methods ● Jackknife – TIMSS and PIRLS – Schools are paired with other similar schools within zones – A replicate is created for each zone or pair of schools – One school is randomly removed within each zone and the weight of the other school is doubled ● Balanced repeated replication (BRR) – Select one school at random within each stratum – Set its weight to 0 – Double the weight of the other school – PISA uses a variant of BRR (Fay) to prevent smaller sample size Source: OECD (2009). PISA Data Analysis Manual: SPSS (2nd Edition. Paris): OECD Publishing.
  11. 11. OUTLINE • International studies • IEA & OECD • PISA, TIMSS, … • Some aspects of their sampling design • Two stage sampling • Weights • Rotated test design • What if you don’t take this into account? • Simulation with TIMSS 2011 data • Single level model • Multilevel models
  12. 12. Weights • In theory sampling design provides student samples with equal selection probabilities. • But variation in number of classes selected, and differential patterns of nonresponse can result in varying selection probabilities, requiring a unique sampling weight for the students in each participating class in the study. • Total weight (TOTWGT) • Sums to the student population size in each country • The overall student sampling weight is the product of the final weight components for schools, classes, and students • Important in multilevel analyses • School level: final school weight • Student level: final student weight multiplied with final class weight
  13. 13. OUTLINE • International studies • IEA & OECD • PISA, TIMSS, … • Some aspects of their sampling design • Two stage sampling • Weights • Rotated test design • What if you don’t take this into account? • Simulation with TIMSS 2011 data • Single level model • Multilevel models
  14. 14. Rotated test design ● The item pool should include a large number of items for domain validity (e.g., mathematical literacy) ● At the same time: – Fatigue biases results of long tests – Schools refuse to participate in lengthy studies ● Rotated test forms – Students are assigned a subset of item pool – Minimize testing time
  15. 15. Plausible values ● Rotated booklets introduce challenges for estimating academic achievement – Students miss data on a number of items ● Plausible values methods are employed to obtain population estimates with rotated booklet designs ● Students do not answer all items but plausible scores are produced as if they had responded to all items based on – Responses to test items – Background characteristics
  16. 16. Plausible values ● Plausible values are random draws from the distribution of a student's ability – Instead of obtaining a point estimate, a range of values are estimated for each student ● A single score cannot be calculated because data is missing for a number of items ● Plausible values account for imputation error – Making inference on ability from small number of items ● Estimation should be conducted separately for each plausible value – Typically five plausible values are considered – The variability between estimates reflects the imputation error
  17. 17. Challenge ● Ignoring the complex design leads to wrong conclusions, like different point estimates and/or underestimated standard errors, see Rutkowski et al. (2010) – Variance estimation: jackknife, BRR – Not taking into account weights (e.g. Rutkowski et al (2010): Bulgarian TIMSS 2007, higher probability of selection to students from vocational and profiled schools). In a multilevel situation choosing wrong composite weights. – Treatment of plausible values: instead of Rubin’s rules averaging (five) plausible values or choosing only one plausible value. ● Drent et al. (2013) formulated quality criteria (low, satisfactory, high) ● Standard software cannot handle replicate weights and plausible values
  18. 18. Available software ● IDB Analyzer (SPSS) ● NAEP Data Explorer (web tool) ● PISA SPSS macros ● R package 'intsvy‘ (Daniel Caro, Oxford) – Free – Does not rely on commercial software like SPSS or SAS – Open source – Can be extended to perform other analyses
  19. 19. Available software Multilevel software ● R – Has multilevel package but no weights – Can link to MLwin ● MLwin – Have to combine plausible values manually – No resampling – Does handle weights ● HLM – Combines plausible values – Weights – No resampling
  20. 20. OUTLINE • International studies • IEA & OECD • PISA, TIMSS, … • Some aspects of their sampling design • Two stage sampling • Weights • Rotated test design • What if you don’t take this into account? • Simulation with TIMSS 2011 data • Single level model • Multilevel models
  21. 21. Simulation with TIMSS 2011 data • TIMSS 2011 • Three aspects: jackknife, weights, plausible values • Five countries: England is chosen as a base-level, using the ranking for grade 8 TIMSS 2011. One arbitrary country significantly above England in the rankings, Singapore, is chosen, as well as one country significantly below England in the rankings (Norway). In addition the countries respectively one place higher and one place lower are chosen (United States and Hungary).
  22. 22. Simulation with TIMSS 2011 data • Data preparation: • Publicly available TIMSS 2011 year 8 data files are used. • Additional columns calculated: average of the five plausible values and different weighting columns. • Two experiments: A. single level analyses, and B. multilevel analyses with students nested in schools. • For experiment A an open source R package intsvy (Caro, 2014) for R is used. • Experiment B looks at multilevel models by constructing null models in HLM 6.08 for five countries with student and school levels.
  23. 23. Single level Different scenarios: • Two conditions concern variance estimation with jackknife (JK): either jackknife is applied or isn’t applied. • Two conditions concern weights (Wgt): either weights are applied or are not applied. • Three final conditions for the maths achievement scores are used for Plausible Values. • PVR denotes the correct approach using ‘plausible values with Rubin’s rules’. • PVA denotes the ‘mean of the plausible values’. • PV1 only uses ‘the first plausible value’. A total of 2×2×3=12 cases are calculated, as shown in the table on the next slide. Case 1 replicates the values from the international report (Mullis, Martin, Foy, & Arora, 2012).
  24. 24. PV1 Case 9 With JK With Wgt Case 10 No JK With Wgt Case 11 With JK No Wgt Case 12 No JK No Wgt Country Score SE # Score SE # Score SE # Score SE # Singapore 609.71 3.68 1 609.71 1.08 1 606.22 3.63 1 606.22 1.08 1 USA 508.75 2.58 2 508.75 0.75 2 508.92 2.52 4 508.92 0.74 4 England 506.03 5.45 3 506.03 1.36 3 509.44 5.59 3 509.44 1.37 3 Hungary 504.75 3.44 4 504.75 1.22 4 513.38 2.96 2 513.38 1.16 2 Norway 475.24 2.38 5 475.24 1.03 5 477.04 2.62 5 477.04 1.03 5 PVA Case 5 With JK With Wgt Case 6 No JK With Wgt Case 7 With JK No Wgt Case 8 No JK No Wgt Country Score SE # Score SE # Score SE # Score SE # Singapore 610.99 3.73 1 610.99 1.06 1 607.54 3.68 1 607.54 1.06 1 USA 509.48 2.59 2 509.48 0.73 2 509.68 2.53 4 509.68 0.72 4 England 506.76 5.48 3 506.76 1.34 3 509.99 5.64 3 509.99 1.35 3 Hungary 504.81 3.48 4 504.81 1.21 4 513.47 2.98 2 513.47 1.15 2 Norway 474.64 2.37 5 474.64 0.99 5 476.55 2.64 5 476.55 1.00 5 PVR Case 1 With JK With Wgt Case 2 No JK With Wgt Case 3 With JK No Wgt Case 4 No JK No Wgt Country Score SE # Score SE # Score SE # Score SE # Singapore 610.99 3.77 1 610.99 0.83 1 607.54 3.74 1 607.54 0.87 1 USA 509.48 2.63 2 509.48 0.55 2 509.68 2.58 4 509.68 0.57 4 England 506.76 5.53 3 506.76 0.89 3 509.99 5.63 3 509.99 0.70 3 Hungary 504.81 3.48 4 504.81 0.47 4 513.47 2.98 2 513.47 0.40 2 Norway 474.64 2.44 5 474.64 0.55 5 476.55 2.66 5 476.55 0.50 5 Maths achievement scores and standard errors for five countries for twelve different cases with weights, jackknife and plausible values.
  25. 25. Observations Differences in achievement results and standard errors: • Not taking into account Jackknife (example in yellow) • Average score the same. • Underestimates standard error. • So: relative ranking same but significant testing influenced. • Not taking into account weights (example in orange) • Influences achievement scores: USA, England, Hungary and Norway scoring higher, and Singapore scoring lower. • Impact on relative rankings. • Standard errors different, some higher some lower. • Plausible values (example in green) • PVA and PVR the same achievement score, PV1 different. • PVA and PV1 underestimate standard error. • But no clear pattern PVA and PV1 (which contradicts previous literature).
  26. 26. Multilevel Used HLM, does not have Jackknife • Note that with MLwin you need to combine Plausible Values manually. • Three conditions concern weights: no weights, weights only at student level (see Willms & Smith, 2005) and final weights (Rutkowski et al., 2010). • Three conditions for the maths achievement scores are used for Plausible Values. PVR denotes the correct approach using ‘plausible values with Rubin’s rules’. PVA denotes the ‘mean of the plausible values’. PV1 only uses ‘the first plausible value’. • The 3×3 scenarios are reported in table 3.
  27. 27. Maths achievement scores and standard errors of five countries for multilevel null models in three different weighting scenarios S1, S4 and S6 and plausible values.
  28. 28. Observations Differences in achievement results and standard errors: • The different weighting methods greatly influence achievement scores and standard errors. This also has an impact on the relative rankings. There does not seem to be a pattern in over- or underestimation of scores and standard errors. • For plausible values the cases for PV1 yield a different average than PVA and PVR, in three cases lower except for Hungary and Norway. For PVA and PV1, the standard error is underestimated with respect to PVR. However, between PVA and PV1 underestimation of SE’s differ only slightly, with PVA in most cases being closer to or just as close to PVR as PV1. • Singapore PV1 PVA PVR United states PVA PVR PV1 England PV1 PVA PVR Hungary PV1 PVA PVR Norway PVA PV1 PVR
  29. 29. Final thoughts • Not taking into account three features of complex sample designs for LSA’s can have a big influence on achievement scores, standard errors and rankings. • Confirms findings by Rutkowski et al. (2010). • Not all ‘rules of thumb’ from previous literature (Drent et al., 2013; Rutkowski et al., 2010) seem to hold. • Therefore, caution should always be taken when analysing LSA data, hopefully improving future LSA analyses by educational researchers. • Need transparent methodology THANK YOU C.Bokhove@soton.ac.uk QUESTIONS/DISCUSSION
  30. 30. Relevant references Beaton, A.E., & Gonzalez, E.J. (1995). NAEP Primer. Center for the study of testing, evaluation and educational policy, Boston College. Chestnut hill: MA. Caro, D. (2014). intsvy: International Assessment Data Manager. R package version 1.3. http://CRAN.R-project. org/package=intsvy Drent, M, Meelissen, M.R.M., & van der Kleij, F.M. (2013). The contribution of TIMSS to the link between school and classroom factors and student achievement. Journal of curriculum studies, 45 (2), 198 - 224. Goldstein, H. (2004). International comparisons of student attainment: some issues arising from the PISA study. Assessment in Education, 11(3), 319-330. Kreiner, S., & Christensen, K. B. (2014). Analyses of model fit and robustness. A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika, 79(2), 210-231. Martin, M.O. & Mullis, I.V.S. (Eds.). (2012). Methods and procedures in TIMSS and PIRLS 2011. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College. Mullis, I.V.S., Martin, M.O., Foy, P., & Arora, A. (2012).TIMSS 2011 International results in mathematics. Lynch School of Education, Boston College. Rubin, D. (1987). Multiple imputation for nonresponse in sample surveys. New York: John Wiley. Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International large-scale assessment data: Issues in secondary analysis and reporting. Educational Researcher, 39(2), 142-151. Von Davier, M., Gonzalez, E., & Mislevy, R.J. (2009). Plausible values: What are they and why do we need them? IERI Monograph Series: Issues and Methodologies in Large-Scale Assessments, 2, 9-36. Willms, J.D., & Smith, T. (2005). A manual for conducting analyses with data from TIMSS and PISA. Report prepared for UNESCO Institute for Statistics.

×