researchED Durham 2018

Effect sizes and meta-
analysis
Adrian Simpson and Gary Jones
24 November, 2018
resaarchED Durham

Introductions
Adrian Simpson
• Professor, School of Education,
Durham University
• Research Interests
• Assessment in Higher Education
• Mathematics Education
• Psychology of Reasoning
Gary Jones
• Former senior manager in
further education
• Blogger, speaker and author
• www.garyrjones.com
• Evidence-Based School Leadership
and Management: A practical
guide.

The Dipstick Test
Why Effect Size is Damaging Education

Sarah’s Garage
• Comparing your oil today to your
oil last week
• Comparing your oil to your
neighbour’s oil
• Comparing the average oil in
some sports cars with the
average oil in hatchbacks

The Dipstick Test – pairwise version
Using relative size on one measure
(a dipstick) to stand for relative size
on another measure (oil volume)
requires everything else which
impacts on the measures to be
equal
E.g. engine dimensions/shape; oil
temperature; dipstick angle;
dipstick penetration etc.

The dipstick test – group version
Using relative average size on one
measure (a dipstick) to stand for
relative average size on another
measure (oil volume) requires
everything else which impacts on
the measures to be equally
distributed
E.g. engine dimensions/shape; oil
temperature; dipstick angle; oil
agitation; dipstick penetration etc.

Sarah’s Garage
• Comparing your oil today to your
oil last week
• Comparing your oil to your
neighbour’s oil
• Comparing the average oil in
some sports cars with the
average oil in hatchbacks
Likely to be OK – worth checking
temperature, penetration, agitation
Very risky – need to check everything
Absurdly unlikely - need very
convincing check

Does ”evidence based
education” pass the dipstick
test?

Evidence based education
• Larger effect size stands for more
effective intervention
• For single studies (IES & EEF
Projects)
• For meta-analysis (Schneider &
Preckel)
• For meta-synthesis (EEF Toolkit &
Hattie)

What factors contribute to effect size?
Depends on: the
choice of measure
Depends on: the choice
of comparison treatment
Depends on: the choice of
sample homogeneity
Depends on:
the choice of
intervention
treatment
d is sometimes converted to the horribly misleading ‘months’ progress’.

The US What Works Clearinghouse from the Institute
for Education Sciences uses “the comparability of
effect size estimates across studies … to establish the
criterion for substantively important effects for
intervention rating purposes” (IES, 2017, E-2) with “An
effect size of 0.25 standard deviations or larger …
considered to be substantively important” (p.22)
The Expert Mathematician: d=0.35
Accelerated Math: d=0

Does “The Expert Mathematician is more effective
than Accelerated Math” pass the dipstick test?

The Dipstick Test – pairwise version
The Expert Mathematician
• Intervention: 196, 40-120 minute LOGO
based ‘generative maths’ lessons
• D=0.35
• Measure: 78 item MCQ including 61
concept & application items
• Sample: Grade 8 suburban middle school
pupils
• Comparison: ‘Transition Mathematics’
textbook, 1-3 pages written explanation
followed by 30 questions for each lesson
Accelerated Math
• Intervention: 15-20 minutes each day on
maths problems (from Acc. Math)
• D=0
• Measure: Delaware Student Testing
program (50 MCQ, 16 short answer, 12
extended)
• Sample: Grade 6 suburban middle school
pupils
• Comparison: 15-20 minutes each day on
maths problems (from Delaware Math)

Schneider & Preckel (2017) Variables Associated With
Achievement in Higher Education: A Systematic
Review of Meta-Analyses
Small group learning: average d=0.51
Intelligent tutoring systems: average d=0.35

Does “small group learning is more effective than
intelligent tutoring systems” pass the dipstick test?

• Intelligent tutoring systems (sometimes main
teaching; sometimes adjunct; wide variety of
systems)
• D=0.35
• 100% teacher set
• Fairly homogenous (students on the same
university course)
• Sometimes human tutoring; sometimes
computer; sometimes lecturing; sometimes
not being taught the topic at all
The dipstick test – group version
Group A Group B
• Interventions: Small group learning
(sometimes for all teaching; sometimes
replacing alternative; sometimes extra)
• D=0.51
• Measures: 80% teacher set exam; 20%
standardised
• Samples: fairly homogenous (students on the
same university course)
• Comparisons: Sometimes individualized;
sometimes whole group

What is effect size really?
• A measure of the clarity of the study
• It depends on the whole study design, not just the intervention
• Researchers can (and do) choose to increase it by:
• More passive control treatments
• More homogenous samples
• More treatment inherent measures
• But different fields allow different freedoms
• Passive/zero control may be impossible/unethical in some situations
• Some interventions only make sense on wide samples
• Some fields tend to use standardized tests

Do EBE researchers know the Dipstick Test?
“The key assumption is that the research represented in the meta-
synthesis is sufficiently evenly distributed by type, population and
outcome that any differences which emerge represent differences in
the educational themes, rather than differences in the research
methods, measurements and target populations.”
(Higgins, 2018, p49)

Berk
People response to criticism that “… the mismatch between a meta-
analysis model and anything real … [is] ... The requisite assumptions are
listed but not defended. A list of assumptions by itself apparently
inoculates the meta-analysis against modelling errors”
(Berk, 2007, p264)
“Statistical assumptions are empirical commitments”
(Berk and Freedman, 2003, p235)

The dipstick test for Meta-synthesis
• Is it likely that researchers in feedback, phonics, homework, uniform,
behaviour, class size etc. use the same distribution of
• Comparison treatments
• Sample ranges
• Measures
• Why would the world of research be organized to make this so?

Passing the dipstick test
• When is relative effect size a measure of relative effectiveness?
• For individual studies when comparison treatment, measure and sample
range are the same
• For groups of studies when comparison treatments, measures and sample
ranges are distributed in the same way
That is: for current education research, NEVER

The final word
“statistical malpractice disguised as statistical razzle-dazzle”
(Berk, 2011, p199)
Let’s not be razzle-dazzled in to harming the education of our pupils

Further reading
• Simpson, A. (2017) 'The misdirection of public policy : comparing and
combining standardised effect sizes.', Journal of Education Policy., 32
(4). pp. 450-466.
• Simpson, A. (2018) ‘Princesses are bigger than elephants: Effect size
as a category error in evidence-based education.‘, British Educational
Research Journal Vol. 44, No. 5, October 2018, pp. 897–913
• Jones, G. (2018) Evidence-Based School Leadership: A practical guide,
SAGE, London

For more information
Adrian Simpson
• adrian.simpson@durham.ac.uk
Gary Jones
• www.garyrjones.com
• @DrGaryJones
• jones.gary@gmail.com

researchED Durham 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to researchED Durham 2018

Similar to researchED Durham 2018 (20)

More from Gary Jones

More from Gary Jones (20)

Recently uploaded

Recently uploaded (20)

researchED Durham 2018

Editor's Notes