Presented by Sam Clifford at the 2018 UseR conference, Brisbane, Australia. The talk describes the design of SEB113 - Quantitative Methods in Science, a first year statistics/mathematics unit in the Bachelor of Science at Queensland University of Technology. The unit uses RStudio and the tidyverse packages to give students the skills to do meaningful data manipulation and analysis without relying on prior knowledge of advanced mathematics.
Measures of Central Tendency: Mean, Median and Mode
Ā
Classes without Dependencies - UseR 2018
1. Classes without dependencies
Teaching the tidyverse to ļ¬rst year science students
Sam Clifford, Iwona Czaplinski, Brett Fyļ¬eld, Sama Low-Choy, Belinda
Spratt, Amy Stringer, Nicholas Tierney
2018-07-12
2. The student bodyās got a bad preparation
SEB113 a core unit in QUTās 2013 redesign of Bachelor of Science
Introduce key math/stats concepts needed for ļ¬rst year science
OP 13 cutoff (ATAR 65)
Assumed knowledge: Intermediate Mathematics
Some calculus and statistics
Not formally required
Diagnostic test and weekly prep material
Basis for further study in disciplines (explicit or embedded)
Still needs to be a self-contained unit that teaches skills
3. What they need is adult education
Engaging students with use of maths/stats in science
Build good statistical habits from the start
Have students doing analysis
that is relevant to their needs
as quickly as possible
competently
with skills that can be built on
Introduction to programming
reproducibility
separating analysis from the raw data
ļ¬exibility beyond menus
correcting mistakes becomes easier
4. You go back to school
Bad old days
Manual calculation of test statistics
Reliance on statistical tables
Donāt want to replicate senior high school study
Reduce reliance on point and click software that only does
everything students need right now (Excel, Minitab)
Students donāt need to become R developers
Focus on functionality rather than directly controlling every element,
e.g. LATEXvs Word
5. Itās a bad situation
Initial course development was not tidy
New B Sc course brought forward
Grab bag of topics at request of science academics
Difļ¬cult to ļ¬nd tutors who could think outside ātraditionalā stat. ed.
very low student satisfaction initially
Rapid and radical redesign required
tidyverse an integrated suite focused on transforming data frames
Vectorisation > loops
RStudio > JGR > Rgui.exe
6. What you want is an adult education (Oh yeah!)
Compassion and support for learners
Problem- and model-based
Technology should support learning goals
Go further, quicker by not focussing on mechanical calculations
Workļ¬ow based on functions rather than element manipulation
Statistics is an integral part of science
Statistics isnāt about generating p values
see Cobb in Wasserstein and Lazar [2016]
7. Machines do the work so people have time to think ā IBM (1967)
All models are wrong, but some are useful ā Box (1987)
8. Now here we go dropping science, dropping it all over
Within context of scientiļ¬c method:
Aims
Methods and Materials
1. Get data/model into an analysis environment
2. Data munging
Results
3. Exploration of data/model
4. Compute model
5. Model diagnostics
Conclusion
6. Interpret meaning of results
9. I said you wanna be startinā somethinā
Redesign around ggplot2
ggplot2 introduced us to tidy data requirements
Redesign based on Year 11 summer camp
This approach not covered by textbooks at the time
Tried using JGR and Plot Builder for one semester
Extension to wider tidyverse
Replace unrelated packages/functions with uniļ¬ed approach
Focus on what you want rather than directly coding how to do it
Good effort-reward with limited expertise
10. Summer(ise) loving, had me a blast; summer(ise) loving,
happened so fast
R is a giant calculator that can operate on objects
ggplot() requires a data frame object
dplyr::summarise() to summarise a column variable
dplyr::group_by() to do summary according to speciļ¬ed
structure
Copy-paste or looping not guaranteed to be MECE
Group-level summary stats leads to potential statistical models
Easier, less error prone, than repeated usage of =AVERAGE()
11. We want the funk(tional programming paradigm)
Tidy data as observations of variables with structure [Wickham,
2014b]
R as functional programming [Wickham, 2014a]
Actions on entire objects to do things to data and return useful
information
Students enter understanding functions like y(x) = x2
function takes input
function returns output
e.g. mean(x) = i xi/n
Week 4: writing functions to solve calculus problems
magrittr::%>% too conceptually similar to ggplot2::+ for
novices to grasp in ļ¬rst course
12. Like Frankie sang, I did it my way
Whatās the mean gas mileage for each engine geometry and
transmission type for the 32 cars listed in 1974 Motor Trends
magazine?
Loops For each of the pre-computed number of
groups, subset, summarise and store how
you want
tapply() INDEX a list of k vectors, 1 summary
FUNction, returns k-dimensional array
dplyr specify grouping variables and which sum-
mary statistics, returns tidy data frame ready
for model/plot
13. Night of the living baseheads
Like all procedural languages, plot() has one giant list of
arguments
Focus is on how plot is drawn rather than what you want to plot
Inefļ¬ciency of keystrokes
re-stating the things being plotted
setting up plot axis limits
loop counters for small multiples, etc.
14. Toot toot, chugga chugga, big red car
Say we want to plot carsā fuel efļ¬ciency against weight
library(tidyverse)
data(mtcars)
mtcars <- mutate(
mtcars, l100km = 235.2146/mpg,
wt_T = wt/2.2046,
am = factor(am, levels = c(0,1),
labels=c("Auto", "Manual")),
vs = factor(vs, levels = c(0,1),
labels=c("V","S")))
plot(y=mtcars$l100km, x=mtcars$wt_T)
1.0 1.5 2.0 2.5
101520
mtcars$wt_T
mtcars$l100km
Fairly quick to say what
goes on x and y axes
More arguments ā better
graph
xlim, ylim
xlab, ylab
main
type, pch
What if we want to see how
it varies with
engine geometry
transmission type
16. One, two, princes kneel before you
Both approaches do the same thing
Idea base ggplot2
Plot variables Specify vectors Coordinate system de-
ļ¬ned by variables
Small multiples Loops, subsets, par facet_grid
Common axes Pre-computed Inherited from data
V/S A/M annotation Strings Inherited from data
Axis labels Per axis set For whole plot
Focus on putting things on the page vs representing variables
17. I got a grammar Hazel and a grammar Tilly
Plots are built from [Wickham, 2010]
data ā which variables are mapped to aesthetic elements
geometry ā how do we draw the data?
annotations ā what is the context of these shapes?
Build more complex plots by adding commands and layering elements,
rather than by stacking individual points and lines e.g.
make a scatter plot, THEN
add a trend line (with inherited x, y), THEN
facet by grouping variable, THEN
change axis information
18. When Iām good, Iām very good; but when Iām bad, Iām better
Want to make good plots as soon as possible
Learning about Tufteās principles [Tufte, 1983, Pantoliano, 2012]
Discuss what makes a plot good and bad
Seeing how ggplot2 code translates into graphical elements
Week 2 workshop has students making best and worst plots for a
data set, e.g.
19. Sie ist ein Model und sie sieht gut aus
Make use of broom package to get model summaries
Get data frames rather than summary.lm() text vomit
tidy()
parameter estimates
CIs
t test info [Greenland et al., 2016]
glance()
everything else
ggplot2::fortify()
regression diagnostic info instead of plot.lm()
stat_qq(aes(x=.stdresid)) for residual quantiles
geom_point(aes(x=.ļ¬tted, y=.resid)) for ļ¬tted vs
residuals
20. When you hear some feedback keep going take it higher
Positives
More conļ¬dence and students see use of maths/stats in science
Students enjoy group discussions in workshops
Some students continue using R over Excel in future units
Labs can be done online in own time
Negatives
Request for more face to face help rather than online
Labs can be done online in own time (but are they?)
Downloading of slides rather than attending/watching lectures
21. Things can only get better
Focus on what you want from R rather than how you do it
representing variables graphically
summarising over structure in data
tidiers for models
Statistics embedded in scientiļ¬c theory [Diggle and Chetwynd, 2011]
Problem-based learning
groups of novices
supervised by tutors
discussion of various approaches
22. Peter J. Diggle and Amanda G. Chetwynd. Statistics and Scientiļ¬c
Method: An Introduction for Students and Researchers. Oxford
University Press, 2011.
Sander Greenland, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin,
Charles Poole, Steven N. Goodman, and Douglas G. Altman.
Statistical tests, p values, conļ¬dence intervals, and power: a guide to
misinterpretations. European Journal of Epidemiology, 31(4):337ā350,
apr 2016. URL https://doi.org/10.1007/s10654-016-0149-3.
Mike Pantoliano. Data visualization principles: Lessons from Tufte, 2012.
URL https:
//moz.com/blog/data-visualization-principles-lessons-from-tufte.
Edward Tufte. The Visual Display of Quantitative Information. Graphics
Press, 1983.
Ronald L. Wasserstein and Nicole A. Lazar. The ASA's statement on
p-values: Context, process, and purpose. The American Statistician, 70
(2):129ā133, Apr 2016. URL
https://doi.org/10.1080/00031305.2016.1154108.
23. H. Wickham. Advanced R. Chapman & Hall/CRC The R Series. Taylor &
Francis, 2014a. ISBN 9781466586963. URL
https://books.google.com.au/books?id=PFHFNAEACAAJ.
Hadley Wickham. A layered grammar of graphics. Journal of
Computational and Graphical Statistics, 19(1):3ā28, 2010. doi:
10.1198/jcgs.2009.07098.
Hadley Wickham. Tidy data. Journal of Statistical Software, 59(1):1ā23,
2014b. ISSN 1548-7660. URL
https://www.jstatsoft.org/index.php/jss/article/view/v059i10.