Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Classes without Dependencies - UseR 2018


Published on

Presented by Sam Clifford at the 2018 UseR conference, Brisbane, Australia. The talk describes the design of SEB113 - Quantitative Methods in Science, a first year statistics/mathematics unit in the Bachelor of Science at Queensland University of Technology. The unit uses RStudio and the tidyverse packages to give students the skills to do meaningful data manipulation and analysis without relying on prior knowledge of advanced mathematics.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Classes without Dependencies - UseR 2018

  1. 1. Classes without dependencies Teaching the tidyverse to first year science students Sam Clifford, Iwona Czaplinski, Brett Fyfield, Sama Low-Choy, Belinda Spratt, Amy Stringer, Nicholas Tierney 2018-07-12
  2. 2. The student body’s got a bad preparation SEB113 a core unit in QUT’s 2013 redesign of Bachelor of Science Introduce key math/stats concepts needed for first year science OP 13 cutoff (ATAR 65) Assumed knowledge: Intermediate Mathematics Some calculus and statistics Not formally required Diagnostic test and weekly prep material Basis for further study in disciplines (explicit or embedded) Still needs to be a self-contained unit that teaches skills
  3. 3. What they need is adult education Engaging students with use of maths/stats in science Build good statistical habits from the start Have students doing analysis that is relevant to their needs as quickly as possible competently with skills that can be built on Introduction to programming reproducibility separating analysis from the raw data flexibility beyond menus correcting mistakes becomes easier
  4. 4. You go back to school Bad old days Manual calculation of test statistics Reliance on statistical tables Don’t want to replicate senior high school study Reduce reliance on point and click software that only does everything students need right now (Excel, Minitab) Students don’t need to become R developers Focus on functionality rather than directly controlling every element, e.g. LATEXvs Word
  5. 5. It’s a bad situation Initial course development was not tidy New B Sc course brought forward Grab bag of topics at request of science academics Difficult to find tutors who could think outside “traditional” stat. ed. very low student satisfaction initially Rapid and radical redesign required tidyverse an integrated suite focused on transforming data frames Vectorisation > loops RStudio > JGR > Rgui.exe
  6. 6. What you want is an adult education (Oh yeah!) Compassion and support for learners Problem- and model-based Technology should support learning goals Go further, quicker by not focussing on mechanical calculations Workflow based on functions rather than element manipulation Statistics is an integral part of science Statistics isn’t about generating p values see Cobb in Wasserstein and Lazar [2016]
  7. 7. Machines do the work so people have time to think – IBM (1967) All models are wrong, but some are useful – Box (1987)
  8. 8. Now here we go dropping science, dropping it all over Within context of scientific method: Aims Methods and Materials 1. Get data/model into an analysis environment 2. Data munging Results 3. Exploration of data/model 4. Compute model 5. Model diagnostics Conclusion 6. Interpret meaning of results
  9. 9. I said you wanna be startin’ somethin’ Redesign around ggplot2 ggplot2 introduced us to tidy data requirements Redesign based on Year 11 summer camp This approach not covered by textbooks at the time Tried using JGR and Plot Builder for one semester Extension to wider tidyverse Replace unrelated packages/functions with unified approach Focus on what you want rather than directly coding how to do it Good effort-reward with limited expertise
  10. 10. Summer(ise) loving, had me a blast; summer(ise) loving, happened so fast R is a giant calculator that can operate on objects ggplot() requires a data frame object dplyr::summarise() to summarise a column variable dplyr::group_by() to do summary according to specified structure Copy-paste or looping not guaranteed to be MECE Group-level summary stats leads to potential statistical models Easier, less error prone, than repeated usage of =AVERAGE()
  11. 11. We want the funk(tional programming paradigm) Tidy data as observations of variables with structure [Wickham, 2014b] R as functional programming [Wickham, 2014a] Actions on entire objects to do things to data and return useful information Students enter understanding functions like y(x) = x2 function takes input function returns output e.g. mean(x) = i xi/n Week 4: writing functions to solve calculus problems magrittr::%>% too conceptually similar to ggplot2::+ for novices to grasp in first course
  12. 12. Like Frankie sang, I did it my way What’s the mean gas mileage for each engine geometry and transmission type for the 32 cars listed in 1974 Motor Trends magazine? Loops For each of the pre-computed number of groups, subset, summarise and store how you want tapply() INDEX a list of k vectors, 1 summary FUNction, returns k-dimensional array dplyr specify grouping variables and which sum- mary statistics, returns tidy data frame ready for model/plot
  13. 13. Night of the living baseheads Like all procedural languages, plot() has one giant list of arguments Focus is on how plot is drawn rather than what you want to plot Inefficiency of keystrokes re-stating the things being plotted setting up plot axis limits loop counters for small multiples, etc.
  14. 14. Toot toot, chugga chugga, big red car Say we want to plot cars’ fuel efficiency against weight library(tidyverse) data(mtcars) mtcars <- mutate( mtcars, l100km = 235.2146/mpg, wt_T = wt/2.2046, am = factor(am, levels = c(0,1), labels=c("Auto", "Manual")), vs = factor(vs, levels = c(0,1), labels=c("V","S"))) plot(y=mtcars$l100km, x=mtcars$wt_T) 1.0 1.5 2.0 2.5 101520 mtcars$wt_T mtcars$l100km Fairly quick to say what goes on x and y axes More arguments → better graph xlim, ylim xlab, ylab main type, pch What if we want to see how it varies with engine geometry transmission type
  15. 15. The wisdom of the fool won’t set you free yrange <- range(mtcars$l100km) xrange <- range(mtcars$wt_T) levs <- expand.grid(vs = c("V", "S"), am = c("Auto", "Manual")) par(mfrow = c(2,2)) for (i in 1:nrow(levs)){ dat_to_plot <- merge(levs[i, ], mtcars) plot(dat_to_plot$l100km ~ dat_to_plot$wt_T, pch=16, xlab="Weight (t)", xlim=xrange, ylab="Fuel efficiency (L/100km)", ylim=yrange, main = sprintf("%s-%s", levs$am[i], levs$vs[i]))} 1.0 1.5 2.0 2.5 101520 Auto−V Weight (t) Fuelefficiency(L/100km) 1.0 1.5 2.0 2.5 101520 Auto−S Weight (t) Fuelefficiency(L/100km) 1.0 1.5 2.0 2.5 101520 Manual−V Weight (t) Fuelefficiency(L/100km) 1.0 1.5 2.0 2.5 101520 Manual−S Weight (t) Fuelefficiency(L/100km) ggplot(data = mtcars, aes(x = wt_T, y = l100km)) + geom_point() + facet_grid(am ~ vs) + theme_bw() + xlab("Weight (t)") + ylab("Fuel efficiency (L/100km)") V S AutoManual 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5 10 15 20 10 15 20 Weight (t) Fuelefficiency(L/100km)
  16. 16. One, two, princes kneel before you Both approaches do the same thing Idea base ggplot2 Plot variables Specify vectors Coordinate system de- fined by variables Small multiples Loops, subsets, par facet_grid Common axes Pre-computed Inherited from data V/S A/M annotation Strings Inherited from data Axis labels Per axis set For whole plot Focus on putting things on the page vs representing variables
  17. 17. I got a grammar Hazel and a grammar Tilly Plots are built from [Wickham, 2010] data – which variables are mapped to aesthetic elements geometry – how do we draw the data? annotations – what is the context of these shapes? Build more complex plots by adding commands and layering elements, rather than by stacking individual points and lines e.g. make a scatter plot, THEN add a trend line (with inherited x, y), THEN facet by grouping variable, THEN change axis information
  18. 18. When I’m good, I’m very good; but when I’m bad, I’m better Want to make good plots as soon as possible Learning about Tufte’s principles [Tufte, 1983, Pantoliano, 2012] Discuss what makes a plot good and bad Seeing how ggplot2 code translates into graphical elements Week 2 workshop has students making best and worst plots for a data set, e.g.
  19. 19. Sie ist ein Model und sie sieht gut aus Make use of broom package to get model summaries Get data frames rather than summary.lm() text vomit tidy() parameter estimates CIs t test info [Greenland et al., 2016] glance() everything else ggplot2::fortify() regression diagnostic info instead of plot.lm() stat_qq(aes(x=.stdresid)) for residual quantiles geom_point(aes(x=.fitted, y=.resid)) for fitted vs residuals
  20. 20. When you hear some feedback keep going take it higher Positives More confidence and students see use of maths/stats in science Students enjoy group discussions in workshops Some students continue using R over Excel in future units Labs can be done online in own time Negatives Request for more face to face help rather than online Labs can be done online in own time (but are they?) Downloading of slides rather than attending/watching lectures
  21. 21. Things can only get better Focus on what you want from R rather than how you do it representing variables graphically summarising over structure in data tidiers for models Statistics embedded in scientific theory [Diggle and Chetwynd, 2011] Problem-based learning groups of novices supervised by tutors discussion of various approaches
  22. 22. Peter J. Diggle and Amanda G. Chetwynd. Statistics and Scientific Method: An Introduction for Students and Researchers. Oxford University Press, 2011. Sander Greenland, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology, 31(4):337–350, apr 2016. URL Mike Pantoliano. Data visualization principles: Lessons from Tufte, 2012. URL https: // Edward Tufte. The Visual Display of Quantitative Information. Graphics Press, 1983. Ronald L. Wasserstein and Nicole A. Lazar. The ASA's statement on p-values: Context, process, and purpose. The American Statistician, 70 (2):129–133, Apr 2016. URL
  23. 23. H. Wickham. Advanced R. Chapman & Hall/CRC The R Series. Taylor & Francis, 2014a. ISBN 9781466586963. URL Hadley Wickham. A layered grammar of graphics. Journal of Computational and Graphical Statistics, 19(1):3–28, 2010. doi: 10.1198/jcgs.2009.07098. Hadley Wickham. Tidy data. Journal of Statistical Software, 59(1):1–23, 2014b. ISSN 1548-7660. URL