UBC STAT545 2014 Cm001 intro to-course

STAT 545A Class meeting 001
Course intro + prompts to install lots of software
and sign up for lots of accounts
web companion: STAT 545 web home > Syllabus > cm001
Wednesday, September 3, 2014

Dr. Jennifer (Jenny) Bryan
Department of Statistics and Michael Smith Laboratories
University of British Columbia
jenny@stat.ubc.ca
https://github.com/jennybc
http://www.stat.ubc.ca/~jenny/
@JennyBryan ← personal, professional Twitter
https://github.com/STAT545-UBC
http://stat545-ubc.github.io
@STAT545 ← Twitter as lead instructor of this course

statistical
theory
real
world
data STAT 545A

in a wide array of academic fields,
the ability to effectively process data
is superseding other more classical modes of research
The Big Data Brain Drain: Why Science is in Trouble
http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain/

The data science Venn diagram
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

a horizontal data science workflow
Visualise
Munge
Model
Communicate
Tidy
Question
Collect
Slides from Hadley Wickham's talk in the Simply Statistics Unconference http://t.co/D931Og8mq3

Visualise
Munge
Model
Communicate
Tidy
Question
Collect
Wednesday, October 30, 13
We can’t focus just on this!
Slides from Hadley Wickham's talk in the Simply Statistics Unconference http://t.co/D931Og8mq3

Data scientists spend 50 -
80% of their time mired in
this more mundane labor
of collecting and preparing
unruly digital data, before
it can be explored for
... what data scientists call “data wrangling,” useful nuggets.
“data munging” and “data janitor work” ...
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?partner=rss&emc=rss&smid=tw-nytimesscience&_r=0

http://mimno.infosci.cornell.edu/b/articles/carpentry/
As data science becomes all the more relevant and indeed, profitable, attention
has been placed on the value of cleaning a data set. David Mimno unpicks the
term and the process and suggests that data carpentry may be a more suitable
description. There is no such thing as pure or clean data buried in a thin layer
of non-clean data. In reality, the process is more like deciding how to cut and
join a piece of material. Data carpentry is not a single process but a thousand
little skills and techniques.
http://blogs.lse.ac.uk/impactofsocialsciences/2014/09/01/data-carpentry-skilled-craft-data-science/

Complexity must be justified.
http://www.john-foreman.com/blog/the-forgotten-job-of-a-data-scientist-editing
Before you analyze your data with
computers, be sure to plot it
http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/
... the first reasonable thing you can do to a set of data
often is 80% of the way to the optimal solution.
Everything after that is working on getting the last
20%... (maybe could even be the 90/10 rule)
http://simplystatistics.org/2014/03/20/the-8020-rule-of-statistical-methods-development/
Problem first not solution backward
http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/

“All models are wrong, some models are useful.”
Box, G.E.P., Robustness in the strategy of scientific model building, in Robustness in
Statistics, R.L. Launer and G.N. Wilkinson, Editors. 1979, Academic Press: New York.
Entia non sunt multiplicanda praeter necessitatem
The principle, known as Occam’s Razor, that says:
when there are two competing theories or
explanations -- both compatible with observed
data, known facts -- the simpler one is better.
Implication for statistical analysis: if two models
are equally wrong-but-compatible-with-data, the
simpler one is more useful!

small medium big
size of dataset
* figure is fictional but I stand by this claim
There are LOTS of small to
medium datasets, even
though it’s more trendy to
talk about the big ones.

“Too big for Excel is not ‘Big Data’.”
“Big data is like teenage sex:
everyone talks about it, nobody really
knows how to do it, everyone thinks
everyone else is doing it, so everyone
claims they are doing it.”
“Big data has become a synonym
for ‘data analysis,’ which is
confusing and counter-productive.”
small medium big
size of dataset
* figure is fictional but I stand by this claim
http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
http://qz.com/81661/most-data-isnt-big-and-businesses-are-wasting-money-pretending-it-is/
https://www.facebook.com/dan.ariely/posts/904383595868

big data ≠ data science
big data ⊂ data science

Two inter-related goals
• Foster your development of a personal philosophy for
data analysis, esp. exploratory and descriptive analysis.
• Help you assemble a modern toolchain and workflows
for data analysis.
My hope:
You’ll leave this course with (at least the beginnings of)
a confident, deliberate attitude about how to approach
data analysis and the practical skills to put your
attitude into action.

you’re going to make your own data science sampler

“A picture is worth
a thousand words”

1986 Challenger space shuttle disaster
Favorite example of Edward Tufte
http://msnbcmedia1.msn.com/j/msnbc/Components/Photos/050709/050609_columbia_hmed_6p.hmedium.jpg

“A picture is worth a thousand words”

Siddhartha R. Dalal; Edward B. Fowlkes; Bruce Hoadley. Risk Analysis of the Space
Shuttle: Pre-Challenger Prediction of Failure. JASA, Vol. 84, No. 408 (Dec., 1989),
pp. 945-957. Access via JSTOR.

Edward Tufte
http://www.edwardtufte.com
BOOK:
Visual Explanations: Images and Quantities, Evidence and
Narrative
Ch. 5 deals with the Challenger disaster
That chapter is available for $7 as a downloadable booklet:
http://www.edwardtufte.com/tufte/books_textb

Always, always, always plot the data.
Replace (or complement) ‘typical’ tables of
data or statistical results with figures that
are more compelling and accessible.
Whenever possible, generate figures that
overlay / juxtapose observed data and
analytical results, e.g. the ‘fit’.

Why?
•find bizarre data and results when it is
least embarrassing and painful
•facilitate comparisons and reveal trends
Recommended reference: Gelman A, Pasarica C, Dodhia
R. “Let's Practice What We Preach: Turning Tables into
Graphs”. The American Statistician, Volume 56, Number
2, 1 May 2002 , pp. 121-130(10). via JSTOR

weak links in the chain:
process, packaging and
presentation

we watched a video about the importance of reproducible
research, data and analytical documentation, data sharing
https://www.youtube.com/watch?v=N2zK3sAtr-4&feature=youtu.be

project organization
literate programming
reproducible research
collaboration / open science
version control / back up / archive

project organization / literate programming /
What the cool kids seem to
be doing ....
Sweave
knitr
R markdown
R packages
RStudio
github
Rforge
sourceforge
git
subversion
mercurial

RStudio is an integrated development environment (IDE) for R

R ≠ RStudio
RStudio mediates your interaction with R; it would
replace Emacs + ESS or Tinn-R, but not R itself
Rstudio is a product of -- actually, more a driver of -- the
emergence of R Markdown, knitr, R + Git(Hub)

Web PDF → HTML
Latex → Markdown
Static → Interactive
from Hadley Wickham’s talk in the Simply Statistics Unconference on the Future of Statistics
13
Open Open source ↑↑
Open science ↑↑
Open research ↑↑
Wednesday, October 30, 13

Markdown HTML
foo.md foo.html
easy to write
(and read!)
easy to publish
easy to read in
browser

Markdown HTML
Title (header 1, actually)
=====================================
This is a Markdown document.
## Medium header (header 2, actually)
It's easy to do *italics* or __make things bold__.
> All models are wrong, but some are useful. An
approximate answer to the right problem is worth a
good deal more than an exact answer to an
approximate problem. Absolute certainty is a
privilege of uneducated minds-and fanatics. It is,
for scientific folk, an unattainable ideal. What
you do every day matters more than what you do
once in a while. We cannot expect anyone to know
anything we didn't teach them ourselves.
Enthusiasm is a form of social courage.
Code block below. Just affects formatting here but
we'll get to R Markdown for the real fun soon!
```
x <- 3 * 4
```
I can haz equations. Inline equations, such as ...
the average is computed as $frac{1}{n} sum_{i=1}
^{n} x_{i}$. Or display equations like this:
$$
begin{equation*}
|x|=
begin{cases} x & text{if $x≥0$,}
-x &text{if $xle 0$.}
end{cases}
end{equation*}
$$

R Markdown Markdown
R Markdown rocks
=====================================
This is an R Markdown document.
```{r}
x <- rnorm(1000)
head(x)
```
See how the R code gets executed and a
representation thereof appears in the document?
`knitr` gives you control over how to represent all
conceivable types of output. In case you care, then
average of the `r length(x)` random normal variates
we just generated is `r round(mean(x), 3)`. Those
numbers are NOT hard-wired but are computed on-the-fly.
As is this figure. No more copy-paste ... copy-paste
... oops forgot to copy-paste.
```{r}
plot(density(x))
```
Note that all the previously demonstrated math
typesetting still works. You don't have to choose
between having math cred and being web-friendly!
Inline equations, such as ... the average is
computed as $frac{1}{n} sum_{i=1}^{n} x_{i}$. Or
display equations like this:
$$
begin{equation*}
|x|=
begin{cases} x & text{if $x≥0$,}
-x &text{if $xle 0$.}
end{cases}
end{equation*}
$$
R Markdown rocks
=====================================
```r
x <- rnorm(1000)
head(x)
```
```
## [1] -1.3007 0.7715 0.5585 -1.2854 1.1973
2.4157
```
average of the 1000 random normal variates we just
generated is -0.081. Those numbers are NOT hard-wired
but are computed on-the-fly. As is this
figure. No more copy-paste ... copy-paste ... oops
forgot to copy-paste.
```r
plot(density(x))
```
![plot of chunk unnamed-chunk-2](figure/unnamed-chunk-
2.png)
...

R Markdown rocks
=====================================
```r
x <- rnorm(1000)
head(x)
```
```
## [1] -1.3007 0.7715 0.5585 -1.2854 1.1973
2.4157
```
average of the 1000 random normal variates we just
generated is -0.081. Those numbers are NOT hard-wired
but are computed on-the-fly. As is this
figure. No more copy-paste ... copy-paste ... oops
forgot to copy-paste.
```r
plot(density(x))
```
![plot of chunk unnamed-chunk-2](figure/unnamed-chunk-
2.png)
...
Markdown HTML

Markdown HTML
foo.md foo.html
easy to write
(and read!)
easy to publish
easy to read in
browser
R Markdown
foo.rmd

how do I put my source on the web?
for the world or select collaborators?
maybe I’ll share my data and my prose too ....
how should I marshall all of that stuff?
how can I collaborate with others on an analysis
or package development?
Advice to preserve sanity:
Stop doing this via email, attachments, and tracking
changes in Word. Get that stuff into plain text, put
it under version control and get it out on the web.

http://www.phdcomics.com/comics/archive.php?comicid=1531 via Ram, 2013 doi:10.1186/1751-0473-8-7

Version control systems (VCS) were created to help
groups of people develop software
Git, in particular, is being “repurposed” for activities
other than pure software development ... like the
messy hybrid of writing, coding and data wrangling
Git repository = a bunch of files you want to
manage in a sane way
repo = repository

collaboration = the “killer app” of version control
Learning Git has been -- and continues to be --
painful. But not nearly as crazy-making as the
alternatives:
- documents as email attachments
- uncertainty about which version is “master”
- am I working with the most recent data?
- archaelogical “digs” on old email threads
- uncertainty about how/if certain changes have been
made or issues solved
- hair-raising ZIP archives containing file salad

Git repository = a bunch of files you
want to manage in a sane way
repo = repository
you can set up repo ... then start
your work
or you can make a set of existing
files and make them into a repo

GitHub = a place to host Git repositories on the web
GitHub ≠ Git

possible, in theory more typical
GitHub
me you
Image from https://www.atlassian.com/git/tutorial/git-basics#!clone
✗

Many R packages are developed in
the open on GitHub
Nice option when someone tells
you to “read the source”!

You can see exactly how files have changed, when, and by
whom. If commit message is good, you’ll see why.
Commit = a formal “checkpoint” or snapshot of the state
of the repository

GitHub provides a fantastic visual “diff” view of exactly
what changed. Incredibly useful.

GitHub issues: think “bug tracker”, “to do list”.

GitHub renders Markdown files nicely
Example: links.md in workshop repo of mine

You can see the raw Markdown too!

GitHub renders comma (.csv) and tab (.tsv) delimited files nicely
Example: Lord of the Rings data I found for STAT 545A

http://www.wired.com/design/2013/08/how-segregated-is-your-city-this-eye-opening-map-shows-you/?viewall=true
NYC

http://www.wired.com/design/2013/08/how-segregated-is-your-city-this-eye-opening-map-shows-you/?viewall=true
NYC
Detroit

http://www.coopercenter.org/demographics/Racial-Dot-Map

Cool result is accompanied by
explanation of how it was done

https://github.com/unorthodox123/RacialDotMap

http://blog.revolutionanalytics.com/2013/08/foodborne-chicago.html

https://github.com/corynissen/foodborne_classifier

project organization / literate programming /
Now you what the fuss is about!
knitr
R markdown
R packages
RStudio
github
Rforge
sourceforge
git
subversion
mercurial

Measured(
Data
Analy/c(
Data
slide(modified(from(Roger(Peng(hBp://www.biostat.jhsph.edu/~rpeng/
Figures
Computa/onal(
Results Tables Ar/cle
Numerical(
Summaries Text
Processing(code Analy/c(code
Presenta/on(code
R((or(Python)(scripts

Measured(
Data
Analy/c(
Data
Figures
Computa/onal(
Numerical(
Summaries Text
Presenta/on(code
delimited(or(other(structured,(agnos/c(files

How(to(make(end(products(more(integrated(and(
more(reproducible?
Measured(
Data
Analy/c(
Data
Figures
Computa/onal(
Numerical(
Summaries Text
Presenta/on(code

How(to(keep(everything(upPtoPdate?
If(the(data(changes,(how(do(we(remember(to(reP
make(the(figures(2B(and(4?
Measured(
Data
Analy/c(
Data
Figures
Computa/onal(
Numerical(
Summaries Text
Presenta/on(code

Makefile
http://zmjones.com/make.html
Like Git,
GNU Make is another old
school tool that is being
repurposed to meet a need in
data-intensive workflows.
Originally intended to
orchestrate compiling
complicate software, it’s now
used to express what depends
on what and keep everything
“in sync”.

who am I?
BA in econ and german
management consultant
PhD biostatistics
assoc prof @ UBC
50% Statistics / 50% Michael Smith Laboratories
I teach and perform lots of data analysis

the whole team
Bernhard
Konrad
Dean
Attali
Luolan
(Gloria) Li
Jenny
Bryan
Julia
Gustavsen
Shaun
Jackman
http://stat545-ubc.github.io/people.html

Culture of the class
• Teaching you to fish (vs. giving you a fish)
- It’s amazing what a determined individual can learn from
documentation, small learning examples, and ... <gasp>
Googling. And also stackoverflow.
• Rewarding engagement, intellectual generosity and
curiosity
- Speaking up, sharing success OR failure, showing some
interest in something will earn marks.
• Zero tolerance of plagiarism
- Generating your own approach, writing some code, and
describing the process is the whole point. Process is
generally more important than product.

Where marks will come from
• Small ~weekly activities; marked coarsely (think check, check
minus, check plus), often with peer evaluation
• Flexibility to, e.g. work with a dataset you choose or to spin
the problem a certain way
- Think about datasets you’d like to prepare and analyze!
• Adjust the difficulty level relative to where you are now (and
where you need/want to be!)
• Departure from past model where individual final project
was almost entire course mark. Want to decrease the back-loaded
heroic efforts by students and instructor alike. Plus I
think you’ll learn more this way.

web companion: STAT 545 web home > Syllabus > cm001
Homework!

Twitter, GitHub ... are PUBLIC
cultivate your professional / scholarly profile with
intention
if you join, make sure @STAT545 follows you back

how to get help
Dean will soon be announcing office hours
Open an issue on a GitHub repository and
tag one or more instructors
Tweet to @STAT545
Direct twitter message to @STAT545
Email to an instructor We’ll write this up and put
on webpage today

respond to our prompt to
figure out who you are!
we want to match up various info
with what UBC provides (e.g.
Twitter handle, Github username)

what class meetings
will look like
9:30 - 9:50 “lecture”
9:50 - 10:35 hands-on work!
10:35 - 10:55 “lecture”

rhythm of each week
submit a
unit of
work
mon class wed class
work on your own
work in class
consult peers and instructors in class
office hrs
online interaction via GitHub, Twitter

UBC STAT545 2014 Cm001 intro to-course

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to UBC STAT545 2014 Cm001 intro to-course

Similar to UBC STAT545 2014 Cm001 intro to-course (20)

Recently uploaded

Recently uploaded (20)

UBC STAT545 2014 Cm001 intro to-course