SlideShare a Scribd company logo
1 of 86
Download to read offline
STAT 545A Class meeting 001 
Course intro + prompts to install lots of software 
and sign up for lots of accounts 
web companion: STAT 545 web home > Syllabus > cm001 
Wednesday, September 3, 2014
Dr. Jennifer (Jenny) Bryan 
Department of Statistics and Michael Smith Laboratories 
University of British Columbia 
jenny@stat.ubc.ca 
https://github.com/jennybc 
http://www.stat.ubc.ca/~jenny/ 
@JennyBryan ← personal, professional Twitter 
https://github.com/STAT545-UBC 
http://stat545-ubc.github.io 
@STAT545 ← Twitter as lead instructor of this course
statistical 
theory 
real 
world 
data STAT 545A
in a wide array of academic fields, 
the ability to effectively process data 
is superseding other more classical modes of research 
The Big Data Brain Drain: Why Science is in Trouble 
http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain/
what is data science?
The data science Venn diagram 
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
a horizontal data science workflow 
Visualise 
Munge 
Model 
Communicate 
Tidy 
Question 
Collect 
Slides from Hadley Wickham's talk in the Simply Statistics Unconference http://t.co/D931Og8mq3
Visualise 
Munge 
Model 
Communicate 
Tidy 
Question 
Collect 
Wednesday, October 30, 13 
We can’t focus just on this! 
Slides from Hadley Wickham's talk in the Simply Statistics Unconference http://t.co/D931Og8mq3
Data scientists spend 50 - 
80% of their time mired in 
this more mundane labor 
of collecting and preparing 
unruly digital data, before 
it can be explored for 
... what data scientists call “data wrangling,” useful nuggets. 
“data munging” and “data janitor work” ... 
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?partner=rss&emc=rss&smid=tw-nytimesscience&_r=0
http://mimno.infosci.cornell.edu/b/articles/carpentry/ 
As data science becomes all the more relevant and indeed, profitable, attention 
has been placed on the value of cleaning a data set. David Mimno unpicks the 
term and the process and suggests that data carpentry may be a more suitable 
description. There is no such thing as pure or clean data buried in a thin layer 
of non-clean data. In reality, the process is more like deciding how to cut and 
join a piece of material. Data carpentry is not a single process but a thousand 
little skills and techniques. 
http://blogs.lse.ac.uk/impactofsocialsciences/2014/09/01/data-carpentry-skilled-craft-data-science/
Complexity must be justified. 
http://www.john-foreman.com/blog/the-forgotten-job-of-a-data-scientist-editing 
Before you analyze your data with 
computers, be sure to plot it 
http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/ 
... the first reasonable thing you can do to a set of data 
often is 80% of the way to the optimal solution. 
Everything after that is working on getting the last 
20%... (maybe could even be the 90/10 rule) 
http://simplystatistics.org/2014/03/20/the-8020-rule-of-statistical-methods-development/ 
Problem first not solution backward 
http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/
“All models are wrong, some models are useful.” 
Box, G.E.P., Robustness in the strategy of scientific model building, in Robustness in 
Statistics, R.L. Launer and G.N. Wilkinson, Editors. 1979, Academic Press: New York. 
Entia non sunt multiplicanda praeter necessitatem 
The principle, known as Occam’s Razor, that says: 
when there are two competing theories or 
explanations -- both compatible with observed 
data, known facts -- the simpler one is better. 
Implication for statistical analysis: if two models 
are equally wrong-but-compatible-with-data, the 
simpler one is more useful!
small medium big 
size of dataset 
* figure is fictional but I stand by this claim 
There are LOTS of small to 
medium datasets, even 
though it’s more trendy to 
talk about the big ones.
“Too big for Excel is not ‘Big Data’.” 
“Big data is like teenage sex: 
everyone talks about it, nobody really 
knows how to do it, everyone thinks 
everyone else is doing it, so everyone 
claims they are doing it.” 
“Big data has become a synonym 
for ‘data analysis,’ which is 
confusing and counter-productive.” 
small medium big 
size of dataset 
* figure is fictional but I stand by this claim 
http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html 
http://qz.com/81661/most-data-isnt-big-and-businesses-are-wasting-money-pretending-it-is/ 
https://www.facebook.com/dan.ariely/posts/904383595868
big data ≠ data science 
big data ⊂ data science
Two inter-related goals 
• Foster your development of a personal philosophy for 
data analysis, esp. exploratory and descriptive analysis. 
• Help you assemble a modern toolchain and workflows 
for data analysis. 
My hope: 
You’ll leave this course with (at least the beginnings of) 
a confident, deliberate attitude about how to approach 
data analysis and the practical skills to put your 
attitude into action.
you’re going to make your own data science sampler
“A picture is worth 
a thousand words”
1986 Challenger space shuttle disaster 
Favorite example of Edward Tufte 
http://msnbcmedia1.msn.com/j/msnbc/Components/Photos/050709/050609_columbia_hmed_6p.hmedium.jpg
“A picture is worth a thousand words”
“A picture is worth a thousand words” 
Siddhartha R. Dalal; Edward B. Fowlkes; Bruce Hoadley. Risk Analysis of the Space 
Shuttle: Pre-Challenger Prediction of Failure. JASA, Vol. 84, No. 408 (Dec., 1989), 
pp. 945-957. Access via JSTOR.
Edward Tufte 
http://www.edwardtufte.com 
BOOK: 
Visual Explanations: Images and Quantities, Evidence and 
Narrative 
Ch. 5 deals with the Challenger disaster 
That chapter is available for $7 as a downloadable booklet: 
http://www.edwardtufte.com/tufte/books_textb
“A picture is worth a thousand words” 
Always, always, always plot the data. 
Replace (or complement) ‘typical’ tables of 
data or statistical results with figures that 
are more compelling and accessible. 
Whenever possible, generate figures that 
overlay / juxtapose observed data and 
analytical results, e.g. the ‘fit’.
“A picture is worth a thousand words” 
Why? 
•find bizarre data and results when it is 
least embarrassing and painful 
•facilitate comparisons and reveal trends 
Recommended reference: Gelman A, Pasarica C, Dodhia 
R. “Let's Practice What We Preach: Turning Tables into 
Graphs”. The American Statistician, Volume 56, Number 
2, 1 May 2002 , pp. 121-130(10). via JSTOR
weak links in the chain: 
process, packaging and 
presentation
we watched a video about the importance of reproducible 
research, data and analytical documentation, data sharing 
https://www.youtube.com/watch?v=N2zK3sAtr-4&feature=youtu.be
project organization 
literate programming 
reproducible research 
collaboration / open science 
version control / back up / archive
project organization / literate programming / 
reproducible research 
What the cool kids seem to 
be doing .... 
collaboration / open science 
Sweave 
knitr 
R markdown 
R packages 
RStudio 
version control / back up / archive 
github 
Rforge 
sourceforge 
git 
subversion 
mercurial
RStudio is an integrated development environment (IDE) for R
R ≠ RStudio 
RStudio mediates your interaction with R; it would 
replace Emacs + ESS or Tinn-R, but not R itself 
Rstudio is a product of -- actually, more a driver of -- the 
emergence of R Markdown, knitr, R + Git(Hub)
Web PDF → HTML 
Latex → Markdown 
Static → Interactive 
from Hadley Wickham’s talk in the Simply Statistics Unconference on the Future of Statistics 
13 
Open Open source ↑↑ 
Open science ↑↑ 
Open research ↑↑ 
Wednesday, October 30, 13
Markdown HTML 
foo.md foo.html 
easy to write 
(and read!) 
easy to publish 
easy to read in 
browser
Markdown HTML 
Title (header 1, actually) 
===================================== 
This is a Markdown document. 
## Medium header (header 2, actually) 
It's easy to do *italics* or __make things bold__. 
> All models are wrong, but some are useful. An 
approximate answer to the right problem is worth a 
good deal more than an exact answer to an 
approximate problem. Absolute certainty is a 
privilege of uneducated minds-and fanatics. It is, 
for scientific folk, an unattainable ideal. What 
you do every day matters more than what you do 
once in a while. We cannot expect anyone to know 
anything we didn't teach them ourselves. 
Enthusiasm is a form of social courage. 
Code block below. Just affects formatting here but 
we'll get to R Markdown for the real fun soon! 
``` 
x <- 3 * 4 
``` 
I can haz equations. Inline equations, such as ... 
the average is computed as $frac{1}{n} sum_{i=1} 
^{n} x_{i}$. Or display equations like this: 
$$ 
begin{equation*} 
|x|= 
begin{cases} x & text{if $x≥0$,}  
-x &text{if $xle 0$.} 
end{cases} 
end{equation*} 
$$
R markdown
R Markdown Markdown 
R Markdown rocks 
===================================== 
This is an R Markdown document. 
```{r} 
x <- rnorm(1000) 
head(x) 
``` 
See how the R code gets executed and a 
representation thereof appears in the document? 
`knitr` gives you control over how to represent all 
conceivable types of output. In case you care, then 
average of the `r length(x)` random normal variates 
we just generated is `r round(mean(x), 3)`. Those 
numbers are NOT hard-wired but are computed on-the-fly. 
As is this figure. No more copy-paste ... copy-paste 
... oops forgot to copy-paste. 
```{r} 
plot(density(x)) 
``` 
Note that all the previously demonstrated math 
typesetting still works. You don't have to choose 
between having math cred and being web-friendly! 
Inline equations, such as ... the average is 
computed as $frac{1}{n} sum_{i=1}^{n} x_{i}$. Or 
display equations like this: 
$$ 
begin{equation*} 
|x|= 
begin{cases} x & text{if $x≥0$,}  
-x &text{if $xle 0$.} 
end{cases} 
end{equation*} 
$$ 
R Markdown rocks 
===================================== 
This is an R Markdown document. 
```r 
x <- rnorm(1000) 
head(x) 
``` 
``` 
## [1] -1.3007 0.7715 0.5585 -1.2854 1.1973 
2.4157 
``` 
See how the R code gets executed and a 
representation thereof appears in the document? 
`knitr` gives you control over how to represent all 
conceivable types of output. In case you care, then 
average of the 1000 random normal variates we just 
generated is -0.081. Those numbers are NOT hard-wired 
but are computed on-the-fly. As is this 
figure. No more copy-paste ... copy-paste ... oops 
forgot to copy-paste. 
```r 
plot(density(x)) 
``` 
![plot of chunk unnamed-chunk-2](figure/unnamed-chunk- 
2.png) 
...
R Markdown rocks 
===================================== 
This is an R Markdown document. 
```r 
x <- rnorm(1000) 
head(x) 
``` 
``` 
## [1] -1.3007 0.7715 0.5585 -1.2854 1.1973 
2.4157 
``` 
See how the R code gets executed and a 
representation thereof appears in the document? 
`knitr` gives you control over how to represent all 
conceivable types of output. In case you care, then 
average of the 1000 random normal variates we just 
generated is -0.081. Those numbers are NOT hard-wired 
but are computed on-the-fly. As is this 
figure. No more copy-paste ... copy-paste ... oops 
forgot to copy-paste. 
```r 
plot(density(x)) 
``` 
![plot of chunk unnamed-chunk-2](figure/unnamed-chunk- 
2.png) 
... 
Markdown HTML
Markdown HTML 
foo.md foo.html 
easy to write 
(and read!) 
easy to publish 
easy to read in 
browser 
R Markdown 
foo.rmd
how do I put my source on the web? 
for the world or select collaborators? 
maybe I’ll share my data and my prose too .... 
how should I marshall all of that stuff? 
how can I collaborate with others on an analysis 
or package development? 
Advice to preserve sanity: 
Stop doing this via email, attachments, and tracking 
changes in Word. Get that stuff into plain text, put 
it under version control and get it out on the web.
http://www.phdcomics.com/comics/archive.php?comicid=1531 via Ram, 2013 doi:10.1186/1751-0473-8-7
Version control systems (VCS) were created to help 
groups of people develop software 
Git, in particular, is being “repurposed” for activities 
other than pure software development ... like the 
messy hybrid of writing, coding and data wrangling 
Git repository = a bunch of files you want to 
manage in a sane way 
repo = repository
collaboration = the “killer app” of version control 
Learning Git has been -- and continues to be -- 
painful. But not nearly as crazy-making as the 
alternatives: 
- documents as email attachments 
- uncertainty about which version is “master” 
- am I working with the most recent data? 
- archaelogical “digs” on old email threads 
- uncertainty about how/if certain changes have been 
made or issues solved 
- hair-raising ZIP archives containing file salad
Git repository = a bunch of files you 
want to manage in a sane way 
repo = repository 
you can set up repo ... then start 
your work 
or you can make a set of existing 
files and make them into a repo
GitHub = a place to host Git repositories on the web 
GitHub ≠ Git
possible, in theory more typical 
GitHub 
me you 
Image from https://www.atlassian.com/git/tutorial/git-basics#!clone 
✗
Many R packages are developed in 
the open on GitHub 
Nice option when someone tells 
you to “read the source”!
You can see exactly how files have changed, when, and by 
whom. If commit message is good, you’ll see why. 
Commit = a formal “checkpoint” or snapshot of the state 
of the repository
GitHub provides a fantastic visual “diff” view of exactly 
what changed. Incredibly useful.
GitHub issues: think “bug tracker”, “to do list”.
GitHub renders Markdown files nicely 
Example: links.md in workshop repo of mine
You can see the raw Markdown too!
GitHub renders comma (.csv) and tab (.tsv) delimited files nicely 
Example: Lord of the Rings data I found for STAT 545A
http://www.wired.com/design/2013/08/how-segregated-is-your-city-this-eye-opening-map-shows-you/?viewall=true 
NYC
http://www.wired.com/design/2013/08/how-segregated-is-your-city-this-eye-opening-map-shows-you/?viewall=true 
NYC 
Detroit
http://www.coopercenter.org/demographics/Racial-Dot-Map
Cool result is accompanied by 
explanation of how it was done
https://github.com/unorthodox123/RacialDotMap
http://blog.revolutionanalytics.com/2013/08/foodborne-chicago.html
https://github.com/corynissen/foodborne_classifier
project organization / literate programming / 
reproducible research 
Now you what the fuss is about! 
collaboration / open science 
knitr 
R markdown 
R packages 
RStudio 
version control / back up / archive 
github 
Rforge 
sourceforge 
git 
subversion 
mercurial
source is real 
≪
Measured( 
Data 
Analy/c( 
Data 
slide(modified(from(Roger(Peng(hBp://www.biostat.jhsph.edu/~rpeng/ 
Figures 
Computa/onal( 
Results Tables Ar/cle 
Numerical( 
Summaries Text 
Processing(code Analy/c(code 
Presenta/on(code 
R((or(Python)(scripts
Measured( 
Data 
Analy/c( 
Data 
slide(modified(from(Roger(Peng(hBp://www.biostat.jhsph.edu/~rpeng/ 
Figures 
Computa/onal( 
Results Tables Ar/cle 
Numerical( 
Summaries Text 
Processing(code Analy/c(code 
Presenta/on(code 
delimited(or(other(structured,(agnos/c(files
How(to(make(end(products(more(integrated(and( 
more(reproducible? 
Measured( 
Data 
Analy/c( 
Data 
slide(modified(from(Roger(Peng(hBp://www.biostat.jhsph.edu/~rpeng/ 
Figures 
Computa/onal( 
Results Tables Ar/cle 
Numerical( 
Summaries Text 
Processing(code Analy/c(code 
Presenta/on(code
How(to(keep(everything(upPtoPdate? 
If(the(data(changes,(how(do(we(remember(to(reP 
make(the(figures(2B(and(4? 
Measured( 
Data 
Analy/c( 
Data 
slide(modified(from(Roger(Peng(hBp://www.biostat.jhsph.edu/~rpeng/ 
Figures 
Computa/onal( 
Results Tables Ar/cle 
Numerical( 
Summaries Text 
Processing(code Analy/c(code 
Presenta/on(code
Makefile 
http://zmjones.com/make.html 
Like Git, 
GNU Make is another old 
school tool that is being 
repurposed to meet a need in 
data-intensive workflows. 
Originally intended to 
orchestrate compiling 
complicate software, it’s now 
used to express what depends 
on what and keep everything 
“in sync”.
course stuff
who am I? 
BA in econ and german 
management consultant 
PhD biostatistics 
assoc prof @ UBC 
50% Statistics / 50% Michael Smith Laboratories 
I teach and perform lots of data analysis
the whole team 
Bernhard 
Konrad 
Dean 
Attali 
Luolan 
(Gloria) Li 
Jenny 
Bryan 
Julia 
Gustavsen 
Shaun 
Jackman 
http://stat545-ubc.github.io/people.html
Culture of the class 
• Teaching you to fish (vs. giving you a fish) 
- It’s amazing what a determined individual can learn from 
documentation, small learning examples, and ... <gasp> 
Googling. And also stackoverflow. 
• Rewarding engagement, intellectual generosity and 
curiosity 
- Speaking up, sharing success OR failure, showing some 
interest in something will earn marks. 
• Zero tolerance of plagiarism 
- Generating your own approach, writing some code, and 
describing the process is the whole point. Process is 
generally more important than product.
Where marks will come from 
• Small ~weekly activities; marked coarsely (think check, check 
minus, check plus), often with peer evaluation 
• Flexibility to, e.g. work with a dataset you choose or to spin 
the problem a certain way 
- Think about datasets you’d like to prepare and analyze! 
• Adjust the difficulty level relative to where you are now (and 
where you need/want to be!) 
• Departure from past model where individual final project 
was almost entire course mark. Want to decrease the back-loaded 
heroic efforts by students and instructor alike. Plus I 
think you’ll learn more this way.
http://stat545-ubc.github.io
https://twitter.com/STAT545
web companion: STAT 545 web home > Syllabus > cm001 
Homework!
Twitter, GitHub ... are PUBLIC 
cultivate your professional / scholarly profile with 
intention 
if you join, make sure @STAT545 follows you back
how to get help 
Dean will soon be announcing office hours 
Open an issue on a GitHub repository and 
tag one or more instructors 
Tweet to @STAT545 
Direct twitter message to @STAT545 
Email to an instructor We’ll write this up and put 
on webpage today
respond to our prompt to 
figure out who you are! 
we want to match up various info 
with what UBC provides (e.g. 
Twitter handle, Github username)
what class meetings 
will look like 
9:30 - 9:50 “lecture” 
9:50 - 10:35 hands-on work! 
10:35 - 10:55 “lecture”
rhythm of each week 
submit a 
unit of 
work 
mon class wed class 
work on your own 
work in class 
consult peers and instructors in class 
office hrs 
online interaction via GitHub, Twitter
the end

More Related Content

What's hot

Sentiment Analysis In Retail Domain
Sentiment Analysis In Retail DomainSentiment Analysis In Retail Domain
Sentiment Analysis In Retail DomainEdureka!
 
Python webinar 4th june
Python webinar 4th junePython webinar 4th june
Python webinar 4th juneEdureka!
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk KnowledgeKrishna Sankar
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big DataRevolution Analytics
 
Managing & sharing your data and code for openness
Managing & sharing your data and code for opennessManaging & sharing your data and code for openness
Managing & sharing your data and code for opennessPhoebe Ayers
 
Business Analytics Decision Tree in R
Business Analytics Decision Tree in RBusiness Analytics Decision Tree in R
Business Analytics Decision Tree in REdureka!
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data ScienceKrishna Sankar
 
Logistic Regression In Data Science
Logistic Regression In Data ScienceLogistic Regression In Data Science
Logistic Regression In Data ScienceEdureka!
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
 
The State of High-Performance Computing in the Open-Source R Ecosystem
The State of High-Performance Computing in the Open-Source R EcosystemThe State of High-Performance Computing in the Open-Source R Ecosystem
The State of High-Performance Computing in the Open-Source R EcosystemIntel® Software
 

What's hot (13)

Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
Sentiment Analysis In Retail Domain
Sentiment Analysis In Retail DomainSentiment Analysis In Retail Domain
Sentiment Analysis In Retail Domain
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Python webinar 4th june
Python webinar 4th junePython webinar 4th june
Python webinar 4th june
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
 
Managing & sharing your data and code for openness
Managing & sharing your data and code for opennessManaging & sharing your data and code for openness
Managing & sharing your data and code for openness
 
Business Analytics Decision Tree in R
Business Analytics Decision Tree in RBusiness Analytics Decision Tree in R
Business Analytics Decision Tree in R
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
Logistic Regression In Data Science
Logistic Regression In Data ScienceLogistic Regression In Data Science
Logistic Regression In Data Science
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
 
Data Science using Python
Data Science using PythonData Science using Python
Data Science using Python
 
The State of High-Performance Computing in the Open-Source R Ecosystem
The State of High-Performance Computing in the Open-Source R EcosystemThe State of High-Performance Computing in the Open-Source R Ecosystem
The State of High-Performance Computing in the Open-Source R Ecosystem
 

Similar to UBC STAT545 2014 Cm001 intro to-course

Decoding Data Science
Decoding Data ScienceDecoding Data Science
Decoding Data ScienceMatt Fornito
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data ScienceRoger Huang
 
BigData Visualization and Usecase@TDGA-Stelligence-11july2019-share
BigData Visualization and Usecase@TDGA-Stelligence-11july2019-shareBigData Visualization and Usecase@TDGA-Stelligence-11july2019-share
BigData Visualization and Usecase@TDGA-Stelligence-11july2019-sharestelligence
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R NotesLakshmiSarvani6
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabadKelly Technologies
 
BarCamp Sd Microformats
BarCamp Sd MicroformatsBarCamp Sd Microformats
BarCamp Sd MicroformatsJoshua Brewer
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science DemystifiedEmily Robinson
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionInside Analysis
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist? HackerEarth
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupRussell Jurney
 
Is Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceIs Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceEdureka!
 
Machine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMachine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMarc Borowczak
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learningTom Dierickx
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Python's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPeter Wang
 

Similar to UBC STAT545 2014 Cm001 intro to-course (20)

Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 
Decoding Data Science
Decoding Data ScienceDecoding Data Science
Decoding Data Science
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data Science
 
BigData Visualization and Usecase@TDGA-Stelligence-11july2019-share
BigData Visualization and Usecase@TDGA-Stelligence-11july2019-shareBigData Visualization and Usecase@TDGA-Stelligence-11july2019-share
BigData Visualization and Usecase@TDGA-Stelligence-11july2019-share
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
BarCamp Sd Microformats
BarCamp Sd MicroformatsBarCamp Sd Microformats
BarCamp Sd Microformats
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science Demystified
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist?
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science Meetup
 
Is Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceIs Hadoop a necessity for Data Science
Is Hadoop a necessity for Data Science
 
Machine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMachine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification Challenges
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learning
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Grc t18
Grc t18Grc t18
Grc t18
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Python's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data Analysis
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 

Recently uploaded

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Recently uploaded (20)

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 

UBC STAT545 2014 Cm001 intro to-course

  • 1. STAT 545A Class meeting 001 Course intro + prompts to install lots of software and sign up for lots of accounts web companion: STAT 545 web home > Syllabus > cm001 Wednesday, September 3, 2014
  • 2. Dr. Jennifer (Jenny) Bryan Department of Statistics and Michael Smith Laboratories University of British Columbia jenny@stat.ubc.ca https://github.com/jennybc http://www.stat.ubc.ca/~jenny/ @JennyBryan ← personal, professional Twitter https://github.com/STAT545-UBC http://stat545-ubc.github.io @STAT545 ← Twitter as lead instructor of this course
  • 3. statistical theory real world data STAT 545A
  • 4. in a wide array of academic fields, the ability to effectively process data is superseding other more classical modes of research The Big Data Brain Drain: Why Science is in Trouble http://jakevdp.github.io/blog/2013/10/26/big-data-brain-drain/
  • 5. what is data science?
  • 6.
  • 7. The data science Venn diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
  • 8. a horizontal data science workflow Visualise Munge Model Communicate Tidy Question Collect Slides from Hadley Wickham's talk in the Simply Statistics Unconference http://t.co/D931Og8mq3
  • 9. Visualise Munge Model Communicate Tidy Question Collect Wednesday, October 30, 13 We can’t focus just on this! Slides from Hadley Wickham's talk in the Simply Statistics Unconference http://t.co/D931Og8mq3
  • 10. Data scientists spend 50 - 80% of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for ... what data scientists call “data wrangling,” useful nuggets. “data munging” and “data janitor work” ... http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?partner=rss&emc=rss&smid=tw-nytimesscience&_r=0
  • 11. http://mimno.infosci.cornell.edu/b/articles/carpentry/ As data science becomes all the more relevant and indeed, profitable, attention has been placed on the value of cleaning a data set. David Mimno unpicks the term and the process and suggests that data carpentry may be a more suitable description. There is no such thing as pure or clean data buried in a thin layer of non-clean data. In reality, the process is more like deciding how to cut and join a piece of material. Data carpentry is not a single process but a thousand little skills and techniques. http://blogs.lse.ac.uk/impactofsocialsciences/2014/09/01/data-carpentry-skilled-craft-data-science/
  • 12. Complexity must be justified. http://www.john-foreman.com/blog/the-forgotten-job-of-a-data-scientist-editing Before you analyze your data with computers, be sure to plot it http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/ ... the first reasonable thing you can do to a set of data often is 80% of the way to the optimal solution. Everything after that is working on getting the last 20%... (maybe could even be the 90/10 rule) http://simplystatistics.org/2014/03/20/the-8020-rule-of-statistical-methods-development/ Problem first not solution backward http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/
  • 13. “All models are wrong, some models are useful.” Box, G.E.P., Robustness in the strategy of scientific model building, in Robustness in Statistics, R.L. Launer and G.N. Wilkinson, Editors. 1979, Academic Press: New York. Entia non sunt multiplicanda praeter necessitatem The principle, known as Occam’s Razor, that says: when there are two competing theories or explanations -- both compatible with observed data, known facts -- the simpler one is better. Implication for statistical analysis: if two models are equally wrong-but-compatible-with-data, the simpler one is more useful!
  • 14. small medium big size of dataset * figure is fictional but I stand by this claim There are LOTS of small to medium datasets, even though it’s more trendy to talk about the big ones.
  • 15. “Too big for Excel is not ‘Big Data’.” “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” “Big data has become a synonym for ‘data analysis,’ which is confusing and counter-productive.” small medium big size of dataset * figure is fictional but I stand by this claim http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html http://qz.com/81661/most-data-isnt-big-and-businesses-are-wasting-money-pretending-it-is/ https://www.facebook.com/dan.ariely/posts/904383595868
  • 16. big data ≠ data science big data ⊂ data science
  • 17. Two inter-related goals • Foster your development of a personal philosophy for data analysis, esp. exploratory and descriptive analysis. • Help you assemble a modern toolchain and workflows for data analysis. My hope: You’ll leave this course with (at least the beginnings of) a confident, deliberate attitude about how to approach data analysis and the practical skills to put your attitude into action.
  • 18. you’re going to make your own data science sampler
  • 19. “A picture is worth a thousand words”
  • 20. 1986 Challenger space shuttle disaster Favorite example of Edward Tufte http://msnbcmedia1.msn.com/j/msnbc/Components/Photos/050709/050609_columbia_hmed_6p.hmedium.jpg
  • 21.
  • 22. “A picture is worth a thousand words”
  • 23. “A picture is worth a thousand words” Siddhartha R. Dalal; Edward B. Fowlkes; Bruce Hoadley. Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure. JASA, Vol. 84, No. 408 (Dec., 1989), pp. 945-957. Access via JSTOR.
  • 24. Edward Tufte http://www.edwardtufte.com BOOK: Visual Explanations: Images and Quantities, Evidence and Narrative Ch. 5 deals with the Challenger disaster That chapter is available for $7 as a downloadable booklet: http://www.edwardtufte.com/tufte/books_textb
  • 25. “A picture is worth a thousand words” Always, always, always plot the data. Replace (or complement) ‘typical’ tables of data or statistical results with figures that are more compelling and accessible. Whenever possible, generate figures that overlay / juxtapose observed data and analytical results, e.g. the ‘fit’.
  • 26. “A picture is worth a thousand words” Why? •find bizarre data and results when it is least embarrassing and painful •facilitate comparisons and reveal trends Recommended reference: Gelman A, Pasarica C, Dodhia R. “Let's Practice What We Preach: Turning Tables into Graphs”. The American Statistician, Volume 56, Number 2, 1 May 2002 , pp. 121-130(10). via JSTOR
  • 27. weak links in the chain: process, packaging and presentation
  • 28. we watched a video about the importance of reproducible research, data and analytical documentation, data sharing https://www.youtube.com/watch?v=N2zK3sAtr-4&feature=youtu.be
  • 29. project organization literate programming reproducible research collaboration / open science version control / back up / archive
  • 30. project organization / literate programming / reproducible research What the cool kids seem to be doing .... collaboration / open science Sweave knitr R markdown R packages RStudio version control / back up / archive github Rforge sourceforge git subversion mercurial
  • 31. RStudio is an integrated development environment (IDE) for R
  • 32. R ≠ RStudio RStudio mediates your interaction with R; it would replace Emacs + ESS or Tinn-R, but not R itself Rstudio is a product of -- actually, more a driver of -- the emergence of R Markdown, knitr, R + Git(Hub)
  • 33. Web PDF → HTML Latex → Markdown Static → Interactive from Hadley Wickham’s talk in the Simply Statistics Unconference on the Future of Statistics 13 Open Open source ↑↑ Open science ↑↑ Open research ↑↑ Wednesday, October 30, 13
  • 34. Markdown HTML foo.md foo.html easy to write (and read!) easy to publish easy to read in browser
  • 35. Markdown HTML Title (header 1, actually) ===================================== This is a Markdown document. ## Medium header (header 2, actually) It's easy to do *italics* or __make things bold__. > All models are wrong, but some are useful. An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. Absolute certainty is a privilege of uneducated minds-and fanatics. It is, for scientific folk, an unattainable ideal. What you do every day matters more than what you do once in a while. We cannot expect anyone to know anything we didn't teach them ourselves. Enthusiasm is a form of social courage. Code block below. Just affects formatting here but we'll get to R Markdown for the real fun soon! ``` x <- 3 * 4 ``` I can haz equations. Inline equations, such as ... the average is computed as $frac{1}{n} sum_{i=1} ^{n} x_{i}$. Or display equations like this: $$ begin{equation*} |x|= begin{cases} x & text{if $x≥0$,} -x &text{if $xle 0$.} end{cases} end{equation*} $$
  • 37. R Markdown Markdown R Markdown rocks ===================================== This is an R Markdown document. ```{r} x <- rnorm(1000) head(x) ``` See how the R code gets executed and a representation thereof appears in the document? `knitr` gives you control over how to represent all conceivable types of output. In case you care, then average of the `r length(x)` random normal variates we just generated is `r round(mean(x), 3)`. Those numbers are NOT hard-wired but are computed on-the-fly. As is this figure. No more copy-paste ... copy-paste ... oops forgot to copy-paste. ```{r} plot(density(x)) ``` Note that all the previously demonstrated math typesetting still works. You don't have to choose between having math cred and being web-friendly! Inline equations, such as ... the average is computed as $frac{1}{n} sum_{i=1}^{n} x_{i}$. Or display equations like this: $$ begin{equation*} |x|= begin{cases} x & text{if $x≥0$,} -x &text{if $xle 0$.} end{cases} end{equation*} $$ R Markdown rocks ===================================== This is an R Markdown document. ```r x <- rnorm(1000) head(x) ``` ``` ## [1] -1.3007 0.7715 0.5585 -1.2854 1.1973 2.4157 ``` See how the R code gets executed and a representation thereof appears in the document? `knitr` gives you control over how to represent all conceivable types of output. In case you care, then average of the 1000 random normal variates we just generated is -0.081. Those numbers are NOT hard-wired but are computed on-the-fly. As is this figure. No more copy-paste ... copy-paste ... oops forgot to copy-paste. ```r plot(density(x)) ``` ![plot of chunk unnamed-chunk-2](figure/unnamed-chunk- 2.png) ...
  • 38. R Markdown rocks ===================================== This is an R Markdown document. ```r x <- rnorm(1000) head(x) ``` ``` ## [1] -1.3007 0.7715 0.5585 -1.2854 1.1973 2.4157 ``` See how the R code gets executed and a representation thereof appears in the document? `knitr` gives you control over how to represent all conceivable types of output. In case you care, then average of the 1000 random normal variates we just generated is -0.081. Those numbers are NOT hard-wired but are computed on-the-fly. As is this figure. No more copy-paste ... copy-paste ... oops forgot to copy-paste. ```r plot(density(x)) ``` ![plot of chunk unnamed-chunk-2](figure/unnamed-chunk- 2.png) ... Markdown HTML
  • 39. Markdown HTML foo.md foo.html easy to write (and read!) easy to publish easy to read in browser R Markdown foo.rmd
  • 40. how do I put my source on the web? for the world or select collaborators? maybe I’ll share my data and my prose too .... how should I marshall all of that stuff? how can I collaborate with others on an analysis or package development? Advice to preserve sanity: Stop doing this via email, attachments, and tracking changes in Word. Get that stuff into plain text, put it under version control and get it out on the web.
  • 42. Version control systems (VCS) were created to help groups of people develop software Git, in particular, is being “repurposed” for activities other than pure software development ... like the messy hybrid of writing, coding and data wrangling Git repository = a bunch of files you want to manage in a sane way repo = repository
  • 43. collaboration = the “killer app” of version control Learning Git has been -- and continues to be -- painful. But not nearly as crazy-making as the alternatives: - documents as email attachments - uncertainty about which version is “master” - am I working with the most recent data? - archaelogical “digs” on old email threads - uncertainty about how/if certain changes have been made or issues solved - hair-raising ZIP archives containing file salad
  • 44. Git repository = a bunch of files you want to manage in a sane way repo = repository you can set up repo ... then start your work or you can make a set of existing files and make them into a repo
  • 45. GitHub = a place to host Git repositories on the web GitHub ≠ Git
  • 46. possible, in theory more typical GitHub me you Image from https://www.atlassian.com/git/tutorial/git-basics#!clone ✗
  • 47. Many R packages are developed in the open on GitHub Nice option when someone tells you to “read the source”!
  • 48. You can see exactly how files have changed, when, and by whom. If commit message is good, you’ll see why. Commit = a formal “checkpoint” or snapshot of the state of the repository
  • 49. GitHub provides a fantastic visual “diff” view of exactly what changed. Incredibly useful.
  • 50. GitHub issues: think “bug tracker”, “to do list”.
  • 51. GitHub renders Markdown files nicely Example: links.md in workshop repo of mine
  • 52. You can see the raw Markdown too!
  • 53. GitHub renders comma (.csv) and tab (.tsv) delimited files nicely Example: Lord of the Rings data I found for STAT 545A
  • 57. Cool result is accompanied by explanation of how it was done
  • 60.
  • 62. project organization / literate programming / reproducible research Now you what the fuss is about! collaboration / open science knitr R markdown R packages RStudio version control / back up / archive github Rforge sourceforge git subversion mercurial
  • 64. Measured( Data Analy/c( Data slide(modified(from(Roger(Peng(hBp://www.biostat.jhsph.edu/~rpeng/ Figures Computa/onal( Results Tables Ar/cle Numerical( Summaries Text Processing(code Analy/c(code Presenta/on(code R((or(Python)(scripts
  • 65. Measured( Data Analy/c( Data slide(modified(from(Roger(Peng(hBp://www.biostat.jhsph.edu/~rpeng/ Figures Computa/onal( Results Tables Ar/cle Numerical( Summaries Text Processing(code Analy/c(code Presenta/on(code delimited(or(other(structured,(agnos/c(files
  • 66. How(to(make(end(products(more(integrated(and( more(reproducible? Measured( Data Analy/c( Data slide(modified(from(Roger(Peng(hBp://www.biostat.jhsph.edu/~rpeng/ Figures Computa/onal( Results Tables Ar/cle Numerical( Summaries Text Processing(code Analy/c(code Presenta/on(code
  • 67. How(to(keep(everything(upPtoPdate? If(the(data(changes,(how(do(we(remember(to(reP make(the(figures(2B(and(4? Measured( Data Analy/c( Data slide(modified(from(Roger(Peng(hBp://www.biostat.jhsph.edu/~rpeng/ Figures Computa/onal( Results Tables Ar/cle Numerical( Summaries Text Processing(code Analy/c(code Presenta/on(code
  • 68. Makefile http://zmjones.com/make.html Like Git, GNU Make is another old school tool that is being repurposed to meet a need in data-intensive workflows. Originally intended to orchestrate compiling complicate software, it’s now used to express what depends on what and keep everything “in sync”.
  • 70. who am I? BA in econ and german management consultant PhD biostatistics assoc prof @ UBC 50% Statistics / 50% Michael Smith Laboratories I teach and perform lots of data analysis
  • 71. the whole team Bernhard Konrad Dean Attali Luolan (Gloria) Li Jenny Bryan Julia Gustavsen Shaun Jackman http://stat545-ubc.github.io/people.html
  • 72. Culture of the class • Teaching you to fish (vs. giving you a fish) - It’s amazing what a determined individual can learn from documentation, small learning examples, and ... <gasp> Googling. And also stackoverflow. • Rewarding engagement, intellectual generosity and curiosity - Speaking up, sharing success OR failure, showing some interest in something will earn marks. • Zero tolerance of plagiarism - Generating your own approach, writing some code, and describing the process is the whole point. Process is generally more important than product.
  • 73. Where marks will come from • Small ~weekly activities; marked coarsely (think check, check minus, check plus), often with peer evaluation • Flexibility to, e.g. work with a dataset you choose or to spin the problem a certain way - Think about datasets you’d like to prepare and analyze! • Adjust the difficulty level relative to where you are now (and where you need/want to be!) • Departure from past model where individual final project was almost entire course mark. Want to decrease the back-loaded heroic efforts by students and instructor alike. Plus I think you’ll learn more this way.
  • 74.
  • 75.
  • 78.
  • 79.
  • 80. web companion: STAT 545 web home > Syllabus > cm001 Homework!
  • 81. Twitter, GitHub ... are PUBLIC cultivate your professional / scholarly profile with intention if you join, make sure @STAT545 follows you back
  • 82. how to get help Dean will soon be announcing office hours Open an issue on a GitHub repository and tag one or more instructors Tweet to @STAT545 Direct twitter message to @STAT545 Email to an instructor We’ll write this up and put on webpage today
  • 83. respond to our prompt to figure out who you are! we want to match up various info with what UBC provides (e.g. Twitter handle, Github username)
  • 84. what class meetings will look like 9:30 - 9:50 “lecture” 9:50 - 10:35 hands-on work! 10:35 - 10:55 “lecture”
  • 85. rhythm of each week submit a unit of work mon class wed class work on your own work in class consult peers and instructors in class office hrs online interaction via GitHub, Twitter