CAMBRIDGE PROSOCIALITY
AND WELL-BEING LABORATORY
CRASH COURSE
DATA SCIENTIST
The Sexiest Job
of the 21st Century
Statistics
Domain
expertise
Hacking
BIG
DATA
ish
SOCIAL NETWORK DATA
DIGITAL TRACE DATA
GLOBAL SURVEY DATA
GENETIC DATA
SPSS ain’t gonna cut it.
Windows Mac Linux
Builtbyscientistsforscientists.
“We have named our language R – in
part to acknowledge the influence of S
and in part to celebrate our own
efforts.”
Ross ...
R is the most powerful
statistics language
in the world.
• Open source
- Free as in speech and beer
• Cross-platform
- Runs on Windows, Mac, and Linux
• Versatile and extensible
-...
http://r-project.org
RStudio.org
Why use ?
R is used by the best.
"...awaytoorganizethebrainpowerof
theworld’smosttalenteddata
scientists..."
Hal Varian
CHIEF ECONOMIST
software on
50%of winners use R
• Everything in one system
- base: linear and nonlinear modeling,
classical statistical tests, time-series
analysis, class...
4,403 available packages
• Automate away“click-click-click”tasks
- More efficient work
• Share analyses and data with ease
- Better collaboration
•...
How do I use ?
You use R by typing commands, not with a mouse.
You use R by typing commands, not with a mouse.
R version 2.14.1 (2011-12-22)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x...
How do you know what to type to R?
For beginners:
For the statistically minded:
For programmers:
The very basics
Put“this”in“here”
“this”HERE
Put“this”in“here”
“this”HERE
Put“this”in here
here <- “this”
Put“this”in here
here <- “this”
variable
Put“this”in here
here <- “this”
a string
Put“this”in here
here <- “this”
assignment operator
Put“this”in here
here = “this”
assignment operator
>
here>
here
[1] "this"
>
here
Row #
[1] "this"
>
functions and data
BLACK BOX
INPUTBLACK BOX
INPUTOUTPUT BLACK BOX
FUNCTION
INPUTFUNCTION
INPUTOUTPUT FUNCTION
FUNCTIONS ARE LIKE FACTORIES.
( )INPUTOUTPUT
In R, parentheses
mean: “DO SOMETHING”
(according to my instructions)
x.bar <- mean(x)
>
mean(x>
mean(x
+
>
mean(x
Waits for more
+
>
( )INPUT
OUTPUT is captured into VARIABLES.
In R, things are often stored in vectors,
lists, matrices, or data frames.
Vector
• The work horse of R
- Even individual numbers are a special
cases of vectors (i.e., a vector of one)
• All elemen...
us <- c("Ilmo","Alex","Chris")
us[1]
us[2:3]
length(us)
class(us)
us <- c("Ilmo","Alex","Chris")
us[1]
us[2:3]
length(us)
class(us)
Very classy
characters,
indeed!
List
• Mix and match!
- Lists can store things of different modes
- Numeric, character, data frames...
• Many functions re...
me <- list(name = "Ilmo", legs = 2)
me$name
me$legs
me["name"]
me[["name"]]
Matrices
are two-dimensional vectors
[,1] [,2]
[1,] "Ilmo" "Alex"
[2,] "Chris" "Dacher"
[,1] [,2]
[1,] 1.09 4.20
[2,] 2.86...
ucb <- rbind(
c("Ilmo","Alex"),
c("Chris","Dacher")
)
ucb[1,1]
ucb[,1]
ucb[2,2]
Data Frames
• The best of both lists and matrices
- Columns and rows
‣ Each column contains data of a single mode
‣ Each r...
DATA FRAMES ARE LIKE WAREHOUSES.
age gender height weight
1
2
3
d[,]
age gender height weight
1
2
3
d[1,]
age gender height weight
1
2
3
d[,1]
age gender height weight
1
2
3
d[,”age”]
age gender height weight
1
2
3
d$age
age gender height weight
1
2
3
d[,1:3]
age gender height weight
1
2
3
d[2,2]
age gender height weight
1
2
3
d[2,c(“age”,”weight”)]
d <- read.csv("MyNobelPrizeData.csv")
What will this do?
d <- read.spss("thatExperiment.sav")
Error: could not find function "read.spss"
library("foreign")
library("foreign")
Minitab, S, SAS, Stata, Systat, and dBase
library("foreign")
Minitab, S, SAS, Stata, Systat, and dBase
...but no Excel
install.packages("xlsx")
read.xlsx("recipes.xlsx")
read.xlsx("recipes.xlsx")
Error in read.xls("recipes.xlsx"):
read.xlsx("recipes.xlsx")
Error in read.xls("recipes.xlsx"):
Please provide a sheet name OR a
sheet index.
read.xlsx("recipes.xlsx")
Error in read.xls("recipes.xlsx"):
Please provide a sheet name OR a
sheet index.
WTF is a“sheet ...
Two-step guide to
solving R problems
Step 1: Search
help(read.xlsx)
or
?read.xlsx
R has a lovely built-in documentation system.
Most often, all that you need is right there.
...
help.search("bar plot")
or
??”bar plot"
When you don’t exactly know what you are
looking for, use free-text search.
Step 1...
Google it.
You are probably not the first person to
encounter the error. Paste the error message
to Google and see what po...
rseek.org
stackexchange.com
reddit.com/r/rstats
Read the R expert forums.
See if they already have solved the problem.
Ste...
Step 2: Ask
Make a reproducible example.
Pin down the exact problem in as few lines of
code as possible. Simplify until only the
probl...
Ask your friends.
Solving problems together is a great way to learn.
Step 2: Ask
Ask the experts online.
There’s R mailing list, statsexchange, rstats reddit,
Quora, Twitter etc. You probably found these...
Step 2: Ask
They do this for living.
Ask the stats dept experts.
Ask Alex or me.
Step 2: Ask
...and show us what you have tried already.
Let’s dive in!
Who has any
programming
experience?
Get your group on.
OPTIONAL
Source
Console
Workspace
Frank Anscombe
STATISTICIAN
ans
ans
ans
ans
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4...
a <- anscombe
a
summary(a$x1)
summary(a[,1])
summary(a[,"x1"])
They all mean the same.
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.0 6.5 9.0 9.0 11.5 14.0
What about the rest of a?
summary(a)
plot(a)
plot(a$x1, a$y1)
cor(a$x1, a$y1)
cor.test(a$x1, a$y1)
a$x4 <- NULL
a$y4 <- NULL
a[,c("x4","y4")] <- NULL
a[,c(4,8)] <- NULL
NULL
TRUE
FALSE
NA
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Slides from R crash course by Ilmo van der Löwe
Upcoming SlideShare
Loading in …5
×

Slides from R crash course by Ilmo van der Löwe

931 views
819 views

Published on

Slides for an introductory R class at the University of Cambridge

Published in: Education, Technology
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total views
931
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
22
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Slides from R crash course by Ilmo van der Löwe

  1. 1. CAMBRIDGE PROSOCIALITY AND WELL-BEING LABORATORY
  2. 2. CRASH COURSE
  3. 3. DATA SCIENTIST The Sexiest Job of the 21st Century
  4. 4. Statistics Domain expertise Hacking
  5. 5. BIG DATA ish
  6. 6. SOCIAL NETWORK DATA
  7. 7. DIGITAL TRACE DATA
  8. 8. GLOBAL SURVEY DATA
  9. 9. GENETIC DATA
  10. 10. SPSS ain’t gonna cut it.
  11. 11. Windows Mac Linux
  12. 12. Builtbyscientistsforscientists.
  13. 13. “We have named our language R – in part to acknowledge the influence of S and in part to celebrate our own efforts.” Ross Ihaka PROFESSOR OF STATISTICS University of Auckland Robert Gentleman SENIOR DIRECTOR OF BIOINFORMATICS Genentech
  14. 14. R is the most powerful statistics language in the world.
  15. 15. • Open source - Free as in speech and beer • Cross-platform - Runs on Windows, Mac, and Linux • Versatile and extensible - Over 4,000 user-contributed packages • General-purpose programming language - You can make it do things automagically
  16. 16. http://r-project.org
  17. 17. RStudio.org
  18. 18. Why use ?
  19. 19. R is used by the best.
  20. 20. "...awaytoorganizethebrainpowerof theworld’smosttalenteddata scientists..." Hal Varian CHIEF ECONOMIST
  21. 21. software on
  22. 22. 50%of winners use R
  23. 23. • Everything in one system - base: linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering etc. - packages from multilevel modeling to medical image analysis • Custom functionality - Programming ➞ Automation
  24. 24. 4,403 available packages
  25. 25. • Automate away“click-click-click”tasks - More efficient work • Share analyses and data with ease - Better collaboration • Make results reproducible - Better science
  26. 26. How do I use ?
  27. 27. You use R by typing commands, not with a mouse.
  28. 28. You use R by typing commands, not with a mouse.
  29. 29. R version 2.14.1 (2011-12-22) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > Prompt
  30. 30. How do you know what to type to R?
  31. 31. For beginners:
  32. 32. For the statistically minded:
  33. 33. For programmers:
  34. 34. The very basics
  35. 35. Put“this”in“here” “this”HERE
  36. 36. Put“this”in“here” “this”HERE
  37. 37. Put“this”in here here <- “this”
  38. 38. Put“this”in here here <- “this” variable
  39. 39. Put“this”in here here <- “this” a string
  40. 40. Put“this”in here here <- “this” assignment operator
  41. 41. Put“this”in here here = “this” assignment operator
  42. 42. >
  43. 43. here>
  44. 44. here [1] "this" >
  45. 45. here Row # [1] "this" >
  46. 46. functions and data
  47. 47. BLACK BOX
  48. 48. INPUTBLACK BOX
  49. 49. INPUTOUTPUT BLACK BOX
  50. 50. FUNCTION
  51. 51. INPUTFUNCTION
  52. 52. INPUTOUTPUT FUNCTION
  53. 53. FUNCTIONS ARE LIKE FACTORIES.
  54. 54. ( )INPUTOUTPUT
  55. 55. In R, parentheses mean: “DO SOMETHING” (according to my instructions)
  56. 56. x.bar <- mean(x)
  57. 57. >
  58. 58. mean(x>
  59. 59. mean(x + >
  60. 60. mean(x Waits for more + >
  61. 61. ( )INPUT OUTPUT is captured into VARIABLES.
  62. 62. In R, things are often stored in vectors, lists, matrices, or data frames.
  63. 63. Vector • The work horse of R - Even individual numbers are a special cases of vectors (i.e., a vector of one) • All elements have to be of the same mode - Vectors of numbers are ok ‣ c(0,1,2,3,4,5,6,7,8,9) - So are vectors of character strings ‣ c("Ilmo","Alex","Chris")
  64. 64. us <- c("Ilmo","Alex","Chris") us[1] us[2:3] length(us) class(us)
  65. 65. us <- c("Ilmo","Alex","Chris") us[1] us[2:3] length(us) class(us) Very classy characters, indeed!
  66. 66. List • Mix and match! - Lists can store things of different modes - Numeric, character, data frames... • Many functions return a list for later use
  67. 67. me <- list(name = "Ilmo", legs = 2) me$name me$legs me["name"] me[["name"]]
  68. 68. Matrices are two-dimensional vectors [,1] [,2] [1,] "Ilmo" "Alex" [2,] "Chris" "Dacher" [,1] [,2] [1,] 1.09 4.20 [2,] 2.86 2.92 A numeric matrix A character string matrix
  69. 69. ucb <- rbind( c("Ilmo","Alex"), c("Chris","Dacher") ) ucb[1,1] ucb[,1] ucb[2,2]
  70. 70. Data Frames • The best of both lists and matrices - Columns and rows ‣ Each column contains data of a single mode ‣ Each row can contain data of various modes • Usually created by reading data from a file or database
  71. 71. DATA FRAMES ARE LIKE WAREHOUSES.
  72. 72. age gender height weight 1 2 3 d[,]
  73. 73. age gender height weight 1 2 3 d[1,]
  74. 74. age gender height weight 1 2 3 d[,1]
  75. 75. age gender height weight 1 2 3 d[,”age”]
  76. 76. age gender height weight 1 2 3 d$age
  77. 77. age gender height weight 1 2 3 d[,1:3]
  78. 78. age gender height weight 1 2 3 d[2,2]
  79. 79. age gender height weight 1 2 3 d[2,c(“age”,”weight”)]
  80. 80. d <- read.csv("MyNobelPrizeData.csv") What will this do?
  81. 81. d <- read.spss("thatExperiment.sav") Error: could not find function "read.spss"
  82. 82. library("foreign")
  83. 83. library("foreign") Minitab, S, SAS, Stata, Systat, and dBase
  84. 84. library("foreign") Minitab, S, SAS, Stata, Systat, and dBase ...but no Excel
  85. 85. install.packages("xlsx")
  86. 86. read.xlsx("recipes.xlsx")
  87. 87. read.xlsx("recipes.xlsx") Error in read.xls("recipes.xlsx"):
  88. 88. read.xlsx("recipes.xlsx") Error in read.xls("recipes.xlsx"): Please provide a sheet name OR a sheet index.
  89. 89. read.xlsx("recipes.xlsx") Error in read.xls("recipes.xlsx"): Please provide a sheet name OR a sheet index. WTF is a“sheet index”?
  90. 90. Two-step guide to solving R problems
  91. 91. Step 1: Search
  92. 92. help(read.xlsx) or ?read.xlsx R has a lovely built-in documentation system. Most often, all that you need is right there. Step 1: Search
  93. 93. help.search("bar plot") or ??”bar plot" When you don’t exactly know what you are looking for, use free-text search. Step 1: Search
  94. 94. Google it. You are probably not the first person to encounter the error. Paste the error message to Google and see what pops up. Step 1: Search
  95. 95. rseek.org stackexchange.com reddit.com/r/rstats Read the R expert forums. See if they already have solved the problem. Step 1: Search
  96. 96. Step 2: Ask
  97. 97. Make a reproducible example. Pin down the exact problem in as few lines of code as possible. Simplify until only the problem remains. Step 2: Ask
  98. 98. Ask your friends. Solving problems together is a great way to learn. Step 2: Ask
  99. 99. Ask the experts online. There’s R mailing list, statsexchange, rstats reddit, Quora, Twitter etc. You probably found these already with your Google searches. Step 2: Ask
  100. 100. Step 2: Ask They do this for living. Ask the stats dept experts.
  101. 101. Ask Alex or me. Step 2: Ask ...and show us what you have tried already.
  102. 102. Let’s dive in!
  103. 103. Who has any programming experience?
  104. 104. Get your group on.
  105. 105. OPTIONAL
  106. 106. Source Console Workspace
  107. 107. Frank Anscombe STATISTICIAN
  108. 108. ans
  109. 109. ans
  110. 110. ans
  111. 111. ans
  112. 112. x1 x2 x3 x4 y1 y2 y3 y4 1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 4 9 9 9 8 8.81 8.77 7.11 8.84 5 11 11 11 8 8.33 9.26 7.81 8.47 6 14 14 14 8 9.96 8.10 8.84 7.04 7 6 6 6 8 7.24 6.13 6.08 5.25 8 4 4 4 19 4.26 3.10 5.39 12.50 9 12 12 12 8 10.84 9.13 8.15 5.56 10 7 7 7 8 4.82 7.26 6.42 7.91 11 5 5 5 8 5.68 4.74 5.73 6.89
  113. 113. a <- anscombe
  114. 114. a
  115. 115. summary(a$x1) summary(a[,1]) summary(a[,"x1"]) They all mean the same.
  116. 116. Min. 1st Qu. Median Mean 3rd Qu. Max. 4.0 6.5 9.0 9.0 11.5 14.0
  117. 117. What about the rest of a?
  118. 118. summary(a)
  119. 119. plot(a)
  120. 120. plot(a$x1, a$y1)
  121. 121. cor(a$x1, a$y1) cor.test(a$x1, a$y1)
  122. 122. a$x4 <- NULL a$y4 <- NULL a[,c("x4","y4")] <- NULL a[,c(4,8)] <- NULL
  123. 123. NULL TRUE FALSE NA

×