R Workshop for Beginners

4,818 views
4,626 views

Published on

Munging and Visualizing Data with R

Michael E. Driscoll & Xavier Léauté

Published in: Technology, Education

R Workshop for Beginners

  1. 1. Munging &VisualizingData with RMichael E. DriscollCTO, Metamarkets@medriscollXavier LéautéMetamarkets@xvrlBarret SchloerkeMetamarkets
  2. 2. I.  A  Tour  of  R  
  3. 3. January  6,  2009  
  4. 4. R  is  a  tool  for…  Data  Manipula?on  •  connec$ng  to  data  sources  •  slicing  &  dicing  data  Modeling  &  Computa?on  •  sta$s$cal  modeling  •  numerical  simula$on  Data  Visualiza?on  •  visualizing  fit  of  models  •  composing  sta$s$cal  graphics  
  5. 5. R  is  an  environment  
  6. 6. Its  interface  is  plain  
  7. 7. RStudio  to  the  rescue  
  8. 8. ## load in some Insurance Claim data library(MASS) data(Insurance) Insurance <- edit(Insurance)Let’s  take  a  tour   head(Insurance) dim(Insurance) ## plot it nicely using the ggplot2 packageof  some  data  in  R   library(ggplot2) qplot(Group, Claims/Holders, data=Insurance, geom="bar", stat=identity, position="dodge", facets=District ~ ., fill=Age, ylab="Claim Propensity", xlab="Car Group") ## hypothesize a relationship between Age ~ Claim Propensity ## visualize this hypothesis with a boxplot x11() library(ggplot2) qplot(Age, Claims/Holders, data=Insurance, geom="boxplot", fill=Age) ## quantify the hypothesis with linear model m <- lm(Claims/Holders ~ Age + 0, data=Insurance) summary(m)
  9. 9. R  is  “an  overgrown  calculator”  sum(rgamma(rpois(1,lambda=2),shape=49,scale=.2)))
  10. 10. R  is  “an  overgrown  calculator”  •  simple  math   > 2+2 4•  storing  results  in  variables   > x <- 2+2 ## ‘<-’ is R syntax for ‘=’ or assignment > x^2 16•  vectorized  math   > weight <- c(110, 180, 240) ## three weights > height <- c(5.5, 6.1, 6.2) ## three heights > bmi <- (weight*4.88)/height^2 ## divides element-wise 17.7 23.6 30.4  
  11. 11. R  is  “an  overgrown  calculator”  •  basic  sta$s$cs   mean(weight) sd(weight) sqrt(var(weight)) 176.6 65.0 65.0 # same as sd•  set  func$ons   union intersect setdiff•  advanced  sta$s$cs   > pbinom(40, 100, 0.5) ## P that a coin tossed 100 times 0.028 ## will comes up less than 40 heads > pshare <- pbirthday(23, 365, coincident=2)   0.530 ## probability that among 23 people, two share a birthday  
  12. 12. Try  It!  #1    Overgrown  Calculator  •  basic  calcula$ons   > 2 + 2 [Hit  ENTER] > log(100) [Hit  ENTER]  •  calculate  the  value  of  $100  aIer  10  years  at  5%   > 100 * exp(0.05*10) [Hit  ENTER]•  construct  a  vector  &  do  a  vectorized  calcula$on   > year <- (1,2,5,10,25) [Hit  ENTER]      this  returns  an  error.    why?   > year <- c(1,2,5,10,25) [Hit  ENTER] > 100 * exp(0.05*year) [Hit  ENTER]      
  13. 13. R  as  a  Programming  Language   fibonacci <- function(n) { fib <- numeric(n) fib [1:2] <- 1 for (i in 3:n) { fib[i] <- fib[i-1] + fib[i-2] } return(fib[n])Image from cover of Abelson& Sussman’s textThe }Structure and Interpretationof Computer Languages
  14. 14. Func$on  Calls  •  There  are  ~  1100  built-­‐in  commands  in  the  R   “base”  package,  which  can  be  executed  on  the   command-­‐line.    The  basic  structure  of  a  call  is   thus:      output <- function(arg1, arg2, …)  •  Arithme$c  Opera$ons   + - * / ^  •  R  func$ons  are  typically  vectorized   x <- x/3  works  whether  x  is  a  one  or  many-­‐valued  vector  
  15. 15. Data  Structures  in  R   numeric   x <- c(0,2:4) vectors   y <- c(“alpha”, “b”, “c3”, “4”) Character   logical   z <- c(1, 0, TRUE, FALSE)> class(x)[1] "numeric"> x2 <- as.logical(x)> class(x2)[1] “logical”
  16. 16. Data  Structures  in  R   lists   lst <- list(x,y,z) objects   M <- matrix(rep(x,3),ncol=3) matrices   data  frames*   df <- data.frame(x,y,z)> class(df)[1] “data.frame"
  17. 17. Summary  of  Data  Structures   Linear Rectangular ?  Homogeneous vectors   matrices  Heterogeneous lists   data  frames*  
  18. 18. R  is  a  numerical  simulator    •  built-­‐in  func$ons  for   classical  probability   distribu$ons  •  let’s  simulate  10,000   trials  of  100  coin  flips.     what’s  the   distribu$on  of  heads?     > heads <- rbinom(10^5,100,0.50) > hist(heads)
  19. 19. Func$ons  for  Probability  Distribu$ons   ddist(  )   density  func$on  (pdf)   pdist(  )   cumula$ve  density  func$on   qdist(  )   quan$le  func$on   rdist(  )   random  deviates   Examples   Normal   dnorm,  pnorm,  qnorm,  rnorm   Binomial   dbinom,  pbinom,  …   Poisson   dpois,  …   >  pnorm(0)    0.05     >  qnorm(0.9)    1.28   >  rnorm(100)    vector  of  length  100    
  20. 20. Func$ons  for  Probability  Distribu$ons   distribu?on   dist  suffix  in  R  How  to  find  the  func?ons  for   Beta   -­‐beta  lognormal  distribu?on?       Binomial   -­‐binom     Cauchy   -­‐cauchy  1)  Use  the  double  ques$on  mark   Chisquare   -­‐chisq   Exponen?al   -­‐exp  ‘??’  to  search   F   -­‐f  > ??lognormal Gamma   -­‐gamma     Geometric   -­‐geom  2)  Then  iden$fy  the  package   Hypergeometric   -­‐hyper    >  ?Lognormal   Logis?c   -­‐logis   Lognormal   -­‐lnorm     Nega?ve  Binomial     -­‐nbinom  3)  Discover  the  dist  func$ons     Normal   -­‐norm  dlnorm, plnorm, qlnorm, Poisson   -­‐pois  rlnorm Student  t     -­‐t   Uniform   -­‐unif   Tukey   -­‐tukey   Weibull   -­‐weib   Wilcoxon   -­‐wilcox  
  21. 21. Try  It!  #2    Numerical  Simula$on  •  simulate  1m  drivers  from  which  we  expect  4  claims   > numclaims <- rpois(n, lambda) (hint:  use  ?rpois to  understand  the  parameters)  •  verify  the  mean  &  variance  are  reasonable > mean(numclaims) > var(numclaims)•  visualize  the  distribu$on  of  claim  counts   > hist(numclaims)    
  22. 22. Gehng  Data  In    -­‐  from  Files   > Insurance <- read.csv(“Insurance.csv”,header=TRUE)      from  Databases   > con <- dbConnect(driver,user,password,host,dbname) > Insurance <- dbSendQuery(con, “SELECT * FROM claims”)      from  the  Web   > con <- url(http://labs.dataspora.com/test.txt) > Insurance <- read.csv(con, header=TRUE)        from  R  data  objects   > load(‘Insurance.Rda’)
  23. 23. Gehng  Data  Out  •  to  Files   write.csv(Insurance,file=“Insurance.csv”)•  to  Databases   con <- dbConnect(dbdriver,user,password,host,dbname) dbWriteTable(con, “Insurance”, Insurance)          to  R  Objects   save(Insurance, file=“Insurance.Rda”)
  24. 24. Naviga$ng  within  the  R  environment  •  lis$ng  all  variables   > ls()•  examining  a  variable  ‘x’   > str(x) > head(x) > tail(x) > class(x)•  removing  variables   > rm(x) > rm(list=ls()) # remove everything
  25. 25. Try  It!  #3    Data  Processing    •  load  data  &  view  it   library(MASS) head(Insurance) ## the first 7 rows dim(Insurance) ## number of rows & columns•  write  it  out   write.csv(Insurance,file=“Insurance.csv”, row.names=FALSE) getwd() ## where am I?•  view  it  in  Excel,  make  a  change,  save  it   remove the first district  •  load  it  back  in  to  R  &  plot  it   Insurance <- read.csv(file=“Insurance.csv”) plot(Claims/Holders ~ Age, data=Insurance)
  26. 26. A  Swiss-­‐Army  Knife  for  Data  
  27. 27. A  Swiss-­‐Army  Knife  for  Data  •  Indexing  •  Three  ways  to  index  into  a  data  frame   –  array  of  integer  indices   –  array  of  character  names   –  array  of  logical  Booleans  •  Examples:   df[1:3,] df[c(“New York”, “Chicago”),] df[c(TRUE,FALSE,TRUE,TRUE),] df[df$city == “New York”,]
  28. 28. A  Swiss-­‐Army  Knife  for  Data  •  subset  –  extract  subsets  mee$ng  some  criteria   subset(Insurance, District==1) subset(Insurance, Claims < 20)•  transform  –  add  or  alter  a  column  of  a  data  frame   transform(Insurance, Propensity=Claims/Holders)•  cut  –  cut  a  con$nuous  value  into  groups cut(Insurance$Claims, breaks=c(-1,100,Inf), labels=c(lo,hi))•  Put  it  all  together:  create  a  new,  transformed  data  frame   transform(subset(Insurance, District==1), ClaimLevel=cut(Claims, breaks=c(-1,100,Inf), labels=c(‘lo’,’hi’)))  
  29. 29. A  Swiss-­‐Army  Knife  for  Data  •  sqldf  –  a  library  that  allows  you  to  query  R  data  frames  as  if  they   were  SQL  tables.    Par$cularly  useful  for  aggrega$ons.  library(sqldf)sqldf(select country, sum(revenue) revenue FROM sales GROUP BY country) country revenue1 FR 307.11572 UK 280.63823 USA 304.6860
  30. 30. A  Sta$s$cal  Modeler  •  R’s  has  a  powerful  modeling  syntax  •  Models  are  specified  with  formulae,  like     y ~ x growth ~ sun + water model  rela$onships  between  con$nuous  and   categorical  variables.  •  Models  are  also  guide  the  visualiza$on  of   rela$onships  in  a  graphical  form  
  31. 31. A  Sta$s$cal  Modeler  •  Linear  model   m <- lm(Claims/Holders ~ Age, data=Insurance)•  Examine  it   summary(m)•  Plot  it   plot(m)
  32. 32. A  Sta$s$cal  Modeler  •  Logis$c  model   m <- glm(Age ~ Claims/Holders, data=Insurance, family=binomial(“logit”))•  Examine  it   summary(m)•  Plot  it   plot(m)
  33. 33. Try  It!  #4    Sta$s$cal  Modeling  •  fit  a  linear  model   m <- lm(Claims/Holders ~ Age + 0, data=Insurance)•  examine  it     summary(m)  •  plot  it   plot(m)
  34. 34. Visualiza$on:       Mul$variate   Barplot  library(ggplot2)qplot(Group, Claims/Holders, data=Insurance, geom="bar", stat=identity, position="dodge", facets=District ~ ., fill=Age)
  35. 35. Visualiza$on:    Boxplots  library(ggplot2) library(lattice)qplot(Age, Claims/Holders, bwplot(Claims/Holders ~ Age, data=Insurance, data=Insurance) geom="boxplot“)  
  36. 36. Visualiza$on:  Histograms  library(ggplot2) library(lattice)qplot(Claims/Holders, densityplot(~ Claims/Holders | Age, data=Insurance, data=Insurance, layout=c(4,1) facets=Age ~ ., geom="density")
  37. 37. Try  It!  #5    Data  Visualiza$on  •  simple  line  chart   > x <- 1:10 > y <- x^2 > plot(y ~ x)•  box  plot   > library(lattice) > boxplot(Claims/Holders ~ Age, data=Insurance)  •  visualize  a  linear  fit   > abline(0,1)
  38. 38. Gehng  Help  with  R  Help  within  R  itself  for  a  func?on   > help(func) > ?funcFor  a  topic   > help.search(topic) > ??topic  •  search.r-­‐project.org  •  Google  Code  Search    www.google.com/codesearch  •  Stack  Overflow    hsp://stackoverflow.com/tags/R    •  R-­‐help  list  hsp://www.r-­‐project.org/pos$ng-­‐guide.html    
  39. 39. Six  Indispensable  Books  on  R   Learning  R   Data  Manipula?on   Visualiza?on:      la-ce  &  ggplot2   Sta?s?cal  Modeling  
  40. 40. Extending  R  with  Packages  Over  one  thousand  user-­‐contributed  packages  are  available   on  CRAN  –  the  Comprehensive  R  Archive  Network              hsp://cran.r-­‐project.org      Install  a  package  from  the  command-­‐line   > install.packages(‘actuar’)Install  a  package  from  the  GUI  menu   “Packages”--> “Install packages(s)”
  41. 41. Visualiza?on  with  lagce  
  42. 42. lahce  =  trellis   (source:  hsp://lmdvr.r-­‐forge.r-­‐project.org  )  
  43. 43. list  of    lahce  func$ons   densityplot(~ speed | type, data=pitch)  
  44. 44. Visualiza?on  with    ggplot2  
  45. 45. ggplot2  =  grammar  of    graphics  
  46. 46. ggplot2  =  grammar  of  graphics  
  47. 47. Visualizing  50,000  Diamonds  with  ggplot2  
  48. 48. qplot(carat, price, data = diamonds)
  49. 49. qplot(log(carat), log(price), data = diamonds)
  50. 50. qplot(log(carat), log(price), data = diamonds,alpha = I(1/20))
  51. 51. qplot(log(carat), log(price), data = diamonds,alpha = I(1/20), colour=color)
  52. 52. qplot(log(carat), log(price), data = diamonds,alpha=I(1/20)) + facet_grid(. ~ color)
  53. 53. qplot(color, price/carat, qplot(color, price/carat,data = diamonds, data = diamonds, alpha = I(1/20),geom=“boxplot”) geom=“jitter”)
  54. 54. (live  demo)  
  55. 55. visualizing  six  dimensions  of  MLB  pitches  with  ggplot2  
  56. 56. Demo  with  MLB  Gameday  Data  Code, data, and instructions at:http://metamx-mdriscol-adhoc.s3.amazonaws.com/gameday/README.R

×