Your SlideShare is downloading. ×
0
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
R Workshop for Beginners
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

R Workshop for Beginners

3,451

Published on

Munging and Visualizing Data with R …

Munging and Visualizing Data with R

Michael E. Driscoll & Xavier Léauté

Published in: Technology, Education
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,451
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Munging &VisualizingData with RMichael E. DriscollCTO, Metamarkets@medriscollXavier LéautéMetamarkets@xvrlBarret SchloerkeMetamarkets
  • 2. I.  A  Tour  of  R  
  • 3. January  6,  2009  
  • 4. R  is  a  tool  for…  Data  Manipula?on  •  connec$ng  to  data  sources  •  slicing  &  dicing  data  Modeling  &  Computa?on  •  sta$s$cal  modeling  •  numerical  simula$on  Data  Visualiza?on  •  visualizing  fit  of  models  •  composing  sta$s$cal  graphics  
  • 5. R  is  an  environment  
  • 6. Its  interface  is  plain  
  • 7. RStudio  to  the  rescue  
  • 8. ## load in some Insurance Claim data library(MASS) data(Insurance) Insurance <- edit(Insurance)Let’s  take  a  tour   head(Insurance) dim(Insurance) ## plot it nicely using the ggplot2 packageof  some  data  in  R   library(ggplot2) qplot(Group, Claims/Holders, data=Insurance, geom="bar", stat=identity, position="dodge", facets=District ~ ., fill=Age, ylab="Claim Propensity", xlab="Car Group") ## hypothesize a relationship between Age ~ Claim Propensity ## visualize this hypothesis with a boxplot x11() library(ggplot2) qplot(Age, Claims/Holders, data=Insurance, geom="boxplot", fill=Age) ## quantify the hypothesis with linear model m <- lm(Claims/Holders ~ Age + 0, data=Insurance) summary(m)
  • 9. R  is  “an  overgrown  calculator”  sum(rgamma(rpois(1,lambda=2),shape=49,scale=.2)))
  • 10. R  is  “an  overgrown  calculator”  •  simple  math   > 2+2 4•  storing  results  in  variables   > x <- 2+2 ## ‘<-’ is R syntax for ‘=’ or assignment > x^2 16•  vectorized  math   > weight <- c(110, 180, 240) ## three weights > height <- c(5.5, 6.1, 6.2) ## three heights > bmi <- (weight*4.88)/height^2 ## divides element-wise 17.7 23.6 30.4  
  • 11. R  is  “an  overgrown  calculator”  •  basic  sta$s$cs   mean(weight) sd(weight) sqrt(var(weight)) 176.6 65.0 65.0 # same as sd•  set  func$ons   union intersect setdiff•  advanced  sta$s$cs   > pbinom(40, 100, 0.5) ## P that a coin tossed 100 times 0.028 ## will comes up less than 40 heads > pshare <- pbirthday(23, 365, coincident=2)   0.530 ## probability that among 23 people, two share a birthday  
  • 12. Try  It!  #1    Overgrown  Calculator  •  basic  calcula$ons   > 2 + 2 [Hit  ENTER] > log(100) [Hit  ENTER]  •  calculate  the  value  of  $100  aIer  10  years  at  5%   > 100 * exp(0.05*10) [Hit  ENTER]•  construct  a  vector  &  do  a  vectorized  calcula$on   > year <- (1,2,5,10,25) [Hit  ENTER]      this  returns  an  error.    why?   > year <- c(1,2,5,10,25) [Hit  ENTER] > 100 * exp(0.05*year) [Hit  ENTER]      
  • 13. R  as  a  Programming  Language   fibonacci <- function(n) { fib <- numeric(n) fib [1:2] <- 1 for (i in 3:n) { fib[i] <- fib[i-1] + fib[i-2] } return(fib[n])Image from cover of Abelson& Sussman’s textThe }Structure and Interpretationof Computer Languages
  • 14. Func$on  Calls  •  There  are  ~  1100  built-­‐in  commands  in  the  R   “base”  package,  which  can  be  executed  on  the   command-­‐line.    The  basic  structure  of  a  call  is   thus:      output <- function(arg1, arg2, …)  •  Arithme$c  Opera$ons   + - * / ^  •  R  func$ons  are  typically  vectorized   x <- x/3  works  whether  x  is  a  one  or  many-­‐valued  vector  
  • 15. Data  Structures  in  R   numeric   x <- c(0,2:4) vectors   y <- c(“alpha”, “b”, “c3”, “4”) Character   logical   z <- c(1, 0, TRUE, FALSE)> class(x)[1] "numeric"> x2 <- as.logical(x)> class(x2)[1] “logical”
  • 16. Data  Structures  in  R   lists   lst <- list(x,y,z) objects   M <- matrix(rep(x,3),ncol=3) matrices   data  frames*   df <- data.frame(x,y,z)> class(df)[1] “data.frame"
  • 17. Summary  of  Data  Structures   Linear Rectangular ?  Homogeneous vectors   matrices  Heterogeneous lists   data  frames*  
  • 18. R  is  a  numerical  simulator    •  built-­‐in  func$ons  for   classical  probability   distribu$ons  •  let’s  simulate  10,000   trials  of  100  coin  flips.     what’s  the   distribu$on  of  heads?     > heads <- rbinom(10^5,100,0.50) > hist(heads)
  • 19. Func$ons  for  Probability  Distribu$ons   ddist(  )   density  func$on  (pdf)   pdist(  )   cumula$ve  density  func$on   qdist(  )   quan$le  func$on   rdist(  )   random  deviates   Examples   Normal   dnorm,  pnorm,  qnorm,  rnorm   Binomial   dbinom,  pbinom,  …   Poisson   dpois,  …   >  pnorm(0)    0.05     >  qnorm(0.9)    1.28   >  rnorm(100)    vector  of  length  100    
  • 20. Func$ons  for  Probability  Distribu$ons   distribu?on   dist  suffix  in  R  How  to  find  the  func?ons  for   Beta   -­‐beta  lognormal  distribu?on?       Binomial   -­‐binom     Cauchy   -­‐cauchy  1)  Use  the  double  ques$on  mark   Chisquare   -­‐chisq   Exponen?al   -­‐exp  ‘??’  to  search   F   -­‐f  > ??lognormal Gamma   -­‐gamma     Geometric   -­‐geom  2)  Then  iden$fy  the  package   Hypergeometric   -­‐hyper    >  ?Lognormal   Logis?c   -­‐logis   Lognormal   -­‐lnorm     Nega?ve  Binomial     -­‐nbinom  3)  Discover  the  dist  func$ons     Normal   -­‐norm  dlnorm, plnorm, qlnorm, Poisson   -­‐pois  rlnorm Student  t     -­‐t   Uniform   -­‐unif   Tukey   -­‐tukey   Weibull   -­‐weib   Wilcoxon   -­‐wilcox  
  • 21. Try  It!  #2    Numerical  Simula$on  •  simulate  1m  drivers  from  which  we  expect  4  claims   > numclaims <- rpois(n, lambda) (hint:  use  ?rpois to  understand  the  parameters)  •  verify  the  mean  &  variance  are  reasonable > mean(numclaims) > var(numclaims)•  visualize  the  distribu$on  of  claim  counts   > hist(numclaims)    
  • 22. Gehng  Data  In    -­‐  from  Files   > Insurance <- read.csv(“Insurance.csv”,header=TRUE)      from  Databases   > con <- dbConnect(driver,user,password,host,dbname) > Insurance <- dbSendQuery(con, “SELECT * FROM claims”)      from  the  Web   > con <- url(http://labs.dataspora.com/test.txt) > Insurance <- read.csv(con, header=TRUE)        from  R  data  objects   > load(‘Insurance.Rda’)
  • 23. Gehng  Data  Out  •  to  Files   write.csv(Insurance,file=“Insurance.csv”)•  to  Databases   con <- dbConnect(dbdriver,user,password,host,dbname) dbWriteTable(con, “Insurance”, Insurance)          to  R  Objects   save(Insurance, file=“Insurance.Rda”)
  • 24. Naviga$ng  within  the  R  environment  •  lis$ng  all  variables   > ls()•  examining  a  variable  ‘x’   > str(x) > head(x) > tail(x) > class(x)•  removing  variables   > rm(x) > rm(list=ls()) # remove everything
  • 25. Try  It!  #3    Data  Processing    •  load  data  &  view  it   library(MASS) head(Insurance) ## the first 7 rows dim(Insurance) ## number of rows & columns•  write  it  out   write.csv(Insurance,file=“Insurance.csv”, row.names=FALSE) getwd() ## where am I?•  view  it  in  Excel,  make  a  change,  save  it   remove the first district  •  load  it  back  in  to  R  &  plot  it   Insurance <- read.csv(file=“Insurance.csv”) plot(Claims/Holders ~ Age, data=Insurance)
  • 26. A  Swiss-­‐Army  Knife  for  Data  
  • 27. A  Swiss-­‐Army  Knife  for  Data  •  Indexing  •  Three  ways  to  index  into  a  data  frame   –  array  of  integer  indices   –  array  of  character  names   –  array  of  logical  Booleans  •  Examples:   df[1:3,] df[c(“New York”, “Chicago”),] df[c(TRUE,FALSE,TRUE,TRUE),] df[df$city == “New York”,]
  • 28. A  Swiss-­‐Army  Knife  for  Data  •  subset  –  extract  subsets  mee$ng  some  criteria   subset(Insurance, District==1) subset(Insurance, Claims < 20)•  transform  –  add  or  alter  a  column  of  a  data  frame   transform(Insurance, Propensity=Claims/Holders)•  cut  –  cut  a  con$nuous  value  into  groups cut(Insurance$Claims, breaks=c(-1,100,Inf), labels=c(lo,hi))•  Put  it  all  together:  create  a  new,  transformed  data  frame   transform(subset(Insurance, District==1), ClaimLevel=cut(Claims, breaks=c(-1,100,Inf), labels=c(‘lo’,’hi’)))  
  • 29. A  Swiss-­‐Army  Knife  for  Data  •  sqldf  –  a  library  that  allows  you  to  query  R  data  frames  as  if  they   were  SQL  tables.    Par$cularly  useful  for  aggrega$ons.  library(sqldf)sqldf(select country, sum(revenue) revenue FROM sales GROUP BY country) country revenue1 FR 307.11572 UK 280.63823 USA 304.6860
  • 30. A  Sta$s$cal  Modeler  •  R’s  has  a  powerful  modeling  syntax  •  Models  are  specified  with  formulae,  like     y ~ x growth ~ sun + water model  rela$onships  between  con$nuous  and   categorical  variables.  •  Models  are  also  guide  the  visualiza$on  of   rela$onships  in  a  graphical  form  
  • 31. A  Sta$s$cal  Modeler  •  Linear  model   m <- lm(Claims/Holders ~ Age, data=Insurance)•  Examine  it   summary(m)•  Plot  it   plot(m)
  • 32. A  Sta$s$cal  Modeler  •  Logis$c  model   m <- glm(Age ~ Claims/Holders, data=Insurance, family=binomial(“logit”))•  Examine  it   summary(m)•  Plot  it   plot(m)
  • 33. Try  It!  #4    Sta$s$cal  Modeling  •  fit  a  linear  model   m <- lm(Claims/Holders ~ Age + 0, data=Insurance)•  examine  it     summary(m)  •  plot  it   plot(m)
  • 34. Visualiza$on:       Mul$variate   Barplot  library(ggplot2)qplot(Group, Claims/Holders, data=Insurance, geom="bar", stat=identity, position="dodge", facets=District ~ ., fill=Age)
  • 35. Visualiza$on:    Boxplots  library(ggplot2) library(lattice)qplot(Age, Claims/Holders, bwplot(Claims/Holders ~ Age, data=Insurance, data=Insurance) geom="boxplot“)  
  • 36. Visualiza$on:  Histograms  library(ggplot2) library(lattice)qplot(Claims/Holders, densityplot(~ Claims/Holders | Age, data=Insurance, data=Insurance, layout=c(4,1) facets=Age ~ ., geom="density")
  • 37. Try  It!  #5    Data  Visualiza$on  •  simple  line  chart   > x <- 1:10 > y <- x^2 > plot(y ~ x)•  box  plot   > library(lattice) > boxplot(Claims/Holders ~ Age, data=Insurance)  •  visualize  a  linear  fit   > abline(0,1)
  • 38. Gehng  Help  with  R  Help  within  R  itself  for  a  func?on   > help(func) > ?funcFor  a  topic   > help.search(topic) > ??topic  •  search.r-­‐project.org  •  Google  Code  Search    www.google.com/codesearch  •  Stack  Overflow    hsp://stackoverflow.com/tags/R    •  R-­‐help  list  hsp://www.r-­‐project.org/pos$ng-­‐guide.html    
  • 39. Six  Indispensable  Books  on  R   Learning  R   Data  Manipula?on   Visualiza?on:      la-ce  &  ggplot2   Sta?s?cal  Modeling  
  • 40. Extending  R  with  Packages  Over  one  thousand  user-­‐contributed  packages  are  available   on  CRAN  –  the  Comprehensive  R  Archive  Network              hsp://cran.r-­‐project.org      Install  a  package  from  the  command-­‐line   > install.packages(‘actuar’)Install  a  package  from  the  GUI  menu   “Packages”--> “Install packages(s)”
  • 41. Visualiza?on  with  lagce  
  • 42. lahce  =  trellis   (source:  hsp://lmdvr.r-­‐forge.r-­‐project.org  )  
  • 43. list  of    lahce  func$ons   densityplot(~ speed | type, data=pitch)  
  • 44. Visualiza?on  with    ggplot2  
  • 45. ggplot2  =  grammar  of    graphics  
  • 46. ggplot2  =  grammar  of  graphics  
  • 47. Visualizing  50,000  Diamonds  with  ggplot2  
  • 48. qplot(carat, price, data = diamonds)
  • 49. qplot(log(carat), log(price), data = diamonds)
  • 50. qplot(log(carat), log(price), data = diamonds,alpha = I(1/20))
  • 51. qplot(log(carat), log(price), data = diamonds,alpha = I(1/20), colour=color)
  • 52. qplot(log(carat), log(price), data = diamonds,alpha=I(1/20)) + facet_grid(. ~ color)
  • 53. qplot(color, price/carat, qplot(color, price/carat,data = diamonds, data = diamonds, alpha = I(1/20),geom=“boxplot”) geom=“jitter”)
  • 54. (live  demo)  
  • 55. visualizing  six  dimensions  of  MLB  pitches  with  ggplot2  
  • 56. Demo  with  MLB  Gameday  Data  Code, data, and instructions at:http://metamx-mdriscol-adhoc.s3.amazonaws.com/gameday/README.R

×