- 1. Visualization and Analysis of Big Data with the R Programming Language Michael E. Driscoll, Ph.D. Presented to Amyris April 2009
- 2. “The sexy job in the next ten years will be statisticians.” – Hal Varian, Chief Economist, Google
- 3. What is R? What can it do? • data manipulation • statistics • visualization Why is it different? • created by statisticians • free, open source • extensible via packages
- 4. What is R? Data Manipulation Data Visualization • database connectivity • slicing & dicing data cubes Statistical Analysis • hypothesis testing • model fitting • clustering • machine learning
- 5. I. Taming Microarray Data with Bioconductor Statistical analysis Visualization of hybridization artifacts • fit models for the distributions of expression values • test hypotheses about outliers • cluster genes with similar patterns http://www.bioconductor.org
- 6. 1million transactions during this presentation
- 7. II. Clustering Product Purchases Statistical analysis Which products are ordered together? • every customer has a history of product purchases • hierarchically cluster products and customers • other approaches (depending on goals): singular value decomposition
- 8. 2 billion clicks during this presentation
- 9. III. Optimizing Online Advertising Statistical analysis How confident are we that B beats A? • estimate posterior distributions for click rates from observed data • test hypothesis that the click-rate of a given ad A is greater than for ad B
- 10. IV. A Tale of Two Pitchers Hamels Webb
- 11. R Nuts and Bolts “The best thing about R is that it was developed by statisticians. The worst thing about R is that… it was developed by statisticians.” – Bo Cowgill, Google
- 12. Data Manipulation Getting Data In Getting Data Out SQL Data formats: • MySQL • Delimited (CSV, Excel) • ODBC (Oracle, MS-SQL) • Matlab Excel Graphic formats: • Vector (PDF, EPS, SVG) Matlab • Raster (PNG, TIFF) driver <- dbDriver(quot;MySQLquot;) con <- dbConnect(driver,user=“tgardner”, password=“julien05”, host=“data.amyris.com”, dbname=“biofx”) resultSet <- dbSendQuery(con, “SELECT * FROM assay”) data <- fetch(resultSet, n=-1)
- 13. Statistical Methods
- 14. Extending R with Packages CRAN http://cran.r-project.org • ~ 2000 packages • organized by field • easy to install > install.package( “lattice”)
- 15. R Packages: Beautiful Colors with Colorspace library(“Colorspace”) red <- LAB(50,64,64) blue <- LAB(50,-48,-48) mixcolor(10, red, blue)
- 16. R Packages: Creating Panel Plots with Lattice library(“Lattice”) xyplot(x ~ y | pitch_type, data = gameday)
- 17. Getting Started Choose a UI Download at R-project.org • Emacs – ESS • JGR – Java GUI for R • Rattle http://www.r-project.org
- 18. Getting Help Online Books • use inline help > ?plot • search /post at R-help http://tolstoy.newcastle.edu.au/R Modern Applied Statistics with S W.N.Venables & B.D. Ripley Use R series includes 20 volumes http://www.springer.com/series/6991
- 19. Data Desktop
- 20. Which is Easier? or Coding Clicking
- 21. R-Based Dashboards A Simple Script setContentType(quot;text/htmlquot;) png(quot;/var/www/hello.pngquot;) plot(sample(100,100),col=1:8,pch=19) dev.off() cat(quot;<html>quot;) cat(quot;<body>quot;) cat(quot;<h1>hello world</h1>quot;) cat('<img src=quot;../hello.pngquot;') cat(quot;</body>quot;) cat(quot;</html>quot;) Download Jeff Horner’s Rapache at http://biostat.mc.vanderbilt.edu/rapache/
- 22. R-Based Dashboards http://labs.dataspora.com/gameday
