Presenter: David Smith
Presented to SURF (Sydney R User Group), June 25 2015

R at Microsoft

  1. 1. • Introduction to R • Applications of R at Microsoft • R Products at Microsoft • What’s coming for R at Microsoft • Q&A
  2. 2. April 6, 2015 “This acquisition will help customers use advanced analytics within Microsoft data platforms.“
  4. 4. • Most widely used data analysis software • Most powerful statistical programming language • Create beautiful and unique data visualizations • Thriving open-source community • Fills the talent gap
  5. 5. • 1993: Research project in Auckland, NZ • 1995: Released as open-source software • 1997: R core group formed • 2000: R 1.0.0 released • 2003: R Foundation formed in Austria • 2004: First international user conference • 2007: Revolution Analytics founded • 2009: New York Times article on R • 2013: Revolution R Open released • 2015: Microsoft acquires Revolution Analytics 7 Photo credit: Robert Gentleman
  6. 6. R Usage Growth Rexer Data Miner Survey, 2007-2013 • Rexer Data Miner Survey • IEEE Spectrum, July 2014 #9: R Language Popularity IEEE Spectrum Top Programming Languages
  7. 7. New York Times, June 25 2009 (3 hours after Michael Jackson’s death)
  9. 9. What happened? Why did it happen? What will happen? How can we make it happen? Traditional BI Advanced Analytics
  10. 10. • System monitoring & alerting • Capacity Planning
  11. 11. • TruSkill Matchmaking System • Player Churn • Game design • In-game purchase optimization • Fraud detection • Player communities
  13. 13. • Enhanced Open Source R distribution • Compatible with all R-related software • Multi-threaded for performance • Focus on reproducibility • Open source (GPLv2 license) • Available for Windows, Mac OS X, Ubuntu, Red Hat and OpenSUSE • Download from 15
  14. 14. • Built on latest R engine • 100% compatible with • Designed to work with RStudio 16
  15. 15. • Multithreaded library replaces standard BLAS/LAPACK algorithms • High-performance algorithms • Sequential  Parallel • No need to change any R code • Included with RRO binary distributions 17 More at Revolutions blog
  16. 16. Adapted from CC BY-NC 2.5
  17. 17. • Static CRAN mirror • Daily CRAN snapshots • Easily write and share scripts synced to a specific snapshot 19 CRAN RRDaily snapshots checkpoint package library(checkpoint) checkpoint("2014-09-17") CRAN mirror checkpoint server Midnight UTC
  18. 18. • Easy to use: add 2 lines to the top of each script • For the package author: • For a script collaborator: 20
  19. 19. • Download Revolution R Open • Learn about R and RRO • Daily CRAN snapshots • Explore Packages • Explore Task Views 21
  20. 20. Trends
  21. 21. R FOR BIG DATA
  22. 22. • Toolkits for data scientists and numerical analysts to create custom parallel and distributed algorithms • Mainly useful for “embarrassingly parallel” problems, where parallel components work with small amounts of data • Big Data Predictive Analytics mostly not embarrassingly parallel Details at 24
  23. 23. is…. the only big data big analytics platform based on open source R the defacto statistical computing language for modern analytics
  24. 24.  Naïve Bayes  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Step Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models Predictive Models  K-Means  Decision Trees  Decision Forests  Gradient Boosted Decision Trees Cluster Analysis Classification Simulation Variable Selection  Stepwise Regression  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Combination New in v7.3  PEMA-R API  rxDataStep  rxExec Coming in v7.4
  25. 25. • ETL • Marketing channel data • Behavioral variables • Promotional data • Overlay data • Exploratory data analysis • Time-to-event models • GAM survival models • Scoring for inference • Scoring for prediction • 5 billion scores per day per retailer CUSTOM DATA FORMAT CUSTOM VARIABLES (PMML)
  26. 26. R IN THE CLOUD
  27. 27. • Exposing the expertise of data scientists as APIs • Bringing the utility of data science to applications • Addressing the Data Science talent gap
  28. 28. Azure: Huge infrastructure scale 19 Regions ONLINE…huge datacenter capacity around the world…and we’re growing  100+ datacenters  One of the top 3 networks in the world (coverage, speed, connections)  2 x AWS and 6x Google number of offered regions  G Series – Largest VM available in the market – 32 cores, 448GB Ram, SSD… Operational Announced Central US Iowa West US California North Europe Ireland East US Virginia East US 2 Virginia US Gov Virginia North Central US Illinois US Gov Iowa South Central US Texas Brazil South Sao Paulo West Europe Netherlands China North * Beijing China South * Shanghai Japan East Saitama Japan West OsakaIndia West TBD India East TBD East Asia Hong Kong SE Asia Singapore Australia West Melbourne Australia East Sydney * Operated by 21Vianet
  29. 29.
  31. 31. 40
  32. 32. Data Scientist Interact directly with data Built-in to SQL Server Data Developer/DBA Manage data and analytics together SQL Server 2016 Built-in in-database analytics Example Solutions • Fraud detection • Salesforecasting • Warehouse efficiency • Predictive maintenance Relational Data Analytic Library T-SQL Interface Extensibility ? R RIntegration 010010 100100 010101 Microsoft Azure Machine Learning Marketplace New R scripts 010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101
  33. 33. rows minutes R on a server pulling data via SQL R on a server Invoking RRE ScaleR Inside the EDW
  34. 34. Thank you Download Revolution R Open: More at: David Smith R Community Lead Revolution Analytics @revodavid
  35. 35. 46 More at