• Introduction to R
• Applications of R at Microsoft
• R Products at Microsoft
• What’s coming for R at Microsoft
• Q&A
April 6, 2015
“This acquisition will help customers use advanced analytics within Microsoft data platforms.“
INTRODUCTION
TO R
• Most widely used data analysis software
• Most powerful statistical programming language
• Create beautiful and unique data visualizations
• Thriving open-source community
• Fills the talent gap
www.revolutionanalytics.com/what-is-r
• 1993: Research project in Auckland, NZ
• 1995: Released as open-source software
• 1997: R core group formed
• 2000: R 1.0.0 released
• 2003: R Foundation formed in Austria
• 2004: First international user conference
• 2007: Revolution Analytics founded
• 2009: New York Times article on R
• 2013: Revolution R Open released
• 2015: Microsoft acquires Revolution
Analytics 7
Photo credit: Robert Gentleman
blog.revolutionanalytics.com/popularity
R Usage Growth
Rexer Data Miner Survey, 2007-2013
• Rexer Data Miner Survey • IEEE Spectrum, July 2014
#9: R
Language Popularity
IEEE Spectrum Top Programming Languages
New York Times, June 25 2009
(3 hours after Michael Jackson’s death)
R AT
MICROSOFT
What
happened?
Why did
it happen?
What will
happen?
How can we
make it happen?
Traditional BI Advanced Analytics
• System monitoring & alerting
• Capacity Planning
• TruSkill Matchmaking System
• Player Churn
• Game design
• In-game purchase optimization
• Fraud detection
• Player communities
MICROSOFT
PRODUCTS
WITH R
• Enhanced Open Source R distribution
• Compatible with all R-related software
• Multi-threaded for performance
• Focus on reproducibility
• Open source (GPLv2 license)
• Available for Windows, Mac OS X, Ubuntu,
Red Hat and OpenSUSE
• Download from
mran.revolutionanalytics.com
15
• Built on latest R engine
• 100% compatible with
• Designed to work with RStudio
16
• Multithreaded library replaces
standard BLAS/LAPACK algorithms
• High-performance algorithms
• Sequential  Parallel
• No need to change any R code
• Included with RRO binary
distributions
17
More at Revolutions blog
Adapted from http://xkcd.com/234/
CC BY-NC 2.5
• Static CRAN mirror
• Daily CRAN snapshots
mran.revolutionanalytics.com/snapshot
• Easily write and share scripts synced to a specific snapshot
19
CRAN
RRDaily
snapshots
http://mran.revolutionanalytics.com/snapshot/
checkpoint
package
library(checkpoint)
checkpoint("2014-09-17")
CRAN mirror
http://cran.revolutionanalytics.com/
checkpoint
server
Midnight
UTC
• Easy to use: add 2 lines to the top of each script
• For the package author:
• For a script collaborator:
20
• Download
Revolution R Open
• Learn about R and
RRO
• Daily CRAN
snapshots
• Explore Packages
• Explore Task Views
21
Trends
R FOR
BIG DATA
• Toolkits for data scientists and numerical analysts to create custom
parallel and distributed algorithms
• Mainly useful for “embarrassingly parallel” problems, where
parallel components work with small amounts of data
• Big Data Predictive Analytics mostly not embarrassingly parallel
Details at projects.revolutionanalytics.com
24
is….
the only big data big analytics platform
based on open source R
the defacto statistical computing language for
modern analytics
 Naïve Bayes
 Data import – Delimited, Fixed, SAS, SPSS,
OBDC
 Variable creation & transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort, Merge, Split
 Aggregate by category (means, sums)
 Min / Max, Mean, Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product matrix for set
variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data (standard tables & long
form)
 Marginal Summaries of Cross Tabulations
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
 Subsample (observations & variables)
 Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
 Sum of Squares (cross product matrix for set
variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
 Covariance & Correlation Matrices
 Logistic Regression
 Classification & Regression Trees
 Predictions/scoring for models
 Residuals for all models
Predictive Models  K-Means
 Decision Trees
 Decision Forests
 Gradient Boosted Decision Trees
Cluster Analysis
Classification
Simulation
Variable Selection
 Stepwise Regression
 Simulation (e.g. Monte Carlo)
 Parallel Random Number Generation
Combination
New in
v7.3
 PEMA-R API
 rxDataStep
 rxExec
Coming
in v7.4
• ETL
• Marketing channel data
• Behavioral variables
• Promotional data
• Overlay data
• Exploratory data analysis
• Time-to-event models
• GAM survival models
• Scoring for inference
• Scoring for prediction
• 5 billion scores per day
per retailer
CUSTOM DATA
FORMAT
CUSTOM VARIABLES
(PMML)
R IN THE CLOUD
• Exposing the expertise of data scientists as APIs
• Bringing the utility of data science to applications
• Addressing the Data Science talent gap
Azure: Huge infrastructure scale
19 Regions ONLINE…huge datacenter capacity around the world…and we’re growing
 100+ datacenters
 One of the top 3 networks in the world (coverage, speed, connections)
 2 x AWS and 6x Google number of offered regions
 G Series – Largest VM available in the market – 32 cores, 448GB Ram, SSD…
Operational Announced
Central US
Iowa
West US
California
North Europe
Ireland
East US
Virginia
East US 2
Virginia
US Gov
Virginia
North Central US
Illinois
US Gov
Iowa
South Central US
Texas
Brazil South
Sao Paulo
West Europe
Netherlands
China North *
Beijing
China South *
Shanghai
Japan East
Saitama
Japan West
OsakaIndia West
TBD
India East
TBD
East Asia
Hong Kong
SE Asia
Singapore
Australia West
Melbourne
Australia East
Sydney
* Operated by 21Vianet
http://blog.revolutionanalytics.com/2015/06/r-build-keynote.html/
WHAT’S
COMING FOR R
AT MICROSOFT
40
Data Scientist
Interact directly with data
Built-in to SQL Server
Data Developer/DBA
Manage data and
analytics together
SQL Server 2016
Built-in in-database analytics
Example Solutions
• Fraud detection
• Salesforecasting
• Warehouse efficiency
• Predictive maintenance
Relational Data
Analytic Library
T-SQL Interface
Extensibility
?
R
RIntegration
010010
100100
010101
Microsoft Azure
Machine Learning Marketplace
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
rows
minutes
R on a
server
pulling data
via SQL
R on a server
Invoking RRE
ScaleR Inside
the EDW
Thank you
Download Revolution R Open:
mran.revolutionanalytics.com
More at:
blog.revolutionanalytics.com
David Smith
R Community Lead
Revolution Analytics
@revodavid
davidsmi@microsoft.com
46
More at deployr.revolutionanalytics.com
R at Microsoft

R at Microsoft

  • 2.
    • Introduction toR • Applications of R at Microsoft • R Products at Microsoft • What’s coming for R at Microsoft • Q&A
  • 3.
    April 6, 2015 “Thisacquisition will help customers use advanced analytics within Microsoft data platforms.“
  • 4.
  • 5.
    • Most widelyused data analysis software • Most powerful statistical programming language • Create beautiful and unique data visualizations • Thriving open-source community • Fills the talent gap www.revolutionanalytics.com/what-is-r
  • 7.
    • 1993: Researchproject in Auckland, NZ • 1995: Released as open-source software • 1997: R core group formed • 2000: R 1.0.0 released • 2003: R Foundation formed in Austria • 2004: First international user conference • 2007: Revolution Analytics founded • 2009: New York Times article on R • 2013: Revolution R Open released • 2015: Microsoft acquires Revolution Analytics 7 Photo credit: Robert Gentleman
  • 8.
    blog.revolutionanalytics.com/popularity R Usage Growth RexerData Miner Survey, 2007-2013 • Rexer Data Miner Survey • IEEE Spectrum, July 2014 #9: R Language Popularity IEEE Spectrum Top Programming Languages
  • 9.
    New York Times,June 25 2009 (3 hours after Michael Jackson’s death)
  • 10.
  • 11.
    What happened? Why did it happen? Whatwill happen? How can we make it happen? Traditional BI Advanced Analytics
  • 12.
    • System monitoring& alerting • Capacity Planning
  • 13.
    • TruSkill MatchmakingSystem • Player Churn • Game design • In-game purchase optimization • Fraud detection • Player communities
  • 14.
  • 15.
    • Enhanced OpenSource R distribution • Compatible with all R-related software • Multi-threaded for performance • Focus on reproducibility • Open source (GPLv2 license) • Available for Windows, Mac OS X, Ubuntu, Red Hat and OpenSUSE • Download from mran.revolutionanalytics.com 15
  • 16.
    • Built onlatest R engine • 100% compatible with • Designed to work with RStudio 16
  • 17.
    • Multithreaded libraryreplaces standard BLAS/LAPACK algorithms • High-performance algorithms • Sequential  Parallel • No need to change any R code • Included with RRO binary distributions 17 More at Revolutions blog
  • 18.
  • 19.
    • Static CRANmirror • Daily CRAN snapshots mran.revolutionanalytics.com/snapshot • Easily write and share scripts synced to a specific snapshot 19 CRAN RRDaily snapshots http://mran.revolutionanalytics.com/snapshot/ checkpoint package library(checkpoint) checkpoint("2014-09-17") CRAN mirror http://cran.revolutionanalytics.com/ checkpoint server Midnight UTC
  • 20.
    • Easy touse: add 2 lines to the top of each script • For the package author: • For a script collaborator: 20
  • 21.
    • Download Revolution ROpen • Learn about R and RRO • Daily CRAN snapshots • Explore Packages • Explore Task Views 21
  • 22.
  • 23.
  • 24.
    • Toolkits fordata scientists and numerical analysts to create custom parallel and distributed algorithms • Mainly useful for “embarrassingly parallel” problems, where parallel components work with small amounts of data • Big Data Predictive Analytics mostly not embarrassingly parallel Details at projects.revolutionanalytics.com 24
  • 25.
    is…. the only bigdata big analytics platform based on open source R the defacto statistical computing language for modern analytics
  • 27.
     Naïve Bayes Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Step Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models Predictive Models  K-Means  Decision Trees  Decision Forests  Gradient Boosted Decision Trees Cluster Analysis Classification Simulation Variable Selection  Stepwise Regression  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Combination New in v7.3  PEMA-R API  rxDataStep  rxExec Coming in v7.4
  • 29.
    • ETL • Marketingchannel data • Behavioral variables • Promotional data • Overlay data • Exploratory data analysis • Time-to-event models • GAM survival models • Scoring for inference • Scoring for prediction • 5 billion scores per day per retailer CUSTOM DATA FORMAT CUSTOM VARIABLES (PMML)
  • 30.
    R IN THECLOUD
  • 31.
    • Exposing theexpertise of data scientists as APIs • Bringing the utility of data science to applications • Addressing the Data Science talent gap
  • 32.
    Azure: Huge infrastructurescale 19 Regions ONLINE…huge datacenter capacity around the world…and we’re growing  100+ datacenters  One of the top 3 networks in the world (coverage, speed, connections)  2 x AWS and 6x Google number of offered regions  G Series – Largest VM available in the market – 32 cores, 448GB Ram, SSD… Operational Announced Central US Iowa West US California North Europe Ireland East US Virginia East US 2 Virginia US Gov Virginia North Central US Illinois US Gov Iowa South Central US Texas Brazil South Sao Paulo West Europe Netherlands China North * Beijing China South * Shanghai Japan East Saitama Japan West OsakaIndia West TBD India East TBD East Asia Hong Kong SE Asia Singapore Australia West Melbourne Australia East Sydney * Operated by 21Vianet
  • 36.
  • 37.
  • 38.
  • 39.
    Data Scientist Interact directlywith data Built-in to SQL Server Data Developer/DBA Manage data and analytics together SQL Server 2016 Built-in in-database analytics Example Solutions • Fraud detection • Salesforecasting • Warehouse efficiency • Predictive maintenance Relational Data Analytic Library T-SQL Interface Extensibility ? R RIntegration 010010 100100 010101 Microsoft Azure Machine Learning Marketplace New R scripts 010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101
  • 40.
    rows minutes R on a server pullingdata via SQL R on a server Invoking RRE ScaleR Inside the EDW
  • 42.
    Thank you Download RevolutionR Open: mran.revolutionanalytics.com More at: blog.revolutionanalytics.com David Smith R Community Lead Revolution Analytics @revodavid davidsmi@microsoft.com
  • 44.

Editor's Notes

  • #14 Xbox: http://blog.revolutionanalytics.com/2014/05/microsoft-uses-r-for-xbox-matchmaking.html Other gaming http://blog.revolutionanalytics.com/2013/06/how-big-data-and-statistical-modeling-are-changing-video-games.html
  • #23 Infinite scale inexpensively Tons of data from which you actually have to get value Customers that have a very high expectation of service and connection – Pier 1 great example Influx of new talent to fill a very big gap McKinsey says is 300 thousand in US alone But the market this new talent is entering is still filled with barriers
  • #28 Enterprise readiness Performance architecture Big Data analytics Data source integration Development tools Deployment tools
  • #31 Demographics: consumer, product, market Actions: web clicks, email clicks, mobile app usage, call center logs, social, search … Outcomes: impressions, touches, orders (retail, online, mobile) Strategic allocation
  • #32 Outcome is “buying” instead of “dying”
  • #35 Over the last few years we’ve truly delivered a huge infrastructure to enable us to grow our services at scale around the globe. Whether it’s our flagship facilities in Quincy, Washington or Boydton, Virginia, or some of the newly announced facilities in Shanghai, Australia and Brazil, it really is key for us to make smart investments around the world to deliver services in a resilient and reliable fashion.   A lot of people ask, what goes into site selection at Microsoft and how do we decide where to place our datacenter investments? There are over thirty-five factors in our site selection criteria. But really, the top elements are around proximity to customers and energy and fiber infrastructure, insuring that we have the capacity and the growth platforms to be able to grow our services.   Another key element is about skilled workforce. We need to insure that we have the right people to run and operate our datacenters on a day to day basis.
  • #43 Work done in conjunction with major Teradata user and household name in silicon valley. Chart shows results of moving R algorithm execution inside Teradata EDW – achieving combined benefits from scaling computation and slashing data movement.