2
WHO
The leading provider
of advanced analytics
software and services
based on open source R,
since 2007
WHAT
REVOLUTION R: The
enterprise-grade predictive
analytics application platform
based on the R language
WHERE
“This acquisition will help
customers use advanced
analytics within Microsoft data
platforms“
-- Joseph Sirosh, CVP C+E
3
• Situation
• Complication
• Critical question?
• Answer
• A high level overview of R
• Data science in the cloud
• Connecting R to SQL
• Scalable R
• R in SQL Server
• Moving your workflow to the cloud
A high level overview of R
• Most widely used data analysis software
• Most powerful statistical programming language
• Create beautiful and unique data visualizations
• Thriving open-source community
• Fills the talent gap
www.revolutionanalytics.com/what-is-r
1993
• Research
project in
Auckland,
NZ
1995
• Open
source
1997
• R-core
2000
• R-1.0.0
2003
• R
Foundation
2004
• First
UseR!
2009
• New
York
Times
2015
• R-3.2.0
• R Consortium
8
Photo credit: Robert Gentleman
The New York Times
Interactive Features
• Election Forecast
• Dialect Quiz
Data Journalism
• NFL Draft Picks
• Wealth distribution in USA
Data science in the Azure cloud
Trends
Software Revenues New License Revenues
http://redmonk.com/sogrady/2013/11/21/selling-software/ 13
The Azure Cloud
Operational Announced
Central US
Iowa
West US
California
North Europe
Ireland
East US
Virginia
East US 2
Virginia
US Gov
Virginia
North Central US
Illinois
US Gov
Iowa
South Central US
Texas
Brazil South
Sao Paulo
West Europe
Netherlands
China North *
Beijing
China South *
Shanghai
Japan East
Saitama
Japan West
OsakaIndia West
TBD
India East
TBD
East Asia
Hong Kong
SE Asia
Singapore
Australia West
Melbourne
Australia East
Sydney
* Operated by 21Vianet
http://blog.revolutionanalytics.com/2015/06/r-build-keynote.html/
Connecting R to SQL
21
mran.revolutionanalytics.com
Demo
• Using ODBC to connect R to SQL
Solving the scalability problem with R
is….
the big data big analytics platform
based on open source R
• Data import – Delimited, Fixed, SAS, SPSS, OBDC
• Variable creation & transformation
• Recode variables
• Factor variables
• Missing value handling
• Sort, Merge, Split
• Aggregate by category (means, sums)
• Min / Max, Mean, Median (approx.)
• Quantiles (approx.)
• Standard Deviation
• Variance
• Correlation
• Covariance
• Sum of Squares (cross product matrix for set
variables)
• Pairwise Cross tabs
• Risk Ratio & Odds Ratio
• Cross-Tabulation of Data (standard tables & long
form)
• Marginal Summaries of Cross Tabulations
• Chi Square Test
• Kendall Rank Correlation
• Fisher’s Exact Test
• Student’s t-Test
• Subsample (observations & variables)
• Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
• Sum of Squares (cross product matrix for set
variables)
• Multiple Linear Regression
• Generalized Linear Models (GLM) exponential family
distributions: binomial, Gaussian, inverse Gaussian,
Poisson, Tweedie. Standard link functions: cauchit,
identity, log, logit, probit. User defined distributions
& link functions.
• Covariance & Correlation Matrices
• Logistic Regression
• Classification & Regression Trees
• Predictions/scoring for models
• Residuals for all models
Predictive Models
• K-Means
• Decision Trees
• Decision Forests
• Stochastic Gradient Boosted Decision Trees
Cluster Analysis
Classification
Simulation
Variable Selection
• Stepwise Regression Linear,
Logistic and GLM
• Monte Carlo
• Parallel Random Number Generation
Combination
• Using Revolution rxDataStep and rxExec
functions to combine open source R with
Revolution R
• PEMA API
Demo
• Using RRE to solve the scalability problem
R in SQL Server
Data Scientist
Interact directly with data
Built-in to SQL Server
Data Developer/DBA
Manage data and
analytics together
Example Solutions
• Fraud detection
• Salesforecasting
• Warehouse efficiency
• Predictive maintenance
Relational Data
Analytic Library
T-SQL Interface
Extensibility
?
R
RIntegration
010010
100100
010101
Microsoft Azure
Machine Learning Marketplace
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
SQL Server 2016
• Use your preferred R IDE
• Set compute context to SQL Server
• Use RevoScaleR rx functions
Run R script
• Create stored procedure
• Execute directly in SSMS query
Create SQL
query
Demo
• Using RRE directly in SQL-Server
Demo
• Running R inside a SQL stored procedure
36
Moving your workflow to the cloud
Model in Cloud
Model
Model in SQL
Server using
Revolution R
Model in SQL
Server using
Revolution R
Model on a
sample of data
Model on a
sample of data
Score in cloud Score in cloud
Score
Score in SQL
Server
Score in SQL
Server
Score using R
Andrie de Vries
Senior Programmer Manager
R Community Projects
@RevoAndrie
adevries@microsoft.com

Taking R Analytics to SQL and the Cloud

  • 2.
    2 WHO The leading provider ofadvanced analytics software and services based on open source R, since 2007 WHAT REVOLUTION R: The enterprise-grade predictive analytics application platform based on the R language WHERE “This acquisition will help customers use advanced analytics within Microsoft data platforms“ -- Joseph Sirosh, CVP C+E
  • 3.
  • 4.
    • Situation • Complication •Critical question? • Answer
  • 5.
    • A highlevel overview of R • Data science in the cloud • Connecting R to SQL • Scalable R • R in SQL Server • Moving your workflow to the cloud
  • 6.
    A high leveloverview of R
  • 7.
    • Most widelyused data analysis software • Most powerful statistical programming language • Create beautiful and unique data visualizations • Thriving open-source community • Fills the talent gap www.revolutionanalytics.com/what-is-r
  • 8.
    1993 • Research project in Auckland, NZ 1995 •Open source 1997 • R-core 2000 • R-1.0.0 2003 • R Foundation 2004 • First UseR! 2009 • New York Times 2015 • R-3.2.0 • R Consortium 8 Photo credit: Robert Gentleman
  • 9.
    The New YorkTimes Interactive Features • Election Forecast • Dialect Quiz Data Journalism • NFL Draft Picks • Wealth distribution in USA
  • 10.
    Data science inthe Azure cloud
  • 11.
  • 12.
    Software Revenues NewLicense Revenues http://redmonk.com/sogrady/2013/11/21/selling-software/ 13
  • 13.
    The Azure Cloud OperationalAnnounced Central US Iowa West US California North Europe Ireland East US Virginia East US 2 Virginia US Gov Virginia North Central US Illinois US Gov Iowa South Central US Texas Brazil South Sao Paulo West Europe Netherlands China North * Beijing China South * Shanghai Japan East Saitama Japan West OsakaIndia West TBD India East TBD East Asia Hong Kong SE Asia Singapore Australia West Melbourne Australia East Sydney * Operated by 21Vianet
  • 16.
  • 18.
  • 19.
  • 20.
    Demo • Using ODBCto connect R to SQL
  • 22.
  • 23.
    is…. the big databig analytics platform based on open source R
  • 24.
    • Data import– Delimited, Fixed, SAS, SPSS, OBDC • Variable creation & transformation • Recode variables • Factor variables • Missing value handling • Sort, Merge, Split • Aggregate by category (means, sums) • Min / Max, Mean, Median (approx.) • Quantiles (approx.) • Standard Deviation • Variance • Correlation • Covariance • Sum of Squares (cross product matrix for set variables) • Pairwise Cross tabs • Risk Ratio & Odds Ratio • Cross-Tabulation of Data (standard tables & long form) • Marginal Summaries of Cross Tabulations • Chi Square Test • Kendall Rank Correlation • Fisher’s Exact Test • Student’s t-Test • Subsample (observations & variables) • Random Sampling Data Step Statistical Tests Sampling Descriptive Statistics • Sum of Squares (cross product matrix for set variables) • Multiple Linear Regression • Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions. • Covariance & Correlation Matrices • Logistic Regression • Classification & Regression Trees • Predictions/scoring for models • Residuals for all models Predictive Models • K-Means • Decision Trees • Decision Forests • Stochastic Gradient Boosted Decision Trees Cluster Analysis Classification Simulation Variable Selection • Stepwise Regression Linear, Logistic and GLM • Monte Carlo • Parallel Random Number Generation Combination • Using Revolution rxDataStep and rxExec functions to combine open source R with Revolution R • PEMA API
  • 25.
    Demo • Using RREto solve the scalability problem
  • 27.
    R in SQLServer
  • 28.
    Data Scientist Interact directlywith data Built-in to SQL Server Data Developer/DBA Manage data and analytics together Example Solutions • Fraud detection • Salesforecasting • Warehouse efficiency • Predictive maintenance Relational Data Analytic Library T-SQL Interface Extensibility ? R RIntegration 010010 100100 010101 Microsoft Azure Machine Learning Marketplace New R scripts 010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101 010010 100100 010101 SQL Server 2016
  • 29.
    • Use yourpreferred R IDE • Set compute context to SQL Server • Use RevoScaleR rx functions Run R script • Create stored procedure • Execute directly in SSMS query Create SQL query
  • 30.
    Demo • Using RREdirectly in SQL-Server
  • 32.
    Demo • Running Rinside a SQL stored procedure
  • 34.
  • 35.
  • 36.
    Model in Cloud Model Modelin SQL Server using Revolution R Model in SQL Server using Revolution R Model on a sample of data Model on a sample of data Score in cloud Score in cloud Score Score in SQL Server Score in SQL Server Score using R
  • 38.
    Andrie de Vries SeniorProgrammer Manager R Community Projects @RevoAndrie adevries@microsoft.com

Editor's Notes

  • #10 Fantasy Football: http://blog.revolutionanalytics.com/2013/10/fantasy-football-modeling-with-r.html
  • #13 Infinite scale inexpensively Tons of data from which you actually have to get value Customers that have a very high expectation of service and connection – Pier 1 great example Influx of new talent to fill a very big gap McKinsey says is 300 thousand in US alone But the market this new talent is entering is still filled with barriers
  • #15 Over the last few years we’ve truly delivered a huge infrastructure to enable us to grow our services at scale around the globe. Whether it’s our flagship facilities in Quincy, Washington or Boydton, Virginia, or some of the newly announced facilities in Shanghai, Australia and Brazil, it really is key for us to make smart investments around the world to deliver services in a resilient and reliable fashion.   A lot of people ask, what goes into site selection at Microsoft and how do we decide where to place our datacenter investments? There are over thirty-five factors in our site selection criteria. But really, the top elements are around proximity to customers and energy and fiber infrastructure, insuring that we have the capacity and the growth platforms to be able to grow our services.   Another key element is about skilled workforce. We need to insure that we have the right people to run and operate our datacenters on a day to day basis.
  • #26 Enterprise readiness Performance architecture Big Data analytics Data source integration Development tools Deployment tools