Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Taking R Analytics to SQL and the Cloud
1.
2. 2
WHO
The leading provider
of advanced analytics
software and services
based on open source R,
since 2007
WHAT
REVOLUTION R: The
enterprise-grade predictive
analytics application platform
based on the R language
WHERE
“This acquisition will help
customers use advanced
analytics within Microsoft data
platforms“
-- Joseph Sirosh, CVP C+E
7. • Most widely used data analysis software
• Most powerful statistical programming language
• Create beautiful and unique data visualizations
• Thriving open-source community
• Fills the talent gap
www.revolutionanalytics.com/what-is-r
8. 1993
• Research
project in
Auckland,
NZ
1995
• Open
source
1997
• R-core
2000
• R-1.0.0
2003
• R
Foundation
2004
• First
UseR!
2009
• New
York
Times
2015
• R-3.2.0
• R Consortium
8
Photo credit: Robert Gentleman
9. The New York Times
Interactive Features
• Election Forecast
• Dialect Quiz
Data Journalism
• NFL Draft Picks
• Wealth distribution in USA
12. Software Revenues New License Revenues
http://redmonk.com/sogrady/2013/11/21/selling-software/ 13
13. The Azure Cloud
Operational Announced
Central US
Iowa
West US
California
North Europe
Ireland
East US
Virginia
East US 2
Virginia
US Gov
Virginia
North Central US
Illinois
US Gov
Iowa
South Central US
Texas
Brazil South
Sao Paulo
West Europe
Netherlands
China North *
Beijing
China South *
Shanghai
Japan East
Saitama
Japan West
OsakaIndia West
TBD
India East
TBD
East Asia
Hong Kong
SE Asia
Singapore
Australia West
Melbourne
Australia East
Sydney
* Operated by 21Vianet
24. • Data import – Delimited, Fixed, SAS, SPSS, OBDC
• Variable creation & transformation
• Recode variables
• Factor variables
• Missing value handling
• Sort, Merge, Split
• Aggregate by category (means, sums)
• Min / Max, Mean, Median (approx.)
• Quantiles (approx.)
• Standard Deviation
• Variance
• Correlation
• Covariance
• Sum of Squares (cross product matrix for set
variables)
• Pairwise Cross tabs
• Risk Ratio & Odds Ratio
• Cross-Tabulation of Data (standard tables & long
form)
• Marginal Summaries of Cross Tabulations
• Chi Square Test
• Kendall Rank Correlation
• Fisher’s Exact Test
• Student’s t-Test
• Subsample (observations & variables)
• Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
• Sum of Squares (cross product matrix for set
variables)
• Multiple Linear Regression
• Generalized Linear Models (GLM) exponential family
distributions: binomial, Gaussian, inverse Gaussian,
Poisson, Tweedie. Standard link functions: cauchit,
identity, log, logit, probit. User defined distributions
& link functions.
• Covariance & Correlation Matrices
• Logistic Regression
• Classification & Regression Trees
• Predictions/scoring for models
• Residuals for all models
Predictive Models
• K-Means
• Decision Trees
• Decision Forests
• Stochastic Gradient Boosted Decision Trees
Cluster Analysis
Classification
Simulation
Variable Selection
• Stepwise Regression Linear,
Logistic and GLM
• Monte Carlo
• Parallel Random Number Generation
Combination
• Using Revolution rxDataStep and rxExec
functions to combine open source R with
Revolution R
• PEMA API
28. Data Scientist
Interact directly with data
Built-in to SQL Server
Data Developer/DBA
Manage data and
analytics together
Example Solutions
• Fraud detection
• Salesforecasting
• Warehouse efficiency
• Predictive maintenance
Relational Data
Analytic Library
T-SQL Interface
Extensibility
?
R
RIntegration
010010
100100
010101
Microsoft Azure
Machine Learning Marketplace
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
010010
100100
010101
SQL Server 2016
29. • Use your preferred R IDE
• Set compute context to SQL Server
• Use RevoScaleR rx functions
Run R script
• Create stored procedure
• Execute directly in SSMS query
Create SQL
query
36. Model in Cloud
Model
Model in SQL
Server using
Revolution R
Model in SQL
Server using
Revolution R
Model on a
sample of data
Model on a
sample of data
Score in cloud Score in cloud
Score
Score in SQL
Server
Score in SQL
Server
Score using R
37.
38. Andrie de Vries
Senior Programmer Manager
R Community Projects
@RevoAndrie
adevries@microsoft.com
Editor's Notes
Fantasy Football: http://blog.revolutionanalytics.com/2013/10/fantasy-football-modeling-with-r.html
Infinite scale inexpensively
Tons of data from which you actually have to get value
Customers that have a very high expectation of service and connection – Pier 1 great example
Influx of new talent to fill a very big gap McKinsey says is 300 thousand in US alone
But the market this new talent is entering is still filled with barriers
Over the last few years we’ve truly delivered a huge infrastructure to enable us to grow our services at scale around the globe. Whether it’s our flagship facilities in Quincy, Washington or Boydton, Virginia, or some of the newly announced facilities in Shanghai, Australia and Brazil, it really is key for us to make smart investments around the world to deliver services in a resilient and reliable fashion.
A lot of people ask, what goes into site selection at Microsoft and how do we decide where to place our datacenter investments? There are over thirty-five factors in our site selection criteria. But really, the top elements are around proximity to customers and energy and fiber infrastructure, insuring that we have the capacity and the growth platforms to be able to grow our services.
Another key element is about skilled workforce. We need to insure that we have the right people to run and operate our datacenters on a day to day basis.
Enterprise readiness
Performance architecture
Big Data analytics
Data source integration
Development tools
Deployment tools