Spark auf Hadoop ist hochskalierbar. Cloud Computing ist hochskalierbar. R, die erweiterbare Open Source Data Science Software, eher nicht. Aber was passiert, wenn wir Spark auf Hadoop, Cloud Computing und den Microsoft R Server zu einer skalierbaren Data Science-Plattform zusammenfügen? Stellen Sie sich vor wie es sein könnte, wenn Sie das Erkunden, Transformieren und Modellieren von Daten in jeder beliebigen Größe aus Ihrer Lieblings-R-Umgebung durchführen könnten. Stellen Sie sich nun vor, wie man anschließend die erzeugten Modelle - mit wenigen Klicks - als skalierbare, cloud basierte Web-Services-API bereitstellt. In dieser Session zeigt Sascha Dittmann, wie Sie Ihren R-Code, tausende von Open-Source-R-Pakete sowie die verteilte Implementierungen der beliebtesten Maschine-Learning-Algorithmen nutzen können, um genau dies umzusetzen. Dabei zeigt er wie man ein HDInsight Spark-Cluster inkl. eines Microsoft R Server-Clusters erstellt, sowie das daraus entstandene Model im SQL Server oder als swagger-based API für Anwendungsentwickler bereitstellt.
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Microsoft R - Data Science at Scale
1.
2. What is
• A statistics programming language
• A data visualization tool
• Open source
• 2.5+M users
• Taught in most universities
• Thriving user groups worldwide
• 10000+ free algorithms in CRAN
• Scalable to big data
• New and recent grad’s use it
Language
Platform
Community
Ecosystem
• Rich application & platform integration
3. Challenges posed by open source R
?
?
Lack of
Commercial
Support
Inadequate
Modeling
Performance
Complex
Deployment
Processes
Limited
Data
Scale
4. R from Microsoft brings
Peace of
mind
Efficiency Speed and
scalability
Flexibility
and agility
6. Convergence with Flexibility
Scalable Algorithms
R: Write Once Deploy Anywhere
Templates & Samples
Microsoft R Server Family
R & Python to AML Interop.
Cortana Intelligence
11. DI
R+CRAN
MicrosoftR
DistributedR
DeployR DevelopR
ScaleR
ConnectR
Delivers High Performance Parallel Distributed
Analytics Across Individual and Clustered Systems
• Cloudera
• Hortonworks
• MapR
• Apache Spark
• IBM Platform LSF
• Microsoft HPC
Clusters
• SQL Server
• Teradata
Database
• Red Hat
• SuSE Servers
• Windows
DistributeR
12. ### SETUP HADOOP ENVIRONMENT VARIABLES ###
myHadoopCC <- RxHadoopMR()
### HADOOP COMPUTE CONTEXT ###
rxSetComputeContext(myHadoopCC)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
hdfsFS
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the data
rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
### SETUP LOCAL ENVIRONMENT VARIABLES ###
myLocalCC <- “localpar”
### LOCAL COMPUTE CONTEXT ###
rxSetComputeContext(myLocalCC)
### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
localFS <- RxNativeFileSystem()
AirlineDataSet <- RxXdfData(“AirlineDemoSmall.xdf”,
fileSystem = localFS)
Local Parallel processing – Linux or Windows In – Hadoop
ScaleR models can be deployed from a server or edge node to run in Hadoop
without any functional R model re-coding for map-reduce
Compute
context R script
– sets where the
model will run
Functional
model R script –
does not need
to change to run
in Hadoop
Copyright Microsoft Corporation. All rights reserved.
13.
14.
15.
16.
17. ▪ Data import – Delimited, Fixed, SAS, SPSS,
OBDC
▪ Variable creation & transformation
▪ Recode variables
▪ Factor variables
▪ Missing value handling
▪ Sort, Merge, Split
▪ Aggregate by category (means, sums)
▪ Min / Max, Mean, Median (approx.)
▪ Quantiles (approx.)
▪ Standard Deviation
▪ Variance
▪ Correlation
▪ Covariance
▪ Sum of Squares (cross product matrix for set
variables)
▪ Pairwise Cross tabs
▪ Risk Ratio & Odds Ratio
▪ Cross-Tabulation of Data (standard tables & long
form)
▪ Marginal Summaries of Cross Tabulations
▪ Chi Square Test
▪ Kendall Rank Correlation
▪ Fisher’s Exact Test
▪ Student’s t-Test
▪ Subsample (observations & variables)
▪ Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
▪ Sum of Squares (cross product matrix for set
variables)
▪ Multiple Linear Regression
▪ Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
▪ Covariance & Correlation Matrices
▪ Logistic Regression
▪ Classification & Regression Trees
▪ Predictions/scoring for models
▪ Residuals for all models
Predictive Models ▪ K-Means
▪ Decision Trees
▪ Decision Forests
▪ Gradient Boosted Decision Trees
▪ Naïve Bayes
Cluster Analysis
Classification
Simulation
Variable Selection
▪ Stepwise Regression
▪ Simulation (e.g. Monte Carlo)
▪ Parallel Random Number Generation
Combination
▪ rxDataStep
▪ rxExec
▪ PEMA-R API Custom Algorithms