Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a Scalable Data Science Platform with R

700 views

Published on

Building a Scalable Data Science Platform with R

Published in: Technology
  • Be the first to comment

Building a Scalable Data Science Platform with R

  1. 1. Building A Scalable Data Science Platform with R Mario Inchiosa, PhD Principal Software Engineer Hadoop Summit San Jose June 30, 2016
  2. 2. What is • The most popular statistical programming language • A data visualization tool • Open source • 2.5+M users • Taught in most universities • Thriving user groups worldwide • 8000+ contributed packages • New and recent grad’s use it Language Platform Community Ecosystem • Rich application & platform integration
  3. 3. Common R use cases Vertical Sales & Marketing Finance & Risk Customer & Channel Operations & Workforce Retail Demand Forecasting Loyalty Programs Cross-sell & Upsell Customer Acquisition Fraud Detection Pricing Strategy Personalization Lifetime Customer Value Product Segmentation Store Location Demographics Supply Chain Management Inventory Management Financial Services Customer Churn Loyalty Programs Cross-sell & Upsell Customer Acquisition Fraud Detection Risk& Compliance Loan Defaults Personalization Lifetime Customer Value Call Center Optimization Pay for Performance Healthcare Marketing Mix Optimization Patient Acquisition Fraud Detection Bill Collection Population Health Patient Demographics Operational Efficiency Pay for Performance Manufacturing Demand Forecasting Marketing mix Optimization Pricing Strategy Perf Risk Management Supply Chain Optimization Personalization Remote Monitoring Predictive Maintenance Asset Management
  4. 4. IEEE Spectrum July 2015 Data Flows Overwhelm Open Source R – In-Memory Operation – Lack of Parallelism – Expensive Data Movement & Duplication Not enterprise ready – Inadequacy of Community Support – Lack of Guaranteed Support Timeliness – No SLAs or Support models R Adoption is on a Tear, but Open Source R is not Enterprise Class
  5. 5. R from Microsoft brings Peace of mind Efficiency Speed and scalability Flexibility and agility
  6. 6. Portability & investment assurance Write Once – Deploy Anywhere R Server portfolio Cloud RDBMS Desktops & Servers Hadoop & Spark EDW R Server Technology
  7. 7. R Script for Execution in MapReduce Sample R Script: rxSetComputeContext( RxHadoopMR(…) ) inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS) model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData) Define Compute Context Define Data Source Train Predictive Model
  8. 8. Easy to Switch From MapReduce to Spark Keep other code unchanged Sample R Script: rxSetComputeContext( RxSpark(…) ) Change the Compute Context
  9. 9. R Server: scale-out R, Enterprise Class! • 100% compatible with open source R • Any code/package that works today with R will work in R Server • Wide range of scalable and distributed R functions • Examples: rxDataStep(), rxSummary(), rxGlm(), rxDForest(), rxPredict() • Ability to parallelize any R function • Ideal for parameter sweeps, simulation, scoring
  10. 10. Parallelized & Distributed Algorithms  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test ETL Statistical Tests  Subsample (observations & variables)  Random Sampling Sampling  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Predictions/scoring for models  Residuals for all models Predictive Statistics  K-Means Clustering  Decision Trees  Decision Forests  Gradient Boosted Decision Trees  Naïve Bayes Machine Learning Simulation  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Custom Parallelization  rxDataStep  rxExec  PEMA-R API Variable Selection  Stepwise Regression
  11. 11. R Server Hadoop Architecture R R R R R R R R R R R Server Master R process on Edge Node Apache YARN and Spark Worker R processes on Data Nodes Data in Distributed Storage R process on Edge Node
  12. 12. R Server for Hadoop - Connectivity Worker Task R Server Master Task Finalizer Initiator Edge Node Worker Task Worker Task Remote Execution: ssh Web Services DeployR ssh or R Tools for Visual Studio BI Tools & Applications Jupyter Notebooks Thin Client IDEs https:// https:// or MapReduce
  13. 13. HDInsight + R Server: Managed Hadoop for Advanced Analytics in the Cloud SparkR functions RevoScaleR functions R Spark and Hadoop Blob Storage Data Lake Storage • Easy setup, elastic, SLA • Spark • Integrated notebooks experience • Upgraded to latest Version 1.6.1 • R Server • Leverage R skills with massively scalable algorithms and statistical functions • Reuse existing R functions over multiple machines
  14. 14. R Server on Hadoop/HDInsight scales to hundreds of nodes, billions of rows and terabytes of data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ElapsedTime Billions of rows Logistic Regression on NYC Taxi Dataset 2.2 TB
  15. 15. OperationalizeModelPrepare Typical advanced analytics lifecycle
  16. 16. • Clean/Join – Using SparkR from R Server • Train/Score/Evaluate – Scalable R Server functions • Deploy/Consume – Using AzureML from R Server Airline Arrival Delay Prediction Demo
  17. 17. • Passenger flight on-time performance data from the US Department of Transportation’s TranStats data collection • >20 years of data • 300+ Airports • Every carrier, every commercial flight • http://www.transtats.bts.gov Airline data set
  18. 18. • Hourly land-based weather observations from NOAA • > 2,000 weather stations • http://www.ncdc.noaa.gov/orders/qclcd/ Weather data set
  19. 19. Provisioning a cluster with R Server
  20. 20. Scaling a cluster
  21. 21. Clean and Join using SparkR in R Server
  22. 22. Train, Score, and Evaluate using R Server
  23. 23. Publish Web Service from R
  24. 24. • HDInsight Premium Hadoop cluster • Spark on YARN distributed computing • R Server R interpreter • SparkR data manipulation functions • RevoScaleR Statistical & Machine Learning functions • AzureML R package and Azure ML web service Demo Technologies
  25. 25. Building a genetic disease risk application with R Data • Public genome data from 1000 Genomes • About 2TB of raw data Processing • VariantTools R package (Bioconductor) • Match against NHGRI GWAS catalog Analytics • Disease Risk • Ancestry Presentation • Expose as Web Service APIs • Phone app, Web page, Enterprise applications BAM BAM BAM BAM VariantTools GWAS BAM Platform • HDInsight Hadoop (8 clusters) • 1500 cores, 4 data centers • Microsoft R Server
  26. 26. The Four Transformational Trends cloud computing 2011  2016 5x increase data science Universities filling 300,000 US talent gap 90% of the data in the world today has been created in the last two years alone big data open source including R, Linux, Hadoop
  27. 27. microsoft.com/r-server microsoft.com/hdinsight

×