Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Performance and Scale Options for R with Hadoop: A comparison of potential architectures

5,207 views

Published on

R and Hadoop go together. In fact, they go together so well, that the number of options available can be confusing to IT and data science teams seeking solutions under varying performance and operational requirements.

Which configuration is faster for big files? Which is faster for sharing data and servers among groups? Which eliminates data movement? Which is easiest to manage? Which works best with iterative and multistep algorithms? What are the hardware requirements of each alternative?

This webinar is intended to help new users of R with Hadoop select their best architecture for integrating Hadoop and R, by explaining the benefits of several popular configurations, their performance potential, workload handling and programming model and administrative characteristics.

Presenters from Revolution Analytics will describe the options for using Revolution R Open and Revolution R Enterprise with Hadoop including servers, edge nodes, rHadoop and ScaleR. We’ll then compare the characteristics of each configuration as regards performance but also programming model, administration, data movement, ease of scaling, mixed workload handling, and performance for large individual analyses vs. mixed workloads.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Performance and Scale Options for R with Hadoop: A comparison of potential architectures

  1. 1. R and Hadoop: Architectural Options Bill Jacobs VP Product Marketing & Field CTO, Revolution Analytics @bill_jacobs
  2. 2. Polling Question #1:  Who Are You? (choose one) – Statistician or modeler who uses R – Other R developer – Hadoop Expert – Application builder – Data guru – Business user – Systems vendor or reseller – Something else…
  3. 3. • Challenges • Options • Considerations • How to Choose Agenda
  4. 4. Boundless Opportunities  Marketing: Clickstream & Campaign Analyses  Digital Media: Recommendation Engines  Retail: Social Sentiment Analysis  Insurance: Fraud Waste and Abuse  Healthcare Delivery: Outcome Prediction  Manufacturing: Quality Optimization  P&C Insurance: Risk Analysis  Consumer Products: Warranty Optimization  Operations: Supply Chain Optimization  Econometrics: Market Prediction  Marketing: Mix and Price Optimization  Life Sciences: Pharmacogenetics  Transportation: Asset Utilization
  5. 5. Polling Question #2:  What Industry Do You Represent? – Financial Services – Insurance – Healthcare, Life Sciences or Pharma – Manufacturing – Energy – Retail – Logistics and Transportation – Education – Government – Marketing & Advertising – Technology – Other
  6. 6. In A Perfect World… Analytical Capability Compute Data Scale UsersPrice Ease Security
  7. 7. Hadoop Analytics - Many Alternatives  R Based Alternatives  Legacy tools updated – SAS HPA, etc.  Big Data Databases  Other Languages – Scala, Java, Julia, various GUIs Today’s Topic:  R-Based Alternatives – “Beside Architectures” – “Inside Architectures” – Open Source and Commercial
  8. 8. Reality: Tradeoffs. Memory Limits In-Memory vs. Shared Infrastructure CRAN vs. Parallelization Desktop vs. Remote Explicit vs. Automatic Distribution Locality vs. Movement Real-Time vs. MapReduce Traditional Statistics vs. Machine Learning
  9. 9. No Magic Bullet.
  10. 10. Corporate Overview & Quick Facts Founded 2008 (as REvolution Computing) Office Locations Palo Alto (HQ), Seattle (Engineering) Singapore London CEO David Rich Number of customers 200+ Investors • Northbridge Venture Partners • Intel Capital • Platform Vendor Web site: • www.revolutionanalytics.com Revolution R Enterprise is the leading commercial analytics platform based on the open source R statistical computing language
  11. 11. Revolution Analytics Our Vision: R becomes the de- facto standard for enterprise predictive analytics Our Mission: Drive enterprise adoption of R by providing enhanced R products tailored to meet enterprise challenges
  12. 12. Revolution Analytics Builds & Delivers:  Software Products:  Stable Distributions  Broad Platform Support  Big Data Analytics in R  Application Integration  Deployment Platforms  Agile Development Tooling  Future Platform Support  Support & Services  Commercial Support Programs  Training Programs  Professional Services  Community Programs  Academic Support Programs  Contributions to Open Source R  Open Source Extensions  Sponsorship of R User Groups
  13. 13. Revolution Analytics Technical Innovations  R Options from Open Source to Enterprise  Parallelized Analytical Computation  In-Database & In-Hadoop Analytics  Big Data Scalability  Remote Execution  Production Deployment Support  Multi-Platform Deployment  Legacy Data Format Support  Multiple IDE Options  PMML Model Export
  14. 14. The Revolution R Product Suite • Free and open source R distribution • Enhanced and distributed by Revolution Analytics Revolution R Open • Open-source distribution of R, packages, and other components • Enhanced, supported and indemnified by Revolution Analytics Revolution R Plus • Secure, Scalable and Supported Distribution of R • With proprietary components created by Revolution Analytics Revolution R Enterprise
  15. 15. Polling Question #3:  State Play: In your company you are… – Building Our “Data Lake” – Running R + Hadoop Data Today – Running R inside Hadoop using Open source – Running RRE inside Hadoop – Deploying Business Apps. Using Analytics from Hadoop Data – Looking at Next Steps e.g. Spark, etc.
  16. 16. Revolution Analytics: Eight Alternatives for Integrating R & Hadoop Open Source 1. Open Source R 2. Revolution R Open 3. Open Source Parallelization on Workstations & Servers 4. rHadoop: Open Source Parallelization with rHadoop Commercial 5. Revolution R Enterprise on Servers & Workstations 6. Revolution R Enterprise on Edge Nodes 7. Revolution R Enterprise Inside Hadoop 8. Combined Edge Node & Inside Hadoop
  17. 17. 1. Open Source R Integrated With Hadoop • Traditional Open Source • Memory- Limited • Data Moves Traditional Open Source R “Beside” Architecture: CRAN Algorithms rHDFS rHbas e rHive rODB C
  18. 18. 2. Revolution R Open On Workstations & Servers Replace Open Source R “Beside” Architecture with Revolution R Open As with Open Source R: • Still Free. • Still Memory Based. • Data Still Moves. Improvements: • Accelerates Math with Intel MKL • Improves R-based packages Limitations • No Effect for non-R Code CRAN Algorithms rHDFS rHbas e rHive rODB C
  19. 19. Accelerate R Math with Intel Math Kernel Lib’s. Source: http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html
  20. 20. 3. Write Parallel Algorithms PC, Server or Clusters Write R Code to Explicitly Parallelize – Deploy Across Several Systems Can Include CRAN Algorithms “Carefully” ForEach & Iterator • DoParallel (PC, server) • DoMPI (cluster) • RRE RxEXEC Example Uses: • Bootstrapping • Simulation • HPC rHDFS rHbas e rHive rODB C As with Previous: • Still Free. • Still Memory Based. • Data Still Moves. • Intel MKL with RRO Improvements: • Parallelized Execution Limitations: • Parallelization Difficulty • Data Movement • Platform Specific
  21. 21. 4. rHadoop: Custom Parallel Execution for Hadoop Remote Desktop R Code Execute R Code & CRAN Algorithms Inside Hadoop Example Uses: • Scoring • Transformation • Easily Parallelized Algorithms Hadoop Streaming Can Include CRAN Algorithms As With Previous:  Still Free.  Optional Intel MKL in RRO Improvements:  Runs R in MapReduce  No Data Movement Limitations:  Manual Parallelization  Hadoop Specific rHbase rHDFS rMapReduce
  22. 22. 5. Revolution R Enterprise (RRE) PEMAs inside Hadoop Traditional “Beside” Architecture with Optimized Algorithms Available for Windows, Linux As With Previous:  Includes Intel MKL in RRO Advantages  Speed: PEMAs Parallelize Across Threads, Cores & Sockets  Scale: PEMAs “Chunk” - no Memory Limits  All of CRAN Available  Portability  Fully Supported Limitations:  Data Movement  Single Machine Revolution R Enterprise: • ScaleR PEMA Algorithms plus • All of CRAN (subject to memory limits) rHDFS rHbas e rHive rODB C
  23. 23. Revolution R Enterprise  High Performance, Scalable Analytics  Portable Across Enterprise Platforms  Easier to Build & Deploy Analytics is…. the only big data big analytics platform based on open source R
  24. 24. ScaleR Refactor Algorithms for Dramatic Performance and Capacity Improvement
  25. 25. ScaleR High Performance Algorithms for the Most Common Uses  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Step Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models Predictive Models  K-Means  Decision Trees  Decision Forests  Gradient Boosted Decision Trees Cluster Analysis Classification Simulation Variable Selection  Stepwise Regression  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Combination 25Revolution Analytics Confidential – Under NDA New in 7.3  PEMA-R API  rxDataStep  rxExec
  26. 26. ScaleR PEMA What’s a PEMA? Parallel External Memory Algorithms Master Algorithm Process Data Analyze Each Block • Not Limited to Available Memory • Unlimited Data Scale • Ingests Data One Chunk At A Time. • Adjustable Memory Footprint • Multi-Thread Execution Performance • Highly-Optimized Algorithms • Algorithm Math Fully Refactored for Parallelism • Delivered as ScaleR Library in Revolution R Enterprise Load Block At A Time Combine Individual Results Script Calls ScaleR Algorithm Scripts can call CRAN Open Source Algorithms Start & Manage Processing
  27. 27. rHDFS rHbas e rHive rODB C 6. Run Revolution R Enterprise on Hadoop Edge Node(s) Local File System (opt.) ScaleR + CRAN Algorithms Fast Single-Server Alternative for Modest Data Scale Edge NodeThin Client or Remote Desktop As With Previous:  Single Machine Execution  PEMA Scale & Speed (Single Machine)  Use ScaleR + CRAN  Accelerate R with Intel MKL Improvements:  Easily Shared via  No Data Movement  Develop on Desktop Run on Edge Node Limitations:  “Shorter Trip” for Data
  28. 28. 7. Fast, Transparent Parallel Computation Inside Hadoop YARN/MapReduce jobtracker ScaleR Algorithms DeployR Fast Parallelized Analytics on Large Data Sets In Hadoop As With Previous:  Speed and Scale of ScaleR PEMA Algorithms  Use CRAN Where Appropriate  Accelerate R Math with MKL  Custom Parallelized Algo’s Advantages  Parallel Computation  No Data Movement  ScaleR PEMA Parallelization  Can Parallelize CRAN “Carefully”  Portable Coding Limitations:  Hadoop Workload Profiles We b Ser vice s Web Services Remote Execution Desktop & Server Tools and Applications
  29. 29. 29 One Client’s Experience with RRE on Hadoop Test Cluster - 9 Nodes Task Processing Time Importing and Filtering Datasets from HDFS 14 Million Observations 82 sec. 227 Million Observations 310 sec. Modeling and Estimation 1.2 M Correlations 2771 sec. Simple Linear Regression, 227 M Observations 61 sec. Multiple Linear Regression, Three Variables, 227 M Observations 58 sec. Multiple Linear Regression, Four Variables, 227 M Observations 58 sec. Random Forest, 10 Predictor Variables, 227 M Observations, 10 Trees with Max Depth of 10 Splits 2 hr. 3 min. 64GB 24 cores each 9 Task Nodes 2 Admin Nodes1 Edge Node 128GB 24 cores each 128GB 24 cores each
  30. 30. 8. Combined Edge Node & In-Hadoop ScaleR Algorithms DeployR Maximized Flexibility, Performance & Workload Handling As With Previous:  Speed and Scale of ScaleR PEMA Algorithms  Use CRAN Where Appropriate  Accelerate R Math with MKL  Custom Parallelized Algo’s Advantages  Flexibility for Blended Workloads  Little or No Data Movement  Maximize CRAN Capabilities by Sharing Large RAM Edge Nodes We b Ser vice s Thin Client Development Remote Execution Desktop & Server Tools and Applications rStudio
  31. 31. Occasionally Conflicting Criteria Infrastructure Criteria:  Big Data Platform  Vendor Choice  Data Ingest  Data Security  Data Governance Data Science Criteria:  Performance  Self Service  Flexibility  Collaboration  Sharing  Capability
  32. 32. Key Questions:  Where are the bulk of your skills? SAS? R? Java? Python? SQL?  Where do you build models today?  Do you have the skills to parallelize algorithms?  Can models be built on a big shared server?  How will you run models?  Do you have the budget to purchase commercial solutions?  How will your needs change over time?  What is your future architecture plan?  How risk averse is your management team regarding new platforms and open source?
  33. 33. Key Questions (cont.)  What Workloads Do You Anticipate? — How May Users? — What Workloads?  Workload Realities: — Many small tasks do not run well in MapReduce — Large data movements / duplications are costly  What Use Cases Will You Encounter? — Traditional statistical exploration, modeling? — Behavior Prediction? — Outlier Detection? — Simulation and HPC? — Massively wide data? — Real-Time scoring? — Internet of Things?
  34. 34. Eight Steps to Fast, Scalable R Analytics with Hadoop Open Source Options 1. Open Source R 2. Revolution R Open 3. Open Source Parallelization… 4. rHadoop… Commercial Options 5. RRE on Servers & Workstations 6. RRE on Edge Nodes 7. RRE Inside Hadoop 8. RRE on Edge Node & Inside Hadoop No Clear Winner:  Budget & use case determine optimal path  Compelling options in both open source & commercial source  RRE ScaleR uniquely provides automatic parallelization  Current Hadoop platforms are fast for large scale analytics.  Combined in-server & in-hadoop fits majority of cases
  35. 35. 2015 Challenges & Opportunities • Evolving Hadoop Architectures • In-Memory Analytics – Spark, YARN Containers, Caching • Additional Algorithm Parallelization • Cluster Management • Cloud and Hybrid Cloud Clusters • SQL on Hadoop “Battle-Royale” • Addressing the Resource Reality • Integration, Deployment Both Drain on Expensive Resources • Leverage other skills • Design efficient collaboration • “Analytics for the Rest of Us” • New Consumption Targets – Mobile • New Participants in Design – Business Users
  36. 36. Recommended Resources  Revolution Analytics Products – http://www.revolutionanalytics.com/products – http://www.revolutionanalytics.com/big-analytics-hadoop-and-edws  Whitepaper: “Delivering Value from Big Data with Revolution R Enterprise and Hadoop – http://www.revolutionanalytics.com/whitepaper/delivering-value-big-data- revolution-r-enterprise-and-hadoop  Revolution Analytics on Social Media: – http://blog.revolutionanalytics.com/ – @revolutionr on Twitter – @bill_jacobs on Twitter
  37. 37. Thank you. www.revolutionanalytics.com 1.855.GET.REVO Twitter: @RevolutionR

×