R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

4,259 views

Published on

The business cases for Hadoop can be made on the tremendous operational cost savings that it affords. But why stop there? The integration of R-powered analytics in Hadoop presents a totally new value proposition. Organizations can write R code and deploy it natively in Hadoop without data movement or the need to write their own MapReduce. Bringing R-powered predictive analytics into Hadoop will accelerate Hadoop’s value to organizations by allowing them to break through performance and scalability challenges and solve new analytic problems. Use all the data in Hadoop to discover more, grow more quickly, and operate more efficiently. Ask bigger questions. Ask new questions. Get better, faster results and share them.

Published in: Technology, Education
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,259
On SlideShare
0
From Embeds
0
Number of Embeds
2,469
Actions
Shares
0
Downloads
222
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

  1. 1. Revolution Confidential Revolution Analytics & Cloudera Confidential R + Hadoop Ask Bigger (and new) Questions and Get Better, Faster Answers Michele Chambers Chief Strategy Officer & VP Product Mgmt Jai Ranganathan Director Product Mgmt & Strategy
  2. 2. Revolution Confidential Period of Disruption 2 1st Generation Predictive Analytics
  3. 3. Revolution Confidential Today’s Challenge: Accelerating Business Cadence Changing Business Environment • Fact Based Decisions Require More Data • Need to Understand Tradeoffs and Best Course of Action • Predictive Models Need to Continually Deliver Lift • Reduced Shelf Life for Predictive Models Faster Time to Value • Reduce Analytic Cycle Time • Build & Deploy Models Faster • Eliminate Time Consuming Data Movements Rapid Customer Facing Decisions • Score More Frequently • Need to Make Best Decision in Real Time 3
  4. 4. Revolution Confidential 4 Big Data 2nd Generation Modern Analytics Machine Learning Quick to Fail Lift
  5. 5. Revolution Confidential Typical Technology Challenges Our Customers Face Big Data • New Data Sources • Data Variety & Velocity • Fine Grain Control • Data Movement, Memory Limits Complex Computation • Experimentation • Many Small Models • Ensemble Models • Simulation Enterprise Readiness • Heterogeneous Landscape • Write Once, Deploy Anywhere • Skill Shortage • Production Support Production Efficiency • Shorter Model Shelf Life • Volume of Models • Long End-to-End Cycle Time • Pace of Decision Accelerated 5
  6. 6. Revolution Confidential Revolution Confidential Big Data Big Analytics is different
  7. 7. Revolution Confidential 7
  8. 8. Revolution Confidential y=ax+b 8
  9. 9. Revolution Confidential y=ax+b y=ax+b y=ax+b y=ax+b y=ax+b y=ax+b y=ax+b y=ax+b 9
  10. 10. Revolution Confidential New model Existing model 10
  11. 11. Revolution Confidential 60% 65% 70% 75% 80% 85% 90% 95% 100% 0% 5% 10% 15% 20% 25% 30% Accuracy False Positives Add unstructured data Existing model
  12. 12. Revolution Confidential Big Data Big Analytics Use Cases 12 • Build predictive models with (very) large datasets • More rows/observations and/or more columns/features • Tend to use dimension reduction, machine learning and/or ensemble techniques One Big Model • Score and predict with (very) large datasets with previously built model • Score in batch or individual transactions • Previously built model may be exported from model build to model deployment env. Big Data Scoring • Model factories build predictive models in quantity • Automated building of individualized models and/or parallel individualized model execution Many Small Models • Score and predict with many individualized models • Production model factories require model management Scoring Many Models • Analytic models that are mathematically intense • May not use large data sets but generate a lot of interim calculations • May include vectorization, simulation, optimization Computationally Intensive Analytics 12
  13. 13. Revolution Confidential Big Data Big Analytics Specialized Use Cases • Build forecasts with time sequenced data • For Big Data, tend to be many small models esp. machine data • Due to typical Big Data volume requires model management Time Series Analytics • Use of unstructured, free text • For Big Data, typically used to enhance structured predictive analytics • Minimally requires text processing tools and may also require natural language processing Text and Document Analytics • Analyzing continuous, high speed data flows for patterns and acting upon the patterns in real-time • Requires specialized sampling and filtering techniques • Uses distinct discovery analytics methods such as frequent itemsets or clustering Mining Data Streams • No separation of model building and model scoring • As real-time data becomes more widely available, this emerging category reduces time-to-insight with little or no separation between model building and scoring Zero Latency 13
  14. 14. Revolution Confidential Revolution Confidential Analytic Reference Architecture Decision Analytic Applications Integration Middleware Data Hadoop Data Warehouse Other Data Sources Analytics Analytics Development Tools & Platforms ||||||||||||||||||||||||||| 14
  15. 15. Revolution Confidential Revolution Confidential Architectural Approaches to Analytics Beside Architecture Inside Architecture DecisionIntegrationAnalytics Analytics Development Tools & Platforms Local Data Mart Data |||||||||||| |||||||||||| DecisionIntegration Data+Analytics Analytics Development Tools & Platforms Analytic Applications Middleware Data Sources Data Sources Analytic Applications Middleware  15
  16. 16. Revolution Confidential Pros & Cons of Architectural Approaches • Analytic workflow tasks performed in a separate analytics environment outside of the source database • Pros: Segregates analytic workload • Cons: Doesn’t leverage powerful production for transformations, introduces scoring latencies, Beside Architecture • Analytics workflow tasks performed inside the source database with embedded analytics • Pros: Eliminates data movement, reduces model latency, allows exploration of all data • Cons: IT governance on production, potential new skills Inside Architecture • Some analytic workflow tasks performed inside the source database & others performed in a separate analytics environment • Pros: Leverages strengths of each architecture • Cons: Maintain multiple environments Hybrid Architecture 16
  17. 17. Revolution Confidential Building & Deploying Analytic Models Beside Architecture Inside Architecture Hybrid Architecture Analytics Analytics Development Tools & Platforms Local Data Mart Data Data Sources 24 3 34 1 Data+Analytics Analytics Development Tools & Platforms Data Sources 2 31 Analytics Analytics Development Tools & Platforms Local Data Mart Data+Analytics Analytics Development Tools & Platforms Data Sources1 2 LEGEND Model Build Model Deploy Model Recode / PMML Update DataData Prep / Marshaling 134
  18. 18. Revolution Confidential + &
  19. 19. Revolution ConfidentialOur platform vision 19 Lower cost per TB Avoid data copying Minimize big data movement Simplify the IT and user experience Organizations bring their applications to Hadoop data ©2013 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  20. 20. Revolution Confidential Traditional workloads in Hadoop WORKLOADS IN HADOOP Search Analytics Self-service BI Data Processing (ELT) In Cloudera • 2-10X the performance • 1/10th the cost In Cloudera • Integrated R support for deep analytics • Takes advantage of entire cluster for high performance • More granular datasets with more model features In Cloudera • Data exploration on the full fidelity data • Faster lifecycle from source data to mini-mart • 1/10th the cost OLAP reporting ©2013 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  21. 21. Revolution Confidential Enterprise-Grade Solutions for Big Data Key Characteristics ©2013 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  22. 22. Revolution Confidential Cloudera Manager & R integration Seamless cluster administration for Revolution R Enterprise Deploy Deploy Revolution R Enterprise quickly and easily onto your CDH cluster 1 Configure & Optimize Ensure optimal settings are configured for performance of Revolution R Enterprise 2 Monitor, Diagnose & Report Identify resource controls, monitor performance, debug and diagnose issues through a single consolidated interface 3 ©2013 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  23. 23. Revolution Confidential 23
  24. 24. Revolution Confidential What is the R Language?  A Platform…  A Procedural Language for Stats, Math and Data Science  A Complete Data Visualization Framework  Provided as Open Source  A Community…  2M+ Users with the Skill to Tackle Big Data Statistical and Numerical Analysis and Machine Learning Projects  Active User Groups Across the World  An Ecosystem  CRAN: 4500+ Freely Available Algorithms, Test Data and Evaluations 24
  25. 25. Revolution Confidential Revolution R Enterprise Revolution R Enterprise is the only enterprise big data big analytics platform based on open source R statistical computing language Portable Across Enterprise Platforms High Performance, Scalable Analytics Easier to Build & Deploy 25
  26. 26. Revolution Confidential R is open source and drives analytic innovation but…. has some limitations for Enterprises Disk based scalability Parallel threading Commercial support Leverage open source packages plus Big Data ready packages 26 Commercial License In memory bound Single threaded Community support 4500+ innovative analytic packages Risk of deployment of open source Big Data Speed of Analysis Enterprise Readiness Analytic Breadth & Depth Commercial Viability 26
  27. 27. Revolution Confidential Language Interpreter and Standard R Algorithm Suites Development & Deployment Tooling Big Data Distributed Execution Platform Introducing Revolution R Enterprise The Big Data Big Analytics Platform R+CRAN RevoR DistributedR ConnectR ScaleR DevelopR DeployR Revolution R Enterprise 27
  28. 28. Revolution Confidential Big Data Speed @ Scale with Revolution R Enterprise Fast Math Libraries Parallelized Algorithms In-Database Execution Multi-Threaded Execution Multi-Core Processing In-Hadoop Execution Memory Management Parallelized User Code 28 First, we enhance and accelerate the Open Source R interpreter. 28
  29. 29. Revolution Confidential Open Source R performance: Multi-threaded Math Open Source R 29 Revolution R Enterprise Computation (4-core laptop) Open Source R Revolution R Speedup Linear Algebra1 Matrix Multiply 176 sec 9.3 sec 18x Cholesky Factorization 25.5 sec 1.3 sec 19x Linear Discriminant Analysis 189 sec 74 sec 3x General R Benchmarks2 R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable 1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php 2. http://r.research.att.com/benchmarks/ Customers report 3-50x performance improvements compared to Open Source R — without changing any code
  30. 30. Revolution Confidential Big Data Speed @ Scale with Revolution R Enterprise Fast Math Libraries Parallelized Algorithms In-Database Execution Multi-Threaded Execution Multi-Core Processing In-Hadoop Execution Memory Management Parallelized User Code 30 Second, we built a platform for hosting R with Big Data on a variety of massively parallel platforms. 30
  31. 31. Revolution ConfidentialRevolution R Enterprise DistributedR Innovative Memory Management, Multi-Threaded Execution, Multi-Core Processing • A Revolution R Enterprise ScaleR analytic is provided a data source as input • The analytic loops over data, reading a block at a time. • Blocks of data are read by a separate worker thread (Thread 0). • Worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update intermediate results objects in memory • When all of the data is processed a master results object is created from the intermediate results objects COMBINE INTERMEDIATE RESULTS 31
  32. 32. Revolution Confidential Revolution R Enterprise ScaleR Performance and Capacity 32
  33. 33. Revolution Confidential SAS HPA Benchmarking comparison* Logistic Regression Rows of data 1 billion 1 billion Parameters “just a few” 7 Time 80 seconds 44 seconds Data location In memory On disk Nodes 32 5 Cores 384 20 RAM 1,536 GB 80 GB Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM. *As published by SAS in HPC Wire, April 21, 2011 Double 45% 1/6th 5% 5% Revolution R Enterprise Delivers Performance at 2% of the Cost 33
  34. 34. Revolution ConfidentialRevolution R Enterprise ScaleR: High Performance Big Data Analytics  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort  Merge  Split  Aggregate by category (means, sums)  Min / Max  Mean  Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test Data Prep, Distillation & Descriptive Analytics  Subsample (observations & variables)  Random Sampling R Data Step Statistical Tests Sampling Descriptive Statistics 34
  35. 35. Revolution ConfidentialRevolution R Enterprise ScaleR: High Performance Big Data Analytics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models  Histogram  Line Plot  Scatter Plot  Lorenz Curve  ROC Curves (actual data and predicted values)  K-Means Statistical Modeling  Decision Trees Predictive Models Cluster AnalysisData Visualization Classification Machine Learning Simulation  Monte Carlo Variable Selection  Stepwise Regression (for linear reg) 35
  36. 36. Revolution Confidential Unparalleled Big Data Big Analytics Scale, Performance & Innovation 1 + 1 = 1000’s Performance V a l u e Revolution R Enterprise + = Performance Enhanced R R Language Open Source R Analytic Packages Big Data Distributed & Parallel Processing & Analytic Package Big Data Distributed & Parallel Processing & Analytic Package Open Source R Analytic Packages Performance Enhanced R 36
  37. 37. Revolution Confidential Leveraging CRAN with DistributedR & ScaleR  Big Data Distillation  Allows a R programmer to leverage RRE ScaleR to reduce dimensionality prior and input the reduced data set into open source packages so that the computationally intensive portion is sped up with RRE ScaleR techniques and any of the plethora of open source packages can be leveraged  Big Data Threading  Allows a R programmer to leverage RRE ScaleR to execute algorithms designed for SMP environments in parallel using DistributedR (ie: Monte Carlo simulation)  Supercharge Open Source package with RRE  Allows a R programmer to re-engineer a CRAN routine by replacing an Open Source function inside an R based algorithm with the equivalent ScaleR function(s)  High Performance Custom Algorithm  Allows a R programmer to use the RRE high throughput extreme data format (XDF) to apply any combination of Open Source functions and logic while chunking through an XDF file to overcome the Open Source R memory limitations 37
  38. 38. Revolution Confidential WODA: Write Once – Deploy Anywhere 38
  39. 39. Revolution Confidential Big Analytics on Big Data in Hadoop  100% R on Hadoop  Full Skill Transfer - No Java needed.  Use 4500+ CRAN Packages  Blend Combine R & Other Tools / Methods  100% Portability  Build Once – Deploy Many  Track Evolution of Hadoop  Protect Against Platform Uncertainty  Avoid Platform Lock-ins  Hadoop Performance & Scale  Leverage Hadoop Parallelism Easily  Analyze Data Without Moving It DataAnalyticsApplications Hadoop + Scalable Compute HDFS HBase Portability. Parallel Storage Hive Big Data Scale 100% R. 39
  40. 40. Revolution Confidential Revolution Confidential Revolution R Enterprise + Cloudera Propels Enterprises into the Future Decision Analytic Applications Integration Middleware Data Cloudera Data Management Platform Analytics Revolution R Enterprise Big Data Big Analytics Platform ||||||||||||||||||||||||||| 40
  41. 41. Revolution Confidential Revolution R Enterprise Powers Write Once, Deploy Anywhere 41 Beside Architecture Inside Architecture Hybrid Architecture Analytics Revolution R Enterprise Local Data Mart Data Cloudera 24 3 34 1 Data+Analytics Revolution R Enterprise Cloudera 2 31 Analytics Revolution R Enterprise Local Data Mart Data+Analytics Revolution R Enterprise Cloudera1 2 LEGEND Model Build Model Deploy Model Recode / PMML Update DataData Prep / Marshaling 4 ||||||||||||| ||||||||||||| |||||| Direct Connector Bottom Line: Save Time, Save Money, Get Insights Faster • Direct connectors access data without data movement • Push down analyzing data without movement • Use same R script on any platform without recoding • Use right architecture for the job!
  42. 42. Revolution Confidential Revolution R Enterprise Inside Cloudera Consumption Cloudera Business Analysts (Alteryx, Tableau, QlikView, Cognos, Microstrategy, Datameer etc.) Power Analysts (R Studio, DevelopR, etc.) Line of Business users (Analytic Apps, Rules Engines, etc.) Revolution R Enterprise Machine Data New Data Sources Data Suppliers Traditional Sources IBM Mainframe Data Sources R+CRAN RevoR DistributedR ConnectR ScaleR DeployR Big Data Big Analytics Data Transformation, Model Building & Scoring 42
  43. 43. Revolution Confidential QuickStart Programs Deliver Value Quickly  Offered by both Cloudera and Revolution Analytics  Combine Software, Services and Training  Cloudera can help you get started with Hadoop in a few ways  Revolution Analytics helps you realize value from R + Hadoop 43
  44. 44. Revolution Confidential Summary Revolution R Enterprise and Cloudera Hadoop bring best-of-breed technologies to deliver:  Highly scalable and high performance machine learning on data residing in Hadoop  Using the familiar R programming environment makes analytics at scale accessible and easy for R users  With the ability to integrate disparate data sources in one repository, full lifecycle analytics from ad-hoc analysis to production analytics are available in one managed environment  The deep integration of Revolution R Enterprise with Cloudera will provide a seamless operational experience for managing both products 44
  45. 45. Revolution Confidential 45 Thank You Visit us @ Strata NYC Oct 28
  46. 46. Revolution Confidential Revolution Confidential Questions Revolution Analytics: info@revolutionanalytics.com Cloudera: info@cloudera.com

×