Revolution Confidential
Revolution Analytics
Bringing the Analytical Power of
R to the Hadoop Platform
Simon Field
Technic...
Revolution Confidential
Vigorous Growth of Big Data…
2
The global Big Data Market revenue is expected to grow from $1.56
b...
Revolution Confidential
Big Data = Opportunity + Disruption
3
Huge New Data Assets
• Internet – Commerce, Communications, ...
Revolution Confidential
Hadoop Analytics Platforms: Disruption,
Challenge, Growth & Opportunity At Once
4
• Java Skill Req...
Revolution Confidential
What We Need: Convergence
 Data Science
 With business solutions that fuse statistics, mathemati...
Revolution Confidential
What is the R Statistics Language?
 The R Language:
 Straightforward Procedural Language for Sta...
Revolution Confidential
Why R and Hadoop?
 Hadoop’s dominates Big Data Storage and
Computational platforms.
 R dominates...
Revolution Confidential
Mission
Company Confidential – Do not distribute 8
Enterprise-ready
Revolution R Enterprise
is the...
Revolution Confidential
Global Industries
Served
Financial Services
Digital Media
Government
Health & Life Sciences
High T...
Revolution Confidential
Big Data Speed and Scale with
Revolution R Enterprise
Fast Math Libraries
Parallelized Algorithms
...
Revolution Confidential
11
Revolution R Enterprise Propels
Enterprises into the Future
Decision
Analytic ApplicationsAnaly...
Revolution Confidential
Digital Media & RetailDigital Media & Retail
200+ Corporate Customers and Growing
Finance & Insura...
Revolution Confidential
Revolution R Enterprise and
R MapReduce
Bringing The R Language to the
Hadoop Environment.
13
Revolution Confidential
R MapReduce:
Fast, Agile Analytics for Hadoop Today
 R MapReduce Enables R-Based Analytics In Had...
Revolution Confidential
R MapReduce (RMR)
R MapReduce:
Build MapReduce Jobs Entirely In R
15
Your Creativity.
+
Your Code....
Revolution Confidential
Why Build MapReduce Jobs using R?
 What can you do with it?
 Transform, Aggregate, Regress, Clus...
Revolution Confidential
R MapReduce:
Create Mappers and Reducers Using R
 How:
 Build R Code Using
Revolution R Enterpri...
Revolution Confidential
Mappers & Reducers:
100% R. 100% Hadoop.
 For Hadoop Users:
 Integrates R with Hadoop via
Hadoop...
Revolution Confidential
Leveraging R with Hadoop
With R “Inside” Hadoop…
 In-Place ETL
 Data Transformation in R
 Enric...
Revolution Confidential
Limitations of R MapReduce
 R Programmer Must “Think MapReduce” –
Dividing Work into Cascades of ...
Revolution Confidential
More Ways to Leverage R with Hadoop:
“Beside” Architectures
Inside Hadoop
 In-Place ETL
 Data Tr...
Revolution Confidential
Two Additional “Beside” Architectures
 Alternatives:
 RRE “Beside” Hadoop
 RRE Both “Beside” an...
Revolution Confidential
Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
||||||||||||||||||||||||||||||...
Revolution Confidential
Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
|||||||
|||||||
RRE “Beside” a...
Revolution Confidential
•Segment
•Categorize
•Select
Features
•Simulate
•Predict
•Validate
ModelModel
•Deploy
•Score
•Inte...
Revolution Confidential
‘Beside’ and/or ‘Inside’:
Dominant Usage Patterns Observed
 Use Case 1: Real-Time Scoring
 Examp...
Revolution Confidential
In-House
Systems:
Transaction
History
27
Example 1:
Card Fraud Detection
MapReduceMapReduce
Hadoop...
Revolution Confidential
In-House
Systems:
EDW, CRM,
Datamarts
Example 2:
Attribution Analysis “Beside” Hadoop
MapReduceMap...
Revolution Confidential
29
Example 3:
Telematics-Enhanced Underwriting
1
Ingest
8
2
Correlate Sources
3 Filter,
Aggregate ...
Revolution Confidential
Conclusion
 Big Data Is Hard.
 Hadoop is Key to Managing It.
 R is Key to Applying It.
 Revolu...
Revolution Confidential
31
Revolution Confidential
Thank you.
32
www.revolutionanalytics.com  650.646.9545 Twitter: @RevolutionR
The leading commerci...
Upcoming SlideShare
Loading in …5
×

R and Big Data using Revolution R Enterprise with Hadoop

3,335 views

Published on

Find out how Revolution Analytics is making it easier to work with Hadoop frameworks with Revolution R Enterprise.

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,335
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
210
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

R and Big Data using Revolution R Enterprise with Hadoop

  1. 1. Revolution Confidential Revolution Analytics Bringing the Analytical Power of R to the Hadoop Platform Simon Field Technical Director, Revolution Analytics June 14, 2013
  2. 2. Revolution Confidential Vigorous Growth of Big Data… 2 The global Big Data Market revenue is expected to grow from $1.56 billion in 2012 to $13.95 billion in 2017, at an estimated CAGR of 54.9% from 2012 to 2017. - Marketsandmarkets.com study, 14 April 2013 “…the market for Big Data technology will reach 16.9 billion by 2015, up from $3.2 billion in 2010. That is a 40 percent-a-year growth rate – about seven times the estimated growth rate for the overall information technology and communications business.” – IDC study, March 2012
  3. 3. Revolution Confidential Big Data = Opportunity + Disruption 3 Huge New Data Assets • Internet – Commerce, Communications, Collaboration • Social Media – Personal, Presence, New Social Networks • Ubiquitous Telemetry – Machines Everywhere Huge New Data Assets • Internet – Commerce, Communications, Collaboration • Social Media – Personal, Presence, New Social Networks • Ubiquitous Telemetry – Machines Everywhere Rapidly-Evolving Platforms • “Data Lake” vs. “Warehouse” vs. “Big Data App. Platforms” • Vast Choices Among Open Source Platfroms • Eliminate Time Consuming Data Movements Rapidly-Evolving Platforms • “Data Lake” vs. “Warehouse” vs. “Big Data App. Platforms” • Vast Choices Among Open Source Platfroms • Eliminate Time Consuming Data Movements Emerging Business Opportunities • Data Science Unlocks New Insight • Big Data Drives Better Decisionmaking • Platforms Evolve Rationally Toward Big Data Vision Emerging Business Opportunities • Data Science Unlocks New Insight • Big Data Drives Better Decisionmaking • Platforms Evolve Rationally Toward Big Data Vision
  4. 4. Revolution Confidential Hadoop Analytics Platforms: Disruption, Challenge, Growth & Opportunity At Once 4 • Java Skill Requirements • Hadoop’s Innovation Pace • Java Skill Requirements • Hadoop’s Innovation Pace • Analytical • Write Once, Deploy Anywhere Growth: Skill Development • EDW Saturation • Limited Analytical Capabilities • EDW Saturation • Limited Analytical Capabilities • Data Science Skill Shortage • MapReduce Paradigm Disruption: Evolving Ecosystems • Designed for Massive Scale • Commodity Foundations • Designed for Massive Scale • Commodity Foundations • Built for Data Variety • Open Source Innovation Pace Challenge: Big Data Readiness • Descriptive -> Predictive • Short Analytical Cycle Time • Descriptive -> Predictive • Short Analytical Cycle Time • Ubiquitous Analytical Decisions • Low-Latency Analytics Opportunity: New, More Capable Analytic Foundation
  5. 5. Revolution Confidential What We Need: Convergence  Data Science  With business solutions that fuse statistics, mathematics and software into meaningful applications.  Software Engineering  With tools and frameworks to create agile, scalable analytics-based applications  IT Operations Management  Deployment platforms that are integrated, cost-effective, secure and ubiquitous. 5
  6. 6. Revolution Confidential What is the R Statistics Language?  The R Language:  Straightforward Procedural Language for Stats, Math and Data Science  Open Source  The R Community:  2M Users with the skill to tackle big data mathematical / statistical and ML needs.  Began on workstation / modest SMP servers  The R Ecosystem:  4500+ Freely Available Algorithms in CRAN  Applicable to Big Data if scaled 6
  7. 7. Revolution Confidential Why R and Hadoop?  Hadoop’s dominates Big Data Storage and Computational platforms.  R dominates Data Science, Providing a Language, Users Thousands of Pre-Built Algorithms.  Bringing Them Together is Our Goal Today. 7
  8. 8. Revolution Confidential Mission Company Confidential – Do not distribute 8 Enterprise-ready Revolution R Enterprise is the only commercial big data analytics platform based on open source R statistical computing language Multi-platform Scalable from desktop to big data Delivers high performance analytics Easier to build and deploy analytic applications
  9. 9. Revolution Confidential Global Industries Served Financial Services Digital Media Government Health & Life Sciences High Tech Manufacturing Retail Telco Our Software Delivers Power: Distributed, scalable high performance advanced analytics Productivity: Easier to build and deploy analytic applications Enterprise Readiness: Multi-platform Our Philosophy Customer-centric innovation Easy to do business with Our Investors Intel Capital North Bridge Presidio Ventures Who we are Leading provider of commercial analytics platform based on open source R statistical computing language Customers 200+ Global 2000 Global Presence North America / EMEA / APAC Our Services Deliver Knowledge: Our experts enable you to be experts Time-to-Value: Our Quickstart projects give you a jumpstart Guidance: Our customer support team is here to help you Company Confidential – Do not distribute 9
  10. 10. Revolution Confidential Big Data Speed and Scale with Revolution R Enterprise Fast Math Libraries Parallelized Algorithms In-Database Execution Multi-Threaded Execution Multi-Core Execution In-Hadoop Execution Memory Management Parallelized User Code
  11. 11. Revolution Confidential 11 Revolution R Enterprise Propels Enterprises into the Future Decision Analytic ApplicationsAnalytic Applications Integration MiddlewareMiddleware Data HadoopHadoop Data Warehouse Data Warehouse Other Data Sources Other Data Sources Analytics Revolution R Enterprise High Performance Analytics Platform Revolution R Enterprise High Performance Analytics Platform |||||||||||||||||||||||||||
  12. 12. Revolution Confidential Digital Media & RetailDigital Media & Retail 200+ Corporate Customers and Growing Finance & InsuranceFinance & Insurance Healthcare & Life SciencesHealthcare & Life Sciences Manufacturing & High TechManufacturing & High TechAcademic & Gov’tAcademic & Gov’t 12
  13. 13. Revolution Confidential Revolution R Enterprise and R MapReduce Bringing The R Language to the Hadoop Environment. 13
  14. 14. Revolution Confidential R MapReduce: Fast, Agile Analytics for Hadoop Today  R MapReduce Enables R-Based Analytics In Hadoop:  Use R to Explore and Visualize Data to Develop Insights  Build Models Using Widely-Available Techniques  Score Data Directly in Hadoop Using R Models  Run R as Mappers and Reducers in Hadoop  Advantages:  No data movement  Connects R to HDFS, Hbase and Hive  Run standard MapReduce jobs  R Programmers need not learn Java  Need Not Rewrite R into Java Pig or SQL to Score Data  No Data Movement Needed  Accelerates Projects Leveraging Libraries By Bringing 4500+ Open Source R Algorithms in CRAN1 to Hadoop 14 Data Data Warehouse Data Warehouse Other Data Sources Other Data Sources Analytics MapReduceMapReduce Applications Hadoop |||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||| |||||||| Other MapReduce Jobs Other MapReduce Jobs HDFSHDFS HbaseHbase R MapReduce (RMR) R MapReduce (RMR) HiveHive 1 CRAN: Comprehensive R Archive Network – an open source collection of 4500+ R- based statistics, analtyics, graphics and data manipulations algorithms for R users.
  15. 15. Revolution Confidential R MapReduce (RMR) R MapReduce: Build MapReduce Jobs Entirely In R 15 Your Creativity. + Your Code. + 4500+ R Packges in CRAN = Rich, Powerful Data Analytics That Runs in MapReduce. Revolution R Enterprise Revolution R Enterprise Hbase Hadoop Hive HDFS MAPMAP MAPMAP MAPMAP REDUCEREDUCE REDUCEREDUCE CRAN Packages
  16. 16. Revolution Confidential Why Build MapReduce Jobs using R?  What can you do with it?  Transform, Aggregate, Regress, Cluster, Filter, Simulate, Model, Score …  Run R Programs While Leveraging Hadoop’s Scalability  Big I/O: Score data files containing billions of rows  Big Math: Run compute-intensive algorithms in parallel – Monte Carlo, Random Trees, etc.  Deliver results to BI or Visualization Tools and Production Applications  When to chose RMR:  Need to Develop Analytics in R, on Big data in Hadoop  Stringent Latency Requirements  Scarce R and Java Developers Need to Collaborate Not Duplicate 16
  17. 17. Revolution Confidential R MapReduce: Create Mappers and Reducers Using R  How:  Build R Code Using Revolution R Enterprise  Use Open Source Algorithms From CRAN project.  Leverage HDFS and MapReduce Directly  Deploy R Mappers & Reducers in Hadoop 17 Data Data Warehouse Data Warehouse Other Data Sources Other Data Sources Analytics MapReduceMapReduce Applications R MapReduce (RMR) R MapReduce (RMR) Revolution R Enterprise Revolution R Enterprise Hadoop |||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||| |||||||| Other MapReduce Jobs Other MapReduce Jobs R CodeR Code R PackagesR Packages HDFSHDFS HbaseHbaseHiveHive RRERRE CRAN Packages
  18. 18. Revolution Confidential Mappers & Reducers: 100% R. 100% Hadoop.  For Hadoop Users:  Integrates R with Hadoop via Hadoop Streaming  Creates MapReduce Jobs Compatible with JobTracker  No Need to Recode Models  No Latency to Move Data  For R Programmers  No need for Java Programming  Serialized & Deserializes Data Between HDFS and R  Handles Standard HDFS Read & Write Transparently  Provides Explicit Access to HDFS, Hbase and Hive via Packages  Access to CRAN Algorithm Library 18 Mapper or Reducer Hadoop Streaming R Code Revolution R Enterprise Revolution R Enterprise High-Speed Connectors Data Deserialization Data Serialization HbaseHive HDFS HDFS CRAN
  19. 19. Revolution Confidential Leveraging R with Hadoop With R “Inside” Hadoop…  In-Place ETL  Data Transformation in R  Enrichment and Correlation Using Other Data In Hadoop  Simulation/Experimentation  Execute Complex Simulations on Massively-Parallel Hadoop Clusters  Scoring  Run Scoring Models Directly in Hadoop.  No Movement Penalty  How?  Write Mappers & Reducers in R and Deploy Using RMapReduce  Augment Hadoop with CRAN1 Packages 19 1 Use of CRAN algorithms limited to non-graphical, parallelizable algorithms
  20. 20. Revolution Confidential Limitations of R MapReduce  R Programmer Must “Think MapReduce” – Dividing Work into Cascades of Map, Reduce, Repeat.  Algorithms Must be Designed for Parallelism Including External Packages Used.  Fits:  Hadoop Literate Teams or Those With Good Support  Non-Fits:  Analytics Teams Tinkering with Hadoop on Short Timeframes. Company Confidential – Do not distribute 20 Data Data Warehouse Data Warehouse Other Data Sources Other Data Sources Analytics MapReduceMapReduce Applications R MapReduce (RMR) R MapReduce (RMR) Hadoop |||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||| |||||||| Other MapReduce Jobs Other MapReduce Jobs HDFSHDFS HbaseHbaseHiveHive
  21. 21. Revolution Confidential More Ways to Leverage R with Hadoop: “Beside” Architectures Inside Hadoop  In-Place ETL  Data Transformation in R  Enrichment and Correlation Using Other Data In Hadoop  Simulation/Experimentation  Execute Complex Simulations on Massively-Parallel Hadoop Clusters  Scoring  Run Scoring Models Directly in Hadoop.  No Movement Penalty  How?  Write Mappers & Reducers in R and Deploy Using RMapReduce  Augment Hadoop with CRAN1 Packages “Beside” Architectures:  Drivers:  Large or Unpredictable R Workloads  Modest Hadoop Cluster  Shared Production Hadoop Cluster  Hadoop Novice  Large Numbers of R Users.  Modest Data Sets To Be Scored  Movement Penalty Isn’t Prohibitive  Maximized Computational Scale  Access to ScaleR Parallel External Memory Algorithms (PEMAs)  Advantages:  Makes Hadoop Easier to Administer  Stabilies Hadoop Resource Availability 21
  22. 22. Revolution Confidential Two Additional “Beside” Architectures  Alternatives:  RRE “Beside” Hadoop  RRE Both “Beside” and “Inside” Hadoop with RMR  “Beside” Usage:  Sample into “Beside” Server or Cluster  Analyze and Model on R Server or Cluster  Score Data on R Server or Cluster  Results to Hadoop for Use.  “Both” Usage - Same As Above Except:  Move Model to Data on Hadoop  Score Data In-Place on Hadoop  Why multiple options?  Greatest Flexibility  Optimize Skill Sets  Scale Clusters Independently  Control Concurrency and Security  Optimize Utilization  Same R Code Can Run in Both  Balance Ease of Use/Development and Resulting Performance & Scale 22
  23. 23. Revolution Confidential Data Warehouse Data Warehouse Other Data Sources Other Data Sources ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||| ||||||| RRE “Beside” Hadoop  Separate Hadoop & R Clusters  Connectors HDFS, Hbase & Hive  Explore & Model Data on R server(s)  Return Scored Data to HDFS/Hbase/Hive  When To Use:  Small, Shared or Production Hadoop Cluster  Need Parallelized Algorithms  Heavy Random Workloads  Extensive “Sandboxing”  Modest Data Scoring  Data Security Constraints.  … while awaiting YARN…  Advantages:  Concurrency By Separation  Security By Separation  Independent Scalability  ScaleR Parallel Algorithms 23 DataAnalytics MapReduceMapReduce Applications Hadoop Cluster ||||||| Other MapReduce Jobs Other MapReduce Jobs HDFSHDFS HbaseHbaseHiveHive RRERRE CRAN Packages Revolution R Enterprise Revolution R Enterprise |||||| ConnectR: Hbase HDFS ODBC & High-Speed Connectors Analytics Apps. Analytics Apps. Analytics Server or Cluster: Linux, Windows, LSF or Azure Data Manipulation and Analysis Data Manipulation and Analysis BI & Visualization
  24. 24. Revolution Confidential Data Warehouse Data Warehouse Other Data Sources Other Data Sources ||||||| ||||||| RRE “Beside” and “Inside”  Both “Inside” and “Beside” Platforms  Connect a Compute Cluster to Hadoop to Run R  Move Models to Score Big Data on Hadoop  When To Use:  Production Hadoop Cluster  Need Parallelized Algorithms  Heavy Random Workloads  Extensive “Sandboxing”  Large Data Scoring  Data Security Constraints.  … while awaiting YARN…  Advantages:  Concurrency & Security  Independent Scalability  Big Data Scoring  Flexibility  Low Latency 24 DataAnalytics MapReduceMapReduce Applications Hadoop Cluster ||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Other MapReduce Jobs Other MapReduce Jobs HDFSHDFS HbaseHbaseHiveHive |||||| ConnectR: Hbase HDFS ODBC & High-Speed Connectors Analytics Server or Cluster: Linux, Windows, LSF or Azure R MapReduce (RMR) R MapReduce (RMR) RRERRE CRAN Packages Analytics Apps. Analytics Apps. Revolution R Enterprise Revolution R Enterprise ConnectR: Hbase HDFS ODBC & High-Speed Connectors Analytics Server or Cluster: Linux, Windows, LSF or Azure BI & Visualization
  25. 25. Revolution Confidential •Segment •Categorize •Select Features •Simulate •Predict •Validate ModelModel •Deploy •Score •Integrate DeployDeploy • Measure Accuracy • Iterate ImproveImprove Typical Predictive Analytics Workflow 25 • Ingest • Format • Enrich • Filter • Aggregate • Profile Data Prep Data Prep •Sample •Cluster •Visualize •Correlate •Sandboxing ExploreExplore
  26. 26. Revolution Confidential ‘Beside’ and/or ‘Inside’: Dominant Usage Patterns Observed  Use Case 1: Real-Time Scoring  Example – Fraud Prevention  Use Case 2: Modeling and Scoring  Example – Attribution Analysis  Use Case 3: Production Analytics  Example – Telematics-Assisted Underwriting 26
  27. 27. Revolution Confidential In-House Systems: Transaction History 27 Example 1: Card Fraud Detection MapReduceMapReduce Hadoop HDFSHDFS HbaseHbase 1 Ingest Weblog Data Personal Data: Credit- worthiness Banking 2 4 Filter & Xform 3 Correlate & Rate Transaction Data R MapReduce (RMR) R MapReduce (RMR) Other MapReduce Jobs Other MapReduce Jobs Develop Risk Models 6 Revolution R Enterprise Revolution R Enterprise ConnectR: Hbase HDFS ODBC & High-Speed Connectors R Workstation Deliver & Integrate Execute Models5 Filter & Score Transactions BI & Visualization Mortgage Data Authorization Systems Demographic Data
  28. 28. Revolution Confidential In-House Systems: EDW, CRM, Datamarts Example 2: Attribution Analysis “Beside” Hadoop MapReduceMapReduce Hadoop HDFSHDFS HbaseHbase 1 Ingest Weblog Data Marketing Service Provider Feeds: Acxiom Experian ExactTarget Monitored Responses CoreMetrics Dotomi DoubleClick 8 3 7 4 Call center Data Java MapReduce Jobs Java MapReduce Jobs Develop Attribution Models Deliver to Users Revolution R Enterprise Revolution R Enterprise ConnectR: Hbase HDFS ODBC & High-Speed Connectors Analytics Apps. Analytics Apps. Linux Server Cluster Server BI & Visualization 2 Filter & Transform Score 6 6 Load Analysis Environment Aggregate, Profile, & EnrichSessionize
  29. 29. Revolution Confidential 29 Example 3: Telematics-Enhanced Underwriting 1 Ingest 8 2 Correlate Sources 3 Filter, Aggregate & Profile Deliver to Underwriting & Call Response Systems Revolution R Enterprise Revolution R Enterprise ConnectR: Hbase HDFS ODBC & High-Speed Connectors Underwriting Applications Underwriting Applications Linux Server Cluster Server MapReduceMapReduce Hadoop HDFSHDFS Other MapReduce Jobs Other MapReduce Jobs HbaseHbase 6 Policy Origination Data Vehicle Sensor Data: Speed Time Acceleration Location Creditworthiness Data Insured Data: Loss History Payment History Credit File Demographics 4 Load Model Environment Export Models Score Large Datasets 5R MapReduce (RMR) R MapReduce (RMR) 7 Develop Risk Models
  30. 30. Revolution Confidential Conclusion  Big Data Is Hard.  Hadoop is Key to Managing It.  R is Key to Applying It.  Revolution R on Hadoop Brings Data Science to Big Data  Hadoop Brings Parallel Performance to R  R Brings a Community with Know-How to Hadoop  Revolution Analytics Can Deliver Convergence Today.  … and the Future of R on Hadoop is Even Brighter… 30
  31. 31. Revolution Confidential 31
  32. 32. Revolution Confidential Thank you. 32 www.revolutionanalytics.com  650.646.9545 Twitter: @RevolutionR The leading commercial provider of software and support for the popular  open source R statistics language.

×