R and Big Data using Revolution R Enterprise with Hadoop

Revolution Confidential
Revolution Analytics
Bringing the Analytical Power of
R to the Hadoop Platform
Simon Field
Technical Director,
Revolution Analytics
June 14, 2013

Vigorous Growth of Big Data…
2
The global Big Data Market revenue is expected to grow from $1.56
billion in 2012 to $13.95 billion in 2017, at an estimated CAGR of
54.9% from 2012 to 2017.
- Marketsandmarkets.com study, 14 April 2013
“…the market for Big Data technology will reach 16.9 billion by
2015, up from $3.2 billion in 2010. That is a 40 percent-a-year
growth rate – about seven times the estimated growth rate for the
overall information technology and communications business.”
– IDC study, March 2012

Big Data = Opportunity + Disruption
3
Huge New Data Assets
• Internet – Commerce, Communications, Collaboration
• Social Media – Personal, Presence, New Social Networks
• Ubiquitous Telemetry – Machines Everywhere
Huge New Data Assets
• Internet – Commerce, Communications, Collaboration
• Social Media – Personal, Presence, New Social Networks
• Ubiquitous Telemetry – Machines Everywhere
Rapidly-Evolving Platforms
• “Data Lake” vs. “Warehouse” vs. “Big Data App. Platforms”
• Vast Choices Among Open Source Platfroms
• Eliminate Time Consuming Data Movements
Rapidly-Evolving Platforms
• “Data Lake” vs. “Warehouse” vs. “Big Data App. Platforms”
• Vast Choices Among Open Source Platfroms
• Eliminate Time Consuming Data Movements
Emerging Business Opportunities
• Data Science Unlocks New Insight
• Big Data Drives Better Decisionmaking
• Platforms Evolve Rationally Toward Big Data Vision
Emerging Business Opportunities
• Data Science Unlocks New Insight
• Big Data Drives Better Decisionmaking
• Platforms Evolve Rationally Toward Big Data Vision

Hadoop Analytics Platforms: Disruption,
Challenge, Growth & Opportunity At Once
4
• Java Skill Requirements
• Hadoop’s Innovation Pace
• Java Skill Requirements
• Hadoop’s Innovation Pace
• Analytical
• Write Once, Deploy Anywhere
Growth: Skill Development
• EDW Saturation
• Limited Analytical Capabilities
• EDW Saturation
• Limited Analytical Capabilities
• Data Science Skill Shortage
• MapReduce Paradigm
Disruption: Evolving Ecosystems
• Designed for Massive Scale
• Commodity Foundations
• Designed for Massive Scale
• Commodity Foundations
• Built for Data Variety
• Open Source Innovation Pace
Challenge: Big Data Readiness
• Descriptive -> Predictive
• Short Analytical Cycle Time
• Descriptive -> Predictive
• Short Analytical Cycle Time
• Ubiquitous Analytical Decisions
• Low-Latency Analytics
Opportunity: New, More Capable Analytic Foundation

What We Need: Convergence
 Data Science
 With business solutions that fuse statistics, mathematics
and software into meaningful applications.
 Software Engineering
 With tools and frameworks to create agile, scalable
analytics-based applications
 IT Operations Management
 Deployment platforms that are integrated, cost-effective,
secure and ubiquitous.
5

What is the R Statistics Language?
 The R Language:
 Straightforward Procedural Language for Stats, Math
and Data Science
 Open Source
 The R Community:
 2M Users with the skill to tackle big data mathematical /
statistical and ML needs.
 Began on workstation / modest SMP servers
 The R Ecosystem:
 4500+ Freely Available Algorithms in CRAN
 Applicable to Big Data if scaled
6

Why R and Hadoop?
 Hadoop’s dominates Big Data Storage and
Computational platforms.
 R dominates Data Science, Providing a
Language, Users Thousands of Pre-Built
Algorithms.
 Bringing Them Together is Our Goal Today.
7

Mission
Company Confidential – Do not distribute 8
Enterprise-ready
Revolution R Enterprise
is the only commercial big data analytics platform
based on open source R statistical computing language
Multi-platform
Scalable from desktop to big data
Delivers high performance analytics
Easier to build and deploy analytic applications

Global Industries
Served
Financial Services
Digital Media
Government
Health & Life Sciences
High Tech
Manufacturing
Retail
Telco
Our Software Delivers
Power: Distributed, scalable high performance advanced analytics
Productivity: Easier to build and deploy analytic applications
Enterprise Readiness: Multi-platform
Our Philosophy
Customer-centric innovation
Easy to do business with
Our Investors
Intel Capital
North Bridge
Presidio Ventures
Who we are
Leading provider of commercial analytics platform based
on open source R statistical computing language
Customers
200+ Global 2000
Global Presence
North America / EMEA / APAC
Our Services Deliver
Knowledge: Our experts enable you to be experts
Time-to-Value: Our Quickstart projects give you a jumpstart
Guidance: Our customer support team is here to help you

Big Data Speed and Scale with
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Execution
In-Hadoop Execution
Memory Management
Parallelized User Code

11
Revolution R Enterprise Propels
Enterprises into the Future
Decision
Analytic ApplicationsAnalytic Applications
Integration
MiddlewareMiddleware
Data
HadoopHadoop
Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
Analytics
High Performance Analytics Platform
High Performance Analytics Platform
|||||||||||||||||||||||||||

Digital Media & RetailDigital Media & Retail
200+ Corporate Customers and Growing
Finance & InsuranceFinance & Insurance Healthcare & Life SciencesHealthcare & Life Sciences
Manufacturing & High TechManufacturing & High TechAcademic & Gov’tAcademic & Gov’t
12

Revolution R Enterprise and
R MapReduce
Bringing The R Language to the
Hadoop Environment.
13

R MapReduce:
Fast, Agile Analytics for Hadoop Today
 R MapReduce Enables R-Based Analytics In Hadoop:
 Use R to Explore and Visualize Data to Develop Insights
 Build Models Using Widely-Available Techniques
 Score Data Directly in Hadoop Using R Models
 Run R as Mappers and Reducers in Hadoop
 Advantages:
 No data movement
 Connects R to HDFS, Hbase and Hive
 Run standard MapReduce jobs
 R Programmers need not learn Java
 Need Not Rewrite R into Java Pig or SQL to Score Data
 No Data Movement Needed
 Accelerates Projects Leveraging Libraries By Bringing
4500+ Open Source R Algorithms in CRAN1 to Hadoop
14
Data
Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
Analytics
MapReduceMapReduce
Applications
Hadoop
||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||
||||||||
Other
MapReduce
Jobs
Other
MapReduce
Jobs
HDFSHDFS
HbaseHbase
R MapReduce
(RMR)
R MapReduce
(RMR)
HiveHive
1 CRAN: Comprehensive R
Archive Network – an open
source collection of 4500+ R-
based statistics, analtyics,
graphics and data manipulations
algorithms for R users.

R MapReduce (RMR)
R MapReduce:
Build MapReduce Jobs Entirely In R
15
Your Creativity.
+
Your Code.
+
4500+ R Packges in
CRAN
=
Rich, Powerful Data
Analytics That
Runs in
MapReduce.
Revolution R
Enterprise
Revolution R
Enterprise
Hbase
Hadoop
Hive
HDFS
MAPMAP MAPMAP MAPMAP
REDUCEREDUCE REDUCEREDUCE CRAN Packages

Why Build MapReduce Jobs using R?
 What can you do with it?
 Transform, Aggregate, Regress, Cluster, Filter, Simulate, Model,
Score …
 Run R Programs While Leveraging Hadoop’s Scalability
 Big I/O: Score data files containing billions of rows
 Big Math: Run compute-intensive algorithms in parallel – Monte Carlo,
Random Trees, etc.
 Deliver results to BI or Visualization Tools and Production
Applications
 When to chose RMR:
 Need to Develop Analytics in R, on Big data in Hadoop
 Stringent Latency Requirements
 Scarce R and Java Developers Need to Collaborate Not Duplicate
16

R MapReduce:
Create Mappers and Reducers Using R
 How:
 Build R Code Using
 Use Open Source Algorithms
From CRAN project.
 Leverage HDFS and
MapReduce Directly
 Deploy R Mappers &
Reducers in Hadoop
17
Data
Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
Analytics
MapReduceMapReduce
Applications
R MapReduce
(RMR)
R MapReduce
(RMR)
Revolution R
Enterprise
Revolution R
Enterprise
Hadoop
||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||
||||||||
Other
MapReduce
Jobs
Other
MapReduce
Jobs
R CodeR Code
R PackagesR Packages
HDFSHDFS
HbaseHbaseHiveHive
RRERRE
CRAN Packages

Mappers & Reducers:
100% R. 100% Hadoop.
 For Hadoop Users:
 Integrates R with Hadoop via
Hadoop Streaming
 Creates MapReduce Jobs
Compatible with JobTracker
 No Need to Recode Models
 No Latency to Move Data
 For R Programmers
 No need for Java Programming
 Serialized & Deserializes Data
Between HDFS and R
 Handles Standard HDFS Read &
Write Transparently
 Provides Explicit Access to
HDFS, Hbase and Hive via
Packages
 Access to CRAN Algorithm
Library
18
Mapper
or
Reducer
Hadoop Streaming
R Code
Revolution R
Enterprise
Revolution R
Enterprise
High-Speed
Connectors
Data Deserialization
Data Serialization
HbaseHive
HDFS
HDFS
CRAN

Leveraging R with Hadoop
With R “Inside” Hadoop…
 In-Place ETL
 Data Transformation in R
 Enrichment and Correlation Using
Other Data In Hadoop
 Simulation/Experimentation
 Execute Complex Simulations on
Massively-Parallel Hadoop Clusters
 Scoring
 Run Scoring Models Directly in
Hadoop.
 No Movement Penalty
 How?
 Write Mappers & Reducers in R and
Deploy Using RMapReduce
 Augment Hadoop with CRAN1
Packages
19
1 Use of CRAN algorithms limited to non-graphical, parallelizable algorithms

Limitations of R MapReduce
 R Programmer Must “Think MapReduce” –
Dividing Work into Cascades of Map, Reduce,
Repeat.
 Algorithms Must be Designed for Parallelism
Including External Packages Used.
 Fits:
 Hadoop Literate Teams or Those With Good Support
 Non-Fits:
 Analytics Teams Tinkering with Hadoop on Short
Timeframes.
Data
Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
Analytics
MapReduceMapReduce
Applications
R MapReduce
(RMR)
R MapReduce
(RMR)
Hadoop
||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||
||||||||
Other
MapReduce
Jobs
Other
MapReduce
Jobs
HDFSHDFS
HbaseHbaseHiveHive

More Ways to Leverage R with Hadoop:
“Beside” Architectures
Inside Hadoop
 In-Place ETL
 Data Transformation in R
 Enrichment and Correlation Using
Other Data In Hadoop
 Simulation/Experimentation
 Execute Complex Simulations on
Massively-Parallel Hadoop Clusters
 Scoring
 Run Scoring Models Directly in
Hadoop.
 No Movement Penalty
 How?
 Write Mappers & Reducers in R and
Deploy Using RMapReduce
 Augment Hadoop with CRAN1
Packages
“Beside” Architectures:
 Drivers:
 Large or Unpredictable R Workloads
 Modest Hadoop Cluster
 Shared Production Hadoop Cluster
 Hadoop Novice
 Large Numbers of R Users.
 Modest Data Sets To Be Scored
 Movement Penalty Isn’t Prohibitive
 Maximized Computational Scale
 Access to ScaleR Parallel External
Memory Algorithms (PEMAs)
 Advantages:
 Makes Hadoop Easier to Administer
 Stabilies Hadoop Resource Availability
21

Two Additional “Beside” Architectures
 Alternatives:
 RRE “Beside” Hadoop
 RRE Both “Beside” and “Inside” Hadoop with RMR
 “Beside” Usage:
 Sample into “Beside” Server or Cluster
 Analyze and Model on R Server or Cluster
 Score Data on R Server or Cluster
 Results to Hadoop for Use.
 “Both” Usage - Same As Above Except:
 Move Model to Data on Hadoop
 Score Data In-Place on Hadoop
 Why multiple options?
 Greatest Flexibility
 Optimize Skill Sets
 Scale Clusters Independently
 Control Concurrency and Security
 Optimize Utilization
 Same R Code Can Run in Both
 Balance Ease of Use/Development and Resulting Performance & Scale
22

Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||
|||||||
RRE “Beside” Hadoop
 Separate Hadoop & R
Clusters
 Connectors HDFS,
Hbase & Hive
 Explore & Model Data
on R server(s)
 Return Scored Data to
HDFS/Hbase/Hive
 When To Use:
 Small, Shared or
Production Hadoop
Cluster
 Need Parallelized
Algorithms
 Heavy Random
Workloads
 Extensive
“Sandboxing”
 Modest Data Scoring
 Data Security
Constraints.
 … while awaiting
YARN…
 Advantages:
 Concurrency By
Separation
 Security By Separation
 Independent
Scalability
 ScaleR Parallel
Algorithms
23
DataAnalytics
MapReduceMapReduce
Applications
Hadoop
Cluster
|||||||
Other
MapReduce
Jobs
Other
MapReduce
Jobs
HDFSHDFS
HbaseHbaseHiveHive
RRERRE
CRAN Packages
Revolution R
Enterprise
Revolution R
Enterprise
||||||
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
Analytics
Apps.
Analytics
Apps.
Analytics Server
or Cluster:
Linux, Windows,
LSF or Azure
Data
Manipulation
and Analysis
Data
Manipulation
and Analysis
BI &
Visualization

Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
|||||||
|||||||
RRE “Beside” and “Inside”  Both “Inside” and
“Beside” Platforms
 Connect a Compute
Cluster to Hadoop
to Run R
 Move Models to
Score Big Data on
Hadoop
 When To Use:
 Production Hadoop
Cluster
 Need Parallelized
Algorithms
 Heavy Random
Workloads
 Extensive
“Sandboxing”
 Large Data Scoring
 Data Security
Constraints.
 … while awaiting
YARN…
 Advantages:
 Concurrency &
Security
 Independent
Scalability
 Big Data Scoring
 Flexibility
 Low Latency
24
DataAnalytics
MapReduceMapReduce
Applications
Hadoop
Cluster
|||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Other
MapReduce
Jobs
Other
MapReduce
Jobs
HDFSHDFS
HbaseHbaseHiveHive
||||||
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
Analytics Server
or Cluster:
Linux, Windows,
LSF or Azure
R MapReduce
(RMR)
R MapReduce
(RMR)
RRERRE
CRAN Packages
Analytics
Apps.
Analytics
Apps.
Revolution R
Enterprise
Revolution R
Enterprise
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
Analytics Server
or Cluster:
Linux, Windows,
LSF or Azure
BI &
Visualization

•Segment
•Categorize
•Select
Features
•Simulate
•Predict
•Validate
ModelModel
•Deploy
•Score
•Integrate
DeployDeploy
• Measure
Accuracy
• Iterate
ImproveImprove
Typical Predictive Analytics Workflow
25
• Ingest
• Format
• Enrich
• Filter
• Aggregate
• Profile
Data
Prep
Data
Prep
•Sample
•Cluster
•Visualize
•Correlate
•Sandboxing
ExploreExplore

‘Beside’ and/or ‘Inside’:
Dominant Usage Patterns Observed
 Use Case 1: Real-Time Scoring
 Example – Fraud Prevention
 Use Case 2: Modeling and Scoring
 Example – Attribution Analysis
 Use Case 3: Production Analytics
 Example – Telematics-Assisted Underwriting
26

In-House
Systems:
Transaction
History
27
Example 1:
Card Fraud Detection
MapReduceMapReduce
Hadoop
HDFSHDFS
HbaseHbase
1 Ingest
Weblog Data
Personal
Data:
Credit-
worthiness
Banking
2
4
Filter &
Xform
3
Correlate &
Rate
Transaction
Data
R MapReduce
(RMR)
R MapReduce
(RMR)
Other
MapReduce
Jobs
Other
MapReduce
Jobs
Develop
Risk
Models
6
Revolution R
Enterprise
Revolution R
Enterprise
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
R
Workstation
Deliver &
Integrate
Execute
Models5
Filter &
Score
Transactions
BI &
Visualization
Mortgage
Data
Authorization
Systems
Demographic
Data

In-House
Systems:
EDW, CRM,
Datamarts
Example 2:
Attribution Analysis “Beside” Hadoop
MapReduceMapReduce
Hadoop
HDFSHDFS HbaseHbase
1
Ingest
Weblog Data
Marketing
Service
Provider
Feeds:
Acxiom
Experian
ExactTarget
Monitored
Responses
CoreMetrics
Dotomi
DoubleClick
8
3
7
4
Call center
Data
Java
MapReduce
Jobs
Java
MapReduce
Jobs
Develop
Attribution
Models
Deliver to
Users
Revolution R
Enterprise
Revolution R
Enterprise
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
Analytics
Apps.
Analytics
Apps.
Linux Server
Cluster
Server
BI &
Visualization
2
Filter &
Transform
Score
6
6
Load Analysis
Environment
Aggregate,
Profile,
& EnrichSessionize

29
Example 3:
Telematics-Enhanced Underwriting
1
Ingest
8
2
Correlate Sources
3 Filter,
Aggregate &
Profile
Deliver to
Underwriting
& Call
Response
Systems
Revolution R
Enterprise
Revolution R
Enterprise
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
Underwriting
Applications
Underwriting
Applications
Linux Server
Cluster
Server
MapReduceMapReduce
Hadoop
HDFSHDFS
Other
MapReduce
Jobs
Other
MapReduce
Jobs
HbaseHbase
6
Policy Origination
Data
Vehicle Sensor
Data:
Speed
Time
Acceleration
Location
Creditworthiness
Data
Insured Data:
Loss History
Payment History
Credit File
Demographics 4
Load Model
Environment
Export
Models
Score
Large
Datasets
5R MapReduce
(RMR)
R MapReduce
(RMR)
7
Develop
Risk
Models

Conclusion
 Big Data Is Hard.
 Hadoop is Key to Managing It.
 R is Key to Applying It.
 Revolution R on Hadoop Brings Data Science to
Big Data
 Hadoop Brings Parallel Performance to R
 R Brings a Community with Know-How to Hadoop
 Revolution Analytics Can Deliver Convergence
Today.
 … and the Future of R on Hadoop is Even Brighter…
30

Thank you.
32
www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR
The leading commercial provider of software and support for the popular
open source R statistics language.

R and Big Data using Revolution R Enterprise with Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to R and Big Data using Revolution R Enterprise with Hadoop

Similar to R and Big Data using Revolution R Enterprise with Hadoop (20)

More from Revolution Analytics

More from Revolution Analytics (20)

Recently uploaded

Recently uploaded (20)

R and Big Data using Revolution R Enterprise with Hadoop