Performance and Scale Options for R with Hadoop: A comparison of potential architectures

R and Hadoop:
Architectural Options
Bill Jacobs
VP Product Marketing & Field CTO, Revolution
Analytics
@bill_jacobs

Polling Question #1:
 Who Are You? (choose one)
– Statistician or modeler who uses R
– Other R developer
– Hadoop Expert
– Application builder
– Data guru
– Business user
– Systems vendor or reseller
– Something else…

• Challenges
• Options
• Considerations
• How to Choose
Agenda

Boundless Opportunities
 Marketing: Clickstream &
Campaign Analyses
 Digital Media:
Recommendation Engines
 Retail: Social Sentiment
Analysis
 Insurance: Fraud Waste and
Abuse
 Healthcare Delivery: Outcome
Prediction
 Manufacturing: Quality
Optimization
 P&C Insurance: Risk Analysis
 Consumer Products: Warranty
Optimization
 Operations: Supply Chain
Optimization
 Econometrics: Market
Prediction
 Marketing: Mix and Price
Optimization
 Life Sciences:
Pharmacogenetics
 Transportation: Asset
Utilization

 What Industry Do You Represent?
– Financial Services
– Insurance
– Healthcare, Life Sciences or Pharma
– Manufacturing
– Energy
– Retail
– Logistics and Transportation
– Education
– Government
– Marketing & Advertising
– Technology
– Other

In A Perfect World…
Analytical Capability
Compute
Data Scale
UsersPrice
Ease
Security

Hadoop Analytics - Many Alternatives
 R Based Alternatives
 Legacy tools updated – SAS HPA, etc.
 Big Data Databases
 Other Languages – Scala, Java, Julia, various GUIs
Today’s Topic:
 R-Based Alternatives
– “Beside Architectures”
– “Inside Architectures”
– Open Source and Commercial

Reality: Tradeoffs.
Memory Limits
In-Memory vs. Shared Infrastructure
CRAN vs. Parallelization
Desktop vs. Remote
Explicit vs. Automatic Distribution
Locality vs. Movement
Real-Time vs. MapReduce
Traditional Statistics vs. Machine Learning

Corporate Overview & Quick Facts
Founded 2008 (as REvolution
Computing)
Office Locations Palo Alto (HQ), Seattle
(Engineering)
Singapore
London
CEO David Rich
Number of
customers
200+
Investors • Northbridge Venture Partners
• Intel Capital
• Platform Vendor
Web site: • www.revolutionanalytics.com
Revolution R Enterprise is the leading commercial analytics platform based on
the open source R statistical computing language

Revolution Analytics
Our Vision:
R becomes the de-
facto standard for
enterprise predictive
analytics
Our Mission:
Drive enterprise
adoption of R by
providing enhanced R
products tailored to
meet enterprise
challenges

Revolution Analytics Builds & Delivers:
 Software Products:
 Stable Distributions
 Broad Platform Support
 Big Data Analytics in R
 Application Integration
 Deployment Platforms
 Agile Development Tooling
 Future Platform Support
 Support & Services
 Commercial Support Programs
 Training Programs
 Professional Services
 Community Programs
 Academic Support Programs
 Contributions to Open Source R
 Open Source Extensions
 Sponsorship of R User Groups

Revolution Analytics Technical Innovations
 R Options from Open Source
to Enterprise
 Parallelized Analytical
Computation
 In-Database & In-Hadoop
Analytics
 Big Data Scalability
 Remote Execution
 Production Deployment
Support
 Multi-Platform Deployment
 Legacy Data Format Support
 Multiple IDE Options
 PMML Model Export

The Revolution R Product Suite
• Free and open source R distribution
• Enhanced and distributed by Revolution Analytics
Revolution R Open
• Open-source distribution of R, packages, and other components
• Enhanced, supported and indemnified by Revolution Analytics
Revolution R Plus
• Secure, Scalable and Supported Distribution of R
• With proprietary components created by Revolution Analytics
Revolution R Enterprise

 State Play: In your company you are…
– Building Our “Data Lake”
– Running R + Hadoop Data Today
– Running R inside Hadoop using Open source
– Running RRE inside Hadoop
– Deploying Business Apps. Using Analytics from Hadoop Data
– Looking at Next Steps e.g. Spark, etc.

Revolution Analytics:
Eight Alternatives for Integrating R & Hadoop
Open Source
1. Open Source R
2. Revolution R Open
3. Open Source Parallelization on Workstations & Servers
4. rHadoop: Open Source Parallelization with rHadoop
Commercial
5. Revolution R Enterprise on Servers & Workstations
6. Revolution R Enterprise on Edge Nodes
7. Revolution R Enterprise Inside Hadoop
8. Combined Edge Node & Inside Hadoop

1. Open Source R Integrated With Hadoop
• Traditional
Open Source
• Memory-
Limited
• Data Moves
Traditional Open Source R “Beside” Architecture:
CRAN
Algorithms
rHDFS
rHbas
e
rHive
rODB
C

2. Revolution R Open On Workstations & Servers
Replace Open Source R “Beside” Architecture with Revolution R Open
As with Open Source R:
• Still Free.
• Still Memory Based.
• Data Still Moves.
Improvements:
• Accelerates Math
with Intel MKL
• Improves R-based
packages
Limitations
• No Effect
for non-R Code
CRAN
Algorithms
rHDFS
rHbas
e
rHive
rODB
C

Accelerate R Math with Intel Math Kernel Lib’s.
Source: http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html

3. Write Parallel Algorithms PC, Server or Clusters
Write R Code to Explicitly Parallelize – Deploy Across Several Systems
Can Include CRAN
Algorithms “Carefully”
ForEach & Iterator
• DoParallel (PC, server)
• DoMPI (cluster)
• RRE RxEXEC
Example Uses:
• Bootstrapping
• Simulation
• HPC
rHDFS
rHbas
e
rHive
rODB
C
As with Previous:
• Still Free.
• Still Memory Based.
• Data Still Moves.
• Intel MKL with RRO
Improvements:
• Parallelized Execution
Limitations:
• Parallelization Difficulty
• Data Movement
• Platform Specific

4. rHadoop: Custom Parallel Execution for Hadoop
Remote
Desktop
R Code
Execute R Code & CRAN Algorithms Inside Hadoop
Example Uses:
• Scoring
• Transformation
• Easily Parallelized
Algorithms
Hadoop
Streaming
Can Include CRAN
Algorithms
As With Previous:
 Still Free.
 Optional Intel MKL
in RRO
Improvements:
 Runs R in
MapReduce
 No Data Movement
Limitations:
 Manual
Parallelization
 Hadoop Specific
rHbase
rHDFS
rMapReduce

5. Revolution R Enterprise (RRE) PEMAs inside
Hadoop
Traditional “Beside” Architecture with Optimized Algorithms
Available for Windows, Linux As With Previous:
 Includes Intel MKL in RRO
Advantages
 Speed: PEMAs Parallelize
Across Threads, Cores &
Sockets
 Scale: PEMAs “Chunk” -
no Memory Limits
 All of CRAN Available
 Portability
 Fully Supported
Limitations:
 Data Movement
 Single Machine
Revolution R Enterprise:
• ScaleR PEMA
Algorithms
plus
• All of CRAN
(subject to memory limits)
rHDFS
rHbas
e
rHive
rODB
C

Revolution R Enterprise
 High Performance, Scalable Analytics
 Portable Across Enterprise Platforms
 Easier to Build & Deploy Analytics
is….
the only big data big analytics platform
based on open source R

ScaleR
Refactor Algorithms for Dramatic Performance and Capacity Improvement

ScaleR
High Performance Algorithms for the Most Common Uses
 Data import – Delimited, Fixed, SAS, SPSS,
OBDC
 Variable creation & transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort, Merge, Split
 Aggregate by category (means, sums)
 Min / Max, Mean, Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product matrix for set
variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data (standard tables & long
form)
 Marginal Summaries of Cross Tabulations
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
 Subsample (observations & variables)
 Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
 Sum of Squares (cross product matrix for set
variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
 Covariance & Correlation Matrices
 Logistic Regression
 Classification & Regression Trees
 Predictions/scoring for models
 Residuals for all models
Predictive Models
 K-Means
 Decision Trees
 Decision Forests
 Gradient Boosted Decision Trees
Cluster Analysis
Classification
Simulation
Variable Selection
 Stepwise Regression
 Simulation (e.g. Monte Carlo)
 Parallel Random Number Generation
Combination
25Revolution Analytics Confidential – Under NDA
New in
7.3
 PEMA-R API
 rxDataStep
 rxExec

ScaleR PEMA
What’s a PEMA?
Parallel External Memory Algorithms
Master
Algorithm
Process
Data
Analyze Each
Block
• Not Limited to Available
Memory
• Unlimited Data Scale
• Ingests Data One Chunk
At A Time.
• Adjustable Memory
Footprint
• Multi-Thread Execution
Performance
• Highly-Optimized
Algorithms
• Algorithm Math Fully
Refactored for Parallelism
• Delivered as ScaleR
Library in Revolution R
Enterprise
Load Block At A
Time
Combine
Individual
Results
Script Calls
ScaleR
Algorithm
Scripts can call CRAN Open
Source Algorithms
Start & Manage
Processing

rHDFS
rHbas
e
rHive
rODB
C
6. Run Revolution R Enterprise on Hadoop
Edge Node(s)
Local
File
System
(opt.)
ScaleR + CRAN
Algorithms
Fast Single-Server Alternative for Modest Data Scale
Edge
NodeThin Client or
Remote
Desktop
As With Previous:
 Single Machine Execution
 PEMA Scale & Speed (Single
Machine)
 Use ScaleR + CRAN
 Accelerate R with Intel MKL
Improvements:
 Easily Shared via
 Develop on Desktop Run on
Edge Node
Limitations:
 “Shorter Trip” for Data

7. Fast, Transparent Parallel Computation
Inside Hadoop YARN/MapReduce
jobtracker
ScaleR
Algorithms
DeployR
Fast Parallelized Analytics on Large Data Sets In Hadoop
As With Previous:
 Speed and Scale of ScaleR PEMA
Algorithms
 Use CRAN Where Appropriate
 Accelerate R Math with MKL
 Custom Parallelized Algo’s
Advantages
 Parallel Computation
 ScaleR PEMA Parallelization
 Can Parallelize CRAN “Carefully”
 Portable Coding
Limitations:
 Hadoop Workload Profiles
We
b
Ser
vice
s
Web
Services
Remote
Execution
Desktop & Server
Tools and
Applications

29
One Client’s Experience with RRE on Hadoop
Test Cluster - 9 Nodes
Task Processing Time
Importing and Filtering Datasets from
HDFS
14 Million Observations 82 sec.
227 Million Observations 310 sec.
Modeling and Estimation
1.2 M Correlations 2771 sec.
Simple Linear Regression, 227 M
Observations 61 sec.
Multiple Linear Regression, Three
Variables, 227 M Observations 58 sec.
Multiple Linear Regression, Four
Variables, 227 M Observations 58 sec.
Random Forest, 10 Predictor Variables,
227 M Observations, 10 Trees with Max
Depth of 10 Splits 2 hr. 3 min.
64GB
24 cores
each
9 Task
Nodes
2 Admin
Nodes1 Edge
Node
128GB
24 cores
each
128GB
24 cores
each

8. Combined Edge Node & In-Hadoop
ScaleR
Algorithms
DeployR
Maximized Flexibility, Performance & Workload Handling
As With Previous:
 Speed and Scale of ScaleR PEMA
Algorithms
 Use CRAN Where Appropriate
 Accelerate R Math with MKL
 Custom Parallelized Algo’s
Advantages
 Flexibility for Blended Workloads
 Little or No Data Movement
 Maximize CRAN Capabilities by
Sharing Large RAM Edge Nodes
We
b
Ser
vice
s
Thin Client
Development
Remote
Execution
Desktop & Server
Tools and
Applications
rStudio

Occasionally
Conflicting Criteria
Infrastructure Criteria:
 Big Data Platform
 Vendor Choice
 Data Ingest
 Data Security
 Data Governance
Data Science Criteria:
 Performance
 Self Service
 Flexibility
 Collaboration
 Sharing
 Capability

Key Questions:
 Where are the bulk of your skills? SAS? R? Java? Python? SQL?
 Where do you build models today?
 Do you have the skills to parallelize algorithms?
 Can models be built on a big shared server?
 How will you run models?
 Do you have the budget to purchase commercial solutions?
 How will your needs change over time?
 What is your future architecture plan?
 How risk averse is your management team regarding new platforms and
open source?

Key Questions (cont.)
 What Workloads Do You Anticipate?
— How May Users?
— What Workloads?
 Workload Realities:
— Many small tasks do not run well
in MapReduce
— Large data movements /
duplications are costly
 What Use Cases Will You
Encounter?
— Traditional statistical
exploration, modeling?
— Behavior Prediction?
— Outlier Detection?
— Simulation and HPC?
— Massively wide data?
— Real-Time scoring?
— Internet of Things?

Eight Steps to Fast, Scalable R Analytics with
Hadoop
Open Source Options
1. Open Source R
2. Revolution R Open
3. Open Source Parallelization…
4. rHadoop…
Commercial Options
5. RRE on Servers &
Workstations
6. RRE on Edge Nodes
7. RRE Inside Hadoop
8. RRE on Edge Node & Inside
Hadoop
No Clear Winner:
 Budget & use case determine
optimal path
 Compelling options in both open
source & commercial source
 RRE ScaleR uniquely provides
automatic parallelization
 Current Hadoop platforms are
fast for large scale analytics.
 Combined in-server & in-hadoop
fits majority of cases

2015 Challenges & Opportunities
• Evolving Hadoop Architectures
• In-Memory Analytics – Spark, YARN Containers, Caching
• Additional Algorithm Parallelization
• Cluster Management
• Cloud and Hybrid Cloud Clusters
• SQL on Hadoop “Battle-Royale”
• Addressing the Resource Reality
• Integration, Deployment Both Drain on Expensive Resources
• Leverage other skills
• Design efficient collaboration
• “Analytics for the Rest of Us”
• New Consumption Targets – Mobile
• New Participants in Design – Business Users

Recommended Resources
 Revolution Analytics Products
– http://www.revolutionanalytics.com/products
– http://www.revolutionanalytics.com/big-analytics-hadoop-and-edws
 Whitepaper: “Delivering Value from Big Data with Revolution R
Enterprise and Hadoop
– http://www.revolutionanalytics.com/whitepaper/delivering-value-big-data-
revolution-r-enterprise-and-hadoop
 Revolution Analytics on Social Media:
– http://blog.revolutionanalytics.com/
– @revolutionr on Twitter
– @bill_jacobs on Twitter

Thank you.
www.revolutionanalytics.com
1.855.GET.REVO
Twitter: @RevolutionR

Performance and Scale Options for R with Hadoop: A comparison of potential architectures

More Related Content

What's hot

Viewers also liked

Similar to Performance and Scale Options for R with Hadoop: A comparison of potential architectures

More from Revolution Analytics

Recently uploaded

Performance and Scale Options for R with Hadoop: A comparison of potential architectures