R and Hadoop go together. In fact, they go together so well, that the number of options available can be confusing to IT and data science teams seeking solutions under varying performance and operational requirements.
Which configuration is faster for big files? Which is faster for sharing data and servers among groups? Which eliminates data movement? Which is easiest to manage? Which works best with iterative and multistep algorithms? What are the hardware requirements of each alternative?
This webinar is intended to help new users of R with Hadoop select their best architecture for integrating Hadoop and R, by explaining the benefits of several popular configurations, their performance potential, workload handling and programming model and administrative characteristics.
Presenters from Revolution Analytics will describe the options for using Revolution R Open and Revolution R Enterprise with Hadoop including servers, edge nodes, rHadoop and ScaleR. We’ll then compare the characteristics of each configuration as regards performance but also programming model, administration, data movement, ease of scaling, mixed workload handling, and performance for large individual analyses vs. mixed workloads.
Anomaly detection and data imputation within time series
Performance and Scale Options for R with Hadoop: A comparison of potential architectures
1. R and Hadoop:
Architectural Options
Bill Jacobs
VP Product Marketing & Field CTO, Revolution
Analytics
@bill_jacobs
2. Polling Question #1:
Who Are You? (choose one)
– Statistician or modeler who uses R
– Other R developer
– Hadoop Expert
– Application builder
– Data guru
– Business user
– Systems vendor or reseller
– Something else…
5. Polling Question #2:
What Industry Do You Represent?
– Financial Services
– Insurance
– Healthcare, Life Sciences or Pharma
– Manufacturing
– Energy
– Retail
– Logistics and Transportation
– Education
– Government
– Marketing & Advertising
– Technology
– Other
6. In A Perfect World…
Analytical Capability
Compute
Data Scale
UsersPrice
Ease
Security
7. Hadoop Analytics - Many Alternatives
R Based Alternatives
Legacy tools updated – SAS HPA, etc.
Big Data Databases
Other Languages – Scala, Java, Julia, various GUIs
Today’s Topic:
R-Based Alternatives
– “Beside Architectures”
– “Inside Architectures”
– Open Source and Commercial
8. Reality: Tradeoffs.
Memory Limits
In-Memory vs. Shared Infrastructure
CRAN vs. Parallelization
Desktop vs. Remote
Explicit vs. Automatic Distribution
Locality vs. Movement
Real-Time vs. MapReduce
Traditional Statistics vs. Machine Learning
10. Corporate Overview & Quick Facts
Founded 2008 (as REvolution
Computing)
Office Locations Palo Alto (HQ), Seattle
(Engineering)
Singapore
London
CEO David Rich
Number of
customers
200+
Investors • Northbridge Venture Partners
• Intel Capital
• Platform Vendor
Web site: • www.revolutionanalytics.com
Revolution R Enterprise is the leading commercial analytics platform based on
the open source R statistical computing language
11. Revolution Analytics
Our Vision:
R becomes the de-
facto standard for
enterprise predictive
analytics
Our Mission:
Drive enterprise
adoption of R by
providing enhanced R
products tailored to
meet enterprise
challenges
12. Revolution Analytics Builds & Delivers:
Software Products:
Stable Distributions
Broad Platform Support
Big Data Analytics in R
Application Integration
Deployment Platforms
Agile Development Tooling
Future Platform Support
Support & Services
Commercial Support Programs
Training Programs
Professional Services
Community Programs
Academic Support Programs
Contributions to Open Source R
Open Source Extensions
Sponsorship of R User Groups
13. Revolution Analytics Technical Innovations
R Options from Open Source
to Enterprise
Parallelized Analytical
Computation
In-Database & In-Hadoop
Analytics
Big Data Scalability
Remote Execution
Production Deployment
Support
Multi-Platform Deployment
Legacy Data Format Support
Multiple IDE Options
PMML Model Export
14. The Revolution R Product Suite
• Free and open source R distribution
• Enhanced and distributed by Revolution Analytics
Revolution R Open
• Open-source distribution of R, packages, and other components
• Enhanced, supported and indemnified by Revolution Analytics
Revolution R Plus
• Secure, Scalable and Supported Distribution of R
• With proprietary components created by Revolution Analytics
Revolution R Enterprise
15. Polling Question #3:
State Play: In your company you are…
– Building Our “Data Lake”
– Running R + Hadoop Data Today
– Running R inside Hadoop using Open source
– Running RRE inside Hadoop
– Deploying Business Apps. Using Analytics from Hadoop Data
– Looking at Next Steps e.g. Spark, etc.
16. Revolution Analytics:
Eight Alternatives for Integrating R & Hadoop
Open Source
1. Open Source R
2. Revolution R Open
3. Open Source Parallelization on Workstations & Servers
4. rHadoop: Open Source Parallelization with rHadoop
Commercial
5. Revolution R Enterprise on Servers & Workstations
6. Revolution R Enterprise on Edge Nodes
7. Revolution R Enterprise Inside Hadoop
8. Combined Edge Node & Inside Hadoop
17. 1. Open Source R Integrated With Hadoop
• Traditional
Open Source
• Memory-
Limited
• Data Moves
Traditional Open Source R “Beside” Architecture:
CRAN
Algorithms
rHDFS
rHbas
e
rHive
rODB
C
18. 2. Revolution R Open On Workstations & Servers
Replace Open Source R “Beside” Architecture with Revolution R Open
As with Open Source R:
• Still Free.
• Still Memory Based.
• Data Still Moves.
Improvements:
• Accelerates Math
with Intel MKL
• Improves R-based
packages
Limitations
• No Effect
for non-R Code
CRAN
Algorithms
rHDFS
rHbas
e
rHive
rODB
C
19. Accelerate R Math with Intel Math Kernel Lib’s.
Source: http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html
20. 3. Write Parallel Algorithms PC, Server or Clusters
Write R Code to Explicitly Parallelize – Deploy Across Several Systems
Can Include CRAN
Algorithms “Carefully”
ForEach & Iterator
• DoParallel (PC, server)
• DoMPI (cluster)
• RRE RxEXEC
Example Uses:
• Bootstrapping
• Simulation
• HPC
rHDFS
rHbas
e
rHive
rODB
C
As with Previous:
• Still Free.
• Still Memory Based.
• Data Still Moves.
• Intel MKL with RRO
Improvements:
• Parallelized Execution
Limitations:
• Parallelization Difficulty
• Data Movement
• Platform Specific
21. 4. rHadoop: Custom Parallel Execution for Hadoop
Remote
Desktop
R Code
Execute R Code & CRAN Algorithms Inside Hadoop
Example Uses:
• Scoring
• Transformation
• Easily Parallelized
Algorithms
Hadoop
Streaming
Can Include CRAN
Algorithms
As With Previous:
Still Free.
Optional Intel MKL
in RRO
Improvements:
Runs R in
MapReduce
No Data Movement
Limitations:
Manual
Parallelization
Hadoop Specific
rHbase
rHDFS
rMapReduce
22. 5. Revolution R Enterprise (RRE) PEMAs inside
Hadoop
Traditional “Beside” Architecture with Optimized Algorithms
Available for Windows, Linux As With Previous:
Includes Intel MKL in RRO
Advantages
Speed: PEMAs Parallelize
Across Threads, Cores &
Sockets
Scale: PEMAs “Chunk” -
no Memory Limits
All of CRAN Available
Portability
Fully Supported
Limitations:
Data Movement
Single Machine
Revolution R Enterprise:
• ScaleR PEMA
Algorithms
plus
• All of CRAN
(subject to memory limits)
rHDFS
rHbas
e
rHive
rODB
C
23. Revolution R Enterprise
High Performance, Scalable Analytics
Portable Across Enterprise Platforms
Easier to Build & Deploy Analytics
is….
the only big data big analytics platform
based on open source R
25. ScaleR
High Performance Algorithms for the Most Common Uses
Data import – Delimited, Fixed, SAS, SPSS,
OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, Merge, Split
Aggregate by category (means, sums)
Min / Max, Mean, Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product matrix for set
variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard tables & long
form)
Marginal Summaries of Cross Tabulations
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Subsample (observations & variables)
Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
Sum of Squares (cross product matrix for set
variables)
Multiple Linear Regression
Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
Covariance & Correlation Matrices
Logistic Regression
Classification & Regression Trees
Predictions/scoring for models
Residuals for all models
Predictive Models
K-Means
Decision Trees
Decision Forests
Gradient Boosted Decision Trees
Cluster Analysis
Classification
Simulation
Variable Selection
Stepwise Regression
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Combination
25Revolution Analytics Confidential – Under NDA
New in
7.3
PEMA-R API
rxDataStep
rxExec
26. ScaleR PEMA
What’s a PEMA?
Parallel External Memory Algorithms
Master
Algorithm
Process
Data
Analyze Each
Block
• Not Limited to Available
Memory
• Unlimited Data Scale
• Ingests Data One Chunk
At A Time.
• Adjustable Memory
Footprint
• Multi-Thread Execution
Performance
• Highly-Optimized
Algorithms
• Algorithm Math Fully
Refactored for Parallelism
• Delivered as ScaleR
Library in Revolution R
Enterprise
Load Block At A
Time
Combine
Individual
Results
Script Calls
ScaleR
Algorithm
Scripts can call CRAN Open
Source Algorithms
Start & Manage
Processing
27. rHDFS
rHbas
e
rHive
rODB
C
6. Run Revolution R Enterprise on Hadoop
Edge Node(s)
Local
File
System
(opt.)
ScaleR + CRAN
Algorithms
Fast Single-Server Alternative for Modest Data Scale
Edge
NodeThin Client or
Remote
Desktop
As With Previous:
Single Machine Execution
PEMA Scale & Speed (Single
Machine)
Use ScaleR + CRAN
Accelerate R with Intel MKL
Improvements:
Easily Shared via
No Data Movement
Develop on Desktop Run on
Edge Node
Limitations:
“Shorter Trip” for Data
28. 7. Fast, Transparent Parallel Computation
Inside Hadoop YARN/MapReduce
jobtracker
ScaleR
Algorithms
DeployR
Fast Parallelized Analytics on Large Data Sets In Hadoop
As With Previous:
Speed and Scale of ScaleR PEMA
Algorithms
Use CRAN Where Appropriate
Accelerate R Math with MKL
Custom Parallelized Algo’s
Advantages
Parallel Computation
No Data Movement
ScaleR PEMA Parallelization
Can Parallelize CRAN “Carefully”
Portable Coding
Limitations:
Hadoop Workload Profiles
We
b
Ser
vice
s
Web
Services
Remote
Execution
Desktop & Server
Tools and
Applications
29. 29
One Client’s Experience with RRE on Hadoop
Test Cluster - 9 Nodes
Task Processing Time
Importing and Filtering Datasets from
HDFS
14 Million Observations 82 sec.
227 Million Observations 310 sec.
Modeling and Estimation
1.2 M Correlations 2771 sec.
Simple Linear Regression, 227 M
Observations 61 sec.
Multiple Linear Regression, Three
Variables, 227 M Observations 58 sec.
Multiple Linear Regression, Four
Variables, 227 M Observations 58 sec.
Random Forest, 10 Predictor Variables,
227 M Observations, 10 Trees with Max
Depth of 10 Splits 2 hr. 3 min.
64GB
24 cores
each
9 Task
Nodes
2 Admin
Nodes1 Edge
Node
128GB
24 cores
each
128GB
24 cores
each
30. 8. Combined Edge Node & In-Hadoop
ScaleR
Algorithms
DeployR
Maximized Flexibility, Performance & Workload Handling
As With Previous:
Speed and Scale of ScaleR PEMA
Algorithms
Use CRAN Where Appropriate
Accelerate R Math with MKL
Custom Parallelized Algo’s
Advantages
Flexibility for Blended Workloads
Little or No Data Movement
Maximize CRAN Capabilities by
Sharing Large RAM Edge Nodes
We
b
Ser
vice
s
Thin Client
Development
Remote
Execution
Desktop & Server
Tools and
Applications
rStudio
31. Occasionally
Conflicting Criteria
Infrastructure Criteria:
Big Data Platform
Vendor Choice
Data Ingest
Data Security
Data Governance
Data Science Criteria:
Performance
Self Service
Flexibility
Collaboration
Sharing
Capability
32. Key Questions:
Where are the bulk of your skills? SAS? R? Java? Python? SQL?
Where do you build models today?
Do you have the skills to parallelize algorithms?
Can models be built on a big shared server?
How will you run models?
Do you have the budget to purchase commercial solutions?
How will your needs change over time?
What is your future architecture plan?
How risk averse is your management team regarding new platforms and
open source?
33. Key Questions (cont.)
What Workloads Do You Anticipate?
— How May Users?
— What Workloads?
Workload Realities:
— Many small tasks do not run well
in MapReduce
— Large data movements /
duplications are costly
What Use Cases Will You
Encounter?
— Traditional statistical
exploration, modeling?
— Behavior Prediction?
— Outlier Detection?
— Simulation and HPC?
— Massively wide data?
— Real-Time scoring?
— Internet of Things?
34. Eight Steps to Fast, Scalable R Analytics with
Hadoop
Open Source Options
1. Open Source R
2. Revolution R Open
3. Open Source Parallelization…
4. rHadoop…
Commercial Options
5. RRE on Servers &
Workstations
6. RRE on Edge Nodes
7. RRE Inside Hadoop
8. RRE on Edge Node & Inside
Hadoop
No Clear Winner:
Budget & use case determine
optimal path
Compelling options in both open
source & commercial source
RRE ScaleR uniquely provides
automatic parallelization
Current Hadoop platforms are
fast for large scale analytics.
Combined in-server & in-hadoop
fits majority of cases
35. 2015 Challenges & Opportunities
• Evolving Hadoop Architectures
• In-Memory Analytics – Spark, YARN Containers, Caching
• Additional Algorithm Parallelization
• Cluster Management
• Cloud and Hybrid Cloud Clusters
• SQL on Hadoop “Battle-Royale”
• Addressing the Resource Reality
• Integration, Deployment Both Drain on Expensive Resources
• Leverage other skills
• Design efficient collaboration
• “Analytics for the Rest of Us”
• New Consumption Targets – Mobile
• New Participants in Design – Business Users
36.
37. Recommended Resources
Revolution Analytics Products
– http://www.revolutionanalytics.com/products
– http://www.revolutionanalytics.com/big-analytics-hadoop-and-edws
Whitepaper: “Delivering Value from Big Data with Revolution R
Enterprise and Hadoop
– http://www.revolutionanalytics.com/whitepaper/delivering-value-big-data-
revolution-r-enterprise-and-hadoop
Revolution Analytics on Social Media:
– http://blog.revolutionanalytics.com/
– @revolutionr on Twitter
– @bill_jacobs on Twitter