R and Hadoop:
Architectural Options
Bill Jacobs
VP Product Marketing & Field CTO, Revolution
Analytics
@bill_jacobs
Polling Question #1:
 Who Are You? (choose one)
– Statistician or modeler who uses R
– Other R developer
– Hadoop Expert
– Application builder
– Data guru
– Business user
– Systems vendor or reseller
– Something else…
• Challenges
• Options
• Considerations
• How to Choose
Agenda
Boundless Opportunities
 Marketing: Clickstream &
Campaign Analyses
 Digital Media:
Recommendation Engines
 Retail: Social Sentiment
Analysis
 Insurance: Fraud Waste and
Abuse
 Healthcare Delivery: Outcome
Prediction
 Manufacturing: Quality
Optimization
 P&C Insurance: Risk Analysis
 Consumer Products: Warranty
Optimization
 Operations: Supply Chain
Optimization
 Econometrics: Market
Prediction
 Marketing: Mix and Price
Optimization
 Life Sciences:
Pharmacogenetics
 Transportation: Asset
Utilization
Polling Question #2:
 What Industry Do You Represent?
– Financial Services
– Insurance
– Healthcare, Life Sciences or Pharma
– Manufacturing
– Energy
– Retail
– Logistics and Transportation
– Education
– Government
– Marketing & Advertising
– Technology
– Other
In A Perfect World…
Analytical Capability
Compute
Data Scale
UsersPrice
Ease
Security
Hadoop Analytics - Many Alternatives
 R Based Alternatives
 Legacy tools updated – SAS HPA, etc.
 Big Data Databases
 Other Languages – Scala, Java, Julia, various GUIs
Today’s Topic:
 R-Based Alternatives
– “Beside Architectures”
– “Inside Architectures”
– Open Source and Commercial
Reality: Tradeoffs.
Memory Limits
In-Memory vs. Shared Infrastructure
CRAN vs. Parallelization
Desktop vs. Remote
Explicit vs. Automatic Distribution
Locality vs. Movement
Real-Time vs. MapReduce
Traditional Statistics vs. Machine Learning
No Magic Bullet.
Corporate Overview & Quick Facts
Founded 2008 (as REvolution
Computing)
Office Locations Palo Alto (HQ), Seattle
(Engineering)
Singapore
London
CEO David Rich
Number of
customers
200+
Investors • Northbridge Venture Partners
• Intel Capital
• Platform Vendor
Web site: • www.revolutionanalytics.com
Revolution R Enterprise is the leading commercial analytics platform based on
the open source R statistical computing language
Revolution Analytics
Our Vision:
R becomes the de-
facto standard for
enterprise predictive
analytics
Our Mission:
Drive enterprise
adoption of R by
providing enhanced R
products tailored to
meet enterprise
challenges
Revolution Analytics Builds & Delivers:
 Software Products:
 Stable Distributions
 Broad Platform Support
 Big Data Analytics in R
 Application Integration
 Deployment Platforms
 Agile Development Tooling
 Future Platform Support
 Support & Services
 Commercial Support Programs
 Training Programs
 Professional Services
 Community Programs
 Academic Support Programs
 Contributions to Open Source R
 Open Source Extensions
 Sponsorship of R User Groups
Revolution Analytics Technical Innovations
 R Options from Open Source
to Enterprise
 Parallelized Analytical
Computation
 In-Database & In-Hadoop
Analytics
 Big Data Scalability
 Remote Execution
 Production Deployment
Support
 Multi-Platform Deployment
 Legacy Data Format Support
 Multiple IDE Options
 PMML Model Export
The Revolution R Product Suite
• Free and open source R distribution
• Enhanced and distributed by Revolution Analytics
Revolution R Open
• Open-source distribution of R, packages, and other components
• Enhanced, supported and indemnified by Revolution Analytics
Revolution R Plus
• Secure, Scalable and Supported Distribution of R
• With proprietary components created by Revolution Analytics
Revolution R Enterprise
Polling Question #3:
 State Play: In your company you are…
– Building Our “Data Lake”
– Running R + Hadoop Data Today
– Running R inside Hadoop using Open source
– Running RRE inside Hadoop
– Deploying Business Apps. Using Analytics from Hadoop Data
– Looking at Next Steps e.g. Spark, etc.
Revolution Analytics:
Eight Alternatives for Integrating R & Hadoop
Open Source
1. Open Source R
2. Revolution R Open
3. Open Source Parallelization on Workstations & Servers
4. rHadoop: Open Source Parallelization with rHadoop
Commercial
5. Revolution R Enterprise on Servers & Workstations
6. Revolution R Enterprise on Edge Nodes
7. Revolution R Enterprise Inside Hadoop
8. Combined Edge Node & Inside Hadoop
1. Open Source R Integrated With Hadoop
• Traditional
Open Source
• Memory-
Limited
• Data Moves
Traditional Open Source R “Beside” Architecture:
CRAN
Algorithms
rHDFS
rHbas
e
rHive
rODB
C
2. Revolution R Open On Workstations & Servers
Replace Open Source R “Beside” Architecture with Revolution R Open
As with Open Source R:
• Still Free.
• Still Memory Based.
• Data Still Moves.
Improvements:
• Accelerates Math
with Intel MKL
• Improves R-based
packages
Limitations
• No Effect
for non-R Code
CRAN
Algorithms
rHDFS
rHbas
e
rHive
rODB
C
Accelerate R Math with Intel Math Kernel Lib’s.
Source: http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html
3. Write Parallel Algorithms PC, Server or Clusters
Write R Code to Explicitly Parallelize – Deploy Across Several Systems
Can Include CRAN
Algorithms “Carefully”
ForEach & Iterator
• DoParallel (PC, server)
• DoMPI (cluster)
• RRE RxEXEC
Example Uses:
• Bootstrapping
• Simulation
• HPC
rHDFS
rHbas
e
rHive
rODB
C
As with Previous:
• Still Free.
• Still Memory Based.
• Data Still Moves.
• Intel MKL with RRO
Improvements:
• Parallelized Execution
Limitations:
• Parallelization Difficulty
• Data Movement
• Platform Specific
4. rHadoop: Custom Parallel Execution for Hadoop
Remote
Desktop
R Code
Execute R Code & CRAN Algorithms Inside Hadoop
Example Uses:
• Scoring
• Transformation
• Easily Parallelized
Algorithms
Hadoop
Streaming
Can Include CRAN
Algorithms
As With Previous:
 Still Free.
 Optional Intel MKL
in RRO
Improvements:
 Runs R in
MapReduce
 No Data Movement
Limitations:
 Manual
Parallelization
 Hadoop Specific
rHbase
rHDFS
rMapReduce
5. Revolution R Enterprise (RRE) PEMAs inside
Hadoop
Traditional “Beside” Architecture with Optimized Algorithms
Available for Windows, Linux As With Previous:
 Includes Intel MKL in RRO
Advantages
 Speed: PEMAs Parallelize
Across Threads, Cores &
Sockets
 Scale: PEMAs “Chunk” -
no Memory Limits
 All of CRAN Available
 Portability
 Fully Supported
Limitations:
 Data Movement
 Single Machine
Revolution R Enterprise:
• ScaleR PEMA
Algorithms
plus
• All of CRAN
(subject to memory limits)
rHDFS
rHbas
e
rHive
rODB
C
Revolution R Enterprise
 High Performance, Scalable Analytics
 Portable Across Enterprise Platforms
 Easier to Build & Deploy Analytics
is….
the only big data big analytics platform
based on open source R
ScaleR
Refactor Algorithms for Dramatic Performance and Capacity Improvement
ScaleR
High Performance Algorithms for the Most Common Uses
 Data import – Delimited, Fixed, SAS, SPSS,
OBDC
 Variable creation & transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort, Merge, Split
 Aggregate by category (means, sums)
 Min / Max, Mean, Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product matrix for set
variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data (standard tables & long
form)
 Marginal Summaries of Cross Tabulations
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
 Subsample (observations & variables)
 Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
 Sum of Squares (cross product matrix for set
variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
 Covariance & Correlation Matrices
 Logistic Regression
 Classification & Regression Trees
 Predictions/scoring for models
 Residuals for all models
Predictive Models
 K-Means
 Decision Trees
 Decision Forests
 Gradient Boosted Decision Trees
Cluster Analysis
Classification
Simulation
Variable Selection
 Stepwise Regression
 Simulation (e.g. Monte Carlo)
 Parallel Random Number Generation
Combination
25Revolution Analytics Confidential – Under NDA
New in
7.3
 PEMA-R API
 rxDataStep
 rxExec
ScaleR PEMA
What’s a PEMA?
Parallel External Memory Algorithms
Master
Algorithm
Process
Data
Analyze Each
Block
• Not Limited to Available
Memory
• Unlimited Data Scale
• Ingests Data One Chunk
At A Time.
• Adjustable Memory
Footprint
• Multi-Thread Execution
Performance
• Highly-Optimized
Algorithms
• Algorithm Math Fully
Refactored for Parallelism
• Delivered as ScaleR
Library in Revolution R
Enterprise
Load Block At A
Time
Combine
Individual
Results
Script Calls
ScaleR
Algorithm
Scripts can call CRAN Open
Source Algorithms
Start & Manage
Processing
rHDFS
rHbas
e
rHive
rODB
C
6. Run Revolution R Enterprise on Hadoop
Edge Node(s)
Local
File
System
(opt.)
ScaleR + CRAN
Algorithms
Fast Single-Server Alternative for Modest Data Scale
Edge
NodeThin Client or
Remote
Desktop
As With Previous:
 Single Machine Execution
 PEMA Scale & Speed (Single
Machine)
 Use ScaleR + CRAN
 Accelerate R with Intel MKL
Improvements:
 Easily Shared via
 No Data Movement
 Develop on Desktop Run on
Edge Node
Limitations:
 “Shorter Trip” for Data
7. Fast, Transparent Parallel Computation
Inside Hadoop YARN/MapReduce
jobtracker
ScaleR
Algorithms
DeployR
Fast Parallelized Analytics on Large Data Sets In Hadoop
As With Previous:
 Speed and Scale of ScaleR PEMA
Algorithms
 Use CRAN Where Appropriate
 Accelerate R Math with MKL
 Custom Parallelized Algo’s
Advantages
 Parallel Computation
 No Data Movement
 ScaleR PEMA Parallelization
 Can Parallelize CRAN “Carefully”
 Portable Coding
Limitations:
 Hadoop Workload Profiles
We
b
Ser
vice
s
Web
Services
Remote
Execution
Desktop & Server
Tools and
Applications
29
One Client’s Experience with RRE on Hadoop
Test Cluster - 9 Nodes
Task Processing Time
Importing and Filtering Datasets from
HDFS
14 Million Observations 82 sec.
227 Million Observations 310 sec.
Modeling and Estimation
1.2 M Correlations 2771 sec.
Simple Linear Regression, 227 M
Observations 61 sec.
Multiple Linear Regression, Three
Variables, 227 M Observations 58 sec.
Multiple Linear Regression, Four
Variables, 227 M Observations 58 sec.
Random Forest, 10 Predictor Variables,
227 M Observations, 10 Trees with Max
Depth of 10 Splits 2 hr. 3 min.
64GB
24 cores
each
9 Task
Nodes
2 Admin
Nodes1 Edge
Node
128GB
24 cores
each
128GB
24 cores
each
8. Combined Edge Node & In-Hadoop
ScaleR
Algorithms
DeployR
Maximized Flexibility, Performance & Workload Handling
As With Previous:
 Speed and Scale of ScaleR PEMA
Algorithms
 Use CRAN Where Appropriate
 Accelerate R Math with MKL
 Custom Parallelized Algo’s
Advantages
 Flexibility for Blended Workloads
 Little or No Data Movement
 Maximize CRAN Capabilities by
Sharing Large RAM Edge Nodes
We
b
Ser
vice
s
Thin Client
Development
Remote
Execution
Desktop & Server
Tools and
Applications
rStudio
Occasionally
Conflicting Criteria
Infrastructure Criteria:
 Big Data Platform
 Vendor Choice
 Data Ingest
 Data Security
 Data Governance
Data Science Criteria:
 Performance
 Self Service
 Flexibility
 Collaboration
 Sharing
 Capability
Key Questions:
 Where are the bulk of your skills? SAS? R? Java? Python? SQL?
 Where do you build models today?
 Do you have the skills to parallelize algorithms?
 Can models be built on a big shared server?
 How will you run models?
 Do you have the budget to purchase commercial solutions?
 How will your needs change over time?
 What is your future architecture plan?
 How risk averse is your management team regarding new platforms and
open source?
Key Questions (cont.)
 What Workloads Do You Anticipate?
— How May Users?
— What Workloads?
 Workload Realities:
— Many small tasks do not run well
in MapReduce
— Large data movements /
duplications are costly
 What Use Cases Will You
Encounter?
— Traditional statistical
exploration, modeling?
— Behavior Prediction?
— Outlier Detection?
— Simulation and HPC?
— Massively wide data?
— Real-Time scoring?
— Internet of Things?
Eight Steps to Fast, Scalable R Analytics with
Hadoop
Open Source Options
1. Open Source R
2. Revolution R Open
3. Open Source Parallelization…
4. rHadoop…
Commercial Options
5. RRE on Servers &
Workstations
6. RRE on Edge Nodes
7. RRE Inside Hadoop
8. RRE on Edge Node & Inside
Hadoop
No Clear Winner:
 Budget & use case determine
optimal path
 Compelling options in both open
source & commercial source
 RRE ScaleR uniquely provides
automatic parallelization
 Current Hadoop platforms are
fast for large scale analytics.
 Combined in-server & in-hadoop
fits majority of cases
2015 Challenges & Opportunities
• Evolving Hadoop Architectures
• In-Memory Analytics – Spark, YARN Containers, Caching
• Additional Algorithm Parallelization
• Cluster Management
• Cloud and Hybrid Cloud Clusters
• SQL on Hadoop “Battle-Royale”
• Addressing the Resource Reality
• Integration, Deployment Both Drain on Expensive Resources
• Leverage other skills
• Design efficient collaboration
• “Analytics for the Rest of Us”
• New Consumption Targets – Mobile
• New Participants in Design – Business Users
Recommended Resources
 Revolution Analytics Products
– http://www.revolutionanalytics.com/products
– http://www.revolutionanalytics.com/big-analytics-hadoop-and-edws
 Whitepaper: “Delivering Value from Big Data with Revolution R
Enterprise and Hadoop
– http://www.revolutionanalytics.com/whitepaper/delivering-value-big-data-
revolution-r-enterprise-and-hadoop
 Revolution Analytics on Social Media:
– http://blog.revolutionanalytics.com/
– @revolutionr on Twitter
– @bill_jacobs on Twitter
Thank you.
www.revolutionanalytics.com
1.855.GET.REVO
Twitter: @RevolutionR

Performance and Scale Options for R with Hadoop: A comparison of potential architectures

  • 1.
    R and Hadoop: ArchitecturalOptions Bill Jacobs VP Product Marketing & Field CTO, Revolution Analytics @bill_jacobs
  • 2.
    Polling Question #1: Who Are You? (choose one) – Statistician or modeler who uses R – Other R developer – Hadoop Expert – Application builder – Data guru – Business user – Systems vendor or reseller – Something else…
  • 3.
    • Challenges • Options •Considerations • How to Choose Agenda
  • 4.
    Boundless Opportunities  Marketing:Clickstream & Campaign Analyses  Digital Media: Recommendation Engines  Retail: Social Sentiment Analysis  Insurance: Fraud Waste and Abuse  Healthcare Delivery: Outcome Prediction  Manufacturing: Quality Optimization  P&C Insurance: Risk Analysis  Consumer Products: Warranty Optimization  Operations: Supply Chain Optimization  Econometrics: Market Prediction  Marketing: Mix and Price Optimization  Life Sciences: Pharmacogenetics  Transportation: Asset Utilization
  • 5.
    Polling Question #2: What Industry Do You Represent? – Financial Services – Insurance – Healthcare, Life Sciences or Pharma – Manufacturing – Energy – Retail – Logistics and Transportation – Education – Government – Marketing & Advertising – Technology – Other
  • 6.
    In A PerfectWorld… Analytical Capability Compute Data Scale UsersPrice Ease Security
  • 7.
    Hadoop Analytics -Many Alternatives  R Based Alternatives  Legacy tools updated – SAS HPA, etc.  Big Data Databases  Other Languages – Scala, Java, Julia, various GUIs Today’s Topic:  R-Based Alternatives – “Beside Architectures” – “Inside Architectures” – Open Source and Commercial
  • 8.
    Reality: Tradeoffs. Memory Limits In-Memoryvs. Shared Infrastructure CRAN vs. Parallelization Desktop vs. Remote Explicit vs. Automatic Distribution Locality vs. Movement Real-Time vs. MapReduce Traditional Statistics vs. Machine Learning
  • 9.
  • 10.
    Corporate Overview &Quick Facts Founded 2008 (as REvolution Computing) Office Locations Palo Alto (HQ), Seattle (Engineering) Singapore London CEO David Rich Number of customers 200+ Investors • Northbridge Venture Partners • Intel Capital • Platform Vendor Web site: • www.revolutionanalytics.com Revolution R Enterprise is the leading commercial analytics platform based on the open source R statistical computing language
  • 11.
    Revolution Analytics Our Vision: Rbecomes the de- facto standard for enterprise predictive analytics Our Mission: Drive enterprise adoption of R by providing enhanced R products tailored to meet enterprise challenges
  • 12.
    Revolution Analytics Builds& Delivers:  Software Products:  Stable Distributions  Broad Platform Support  Big Data Analytics in R  Application Integration  Deployment Platforms  Agile Development Tooling  Future Platform Support  Support & Services  Commercial Support Programs  Training Programs  Professional Services  Community Programs  Academic Support Programs  Contributions to Open Source R  Open Source Extensions  Sponsorship of R User Groups
  • 13.
    Revolution Analytics TechnicalInnovations  R Options from Open Source to Enterprise  Parallelized Analytical Computation  In-Database & In-Hadoop Analytics  Big Data Scalability  Remote Execution  Production Deployment Support  Multi-Platform Deployment  Legacy Data Format Support  Multiple IDE Options  PMML Model Export
  • 14.
    The Revolution RProduct Suite • Free and open source R distribution • Enhanced and distributed by Revolution Analytics Revolution R Open • Open-source distribution of R, packages, and other components • Enhanced, supported and indemnified by Revolution Analytics Revolution R Plus • Secure, Scalable and Supported Distribution of R • With proprietary components created by Revolution Analytics Revolution R Enterprise
  • 15.
    Polling Question #3: State Play: In your company you are… – Building Our “Data Lake” – Running R + Hadoop Data Today – Running R inside Hadoop using Open source – Running RRE inside Hadoop – Deploying Business Apps. Using Analytics from Hadoop Data – Looking at Next Steps e.g. Spark, etc.
  • 16.
    Revolution Analytics: Eight Alternativesfor Integrating R & Hadoop Open Source 1. Open Source R 2. Revolution R Open 3. Open Source Parallelization on Workstations & Servers 4. rHadoop: Open Source Parallelization with rHadoop Commercial 5. Revolution R Enterprise on Servers & Workstations 6. Revolution R Enterprise on Edge Nodes 7. Revolution R Enterprise Inside Hadoop 8. Combined Edge Node & Inside Hadoop
  • 17.
    1. Open SourceR Integrated With Hadoop • Traditional Open Source • Memory- Limited • Data Moves Traditional Open Source R “Beside” Architecture: CRAN Algorithms rHDFS rHbas e rHive rODB C
  • 18.
    2. Revolution ROpen On Workstations & Servers Replace Open Source R “Beside” Architecture with Revolution R Open As with Open Source R: • Still Free. • Still Memory Based. • Data Still Moves. Improvements: • Accelerates Math with Intel MKL • Improves R-based packages Limitations • No Effect for non-R Code CRAN Algorithms rHDFS rHbas e rHive rODB C
  • 19.
    Accelerate R Mathwith Intel Math Kernel Lib’s. Source: http://blog.revolutionanalytics.com/2014/10/revolution-r-open-mkl.html
  • 20.
    3. Write ParallelAlgorithms PC, Server or Clusters Write R Code to Explicitly Parallelize – Deploy Across Several Systems Can Include CRAN Algorithms “Carefully” ForEach & Iterator • DoParallel (PC, server) • DoMPI (cluster) • RRE RxEXEC Example Uses: • Bootstrapping • Simulation • HPC rHDFS rHbas e rHive rODB C As with Previous: • Still Free. • Still Memory Based. • Data Still Moves. • Intel MKL with RRO Improvements: • Parallelized Execution Limitations: • Parallelization Difficulty • Data Movement • Platform Specific
  • 21.
    4. rHadoop: CustomParallel Execution for Hadoop Remote Desktop R Code Execute R Code & CRAN Algorithms Inside Hadoop Example Uses: • Scoring • Transformation • Easily Parallelized Algorithms Hadoop Streaming Can Include CRAN Algorithms As With Previous:  Still Free.  Optional Intel MKL in RRO Improvements:  Runs R in MapReduce  No Data Movement Limitations:  Manual Parallelization  Hadoop Specific rHbase rHDFS rMapReduce
  • 22.
    5. Revolution REnterprise (RRE) PEMAs inside Hadoop Traditional “Beside” Architecture with Optimized Algorithms Available for Windows, Linux As With Previous:  Includes Intel MKL in RRO Advantages  Speed: PEMAs Parallelize Across Threads, Cores & Sockets  Scale: PEMAs “Chunk” - no Memory Limits  All of CRAN Available  Portability  Fully Supported Limitations:  Data Movement  Single Machine Revolution R Enterprise: • ScaleR PEMA Algorithms plus • All of CRAN (subject to memory limits) rHDFS rHbas e rHive rODB C
  • 23.
    Revolution R Enterprise High Performance, Scalable Analytics  Portable Across Enterprise Platforms  Easier to Build & Deploy Analytics is…. the only big data big analytics platform based on open source R
  • 24.
    ScaleR Refactor Algorithms forDramatic Performance and Capacity Improvement
  • 25.
    ScaleR High Performance Algorithmsfor the Most Common Uses  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling Data Step Statistical Tests Sampling Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models Predictive Models  K-Means  Decision Trees  Decision Forests  Gradient Boosted Decision Trees Cluster Analysis Classification Simulation Variable Selection  Stepwise Regression  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Combination 25Revolution Analytics Confidential – Under NDA New in 7.3  PEMA-R API  rxDataStep  rxExec
  • 26.
    ScaleR PEMA What’s aPEMA? Parallel External Memory Algorithms Master Algorithm Process Data Analyze Each Block • Not Limited to Available Memory • Unlimited Data Scale • Ingests Data One Chunk At A Time. • Adjustable Memory Footprint • Multi-Thread Execution Performance • Highly-Optimized Algorithms • Algorithm Math Fully Refactored for Parallelism • Delivered as ScaleR Library in Revolution R Enterprise Load Block At A Time Combine Individual Results Script Calls ScaleR Algorithm Scripts can call CRAN Open Source Algorithms Start & Manage Processing
  • 27.
    rHDFS rHbas e rHive rODB C 6. Run RevolutionR Enterprise on Hadoop Edge Node(s) Local File System (opt.) ScaleR + CRAN Algorithms Fast Single-Server Alternative for Modest Data Scale Edge NodeThin Client or Remote Desktop As With Previous:  Single Machine Execution  PEMA Scale & Speed (Single Machine)  Use ScaleR + CRAN  Accelerate R with Intel MKL Improvements:  Easily Shared via  No Data Movement  Develop on Desktop Run on Edge Node Limitations:  “Shorter Trip” for Data
  • 28.
    7. Fast, TransparentParallel Computation Inside Hadoop YARN/MapReduce jobtracker ScaleR Algorithms DeployR Fast Parallelized Analytics on Large Data Sets In Hadoop As With Previous:  Speed and Scale of ScaleR PEMA Algorithms  Use CRAN Where Appropriate  Accelerate R Math with MKL  Custom Parallelized Algo’s Advantages  Parallel Computation  No Data Movement  ScaleR PEMA Parallelization  Can Parallelize CRAN “Carefully”  Portable Coding Limitations:  Hadoop Workload Profiles We b Ser vice s Web Services Remote Execution Desktop & Server Tools and Applications
  • 29.
    29 One Client’s Experiencewith RRE on Hadoop Test Cluster - 9 Nodes Task Processing Time Importing and Filtering Datasets from HDFS 14 Million Observations 82 sec. 227 Million Observations 310 sec. Modeling and Estimation 1.2 M Correlations 2771 sec. Simple Linear Regression, 227 M Observations 61 sec. Multiple Linear Regression, Three Variables, 227 M Observations 58 sec. Multiple Linear Regression, Four Variables, 227 M Observations 58 sec. Random Forest, 10 Predictor Variables, 227 M Observations, 10 Trees with Max Depth of 10 Splits 2 hr. 3 min. 64GB 24 cores each 9 Task Nodes 2 Admin Nodes1 Edge Node 128GB 24 cores each 128GB 24 cores each
  • 30.
    8. Combined EdgeNode & In-Hadoop ScaleR Algorithms DeployR Maximized Flexibility, Performance & Workload Handling As With Previous:  Speed and Scale of ScaleR PEMA Algorithms  Use CRAN Where Appropriate  Accelerate R Math with MKL  Custom Parallelized Algo’s Advantages  Flexibility for Blended Workloads  Little or No Data Movement  Maximize CRAN Capabilities by Sharing Large RAM Edge Nodes We b Ser vice s Thin Client Development Remote Execution Desktop & Server Tools and Applications rStudio
  • 31.
    Occasionally Conflicting Criteria Infrastructure Criteria: Big Data Platform  Vendor Choice  Data Ingest  Data Security  Data Governance Data Science Criteria:  Performance  Self Service  Flexibility  Collaboration  Sharing  Capability
  • 32.
    Key Questions:  Whereare the bulk of your skills? SAS? R? Java? Python? SQL?  Where do you build models today?  Do you have the skills to parallelize algorithms?  Can models be built on a big shared server?  How will you run models?  Do you have the budget to purchase commercial solutions?  How will your needs change over time?  What is your future architecture plan?  How risk averse is your management team regarding new platforms and open source?
  • 33.
    Key Questions (cont.) What Workloads Do You Anticipate? — How May Users? — What Workloads?  Workload Realities: — Many small tasks do not run well in MapReduce — Large data movements / duplications are costly  What Use Cases Will You Encounter? — Traditional statistical exploration, modeling? — Behavior Prediction? — Outlier Detection? — Simulation and HPC? — Massively wide data? — Real-Time scoring? — Internet of Things?
  • 34.
    Eight Steps toFast, Scalable R Analytics with Hadoop Open Source Options 1. Open Source R 2. Revolution R Open 3. Open Source Parallelization… 4. rHadoop… Commercial Options 5. RRE on Servers & Workstations 6. RRE on Edge Nodes 7. RRE Inside Hadoop 8. RRE on Edge Node & Inside Hadoop No Clear Winner:  Budget & use case determine optimal path  Compelling options in both open source & commercial source  RRE ScaleR uniquely provides automatic parallelization  Current Hadoop platforms are fast for large scale analytics.  Combined in-server & in-hadoop fits majority of cases
  • 35.
    2015 Challenges &Opportunities • Evolving Hadoop Architectures • In-Memory Analytics – Spark, YARN Containers, Caching • Additional Algorithm Parallelization • Cluster Management • Cloud and Hybrid Cloud Clusters • SQL on Hadoop “Battle-Royale” • Addressing the Resource Reality • Integration, Deployment Both Drain on Expensive Resources • Leverage other skills • Design efficient collaboration • “Analytics for the Rest of Us” • New Consumption Targets – Mobile • New Participants in Design – Business Users
  • 37.
    Recommended Resources  RevolutionAnalytics Products – http://www.revolutionanalytics.com/products – http://www.revolutionanalytics.com/big-analytics-hadoop-and-edws  Whitepaper: “Delivering Value from Big Data with Revolution R Enterprise and Hadoop – http://www.revolutionanalytics.com/whitepaper/delivering-value-big-data- revolution-r-enterprise-and-hadoop  Revolution Analytics on Social Media: – http://blog.revolutionanalytics.com/ – @revolutionr on Twitter – @bill_jacobs on Twitter
  • 38.