1. Revolution Confidential
Are You Ready for Big
Data Big Analytics?
September, 2013
Bill Jacobs
Director, Product Marketing
Revolution Analytics
@bill_jacobs
Revolution Analytics
@RevolutionR
5. Revolution Confidential
What Language is Most Popular for Data
Mining and Data Science?
Survey Question:
“What programming/statistics languages you used for an analytics /
data mining / data science work in 2013?”
Results:
R – 61%
Python – 39%
SQL - 37%
How does this compare to 2012?
“Highest growth was for Pig/Hive/Hadoop-based languages, R, and
SQL, while Perl, C/C++, and Unix tools declined…”
From 2013 KDNuggets Survey of 700 voters.
5
6. Revolution Confidential
The R Language: What Is It?
A Language Platform…
A Procedural Language optimized for Statistics and Data Science
A Data Visualization Framework
Provided as Open Source
A Community…
2M Statistical Analysis and Machine Learning Users
Taught in Most University Statistics Programs
Active User Groups Across the World
An Ecosystem
CRAN: 4500+ Freely Available Algorithms, Test Data and
Evaluations
Many Applicable to Big Data If Scaled
6
7. Revolution Confidential
Revolution Analytics - Overview
7
We are the only provider of a commercial analytics platform based on
the open source R statistical computing language.
Power
Productivity
Enterprise
Readiness
Stable,scalable
multi-platform
world-wide support
Easier to build and deploy analytic
applications
Professional services enablement
Distributed, high performance
analytics algorithms
World Wide Support Teams
• Standard and Premium Programs
• Technical Account Managers
• Customer Success Managers
Professional Services
• Architecture planning
• Systems Integration
• Advanced analytic applications
• Full life cycle projects
8. Revolution Confidential
Digital Media & Retail
200+ Customer Stories
Finance & Insurance Healthcare & Life Sciences
Manufacturing & High TechAcademic & Gov’t
8
9. Revolution Confidential
Revolution R Enterprise
9
Revolution R Enterprise
is the only commercial big data analytics platform
that provides Big Data Big Analytics based on R.
Portable Across Enterprise Platforms
High Performance, Scalable Analytics
Easier to Build & Deploy
10. Revolution Confidential
Additional Technology Challenges
Accompanying Big Data Analytics Efforts
10
Big Data
• New Data
Sources
• Data Variety &
Velocity
• Fine Grain
Control
• Data Movement,
Memory Limits
Complex
Computation
• Experimentation
• Many Small
Models
• Ensemble
Models
• Simulation
Enterprise
Readiness
• Heterogeneous
Landscape
• Write Once,
Deploy Anywhere
• Skill Shortage
• Production
Support
Production
Efficiency
• Shorter Model
Shelf Life
• Volume of
Models
• Long End-to-End
Cycle Time
• Pace of Decision
Accelerated
11. Revolution Confidential
Open Source R Drives Analytical Innovation
… with some limitations for enterprises
but has some limitations for Enterprise Deployment
Memory Bound
Large Data & Cluster-Based
Storage Management
Single Threaded
Scalable, multi-threaded,
parallel processing
Community Support
Commercial production
support and professional
services teams
Innovative – 5000
packages+,
exponential growth
Ability to combine
with open source R
packages where
needed
Operate on
bigger data
sizes
Increased
speed of
analysis
Holistic
production
support
A key combination
of innovation and
scale
Results
limitations
12. Revolution Confidential
Big Data Speed @ Scale with
Revolution R Enterprise (RRE)
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Processing
In-Hadoop Execution
Memory Management
Parallelized User Code
12
First, we enhance and
accelerate the Open
Source R interpreter.
13. Revolution Confidential
Open Source R performance:
Multi-threaded Math
Open
Source R
13
Revolution R
Enterprise
Computation (4-core laptop) Open Source R Revolution R Speedup
Linear Algebra1
Matrix Multiply 176 sec 9.3 sec 18x
Cholesky Factorization 25.5 sec 1.3 sec 19x
Linear Discriminant Analysis 189 sec 74 sec 3x
General R Benchmarks2
R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x
R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable
1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php
2. http://r.research.att.com/benchmarks/
Customers report 5-50x
performance improvements
compared to Open Source R —
without changing any code
14. Revolution Confidential
Big Data Speed @ Scale with
Revolution R Enterprise (RRE)
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Processing
In-Hadoop Execution
Memory Management
Parallelized User Code
14
Second, we built a
platform for hosting R
with Big Data on a
variety of massively
parallel platforms.
15. Revolution Confidential
Unparalleled Big Data Big Analytics
Scale, Performance & Innovation
15
1 + 1 = 1000’s
Performance
V
a
l
u
e
Revolution R Enterprise
+ =
Performance
Enhanced R
R Language
Open Source
R Analytic
Packages
Big Data
Distributed &
Parallel
Processing
&
Analytic Package
Big Data
Distributed &
Parallel
Processing
&
Analytic Package
Open Source
R Analytic
Packages
Performance Enhanced R
16. Revolution Confidential
Analytic Personas and their Tools
16
Analytic
Consumer
Business
Analyst
Power
Analyst
Data
Scientist
Information
Technologist
Right Tool, Right Problem
19. Revolution Confidential
Predicting Predictive Analytics
What Are Your Use Cases?
How Will Your Use Cases Evolve?
What Platform Will Best Support Each?
Who’s Platform Excel Tomorrow?
19
?
20. Revolution Confidential
Portability and Investment Assurance:
Write Once – Deploy Anywhere
20
Servers
Server Clusters
EDWs and Analytical DBMSs
Hadoop (coming soon!)
Write it Once.
Deploy it Anywhere
Workstations
21. Revolution Confidential
Summary.
R is Hot.
Revolution R Enterprise:
Scales R to Big Data.
Scales Performance on Big Data Platforms
Is Commercially Supported
Is Broadly Deployable
Allows you to WODA!
Revolution Analytics Maximizes Results, While
Minimizing Near-Term and Long-Term Risks
21
Remember that CRAN is a new term to IT professionals, and anyone who hasn’t learned much about R. Spend some time on it. CRAN = Community R Archive Network – a single repository of R algorithms, test data, evaluations. Use by nearly all R programmers.
Who is revolution
To understand how a typical customer might use RRE, it’s important to understand who a typical customer might be.users comprised of statisticians, data scientists, IT and academics across a wide variety of fields and industriesAlso point out flexibility of R solution, cross industries, CRAN offers incredible capabilities.Same with scalability, some customers use it to do desk top analysis and the exact same program is used in production servers elsewhere with no change to coding
Despite the growth, there are limitations with open source R, and these become more impactful as either the scale of the data grows or the number of users within an organizationRevo addresses these points to offer a more complete solutionCompare and contrast
This slide presents a way to distinguish ourselves from the open source versions of R, particularly those “supported” by platform vendors who bundle it. Explain that with this slide we are illustrating orders of magnitude performance improvement overall.Key advances are:Multi-threading and Multi-Core execution which allows parallel processors in a server to work together.Memory management that enables algorithms to use a combination of memory and disk, alleviating a long-standing problem with R, that of being limited by amount of physical memory.Parallelization in all its forms, but most importantly, the PEMA algorithms in ScaleR that work across clusters of servers – both in Hadoop and in cluster operating systems, to fully parallelize key statistics algorithms.
This slide presents a way to distinguish ourselves from the open source versions of R, particularly those “supported” by platform vendors who bundle it. Explain that with this slide we are illustrating orders of magnitude performance improvement overall.Key advances are:Multi-threading and Multi-Core execution which allows parallel processors in a server to work together.Memory management that enables algorithms to use a combination of memory and disk, alleviating a long-standing problem with R, that of being limited by amount of physical memory.Parallelization in all its forms, but most importantly, the PEMA algorithms in ScaleR that work across clusters of servers – both in Hadoop and in cluster operating systems, to fully parallelize key statistics algorithms.