Big Data Predictive Analytics
with Revolution R Enterprise
David Smith
Gartner BI Conference, April 2014
Chief Community Officer
@revodavid
2
OUR COMPANY
The leading provider
of advanced analytics
software and services
based on open source R,
since 2007
OUR SOFTWARE
The only Big Data, Big
Analytics software platform
based on the data science
language R
KUDOS
Visionary
Gartner Magic Quadrant
for Advanced Analytics
Platforms, 2014
What is R?
 Most widely used data analysis software
• Used by 2M+ data scientists, statisticians and analysts
 Most powerful statistical programming language
• Flexible, extensible and comprehensive for productivity
 Create beautiful and unique data visualizations
• As seen in New York Times, Twitter and Flowing Data
 Thriving open-source community
• Leading edge of analytics research
 Fills the talent gap
• New graduates prefer R
R is Hot
bit.ly/r-is-hot
WHITE PAPER
Exploding growth and demand for R
 R is the highest paid IT skill
 R most-used data science language
after SQL
 R is used by 70% of data miners
 R is #15 of all programming languages
 R growing faster than any other data
science language
 R is the #1 Google Search for
Advanced Analytics software
 R has more than 2 million users
worldwide
R Usage Growth
Rexer Data Miner Survey, 2007-2013
70% of data miners report using R
R is the first choice of more
data miners than any other
software
Source: www.rexeranalytics.com
5
Technical Support for Open Source R
AdviseR™ from Revolution Analytics
Technical support for open source R, from the R experts.
 24x7 email and phone support
 On-line case management and knowledgebase
 Access to technical resources, documentation and user forums
 Exclusive on-line webinars from community experts
 Guaranteed response times
Also available: expert hands-on and on-line training for R, from
Revolution Analytics AcademyR.
www.revolutionanalytics.com/AdviseR
www.revolutionanalytics.com/AcademyR
Revolution R Enterprise
 High Performance, Scalable Analytics
 Portable Across Enterprise Platforms
 Easier to Build & Deploy Analytics
is….
the only big data big analytics platform
based on open source R
6
Big Data In-memory bound Hybrid memory & disk
scalability
Operates on bigger
volumes & factors
Speed of
Analysis
Single threaded Parallel threading Shrinks analysis time
Enterprise
Readiness
Community support Commercial support Delivers full service
production support
Analytic
Breadth &
Depth
5000+ innovative
analytic packages
Leverage open source
packages plus Big Data
ready packages
Supercharges R
Commercial
Viability
Risk of deployment
of open source
GPL-compatible
licensing
Eliminate risk with open
source
Enhancing Open Source R for the Enterprise
7
COMBINE INTERMEDIATE RESULTS
8
Powering Next Generation Analytics
Parallel External Memory Algorithms
 Unique PEMAs: Parallel,
external-memory algorithms
 High-performance, scalable
replacements for R/SAS
analytic functions
 Parallel/distributed
processing eliminates CPU
bottleneck
 Data streaming eliminates
memory size limitations
 Works with in-memory and
disk-based architectures
9
Eliminates Performance and Capacity
Limits of Open Source R and Legacy SAS
All of Open Source R plus:
 Big Data scalability
 High-performance analytics
 Development and deployment
tools
 Data source connectivity
 Application integration framework
 Multi-platform architecture
 Support, Training and Services
10
is the
Big Data Big Analytics Platform
DistributedR
ScaleR
ConnectR
DeployR
DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE
In the Cloud Amazon AWS
Workstations & Servers Windows
Red Hat and SUSE Linux
Clustered Systems IBM Platform LSF
Microsoft HPC
EDW IBM Netezza
Teradata
Hadoop Hortonworks
Cloudera
11
Write Once.
Deploy Anywhere.
Write Once  Deploy Anywhere
rxSetComputeContext("local") # DEFAULT
rxSetComputeContext(RxHadoopMR(<data, server environment arguments>))
# Summarize and calculate descriptive statistics from the data airDS data set
adsSummary = rxSummary(~ArrDelay+CRSDepTime+DayOfWeek, data = airDS)
# Fit Linear Regression Model
arrDelayLm1 = rxLinMod(ArrDelay ~ DayOfWeek, data = airDS); summary(arrDelayLm1)
rxSetComputeContext(RxHpcServer(<data, server environment arguments>))
rxSetComputeContext(RxLsfCluster(<data, server environment arguments>))
Same code to be run anywhere …..
Local System
(default)




Set the desired compute context for code execution…..
rxSetComputeContext(RxTeradata(<data, server environment arguments>))

13
In-Hadoop Big Data Big Analytics
 Eliminate data
movement latency
 Speed model
development
 Use commodity
Hadoop nodes as
analytics engine
Name Node
Data NodeData Node Data NodeData Node Data Node
Job
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
MapReduce
HDFS
14
Revolution Analytics coupled with the Teradata Unified Data Architecture accelerates
big data analytics with the R language.
+
In-Database Analytics:
 Parallel R in-database for big
data analytics on Teradata
 Build parallel R models
completely in R
 Use Teradata appliance as
analytics engine
 No need to move data
Teradata
14.10
+
Revolution R
Enterprise V7
15
RRE7 in the Cloud
 Revolution R Enterprise 7, on the industry-leading cloud platform
 Pay as you go, priced by cores x hours
– No long-term commitment required
 Launch Windows and Linux servers on demand
– Windows 2008 R2 with DevelopR
– RHEL 6 with RStudio Server Professional
– Server instances from 2 – 32 cores
– Analyze data sets up to 2 TB
 Convenient, consistent and reliable
– Available globally, accessible anywhere
– Forum-based support with registration
 Free 14-day trial available
CLOUD SERVERS
$0.70
PER CORE/HOUR
PLUS AWS INFRASTRUCTURE COSTS
Revolution R Enterprise Ecosystem
Integration with the Big Data Analytics Stack
Deployment / Consumption
Data / Infrastructure
Advanced Analytics
ETL
SI / Service MSP / DSP
16
How Customers Revolutionize their Business
Power
“We’ve combined Revolution R
Enterprise and Hadoop to build and
deploy customized exploratory data
analysis and GAM survival models for
our marketing performance
management and attribution platform.
Given that our data sets are already in
the terabytes and are growing rapidly,
we depend on Revolution R Enterprise’s
scalability and power – we saw about
a 4x performance improvement on 50
million records. It works brilliantly.”
- CEO, John Wallace, DataSong
4X performance
50M records scored daily
Scalability
“We’ve been able to scale our solution to a
problem that’s so big that most companies could
not address it. If we had to go with a different
solution we wouldn’t be as efficient as we are
now.”
- SVP Analytics, Kevin Lyons, eXelate
TB’s data from 200+ data sources
10’s thousands attributes
100’s millions of scores daily
2X data
2X attributes
no impact on performance
Performance
“We need a high-performance analytics
infrastructure because marketing optimization is a
lot like a financial trading. By watching the market
constantly for data or market condition updates,
we can now identify opportunities for our
clients that would otherwise be lost.”
- Chief Analytics Officer, Leon Zemel, [x+1]
Why Revolution R Enterprise?
18
Platform
Independence
Take Big Cost Out
of Big Data
Supercharge R for
Massive Data
Power R for the
Enterprise
Thank You
David Smith
Chief Community Officer
@revodavid
blog.revolutionanalytics.com

Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit 2014)

  • 1.
    Big Data PredictiveAnalytics with Revolution R Enterprise David Smith Gartner BI Conference, April 2014 Chief Community Officer @revodavid
  • 2.
    2 OUR COMPANY The leadingprovider of advanced analytics software and services based on open source R, since 2007 OUR SOFTWARE The only Big Data, Big Analytics software platform based on the data science language R KUDOS Visionary Gartner Magic Quadrant for Advanced Analytics Platforms, 2014
  • 3.
    What is R? Most widely used data analysis software • Used by 2M+ data scientists, statisticians and analysts  Most powerful statistical programming language • Flexible, extensible and comprehensive for productivity  Create beautiful and unique data visualizations • As seen in New York Times, Twitter and Flowing Data  Thriving open-source community • Leading edge of analytics research  Fills the talent gap • New graduates prefer R R is Hot bit.ly/r-is-hot WHITE PAPER
  • 4.
    Exploding growth anddemand for R  R is the highest paid IT skill  R most-used data science language after SQL  R is used by 70% of data miners  R is #15 of all programming languages  R growing faster than any other data science language  R is the #1 Google Search for Advanced Analytics software  R has more than 2 million users worldwide R Usage Growth Rexer Data Miner Survey, 2007-2013 70% of data miners report using R R is the first choice of more data miners than any other software Source: www.rexeranalytics.com
  • 5.
    5 Technical Support forOpen Source R AdviseR™ from Revolution Analytics Technical support for open source R, from the R experts.  24x7 email and phone support  On-line case management and knowledgebase  Access to technical resources, documentation and user forums  Exclusive on-line webinars from community experts  Guaranteed response times Also available: expert hands-on and on-line training for R, from Revolution Analytics AcademyR. www.revolutionanalytics.com/AdviseR www.revolutionanalytics.com/AcademyR
  • 6.
    Revolution R Enterprise High Performance, Scalable Analytics  Portable Across Enterprise Platforms  Easier to Build & Deploy Analytics is…. the only big data big analytics platform based on open source R 6
  • 7.
    Big Data In-memorybound Hybrid memory & disk scalability Operates on bigger volumes & factors Speed of Analysis Single threaded Parallel threading Shrinks analysis time Enterprise Readiness Community support Commercial support Delivers full service production support Analytic Breadth & Depth 5000+ innovative analytic packages Leverage open source packages plus Big Data ready packages Supercharges R Commercial Viability Risk of deployment of open source GPL-compatible licensing Eliminate risk with open source Enhancing Open Source R for the Enterprise 7
  • 8.
    COMBINE INTERMEDIATE RESULTS 8 PoweringNext Generation Analytics Parallel External Memory Algorithms
  • 9.
     Unique PEMAs:Parallel, external-memory algorithms  High-performance, scalable replacements for R/SAS analytic functions  Parallel/distributed processing eliminates CPU bottleneck  Data streaming eliminates memory size limitations  Works with in-memory and disk-based architectures 9 Eliminates Performance and Capacity Limits of Open Source R and Legacy SAS
  • 10.
    All of OpenSource R plus:  Big Data scalability  High-performance analytics  Development and deployment tools  Data source connectivity  Application integration framework  Multi-platform architecture  Support, Training and Services 10 is the Big Data Big Analytics Platform
  • 11.
    DistributedR ScaleR ConnectR DeployR DESIGNED FOR SCALE,PORTABILITY & PERFORMANCE In the Cloud Amazon AWS Workstations & Servers Windows Red Hat and SUSE Linux Clustered Systems IBM Platform LSF Microsoft HPC EDW IBM Netezza Teradata Hadoop Hortonworks Cloudera 11 Write Once. Deploy Anywhere.
  • 12.
    Write Once Deploy Anywhere rxSetComputeContext("local") # DEFAULT rxSetComputeContext(RxHadoopMR(<data, server environment arguments>)) # Summarize and calculate descriptive statistics from the data airDS data set adsSummary = rxSummary(~ArrDelay+CRSDepTime+DayOfWeek, data = airDS) # Fit Linear Regression Model arrDelayLm1 = rxLinMod(ArrDelay ~ DayOfWeek, data = airDS); summary(arrDelayLm1) rxSetComputeContext(RxHpcServer(<data, server environment arguments>)) rxSetComputeContext(RxLsfCluster(<data, server environment arguments>)) Same code to be run anywhere ….. Local System (default)     Set the desired compute context for code execution….. rxSetComputeContext(RxTeradata(<data, server environment arguments>)) 
  • 13.
    13 In-Hadoop Big DataBig Analytics  Eliminate data movement latency  Speed model development  Use commodity Hadoop nodes as analytics engine Name Node Data NodeData Node Data NodeData Node Data Node Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker MapReduce HDFS
  • 14.
    14 Revolution Analytics coupledwith the Teradata Unified Data Architecture accelerates big data analytics with the R language. + In-Database Analytics:  Parallel R in-database for big data analytics on Teradata  Build parallel R models completely in R  Use Teradata appliance as analytics engine  No need to move data Teradata 14.10 + Revolution R Enterprise V7
  • 15.
    15 RRE7 in theCloud  Revolution R Enterprise 7, on the industry-leading cloud platform  Pay as you go, priced by cores x hours – No long-term commitment required  Launch Windows and Linux servers on demand – Windows 2008 R2 with DevelopR – RHEL 6 with RStudio Server Professional – Server instances from 2 – 32 cores – Analyze data sets up to 2 TB  Convenient, consistent and reliable – Available globally, accessible anywhere – Forum-based support with registration  Free 14-day trial available CLOUD SERVERS $0.70 PER CORE/HOUR PLUS AWS INFRASTRUCTURE COSTS
  • 16.
    Revolution R EnterpriseEcosystem Integration with the Big Data Analytics Stack Deployment / Consumption Data / Infrastructure Advanced Analytics ETL SI / Service MSP / DSP 16
  • 17.
    How Customers Revolutionizetheir Business Power “We’ve combined Revolution R Enterprise and Hadoop to build and deploy customized exploratory data analysis and GAM survival models for our marketing performance management and attribution platform. Given that our data sets are already in the terabytes and are growing rapidly, we depend on Revolution R Enterprise’s scalability and power – we saw about a 4x performance improvement on 50 million records. It works brilliantly.” - CEO, John Wallace, DataSong 4X performance 50M records scored daily Scalability “We’ve been able to scale our solution to a problem that’s so big that most companies could not address it. If we had to go with a different solution we wouldn’t be as efficient as we are now.” - SVP Analytics, Kevin Lyons, eXelate TB’s data from 200+ data sources 10’s thousands attributes 100’s millions of scores daily 2X data 2X attributes no impact on performance Performance “We need a high-performance analytics infrastructure because marketing optimization is a lot like a financial trading. By watching the market constantly for data or market condition updates, we can now identify opportunities for our clients that would otherwise be lost.” - Chief Analytics Officer, Leon Zemel, [x+1]
  • 18.
    Why Revolution REnterprise? 18 Platform Independence Take Big Cost Out of Big Data Supercharge R for Massive Data Power R for the Enterprise
  • 19.
    Thank You David Smith ChiefCommunity Officer @revodavid blog.revolutionanalytics.com