SlideShare a Scribd company logo
1 of 34
High Performance
Predictive Analytics
in R and Hadoop
Presented by:
Mario E. Inchiosa, Ph.D.
US Chief Scientist
Hadoop Summit 2013
June 27, 2013 1
Innovate with R
2
 Most widely used data analysis software
 Used by 2M+ data scientists, statisticians and analysts
 Most powerful statistical programming language
 Flexible, extensible and comprehensive for productivity
 Create beautiful and unique data visualizations
 As seen in New York Times, Twitter and Flowing Data
 Thriving open-source community
 Leading edge of analytics research
 Fills the talent gap
 New graduates prefer R
Download the White Paper
R is Hot
bit.ly/r-is-hot
R is open source and drives analytic innovation but has
some limitations
Bigger
data sizes
Speed of
analysis
Production
support
Memory Bound Big Data
Single Threaded Scale out, parallel
processing, high speed
Community Support Commercial
production support
Innovation
and scale
Innovative –
5000+
packages, exponential
growth
Combines with open
source R packages
where needed
4
Revolution R Enterprise
High Performance, Multi-Platform Analytics Platform
Revolution R Enterprise
DeployR
Web Services
Software Development Kit
DevelopR
Integrated Development
Environment
ConnectR
High Speed & Direct Connectors
Teradata, Hadoop (HDFS, HBase), SAS, SPSS, CSV, OBDC
ScaleR
High Performance Big Data Analytics
Platform LSF, MS HPC Server,
MS Azure Burst, SMP Servers
DistributedR
Distributed Computing Framework
Platform LSF, MS HPC Server, MS
Azure Burst
RevoR
Performance Enhanced Open Source R + CRAN packages
IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure
Burst, Cloudera, Hortonworks, IBM Big Insights, Intel Hadoop, SMP servers
Open Source R
Plus
Revolution Analytics
performance
enhancements
Revolution
Analytics
Value-Add
Components
Providing Power
and Scale to Open
Source R
Our objectives with respect to Hadoop
 Allow our customers to do predictive
analytics as easily in Hadoop as they can
using R on their laptops today
 Scalable and High Performance
 Provide the first enterprise-
ready, commercially supported, full-
featured, out-of-the-box Predictive Analytics
suite running in Hadoop
5
Overview of Revolution and Hadoop
 Revolution currently provides capabilities for
using R and Hadoop together
 RHadoop packages: rmr2, rhdfs, rhbase
 RevoScaleR package: Rapidly read and process
data from HDFS
 We are in the process of porting
RevoScaleR to run fully “in Hadoop.” It
currently runs on
laptops, workstations, servers, and MPI-
based clusters.
6
RHadoop: Map-Reduce with R
 Unlock data in Hadoop using only the R language
 No need to learn Java, Pig, Python or any other language
7
RevoScaleR Overview
8
 An R package that adds capabilities to R:
 Data Import/Clean/Explore/Transform
 Analytics – Descriptive and Predictive
 Parallel and distributed computing
 Visualization
 Scales from small local data to huge distributed
data
 Scales from laptop to server to cluster to cloud
 Portable – the same code works on small and
big data, and on laptop, server, cluster, Hadoop
Key ways RevoScaleR enhances R
 High Performance Computing (HPC)
functions: Parallel/Distributed computing
 High Performance Analytics (HPA) functions:
Big Data + Parallel/Distributed computing
 XDF file format; rapidly store and extract
data
 Use results from HPA and HPC functions in
other R packages
Revolution R Enterprise 9
High Performance Big Data Analytics with
ScaleR
10
Statistical
Tests
Machine
Learning
Simulation
Descriptive
Statistics
Data
Visualization
R Data Step
Predictive
Models
Sampling
ScaleR: High Performance Scalable
Parallel External Memory Algorithms
Company Confidential – Do not distribute 11
 Data import –
Delimited, Fixed, SAS, SPSS,
OBDC
 Variable creation &
transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort
 Merge
 Split
 Aggregate by category
(means, sums)
 Min / Max
 Mean
 Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product
matrix for set variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data
(standard tables & long form)
 Marginal Summaries of Cross
Tabulations
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
Data Prep, Distillation & Descriptive Analytics
 Subsample (observations &
variables)
 Random Sampling
R Data Step Statistical Tests
Sampling
Descriptive Statistics
ScaleR: High Performance Scalable
Parallel External Memory Algorithms
Company Confidential – Do not distribute 12
 Sum of Squares (cross product
matrix for set variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM)
- All exponential family
distributions:
binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie.
Standard link functions including:
cauchit, identity, log, logit, probit.
User defined distributions & link
functions.
 Covariance & Correlation
Matrices
 Logistic Regression
 Classification & Regression Trees
 Predictions/scoring for models
 Residuals for all models
 Histogram
 Line Plot
 Scatter Plot
 Lorenz Curve
 ROC Curves (actual data and
predicted values)
 K-Means
Statistical Modeling
 Decision Trees
Predictive Models Cluster AnalysisData Visualization
Classification
Machine Learning
Simulation
 Monte Carlo
HPC Capabilities in RevoScaleR
 Execute (essentially) any R function in
parallel on nodes and cores
 Results from all runs are returned in a list to
the user’s laptop
 Extensive control over parameters
 Extensive control over nodes, cores, and
times to run
 Ideal for simulations and for running R
functions on small amounts of data
Revolution R Enterprise 13
Key ScaleR HPA features
 Handles an arbitrarily large number of rows
in a fixed amount of memory
 Scales linearly with the number of rows
 Scales linearly with the number of nodes
 Scales well with the number of cores per
node
 Scales well with the number of parameters
 Extremely high performance
Revolution R Enterprise 14
Regression comparison using in-memory
data: lm() vs rxLinMod()
Revolution R Enterprise 17
lm() in R
lm() in Revolution R
rxLinMod() in Revolution R
GLM comparison using in-memory data:
glm() vs rxGlm()
Revolution R Enterprise 18
SAS HPA Benchmarking comparison*
Logistic Regression
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
19
Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a
20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPC Wire, April 21, 2011
Double
45%
1/6th
5%
5%
Revolution R Enterprise Delivers Performance at 2% of the Cost
Allstate compares SAS, Hadoop and R
for Big-Data Insurance Models
Approach Platform Time to fit
SAS 16-core Sun Server 5 hours
rmr/MapReduce 10-node 80-core
Hadoop Cluster
> 10 hours
R 250 GB Server Impossible (> 3 days)
Revolution R Enterprise 5-node 20-core
LSF cluster
5.7 minutes
Revolution R Enterprise 20
Generalized linear model, 150 million observations, 70 degrees of freedom
http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html
What makes ScaleR so fast?
 Lot’s of things!
 This kind of performance requires
 Careful architecting that supports performance
 Constant, intense focus on all of the details
 A review of every line of code with an eye to
performance (in addition to giving the correct
answers, of course)
 Extensive profiling and continuous
benchmarking to detect problems and improve
code
Revolution R Enterprise 21
Specific speed-related factors -- 1
 Efficient computational algorithms
 Efficient memory management – minimize
data copying and data conversion
 Heavy use of C++ templates; optimal code
 Efficient data file format; fast access by row
and column
Revolution R Enterprise 22
Specific speed-related factors -- 2
 Models are pre-analyzed to detect and
remove duplicate computations and points of
failure (singularities)
 Handle categorical variables efficiently
Revolution R Enterprise 23
Parallel External Memory Algorithms
(PEMA’s)
 The HPA analytics algorithms are all built on
a platform that efficiently parallelizes a broad
class of statistical, data mining and machine
learning algorithms
 These Parallel External Memory Algorithms
(PEMA’s) process data a chunk at a time in
parallel across cores and nodes
Revolution R Enterprise 24
Scalability and portability of Revo’s
implementation of PEMA’s
 These PEMA algorithms can process an unlimited
number of rows of data in a fixed amount of RAM.
They process a chunk of data at a time, giving
linear scalability
 They are independent of the “compute context”
(number of cores, computers, distributed
computing platform), giving portability across these
dimensions
 They are independent of where the data is coming
from, giving portability with respect to data sources
Revolution R Enterprise 25
Simplified ScaleR Internal Architecture
Revolution R Enterprise 26
Analytics Engine
PEMA’s are implemented here
(Scalable, Parallelized, Threaded, Distributable)
Inter-process Communication
MPI, RPC, Sockets, Files
Data Sources
HDFS, Teradata, ODBC, SAS, SPSS, CSV
, Fixed, XDF
ScaleR on Hadoop – Base Implementation
 Each iteration (pass through the data) is a separate
MapReduce job
 The mapper produces “intermediate result objects”
for each task that are combined using a combiner
and then a reducer
 A master process decides if another pass through
the data is required
 Allow data to be stored temporarily (or longer) in
XDF binary format for increased speed especially
on iterative algorithms
Revolution R Enterprise 27
Performance enhancements
 A single Map job that handles all iterations,
to reduce overhead of starting multiple jobs
 Tree structured communication and
reduction to reduce time for those
operations, especially on large clusters
 Alternative communication schemes (e.g.
sockets) instead of files
Revolution R Enterprise 28
Thank You!
Mario Inchiosa
mario.inchiosa@revolutionanalytics.com
Revolution R Enterprise 29
Backup Slides
Revolution R Enterprise 30
Functionality in common across HPA
algorithms
 Ability to do on-the-fly data transformations
in the R language
 In-formula: ArrDelay>15 ~ DayOfWeek + …
 As transform: Late = ArrDelay>15
 As transform function: Specify an R function to
process an entire chunk of data at a time
 Row selection
 Both standard weights and frequency
weights
Revolution R Enterprise 31
Portability of user code
 The same analytics code can be used on a
laptop, server, MPI cluster, or Hadoop
 The same analytics code can be used on a
small data.frame in memory and on a huge
distributed data file
Revolution R Enterprise 32
Sample code for logit on laptop
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxLogit(ArrDelay>15 ~ Origin + Year +
Month + DayOfWeek + UniqueCarrier +
F(CRSDepTime), data=airData)
Revolution R Enterprise 33
Sample code for logit on Hadoop
# Change the “compute context”
rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year +
Month + DayOfWeek + UniqueCarrier +
F(CRSDepTime), data=airData)
Revolution R Enterprise 34

More Related Content

What's hot

R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopRevolution Analytics
 
Intro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarIntro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarRevolution Analytics
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for HadoopWilly Marroquin (WillyDevNET)
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
DeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence ApplicationsDeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence ApplicationsRevolution Analytics
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution AnalyticsRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Revolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm DataWorks Summit/Hadoop Summit
 
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...MapR Technologies
 
When Streaming Becomes Strategic
When Streaming Becomes StrategicWhen Streaming Becomes Strategic
When Streaming Becomes StrategicMapR Technologies
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R ServicesGregg Barrett
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...Revolution Analytics
 

What's hot (20)

R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with Hadoop
 
Intro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarIntro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User Webinar
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
DeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence ApplicationsDeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence Applications
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
 
When Streaming Becomes Strategic
When Streaming Becomes StrategicWhen Streaming Becomes Strategic
When Streaming Becomes Strategic
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
 

Viewers also liked

Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Sparkcarl_pulley
 
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수Aiden Seonghak Hong
 
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정Aiden Seonghak Hong
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoopDavid Chiu
 
Opening sequence conventions
Opening sequence conventionsOpening sequence conventions
Opening sequence conventionsCsengeNemeti
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computingBAINIDA
 
Predictive Maintenance with R
Predictive Maintenance with RPredictive Maintenance with R
Predictive Maintenance with Reoda GmbH
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperDataWorks Summit
 
Introduction to text mining
Introduction to text miningIntroduction to text mining
Introduction to text miningLars Juhl Jensen
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learnJimmy Lai
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 
Emotion detection from text using data mining and text mining
Emotion detection from text using data mining and text miningEmotion detection from text using data mining and text mining
Emotion detection from text using data mining and text miningSakthi Dasans
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
 
Series de Tiempo en R parte I (Series estacionarias)
Series de Tiempo en R parte I (Series estacionarias)Series de Tiempo en R parte I (Series estacionarias)
Series de Tiempo en R parte I (Series estacionarias)Juan Carlos Campuzano
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text miningLokesh Ramaswamy
 

Viewers also liked (20)

Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Spark
 
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
 
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoop
 
Textmining
TextminingTextmining
Textmining
 
Opening sequence conventions
Opening sequence conventionsOpening sequence conventions
Opening sequence conventions
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 
Predictive Maintenance with R
Predictive Maintenance with RPredictive Maintenance with R
Predictive Maintenance with R
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Introduction to text mining
Introduction to text miningIntroduction to text mining
Introduction to text mining
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Emotion detection from text using data mining and text mining
Emotion detection from text using data mining and text miningEmotion detection from text using data mining and text mining
Emotion detection from text using data mining and text mining
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Series de Tiempo en R parte I (Series estacionarias)
Series de Tiempo en R parte I (Series estacionarias)Series de Tiempo en R parte I (Series estacionarias)
Series de Tiempo en R parte I (Series estacionarias)
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
Mining Scipy Lectures
Mining Scipy LecturesMining Scipy Lectures
Mining Scipy Lectures
 
NLTK in 20 minutes
NLTK in 20 minutesNLTK in 20 minutes
NLTK in 20 minutes
 

Similar to High Performance Predictive Analytics in R and Hadoop

Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2Revolution Analytics
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution Analytics
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analyticstempledf
 
useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsenrusersla
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data avanttic Consultoría Tecnológica
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & RŁukasz Grala
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationRevolution Analytics
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionRevolution Analytics
 

Similar to High Performance Predictive Analytics in R and Hadoop (20)

Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
 
useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsen
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar Presentation
 
Michal Marušan: Scalable R
Michal Marušan: Scalable RMichal Marušan: Scalable R
Michal Marušan: Scalable R
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management TrendsMeetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management Trends
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

High Performance Predictive Analytics in R and Hadoop

  • 1. High Performance Predictive Analytics in R and Hadoop Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist Hadoop Summit 2013 June 27, 2013 1
  • 2. Innovate with R 2  Most widely used data analysis software  Used by 2M+ data scientists, statisticians and analysts  Most powerful statistical programming language  Flexible, extensible and comprehensive for productivity  Create beautiful and unique data visualizations  As seen in New York Times, Twitter and Flowing Data  Thriving open-source community  Leading edge of analytics research  Fills the talent gap  New graduates prefer R Download the White Paper R is Hot bit.ly/r-is-hot
  • 3. R is open source and drives analytic innovation but has some limitations Bigger data sizes Speed of analysis Production support Memory Bound Big Data Single Threaded Scale out, parallel processing, high speed Community Support Commercial production support Innovation and scale Innovative – 5000+ packages, exponential growth Combines with open source R packages where needed
  • 4. 4 Revolution R Enterprise High Performance, Multi-Platform Analytics Platform Revolution R Enterprise DeployR Web Services Software Development Kit DevelopR Integrated Development Environment ConnectR High Speed & Direct Connectors Teradata, Hadoop (HDFS, HBase), SAS, SPSS, CSV, OBDC ScaleR High Performance Big Data Analytics Platform LSF, MS HPC Server, MS Azure Burst, SMP Servers DistributedR Distributed Computing Framework Platform LSF, MS HPC Server, MS Azure Burst RevoR Performance Enhanced Open Source R + CRAN packages IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure Burst, Cloudera, Hortonworks, IBM Big Insights, Intel Hadoop, SMP servers Open Source R Plus Revolution Analytics performance enhancements Revolution Analytics Value-Add Components Providing Power and Scale to Open Source R
  • 5. Our objectives with respect to Hadoop  Allow our customers to do predictive analytics as easily in Hadoop as they can using R on their laptops today  Scalable and High Performance  Provide the first enterprise- ready, commercially supported, full- featured, out-of-the-box Predictive Analytics suite running in Hadoop 5
  • 6. Overview of Revolution and Hadoop  Revolution currently provides capabilities for using R and Hadoop together  RHadoop packages: rmr2, rhdfs, rhbase  RevoScaleR package: Rapidly read and process data from HDFS  We are in the process of porting RevoScaleR to run fully “in Hadoop.” It currently runs on laptops, workstations, servers, and MPI- based clusters. 6
  • 7. RHadoop: Map-Reduce with R  Unlock data in Hadoop using only the R language  No need to learn Java, Pig, Python or any other language 7
  • 8. RevoScaleR Overview 8  An R package that adds capabilities to R:  Data Import/Clean/Explore/Transform  Analytics – Descriptive and Predictive  Parallel and distributed computing  Visualization  Scales from small local data to huge distributed data  Scales from laptop to server to cluster to cloud  Portable – the same code works on small and big data, and on laptop, server, cluster, Hadoop
  • 9. Key ways RevoScaleR enhances R  High Performance Computing (HPC) functions: Parallel/Distributed computing  High Performance Analytics (HPA) functions: Big Data + Parallel/Distributed computing  XDF file format; rapidly store and extract data  Use results from HPA and HPC functions in other R packages Revolution R Enterprise 9
  • 10. High Performance Big Data Analytics with ScaleR 10 Statistical Tests Machine Learning Simulation Descriptive Statistics Data Visualization R Data Step Predictive Models Sampling
  • 11. ScaleR: High Performance Scalable Parallel External Memory Algorithms Company Confidential – Do not distribute 11  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort  Merge  Split  Aggregate by category (means, sums)  Min / Max  Mean  Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test Data Prep, Distillation & Descriptive Analytics  Subsample (observations & variables)  Random Sampling R Data Step Statistical Tests Sampling Descriptive Statistics
  • 12. ScaleR: High Performance Scalable Parallel External Memory Algorithms Company Confidential – Do not distribute 12  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models  Histogram  Line Plot  Scatter Plot  Lorenz Curve  ROC Curves (actual data and predicted values)  K-Means Statistical Modeling  Decision Trees Predictive Models Cluster AnalysisData Visualization Classification Machine Learning Simulation  Monte Carlo
  • 13. HPC Capabilities in RevoScaleR  Execute (essentially) any R function in parallel on nodes and cores  Results from all runs are returned in a list to the user’s laptop  Extensive control over parameters  Extensive control over nodes, cores, and times to run  Ideal for simulations and for running R functions on small amounts of data Revolution R Enterprise 13
  • 14. Key ScaleR HPA features  Handles an arbitrarily large number of rows in a fixed amount of memory  Scales linearly with the number of rows  Scales linearly with the number of nodes  Scales well with the number of cores per node  Scales well with the number of parameters  Extremely high performance Revolution R Enterprise 14
  • 15.
  • 16.
  • 17. Regression comparison using in-memory data: lm() vs rxLinMod() Revolution R Enterprise 17 lm() in R lm() in Revolution R rxLinMod() in Revolution R
  • 18. GLM comparison using in-memory data: glm() vs rxGlm() Revolution R Enterprise 18
  • 19. SAS HPA Benchmarking comparison* Logistic Regression Rows of data 1 billion 1 billion Parameters “just a few” 7 Time 80 seconds 44 seconds Data location In memory On disk Nodes 32 5 Cores 384 20 RAM 1,536 GB 80 GB 19 Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM. *As published by SAS in HPC Wire, April 21, 2011 Double 45% 1/6th 5% 5% Revolution R Enterprise Delivers Performance at 2% of the Cost
  • 20. Allstate compares SAS, Hadoop and R for Big-Data Insurance Models Approach Platform Time to fit SAS 16-core Sun Server 5 hours rmr/MapReduce 10-node 80-core Hadoop Cluster > 10 hours R 250 GB Server Impossible (> 3 days) Revolution R Enterprise 5-node 20-core LSF cluster 5.7 minutes Revolution R Enterprise 20 Generalized linear model, 150 million observations, 70 degrees of freedom http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html
  • 21. What makes ScaleR so fast?  Lot’s of things!  This kind of performance requires  Careful architecting that supports performance  Constant, intense focus on all of the details  A review of every line of code with an eye to performance (in addition to giving the correct answers, of course)  Extensive profiling and continuous benchmarking to detect problems and improve code Revolution R Enterprise 21
  • 22. Specific speed-related factors -- 1  Efficient computational algorithms  Efficient memory management – minimize data copying and data conversion  Heavy use of C++ templates; optimal code  Efficient data file format; fast access by row and column Revolution R Enterprise 22
  • 23. Specific speed-related factors -- 2  Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities)  Handle categorical variables efficiently Revolution R Enterprise 23
  • 24. Parallel External Memory Algorithms (PEMA’s)  The HPA analytics algorithms are all built on a platform that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms  These Parallel External Memory Algorithms (PEMA’s) process data a chunk at a time in parallel across cores and nodes Revolution R Enterprise 24
  • 25. Scalability and portability of Revo’s implementation of PEMA’s  These PEMA algorithms can process an unlimited number of rows of data in a fixed amount of RAM. They process a chunk of data at a time, giving linear scalability  They are independent of the “compute context” (number of cores, computers, distributed computing platform), giving portability across these dimensions  They are independent of where the data is coming from, giving portability with respect to data sources Revolution R Enterprise 25
  • 26. Simplified ScaleR Internal Architecture Revolution R Enterprise 26 Analytics Engine PEMA’s are implemented here (Scalable, Parallelized, Threaded, Distributable) Inter-process Communication MPI, RPC, Sockets, Files Data Sources HDFS, Teradata, ODBC, SAS, SPSS, CSV , Fixed, XDF
  • 27. ScaleR on Hadoop – Base Implementation  Each iteration (pass through the data) is a separate MapReduce job  The mapper produces “intermediate result objects” for each task that are combined using a combiner and then a reducer  A master process decides if another pass through the data is required  Allow data to be stored temporarily (or longer) in XDF binary format for increased speed especially on iterative algorithms Revolution R Enterprise 27
  • 28. Performance enhancements  A single Map job that handles all iterations, to reduce overhead of starting multiple jobs  Tree structured communication and reduction to reduce time for those operations, especially on large clusters  Alternative communication schemes (e.g. sockets) instead of files Revolution R Enterprise 28
  • 30. Backup Slides Revolution R Enterprise 30
  • 31. Functionality in common across HPA algorithms  Ability to do on-the-fly data transformations in the R language  In-formula: ArrDelay>15 ~ DayOfWeek + …  As transform: Late = ArrDelay>15  As transform function: Specify an R function to process an entire chunk of data at a time  Row selection  Both standard weights and frequency weights Revolution R Enterprise 31
  • 32. Portability of user code  The same analytics code can be used on a laptop, server, MPI cluster, or Hadoop  The same analytics code can be used on a small data.frame in memory and on a huge distributed data file Revolution R Enterprise 32
  • 33. Sample code for logit on laptop # Specify local data source airData <- myLocalDataSource # Specify model formula and parameters rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 33
  • 34. Sample code for logit on Hadoop # Change the “compute context” rxSetComputeContext(myHadoopCluster) # Change the data source if necessary airData <- myHadoopDataSource # Otherwise, the code is the same rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 34

Editor's Notes

  1. More information:KDNuggets Software Poll: http://blog.revolutionanalytics.com/2013/06/kdnuggets-2013-poll-results.htmlCompanies using R: http://blog.revolutionanalytics.com/2013/05/companies-using-open-source-r-in-2013.htmlR is Hot white paper: bit.ly/r-is-hot
  2. SAS HPA benchmarks published in:http://www.hpcwire.com/hpcwire/2011-04-19/sas_brings_high_performance_analytics_to_database_appliances.htmlMore information: “SAS Benchmarking FAQ” in Sales Tools on GDrive.