SlideShare a Scribd company logo
1 of 19
R ON HADOOP
Kostiantyn Kudriavtsev
Lviv Hadoop User Group, June 19, 2014
Agenda
• What is R?
• Linear Regression
• R on Hadoop
• Summary
What is R?
Object-oriented and functional language for Stats, Math and
Data Science created by statisticians with comprehensive
data visualisation and statistical modelling capabilities;
5000+ (and grow) freely available specialised algorithms for
finance, economics, genomics, linguistic and so on;
2M+ users with specialised domain skills;
… but some drawbacks are:
- limited by RAM
- single thread
R development environment
RStudio is de-
facto standard
IDE for R
development and
available in local
or server mode.
Might be used not
only for coding,
but also
visualisation.
Suitable to
develop R
solutions on top of
Hadoop.
Apache Hadoop is an software framework that supports data-
intensive distributed applications based on MapReduce
algorithm (MR). Main idea: move computation to data.
MR idea:
- Map step: Map(k1,v1) → list(k2,v2)
- Magic here (sort by k2, data transfer between
nodes, etc)
- Reduce step: Reduce(k2, list (v2)) → (k3, v3)
What is Hadoop?
Linear regression
Web-store might use linear
regression to predict sales of
goods or discover trends.
sale(Product) ~ visitors(Product)
Linear regression might be
used here:
sale = α * visitors + β
Linear regression in R
df <- read.csv("Phone.csv", header=TRUE)
qq <-
qplot(visited,purchased,colour=product_page,
data=df)
qq + geom_smooth(method='lm', formula=y~x)
Linear regression in R
df.p2 <- df[df$product_page == 'phone_2', ]
m <- lm(purchased ~ visited, data=df.p2)
summary(m)
R on Hadoop
Several options:
• Hadoop streaming
• RHadoop
• RHipe
• RSpark
• Oracle R Advanced Analytics for Hadoop
• etc.
R Hadoop streaming
Hadoop was mainly designed to use Java and
provides comprehensive Java API.
Other languages can be used through “Streaming
API” Streaming API utilised standard input (reading)
and standard output (writing) OS possibilities. It
provides lightweight API for MapReduce in compare
to Java API.
Streaming requires writing two separate scripts (per
mapper and reducer) in any language (Python,
Ruby, R, C#, Go, OCalm, Lisp, etc)
R Hadoop streaming
Streaming API drawbacks:
● while the inputs to the reducer are grouped by key, they are still iterated
over line-by-line, and the boundaries between keys must be detected by the
user
● no possibilities to utilize different mappers in one MapReduce job
● no possibilities to create different outputs from reducer
● counters update through stderr
Additional disadvantage of implementing streaming in R:
•strong output control for R functions, because they are “buzzy”, however
only meaning data must be pushed
R Hadoop streaming: Mapper
R Hadoop streaming: Reducer
RHadoop
RHadoop - set of libraries (written in R language)
for R languages aim to facilitate using R
languages with Hadoop streaming to develop MR
jobs. So, it has general drawbacks for Hadoop
streaming.
RHadoop
RHadoop is still R through Hadoop Streaming
Advantages compared to Streaming:
● don’t need to manage key change in Reducer
● don’t need to control functions output manually
● simple R API covers Streaming API
● R code can be run on local env/Hadoop without
changes
Demo time
R on Hadoop in Real Life
Several steps are required to achieve the goal:
1. Data ingestion
2. Data preparation
3. R processing
4. Postprocessing
http://static.vroomgirls.com/website/wp-content/uploads/2011/09/Route66Road%C2%A9-Dmitry-
Rogozhin.jpg
Learned Lessons
R is slow… for million calculations
it’s even slow with Hadoop!
How to improve the speed?
Rewrite flow - maximum preprocessing work before R
step.
Hadoop streaming supports mapper/reducer in
different languages.
Think twice. R is great for exploratory analysis and
researches, but in production might cause performance
penalty.
Q&A
• Thank you for your attention

More Related Content

Similar to R on Hadoop

Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programminghemasri56
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)Revolution Analytics
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for HadoopWilly Marroquin (WillyDevNET)
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and FriendsRob Vesse
 
HBase, dances on the elephant back.
HBase, dances on the elephant back.HBase, dances on the elephant back.
HBase, dances on the elephant back.Roman Nikitchenko
 
R programming presentation
R programming presentationR programming presentation
R programming presentationAkshat Sharma
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 

Similar to R on Hadoop (20)

Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
R programming
R programmingR programming
R programming
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
Data Analytics Domain
Data Analytics DomainData Analytics Domain
Data Analytics Domain
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 
HBase, dances on the elephant back.
HBase, dances on the elephant back.HBase, dances on the elephant back.
HBase, dances on the elephant back.
 
Apache pig
Apache pigApache pig
Apache pig
 
R programming presentation
R programming presentationR programming presentation
R programming presentation
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 

Recently uploaded

What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 

Recently uploaded (20)

Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 

R on Hadoop

  • 1. R ON HADOOP Kostiantyn Kudriavtsev Lviv Hadoop User Group, June 19, 2014
  • 2. Agenda • What is R? • Linear Regression • R on Hadoop • Summary
  • 3. What is R? Object-oriented and functional language for Stats, Math and Data Science created by statisticians with comprehensive data visualisation and statistical modelling capabilities; 5000+ (and grow) freely available specialised algorithms for finance, economics, genomics, linguistic and so on; 2M+ users with specialised domain skills; … but some drawbacks are: - limited by RAM - single thread
  • 4. R development environment RStudio is de- facto standard IDE for R development and available in local or server mode. Might be used not only for coding, but also visualisation. Suitable to develop R solutions on top of Hadoop.
  • 5. Apache Hadoop is an software framework that supports data- intensive distributed applications based on MapReduce algorithm (MR). Main idea: move computation to data. MR idea: - Map step: Map(k1,v1) → list(k2,v2) - Magic here (sort by k2, data transfer between nodes, etc) - Reduce step: Reduce(k2, list (v2)) → (k3, v3) What is Hadoop?
  • 6. Linear regression Web-store might use linear regression to predict sales of goods or discover trends. sale(Product) ~ visitors(Product) Linear regression might be used here: sale = α * visitors + β
  • 7. Linear regression in R df <- read.csv("Phone.csv", header=TRUE) qq <- qplot(visited,purchased,colour=product_page, data=df) qq + geom_smooth(method='lm', formula=y~x)
  • 8. Linear regression in R df.p2 <- df[df$product_page == 'phone_2', ] m <- lm(purchased ~ visited, data=df.p2) summary(m)
  • 9. R on Hadoop Several options: • Hadoop streaming • RHadoop • RHipe • RSpark • Oracle R Advanced Analytics for Hadoop • etc.
  • 10. R Hadoop streaming Hadoop was mainly designed to use Java and provides comprehensive Java API. Other languages can be used through “Streaming API” Streaming API utilised standard input (reading) and standard output (writing) OS possibilities. It provides lightweight API for MapReduce in compare to Java API. Streaming requires writing two separate scripts (per mapper and reducer) in any language (Python, Ruby, R, C#, Go, OCalm, Lisp, etc)
  • 11. R Hadoop streaming Streaming API drawbacks: ● while the inputs to the reducer are grouped by key, they are still iterated over line-by-line, and the boundaries between keys must be detected by the user ● no possibilities to utilize different mappers in one MapReduce job ● no possibilities to create different outputs from reducer ● counters update through stderr Additional disadvantage of implementing streaming in R: •strong output control for R functions, because they are “buzzy”, however only meaning data must be pushed
  • 14. RHadoop RHadoop - set of libraries (written in R language) for R languages aim to facilitate using R languages with Hadoop streaming to develop MR jobs. So, it has general drawbacks for Hadoop streaming.
  • 15. RHadoop RHadoop is still R through Hadoop Streaming Advantages compared to Streaming: ● don’t need to manage key change in Reducer ● don’t need to control functions output manually ● simple R API covers Streaming API ● R code can be run on local env/Hadoop without changes
  • 17. R on Hadoop in Real Life Several steps are required to achieve the goal: 1. Data ingestion 2. Data preparation 3. R processing 4. Postprocessing http://static.vroomgirls.com/website/wp-content/uploads/2011/09/Route66Road%C2%A9-Dmitry- Rogozhin.jpg
  • 18. Learned Lessons R is slow… for million calculations it’s even slow with Hadoop! How to improve the speed? Rewrite flow - maximum preprocessing work before R step. Hadoop streaming supports mapper/reducer in different languages. Think twice. R is great for exploratory analysis and researches, but in production might cause performance penalty.
  • 19. Q&A • Thank you for your attention