SlideShare a Scribd company logo
© Hortonworks Inc. 2012
Enabling R on Hadoop
July 11, 2013
Page 1
© Hortonworks Inc. 2012
Your Presenters
Ravi Mutyala
Systems Architect
Page 2
Paul Codding
Solutions Engineer
© Hortonworks Inc. 2012
Agenda
• A Brief History of R
• How R is Typically Used
• How R is Used with Hadoop
• Getting Started
Page 3
© Hortonworks Inc. 2012
A Brief History of R
Page 4
© Hortonworks Inc. 2012
History of R
Page 5
1976: S
Fortran
John
Chambers
S
1988: S V3
written in C
& statistical
models
included
1998: S V4
1991: R
Created by
Ross Ihaka &
Robert
Gentleman
R
1997: R
Core Group
Formed
2000: R
Version 1.0
released
© Hortonworks Inc. 2012
How R is Typically Used
Page 6
© Hortonworks Inc. 2012
Main Uses of R
• Statistical Analysis & Modeling
– Classification
– Scoring
– Ranking
– Clustering
– Finding relationships
– Characterization
• Common Uses
– Interactive Data Analysis
– General Purpose Statistics
– Predictive Modeling
Page 7
© Hortonworks Inc. 2012
How R is Used with Hadoop
Page 8
© Hortonworks Inc. 2012
Hadoop Components
Page 9
OS	
   Cloud	
   VM	
   Appliance	
  
PLATFORM	
  SERVICES	
  
HADOOP	
  CORE	
  
DATA	
  
SERVICES	
  
OPERATIONAL	
  
SERVICES	
  
Manage &
Operate at
Scale
Store,
Process and
Access Data
Enterprise Readiness: HA,
DR, Snapshots, Security, …
HORTONWORKS	
  	
  
DATA	
  PLATFORM	
  (HDP)	
  
Distributed
Storage & ProcessingHDFS	
   YARN	
  (in	
  2.0)	
  
WEBHDFS	
   MAP	
  REDUCE	
  
HCATALOG	
  
HIVE	
  PIG	
  
HBASE	
  
SQOOP	
  
FLUME	
  
OOZIE	
  
AMBARI	
  
© Hortonworks Inc. 2012
Hadoop Components & R
Page 10
OS	
   Cloud	
   VM	
   Appliance	
  
PLATFORM	
  SERVICES	
  
HADOOP	
  CORE	
  
DATA	
  
SERVICES	
  
OPERATIONAL	
  
SERVICES	
  
Manage &
Operate at
Scale
Store,
Process and
Access Data
Enterprise Readiness: HA,
DR, Snapshots, Security, …
HORTONWORKS	
  	
  
DATA	
  PLATFORM	
  (HDP)	
  
Distributed
Storage & ProcessingHDFS	
   YARN	
  (in	
  2.0)	
  
WEBHDFS	
   MAP	
  REDUCE	
  
HCATALOG	
  
HIVE	
  PIG	
  
HBASE	
  
SQOOP	
  
FLUME	
  
OOZIE	
  
AMBARI	
  
Data Service Components
•  Hive
•  HBase
Hadoop Core
•  Map Reduce
•  HDFS
© Hortonworks Inc. 2012
Options for R on Hadoop
• Options
– RODBC/RJDBC
– RHive
– RHadoop
• Analysis
– Focus
– Integration Ease
– Benefits
– Limitations
Page 11
RHadoop
RODBC/RJDBC
RHive
© Hortonworks Inc. 2012
RODBC/RJDBC
• Focus
– SQL Access from R
• Integration Ease
– Install Hortonworks Hive ODBC Driver
– Install Hive libraries
• Benefits
– Low impact on existing R scripts leveraging other DB packages
– Not required to install Hadoop configuration/binaries on client
machines
• Limitations
– Parallelism limited to Hive
– Result set size
Page 12
© Hortonworks Inc. 2012
Deployment Considerations
Page 13
TT , DN
.
.
.
.
.
.
.
TT , DNJTNNHS
© Hortonworks Inc. 2012
RHive
• Focus
– Broad access to Hive and HDFS
• Integration Ease
– Requires Hadoop binaries, libraries, and configuration files on
client machines
– Uses Java DFS Client and HiveServer
• Benefits
– Wide range of features expressed through HQL
– rhive-apply R Distributed apply function using HQL
• Limitations
– Requires heavy client deployment
– Dependent on HiveServer, and can’t be used with HiveServer2
Page 14
© Hortonworks Inc. 2012
Deployment Considerations
Page 15
TT + DN
.
.
.
.
.
.
.
TT + DN
JT
R Edge
Node
NNHS
© Hortonworks Inc. 2012
RHadoop
• Focus
– Tight integration with core Hadoop components
• Benefit
– Ability to run R on a massively distributed system
– Ability to work with full data sets instead of sample sets
• Additional Information
– https://github.com/RevolutionAnalytics/RHadoop/wiki
Page 16
© Hortonworks Inc. 2012
RHadoop Architecture
Page 17
R
rhdfs
rhbase
rmr2
HDFS
HBase Thrift
Gateway
Map Reduce
HBase
Streaming
R
R
R
R
© Hortonworks Inc. 2012
rhdfs
• Access HDFS from R
• Read from HDFS to R dataframe
• Write from R dataframe to HDFS
• 1.0.6 adds support for Windows (using HDP)
Page 18
© Hortonworks Inc. 2012
rhdfs
• Hadoop CLI Commands & rhdfs equivalent
• hadoop fs –ls /
– hdfs.ls(“/”)
• hadoop fs –mkdir /user/rhdfs/ppt
– hdfs.mkdir(“/user/rhdfs/ppt”)
• hadoop fs –put 1.txt /user/rhfds/ppt/
– localData <- system.file(file.path("unitTestData", ”1.txt"), package="rhdfs”)
– hdfs.put(localData, ”/user/rhdfs/ppt/1.txt”)
• hadoop fs –get /user/rhdfs/ppt/1.txt 1.txt
– hdfs.get(”/user/rhdfs/ppt/1.txt”,”test”)
• hadoop fs –rm /user/rhdfs/ppt/1.txt
– hdfs.delete(“/user/rhdfs/ppt/1.txt”)
Page 19
© Hortonworks Inc. 2012
rhbase
• Access and change data within HBase
• Uses Thrift API
• Command Examples
– hb.new.table
– hb.insert
– hb.scan.ex
– hb.scan
Page 20
© Hortonworks Inc. 2012
rmr2
• Enables writing MapReduce jobs using R
• Ability to parallelize algorithms
• Ability to use big data sets without needing to sample
data
• mapreduce(input, output, map, reduce, …)
• Reduces takes a key and a collection of values which
could be vector, list, data frame or matrix
• 2.2.1 adds support for Windows (using HDP)
Page 21
© Hortonworks Inc. 2012
Sample code - wordcount
Page 22
wc.map = !
function(., lines) {!
keyval(!
unlist(!
strsplit(!
x = lines,!
split = pattern)),!
1)}!
wc.reduce =!
function(word, counts ) {!
keyval(word, sum(counts))}!
!
mapreduce(!
input = input ,!
output = output,!
input.format = "text",!
map = wc.map,!
reduce = wc.reduce,!
combine = T)}!
© Hortonworks Inc. 2012
More Sample Code
Page 23
groups = rbinom(32, n = 50, prob = 0.4)!
tapply(groups, groups, length)!
groups = to.dfs(groups)!
from.dfs(!
mapreduce(!
input = groups,!
map = function(., v) keyval(v, 1),!
reduce =!
function(k, vv)!
keyval(k, length(vv))))!
© Hortonworks Inc. 2012
Deployment Considerations
Page 24
TT , DN,
RS
R
.
.
.
.
.
.
.
TT , DN,
RS
RJT
R Edge
Node
NN
HT
G
© Hortonworks Inc. 2012
RHadoop
• Limitations
– Requires installation of R on all TaskTracker nodes
– Does not automatically parallelize algorithms
– Different slot/memory configuration recommended to leave
memory and CPU resources for R
Page 25
OS
Map Reduce
OS
Map Reduce
R
© Hortonworks Inc. 2012
Getting Started
Page 26
© Hortonworks Inc. 2012
Your Fastest On-ramp to Enterprise Hadoop™!
Page 27
http://hortonworks.com/products/hortonworks-sandbox/
The Sandbox lets you experience Apache Hadoop from the convenience of your own
laptop – no data center, no cloud and no internet connection needed!
The Hortonworks Sandbox is:
•  A free download: http://hortonworks.com/products/hortonworks-sandbox/
•  A complete, self contained virtual machine with Apache Hadoop pre-configured
•  A personal, portable and standalone Hadoop environment
•  A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop
© Hortonworks Inc. 2012
Installation
• Install R on all nodes
• Install dependent
packages
– RJSONIO
– itertools
– digest
– Rcpp
– rJava
– functional
– RCurl
– httr
– plyr
• Download & Install
RHadoop Packages
– rmr2
– rhdfs
– rhbase (requires Thrift)
Page 28
© Hortonworks Inc. 2012
Questions & Answers
TRY
Download HDP at hortonworks.com
LEARN
Applying Data Science using Apache
Hadoop Training
FOLLOW
twitter: @hortonworks
Facebook: facebook.com/hortonworks
Page 29
Further questions & comments:
paul@hortonworks.com
ravi@hortonworks.com

More Related Content

What's hot

Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Mithun Radhakrishnan
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
Swiss Big Data User Group
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
MapR Technologies
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
Yahoo Developer Network
 
Apache Drill
Apache DrillApache Drill
Apache Drill
Ted Dunning
 
RHadoop
RHadoopRHadoop
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
FARUK BERKSÖZ
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!
 
Apache Drill
Apache DrillApache Drill
Spark vstez
Spark vstezSpark vstez
Spark vstez
David Groozman
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 

What's hot (20)

Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
RHadoop
RHadoopRHadoop
RHadoop
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 

Similar to Enabling R on Hadoop

Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Hortonworks
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
 
hadoop overview.pptx
hadoop overview.pptxhadoop overview.pptx
hadoop overview.pptx
SachinSingh217687
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
Hortonworks
 
OSDC 2013 | Introduction into Hadoop by Olivier Renault
OSDC 2013 | Introduction into Hadoop by Olivier RenaultOSDC 2013 | Introduction into Hadoop by Olivier Renault
OSDC 2013 | Introduction into Hadoop by Olivier Renault
NETWAYS
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
Yahoo Developer Network
 
Hadoop MapReduce
Hadoop MapReduceHadoop MapReduce
Hadoop MapReduce
Urvashi Kataria
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
Joseph Niemiec
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
Hortonworks
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
Microsoft TechNet - Belgium and Luxembourg
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
BigDataEverywhere
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
MapR Technologies
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
Yahoo Developer Network
 
Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and future
Codemotion
 
Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?
Inside Analysis
 

Similar to Enabling R on Hadoop (20)

Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Munich HUG 21.11.2013
Munich HUG 21.11.2013Munich HUG 21.11.2013
Munich HUG 21.11.2013
 
hadoop overview.pptx
hadoop overview.pptxhadoop overview.pptx
hadoop overview.pptx
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
OSDC 2013 | Introduction into Hadoop by Olivier Renault
OSDC 2013 | Introduction into Hadoop by Olivier RenaultOSDC 2013 | Introduction into Hadoop by Olivier Renault
OSDC 2013 | Introduction into Hadoop by Olivier Renault
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Hadoop MapReduce
Hadoop MapReduceHadoop MapReduce
Hadoop MapReduce
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
 
Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and future
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 

Recently uploaded (20)

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 

Enabling R on Hadoop

  • 1. © Hortonworks Inc. 2012 Enabling R on Hadoop July 11, 2013 Page 1
  • 2. © Hortonworks Inc. 2012 Your Presenters Ravi Mutyala Systems Architect Page 2 Paul Codding Solutions Engineer
  • 3. © Hortonworks Inc. 2012 Agenda • A Brief History of R • How R is Typically Used • How R is Used with Hadoop • Getting Started Page 3
  • 4. © Hortonworks Inc. 2012 A Brief History of R Page 4
  • 5. © Hortonworks Inc. 2012 History of R Page 5 1976: S Fortran John Chambers S 1988: S V3 written in C & statistical models included 1998: S V4 1991: R Created by Ross Ihaka & Robert Gentleman R 1997: R Core Group Formed 2000: R Version 1.0 released
  • 6. © Hortonworks Inc. 2012 How R is Typically Used Page 6
  • 7. © Hortonworks Inc. 2012 Main Uses of R • Statistical Analysis & Modeling – Classification – Scoring – Ranking – Clustering – Finding relationships – Characterization • Common Uses – Interactive Data Analysis – General Purpose Statistics – Predictive Modeling Page 7
  • 8. © Hortonworks Inc. 2012 How R is Used with Hadoop Page 8
  • 9. © Hortonworks Inc. 2012 Hadoop Components Page 9 OS   Cloud   VM   Appliance   PLATFORM  SERVICES   HADOOP  CORE   DATA   SERVICES   OPERATIONAL   SERVICES   Manage & Operate at Scale Store, Process and Access Data Enterprise Readiness: HA, DR, Snapshots, Security, … HORTONWORKS     DATA  PLATFORM  (HDP)   Distributed Storage & ProcessingHDFS   YARN  (in  2.0)   WEBHDFS   MAP  REDUCE   HCATALOG   HIVE  PIG   HBASE   SQOOP   FLUME   OOZIE   AMBARI  
  • 10. © Hortonworks Inc. 2012 Hadoop Components & R Page 10 OS   Cloud   VM   Appliance   PLATFORM  SERVICES   HADOOP  CORE   DATA   SERVICES   OPERATIONAL   SERVICES   Manage & Operate at Scale Store, Process and Access Data Enterprise Readiness: HA, DR, Snapshots, Security, … HORTONWORKS     DATA  PLATFORM  (HDP)   Distributed Storage & ProcessingHDFS   YARN  (in  2.0)   WEBHDFS   MAP  REDUCE   HCATALOG   HIVE  PIG   HBASE   SQOOP   FLUME   OOZIE   AMBARI   Data Service Components •  Hive •  HBase Hadoop Core •  Map Reduce •  HDFS
  • 11. © Hortonworks Inc. 2012 Options for R on Hadoop • Options – RODBC/RJDBC – RHive – RHadoop • Analysis – Focus – Integration Ease – Benefits – Limitations Page 11 RHadoop RODBC/RJDBC RHive
  • 12. © Hortonworks Inc. 2012 RODBC/RJDBC • Focus – SQL Access from R • Integration Ease – Install Hortonworks Hive ODBC Driver – Install Hive libraries • Benefits – Low impact on existing R scripts leveraging other DB packages – Not required to install Hadoop configuration/binaries on client machines • Limitations – Parallelism limited to Hive – Result set size Page 12
  • 13. © Hortonworks Inc. 2012 Deployment Considerations Page 13 TT , DN . . . . . . . TT , DNJTNNHS
  • 14. © Hortonworks Inc. 2012 RHive • Focus – Broad access to Hive and HDFS • Integration Ease – Requires Hadoop binaries, libraries, and configuration files on client machines – Uses Java DFS Client and HiveServer • Benefits – Wide range of features expressed through HQL – rhive-apply R Distributed apply function using HQL • Limitations – Requires heavy client deployment – Dependent on HiveServer, and can’t be used with HiveServer2 Page 14
  • 15. © Hortonworks Inc. 2012 Deployment Considerations Page 15 TT + DN . . . . . . . TT + DN JT R Edge Node NNHS
  • 16. © Hortonworks Inc. 2012 RHadoop • Focus – Tight integration with core Hadoop components • Benefit – Ability to run R on a massively distributed system – Ability to work with full data sets instead of sample sets • Additional Information – https://github.com/RevolutionAnalytics/RHadoop/wiki Page 16
  • 17. © Hortonworks Inc. 2012 RHadoop Architecture Page 17 R rhdfs rhbase rmr2 HDFS HBase Thrift Gateway Map Reduce HBase Streaming R R R R
  • 18. © Hortonworks Inc. 2012 rhdfs • Access HDFS from R • Read from HDFS to R dataframe • Write from R dataframe to HDFS • 1.0.6 adds support for Windows (using HDP) Page 18
  • 19. © Hortonworks Inc. 2012 rhdfs • Hadoop CLI Commands & rhdfs equivalent • hadoop fs –ls / – hdfs.ls(“/”) • hadoop fs –mkdir /user/rhdfs/ppt – hdfs.mkdir(“/user/rhdfs/ppt”) • hadoop fs –put 1.txt /user/rhfds/ppt/ – localData <- system.file(file.path("unitTestData", ”1.txt"), package="rhdfs”) – hdfs.put(localData, ”/user/rhdfs/ppt/1.txt”) • hadoop fs –get /user/rhdfs/ppt/1.txt 1.txt – hdfs.get(”/user/rhdfs/ppt/1.txt”,”test”) • hadoop fs –rm /user/rhdfs/ppt/1.txt – hdfs.delete(“/user/rhdfs/ppt/1.txt”) Page 19
  • 20. © Hortonworks Inc. 2012 rhbase • Access and change data within HBase • Uses Thrift API • Command Examples – hb.new.table – hb.insert – hb.scan.ex – hb.scan Page 20
  • 21. © Hortonworks Inc. 2012 rmr2 • Enables writing MapReduce jobs using R • Ability to parallelize algorithms • Ability to use big data sets without needing to sample data • mapreduce(input, output, map, reduce, …) • Reduces takes a key and a collection of values which could be vector, list, data frame or matrix • 2.2.1 adds support for Windows (using HDP) Page 21
  • 22. © Hortonworks Inc. 2012 Sample code - wordcount Page 22 wc.map = ! function(., lines) {! keyval(! unlist(! strsplit(! x = lines,! split = pattern)),! 1)}! wc.reduce =! function(word, counts ) {! keyval(word, sum(counts))}! ! mapreduce(! input = input ,! output = output,! input.format = "text",! map = wc.map,! reduce = wc.reduce,! combine = T)}!
  • 23. © Hortonworks Inc. 2012 More Sample Code Page 23 groups = rbinom(32, n = 50, prob = 0.4)! tapply(groups, groups, length)! groups = to.dfs(groups)! from.dfs(! mapreduce(! input = groups,! map = function(., v) keyval(v, 1),! reduce =! function(k, vv)! keyval(k, length(vv))))!
  • 24. © Hortonworks Inc. 2012 Deployment Considerations Page 24 TT , DN, RS R . . . . . . . TT , DN, RS RJT R Edge Node NN HT G
  • 25. © Hortonworks Inc. 2012 RHadoop • Limitations – Requires installation of R on all TaskTracker nodes – Does not automatically parallelize algorithms – Different slot/memory configuration recommended to leave memory and CPU resources for R Page 25 OS Map Reduce OS Map Reduce R
  • 26. © Hortonworks Inc. 2012 Getting Started Page 26
  • 27. © Hortonworks Inc. 2012 Your Fastest On-ramp to Enterprise Hadoop™! Page 27 http://hortonworks.com/products/hortonworks-sandbox/ The Sandbox lets you experience Apache Hadoop from the convenience of your own laptop – no data center, no cloud and no internet connection needed! The Hortonworks Sandbox is: •  A free download: http://hortonworks.com/products/hortonworks-sandbox/ •  A complete, self contained virtual machine with Apache Hadoop pre-configured •  A personal, portable and standalone Hadoop environment •  A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop
  • 28. © Hortonworks Inc. 2012 Installation • Install R on all nodes • Install dependent packages – RJSONIO – itertools – digest – Rcpp – rJava – functional – RCurl – httr – plyr • Download & Install RHadoop Packages – rmr2 – rhdfs – rhbase (requires Thrift) Page 28
  • 29. © Hortonworks Inc. 2012 Questions & Answers TRY Download HDP at hortonworks.com LEARN Applying Data Science using Apache Hadoop Training FOLLOW twitter: @hortonworks Facebook: facebook.com/hortonworks Page 29 Further questions & comments: paul@hortonworks.com ravi@hortonworks.com