This document discusses integrating R with Hadoop. It begins with an introduction to R and its uses for statistical analysis and data visualization. It then discusses how R can be used with Hadoop to analyze large datasets stored in Hadoop and to execute R code using Hadoop. Examples of R packages that interface with Hadoop components like HDFS, HBase, and MapReduce are provided. Guidelines are given for when it makes sense to integrate R and Hadoop versus using them separately.
In this document
Powered by AI
Overview of R, its integration with Hadoop, objectives, and contents of presentation.
R as software for statistical data analysis; it is open source, highly extensible, but has a steep learning curve.
Explains how to utilize R with Hadoop for execution and discusses when to integrate based on strengths and scenarios.
Various scenarios to use R with Hadoop, including data analysis, modeling, machine learning algorithms, and preprocessing.
Introduction of rhdfs for HDFS manipulation, rhbase for HBASE tables, and rmr for writing MapReduce programs.
Provides references for further reading on Big Data, Hadoop, and R integration.
- Objectives
- Contents:
•Introduction of R
• Implementation of R integration with Hadoop
• When to use R in combination with Hadoop
• Examples using Hadoop
- Q&A
- References
Security Classification: InternalRintegration with Hadoop 5
• Software for Statistical Data Analysis
• Based on S
• Programming Environment
• Interpreted Language
• Data Storage, Analysis, Graphing
• Free and Open Source Software
6.
Security Classification: InternalRintegration with Hadoop 6
• Free and Open Source
• Strong User Community
• Highly extensible, flexible
• Implementation of high end statistical methods
• Flexible graphics and intelligent defaults
But ..
• Steep learning curve
• Slow for large datasets
Security Classification: InternalRintegration with Hadoop 10
No Factor Mantra Guideline
1 R's natural strength Use R for statistical
computing
Consider integrating when your project can
be solved using code available in R, or when it
is not easily solved in other languages
2 Hadoop's natural
strength
Use Hadoop for
distributed storage &
batch computing
Consider integrating when your problem
requires lots of storage or when it could
benefit from parallelization
3 Coding effort Work smart, not hard R and Hadoop are tools, not "cure-all"
panaceas. Consider not integrating if it is
easier to solve your problem with other tools
4 Processing time Work smart, not hard Although some problems can benefit from
parallelization, consider not integrating if the
gains are negligible since this can help you
reduce the complexity of your project
11.
Security Classification: InternalRintegration with Hadoop 11
N
o
Scenario Use
R/Hadoop
?
Why? Example
1 Analyzing small data
stored in Hadoop
Y R can quickly download data analyze it
locally
Want to analyze summary datasets derived from
map reduce jobs done in Hadoop
2 Extracting complex
features from large
data stored in Hadoop
Y R has more built-in and contributed
functions that analyze data than many
standard programming languages
R is a natural language to use to write an algorithm
or classifier that extracts information about objects
contained in images
3 Applying prediction
and classification
models to datasets
Y R is better at modeling than many
standard programming languages
Using a logistic regression model to generate
predictions in a large dataset
4 Implementing an
"iteration-based"
machine
learning algorithm
Maybe 1) Other languages may be faster than R
for your analysis
2) Hadoop reads and writes a lot of data
to disks, other "big data" tools, like Spark
(and SparkR) are designed for speed in
these scenarios by working in memory
Training a k-means classification algorithm or
logistic regression on a large dataset
5 Simple preprocessing
of large data stored in
Hadoop
N Standard programming languages are
much faster than R at executing many
basic text and image processing
tasks
Pre-processing twitter tweets for use in a natural
language processing project
Security Classification: InternalRintegration with Hadoop 13
rhdfs:
• Manipulate HDFS directly from R
• Mimic as much of the HDFS Java API as possible
• Examples:
– Read a HDFS text file into a data frame.
– Serialize/Deserialize a model to HDFS
– Write an HDFS file to local storage
• rhdfs/pkg/inst/unitTests
• rhdfs/pkg/inst/examples
14.
Security Classification: InternalRintegration with Hadoop 14
rhbase:
• Manipulate HBASE tables and their content
• Uses Thrift C++ API as the mechanism to
communicate to HBASE
• Examples:
– Create a data frame from a collection of rows
and columns in an HBASE table
– Update an HBASE table with values from a data
frame
15.
Security Classification: InternalRintegration with Hadoop 15
rmr:
• Designed to be the simplest and most elegant way to
write MapReduce programs
• Gives the R programmer the tools necessary to perform
data analysis in a way that is “R” like
• Provides an abstraction layer to hide the implementation
details
Security Classification: Internal
References
Bigdata and Hadoop
introduction 24
- http://cran-rproject.org
- http://revolutionanalytics.com
- Hadoop for dummies
R – a brief introduction
Gilberto Câmara
Editor's Notes
#6 R is a software that provides a programming environment for doing statistical data analysis. This software was written by Robert Gentleman and Ross Ihaka and the name of the software bear the name of the creators. It is a free implementation of S, another popular statistical software. R can be effectively used for data storage, data analysis and a variety of graphing functions. R is distributed free and it is an open source software.
#7 R is a great software. It is freely distributed (free both in price as well as in freedom of usage, no restrictions). It has a very strong user community who are ready to help newbies and share information. It has extensive documentation. Best of all, it is extremely scalable, meaning from very low end to very high end, all types of statistical methods can be easily implemented using R. The graphics of R are very flexible and there are many intelligent defaults. Intelligent defaults mean R can guess what you are trying to do and act accordingly.
On the downside, it can be time-consuming to learn to use it effectively. The learning process is slow, sometimes frustrating, but in the end, it is a rewarding experience. However, for very large datasets, R can sometimes be slow, but there are several ways to speed up R. The newer versions are invariably faster than the older ones, so continuous upgrading of the software is a good way to speed things up.