- Objectives
- Contents:
• Introduction of R
• Implementation of R integration with Hadoop
• When to use R in combination with Hadoop
• Examples using Hadoop
- Q&A
- References
Security Classification: Internal
Objectives
3
• Understand R
• Understand when to use R in combination
with Hadoop
• Understand the implementation of
integration
Security Classification: InternalR integration with Hadoop 5
• Software for Statistical Data Analysis
• Based on S
• Programming Environment
• Interpreted Language
• Data Storage, Analysis, Graphing
• Free and Open Source Software
Security Classification: InternalR integration with Hadoop 6
• Free and Open Source
• Strong User Community
• Highly extensible, flexible
• Implementation of high end statistical methods
• Flexible graphics and intelligent defaults
But ..
• Steep learning curve
• Slow for large datasets
Security Classification: InternalR integration with Hadoop 7
Security Classification: InternalR integration with Hadoop 9
• Use Hadoop to execute R code
• Use R to access data stored in Hadoop
Security Classification: InternalR integration with Hadoop 10
No Factor Mantra Guideline
1 R's natural strength Use R for statistical
computing
Consider integrating when your project can
be solved using code available in R, or when it
is not easily solved in other languages
2 Hadoop's natural
strength
Use Hadoop for
distributed storage &
batch computing
Consider integrating when your problem
requires lots of storage or when it could
benefit from parallelization
3 Coding effort Work smart, not hard R and Hadoop are tools, not "cure-all"
panaceas. Consider not integrating if it is
easier to solve your problem with other tools
4 Processing time Work smart, not hard Although some problems can benefit from
parallelization, consider not integrating if the
gains are negligible since this can help you
reduce the complexity of your project
Security Classification: InternalR integration with Hadoop 11
N
o
Scenario Use
R/Hadoop
?
Why? Example
1 Analyzing small data
stored in Hadoop
Y R can quickly download data analyze it
locally
Want to analyze summary datasets derived from
map reduce jobs done in Hadoop
2 Extracting complex
features from large
data stored in Hadoop
Y R has more built-in and contributed
functions that analyze data than many
standard programming languages
R is a natural language to use to write an algorithm
or classifier that extracts information about objects
contained in images
3 Applying prediction
and classification
models to datasets
Y R is better at modeling than many
standard programming languages
Using a logistic regression model to generate
predictions in a large dataset
4 Implementing an
"iteration-based"
machine
learning algorithm
Maybe 1) Other languages may be faster than R
for your analysis
2) Hadoop reads and writes a lot of data
to disks, other "big data" tools, like Spark
(and SparkR) are designed for speed in
these scenarios by working in memory
Training a k-means classification algorithm or
logistic regression on a large dataset
5 Simple preprocessing
of large data stored in
Hadoop
N Standard programming languages are
much faster than R at executing many
basic text and image processing
tasks
Pre-processing twitter tweets for use in a natural
language processing project
Security Classification: InternalR integration with Hadoop 12
Security Classification: InternalR integration with Hadoop 13
rhdfs:
• Manipulate HDFS directly from R
• Mimic as much of the HDFS Java API as possible
• Examples:
– Read a HDFS text file into a data frame.
– Serialize/Deserialize a model to HDFS
– Write an HDFS file to local storage
• rhdfs/pkg/inst/unitTests
• rhdfs/pkg/inst/examples
Security Classification: InternalR integration with Hadoop 14
rhbase:
• Manipulate HBASE tables and their content
• Uses Thrift C++ API as the mechanism to
communicate to HBASE
• Examples:
– Create a data frame from a collection of rows
and columns in an HBASE table
– Update an HBASE table with values from a data
frame
Security Classification: InternalR integration with Hadoop 15
rmr:
• Designed to be the simplest and most elegant way to
write MapReduce programs
• Gives the R programmer the tools necessary to perform
data analysis in a way that is “R” like
• Provides an abstraction layer to hide the implementation
details
Security Classification: InternalR integration with Hadoop 16
Security Classification: InternalR integration with Hadoop 17
Security Classification: InternalR integration with Hadoop 18
Security Classification: InternalR integration with Hadoop 19
Security Classification: InternalR integration with Hadoop 20
Security Classification: InternalR integration with Hadoop 21
Security Classification: InternalR integration with Hadoop 22
Security Classification: Internal
References
Big data and Hadoop
introduction 24
- http://cran-rproject.org
- http://revolutionanalytics.com
- Hadoop for dummies
R – a brief introduction
Gilberto Câmara
R Hadoop integration

R Hadoop integration

  • 2.
    - Objectives - Contents: •Introduction of R • Implementation of R integration with Hadoop • When to use R in combination with Hadoop • Examples using Hadoop - Q&A - References
  • 3.
    Security Classification: Internal Objectives 3 •Understand R • Understand when to use R in combination with Hadoop • Understand the implementation of integration
  • 5.
    Security Classification: InternalRintegration with Hadoop 5 • Software for Statistical Data Analysis • Based on S • Programming Environment • Interpreted Language • Data Storage, Analysis, Graphing • Free and Open Source Software
  • 6.
    Security Classification: InternalRintegration with Hadoop 6 • Free and Open Source • Strong User Community • Highly extensible, flexible • Implementation of high end statistical methods • Flexible graphics and intelligent defaults But .. • Steep learning curve • Slow for large datasets
  • 7.
    Security Classification: InternalRintegration with Hadoop 7
  • 9.
    Security Classification: InternalRintegration with Hadoop 9 • Use Hadoop to execute R code • Use R to access data stored in Hadoop
  • 10.
    Security Classification: InternalRintegration with Hadoop 10 No Factor Mantra Guideline 1 R's natural strength Use R for statistical computing Consider integrating when your project can be solved using code available in R, or when it is not easily solved in other languages 2 Hadoop's natural strength Use Hadoop for distributed storage & batch computing Consider integrating when your problem requires lots of storage or when it could benefit from parallelization 3 Coding effort Work smart, not hard R and Hadoop are tools, not "cure-all" panaceas. Consider not integrating if it is easier to solve your problem with other tools 4 Processing time Work smart, not hard Although some problems can benefit from parallelization, consider not integrating if the gains are negligible since this can help you reduce the complexity of your project
  • 11.
    Security Classification: InternalRintegration with Hadoop 11 N o Scenario Use R/Hadoop ? Why? Example 1 Analyzing small data stored in Hadoop Y R can quickly download data analyze it locally Want to analyze summary datasets derived from map reduce jobs done in Hadoop 2 Extracting complex features from large data stored in Hadoop Y R has more built-in and contributed functions that analyze data than many standard programming languages R is a natural language to use to write an algorithm or classifier that extracts information about objects contained in images 3 Applying prediction and classification models to datasets Y R is better at modeling than many standard programming languages Using a logistic regression model to generate predictions in a large dataset 4 Implementing an "iteration-based" machine learning algorithm Maybe 1) Other languages may be faster than R for your analysis 2) Hadoop reads and writes a lot of data to disks, other "big data" tools, like Spark (and SparkR) are designed for speed in these scenarios by working in memory Training a k-means classification algorithm or logistic regression on a large dataset 5 Simple preprocessing of large data stored in Hadoop N Standard programming languages are much faster than R at executing many basic text and image processing tasks Pre-processing twitter tweets for use in a natural language processing project
  • 12.
    Security Classification: InternalRintegration with Hadoop 12
  • 13.
    Security Classification: InternalRintegration with Hadoop 13 rhdfs: • Manipulate HDFS directly from R • Mimic as much of the HDFS Java API as possible • Examples: – Read a HDFS text file into a data frame. – Serialize/Deserialize a model to HDFS – Write an HDFS file to local storage • rhdfs/pkg/inst/unitTests • rhdfs/pkg/inst/examples
  • 14.
    Security Classification: InternalRintegration with Hadoop 14 rhbase: • Manipulate HBASE tables and their content • Uses Thrift C++ API as the mechanism to communicate to HBASE • Examples: – Create a data frame from a collection of rows and columns in an HBASE table – Update an HBASE table with values from a data frame
  • 15.
    Security Classification: InternalRintegration with Hadoop 15 rmr: • Designed to be the simplest and most elegant way to write MapReduce programs • Gives the R programmer the tools necessary to perform data analysis in a way that is “R” like • Provides an abstraction layer to hide the implementation details
  • 16.
    Security Classification: InternalRintegration with Hadoop 16
  • 17.
    Security Classification: InternalRintegration with Hadoop 17
  • 18.
    Security Classification: InternalRintegration with Hadoop 18
  • 19.
    Security Classification: InternalRintegration with Hadoop 19
  • 20.
    Security Classification: InternalRintegration with Hadoop 20
  • 21.
    Security Classification: InternalRintegration with Hadoop 21
  • 22.
    Security Classification: InternalRintegration with Hadoop 22
  • 24.
    Security Classification: Internal References Bigdata and Hadoop introduction 24 - http://cran-rproject.org - http://revolutionanalytics.com - Hadoop for dummies R – a brief introduction Gilberto Câmara

Editor's Notes

  • #6 R is a software that provides a programming environment for doing statistical data analysis. This software was written by Robert Gentleman and Ross Ihaka and the name of the software bear the name of the creators. It is a free implementation of S, another popular statistical software. R can be effectively used for data storage, data analysis and a variety of graphing functions. R is distributed free and it is an open source software.
  • #7 R is a great software. It is freely distributed (free both in price as well as in freedom of usage, no restrictions). It has a very strong user community who are ready to help newbies and share information. It has extensive documentation. Best of all, it is extremely scalable, meaning from very low end to very high end, all types of statistical methods can be easily implemented using R. The graphics of R are very flexible and there are many intelligent defaults. Intelligent defaults mean R can guess what you are trying to do and act accordingly. On the downside, it can be time-consuming to learn to use it effectively. The learning process is slow, sometimes frustrating, but in the end, it is a rewarding experience. However, for very large datasets, R can sometimes be slow, but there are several ways to speed up R. The newer versions are invariably faster than the older ones, so continuous upgrading of the software is a good way to speed things up.