Big Data - Analytics with R

Big Data with R
Big Data refers to the large volume of data which may be organized or unorganized. This big
data is very essential for large organizations and businesses for valuable insights to determine
futuristic trends. Big Data is defined in terms of 3Vs which are as follows:
Volume – Volume refers to the quantity and amount of data and this data is increasing day by
day. Facebook has more number of users than the entire population of China. Its data is also
huge. The data is in the form of images, music, videos and all such stuff.
Velocity – Velocity refers to the rate at which the data is generated. Again taking the example of
Facebook, a huge amount of data is uploaded, shared each second on Facebook. People on social
media want new information and content each time they log in to social media. Old obsolete
news and information does not matter to them. Thus new information is shared at every second
on social media.
Variety – Coming to the third V of Big Data i.e. variety. Variety means diverse type of data.
There are multiple formats of data that can be stored. The data can be in the form of image,
video, text, pdf or excel. Big Data has a big challenge of managing this different type of data. An
organization need to arrange similar format of data together in order to extract important
information out of that.
Why is Big Data Analytics important?
Big Data and its analytics are important on account of the following reasons:
 Reduction in Cost – Big Data Analytics offer cost advantages using technologies like
Hadoop and Cloud Computing. These technologies help in storing and managing large
amount of data.
 Better Decision making – Using Hadoop and analytics, organizations and businesses are
able to make better and faster decisions by analyzing different sources of data.

 New services and product development – With the help of big data analytics,
companies can measure customer behavior and needs. Using these parameters, companies
can launch new products and services that will satisfy user needs.
R Programming Language
R is an open-source programming language and software environment for statistical study,
graphical representation, and reporting. The R language is extensively used by statisticians and
data miners for data analysis and software statistics. Robert Gentleman and Ross Ihaka are the
two authors of this language. The language is named ‘R’ from the first letter of the name of these
authors.
R software environment’s source code is written mainly in C, FORTRAN, and R language. R is
a GNU Package and is freely available under GNU General Public License.
What is GNU?
GNU is an acronym for “GNU’s Not Unix!” It is an operating system and is a collection of
computer software. Its design is like UNIX but it differs from UNIX in the sense that it is a free
software and do not contain any UNIX code in it.
Features of R
R programming language has the following main features:
 It is a simple and well-defined programming language that includes conditions, loops,
and recursive functions.
 It has data handling and data storage facility.
 It provides operators for array, matrices and vector calculations.
 It provides integrated set of tools for data analysis.
 It also provides static graphics to produce dynamic and interactive graphs.

Basic Syntax of R
For working with R, you first need to set up the environment for R. After the R environment is
set, you are ready to work with R command prompt. To start the R command prompt, type the
following command:
$ R
R interpreter will be launched where you will type your program with prompt > as follows:
Mystring <- “Hello World!”
Print(Mystring)
[1] “Hello World!”
R Script File
The programs are written in script files and then executed at command prompt using R
interpreter called Rscript.
In R language, variables are assigned R-Objects which are as follows:
 Vectors
 Lists
 Matrices
 Arrays
 Factors
 Data Frames

Working with Big Data in R
R language has been there for the last 20 years but it gained attention recently due to its capacity
to handle Big Data. R language provides series of packages and an environment for statistical
computation of Big Data. The project of programming with Big Data in R was developed a few
years ago. This project is mainly used for data profiling and distributed computing. R packages
and functions are available to load data from any source.
Hadoop is a Big Data technology to handle large amount of data. R and Hadoop can be
integrated together for Big Data analytics.
Why integrate R with Hadoop?
R is a very good programming language for statistical data analysis and to convert this data
analysis to interactive graphs. Although R is preferred programming language for statistics and
analysis, there are some drawbacks of this language also. In R programming language, a single
machine contains all the objects in the main memory. Large size of data cannot be loaded into
the RAM memory. Also, R is not scalable and this cause only limited amount of data to be
processed at a time. For this case, Hadoop is the perfect choice.
Hadoop is a distributed processing framework to perform operations and handle large datasets.
Hadoop already is a popular framework for Big Data processing and integrating it with R will
work wonders. This will make data analytics highly scalable such that the analytics platform can
be scaled up and scaled down depending upon the datasets. It will also provide cost value return.
How to integrate R with Hadoop?
R packages and R scripts are used by data scientists for data processing. These R packages and R
scripts need to be rewritten in Java language or any such programming language that implements
Hadoop MapReduce algorithm to use these scripts and packages with Hadoop. A software
written in R language is required with data stored on distributed storage Hadoop. Following are
some of the methods to integrate R with Hadoop:
1. RHADOOP – It is the most commonly used solution to integrate R with Hadoop. This
analytics solution allows user to directly take data from HBase database systems and

HDFS file systems. It also offers the advantage of simplicity and cost. It is a collection of
5 packages to manage and analyze data using programming language R. Following are
the 5 packages:
 Rhbase – This provides database management functions for HBase within R.
 Rhdfs – This package provides connectivity to Hadoop distributed file system.
 Plyrmr – This package provides data manipulation operations on large datasets.
 Ravro – This allows users to read and write Avro files from HDFS.
 Rmr2 – This is used to perform statistical analysis on data stored in Hadoop.
2. RHIPE – It is an acronym for R and Hadoop Integrated Programming Environment. It is
an R library that provides users the ability to MapReduce within R. It provides data
distribution scheme and integrates well with Hadoop.
3. R and Hadoop Streaming – Hadoop Streaming makes it possible for the user to run
MapReduce using any executable script. This script reads data from standard input and
writes data as a mapper or reducer. Hadoop Streaming can be integrated with R
programming scripts.
4. RHIVE – It is based on installing R on workstations and connecting to data in Hadoop.
RHIVE is the package to launch Hive queries. It has functions to retrieve metadata from
Apache Hive like database names, column names, and table names. RHIVE also provides
libraries and algorithms in R to the data stored in Hadoop. The main advantage of this is
parallelizing of operations.
5. ORCH – It is an acronym for Oracle Connector for Hadoop. It allows users to test
MapReduce program’s ability without any need of learning a new programming
language.
Considering all this, combination of R and Hadoop is a must to work with Big Data for faster,
better, and predictive analytics along with performance, scalability and flexibility.
Strategies of Big Data in R
Big Data can be tackled with R with the following strategies:

 Sampling – The size of data can be reduced using sampling if it is too big to be analyzed.
Sampling also decreases the performance in some cases.
 Bigger Hardware – R keeps all the objects in a single memory. Problem occurs if the
data is very large. To resolve this issue, machine’s memory can be increased and Big
Data can be handled easily.
 Storing objects on hard drive – Instead of storing data in memory, data objects can be
stored on hard disc using packages that are available. This data can be analyzed block
wise which leads to parallelization. This can be performed with only those algorithms
that are specifically designed for this purpose. ‘FF’ and ‘ffbase’ are the main packages
for this purpose.
 Integration of high performing programming languages – For better performance,
high performing programming languages can be integrated with R. Small components of
the program are transferred from R language to another language to prevent any risks. In
order to implement this strategy, developers need to be efficient in some other
programming language like Java and C++.
 Alternative Interpreters – To deal with Big Data, alternative interpreters can be used.
One such interpreter is pqR(pretty quick R). Another alternative is the Renjin which can
run on the JVM(Java Virtual Machine).

Big Data - Analytics with R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data - Analytics with R

Similar to Big Data - Analytics with R (20)

More from Techsparks

More from Techsparks (20)

Recently uploaded

Recently uploaded (20)

Big Data - Analytics with R