Big data analysis using spark r published

[1]
Big Data Analysis using SparkR
Abstract:
R is popular statistical programming language for analysis, graphical representation and reporting with
large number of packages on Machine Learning, Text Mining, Graph analysis. But interactive R has limited
processing speed due to single threading running on it causing difficult to handle and process huge
amount of data like big data. That why SparkR has rose to overcome it with its distributed computational
engine to enable large scale data analysis from R shell. Our goal is to describe R, SparkR and various
command on SparkR for data manipulation and data exploration using Data frame and various library and
Data frame API.
Introduction:
Data is growing rapidly day by day. As of 2012, about 2.5 exabytes of data are created each day, and that
number is doubling every 40 months[10]. Many companies are generating petabytes of data in a single
data set—and not just from the internet. For instance, it is estimated that Walmart collects more than 2.5
petabytes of data every hour from its customer transactions. And Facebook alone has 300 peta bytes of
Hive data and growing data of 600TB everyday[9]. For analysis data, there are number of data analysis
tools and R is one of the popular data analysis tool among data scientist. It provides support for structured
data processing using data frame including number of statistical and visualization packages.
But, analysis on R is limited to amount of memory available on single computer running in single thread.
So, in this paper we will explore about the SparkR for processing huge amount of data taking advantage
of parallel processing over the cluster. SparkR contains large number of packages for SQL querying,
distributed machine learning, graph analytic.
Background:
R programming
R is a programing language for statistical analysis, graphical representation and reporting. It was
developed at Bell Laboratories (formerly AT&T now Lucent Technologies) by John Chambers and
colleagues. Because of free, open source, powerful and highly extensible too, R has become hot in the
data analysis field. It provides wide varieties of statistical packages on linear and nonlinear modelling,
classical statistical test, time series analysis, classification, clustering and many more. Currently it has more
than nine thousand packages available and many more developer are supporting it. Use of data frame
and matrices, help to handle data effectively and operate effectively. Also, provision of graphical facilities
and display either on-screen or on hardcopy, help user to analysis the data more precise. Moreover, it
provides programming paradigm like conditions, loops, user-defined recursive function and input output
facilities.

[2]
R not only provides for numerical computation but also support for structured data processing through
data frames. Data frame are tabular data structure containing multiple column in vector form that include
numerical value as well as categorical values. Data frame make it easy for data filtering, summarizing and
sorting data. Packages like dplyr, data.table, reshap2, readr, tidyr, lubricate help greatly in data
exploration.
Apache Spark
Apache is powerful open source tools for processing huge amount of data with ease of use having
sophisticated analysis tools. It is started as research project at UC Berkeley in the AMPLab that focus on
big data analytics. MapReduce is inefficient for multi-processing application that require low-latency data
sharing across parallel operation. Spark overcome those short come and become famous for its parallel
operation. The Spark project first introduced Resilient Distributed Datasets (RDD), an API for fault tolerant
computation in a cluster computing environment. It is top on the market due to some of its major features
like: it includes many machine learning algorithm like MLLib and graph algorithm like GraphX, PageRank,
also it can process data in memory giving it more rapidly process data and query data over cluster. Since
the above libraries are closely integrated with the core API, Spark enables complex workflows like SQL
queries can be used to pre-processing data and the results can then be analyzed using advanced machine
learning algorithms and graph algorithm.
Apache SparkR
SparkR is a light-weight frontend on top of Apache Spark. It was initially started in AMPLab for exploring
usability of R with the scalability of Spark. It was first opened source in January 2014. In Spark 2.0.2, SparkR
provides a distributed data frame implementation that supports data exploration like selection, filtering,
aggregation, summarization and advanced package on Text mining, machine learning and graph analysis.
Benefits of SparkR Integration:
Using the Spark API, SparkR inherits many benefits being tightly integrated with Spark. These are:
Data Source API:
SparkR’s data source API enable users to load data form variety of big data sources like HBase, Cassandra,
Hive table, JSON files, Parquet file easily.
Data Frame Optimization:
SparkR DataFrame is optimized in term of memory management and coding. The chart show the
runtime performance of running group by aggregation on 10 million integer pairs on single machine in R,
Python and Scala[1]. The graph shows that the SparkR performance is like Scala and Python.

[3]
Scalability over cluster machine:
The query and operation executed on SparkR Data Frame automatically get distributed across all core of
processing and machine available in Spark cluster easily so that terabytes of data over cluster with
thousands of machine compute and analyze data in no time which otherwise take large amount of time.
DataFrame in SparkR:
SparkR support data frame from various sources like local, Hive table, JSON file, CSV and Parquet files etc.
From local data:
# read data from iris package
df <- as.DataFrame(iris)
# show the top 5 rows
head(df)

[4]
From data source:
Spark support CSV, JSON, Parquet files natively and can be loaded using third party data source
connector like Avro.
# start SparkR with avro package
sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
# read csv file
dataCar <- read.df("car.json", "json")
head(dataCar)
# see the schema of json
printSchema(dataCar)
# we can also load multiple files at once
people <- read.json(c("Car1.json", "Car2.json"))
# read csv file
data.car <- read.df(carcsv, "csv", header = "true", inferSchema = "true",
na.strings = "NA")
Load from Hive table:
SparkR has built in Hive support so Spark Session with Hive support can access Hive table.
# start SparkR session
sparkR.session()
# create table
sql("CREATE TABLE IF NOT EXISTS tb_src (key INT, value STRING)")
# load data in hive table
sql("LOAD DATA LOCAL INPATH 'data_kv.txt' INTO TABLE tb_src")
# query using HiveQL
results <- sql("FROM tb_src SELECT key, value")
# see the top 5 rows from results dataframe
head(results)
Data Exploration:
SparkR supports data exploration like filter, aggregate, grouping, select, operation and summarization.
Selecting:
# select age only
head(select(df, df$age,df$gender))
# get data having age >20 and select age and gender only
head(filter(df, df$age >20),df$age,df$gender)

[5]
Grouping and Aggregating:
# group based on age and count the number
head(summarize(groupBy(df, df$age), count = n(df$age)))
# sort based on the count
group_by_age <- summarize(groupBy(df, df$age), count = n(df$age))
head(arrange(group_by_age, desc(group_by_age$count)))
Data visualization using ggplot2:
SparkR supports ggplot2 library for data visualization.
# install packages ggplot2
install.packages("ggplot2")
library(ggplot2)
# read file form csv
data <- read.df("data.csv",header='true', source = "com.databricks.spark.csv",
inferSchema='true')
# group by gender and get summary value by age
summary_by_age <- collect(
agg(
groupBy(data, "gender"),
AVG_VALP=avg(data$age)
)
)
head(summary_by_age)
ggplot(summary_by_age, aes(x = gender)) + geom_bar()
K-Means Model
K-Means is widely used clustering algorithm for dividing the data into different cluster. In K-Means
clustering we have to choose cluster number and see how data fit in different cluster.
# Fit a k-means model with spark.kmeans
irisDF <- createDataFrame(iris)
kmeansDF <- irisDF
kmeansTestDF <- irisDF
kmeansModel <- spark.kmeans(kmeansDF, ~ Sepal_Length + Sepal_Width +
Petal_Length + Petal_Width,
k = 5)
# Model summary
summary(kmeansModel)

[6]
# Get fitted result from the k-means model
showDF(fitted(kmeansModel))
# Prediction
kmeansPredictions <- predict(kmeansModel, kmeansTestDF)
showDF(kmeansPredictions)
Conclusion:
In summary, SparkR provides capability of R analysis residing on top of Spark taking advantage of
processing and analyzing huge amount of data. SparkR has lot of package supporting the machine learning
and graph package like MLLib, GraphX and so on. SparkR also can be used for summarization, aggregation,
filter and visualization for quick insight on data and find pattern over their.
Reference:
[1]. SparkR: Scaling R Programs with Spark
https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
[2]. SparkR: Scaling R Programs with Spark
https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
[3]. Spark Research http://spark.apache.org/research.html
[4]. Announcing SparkR: R on Apache Spark https://databricks.com/blog/2015/06/09/announcing-
sparkr-r-on-spark.html
[5]. Do Faster Data Manipulation using These 7 R Packages
https://www.analyticsvidhya.com/blog/2015/12/faster-data-m anipulation-7-packages/
[6]. SparkR and Sparking Water https://rpubs.com/wendyu/sparkr
[7]. Exploring geographical data using SparkR and ggplot2
https://www.codementor.io/spark/tutorial/exploratory-geographical-data-using-sparkr-and-
ggplot2
[8]. Plot http://skku-skt.github.io/ggplot2.SparkR/plot-types
[9]. Scaling the Facebook data warehouse to 300 PB
https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-
300-pb/
[10]. Big Data: The Management Revolution https://hbr.org/2012/10/big-data-the-management-
revolution

Big data analysis using spark r published

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Big data analysis using spark r published

Similar to Big data analysis using spark r published (20)

Recently uploaded

Recently uploaded (20)

Big data analysis using spark r published