SlideShare a Scribd company logo
[1]
Big Data Analysis using SparkR
Abstract:
R is popular statistical programming language for analysis, graphical representation and reporting with
large number of packages on Machine Learning, Text Mining, Graph analysis. But interactive R has limited
processing speed due to single threading running on it causing difficult to handle and process huge
amount of data like big data. That why SparkR has rose to overcome it with its distributed computational
engine to enable large scale data analysis from R shell. Our goal is to describe R, SparkR and various
command on SparkR for data manipulation and data exploration using Data frame and various library and
Data frame API.
Introduction:
Data is growing rapidly day by day. As of 2012, about 2.5 exabytes of data are created each day, and that
number is doubling every 40 months[10]. Many companies are generating petabytes of data in a single
data set—and not just from the internet. For instance, it is estimated that Walmart collects more than 2.5
petabytes of data every hour from its customer transactions. And Facebook alone has 300 peta bytes of
Hive data and growing data of 600TB everyday[9]. For analysis data, there are number of data analysis
tools and R is one of the popular data analysis tool among data scientist. It provides support for structured
data processing using data frame including number of statistical and visualization packages.
But, analysis on R is limited to amount of memory available on single computer running in single thread.
So, in this paper we will explore about the SparkR for processing huge amount of data taking advantage
of parallel processing over the cluster. SparkR contains large number of packages for SQL querying,
distributed machine learning, graph analytic.
Background:
R programming
R is a programing language for statistical analysis, graphical representation and reporting. It was
developed at Bell Laboratories (formerly AT&T now Lucent Technologies) by John Chambers and
colleagues. Because of free, open source, powerful and highly extensible too, R has become hot in the
data analysis field. It provides wide varieties of statistical packages on linear and nonlinear modelling,
classical statistical test, time series analysis, classification, clustering and many more. Currently it has more
than nine thousand packages available and many more developer are supporting it. Use of data frame
and matrices, help to handle data effectively and operate effectively. Also, provision of graphical facilities
and display either on-screen or on hardcopy, help user to analysis the data more precise. Moreover, it
provides programming paradigm like conditions, loops, user-defined recursive function and input output
facilities.
[2]
R not only provides for numerical computation but also support for structured data processing through
data frames. Data frame are tabular data structure containing multiple column in vector form that include
numerical value as well as categorical values. Data frame make it easy for data filtering, summarizing and
sorting data. Packages like dplyr, data.table, reshap2, readr, tidyr, lubricate help greatly in data
exploration.
Apache Spark
Apache is powerful open source tools for processing huge amount of data with ease of use having
sophisticated analysis tools. It is started as research project at UC Berkeley in the AMPLab that focus on
big data analytics. MapReduce is inefficient for multi-processing application that require low-latency data
sharing across parallel operation. Spark overcome those short come and become famous for its parallel
operation. The Spark project first introduced Resilient Distributed Datasets (RDD), an API for fault tolerant
computation in a cluster computing environment. It is top on the market due to some of its major features
like: it includes many machine learning algorithm like MLLib and graph algorithm like GraphX, PageRank,
also it can process data in memory giving it more rapidly process data and query data over cluster. Since
the above libraries are closely integrated with the core API, Spark enables complex workflows like SQL
queries can be used to pre-processing data and the results can then be analyzed using advanced machine
learning algorithms and graph algorithm.
Apache SparkR
SparkR is a light-weight frontend on top of Apache Spark. It was initially started in AMPLab for exploring
usability of R with the scalability of Spark. It was first opened source in January 2014. In Spark 2.0.2, SparkR
provides a distributed data frame implementation that supports data exploration like selection, filtering,
aggregation, summarization and advanced package on Text mining, machine learning and graph analysis.
Benefits of SparkR Integration:
Using the Spark API, SparkR inherits many benefits being tightly integrated with Spark. These are:
Data Source API:
SparkR’s data source API enable users to load data form variety of big data sources like HBase, Cassandra,
Hive table, JSON files, Parquet file easily.
Data Frame Optimization:
SparkR DataFrame is optimized in term of memory management and coding. The chart show the
runtime performance of running group by aggregation on 10 million integer pairs on single machine in R,
Python and Scala[1]. The graph shows that the SparkR performance is like Scala and Python.
[3]
Scalability over cluster machine:
The query and operation executed on SparkR Data Frame automatically get distributed across all core of
processing and machine available in Spark cluster easily so that terabytes of data over cluster with
thousands of machine compute and analyze data in no time which otherwise take large amount of time.
DataFrame in SparkR:
SparkR support data frame from various sources like local, Hive table, JSON file, CSV and Parquet files etc.
From local data:
# read data from iris package
df <- as.DataFrame(iris)
# show the top 5 rows
head(df)
[4]
From data source:
Spark support CSV, JSON, Parquet files natively and can be loaded using third party data source
connector like Avro.
# start SparkR with avro package
sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
# read csv file
dataCar <- read.df("car.json", "json")
head(dataCar)
# see the schema of json
printSchema(dataCar)
# we can also load multiple files at once
people <- read.json(c("Car1.json", "Car2.json"))
# read csv file
data.car <- read.df(carcsv, "csv", header = "true", inferSchema = "true",
na.strings = "NA")
Load from Hive table:
SparkR has built in Hive support so Spark Session with Hive support can access Hive table.
# start SparkR session
sparkR.session()
# create table
sql("CREATE TABLE IF NOT EXISTS tb_src (key INT, value STRING)")
# load data in hive table
sql("LOAD DATA LOCAL INPATH 'data_kv.txt' INTO TABLE tb_src")
# query using HiveQL
results <- sql("FROM tb_src SELECT key, value")
# see the top 5 rows from results dataframe
head(results)
Data Exploration:
SparkR supports data exploration like filter, aggregate, grouping, select, operation and summarization.
Selecting:
# select age only
head(select(df, df$age,df$gender))
# get data having age >20 and select age and gender only
head(filter(df, df$age >20),df$age,df$gender)
[5]
Grouping and Aggregating:
# group based on age and count the number
head(summarize(groupBy(df, df$age), count = n(df$age)))
# sort based on the count
group_by_age <- summarize(groupBy(df, df$age), count = n(df$age))
head(arrange(group_by_age, desc(group_by_age$count)))
Data visualization using ggplot2:
SparkR supports ggplot2 library for data visualization.
# install packages ggplot2
install.packages("ggplot2")
library(ggplot2)
# read file form csv
data <- read.df("data.csv",header='true', source = "com.databricks.spark.csv",
inferSchema='true')
# group by gender and get summary value by age
summary_by_age <- collect(
agg(
groupBy(data, "gender"),
AVG_VALP=avg(data$age)
)
)
head(summary_by_age)
ggplot(summary_by_age, aes(x = gender)) + geom_bar()
K-Means Model
K-Means is widely used clustering algorithm for dividing the data into different cluster. In K-Means
clustering we have to choose cluster number and see how data fit in different cluster.
# Fit a k-means model with spark.kmeans
irisDF <- createDataFrame(iris)
kmeansDF <- irisDF
kmeansTestDF <- irisDF
kmeansModel <- spark.kmeans(kmeansDF, ~ Sepal_Length + Sepal_Width +
Petal_Length + Petal_Width,
k = 5)
# Model summary
summary(kmeansModel)
[6]
# Get fitted result from the k-means model
showDF(fitted(kmeansModel))
# Prediction
kmeansPredictions <- predict(kmeansModel, kmeansTestDF)
showDF(kmeansPredictions)
Conclusion:
In summary, SparkR provides capability of R analysis residing on top of Spark taking advantage of
processing and analyzing huge amount of data. SparkR has lot of package supporting the machine learning
and graph package like MLLib, GraphX and so on. SparkR also can be used for summarization, aggregation,
filter and visualization for quick insight on data and find pattern over their.
Reference:
[1]. SparkR: Scaling R Programs with Spark
https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
[2]. SparkR: Scaling R Programs with Spark
https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
[3]. Spark Research http://spark.apache.org/research.html
[4]. Announcing SparkR: R on Apache Spark https://databricks.com/blog/2015/06/09/announcing-
sparkr-r-on-spark.html
[5]. Do Faster Data Manipulation using These 7 R Packages
https://www.analyticsvidhya.com/blog/2015/12/faster-data-m anipulation-7-packages/
[6]. SparkR and Sparking Water https://rpubs.com/wendyu/sparkr
[7]. Exploring geographical data using SparkR and ggplot2
https://www.codementor.io/spark/tutorial/exploratory-geographical-data-using-sparkr-and-
ggplot2
[8]. Plot http://skku-skt.github.io/ggplot2.SparkR/plot-types
[9]. Scaling the Facebook data warehouse to 300 PB
https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-
300-pb/
[10]. Big Data: The Management Revolution https://hbr.org/2012/10/big-data-the-management-
revolution

More Related Content

What's hot

Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
Supriya .
 
Hadoop and Vertica: Data Analytics Platform at Twitter
Hadoop and Vertica: Data Analytics Platform at TwitterHadoop and Vertica: Data Analytics Platform at Twitter
Hadoop and Vertica: Data Analytics Platform at Twitter
DataWorks Summit
 

What's hot (20)

Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Real-World NoSQL Schema Design
Real-World NoSQL Schema DesignReal-World NoSQL Schema Design
Real-World NoSQL Schema Design
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hadoop and Vertica: Data Analytics Platform at Twitter
Hadoop and Vertica: Data Analytics Platform at TwitterHadoop and Vertica: Data Analytics Platform at Twitter
Hadoop and Vertica: Data Analytics Platform at Twitter
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Data
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 

Viewers also liked (6)

Data analysis and statistical inference project
Data analysis and statistical inference projectData analysis and statistical inference project
Data analysis and statistical inference project
 
Data collection m.com final
Data collection m.com finalData collection m.com final
Data collection m.com final
 
Research methodology mcom part II sem IV assignment
Research methodology mcom part II sem IV assignmentResearch methodology mcom part II sem IV assignment
Research methodology mcom part II sem IV assignment
 
Research project for m. com. students by Dr. Shitole
Research project for m. com. students by Dr. ShitoleResearch project for m. com. students by Dr. Shitole
Research project for m. com. students by Dr. Shitole
 
Marketing project topics
Marketing project topicsMarketing project topics
Marketing project topics
 
Marketing management project on hair oil class 12th by faizan khan
Marketing management project on hair oil class 12th by faizan khanMarketing management project on hair oil class 12th by faizan khan
Marketing management project on hair oil class 12th by faizan khan
 

Similar to Big data analysis using spark r published

Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 

Similar to Big data analysis using spark r published (20)

Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkR
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Multiplaform Solution for Graph Datasources
Multiplaform Solution for Graph DatasourcesMultiplaform Solution for Graph Datasources
Multiplaform Solution for Graph Datasources
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
 

Recently uploaded

Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 

Recently uploaded (20)

Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 

Big data analysis using spark r published

  • 1. [1] Big Data Analysis using SparkR Abstract: R is popular statistical programming language for analysis, graphical representation and reporting with large number of packages on Machine Learning, Text Mining, Graph analysis. But interactive R has limited processing speed due to single threading running on it causing difficult to handle and process huge amount of data like big data. That why SparkR has rose to overcome it with its distributed computational engine to enable large scale data analysis from R shell. Our goal is to describe R, SparkR and various command on SparkR for data manipulation and data exploration using Data frame and various library and Data frame API. Introduction: Data is growing rapidly day by day. As of 2012, about 2.5 exabytes of data are created each day, and that number is doubling every 40 months[10]. Many companies are generating petabytes of data in a single data set—and not just from the internet. For instance, it is estimated that Walmart collects more than 2.5 petabytes of data every hour from its customer transactions. And Facebook alone has 300 peta bytes of Hive data and growing data of 600TB everyday[9]. For analysis data, there are number of data analysis tools and R is one of the popular data analysis tool among data scientist. It provides support for structured data processing using data frame including number of statistical and visualization packages. But, analysis on R is limited to amount of memory available on single computer running in single thread. So, in this paper we will explore about the SparkR for processing huge amount of data taking advantage of parallel processing over the cluster. SparkR contains large number of packages for SQL querying, distributed machine learning, graph analytic. Background: R programming R is a programing language for statistical analysis, graphical representation and reporting. It was developed at Bell Laboratories (formerly AT&T now Lucent Technologies) by John Chambers and colleagues. Because of free, open source, powerful and highly extensible too, R has become hot in the data analysis field. It provides wide varieties of statistical packages on linear and nonlinear modelling, classical statistical test, time series analysis, classification, clustering and many more. Currently it has more than nine thousand packages available and many more developer are supporting it. Use of data frame and matrices, help to handle data effectively and operate effectively. Also, provision of graphical facilities and display either on-screen or on hardcopy, help user to analysis the data more precise. Moreover, it provides programming paradigm like conditions, loops, user-defined recursive function and input output facilities.
  • 2. [2] R not only provides for numerical computation but also support for structured data processing through data frames. Data frame are tabular data structure containing multiple column in vector form that include numerical value as well as categorical values. Data frame make it easy for data filtering, summarizing and sorting data. Packages like dplyr, data.table, reshap2, readr, tidyr, lubricate help greatly in data exploration. Apache Spark Apache is powerful open source tools for processing huge amount of data with ease of use having sophisticated analysis tools. It is started as research project at UC Berkeley in the AMPLab that focus on big data analytics. MapReduce is inefficient for multi-processing application that require low-latency data sharing across parallel operation. Spark overcome those short come and become famous for its parallel operation. The Spark project first introduced Resilient Distributed Datasets (RDD), an API for fault tolerant computation in a cluster computing environment. It is top on the market due to some of its major features like: it includes many machine learning algorithm like MLLib and graph algorithm like GraphX, PageRank, also it can process data in memory giving it more rapidly process data and query data over cluster. Since the above libraries are closely integrated with the core API, Spark enables complex workflows like SQL queries can be used to pre-processing data and the results can then be analyzed using advanced machine learning algorithms and graph algorithm. Apache SparkR SparkR is a light-weight frontend on top of Apache Spark. It was initially started in AMPLab for exploring usability of R with the scalability of Spark. It was first opened source in January 2014. In Spark 2.0.2, SparkR provides a distributed data frame implementation that supports data exploration like selection, filtering, aggregation, summarization and advanced package on Text mining, machine learning and graph analysis. Benefits of SparkR Integration: Using the Spark API, SparkR inherits many benefits being tightly integrated with Spark. These are: Data Source API: SparkR’s data source API enable users to load data form variety of big data sources like HBase, Cassandra, Hive table, JSON files, Parquet file easily. Data Frame Optimization: SparkR DataFrame is optimized in term of memory management and coding. The chart show the runtime performance of running group by aggregation on 10 million integer pairs on single machine in R, Python and Scala[1]. The graph shows that the SparkR performance is like Scala and Python.
  • 3. [3] Scalability over cluster machine: The query and operation executed on SparkR Data Frame automatically get distributed across all core of processing and machine available in Spark cluster easily so that terabytes of data over cluster with thousands of machine compute and analyze data in no time which otherwise take large amount of time. DataFrame in SparkR: SparkR support data frame from various sources like local, Hive table, JSON file, CSV and Parquet files etc. From local data: # read data from iris package df <- as.DataFrame(iris) # show the top 5 rows head(df)
  • 4. [4] From data source: Spark support CSV, JSON, Parquet files natively and can be loaded using third party data source connector like Avro. # start SparkR with avro package sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") # read csv file dataCar <- read.df("car.json", "json") head(dataCar) # see the schema of json printSchema(dataCar) # we can also load multiple files at once people <- read.json(c("Car1.json", "Car2.json")) # read csv file data.car <- read.df(carcsv, "csv", header = "true", inferSchema = "true", na.strings = "NA") Load from Hive table: SparkR has built in Hive support so Spark Session with Hive support can access Hive table. # start SparkR session sparkR.session() # create table sql("CREATE TABLE IF NOT EXISTS tb_src (key INT, value STRING)") # load data in hive table sql("LOAD DATA LOCAL INPATH 'data_kv.txt' INTO TABLE tb_src") # query using HiveQL results <- sql("FROM tb_src SELECT key, value") # see the top 5 rows from results dataframe head(results) Data Exploration: SparkR supports data exploration like filter, aggregate, grouping, select, operation and summarization. Selecting: # select age only head(select(df, df$age,df$gender)) # get data having age >20 and select age and gender only head(filter(df, df$age >20),df$age,df$gender)
  • 5. [5] Grouping and Aggregating: # group based on age and count the number head(summarize(groupBy(df, df$age), count = n(df$age))) # sort based on the count group_by_age <- summarize(groupBy(df, df$age), count = n(df$age)) head(arrange(group_by_age, desc(group_by_age$count))) Data visualization using ggplot2: SparkR supports ggplot2 library for data visualization. # install packages ggplot2 install.packages("ggplot2") library(ggplot2) # read file form csv data <- read.df("data.csv",header='true', source = "com.databricks.spark.csv", inferSchema='true') # group by gender and get summary value by age summary_by_age <- collect( agg( groupBy(data, "gender"), AVG_VALP=avg(data$age) ) ) head(summary_by_age) ggplot(summary_by_age, aes(x = gender)) + geom_bar() K-Means Model K-Means is widely used clustering algorithm for dividing the data into different cluster. In K-Means clustering we have to choose cluster number and see how data fit in different cluster. # Fit a k-means model with spark.kmeans irisDF <- createDataFrame(iris) kmeansDF <- irisDF kmeansTestDF <- irisDF kmeansModel <- spark.kmeans(kmeansDF, ~ Sepal_Length + Sepal_Width + Petal_Length + Petal_Width, k = 5) # Model summary summary(kmeansModel)
  • 6. [6] # Get fitted result from the k-means model showDF(fitted(kmeansModel)) # Prediction kmeansPredictions <- predict(kmeansModel, kmeansTestDF) showDF(kmeansPredictions) Conclusion: In summary, SparkR provides capability of R analysis residing on top of Spark taking advantage of processing and analyzing huge amount of data. SparkR has lot of package supporting the machine learning and graph package like MLLib, GraphX and so on. SparkR also can be used for summarization, aggregation, filter and visualization for quick insight on data and find pattern over their. Reference: [1]. SparkR: Scaling R Programs with Spark https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf [2]. SparkR: Scaling R Programs with Spark https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf [3]. Spark Research http://spark.apache.org/research.html [4]. Announcing SparkR: R on Apache Spark https://databricks.com/blog/2015/06/09/announcing- sparkr-r-on-spark.html [5]. Do Faster Data Manipulation using These 7 R Packages https://www.analyticsvidhya.com/blog/2015/12/faster-data-m anipulation-7-packages/ [6]. SparkR and Sparking Water https://rpubs.com/wendyu/sparkr [7]. Exploring geographical data using SparkR and ggplot2 https://www.codementor.io/spark/tutorial/exploratory-geographical-data-using-sparkr-and- ggplot2 [8]. Plot http://skku-skt.github.io/ggplot2.SparkR/plot-types [9]. Scaling the Facebook data warehouse to 300 PB https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to- 300-pb/ [10]. Big Data: The Management Revolution https://hbr.org/2012/10/big-data-the-management- revolution