SlideShare a Scribd company logo
1 of 25
Download to read offline
Introduction to
SparkR
OLGUN AYDIN – LEAD ANALYST
olgun.aydin@zingat.com
info@olgunaydin.com
Outline
u Installation and Creating a SparkContext
u Getting Data
u SQL queries in SparkR
u DataFrames
u Application Examples
u Correlation Analysis
u Time Series Analysis
u K-Means
Power of
u Fast
u Powerful
u Scalable
Power of
u Effective
u Number of Packages
u One of the Most prefered language
for statistical analysis
Power of
u Effective
u Powerful
u Statiscal Power
u Fast
u Scalable
+
u SparkR, an R package that provides a frontend to Apache Spark
and uses Spark’s distributed computation engine to enable large
scale data analysis from the R Shell.
u Data analysis using R is limited by the amount of memory available
on a single machine and further as R is single threaded it is often
impractical to use R on large datasets.
u R programs can be scaled while making it easy to use and deploy
across a number of workloads. SparkR: an R frontend for Apache
Spark, a widely deployed cluster computing engine. There are a
number of benefits to designing an R frontend that is tightly
integrated with Spark.
u SparkR is built as an R package and requires no changes to R. The
central component of SparkR is a distributed data frame that
enables structured data processing with a syntax familiar to R users.
u To improve performance over large datasets, SparkR performs lazy
evaluation on data frame operations and uses Spark’s relational
query optimizer to optimize execution.
u SparkR was initially developed at the AMPLab, UC Berkeley and has
been a part of the Apache Spark project for the past eight months.
u The central component of SparkR is a distributed data frame
implemented on top of Spark.
u SparkR DataFrames have an API similar to dplyr or local R data
frames, but scale to large datasets using Spark’s execution engine
and relational query optimizer.
u SparkR’s read.df method integrates with Spark’s data source API
and this enables users to load data from systems like HBase,
Cassandra etc. Having loaded the data, users are then able to use
a familiar syntax for performing relational operations like selections,
projections, aggregations and joins.
u Further, SparkR supports more than 100 pre-defined functions on
DataFrames including string manipulation methods, statistical
functions and date-time operations. Users can also execute SQL
queries directly on SparkR DataFrames using the sql command.
SparkR also makes it easy for users to chain commands using existing
R libraries.
u Finally, SparkR DataFrames can be converted to a local R data
frame using the collect operator and this is useful for the big data,
small learning scenarios described earlier
u SparkR’s architecture consists of two main components an R to JVM
binding on the driver that allows R programs to submit jobs to a
Spark cluster and support for running R on the Spark executors.
Installation and Creating a SparkContext
u Step 1: Download Spark
u http://spark.apache.org/
Installation and Creating a SparkContext
u Step 1: Download Spark
http://spark.apache.org/
u Step 2: Run in Command Prompt
Now start your favorite command shell and change directory to your Spark
folder
u Step 3: Run in RStudio
Set System Environment. Once you have opened RStudio, you need to set
the system environment first. You have to point your R session to the
installed version of SparkR. Use the code shown in Figure 11 below but
replace the SPARK_HOME variable using the path to your Spark folder.
“C:/Apache/Spark-1.4.1″.
Installation and Creating a SparkContext
u Step 4: Set the Library Paths
You have to set the library path for Spark
u Step 5: Load SparkR library
Load SparkR library by using library(SparkR)
u Step 6: Initialize Spark Context and SQL Context
sc<-sparkR.init(master=‘local’)
sqlcontext<-sparkRSQL.init(sc)
Getting Data
u From local data frames
u The simplest way to create a data frame is to convert a local R data
frame into a SparkR DataFrame. Specifically we can use
createDataFrame and pass in the local R data frame to create a
SparkR DataFrame. As an example, the following creates a
DataFrame based using the faithful dataset from R.
Getting Data
u From Data Sources
u SparkR supports operating on a variety of data sources through the
DataFrame interface. This section describes the general methods for
loading and saving data using Data Sources. You can check the Spark
SQL programming guide for more specific options that are available for
the built-in data sources.
u The general method for creating DataFrames from data sources is
read.df.
u This method takes in the SQLContext, the path for the file to load and
the type of data source.
u SparkR supports reading JSON and Parquet files natively and through
Spark Packages you can find data source connectors for popular file
formats like CSV and Avro.
Getting Data
u We can see how to use data sources using an example JSON input
file. Note that the file that is used here is not a typical JSON file. Each
line in the file must contain a separate, self-contained valid JSON
object.
Getting Data
u From Hive tables
u You can also create SparkR DataFrames from Hive tables. To do this we will need
to create a HiveContext which can access tables in the Hive MetaStore. Note
that Spark should have been built with Hive support and more details on the
difference between SQLContext and HiveContext can be found in the SQL
programming guide.
SQL queries in SparkR
u A SparkR DataFrame can also be registered as a temporary table in Spark SQL
and registering a DataFrame as a table allows you to run SQL queries over its
data. The sql function enables applications to run SQL queries programmatically
and returns the result as a DataFrame.
DataFrames
u SparkR DataFrames support a number of functions to do structured data
processing. Here we include some basic examples and a complete list can be
found in the API docs.
DataFrames
u SparkR data frames support a number of commonly used functions to
aggregate data after grouping. For example we can compute a histogram of
the waiting time in the faithful dataset as shown below
DataFrames
u SparkR also provides a number of functions that can directly applied to columns
for data processing and during aggregation. The example below shows the use
of basic arithmetic functions.
Application
u http://www.transtats.bts.gov/Tables.asp?DB_ID=120

More Related Content

What's hot

[1023 박민수] 깊이_버퍼_그림자
[1023 박민수] 깊이_버퍼_그림자[1023 박민수] 깊이_버퍼_그림자
[1023 박민수] 깊이_버퍼_그림자MoonLightMS
 
Ativ mat2 descritores anos finais
Ativ mat2  descritores anos finaisAtiv mat2  descritores anos finais
Ativ mat2 descritores anos finaisEdileusa Camargo
 
A história dos números
A história dos númerosA história dos números
A história dos númerosleilamaluf
 
Screen space reflection
Screen space reflectionScreen space reflection
Screen space reflectionBongseok Cho
 
언차티드4 테크아트 파트1 톤맵핑&색보정
언차티드4 테크아트 파트1 톤맵핑&색보정언차티드4 테크아트 파트1 톤맵핑&색보정
언차티드4 테크아트 파트1 톤맵핑&색보정Dae Hyek KIM
 
Recent Advances in Computer Vision
Recent Advances in Computer VisionRecent Advances in Computer Vision
Recent Advances in Computer Visionantiw
 
Questões com razão e proporção
Questões com razão e proporção Questões com razão e proporção
Questões com razão e proporção Alexandro C. Marins
 
Prova de matemática Colégio Militar 2012 6ºano
Prova de matemática Colégio Militar 2012   6ºanoProva de matemática Colégio Militar 2012   6ºano
Prova de matemática Colégio Militar 2012 6ºanoElias de Lima Neto
 
노말 맵핑(Normal mapping)
노말 맵핑(Normal mapping)노말 맵핑(Normal mapping)
노말 맵핑(Normal mapping)QooJuice
 
Es i agregati
Es i agregatiEs i agregati
Es i agregatijuteam
 
âNgulos opostos pelo vértice
âNgulos opostos pelo vérticeâNgulos opostos pelo vértice
âNgulos opostos pelo vérticewarepic
 
Ndc17 - 차세대 게임이펙트를 위해 알야아할 기법들
Ndc17 - 차세대 게임이펙트를 위해 알야아할 기법들Ndc17 - 차세대 게임이펙트를 위해 알야아할 기법들
Ndc17 - 차세대 게임이펙트를 위해 알야아할 기법들Dae Hyek KIM
 
이재훈 개발 포트폴리오.pdf
이재훈 개발 포트폴리오.pdf이재훈 개발 포트폴리오.pdf
이재훈 개발 포트폴리오.pdfjaehoon lee
 
Matemática – principio fundamental da contagem 01 – 2013
Matemática – principio fundamental da contagem 01 – 2013Matemática – principio fundamental da contagem 01 – 2013
Matemática – principio fundamental da contagem 01 – 2013Jakson Raphael Pereira Barbosa
 
Brdf기반 사전정의 스킨 셰이더
Brdf기반 사전정의 스킨 셰이더Brdf기반 사전정의 스킨 셰이더
Brdf기반 사전정의 스킨 셰이더동석 김
 

What's hot (20)

[1023 박민수] 깊이_버퍼_그림자
[1023 박민수] 깊이_버퍼_그림자[1023 박민수] 깊이_버퍼_그림자
[1023 박민수] 깊이_버퍼_그림자
 
Arte e Matemática
Arte e MatemáticaArte e Matemática
Arte e Matemática
 
Ativ mat2 descritores anos finais
Ativ mat2  descritores anos finaisAtiv mat2  descritores anos finais
Ativ mat2 descritores anos finais
 
A história dos números
A história dos númerosA história dos números
A história dos números
 
Screen space reflection
Screen space reflectionScreen space reflection
Screen space reflection
 
언차티드4 테크아트 파트1 톤맵핑&색보정
언차티드4 테크아트 파트1 톤맵핑&색보정언차티드4 테크아트 파트1 톤맵핑&색보정
언차티드4 테크아트 파트1 톤맵핑&색보정
 
Recent Advances in Computer Vision
Recent Advances in Computer VisionRecent Advances in Computer Vision
Recent Advances in Computer Vision
 
Questões com razão e proporção
Questões com razão e proporção Questões com razão e proporção
Questões com razão e proporção
 
Prova de matemática Colégio Militar 2012 6ºano
Prova de matemática Colégio Militar 2012   6ºanoProva de matemática Colégio Militar 2012   6ºano
Prova de matemática Colégio Militar 2012 6ºano
 
Geometria
GeometriaGeometria
Geometria
 
노말 맵핑(Normal mapping)
노말 맵핑(Normal mapping)노말 맵핑(Normal mapping)
노말 맵핑(Normal mapping)
 
Motion blur
Motion blurMotion blur
Motion blur
 
Es i agregati
Es i agregatiEs i agregati
Es i agregati
 
âNgulos opostos pelo vértice
âNgulos opostos pelo vérticeâNgulos opostos pelo vértice
âNgulos opostos pelo vértice
 
Ndc11 이창희_hdr
Ndc11 이창희_hdrNdc11 이창희_hdr
Ndc11 이창희_hdr
 
Matemática e arte
Matemática e arteMatemática e arte
Matemática e arte
 
Ndc17 - 차세대 게임이펙트를 위해 알야아할 기법들
Ndc17 - 차세대 게임이펙트를 위해 알야아할 기법들Ndc17 - 차세대 게임이펙트를 위해 알야아할 기법들
Ndc17 - 차세대 게임이펙트를 위해 알야아할 기법들
 
이재훈 개발 포트폴리오.pdf
이재훈 개발 포트폴리오.pdf이재훈 개발 포트폴리오.pdf
이재훈 개발 포트폴리오.pdf
 
Matemática – principio fundamental da contagem 01 – 2013
Matemática – principio fundamental da contagem 01 – 2013Matemática – principio fundamental da contagem 01 – 2013
Matemática – principio fundamental da contagem 01 – 2013
 
Brdf기반 사전정의 스킨 셰이더
Brdf기반 사전정의 스킨 셰이더Brdf기반 사전정의 스킨 셰이더
Brdf기반 사전정의 스킨 셰이더
 

Similar to Introduction to SparkR

Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkROlgun Aydın
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseMartin Toshev
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Databricks
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRDatabricks
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLphanleson
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r publishedDipendra Kusi
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewMario Cartia
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; pythonMaloy Manna, PMP®
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 

Similar to Introduction to SparkR (20)

Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkR
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Spark core
Spark coreSpark core
Spark core
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 

Recently uploaded

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 

Introduction to SparkR

  • 1. Introduction to SparkR OLGUN AYDIN – LEAD ANALYST olgun.aydin@zingat.com info@olgunaydin.com
  • 2. Outline u Installation and Creating a SparkContext u Getting Data u SQL queries in SparkR u DataFrames u Application Examples u Correlation Analysis u Time Series Analysis u K-Means
  • 3. Power of u Fast u Powerful u Scalable
  • 4. Power of u Effective u Number of Packages u One of the Most prefered language for statistical analysis
  • 6. u Effective u Powerful u Statiscal Power u Fast u Scalable +
  • 7. u SparkR, an R package that provides a frontend to Apache Spark and uses Spark’s distributed computation engine to enable large scale data analysis from the R Shell.
  • 8. u Data analysis using R is limited by the amount of memory available on a single machine and further as R is single threaded it is often impractical to use R on large datasets. u R programs can be scaled while making it easy to use and deploy across a number of workloads. SparkR: an R frontend for Apache Spark, a widely deployed cluster computing engine. There are a number of benefits to designing an R frontend that is tightly integrated with Spark.
  • 9. u SparkR is built as an R package and requires no changes to R. The central component of SparkR is a distributed data frame that enables structured data processing with a syntax familiar to R users. u To improve performance over large datasets, SparkR performs lazy evaluation on data frame operations and uses Spark’s relational query optimizer to optimize execution. u SparkR was initially developed at the AMPLab, UC Berkeley and has been a part of the Apache Spark project for the past eight months.
  • 10.
  • 11. u The central component of SparkR is a distributed data frame implemented on top of Spark. u SparkR DataFrames have an API similar to dplyr or local R data frames, but scale to large datasets using Spark’s execution engine and relational query optimizer. u SparkR’s read.df method integrates with Spark’s data source API and this enables users to load data from systems like HBase, Cassandra etc. Having loaded the data, users are then able to use a familiar syntax for performing relational operations like selections, projections, aggregations and joins.
  • 12. u Further, SparkR supports more than 100 pre-defined functions on DataFrames including string manipulation methods, statistical functions and date-time operations. Users can also execute SQL queries directly on SparkR DataFrames using the sql command. SparkR also makes it easy for users to chain commands using existing R libraries. u Finally, SparkR DataFrames can be converted to a local R data frame using the collect operator and this is useful for the big data, small learning scenarios described earlier
  • 13. u SparkR’s architecture consists of two main components an R to JVM binding on the driver that allows R programs to submit jobs to a Spark cluster and support for running R on the Spark executors.
  • 14. Installation and Creating a SparkContext u Step 1: Download Spark u http://spark.apache.org/
  • 15. Installation and Creating a SparkContext u Step 1: Download Spark http://spark.apache.org/ u Step 2: Run in Command Prompt Now start your favorite command shell and change directory to your Spark folder u Step 3: Run in RStudio Set System Environment. Once you have opened RStudio, you need to set the system environment first. You have to point your R session to the installed version of SparkR. Use the code shown in Figure 11 below but replace the SPARK_HOME variable using the path to your Spark folder. “C:/Apache/Spark-1.4.1″.
  • 16. Installation and Creating a SparkContext u Step 4: Set the Library Paths You have to set the library path for Spark u Step 5: Load SparkR library Load SparkR library by using library(SparkR) u Step 6: Initialize Spark Context and SQL Context sc<-sparkR.init(master=‘local’) sqlcontext<-sparkRSQL.init(sc)
  • 17. Getting Data u From local data frames u The simplest way to create a data frame is to convert a local R data frame into a SparkR DataFrame. Specifically we can use createDataFrame and pass in the local R data frame to create a SparkR DataFrame. As an example, the following creates a DataFrame based using the faithful dataset from R.
  • 18. Getting Data u From Data Sources u SparkR supports operating on a variety of data sources through the DataFrame interface. This section describes the general methods for loading and saving data using Data Sources. You can check the Spark SQL programming guide for more specific options that are available for the built-in data sources. u The general method for creating DataFrames from data sources is read.df. u This method takes in the SQLContext, the path for the file to load and the type of data source. u SparkR supports reading JSON and Parquet files natively and through Spark Packages you can find data source connectors for popular file formats like CSV and Avro.
  • 19. Getting Data u We can see how to use data sources using an example JSON input file. Note that the file that is used here is not a typical JSON file. Each line in the file must contain a separate, self-contained valid JSON object.
  • 20. Getting Data u From Hive tables u You can also create SparkR DataFrames from Hive tables. To do this we will need to create a HiveContext which can access tables in the Hive MetaStore. Note that Spark should have been built with Hive support and more details on the difference between SQLContext and HiveContext can be found in the SQL programming guide.
  • 21. SQL queries in SparkR u A SparkR DataFrame can also be registered as a temporary table in Spark SQL and registering a DataFrame as a table allows you to run SQL queries over its data. The sql function enables applications to run SQL queries programmatically and returns the result as a DataFrame.
  • 22. DataFrames u SparkR DataFrames support a number of functions to do structured data processing. Here we include some basic examples and a complete list can be found in the API docs.
  • 23. DataFrames u SparkR data frames support a number of commonly used functions to aggregate data after grouping. For example we can compute a histogram of the waiting time in the faithful dataset as shown below
  • 24. DataFrames u SparkR also provides a number of functions that can directly applied to columns for data processing and during aggregation. The example below shows the use of basic arithmetic functions.