SlideShare a Scribd company logo
1 of 32
Machine Learning with
SparkR
OLGUN AYDIN
SENIOR DATA SCIENTIST
olgun_aydin@epam.com
info@olgunaydin.com
About me
 BsC. And MsC. degree from Statistics
 6 years experienced Data Scientist
 6 years experience of R
 Love to use R, SparkR and Shiny
 Organizer of PyData Istanbul
 Co-organizer of Istanbul Spark Meetup
 Co-organizer of Trojimasto Spark Meetup
github.com/olgnaydn/R
www.linkedin.com/in/olgun-aydin/
twitter.com/olgunaydinn
https://www.packtpub.com/books/info/authors/olgun-aydin
Outline
 Introduction to Machine Learning
 SparkR
 Getting Data
 DataFrames
 Applications
Introduction to Machine Learning
 Machine learning is a field of computer science that uses statistical
techniques to give computer systems the ability to "learn" (e.g.,
progressively improve performance on a specific task) with data, without
being explicitly programmed. (Wikipedia)
 Machine learning is closely related to (and often overlaps with)
computational statistics, which also focuses on prediction-making through
the use of computers. It has strong ties to mathematical optimization,
which delivers methods, theory and application domains to the field.
Introduction to Machine Learning
 DeepMind developed an agent that surpassed human-level
performance at 49 Atari games, receiving only the pixels and game
score as inputs.
 Soon after, in 2016, DeepMind obsoleted their own this achievement
by releasing a new state-of-the-art gameplay method called A3C.
 Meanwhile, AlphaGo defeated one of the best human players at
Go—an extraordinary achievement in a game dominated by humans
for two decades after machines first conquered chess.
Introduction to Machine Learning
Introduction to Machine Learning
2
1
3
Examples for Real Life Applications
Internet Search
• Google, Bing, Yahoo, Ask
• Better results with data science algorithms
Recomendation
Systems
• Netflix, Amazon, Alibaba
Predictions
Systems
• Image recognition, speech recognition
• Fraud and Risk detection,Self driving cars, robots
Examples for Real Life Applications
Power of
 Fast
 Powerful
 Scalable
Power of
 Effective
 Number of Packages
 One of the Most prefered language
for statistical analysis
Power of
Power of
 Effective
 Powerful
 Statiscal Power
 Fast
 Scalable
+
 SparkR provides a frontend to Apache Spark and uses Spark’s distributed
computation engine to enable large scale data analysis from the R Shell.
 Data analysis using R is limited by the amount of memory available on a
single machine and further as R is single threaded it is often impractical to
use R on large datasets.
 R programs can be scaled while making it easy to use and deploy across a
number of workloads. SparkR: an R frontend for Apache Spark, a widely
deployed cluster computing engine. There are a number of benefits to
designing an R frontend that is tightly integrated with Spark.
 SparkR requires no changes to R. The central component of SparkR is a
distributed data frame that enables structured data processing with a
syntax familiar to R users.
 To improve performance over large datasets, SparkR performs lazy
evaluation on data frame operations and uses Spark’s relational query
optimizer to optimize execution.
 SparkR was initially developed at the AMPLab, UC Berkeley and has been a
part of the Apache Spark.
 The central component of SparkR is a distributed data frame implemented
on top of Spark.
 SparkR DataFrames have an API similar to dplyr or local R data frames, but
scale to large datasets using Spark’s execution engine and relational query
optimizer.
 SparkR’s read.df method integrates with Spark’s data source API and this
enables users to load data from systems like HBase, Cassandra etc. Having
loaded the data, users are then able to use a familiar syntax for performing
relational operations like selections, projections, aggregations and joins.
 Further, SparkR supports more than 100 pre-defined functions on
DataFrames including string manipulation methods, statistical functions
and date-time operations. Users can also execute SQL queries directly on
SparkR DataFrames using the sql command. SparkR also makes it easy for
users to chain commands using existing R libraries.
 Finally, SparkR DataFrames can be converted to a local R data frame using
the collect operator and this is useful for the big data, small learning
scenarios described earlier
 SparkR’s architecture consists of two main components an R to JVM
binding on the driver that allows R programs to submit jobs to a Spark
cluster and support for running R on the Spark executors.
Installation and Creating a SparkContext
 Step 1: Download Spark
 http://spark.apache.org/
Installation and Creating a SparkContext
 Step 1: Download Spark
http://spark.apache.org/
 Step 2: Run in Command Prompt
Now start your favorite command shell and change directory to your Spark folder
 Step 3: Run in RStudio
Set System Environment. Once you have opened RStudio, you need to set the
system environment first. You have to point your R session to the installed version
of SparkR. Use the code shown in Figure 11 below but replace
the SPARK_HOME variable using the path to your Spark folder.
“C:/Apache/Spark-1.4.1″.
Getting Data
 From local data frames
 The simplest way to create a data frame is to convert a local R data frame
into a SparkR DataFrame. Specifically we can use createDataFrame and
pass in the local R data frame to create a SparkR DataFrame. As an
example, the following creates a DataFrame based using the faithful
dataset from R.
Getting Data
 From Data Sources
 SparkR supports operating on a variety of data sources through the DataFrame
interface. This section describes the general methods for loading and saving
data using Data Sources. You can check the Spark SQL programming guide for
more specific options that are available for the built-in data sources.
 The general method for creating DataFrames from data sources is read.df.
 This method takes in the SQLContext, the path for the file to load and the type
of data source.
 SparkR supports reading JSON and Parquet files natively and through Spark
Packages you can find data source connectors for popular file formats like CSV
and Avro.
Getting Data
 We can see how to use data sources using an example JSON input file.
Note that the file that is used here is not a typical JSON file. Each line in
the file must contain a separate, self-contained valid JSON object.
Getting Data
 From Hive tables
 You can also create SparkR DataFrames from Hive tables. To do this we will need to create
a HiveContext which can access tables in the Hive MetaStore. Note that Spark should have
been built with Hive support and more details on the difference between SQLContext and
HiveContext can be found in the SQL programming guide.
SQL queries in SparkR
 A SparkR DataFrame can also be registered as a temporary table in Spark SQL and
registering a DataFrame as a table allows you to run SQL queries over its data. The sql
function enables applications to run SQL queries programmatically and returns the result
as a DataFrame.
DataFrames
 SparkR DataFrames support a number of functions to do structured data processing.
Here we include some basic examples and a complete list can be found in the API docs.
DataFrames
 SparkR data frames support a number of commonly used functions to aggregate data
after grouping. For example we can compute a histogram of the waiting time in the
faithful dataset as shown below
DataFrames
 SparkR also provides a number of functions that can directly applied to columns for data
processing and during aggregation. The example below shows the use of basic
arithmetic functions.
Applications
Correlation Analysis
K-Means
Decision Trees

More Related Content

What's hot

SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Datasamuel shamiri
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...CitiusTech
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalystTakuya UESHIN
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduceGreg Hanchin
 
12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools12 SQL On-Hadoop Tools
12 SQL On-Hadoop ToolsXplenty
 
Modeling employees relationships with Apache Spark
Modeling employees relationships with Apache SparkModeling employees relationships with Apache Spark
Modeling employees relationships with Apache SparkWassim TRIFI
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with HadoopJayant Shekhar
 
Big Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data WorldBig Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data WorldJongwook Woo
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the artStavros Kontopoulos
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
 
Azure Data Factory usage at Aucfanlab
Azure Data Factory usage at AucfanlabAzure Data Factory usage at Aucfanlab
Azure Data Factory usage at AucfanlabAucfan
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfan
 

What's hot (20)

SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Data
 
Heart Proposal
Heart ProposalHeart Proposal
Heart Proposal
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduce
 
12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools
 
Modeling employees relationships with Apache Spark
Modeling employees relationships with Apache SparkModeling employees relationships with Apache Spark
Modeling employees relationships with Apache Spark
 
Technologies for Websites
Technologies for WebsitesTechnologies for Websites
Technologies for Websites
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 
Big Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data WorldBig Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data World
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
projects_with_descriptions
projects_with_descriptionsprojects_with_descriptions
projects_with_descriptions
 
Splunk Architecture
Splunk ArchitectureSplunk Architecture
Splunk Architecture
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Hive
HiveHive
Hive
 
Azure Data Factory usage at Aucfanlab
Azure Data Factory usage at AucfanlabAzure Data Factory usage at Aucfanlab
Azure Data Factory usage at Aucfanlab
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -
 
Big data analytics use case and software
Big data analytics use case and softwareBig data analytics use case and software
Big data analytics use case and software
 

Similar to Machine Learning with SparkR

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmodwaqasm86
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseMartin Toshev
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogsprateek kumar
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!Edureka!
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!Edureka!
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewMario Cartia
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxKnoldus Inc.
 

Similar to Machine Learning with SparkR (20)

Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
Apache spark
Apache sparkApache spark
Apache spark
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Spark1
Spark1Spark1
Spark1
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptx
 

Recently uploaded

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 

Recently uploaded (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

Machine Learning with SparkR

  • 1. Machine Learning with SparkR OLGUN AYDIN SENIOR DATA SCIENTIST olgun_aydin@epam.com info@olgunaydin.com
  • 2. About me  BsC. And MsC. degree from Statistics  6 years experienced Data Scientist  6 years experience of R  Love to use R, SparkR and Shiny  Organizer of PyData Istanbul  Co-organizer of Istanbul Spark Meetup  Co-organizer of Trojimasto Spark Meetup github.com/olgnaydn/R www.linkedin.com/in/olgun-aydin/ twitter.com/olgunaydinn https://www.packtpub.com/books/info/authors/olgun-aydin
  • 3. Outline  Introduction to Machine Learning  SparkR  Getting Data  DataFrames  Applications
  • 4. Introduction to Machine Learning  Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to "learn" (e.g., progressively improve performance on a specific task) with data, without being explicitly programmed. (Wikipedia)  Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses on prediction-making through the use of computers. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field.
  • 5. Introduction to Machine Learning  DeepMind developed an agent that surpassed human-level performance at 49 Atari games, receiving only the pixels and game score as inputs.  Soon after, in 2016, DeepMind obsoleted their own this achievement by releasing a new state-of-the-art gameplay method called A3C.  Meanwhile, AlphaGo defeated one of the best human players at Go—an extraordinary achievement in a game dominated by humans for two decades after machines first conquered chess.
  • 8. 2 1 3 Examples for Real Life Applications Internet Search • Google, Bing, Yahoo, Ask • Better results with data science algorithms Recomendation Systems • Netflix, Amazon, Alibaba Predictions Systems • Image recognition, speech recognition • Fraud and Risk detection,Self driving cars, robots
  • 9. Examples for Real Life Applications
  • 10. Power of  Fast  Powerful  Scalable
  • 11. Power of  Effective  Number of Packages  One of the Most prefered language for statistical analysis
  • 14.  Effective  Powerful  Statiscal Power  Fast  Scalable +
  • 15.  SparkR provides a frontend to Apache Spark and uses Spark’s distributed computation engine to enable large scale data analysis from the R Shell.
  • 16.  Data analysis using R is limited by the amount of memory available on a single machine and further as R is single threaded it is often impractical to use R on large datasets.  R programs can be scaled while making it easy to use and deploy across a number of workloads. SparkR: an R frontend for Apache Spark, a widely deployed cluster computing engine. There are a number of benefits to designing an R frontend that is tightly integrated with Spark.
  • 17.  SparkR requires no changes to R. The central component of SparkR is a distributed data frame that enables structured data processing with a syntax familiar to R users.  To improve performance over large datasets, SparkR performs lazy evaluation on data frame operations and uses Spark’s relational query optimizer to optimize execution.  SparkR was initially developed at the AMPLab, UC Berkeley and has been a part of the Apache Spark.
  • 18.  The central component of SparkR is a distributed data frame implemented on top of Spark.  SparkR DataFrames have an API similar to dplyr or local R data frames, but scale to large datasets using Spark’s execution engine and relational query optimizer.  SparkR’s read.df method integrates with Spark’s data source API and this enables users to load data from systems like HBase, Cassandra etc. Having loaded the data, users are then able to use a familiar syntax for performing relational operations like selections, projections, aggregations and joins.
  • 19.  Further, SparkR supports more than 100 pre-defined functions on DataFrames including string manipulation methods, statistical functions and date-time operations. Users can also execute SQL queries directly on SparkR DataFrames using the sql command. SparkR also makes it easy for users to chain commands using existing R libraries.  Finally, SparkR DataFrames can be converted to a local R data frame using the collect operator and this is useful for the big data, small learning scenarios described earlier
  • 20.
  • 21.  SparkR’s architecture consists of two main components an R to JVM binding on the driver that allows R programs to submit jobs to a Spark cluster and support for running R on the Spark executors.
  • 22. Installation and Creating a SparkContext  Step 1: Download Spark  http://spark.apache.org/
  • 23. Installation and Creating a SparkContext  Step 1: Download Spark http://spark.apache.org/  Step 2: Run in Command Prompt Now start your favorite command shell and change directory to your Spark folder  Step 3: Run in RStudio Set System Environment. Once you have opened RStudio, you need to set the system environment first. You have to point your R session to the installed version of SparkR. Use the code shown in Figure 11 below but replace the SPARK_HOME variable using the path to your Spark folder. “C:/Apache/Spark-1.4.1″.
  • 24. Getting Data  From local data frames  The simplest way to create a data frame is to convert a local R data frame into a SparkR DataFrame. Specifically we can use createDataFrame and pass in the local R data frame to create a SparkR DataFrame. As an example, the following creates a DataFrame based using the faithful dataset from R.
  • 25. Getting Data  From Data Sources  SparkR supports operating on a variety of data sources through the DataFrame interface. This section describes the general methods for loading and saving data using Data Sources. You can check the Spark SQL programming guide for more specific options that are available for the built-in data sources.  The general method for creating DataFrames from data sources is read.df.  This method takes in the SQLContext, the path for the file to load and the type of data source.  SparkR supports reading JSON and Parquet files natively and through Spark Packages you can find data source connectors for popular file formats like CSV and Avro.
  • 26. Getting Data  We can see how to use data sources using an example JSON input file. Note that the file that is used here is not a typical JSON file. Each line in the file must contain a separate, self-contained valid JSON object.
  • 27. Getting Data  From Hive tables  You can also create SparkR DataFrames from Hive tables. To do this we will need to create a HiveContext which can access tables in the Hive MetaStore. Note that Spark should have been built with Hive support and more details on the difference between SQLContext and HiveContext can be found in the SQL programming guide.
  • 28. SQL queries in SparkR  A SparkR DataFrame can also be registered as a temporary table in Spark SQL and registering a DataFrame as a table allows you to run SQL queries over its data. The sql function enables applications to run SQL queries programmatically and returns the result as a DataFrame.
  • 29. DataFrames  SparkR DataFrames support a number of functions to do structured data processing. Here we include some basic examples and a complete list can be found in the API docs.
  • 30. DataFrames  SparkR data frames support a number of commonly used functions to aggregate data after grouping. For example we can compute a histogram of the waiting time in the faithful dataset as shown below
  • 31. DataFrames  SparkR also provides a number of functions that can directly applied to columns for data processing and during aggregation. The example below shows the use of basic arithmetic functions.