SlideShare a Scribd company logo
1 of 29
Download to read offline
Dr. Ahmet Bulut

Computer Science Department

Istanbul Sehir University
email: ahmetbulut@sehir.edu.tr
Nose Dive into Apache Spark ML
DataFrame
The reason for putting the data on more than one computer should be intuitive: either the
data is too large to fit on one machine or it would simply take too long to perform
that computation on one machine.
DataFrame
•A DataFrame is a distributed collection of data organized into
named columns.
•It is conceptually equivalent to a table in a relational database or
a data frame in R/Python, but with richer optimizations under the
hood.
•You can load data from a variety of structured data sources, e.g.,
JSON and Parquet. 

Advanced Analytics
•Supervised learning, including classification and regression, to
predict a label for each data point based on various features.
•Recommendation engines to suggest products to users based on
behavior.
•Unsupervised learning, including clustering, anomaly detection,
and topic modeling to discover structure in the data.
•Graph analytics such as searching for patterns in a social network.
Supervised Learning
•Supervised learning is probably the most common type of
machine learning.
•The goal is simple: using historical data that already has labels
(often called the dependent variables), train a model to predict
the values of those labels based on various features of the data
points.
Supervised Learning
•Classification: predict a categorical variable.
•Regression: predict a continuous variable (a real number)
Supervised Learning
•Classification

- Predicting disease,

- Classifying images,

- Predicting customer churn,

- Buy or won’t buy (predicting conversion).
•Regression

- Predicting sales,

- Predicting height,

- Predicting the number of viewers of a show.
Machine Learning Workflow
Machine Learning Workflow
1. Gathering and collecting the relevant data for your task.
2. Cleaning and inspecting the data to better understand it.
3. Performing feature engineering to allow the algorithm to leverage the data in a
suitable form (e.g., converting the data to numerical vectors).
4. Using a portion of this data as a training set to train one or more algorithms to
generate some candidate models.
5. Evaluating and comparing models against your success criteria by objectively
measuring results on a subset of the same data that was not used for training.
6. Leveraging the insights from the above process and/or using the model to make
predictions, detect anomalies, or solve more general business challenges.
MLWorkflow in Spark
Transformer
Transformers
•Transformers are functions that convert raw data in some way.This
might be to create a new interaction variable (from two other
variables), normalize a column, or simply change an Integer into a
Double type to be input into a model.
•Transformers take a DataFrame as input and produce a new
DataFrame as output.
Estimators
•Algorithms that allow users to train a model from data are
referred to as estimators.
Evaluator
•An evaluator allows us to see how a given model performs
according to criteria we specify like a receiver operating
characteristic (ROC) curve.
•We use an evaluator in order to select the best model among the
alternatives.The best model is then used to make predictions.
Pipeline
•From a high level we can specify each of the transformations,
estimations, and evaluations one by one, but it is often easier to
specify our steps as stages in a pipeline.
•This pipeline is similar to scikit-learn’s pipeline concept.
Collaborative Filtering
•Collaborative filtering is commonly used for recommender
systems.
•The aim is to fill in the missing entries of a user-item association
(preference, score, …) matrix.
•Users and products are described by a small set of latent factors
that can be used to predict missing entries.
•Alternating least squares (ALS) algorithm is used to learn the
latent factors.
Ratings Dataset
•Ratings data could consist of explicit ratings given by users, or
they could be derived.
•In general, we could work with two types of user feedback:



(1) Explicit Feedback



(2) Implicit Feedback
Rating Data
•Explicit Feedback: 



- The score entries in the user-item matrix are explicit preferences
given by users to items.
Rating Data
•Implicit Feedback: 



- It is common in many real-world use cases to only have access to
implicit feedback (e.g. total views, total clicks, total purchases,
total likes, total shares etc.). 



- Using such aggregate statistics, we could compute scores.
Training Dataset
Model Building
userId userId
movieId movieId
userIduserId
Training Dataset
•The ratings in our dataset are in the following format:



UserID::MovieID::Rating::Timestamp



— UserIDs range between 1 and 6040.

— MovieIDs range between 1 and 3952.

— Ratings are made on a 5-star scale (whole-star ratings only).

— Timestamp represented in secs since the epoch as returned by time(2). 

— Each user has at least 20 ratings.
Data Wrangling
>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> from pyspark.ml.recommendation import ALS
>>> from pyspark.sql import Row
>>> parts = rdd.map(lambda row: row.split("::"))
>>> ratingsRDD = parts.map(lambda p: 

Row(userId=int(p[0]),movieId=int(p[1]),
rating=float(p[2]), timestamp=long(p[3])))

>>> ratings = sqlContext.createDataFrame(ratingsRDD)
>>> (training, test) = ratings.randomSplit([0.8, 0.2])
Data Wrangling
We will use 80% of our
dataset for training, and
20% for testing.
>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> from pyspark.ml.recommendation import ALS
>>> from pyspark.sql import Row
>>> parts = rdd.map(lambda row: row.split("::"))
>>> ratingsRDD = parts.map(lambda p: 

Row(userId=int(p[0]),movieId=int(p[1]),
rating=float(p[2]), timestamp=long(p[3])))

>>> ratings = sqlContext.createDataFrame(ratingsRDD)
>>> (training, test) = ratings.randomSplit([0.8, 0.2])
DataFrame for Training
•>>> training.limit(2).show()
+-------+------+---------+------+
|movieId|rating|timestamp|userId|
+-------+------+---------+------+
| 1| 1.0|974643004| 2534|
| 1| 1.0|974785296| 1314|
+-------+------+---------+------+
Model Building
•Use the ALS "Estimator" to "fit" a model on the training dataset. 



>>> als = ALS(maxIter=5,regParam=0.01,userCol="userId",
itemCol="movieId",ratingCol="rating")



>>> model = als.fit(training)
Testing
•Use the learnt model, which is a "Transformer", to "predict" a
named column value for test instances.



>>> model.transform(test.limit(2)).show()
+-------+------+---------+------+----------+
|movieId|rating|timestamp|userId|prediction|
+-------+------+---------+------+----------+
| 1| 1.0|974675906| 2015| 3.5993457|
| 1| 1.0|973215902| 2744| 1.4472415|
+-------+------+---------+------+----------+
New column added 

by the transformer.
Estimation Error
•Let's compute the error we made in our predictions.



Root Mean Squared Error (RMSE):

- The square root of the average of the square of all of the error.

- The use of RMSE is common and it makes an excellent general
purpose error metric for numerical predictions.

- Compared to the similar Mean Absolute Error, RMSE amplifies
and severely punishes large errors.
Estimation Error
>>> from math import sqrt
>>> from pyspark.sql.functions import sum, isnan
>>> predictions = model.transform(test)


>>> df1 = predictions

.select(((predictions.prediction - predictions.rating)**2).alias("error"))


>>> df1 = df1.filter(~isnan(df1.error))


>>> print "RMSE:", sqrt(df1

.select(sum("error").alias("error")).collect()[0].error / df1.count())

More Related Content

What's hot

Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction StratergiesAnjaliSoorej
 
Feature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceFeature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceVenkat Projects
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
Basic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesBasic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesOmprakash Chauhan
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusoneDotNetCampus
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data miningkavitha muneeshwaran
 
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Ankit Pandey
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_SagarSagar Kumar
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerDataminingTools Inc
 
Object Modeling Techniques
Object Modeling TechniquesObject Modeling Techniques
Object Modeling TechniquesShilpa Wadhwani
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
 
Interaction Modeling
Interaction ModelingInteraction Modeling
Interaction ModelingHemant Sharma
 

What's hot (19)

Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
 
Data reduction
Data reductionData reduction
Data reduction
 
XL Miner: Classification
XL Miner: ClassificationXL Miner: Classification
XL Miner: Classification
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
 
Feature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performanceFeature extraction for classifying students based on theirac ademic performance
Feature extraction for classifying students based on theirac ademic performance
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Basic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - NotesBasic of Data Structure - Data Structure - Notes
Basic of Data Structure - Data Structure - Notes
 
Data discretization
Data discretizationData discretization
Data discretization
 
Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_Sagar
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid Miner
 
Object Modeling Techniques
Object Modeling TechniquesObject Modeling Techniques
Object Modeling Techniques
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
 
Interaction Modeling
Interaction ModelingInteraction Modeling
Interaction Modeling
 

Similar to Nose Dive into Apache Spark ML

Cognos framework manager
Cognos framework managerCognos framework manager
Cognos framework managermaxonlinetr
 
SAS Training session - By Pratima
SAS Training session  -  By Pratima SAS Training session  -  By Pratima
SAS Training session - By Pratima Pratima Pandey
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruptionjagan477830
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance Anaya Zafar
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATADotNetCampus
 
Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docxjaffarbikat
 
Data Analysis – Technical learnings
Data Analysis – Technical learningsData Analysis – Technical learnings
Data Analysis – Technical learningsInvenkLearn
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET Journal
 
Free ebooks download ! Edhole
Free ebooks download ! EdholeFree ebooks download ! Edhole
Free ebooks download ! EdholeEdhole.com
 
Free ebooks download ! Edhole
Free ebooks download ! EdholeFree ebooks download ! Edhole
Free ebooks download ! EdholeEdhole.com
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesPhilip Goddard
 

Similar to Nose Dive into Apache Spark ML (20)

Cognos framework manager
Cognos framework managerCognos framework manager
Cognos framework manager
 
SAS Training session - By Pratima
SAS Training session  -  By Pratima SAS Training session  -  By Pratima
SAS Training session - By Pratima
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
 
Deep Learning Vocabulary.docx
Deep Learning Vocabulary.docxDeep Learning Vocabulary.docx
Deep Learning Vocabulary.docx
 
Data Analysis – Technical learnings
Data Analysis – Technical learningsData Analysis – Technical learnings
Data Analysis – Technical learnings
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
 
Ch08
Ch08Ch08
Ch08
 
Ch08
Ch08Ch08
Ch08
 
Free ebooks download ! Edhole
Free ebooks download ! EdholeFree ebooks download ! Edhole
Free ebooks download ! Edhole
 
Free ebooks download ! Edhole
Free ebooks download ! EdholeFree ebooks download ! Edhole
Free ebooks download ! Edhole
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
 

More from Ahmet Bulut

Data Economy: Lessons learned and the Road ahead!
Data Economy: Lessons learned and the Road ahead!Data Economy: Lessons learned and the Road ahead!
Data Economy: Lessons learned and the Road ahead!Ahmet Bulut
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
A Few Tips for the CS Freshmen
A Few Tips for the CS FreshmenA Few Tips for the CS Freshmen
A Few Tips for the CS FreshmenAhmet Bulut
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data ScienceAhmet Bulut
 
Agile Software Development
Agile Software DevelopmentAgile Software Development
Agile Software DevelopmentAhmet Bulut
 
What is open source?
What is open source?What is open source?
What is open source?Ahmet Bulut
 
Programming with Python - Week 3
Programming with Python - Week 3Programming with Python - Week 3
Programming with Python - Week 3Ahmet Bulut
 
Programming with Python - Week 2
Programming with Python - Week 2Programming with Python - Week 2
Programming with Python - Week 2Ahmet Bulut
 
Liselerde tanıtım sunumu
Liselerde tanıtım sunumuLiselerde tanıtım sunumu
Liselerde tanıtım sunumuAhmet Bulut
 
Programming with Python: Week 1
Programming with Python: Week 1Programming with Python: Week 1
Programming with Python: Week 1Ahmet Bulut
 
Ecosystem for Scholarly Work
Ecosystem for Scholarly WorkEcosystem for Scholarly Work
Ecosystem for Scholarly WorkAhmet Bulut
 
Startup Execution Models
Startup Execution ModelsStartup Execution Models
Startup Execution ModelsAhmet Bulut
 
Bilisim 2010 @ bura
Bilisim 2010 @ buraBilisim 2010 @ bura
Bilisim 2010 @ buraAhmet Bulut
 
ESX Server from VMware
ESX Server from VMwareESX Server from VMware
ESX Server from VMwareAhmet Bulut
 
Virtualization @ Sehir
Virtualization @ SehirVirtualization @ Sehir
Virtualization @ SehirAhmet Bulut
 

More from Ahmet Bulut (18)

Data Economy: Lessons learned and the Road ahead!
Data Economy: Lessons learned and the Road ahead!Data Economy: Lessons learned and the Road ahead!
Data Economy: Lessons learned and the Road ahead!
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
A Few Tips for the CS Freshmen
A Few Tips for the CS FreshmenA Few Tips for the CS Freshmen
A Few Tips for the CS Freshmen
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Data Science
Data ScienceData Science
Data Science
 
Agile Software Development
Agile Software DevelopmentAgile Software Development
Agile Software Development
 
What is open source?
What is open source?What is open source?
What is open source?
 
Programming with Python - Week 3
Programming with Python - Week 3Programming with Python - Week 3
Programming with Python - Week 3
 
Programming with Python - Week 2
Programming with Python - Week 2Programming with Python - Week 2
Programming with Python - Week 2
 
Liselerde tanıtım sunumu
Liselerde tanıtım sunumuLiselerde tanıtım sunumu
Liselerde tanıtım sunumu
 
Programming with Python: Week 1
Programming with Python: Week 1Programming with Python: Week 1
Programming with Python: Week 1
 
Ecosystem for Scholarly Work
Ecosystem for Scholarly WorkEcosystem for Scholarly Work
Ecosystem for Scholarly Work
 
Startup Execution Models
Startup Execution ModelsStartup Execution Models
Startup Execution Models
 
I feel dealsy
I feel dealsyI feel dealsy
I feel dealsy
 
Kaihl 2010
Kaihl 2010Kaihl 2010
Kaihl 2010
 
Bilisim 2010 @ bura
Bilisim 2010 @ buraBilisim 2010 @ bura
Bilisim 2010 @ bura
 
ESX Server from VMware
ESX Server from VMwareESX Server from VMware
ESX Server from VMware
 
Virtualization @ Sehir
Virtualization @ SehirVirtualization @ Sehir
Virtualization @ Sehir
 

Recently uploaded

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 

Recently uploaded (20)

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 

Nose Dive into Apache Spark ML

  • 1. Dr. Ahmet Bulut
 Computer Science Department
 Istanbul Sehir University email: ahmetbulut@sehir.edu.tr Nose Dive into Apache Spark ML
  • 2. DataFrame The reason for putting the data on more than one computer should be intuitive: either the data is too large to fit on one machine or it would simply take too long to perform that computation on one machine.
  • 3. DataFrame •A DataFrame is a distributed collection of data organized into named columns. •It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. •You can load data from a variety of structured data sources, e.g., JSON and Parquet. 

  • 4. Advanced Analytics •Supervised learning, including classification and regression, to predict a label for each data point based on various features. •Recommendation engines to suggest products to users based on behavior. •Unsupervised learning, including clustering, anomaly detection, and topic modeling to discover structure in the data. •Graph analytics such as searching for patterns in a social network.
  • 5. Supervised Learning •Supervised learning is probably the most common type of machine learning. •The goal is simple: using historical data that already has labels (often called the dependent variables), train a model to predict the values of those labels based on various features of the data points.
  • 6. Supervised Learning •Classification: predict a categorical variable. •Regression: predict a continuous variable (a real number)
  • 7. Supervised Learning •Classification
 - Predicting disease,
 - Classifying images,
 - Predicting customer churn,
 - Buy or won’t buy (predicting conversion). •Regression
 - Predicting sales,
 - Predicting height,
 - Predicting the number of viewers of a show.
  • 9. Machine Learning Workflow 1. Gathering and collecting the relevant data for your task. 2. Cleaning and inspecting the data to better understand it. 3. Performing feature engineering to allow the algorithm to leverage the data in a suitable form (e.g., converting the data to numerical vectors). 4. Using a portion of this data as a training set to train one or more algorithms to generate some candidate models. 5. Evaluating and comparing models against your success criteria by objectively measuring results on a subset of the same data that was not used for training. 6. Leveraging the insights from the above process and/or using the model to make predictions, detect anomalies, or solve more general business challenges.
  • 12. Transformers •Transformers are functions that convert raw data in some way.This might be to create a new interaction variable (from two other variables), normalize a column, or simply change an Integer into a Double type to be input into a model. •Transformers take a DataFrame as input and produce a new DataFrame as output.
  • 13. Estimators •Algorithms that allow users to train a model from data are referred to as estimators.
  • 14. Evaluator •An evaluator allows us to see how a given model performs according to criteria we specify like a receiver operating characteristic (ROC) curve. •We use an evaluator in order to select the best model among the alternatives.The best model is then used to make predictions.
  • 15. Pipeline •From a high level we can specify each of the transformations, estimations, and evaluations one by one, but it is often easier to specify our steps as stages in a pipeline. •This pipeline is similar to scikit-learn’s pipeline concept.
  • 16. Collaborative Filtering •Collaborative filtering is commonly used for recommender systems. •The aim is to fill in the missing entries of a user-item association (preference, score, …) matrix. •Users and products are described by a small set of latent factors that can be used to predict missing entries. •Alternating least squares (ALS) algorithm is used to learn the latent factors.
  • 17. Ratings Dataset •Ratings data could consist of explicit ratings given by users, or they could be derived. •In general, we could work with two types of user feedback:
 
 (1) Explicit Feedback
 
 (2) Implicit Feedback
  • 18. Rating Data •Explicit Feedback: 
 
 - The score entries in the user-item matrix are explicit preferences given by users to items.
  • 19. Rating Data •Implicit Feedback: 
 
 - It is common in many real-world use cases to only have access to implicit feedback (e.g. total views, total clicks, total purchases, total likes, total shares etc.). 
 
 - Using such aggregate statistics, we could compute scores.
  • 21. Model Building userId userId movieId movieId userIduserId
  • 22. Training Dataset •The ratings in our dataset are in the following format:
 
 UserID::MovieID::Rating::Timestamp
 
 — UserIDs range between 1 and 6040.
 — MovieIDs range between 1 and 3952.
 — Ratings are made on a 5-star scale (whole-star ratings only).
 — Timestamp represented in secs since the epoch as returned by time(2). 
 — Each user has at least 20 ratings.
  • 23. Data Wrangling >>> from pyspark.sql import SQLContext >>> sqlContext = SQLContext(sc) >>> from pyspark.ml.recommendation import ALS >>> from pyspark.sql import Row >>> parts = rdd.map(lambda row: row.split("::")) >>> ratingsRDD = parts.map(lambda p: 
 Row(userId=int(p[0]),movieId=int(p[1]), rating=float(p[2]), timestamp=long(p[3])))
 >>> ratings = sqlContext.createDataFrame(ratingsRDD) >>> (training, test) = ratings.randomSplit([0.8, 0.2])
  • 24. Data Wrangling We will use 80% of our dataset for training, and 20% for testing. >>> from pyspark.sql import SQLContext >>> sqlContext = SQLContext(sc) >>> from pyspark.ml.recommendation import ALS >>> from pyspark.sql import Row >>> parts = rdd.map(lambda row: row.split("::")) >>> ratingsRDD = parts.map(lambda p: 
 Row(userId=int(p[0]),movieId=int(p[1]), rating=float(p[2]), timestamp=long(p[3])))
 >>> ratings = sqlContext.createDataFrame(ratingsRDD) >>> (training, test) = ratings.randomSplit([0.8, 0.2])
  • 25. DataFrame for Training •>>> training.limit(2).show() +-------+------+---------+------+ |movieId|rating|timestamp|userId| +-------+------+---------+------+ | 1| 1.0|974643004| 2534| | 1| 1.0|974785296| 1314| +-------+------+---------+------+
  • 26. Model Building •Use the ALS "Estimator" to "fit" a model on the training dataset. 
 
 >>> als = ALS(maxIter=5,regParam=0.01,userCol="userId", itemCol="movieId",ratingCol="rating")
 
 >>> model = als.fit(training)
  • 27. Testing •Use the learnt model, which is a "Transformer", to "predict" a named column value for test instances.
 
 >>> model.transform(test.limit(2)).show() +-------+------+---------+------+----------+ |movieId|rating|timestamp|userId|prediction| +-------+------+---------+------+----------+ | 1| 1.0|974675906| 2015| 3.5993457| | 1| 1.0|973215902| 2744| 1.4472415| +-------+------+---------+------+----------+ New column added 
 by the transformer.
  • 28. Estimation Error •Let's compute the error we made in our predictions.
 
 Root Mean Squared Error (RMSE):
 - The square root of the average of the square of all of the error.
 - The use of RMSE is common and it makes an excellent general purpose error metric for numerical predictions.
 - Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors.
  • 29. Estimation Error >>> from math import sqrt >>> from pyspark.sql.functions import sum, isnan >>> predictions = model.transform(test) 
 >>> df1 = predictions
 .select(((predictions.prediction - predictions.rating)**2).alias("error")) 
 >>> df1 = df1.filter(~isnan(df1.error)) 
 >>> print "RMSE:", sqrt(df1
 .select(sum("error").alias("error")).collect()[0].error / df1.count())