Dr. Ahmet Bulut

Computer Science Department

Istanbul Sehir University
email: ahmetbulut@sehir.edu.tr
Nose Dive into Apache Spark ML
DataFrame
The reason for putting the data on more than one computer should be intuitive: either the
data is too large to fit on one machine or it would simply take too long to perform
that computation on one machine.
DataFrame
•A DataFrame is a distributed collection of data organized into
named columns.
•It is conceptually equivalent to a table in a relational database or
a data frame in R/Python, but with richer optimizations under the
hood.
•You can load data from a variety of structured data sources, e.g.,
JSON and Parquet. 

Advanced Analytics
•Supervised learning, including classification and regression, to
predict a label for each data point based on various features.
•Recommendation engines to suggest products to users based on
behavior.
•Unsupervised learning, including clustering, anomaly detection,
and topic modeling to discover structure in the data.
•Graph analytics such as searching for patterns in a social network.
Supervised Learning
•Supervised learning is probably the most common type of
machine learning.
•The goal is simple: using historical data that already has labels
(often called the dependent variables), train a model to predict
the values of those labels based on various features of the data
points.
Supervised Learning
•Classification: predict a categorical variable.
•Regression: predict a continuous variable (a real number)
Supervised Learning
•Classification

- Predicting disease,

- Classifying images,

- Predicting customer churn,

- Buy or won’t buy (predicting conversion).
•Regression

- Predicting sales,

- Predicting height,

- Predicting the number of viewers of a show.
Machine Learning Workflow
Machine Learning Workflow
1. Gathering and collecting the relevant data for your task.
2. Cleaning and inspecting the data to better understand it.
3. Performing feature engineering to allow the algorithm to leverage the data in a
suitable form (e.g., converting the data to numerical vectors).
4. Using a portion of this data as a training set to train one or more algorithms to
generate some candidate models.
5. Evaluating and comparing models against your success criteria by objectively
measuring results on a subset of the same data that was not used for training.
6. Leveraging the insights from the above process and/or using the model to make
predictions, detect anomalies, or solve more general business challenges.
MLWorkflow in Spark
Transformer
Transformers
•Transformers are functions that convert raw data in some way.This
might be to create a new interaction variable (from two other
variables), normalize a column, or simply change an Integer into a
Double type to be input into a model.
•Transformers take a DataFrame as input and produce a new
DataFrame as output.
Estimators
•Algorithms that allow users to train a model from data are
referred to as estimators.
Evaluator
•An evaluator allows us to see how a given model performs
according to criteria we specify like a receiver operating
characteristic (ROC) curve.
•We use an evaluator in order to select the best model among the
alternatives.The best model is then used to make predictions.
Pipeline
•From a high level we can specify each of the transformations,
estimations, and evaluations one by one, but it is often easier to
specify our steps as stages in a pipeline.
•This pipeline is similar to scikit-learn’s pipeline concept.
Collaborative Filtering
•Collaborative filtering is commonly used for recommender
systems.
•The aim is to fill in the missing entries of a user-item association
(preference, score, …) matrix.
•Users and products are described by a small set of latent factors
that can be used to predict missing entries.
•Alternating least squares (ALS) algorithm is used to learn the
latent factors.
Ratings Dataset
•Ratings data could consist of explicit ratings given by users, or
they could be derived.
•In general, we could work with two types of user feedback:



(1) Explicit Feedback



(2) Implicit Feedback
Rating Data
•Explicit Feedback: 



- The score entries in the user-item matrix are explicit preferences
given by users to items.
Rating Data
•Implicit Feedback: 



- It is common in many real-world use cases to only have access to
implicit feedback (e.g. total views, total clicks, total purchases,
total likes, total shares etc.). 



- Using such aggregate statistics, we could compute scores.
Training Dataset
Model Building
userId userId
movieId movieId
userIduserId
Training Dataset
•The ratings in our dataset are in the following format:



UserID::MovieID::Rating::Timestamp



— UserIDs range between 1 and 6040.

— MovieIDs range between 1 and 3952.

— Ratings are made on a 5-star scale (whole-star ratings only).

— Timestamp represented in secs since the epoch as returned by time(2). 

— Each user has at least 20 ratings.
Data Wrangling
>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> from pyspark.ml.recommendation import ALS
>>> from pyspark.sql import Row
>>> parts = rdd.map(lambda row: row.split("::"))
>>> ratingsRDD = parts.map(lambda p: 

Row(userId=int(p[0]),movieId=int(p[1]),
rating=float(p[2]), timestamp=long(p[3])))

>>> ratings = sqlContext.createDataFrame(ratingsRDD)
>>> (training, test) = ratings.randomSplit([0.8, 0.2])
Data Wrangling
We will use 80% of our
dataset for training, and
20% for testing.
>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> from pyspark.ml.recommendation import ALS
>>> from pyspark.sql import Row
>>> parts = rdd.map(lambda row: row.split("::"))
>>> ratingsRDD = parts.map(lambda p: 

Row(userId=int(p[0]),movieId=int(p[1]),
rating=float(p[2]), timestamp=long(p[3])))

>>> ratings = sqlContext.createDataFrame(ratingsRDD)
>>> (training, test) = ratings.randomSplit([0.8, 0.2])
DataFrame for Training
•>>> training.limit(2).show()
+-------+------+---------+------+
|movieId|rating|timestamp|userId|
+-------+------+---------+------+
| 1| 1.0|974643004| 2534|
| 1| 1.0|974785296| 1314|
+-------+------+---------+------+
Model Building
•Use the ALS "Estimator" to "fit" a model on the training dataset. 



>>> als = ALS(maxIter=5,regParam=0.01,userCol="userId",
itemCol="movieId",ratingCol="rating")



>>> model = als.fit(training)
Testing
•Use the learnt model, which is a "Transformer", to "predict" a
named column value for test instances.



>>> model.transform(test.limit(2)).show()
+-------+------+---------+------+----------+
|movieId|rating|timestamp|userId|prediction|
+-------+------+---------+------+----------+
| 1| 1.0|974675906| 2015| 3.5993457|
| 1| 1.0|973215902| 2744| 1.4472415|
+-------+------+---------+------+----------+
New column added 

by the transformer.
Estimation Error
•Let's compute the error we made in our predictions.



Root Mean Squared Error (RMSE):

- The square root of the average of the square of all of the error.

- The use of RMSE is common and it makes an excellent general
purpose error metric for numerical predictions.

- Compared to the similar Mean Absolute Error, RMSE amplifies
and severely punishes large errors.
Estimation Error
>>> from math import sqrt
>>> from pyspark.sql.functions import sum, isnan
>>> predictions = model.transform(test)


>>> df1 = predictions

.select(((predictions.prediction - predictions.rating)**2).alias("error"))


>>> df1 = df1.filter(~isnan(df1.error))


>>> print "RMSE:", sqrt(df1

.select(sum("error").alias("error")).collect()[0].error / df1.count())

Nose Dive into Apache Spark ML

  • 1.
    Dr. Ahmet Bulut
 ComputerScience Department
 Istanbul Sehir University email: ahmetbulut@sehir.edu.tr Nose Dive into Apache Spark ML
  • 2.
    DataFrame The reason forputting the data on more than one computer should be intuitive: either the data is too large to fit on one machine or it would simply take too long to perform that computation on one machine.
  • 3.
    DataFrame •A DataFrame isa distributed collection of data organized into named columns. •It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. •You can load data from a variety of structured data sources, e.g., JSON and Parquet. 

  • 4.
    Advanced Analytics •Supervised learning,including classification and regression, to predict a label for each data point based on various features. •Recommendation engines to suggest products to users based on behavior. •Unsupervised learning, including clustering, anomaly detection, and topic modeling to discover structure in the data. •Graph analytics such as searching for patterns in a social network.
  • 5.
    Supervised Learning •Supervised learningis probably the most common type of machine learning. •The goal is simple: using historical data that already has labels (often called the dependent variables), train a model to predict the values of those labels based on various features of the data points.
  • 6.
    Supervised Learning •Classification: predicta categorical variable. •Regression: predict a continuous variable (a real number)
  • 7.
    Supervised Learning •Classification
 - Predictingdisease,
 - Classifying images,
 - Predicting customer churn,
 - Buy or won’t buy (predicting conversion). •Regression
 - Predicting sales,
 - Predicting height,
 - Predicting the number of viewers of a show.
  • 8.
  • 9.
    Machine Learning Workflow 1.Gathering and collecting the relevant data for your task. 2. Cleaning and inspecting the data to better understand it. 3. Performing feature engineering to allow the algorithm to leverage the data in a suitable form (e.g., converting the data to numerical vectors). 4. Using a portion of this data as a training set to train one or more algorithms to generate some candidate models. 5. Evaluating and comparing models against your success criteria by objectively measuring results on a subset of the same data that was not used for training. 6. Leveraging the insights from the above process and/or using the model to make predictions, detect anomalies, or solve more general business challenges.
  • 10.
  • 11.
  • 12.
    Transformers •Transformers are functionsthat convert raw data in some way.This might be to create a new interaction variable (from two other variables), normalize a column, or simply change an Integer into a Double type to be input into a model. •Transformers take a DataFrame as input and produce a new DataFrame as output.
  • 13.
    Estimators •Algorithms that allowusers to train a model from data are referred to as estimators.
  • 14.
    Evaluator •An evaluator allowsus to see how a given model performs according to criteria we specify like a receiver operating characteristic (ROC) curve. •We use an evaluator in order to select the best model among the alternatives.The best model is then used to make predictions.
  • 15.
    Pipeline •From a highlevel we can specify each of the transformations, estimations, and evaluations one by one, but it is often easier to specify our steps as stages in a pipeline. •This pipeline is similar to scikit-learn’s pipeline concept.
  • 16.
    Collaborative Filtering •Collaborative filteringis commonly used for recommender systems. •The aim is to fill in the missing entries of a user-item association (preference, score, …) matrix. •Users and products are described by a small set of latent factors that can be used to predict missing entries. •Alternating least squares (ALS) algorithm is used to learn the latent factors.
  • 17.
    Ratings Dataset •Ratings datacould consist of explicit ratings given by users, or they could be derived. •In general, we could work with two types of user feedback:
 
 (1) Explicit Feedback
 
 (2) Implicit Feedback
  • 18.
    Rating Data •Explicit Feedback:
 
 - The score entries in the user-item matrix are explicit preferences given by users to items.
  • 19.
    Rating Data •Implicit Feedback:
 
 - It is common in many real-world use cases to only have access to implicit feedback (e.g. total views, total clicks, total purchases, total likes, total shares etc.). 
 
 - Using such aggregate statistics, we could compute scores.
  • 20.
  • 21.
  • 22.
    Training Dataset •The ratingsin our dataset are in the following format:
 
 UserID::MovieID::Rating::Timestamp
 
 — UserIDs range between 1 and 6040.
 — MovieIDs range between 1 and 3952.
 — Ratings are made on a 5-star scale (whole-star ratings only).
 — Timestamp represented in secs since the epoch as returned by time(2). 
 — Each user has at least 20 ratings.
  • 23.
    Data Wrangling >>> frompyspark.sql import SQLContext >>> sqlContext = SQLContext(sc) >>> from pyspark.ml.recommendation import ALS >>> from pyspark.sql import Row >>> parts = rdd.map(lambda row: row.split("::")) >>> ratingsRDD = parts.map(lambda p: 
 Row(userId=int(p[0]),movieId=int(p[1]), rating=float(p[2]), timestamp=long(p[3])))
 >>> ratings = sqlContext.createDataFrame(ratingsRDD) >>> (training, test) = ratings.randomSplit([0.8, 0.2])
  • 24.
    Data Wrangling We willuse 80% of our dataset for training, and 20% for testing. >>> from pyspark.sql import SQLContext >>> sqlContext = SQLContext(sc) >>> from pyspark.ml.recommendation import ALS >>> from pyspark.sql import Row >>> parts = rdd.map(lambda row: row.split("::")) >>> ratingsRDD = parts.map(lambda p: 
 Row(userId=int(p[0]),movieId=int(p[1]), rating=float(p[2]), timestamp=long(p[3])))
 >>> ratings = sqlContext.createDataFrame(ratingsRDD) >>> (training, test) = ratings.randomSplit([0.8, 0.2])
  • 25.
    DataFrame for Training •>>>training.limit(2).show() +-------+------+---------+------+ |movieId|rating|timestamp|userId| +-------+------+---------+------+ | 1| 1.0|974643004| 2534| | 1| 1.0|974785296| 1314| +-------+------+---------+------+
  • 26.
    Model Building •Use theALS "Estimator" to "fit" a model on the training dataset. 
 
 >>> als = ALS(maxIter=5,regParam=0.01,userCol="userId", itemCol="movieId",ratingCol="rating")
 
 >>> model = als.fit(training)
  • 27.
    Testing •Use the learntmodel, which is a "Transformer", to "predict" a named column value for test instances.
 
 >>> model.transform(test.limit(2)).show() +-------+------+---------+------+----------+ |movieId|rating|timestamp|userId|prediction| +-------+------+---------+------+----------+ | 1| 1.0|974675906| 2015| 3.5993457| | 1| 1.0|973215902| 2744| 1.4472415| +-------+------+---------+------+----------+ New column added 
 by the transformer.
  • 28.
    Estimation Error •Let's computethe error we made in our predictions.
 
 Root Mean Squared Error (RMSE):
 - The square root of the average of the square of all of the error.
 - The use of RMSE is common and it makes an excellent general purpose error metric for numerical predictions.
 - Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors.
  • 29.
    Estimation Error >>> frommath import sqrt >>> from pyspark.sql.functions import sum, isnan >>> predictions = model.transform(test) 
 >>> df1 = predictions
 .select(((predictions.prediction - predictions.rating)**2).alias("error")) 
 >>> df1 = df1.filter(~isnan(df1.error)) 
 >>> print "RMSE:", sqrt(df1
 .select(sum("error").alias("error")).collect()[0].error / df1.count())