Nose Dive into Apache Spark ML

Dr. Ahmet Bulut 
Computer Science Department 
Istanbul Sehir University
email: ahmetbulut@sehir.edu.tr
Nose Dive into Apache Spark ML

DataFrame
The reason for putting the data on more than one computer should be intuitive: either the
data is too large to fit on one machine or it would simply take too long to perform
that computation on one machine.

DataFrame
•A DataFrame is a distributed collection of data organized into
named columns.
•It is conceptually equivalent to a table in a relational database or
a data frame in R/Python, but with richer optimizations under the
hood.
•You can load data from a variety of structured data sources, e.g.,
JSON and Parquet.

Advanced Analytics
•Supervised learning, including classiﬁcation and regression, to
predict a label for each data point based on various features.
•Recommendation engines to suggest products to users based on
behavior.
•Unsupervised learning, including clustering, anomaly detection,
and topic modeling to discover structure in the data.
•Graph analytics such as searching for patterns in a social network.

Supervised Learning
•Supervised learning is probably the most common type of
machine learning.
•The goal is simple: using historical data that already has labels
(often called the dependent variables), train a model to predict
the values of those labels based on various features of the data
points.

Supervised Learning
•Classiﬁcation: predict a categorical variable.
•Regression: predict a continuous variable (a real number)

Supervised Learning
•Classiﬁcation 
- Predicting disease, 
- Classifying images, 
- Predicting customer churn, 
- Buy or won’t buy (predicting conversion).
•Regression 
- Predicting sales, 
- Predicting height, 
- Predicting the number of viewers of a show.

Machine Learning Workﬂow
1. Gathering and collecting the relevant data for your task.
2. Cleaning and inspecting the data to better understand it.
3. Performing feature engineering to allow the algorithm to leverage the data in a
suitable form (e.g., converting the data to numerical vectors).
4. Using a portion of this data as a training set to train one or more algorithms to
generate some candidate models.
5. Evaluating and comparing models against your success criteria by objectively
measuring results on a subset of the same data that was not used for training.
6. Leveraging the insights from the above process and/or using the model to make
predictions, detect anomalies, or solve more general business challenges.

Transformers
•Transformers are functions that convert raw data in some way.This
might be to create a new interaction variable (from two other
variables), normalize a column, or simply change an Integer into a
Double type to be input into a model.
•Transformers take a DataFrame as input and produce a new
DataFrame as output.

Estimators
•Algorithms that allow users to train a model from data are
referred to as estimators.

Evaluator
•An evaluator allows us to see how a given model performs
according to criteria we specify like a receiver operating
characteristic (ROC) curve.
•We use an evaluator in order to select the best model among the
alternatives.The best model is then used to make predictions.

Pipeline
•From a high level we can specify each of the transformations,
estimations, and evaluations one by one, but it is often easier to
specify our steps as stages in a pipeline.
•This pipeline is similar to scikit-learn’s pipeline concept.

Collaborative Filtering
•Collaborative ﬁltering is commonly used for recommender
systems.
•The aim is to ﬁll in the missing entries of a user-item association
(preference, score, …) matrix.
•Users and products are described by a small set of latent factors
that can be used to predict missing entries.
•Alternating least squares (ALS) algorithm is used to learn the
latent factors.

Ratings Dataset
•Ratings data could consist of explicit ratings given by users, or
they could be derived.
•In general, we could work with two types of user feedback: 
 
(1) Explicit Feedback 
 
(2) Implicit Feedback

Rating Data
•Explicit Feedback:  
 
- The score entries in the user-item matrix are explicit preferences
given by users to items.

Rating Data
•Implicit Feedback:  
 
- It is common in many real-world use cases to only have access to
implicit feedback (e.g. total views, total clicks, total purchases,
total likes, total shares etc.).  
 
- Using such aggregate statistics, we could compute scores.

Model Building
userId userId
movieId movieId
userIduserId

Training Dataset
•The ratings in our dataset are in the following format: 
 
UserID::MovieID::Rating::Timestamp 
 
— UserIDs range between 1 and 6040. 
— MovieIDs range between 1 and 3952. 
— Ratings are made on a 5-star scale (whole-star ratings only). 
— Timestamp represented in secs since the epoch as returned by time(2).  
— Each user has at least 20 ratings.

Data Wrangling
>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> from pyspark.ml.recommendation import ALS
>>> from pyspark.sql import Row
>>> parts = rdd.map(lambda row: row.split("::"))
>>> ratingsRDD = parts.map(lambda p:  
Row(userId=int(p[0]),movieId=int(p[1]),
rating=ﬂoat(p[2]), timestamp=long(p[3]))) 
>>> ratings = sqlContext.createDataFrame(ratingsRDD)
>>> (training, test) = ratings.randomSplit([0.8, 0.2])

Data Wrangling
We will use 80% of our
dataset for training, and
20% for testing.
>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> from pyspark.ml.recommendation import ALS
>>> from pyspark.sql import Row
>>> parts = rdd.map(lambda row: row.split("::"))
>>> ratingsRDD = parts.map(lambda p:  
Row(userId=int(p[0]),movieId=int(p[1]),
rating=ﬂoat(p[2]), timestamp=long(p[3]))) 
>>> ratings = sqlContext.createDataFrame(ratingsRDD)
>>> (training, test) = ratings.randomSplit([0.8, 0.2])

DataFrame for Training
•>>> training.limit(2).show()
+-------+------+---------+------+
|movieId|rating|timestamp|userId|
+-------+------+---------+------+
| 1| 1.0|974643004| 2534|
| 1| 1.0|974785296| 1314|
+-------+------+---------+------+

Model Building
•Use the ALS "Estimator" to "ﬁt" a model on the training dataset.  
 
>>> als = ALS(maxIter=5,regParam=0.01,userCol="userId",
itemCol="movieId",ratingCol="rating") 
 
>>> model = als.fit(training)

Testing
•Use the learnt model, which is a "Transformer", to "predict" a
named column value for test instances. 
 
>>> model.transform(test.limit(2)).show()
+-------+------+---------+------+----------+
|movieId|rating|timestamp|userId|prediction|
+-------+------+---------+------+----------+
| 1| 1.0|974675906| 2015| 3.5993457|
| 1| 1.0|973215902| 2744| 1.4472415|
+-------+------+---------+------+----------+
New column added  
by the transformer.

Estimation Error
•Let's compute the error we made in our predictions. 
 
Root Mean Squared Error (RMSE): 
- The square root of the average of the square of all of the error. 
- The use of RMSE is common and it makes an excellent general
purpose error metric for numerical predictions. 
- Compared to the similar Mean Absolute Error, RMSE amplifies
and severely punishes large errors.

Estimation Error
>>> from math import sqrt
>>> from pyspark.sql.functions import sum, isnan
>>> predictions = model.transform(test)
 
>>> df1 = predictions 
.select(((predictions.prediction - predictions.rating)**2).alias("error"))
 
>>> df1 = df1.ﬁlter(~isnan(df1.error))
 
>>> print "RMSE:", sqrt(df1 
.select(sum("error").alias("error")).collect()[0].error / df1.count())

Nose Dive into Apache Spark ML

More Related Content

What's hot

Similar to Nose Dive into Apache Spark ML

More from Ahmet Bulut

Recently uploaded

Nose Dive into Apache Spark ML