3. P. 3
When you use YouTube, Netflix or other online media services, you may have
noticed ârecommendation for youâ on videos, movies or music. As
consumers, we like to have a personalized list for easy access to products,
services and to save time. As we watch more videos, those recommendations
become better in accuracy and quality. A more satisfied and happy user is a
winning factor for a company.
Big data makes this easy and cool stuff available to us with its scalability and itâs
power to process huge data either structured or unstructured data. Through big
data, data developers can analyze billions of products of a company and process
them with the help of machine learning to better provide even more narrowed
recommendations for the user.
Products Recommendations - Business Case
PRODUCT
RECOMMENDATIONS
4. P. 4
Customers of these brands would be
delighted with the huge
variety of products to choose from
But often find it difficult to
sift through the variety
and identify things they would like
RECOMMENDATIONS HELP USERS
Navigate the maze of the product catalogues
Find what they are looking for
Find PRODUCTS they might like, But didnât know of
Products Recommendations - Business Case
6. P. 6
HOW?
USING DATA
What users
Bought
What users
Browsed
What Users
Rated
RECOMMENDATION
ENGINE
Top picks for you!!!
If you like this,
Youâll love that
Products Recommendations - Business Case
8. P. 8
RECOMMENDATION ENGINE
OBJECTIVE
Filter Relevant Products
Predict what rating the user
would give a product
Predict whether a user
would buy a product
Rank products based on their
relevance to the user
Tasks Performed
By
RECOMMENDATION
ENGINES
Products Recommendations - How?
9. P. 9
Most RECOMMENDATION ENGINES
use a technique called
COLLABORATIVE
FILTERING â Latent Factor
How does that work?
The basic premise is that
If 2 users have the same opinion
About a bunch of Products
They are likely to have the same
opinion about other products too
IT REPRESENTS USERS BYTHEIR
RATINGS FOR DIFFERENT
PRODUCTS
COLLABORATIVE FILTERING
Algorithms normally predict
Usersâ Ratings for Products they havenât yet rated
Products Recommendations - How?
10. P. 10
Products are represented using
these descriptors
sweatersJeans
Shirts
Outerwear
Users are represented using the same descriptors
Joe likes light weight skinny fit jeans and
Linen-cotton short sleeve standard fit shirt
9, 7
Products Recommendations - How?
14. P. 14
We will use the Sparkâs MLlib ALS algorithm to learn the latent factors that can be used to predict missing entries in the user-
product association matrix.
First we separate the ratings data into training data (80%) and test data (20%). We will get recommendations for the training
data, then we will evaluate the predictions with the test data. This process of taking a subset of the data to build the model
and then verifying the model with the remaining data is known as cross validation, the goal is to estimate how accurately a
predictive model will perform in practice.
To improve the model this process is often done multiple times with different subsets, we will only do it once.
Products Recommendations â Implementation
Using ALTERNATING LEAST SQUARES (ALS) to Build a Matrix Factorization Model
15. P. 15
Products Recommendations â Implementation
All ratings are contained in the file "ratings.dat" and are in the following format:
UserID::ProductID::Rating::Timestamp
1::1193::5::978300760
- UserIDs range between 1 and 6040
- ProductIDs range between 1 and 3952
- Ratings are made on a 5-star scale
- Timestamp is represented in seconds since the epoch
User information is in the file "users.dat" and is in the following format:
UserID::Gender::Age::Occupation::Zip-code
1::F::1::10::4806720::M::25::14::55113
- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:
* 1: "Under 18â
* 18: "18-24â
* 25: "25-34â
* 35: "35-44â
* 45: "45-49â
* 50: "50-55â
* 56: "56+"
- Occupation is chosen from the following choices:
* 0: "other" or not specified
* 1: "academic/educatorâ
* 2: "artistâ
* 3: "clerical/admin"
* 4: "college/grad studentâ
* 5: "customer serviceâ
* 6: "doctor/health careâ
* 7: "executive/managerialâ
* 8: "farmerâ
* 9: "homemakerâ
* 10: "K-12 studentâ
* 11: "lawyerâ
* 12: "programmerâ
* 13: "retiredâ
* 14: "sales/marketingâ
* 15: "scientistâ
* 16: "self-employedâ
* 17: "technician/engineerâ
* 18: "tradesman/craftsmanâ
* 19: "unemployedâ
* 20: "writer"
Product information is in the file âproducts.dat" and is in the following format:
ProductID::Name::Category
1::Product1::Pants|Baby|Stripe
The Sample Data Sets
19. P. 19
Load Data into Spark DataFrames
First we will import some packages and instantiate a sqlContext, which is the entry point for working with structured data
(rows and columns) in Spark and allows the creation of DataFrame objects.
Products Recommendations â Implementation
// SQLContext entry point for working with structured data
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Import Spark SQL data types
import org.apache.spark.sql._
// Import MLLIB data types
import org.apache.spark.mllib.recommendation.{ALS, MatrixFactorizationModel, Rating}
// define the schemas using a case classes
// input format ProductID::Name::Category
case class Product(productId: Int, name: String)
20. P. 20
Products Recommendations â Implementation
// input format UserID::Gender::Age::Occupation::Zip-code
case class User(userId: Int, gender: String, age: Int, occupation: Int, zip: String)
// function to parse input into Product class
def parseProduct(str: String): Product = {
val fields = str.split("::")
assert(fields.size == 3)
Product(fields(0).toInt, fields(1))
}
// function to parse input into User class
def parseUser(str: String): User = {
val fields = str.split("::")
assert(fields.size == 5)
User(fields(0).toInt, fields(1).toString, fields(2).toInt, fields(3).toInt, fields(4).toString)
}
// function to parse input UserID::ProductID::Rating
// and pass into constructor for org.apache.spark.mllib.recommendation.Rating class
def parseRating(str: String): Rating = {
val fields = str.split("::")
Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble)
}
21. P. 21
Products Recommendations â Implementation
// load the data into an RDD
val ratingText = sc.textFile("/user/hadoop/data/ratings.dat")
val ratingsRDD = ratingText.map(parseRating).cache()
// count number of total ratings
val numRatings = ratingsRDD.count()
// count number of users who rated a product
val numUsers = ratingsRDD.map(_.user).distinct().count()
// count number of product rated
val numProducts = ratingsRDD.map(_.product).distinct().count()
println(s"Got $numRatings ratings from $numUsers users on $numProducts products.")
// load the data into DataFrames
val productsDF= sc.textFile("/user/hadoop/data/products.dat").map(parseProduct).toDF()
val usersDF = sc.textFile("/user/hadoop/data/users.dat").map(parseUser).toDF()
// create a DataFrame from ratingsRDD
val ratingsDF = ratingsRDD.toDF()
ratingsDF.registerTempTable("ratings")
productsDF.registerTempTable("products")
usersDF.registerTempTable("users")
22. P. 22
Products Recommendations â Implementation
ratingsDF.select("product").distinct.count //res7: Long = 3706
ratingsDF.groupBy("product", "rating").count.show
ratingsDF.groupBy("product").count.agg(min("count"), avg("count"),max("count")).show
ratingsDF.select("product", "rating").groupBy("product", "rating").count.agg(min("count"), avg("count"),max("count")).show
// Count the max, min ratings along with the number of users who have rated a product.
// Display the name, max rating, min rating, number of users.
val results =sqlContext.sql("select products.name, productrates.maxr, productrates.minr, productrates.cntu from(SELECT
ratings.product, max(ratings.rating) as maxr, min(ratings.rating) as minr,count(distinct user) as cntu FROM ratings group by
ratings.product ) productrates join products on productrates.product=products.productId order by productrates.cntu desc ")
// DataFrame show() displays the top 20 rows in tabular form
results.show()
// Show the top 10 most-active users and how many times they rated a product
val mostActiveUsersSchemaRDD = sqlContext.sql("SELECT ratings.user, count(*) as ct from ratings group by ratings.user order by
ct desc limit 10")
mostActiveUsersSchemaRDD.take(20).foreach(println)
// Find the products that user 4169 rated higher than 4
val results =sqlContext.sql("SELECT ratings.user, ratings.product, ratings.rating, products.name FROM ratings JOIN products ON
products.productId=ratings.product where ratings.user=4169 and ratings.rating > 4 order by ratings.rating desc ")
24. P. 24
Products Recommendations â Implementation
results.show
We run ALS on the input trainingRDD of Rating (user, product, rating) objects with the rank and Iterations parameters:
âą Rank is the number of latent factors in the model.
âą Iterations is the number of iterations to run.
The ALS run(trainingRDD) method will build and return a MatrixFactorizationModel, which can be used to make product
predictions for users.
// Randomly split ratings RDD into training data RDD (80%) and test data RDD (20%)
val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L)
val trainingRatingsRDD = splits(0).cache()
val testRatingsRDD = splits(1).cache()
val numTraining = trainingRatingsRDD.count()
val numTest = testRatingsRDD.count()
println(s"Training: $numTraining, test: $numTest.")
// Build the recommendation model using ALS with rank=20, iterations=10
val model = ALS.train(trainingRatingsRDD, 20, 10)
26. P. 26
Products Recommendations â Implementation
val model = (new ALS().setRank(20).setIterations(10).run(trainingRatingsRDD))
Making Predictions with the MatrixFactorizationModel
Now we can use the MatrixFactorizationModel to make predictions. First we will get product predictions for the most active
user, 4169, with the recommendProducts() method , which takes as input the user ID and the number of products to
recommend. Then we print out the recommended product names.
// Make product predictions for user 4169
val topRecsForUser = model.recommendProducts(4169, 10)
// get product names to show with recommendations
val productNames= productsDF.rdd.map(array => (array(0), array(1))).collectAsMap()
// print out top recommendations for user 4169 with products
topRecsForUser.map(rating => (productNames(rating.product), rating.rating)).foreach(println)
Evaluating the Model
Next we will compare predictions from the model with actual ratings in the testRatingsRDD. First we get the user product pairs
from the testRatingsRDD to pass to the MatrixFactorizationModel predict(user:Int,product:Int) method , which will return
predictions as Rating (user, product, rating) objects.
27. P. 27
Products Recommendations â Implementation
// get predicted ratings to compare to test ratings
val predictionsForTestRDD = model.predict(testRatingsRDD.map{case Rating(user, product, rating) => (user, product)})
predictionsForTestRDD.take(10).mkString("n")
Now we will compare the test predictions to the actual test ratings. First we put the predictions and the test RDDs in this key,
value pair format for joining: ((user, product), rating). Then we print out the (user, product), (test rating, predicted rating) for
comparison.
// prepare the predictions for comparison
val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
// prepare the test for comparison
val testKeyedByUserProductRDD = testRatingsRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
//Join the test with the predictions
val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD.join(predictionsKeyedByUserProductRDD)
testAndPredictionsJoinedRDD.take(10).mkString("n")
28. P. 28
Products Recommendations â Implementation
The example below finds false positives by finding predicted ratings which were >= 4 when the actual test rating was <= 1.
val falsePositives =(testAndPredictionsJoinedRDD.filter{ case ((user, product), (ratingT, ratingP)) => (ratingT <= 1 && ratingP
>=4) })
falsePositives.take(2)
falsePositives.count
Next we evaluate the model using Mean Absolute Error (MAE). MAE is the absolute differences between the predicted and
actual targets.
//Evaluate the model using Mean Absolute Error (MAE) between test and predictions
val meanAbsoluteError = testAndPredictionsJoinedRDD.map {
case ((user, product), (testRating, predRating)) =>
val err = (testRating - predRating)
Math.abs(err)
}.mean()
35. P. 35
Products Recommendations â Implementation
CLOSING THOUGHTS
ï± The goal of a collaborative filtering algorithm is to take preferences data from users, and to create a model that can be used
for recommendations or predictions.
ï± Collaborative filtering algorithms recommend items based on preference information from many users. The collaborative
filtering approach is based on similarity; people who liked similar items in the past will like similar items in the future.
ï± Machine learning algorithms are pretty complicated
ï± Apache Sparkâs MLlib has Built-in modules for ClassiïŹcation, regression, clustering, recommendations etc algorithms. Under the
hood the library takes care of running these algorithms across a cluster. This completely abstracts the programmer from
Implementing the ML algorithm Intricacies of running it across a cluster
ï± Latent Factor analysis and ALS are pretty magical. We just need to have a good dataset with User-Product Ratings