Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Recommender Systems with Apache Spark's ALS Function

19,363 views

Published on

A quick visual guide to recommender systems (user based, item based, and matrix factorization) and the code behind making an apache spark MatrxFactorization Model with the ALS function.

Published in: Data & Analytics

Recommender Systems with Apache Spark's ALS Function

  1. 1. Building aBuilding a RecommenderRecommender SystemSystem in Pysparkin Pyspark
  2. 2. Will JohnsonWill Johnson - Uline- Uline - DePaul- DePaul LearnBy Marketing.com
  3. 3. AGENDAAGENDA - RecSys- RecSys * Basics* Basics * MF* MF * Evaluation* Evaluation * Advanced* Advanced - PySpark- PySpark * Basics* Basics * ALS* ALS
  4. 4. User Based Collaborative Filtering 4.5 4.0 5.0 4.5 3.0 4.0 2.0 1.0 2.0 1.5 4.5
  5. 5. User Based Collaborative Filtering 4.5 4.0 5.0 4.5 3.0 4.0 3.8 2.0 1.0 2.0 1.5 4.5
  6. 6. Item Based Collaborative Filtering
  7. 7. Item Based Collaborative Filtering
  8. 8. Matrix Factorization
  9. 9. Matrix Factorization
  10. 10. Evaluation RMSE = √∑(Predicted−Actual)2 n Precision Recall |hitsu| |RecoSetu| |hitsu| |TestSetu| Expert Review: Novelty, Context
  11. 11. CRISP-DM
  12. 12. Data Understanding movielens = sc.textFile("../in/ml-100k/u.data")
  13. 13. Data Understanding movielens.first() movielens.count() 100,000 u'196t242t3t881250949'
  14. 14. Data Understanding clean_data = movielens.map(lambda x:x.split('t')) rate = clean_data.map(lambda y: int(y[2])) rate.mean() 3.52986 3 users = clean_data.map(lambda y: int(y[0])) users.distinct().count() 943 clean_data.map(lambda y: int(y[1])). distinct().count() 1,682
  15. 15. Data Preparation from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating mls = movielens.map(lambda l: l.split('t')) ratings = mls.map(lambda x: Rating(int(x[0]), int(x[1]), float(x[2]))) Rating(user=196, product=242, rating=3.0)
  16. 16. Data Preparation train, test = ratings.randomSplit([0.7,0.3],7856) train.count() 70,005 test.count() 29,995 train.cache() test.cache()
  17. 17. Modeling rank = 5 # Latent Factors to be made numIterations = 10 # Times to repeat process #Create the model on the training data model = ALS.train(train, rank, numIterations)
  18. 18. Modeling / Evaluation model.userFeatures() model.productFeatures()
  19. 19. Modeling / Evaluation # For Product X, Find N Users to Sell To model.recommendUsers(242,100) # For User Y Find N Products to Promote model.recommendProducts(196,10) #Predict Single Product for Single User model.predict(196, 242)
  20. 20. Modeling / Evaluation # Predict Multi Users and Multi Products # Pre-Processing pred_input = train.map(lambda x:(x[0],x[1])) # Lots of Predictions pred = model.predictAll(pred_input) #Returns Ratings(user, item, prediction) (196, 242) Rating(user=894, product=1560, rating=3.845)
  21. 21. Evaluation User Item Actual Pred 196 242 3.0 3.91 186 302 3.0 3.29 22 377 1.0 1.09 244 51 2.0 3.66 298 474 4.0 4.11 TRAINING RMSE: 0.763
  22. 22. Evaluation #Organize the data to make (user, product) the key) true_reorg = train.map(lambda x:((x[0],x[1]), x[2])) pred_reorg = pred.map(lambda x:((x[0],x[1]), x[2])) #Do the actual join true_pred = true_reorg.join(pred_reorg) from math import sqrt MSE = true_pred.map(lambda r: (r[1][0] - r[1][1])**2).mean() RMSE = sqrt(MSE) #Results in 0.7629908117414474 ((582, 1014), (4.0, 3.397)) ((196, 242), 3.0)
  23. 23. Evaluation test_input = test.map(lambda x:(x[0],x[1])) pred_test = model.predictAll(test_input) test_reorg = test.map(lambda x:((x[0],x[1]), x[2])) pred_reorg = pred_test.map(lambda x: ((x[0],x[1]), x[2])) test_pred = test_reorg.join(pred_reorg) test_MSE = test_pred.map(lambda r: (r[1][0] - r[1][1])**2).mean() test_RMSE = sqrt(test_MSE) TEST RMSE: 1.0145
  24. 24. CRISP-DM
  25. 25. RECAP RecSys are Nearest Neighbors or MF Based ALS is Implemented in Spark
  26. 26. RECAP rank = 5; numIterations = 10; #Create the model on the training data model = ALS.train(train, rank, numIterations) # Lots of Predictions pred = model.predictAll(pred_input) #Examine Model Features model.productFeatures() # Save your model! model.save(sc,"../out/ml-model")
  27. 27. Questions?Questions? LearnBy Marketing.com

×