Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

446 views

Published on

This slide shows how to use Hivemall for Recommendation problems.

Published in:
Engineering

No Downloads

Total views

446

On SlideShare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

11

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Recommendation 101 using Hivemall Research Engineer Makoto YUI @myui <myui@treasure-data.com> 1
- 2. Agenda 1. Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Bayesian Probabilistic Ranking 2
- 3. What is Hivemall Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2 3 https://github.com/myui/hivemall
- 4. Hivemall’s Vision: ML on SQL Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓Machine Learning made easy for SQL developers (ML for the rest of us) ✓Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop 4
- 5. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Data preparation 5
- 6. CREATE EXTERNAL TABLE e2006tfidf_train ( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train'; How to use Hivemall - Data preparation Define a Hive table for training/testing data 6
- 7. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Feature Engineering 7
- 8. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How to use Hivemall - Feature Engineering Transforming a label value to a value between 0.0 and 1.0 8
- 9. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Training 9
- 10. How to use Hivemall - Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training by logistic regression map-only task to learn a prediction model Shuffle map-outputs to reduces by feature Reducers perform model averaging in parallel 10
- 11. How to use Hivemall - Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training of Confidence Weighted Classifier Vote to use negative or positive weights for avg +0.7, +0.3, +0.2, -0.1, +0.7 Training for the CW classifier 11
- 12. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Prediction 12
- 13. How to use Hivemall - Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction is done by LEFT OUTER JOIN between test data and prediction model No need to load the entire model into memory 13
- 14. 14 Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification Regression ✓Logistic Regression (SGD) ✓PA Regression ✓AROW Regression ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓Factorization Machines ✓RandomForest Regression List of supported Algorithms
- 15. List of supported Algorithms 15 Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones
- 16. List of Algorithms for Recommendation 16 K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items
- 17. Other Supported Algorithms 17 Anomaly Detection ✓ Local Outlier Factor (LoF) Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ TF-IDF vectorizer ✓ Polynomial Expansion (Feature Pairing) ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer (Kuromoji)
- 18. Agenda 1. Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Bayesian Probabilistic Ranking 18
- 19. •Explicit Feedback • Item Rating • Item Ranking •Implicit Feedback • Positive-only Implicit Feedback • Bought (or not) • Click (or not) • Converged (or not) 19 Recommendation 101
- 20. •Explicit Feedback • Item Rating • Item Ranking •Implicit Feedback • Positive-only Implicit Feedback • Bought (or not) • Click (or not) • Converged (or not) 20 Recommendation 101 Case for Coursehero?
- 21. U/I Item 1 Item 2 Item 3 … Item I User 1 5 3 User 2 2 1 … 3 4 User U 1 4 5 21 Explicit Feedback
- 22. U/I Item 1 Item 2 Item 3 … Item I User 1 ? 5 ? ? 3 User 2 2 ? 1 ? ? … ? 3 ? 4 ? User U 1 ? 4 ? 5 22 Explicit Feedback
- 23. 23 Explicit Feedback U/I Item 1 Item 2 Item 3 … Item I User 1 ? 5 ? ? 3 User 2 2 ? 1 ? ? … ? 3 ? 4 ? User U 1 ? 4 ? 5 • Very Sparse Dataset • # of feedback is small • Unknown data >> Training data • User preference to rated items is clear • Has negative feedbacks • Evaluation is easy (MAE/RMSE)
- 24. U/I Item 1 Item 2 Item 3 … Item I User 1 ⭕ ⭕ User 2 ⭕ ⭕ … ⭕ ⭕ User U ⭕ ⭕ ⭕ 24 Implicit Feedback
- 25. U/I Item 1 Item 2 Item 3 … Item I User 1 ⭕ ⭕ User 2 ⭕ ⭕ … ⭕ ⭕ User U ⭕ ⭕ ⭕ 25 Implicit Feedback • Sparse Dataset • Number of Feedbacks are large • User preference is unclear • No negative feedback • Known feedback maybe negative • Unknown feedback maybe positive • Evaluation is not so easy (NDCG, Prec@K, Recall@K)
- 26. 26 Pros and Cons Explicit Feedback Implicit Feedback Data size L J User preference J L Dislike/Unknown J L Impact of Bias L J
- 27. Agenda 1. Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Bayesian Probabilistic Ranking 27
- 28. 28 Matrix Factorization/Completion Factorize a matrix into a product of matrices having k-latent factor
- 29. 29 Matrix Completion How-to • Mean Rating μ • Rating Bias for each Item Bi • Rating Bias for each User Bu
- 30. 30 Mean Rating Matrix Factorization Regularization Bias for each user/item Criteria of Biased MF Factorization Diff in prediction
- 31. 31 Training of Matrix Factorization Support iterative training using local disk cache
- 32. 32 Prediction of Matrix Factorization
- 33. Agenda 1. Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Bayesian Probabilistic Ranking 33 Still in Beta but will officially be supported soon
- 34. 34 Implicit Feedback A naïve L approach by filling unknown cell as negative
- 35. 35 Sampling scheme for Implicit Feedback Sample pairs <u, i, j> of Positive Item i and Negative Item j for each User u • Uniform user sampling Ø Sample a user. Then, sample a pair. • Uniform pair sampling Ø Sample pairs directory (dist. along w/ original dataset) • With-replacement or without-replacement sampling U/I Item 1 Item 2 Item 3 … Item I User 1 ⭕ ⭕ User 2 ⭕ ⭕ … ⭕ ⭕ User U ⭕ ⭕ ⭕ Default Hivemall sampling scheme: - Uniform user sampling - With replacement
- 36. •Rendle et al., “BPR: Bayesian Personalized Ranking from Implicit Feedback”, Proc. UAI, 2009. •A most proven(?) algorithm for recommendation for implicit feedback 36 Bayesian Probabilistic Ranking Key assumption: user u prefers item i over non- observed item j
- 37. Bayesian Probabilistic Ranking 37 Image taken from Rendle et al., “BPR: Bayesian Personalized Ranking from Implicit Feedback”, Proc. UAI, 2009. http://www.algo.uni-konstanz.de/members/rendle/pdf/Rendle_et_al2009-Bayesian_Personalized_Ranking.pdf BPRMF’s task can be considered filling 0/1 the item-item matrix and getting probability of I >u J
- 38. Train by BPR-Matrix Factoriaztion 38
- 39. 39 Predict by BPR-Matrix Factorization
- 40. 40 Predict by BPR-Matrix Factorization
- 41. 41 Predict by BPR-Matrix Factorization
- 42. 42 Recommendation for Implicit Feedback Dataset 1. Efficient Top-k computation is important for prediction O(U * I) 2. Memory consumption is heavy for where item size |i| is large • MyMediaLite requires lots of memory • Maximum data size of Movielens: 33,000 movies by 240,000 users, 20 million ratings 3. Better to avoid computing predictions for each time
- 43. 43 We support machine learning in Cloud Any feature request? Or, questions?

No public clipboards found for this slide

Be the first to comment