Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Recommendation	101
using	Hivemall
Research	Engineer
Makoto	YUI	@myui
<myui@treasure-data.com>
1
Agenda
1. Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Bayesian	Probabilistic	Ranking
2
What	is	Hivemall
Scalable	machine	learning	library	built	
as	a	collection	of	Hive	UDFs,	licensed	
under	the	Apache	License...
Hivemall’s Vision:	ML	on	SQL
Classification	with	Mahout
CREATE	TABLE	lr_model	AS
SELECT
feature,	-- reducers	perform	model...
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	Vector
Feature	Vector
Label
Data	p...
CREATE EXTERNAL TABLE e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERM...
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	Vector
Feature	Vector
Label
Featur...
create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e20...
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	Vector
Feature	Vector
Label
Traini...
How	to	use	Hivemall	- Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(featur...
How	to	use	Hivemall	- Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
...
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	Vector
Feature	Vector
Label
Predic...
How	to	use	Hivemall	- Prediction
CREATE	TABLE	lr_predict
as
SELECT
t.rowid,	
sigmoid(sum(m.weight))	 as	prob
FROM
testing_...
14
Classification
✓ Perceptron
✓ Passive	Aggressive	(PA,	PA1,	
PA2)
✓ Confidence	Weighted	(CW)
✓ Adaptive	Regularization	o...
List	of	supported	Algorithms
15
Classification	
✓ Perceptron
✓ Passive	Aggressive	(PA,	PA1,	
PA2)
✓ Confidence	Weighted	(C...
List	of	Algorithms	for	Recommendation
16
K-Nearest	Neighbor
✓ Minhash and	b-Bit	Minhash
(LSH	variant)
✓ Similarity	Search ...
Other	Supported	Algorithms
17
Anomaly	Detection
✓ Local	Outlier	Factor	(LoF)
Feature	Engineering
✓Feature	Hashing
✓Feature...
Agenda
1. Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Bayesian	Probabilistic	Ranking
18
•Explicit	Feedback
• Item	Rating
• Item	Ranking
•Implicit	Feedback
• Positive-only	Implicit	Feedback
• Bought	(or	not)	
• ...
•Explicit	Feedback
• Item	Rating
• Item	Ranking
•Implicit	Feedback
• Positive-only	Implicit	Feedback
• Bought	(or	not)	
• ...
U/I Item	1 Item	2 Item	3 … Item	I
User	1 5 3
User	2 2 1
… 3 4
User	U 1 4 5
21
Explicit	Feedback
U/I Item	1 Item	2 Item	3 … Item	I
User	1 ? 5 ? ? 3
User	2 2 ? 1 ? ?
… ? 3 ? 4 ?
User	U 1 ? 4 ? 5
22
Explicit	Feedback
23
Explicit	Feedback
U/I Item	1 Item	2 Item	3 … Item	I
User	1 ? 5 ? ? 3
User	2 2 ? 1 ? ?
… ? 3 ? 4 ?
User	U 1 ? 4 ? 5
• Ve...
U/I Item	1 Item	2 Item	3 … Item	I
User	1 ⭕ ⭕
User	2 ⭕ ⭕
… ⭕ ⭕
User	U ⭕ ⭕ ⭕
24
Implicit	Feedback
U/I Item	1 Item	2 Item	3 … Item	I
User	1 ⭕ ⭕
User	2 ⭕ ⭕
… ⭕ ⭕
User	U ⭕ ⭕ ⭕
25
Implicit	Feedback
• Sparse	Dataset
• Number	...
26
Pros	and	Cons
Explicit
Feedback
Implicit	
Feedback
Data	size L J
User preference J L
Dislike/Unknown J L
Impact of	Bias...
Agenda
1. Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Bayesian	Probabilistic	Ranking
27
28
Matrix	Factorization/Completion
Factorize	a	matrix	
into	a	product	of	matrices
having	k-latent	factor
29
Matrix	Completion How-to
• Mean	Rating	μ
• Rating	Bias	for	each	Item Bi
• Rating	Bias	for	each	User	Bu
30
Mean	Rating
Matrix	Factorization
Regularization
Bias	
for	each	user/item
Criteria	of	Biased	MF
Factorization
Diff	in	pr...
31
Training	of	Matrix	Factorization
Support iterative training using local disk cache
32
Prediction	of	Matrix	Factorization
Agenda
1. Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Bayesian	Probabilistic	Ranking
33
Stil...
34
Implicit	Feedback
A	naïve	L approach	by	filling	unknown	cell	
as	negative
35
Sampling	scheme	for	Implicit	Feedback
Sample	pairs	<u,	i,	j>	of	Positive	Item	i and	
Negative	Item j	for	each	User	u
• ...
•Rendle et	al.,	“BPR:	Bayesian	Personalized	Ranking	
from	Implicit	Feedback”,	Proc.	UAI,	2009.
•A	most	proven(?)	algorithm...
Bayesian	Probabilistic	Ranking
37
Image	taken	from	
Rendle et	al.,	“BPR:	Bayesian	Personalized	Ranking	from	Implicit	Feedb...
Train	by	BPR-Matrix	Factoriaztion
38
39
Predict	by	BPR-Matrix	Factorization
40
Predict	by	BPR-Matrix	Factorization
41
Predict	by	BPR-Matrix	Factorization
42
Recommendation	for	Implicit	Feedback	Dataset
1. Efficient	Top-k	computation	is	important	
for	prediction O(U	*	I)
2. Me...
43
We	support	machine	learning	in	Cloud
Any	feature	request?	Or,	questions?
Upcoming SlideShare
Loading in …5
×

Recommendation 101 using Hivemall

446 views

Published on

This slide shows how to use Hivemall for Recommendation problems.

Published in: Engineering
  • Be the first to comment

Recommendation 101 using Hivemall

  1. 1. Recommendation 101 using Hivemall Research Engineer Makoto YUI @myui <myui@treasure-data.com> 1
  2. 2. Agenda 1. Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Bayesian Probabilistic Ranking 2
  3. 3. What is Hivemall Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2 3 https://github.com/myui/hivemall
  4. 4. Hivemall’s Vision: ML on SQL Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓Machine Learning made easy for SQL developers (ML for the rest of us) ✓Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop 4
  5. 5. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Data preparation 5
  6. 6. CREATE EXTERNAL TABLE e2006tfidf_train ( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train'; How to use Hivemall - Data preparation Define a Hive table for training/testing data 6
  7. 7. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Feature Engineering 7
  8. 8. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How to use Hivemall - Feature Engineering Transforming a label value to a value between 0.0 and 1.0 8
  9. 9. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Training 9
  10. 10. How to use Hivemall - Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training by logistic regression map-only task to learn a prediction model Shuffle map-outputs to reduces by feature Reducers perform model averaging in parallel 10
  11. 11. How to use Hivemall - Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training of Confidence Weighted Classifier Vote to use negative or positive weights for avg +0.7, +0.3, +0.2, -0.1, +0.7 Training for the CW classifier 11
  12. 12. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Prediction 12
  13. 13. How to use Hivemall - Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction is done by LEFT OUTER JOIN between test data and prediction model No need to load the entire model into memory 13
  14. 14. 14 Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification Regression ✓Logistic Regression (SGD) ✓PA Regression ✓AROW Regression ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓Factorization Machines ✓RandomForest Regression List of supported Algorithms
  15. 15. List of supported Algorithms 15 Classification ✓ Perceptron ✓ Passive Aggressive (PA, PA1, PA2) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓ AdaGrad+RDA ✓ Factorization Machines ✓ RandomForest Classification Regression ✓Logistic Regression (SGD) ✓AdaGrad (logistic loss) ✓AdaDELTA (logistic loss) ✓PA Regression ✓AROW Regression ✓Factorization Machines ✓RandomForest Regression SCW is a good first choice Try RandomForest if SCW does not work Logistic regression is good for getting a probability of a positive class Factorization Machines is good where features are sparse and categorical ones
  16. 16. List of Algorithms for Recommendation 16 K-Nearest Neighbor ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular) Matrix Completion ✓ Matrix Factorization ✓ Factorization Machines (regression) each_top_k function of Hivemall is useful for recommending top-k items
  17. 17. Other Supported Algorithms 17 Anomaly Detection ✓ Local Outlier Factor (LoF) Feature Engineering ✓Feature Hashing ✓Feature Scaling (normalization, z-score) ✓ TF-IDF vectorizer ✓ Polynomial Expansion (Feature Pairing) ✓ Amplifier NLP ✓Basic Englist text Tokenizer ✓Japanese Tokenizer (Kuromoji)
  18. 18. Agenda 1. Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Bayesian Probabilistic Ranking 18
  19. 19. •Explicit Feedback • Item Rating • Item Ranking •Implicit Feedback • Positive-only Implicit Feedback • Bought (or not) • Click (or not) • Converged (or not) 19 Recommendation 101
  20. 20. •Explicit Feedback • Item Rating • Item Ranking •Implicit Feedback • Positive-only Implicit Feedback • Bought (or not) • Click (or not) • Converged (or not) 20 Recommendation 101 Case for Coursehero?
  21. 21. U/I Item 1 Item 2 Item 3 … Item I User 1 5 3 User 2 2 1 … 3 4 User U 1 4 5 21 Explicit Feedback
  22. 22. U/I Item 1 Item 2 Item 3 … Item I User 1 ? 5 ? ? 3 User 2 2 ? 1 ? ? … ? 3 ? 4 ? User U 1 ? 4 ? 5 22 Explicit Feedback
  23. 23. 23 Explicit Feedback U/I Item 1 Item 2 Item 3 … Item I User 1 ? 5 ? ? 3 User 2 2 ? 1 ? ? … ? 3 ? 4 ? User U 1 ? 4 ? 5 • Very Sparse Dataset • # of feedback is small • Unknown data >> Training data • User preference to rated items is clear • Has negative feedbacks • Evaluation is easy (MAE/RMSE)
  24. 24. U/I Item 1 Item 2 Item 3 … Item I User 1 ⭕ ⭕ User 2 ⭕ ⭕ … ⭕ ⭕ User U ⭕ ⭕ ⭕ 24 Implicit Feedback
  25. 25. U/I Item 1 Item 2 Item 3 … Item I User 1 ⭕ ⭕ User 2 ⭕ ⭕ … ⭕ ⭕ User U ⭕ ⭕ ⭕ 25 Implicit Feedback • Sparse Dataset • Number of Feedbacks are large • User preference is unclear • No negative feedback • Known feedback maybe negative • Unknown feedback maybe positive • Evaluation is not so easy (NDCG, Prec@K, Recall@K)
  26. 26. 26 Pros and Cons Explicit Feedback Implicit Feedback Data size L J User preference J L Dislike/Unknown J L Impact of Bias L J
  27. 27. Agenda 1. Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Bayesian Probabilistic Ranking 27
  28. 28. 28 Matrix Factorization/Completion Factorize a matrix into a product of matrices having k-latent factor
  29. 29. 29 Matrix Completion How-to • Mean Rating μ • Rating Bias for each Item Bi • Rating Bias for each User Bu
  30. 30. 30 Mean Rating Matrix Factorization Regularization Bias for each user/item Criteria of Biased MF Factorization Diff in prediction
  31. 31. 31 Training of Matrix Factorization Support iterative training using local disk cache
  32. 32. 32 Prediction of Matrix Factorization
  33. 33. Agenda 1. Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Bayesian Probabilistic Ranking 33 Still in Beta but will officially be supported soon
  34. 34. 34 Implicit Feedback A naïve L approach by filling unknown cell as negative
  35. 35. 35 Sampling scheme for Implicit Feedback Sample pairs <u, i, j> of Positive Item i and Negative Item j for each User u • Uniform user sampling Ø Sample a user. Then, sample a pair. • Uniform pair sampling Ø Sample pairs directory (dist. along w/ original dataset) • With-replacement or without-replacement sampling U/I Item 1 Item 2 Item 3 … Item I User 1 ⭕ ⭕ User 2 ⭕ ⭕ … ⭕ ⭕ User U ⭕ ⭕ ⭕ Default Hivemall sampling scheme: - Uniform user sampling - With replacement
  36. 36. •Rendle et al., “BPR: Bayesian Personalized Ranking from Implicit Feedback”, Proc. UAI, 2009. •A most proven(?) algorithm for recommendation for implicit feedback 36 Bayesian Probabilistic Ranking Key assumption: user u prefers item i over non- observed item j
  37. 37. Bayesian Probabilistic Ranking 37 Image taken from Rendle et al., “BPR: Bayesian Personalized Ranking from Implicit Feedback”, Proc. UAI, 2009. http://www.algo.uni-konstanz.de/members/rendle/pdf/Rendle_et_al2009-Bayesian_Personalized_Ranking.pdf BPRMF’s task can be considered filling 0/1 the item-item matrix and getting probability of I >u J
  38. 38. Train by BPR-Matrix Factoriaztion 38
  39. 39. 39 Predict by BPR-Matrix Factorization
  40. 40. 40 Predict by BPR-Matrix Factorization
  41. 41. 41 Predict by BPR-Matrix Factorization
  42. 42. 42 Recommendation for Implicit Feedback Dataset 1. Efficient Top-k computation is important for prediction O(U * I) 2. Memory consumption is heavy for where item size |i| is large • MyMediaLite requires lots of memory • Maximum data size of Movielens: 33,000 movies by 240,000 users, 20 million ratings 3. Better to avoid computing predictions for each time
  43. 43. 43 We support machine learning in Cloud Any feature request? Or, questions?

×