Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Recommendation	101
using	Hivemall
Research	Engineer
Makoto	YUI	@myui
<myui@treasure-data.com>
1
2016/04/25
Treasure	Data	T...
Ø 2015.04	Joined	Treasure	Data,	Inc.
1st Research	Engineer	in	Treasure	Data
Ø 2010.04-2015.03	Senior	Researcher	at	Nationa...
Ø 2015.04	Joined	Treasure	Data,	Inc.
1st Research	Engineer	in	Treasure	Data
Ø 2010.04-2015.03	Senior	Researcher	at	Nationa...
Agenda
1. Short	Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Factorization	Machines
5. Bayesi...
What	is	Hivemall
Scalable	machine	learning	library	built	
as	a	collection	of	Hive	UDFs,	licensed	
under	the	Apache	License...
Hivemall’s Vision:	ML	on	SQL
Classification	with	Mahout
CREATE	TABLE	lr_model	AS
SELECT
feature,	-- reducers	perform	model...
Ø CTR	prediction	of	Ad	click	logs
•Freakout Inc.,	Fan	communication,	and	more
•Replaced	Spark	MLlib	w/	Hivemall	at	company...
8
ØGender	prediction	of	Ad	click	logs
•Scaleout Inc.	and	Fan	commucations
http://eventdots.jp/eventreport/458208
Industry	...
9
Industry	use	cases	of	Hivemall
Ø Value	prediction	of	Real	estates
•Livesense
http://www.slideshare.net/y-ken/real-estate...
10Source:	http://itnp.net/article/2016/02/18/2286.html
Industry	use	cases	of	Hivemall
11
ØChurn	Detection
•OISIX
Industry	use	cases	of	Hivemall
http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-ois...
12
会員サービスの解約予測
•10万人の会員による定期購
買が会社全体の売上、利益を
左右するが、解約リスクのあ
る会員を事前に把握、防止す
る策を欠いていた
•統計の専門知識無しで機械学習
•解約予測リストへのポイント付
与により解約率が半...
13
ØRecommendation
•Portal	site
Industry	use	cases	of	Hivemall
List	of	Features	in	Hivemall	v0.3.x
Classification(both	
binary- and	multi-class)
✓ Perceptron
✓ Passive	Aggressive	(PA)
✓...
Features supported	in	Hivemall	v0.4.0
15
1.RandomForest
• classification,	regression
2.Factorization	Machine
• classificat...
Features supported	in	Hivemall	v0.4.1-alpha
16
1. NLP	Tokenizer (形態素解析)
• Kuromoji
2. Mini-batch	Gradient	Descent
3. Rando...
Features supported	in	Hivemall	v0.4.1
17
1. Matrix	Factorization	for	Implicit	Feedback
2. Field-aware	Factorization	Machin...
Agenda
1. Short	Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Factorization	Machines
5. Bayesi...
19
Recommendation
20
If	I	have	3	million	customers	on	the	Web,	
I	should	have	3	million	stores	on	the	Web	
Jeff	Bezos,	CEO	of	Amazon.com
Quo...
•Explicit	Feedback 明示的
• Item	Rating
• Item	Ranking
•Implicit	Feedback	暗黙的
• Positive-only	Implicit	Feedback
• Bought	(or	...
•Explicit	Feedback 明示的
• Item	Rating
• Item	Ranking
•Implicit	Feedback	暗黙的
• Positive-only	Implicit	Feedback
• Bought	(or	...
•Explicit	Feedback 明示的
• Item	Rating
• Item	Ranking
•Implicit	Feedback	暗黙的
• Positive-only	Implicit	Feedback
• Bought	(or	...
U/I Item	1 Item	2 Item	3 … Item	I
User	1 5 3
User	2 2 1
… 3 4
User	U 1 4 5
24
Explicit	Feedback
U/I Item	1 Item	2 Item	3 … Item	I
User	1 ? 5 ? ? 3
User	2 2 ? 1 ? ?
… ? 3 ? 4 ?
User	U 1 ? 4 ? 5
25
Explicit	Feedback
26
Explicit	Feedback
U/I Item	1 Item	2 Item	3 … Item	I
User	1 ? 5 ? ? 3
User	2 2 ? 1 ? ?
… ? 3 ? 4 ?
User	U 1 ? 4 ? 5
• Ve...
U/I Item	1 Item	2 Item	3 … Item	I
User	1 ⭕ ⭕
User	2 ⭕ ⭕
… ⭕ ⭕
User	U ⭕ ⭕ ⭕
27
Implicit	Feedback
U/I Item	1 Item	2 Item	3 … Item	I
User	1 ⭕ ⭕
User	2 ⭕ ⭕
… ⭕ ⭕
User	U ⭕ ⭕ ⭕
28
Implicit	Feedback
• Sparse	Dataset
• Number	...
29
Pros	and	Cons
Explicit
Feedback
Implicit	
Feedback
Data	size L J
User preference J L
Dislike/Unknown J L
Impact of	Bias...
Agenda
1. Short	Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Factorization	Machines
5. Bayesi...
31
Matrix	Factorization/Completion
Factorize	a	matrix	
into	a	product	of	matrices
having	k-latent	factor
K個の潜在変数
32
Matrix	Completion How-to
• Mean	Rating	μ
• Rating	Bias	for	each	Item Bi
• Rating	Bias	for	each	User	Bu
33
Mean	Rating
Matrix	Factorization
Regularization
Bias	
for	each	user/item
Criteria	of	Biased	MF
Factorization
Diff	in	pr...
34
Training	of	Matrix	Factorization
Support iterative training using local disk cache
35
Prediction	of	Matrix	Factorization
Agenda
1. Short	Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Factorization	Machines
5. Bayesi...
37
Factorization	Machine
Matrix	Factorization
38
Factorization	Machine
Context	information	(e.g.,	time)	
can	be	considered
Source:	http://www.ismll.uni-hildesheim.de/pu...
39
Factorization	Machine
Factorization	Model	with	degress=2	(2-way	interaction)
Global	Bias
Regression	coefficience
of	j-t...
40
Training	data	for	Factorization	Machines
Each	Feature	takes	LibSVM-like	format	<feature[:weight]>
41
Training	of	Factorization	Machines
42
Prediction	of	Factorization	Machines
Agenda
1. Short	Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Factorization	Machines
5. Bayesi...
44
Implicit	Feedback
A	naïve	L approach	by	filling	unknown	cell	
as	negative
45
Sampling	scheme	for	Implicit	Feedback
Sample	pairs	<u,	i,	j>	of	Positive	Item	i and	
Negative	Item j	for	each	User	u
• ...
•Rendle et	al.,	“BPR:	Bayesian	Personalized	Ranking	
from	Implicit	Feedback”,	Proc.	UAI,	2009.
•A	most	proven(?)	algorithm...
Bayesian	Probabilistic	Ranking
47
Image	taken	from	
Rendle et	al.,	“BPR:	Bayesian	Personalized	Ranking	from	Implicit	Feedb...
48
Bayesian	Theorem
ベイズの定理:	事後確率は尤度と事前確率の積に比例
ここで、Θは求めたいパラメタ
>uはユーザuの好みの構造
ユーザuの好みの構造 >u が与えられたときの
モデルパラメタΘの事後分布P(Θ|>u)を得たい
49
Bayesian	Probabilistic	Ranking
√
尤度
簡単のためにユーザの好み>uは互いに独立であるとして
p(i >u	j|Θ)の直積で表現.	次のように尤度を定義
u,i,jを入力とした予測値にsigmoid関数を適用
𝐵𝑃𝑅𝑂𝑃𝑇 = max
*
ln 𝑝 >/ |Θ 𝑝 Θ
50
Bayesian	Probabilistic	Ranking
= max
*
2 ln 𝜎(𝑥/567 (Θ)) − 𝜆 | 𝜎 |<
/,5,6 ∈?
√
√
√
ln	sig...
Train	by	BPR-Matrix	Factoriaztion
51
52
Predict	by	BPR-Matrix	Factorization
53
Predict	by	BPR-Matrix	Factorization
54
Predict	by	BPR-Matrix	Factorization
55
Recommendation	for	Implicit	Feedback	Dataset
1. Efficient	Top-k	computation	is	important	
for	prediction O(U	*	I)
2. Me...
Agenda
1. Short	Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Factorization	Machines
5. Bayesi...
Conclusion	and	Takeaway
57
Hivemall	provides	a	collection	of	machine	
learning	algorithms	as	Hive	UDFs/UDTFs
Treasure	Data...
58
Blog	article	about	Hivemall
http://blog-jp.treasuredata.com/
TD,	Hivemall,	Jupyter,	Pandas-TDを使ってKaggleの
課題を解くシリーズ
59
We	support	machine	learning	in	Cloud
Any	feature	request?	Or,	questions?
Upcoming SlideShare
Loading in …5
×

Tdtechtalk20160425myui

2,211 views

Published on

Talk at TD techtalk. 2016/04/26

http://eventdots.jp/event/584571

Published in: Data & Analytics
  • Be the first to comment

Tdtechtalk20160425myui

  1. 1. Recommendation 101 using Hivemall Research Engineer Makoto YUI @myui <myui@treasure-data.com> 1 2016/04/25 Treasure Data Techtalk
  2. 2. Ø 2015.04 Joined Treasure Data, Inc. 1st Research Engineer in Treasure Data Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. Ø 2009.03 Ph.D. in Computer Science from NAIST Ø TD登山部部長 Ø 部員3名(うち幽霊部員1名) Who am I ? 2
  3. 3. Ø 2015.04 Joined Treasure Data, Inc. 1st Research Engineer in Treasure Data Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. Ø 2009.03 Ph.D. in Computer Science from NAIST Ø TD登山部部長 Ø 部員3名(うち幽霊部員1名) Who am I ? 3
  4. 4. Agenda 1. Short Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Factorization Machines 5. Bayesian Probabilistic Ranking 6. Conclusion 4
  5. 5. What is Hivemall Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2 5 https://github.com/myui/hivemall
  6. 6. Hivemall’s Vision: ML on SQL Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓Machine Learning made easy for SQL developers (ML for the rest of us) ✓Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop 6
  7. 7. Ø CTR prediction of Ad click logs •Freakout Inc., Fan communication, and more •Replaced Spark MLlib w/ Hivemall at company X Industry use cases of Hivemall 7 http://www.slideshare.net/masakazusano75/sano-hmm-20150512
  8. 8. 8 ØGender prediction of Ad click logs •Scaleout Inc. and Fan commucations http://eventdots.jp/eventreport/458208 Industry use cases of Hivemall
  9. 9. 9 Industry use cases of Hivemall Ø Value prediction of Real estates •Livesense http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall
  10. 10. 10Source: http://itnp.net/article/2016/02/18/2286.html Industry use cases of Hivemall
  11. 11. 11 ØChurn Detection •OISIX Industry use cases of Hivemall http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix
  12. 12. 12 会員サービスの解約予測 •10万人の会員による定期購 買が会社全体の売上、利益を 左右するが、解約リスクのあ る会員を事前に把握、防止す る策を欠いていた •統計の専門知識無しで機械学習 •解約予測リストへのポイント付 与により解約率が半減 •解約リスクを伴う施策、イベン トを炙り出すと同時に、非解約 者の特徴的な行動も把握可能に •リスク度合いに応じて UI を変 更するなど間接的なサービス改 善も実現 •機械学習を行い、過去1ヶ月間 のデータをもとに未来1ヶ月間 に解約する可能性の高い顧客リ ストを作成 •具体的には、学習用テーブル作 成 -> 正規化 -> 学習モデル作成 -> ロジスティック回帰の各ス テップをTD + Hivemall を用い てクエリで簡便に実現 Web Mobile 属性情報 行動ログ クレーム情報 流入元 利用サービス情報 直接施策 間接施策 ポイント付与 ケアコール 成功体験への誘導UI 変更 予測に使うデータ
  13. 13. 13 ØRecommendation •Portal site Industry use cases of Hivemall
  14. 14. List of Features in Hivemall v0.3.x Classification(both binary- and multi-class) ✓ Perceptron ✓ Passive Aggressive (PA) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓AdaGrad+RDA Regression ✓Logistic Regression (SGD) ✓PA Regression ✓AROW Regression ✓AdaGrad ✓AdaDELTA kNN and Recommendation ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search using K-NN (Euclid/Cosine/Jaccard/Angular) ✓ Matrix Factorization Feature engineering ✓ Feature Hashing ✓ Feature Scaling (normalization, z-score) ✓TF-IDF vectorizer ✓Polynomial Expansion Anomaly Detection ✓ Local Outlier Factor Top-k query processing 14
  15. 15. Features supported in Hivemall v0.4.0 15 1.RandomForest • classification, regression 2.Factorization Machine • classification, regression (factorization)
  16. 16. Features supported in Hivemall v0.4.1-alpha 16 1. NLP Tokenizer (形態素解析) • Kuromoji 2. Mini-batch Gradient Descent 3. RandomForest scalability Improvements Treasure Data is operating Hivemall v0.4.1-alpha.6 The above feature are already supported
  17. 17. Features supported in Hivemall v0.4.1 17 1. Matrix Factorization for Implicit Feedback 2. Field-aware Factorization Machines V0.4.1-RC1 will be released soon!
  18. 18. Agenda 1. Short Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Factorization Machines 5. Bayesian Probabilistic Ranking 6. Conclusion 18
  19. 19. 19 Recommendation
  20. 20. 20 If I have 3 million customers on the Web, I should have 3 million stores on the Web Jeff Bezos, CEO of Amazon.com Quote about Recommendation
  21. 21. •Explicit Feedback 明示的 • Item Rating • Item Ranking •Implicit Feedback 暗黙的 • Positive-only Implicit Feedback • Bought (or not) • Click (or not) • Converged (or not) 21 Recommendation 101
  22. 22. •Explicit Feedback 明示的 • Item Rating • Item Ranking •Implicit Feedback 暗黙的 • Positive-only Implicit Feedback • Bought (or not) • Click (or not) • Converged (or not) 22 Recommendation 101 評価用のPublicデータセットが存在し サンプルやTutorial記事も多い
  23. 23. •Explicit Feedback 明示的 • Item Rating • Item Ranking •Implicit Feedback 暗黙的 • Positive-only Implicit Feedback • Bought (or not) • Click (or not) • Converged (or not) 23 Recommendation 101 対応しているライブラリが少なく サンプルが少ない
  24. 24. U/I Item 1 Item 2 Item 3 … Item I User 1 5 3 User 2 2 1 … 3 4 User U 1 4 5 24 Explicit Feedback
  25. 25. U/I Item 1 Item 2 Item 3 … Item I User 1 ? 5 ? ? 3 User 2 2 ? 1 ? ? … ? 3 ? 4 ? User U 1 ? 4 ? 5 25 Explicit Feedback
  26. 26. 26 Explicit Feedback U/I Item 1 Item 2 Item 3 … Item I User 1 ? 5 ? ? 3 User 2 2 ? 1 ? ? … ? 3 ? 4 ? User U 1 ? 4 ? 5 • Very Sparse Dataset • # of feedback is small • Unknown data >> Training data • User preference to rated items is clear • Has negative feedbacks • Evaluation is easy (MAE/RMSE)
  27. 27. U/I Item 1 Item 2 Item 3 … Item I User 1 ⭕ ⭕ User 2 ⭕ ⭕ … ⭕ ⭕ User U ⭕ ⭕ ⭕ 27 Implicit Feedback
  28. 28. U/I Item 1 Item 2 Item 3 … Item I User 1 ⭕ ⭕ User 2 ⭕ ⭕ … ⭕ ⭕ User U ⭕ ⭕ ⭕ 28 Implicit Feedback • Sparse Dataset • Number of Feedbacks are large • User preference is unclear • No negative feedback • Known feedback maybe negative • Unknown feedback maybe positive • Evaluation is not so easy (NDCG, Prec@K, Recall@K)
  29. 29. 29 Pros and Cons Explicit Feedback Implicit Feedback Data size L J User preference J L Dislike/Unknown J L Impact of Bias L J
  30. 30. Agenda 1. Short Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Factorization Machines 5. Bayesian Probabilistic Ranking 6. Conclusion 30
  31. 31. 31 Matrix Factorization/Completion Factorize a matrix into a product of matrices having k-latent factor K個の潜在変数
  32. 32. 32 Matrix Completion How-to • Mean Rating μ • Rating Bias for each Item Bi • Rating Bias for each User Bu
  33. 33. 33 Mean Rating Matrix Factorization Regularization Bias for each user/item Criteria of Biased MF Factorization Diff in prediction
  34. 34. 34 Training of Matrix Factorization Support iterative training using local disk cache
  35. 35. 35 Prediction of Matrix Factorization
  36. 36. Agenda 1. Short Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Factorization Machines 5. Bayesian Probabilistic Ranking 6. Conclusion 36
  37. 37. 37 Factorization Machine Matrix Factorization
  38. 38. 38 Factorization Machine Context information (e.g., time) can be considered Source: http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf
  39. 39. 39 Factorization Machine Factorization Model with degress=2 (2-way interaction) Global Bias Regression coefficience of j-th variable Pairwise Interaction Factorization
  40. 40. 40 Training data for Factorization Machines Each Feature takes LibSVM-like format <feature[:weight]>
  41. 41. 41 Training of Factorization Machines
  42. 42. 42 Prediction of Factorization Machines
  43. 43. Agenda 1. Short Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Factorization Machines 5. Bayesian Probabilistic Ranking 6. Conclusion 43
  44. 44. 44 Implicit Feedback A naïve L approach by filling unknown cell as negative
  45. 45. 45 Sampling scheme for Implicit Feedback Sample pairs <u, i, j> of Positive Item i and Negative Item j for each User u • Uniform user sampling Ø Sample a user. Then, sample a pair. • Uniform pair sampling Ø Sample pairs directory (dist. along w/ original dataset) • With-replacement or without-replacement sampling U/I Item 1 Item 2 Item 3 … Item I User 1 ⭕ ⭕ User 2 ⭕ ⭕ … ⭕ ⭕ User U ⭕ ⭕ ⭕ Default Hivemall sampling scheme: - Uniform user sampling - With replacement
  46. 46. •Rendle et al., “BPR: Bayesian Personalized Ranking from Implicit Feedback”, Proc. UAI, 2009. •A most proven(?) algorithm for recommendation for implicit feedback 46 Bayesian Probabilistic Ranking Key assumption: user u prefers item i over non- observed item j
  47. 47. Bayesian Probabilistic Ranking 47 Image taken from Rendle et al., “BPR: Bayesian Personalized Ranking from Implicit Feedback”, Proc. UAI, 2009.
  48. 48. 48 Bayesian Theorem ベイズの定理: 事後確率は尤度と事前確率の積に比例 ここで、Θは求めたいパラメタ >uはユーザuの好みの構造 ユーザuの好みの構造 >u が与えられたときの モデルパラメタΘの事後分布P(Θ|>u)を得たい
  49. 49. 49 Bayesian Probabilistic Ranking √ 尤度 簡単のためにユーザの好み>uは互いに独立であるとして p(i >u j|Θ)の直積で表現. 次のように尤度を定義 u,i,jを入力とした予測値にsigmoid関数を適用
  50. 50. 𝐵𝑃𝑅𝑂𝑃𝑇 = max * ln 𝑝 >/ |Θ 𝑝 Θ 50 Bayesian Probabilistic Ranking = max * 2 ln 𝜎(𝑥/567 (Θ)) − 𝜆 | 𝜎 |< /,5,6 ∈? √ √ √ ln sigmoid for loss function 正則化
  51. 51. Train by BPR-Matrix Factoriaztion 51
  52. 52. 52 Predict by BPR-Matrix Factorization
  53. 53. 53 Predict by BPR-Matrix Factorization
  54. 54. 54 Predict by BPR-Matrix Factorization
  55. 55. 55 Recommendation for Implicit Feedback Dataset 1. Efficient Top-k computation is important for prediction O(U * I) 2. Memory consumption is heavy for where item size |i| is large • MyMediaLite requires lots of memory • Maximum data size of Movielens: 33,000 movies by 240,000 users, 20 million ratings 3. Better to avoid computing predictions for each time
  56. 56. Agenda 1. Short Introduction to Hivemall 2. Recommendation 101 3. Matrix Factorization 4. Factorization Machines 5. Bayesian Probabilistic Ranking 6. Conclusion 56
  57. 57. Conclusion and Takeaway 57 Hivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs Treasure Data provides ML-as-a-Service using latest version of Hivemall • BPR-MF and BPR-FM (?) for Implicit Feedback • Evaluation metric (NDCG, Prec/Recall@k) • Field-aware Factorization Machine • One-class Passive Aggressive • (Kernelized Passive Aggressive) Major features in the coming v0.4.1
  58. 58. 58 Blog article about Hivemall http://blog-jp.treasuredata.com/ TD, Hivemall, Jupyter, Pandas-TDを使ってKaggleの 課題を解くシリーズ
  59. 59. 59 We support machine learning in Cloud Any feature request? Or, questions?

×