Recommendation	101
using	Hivemall
Research	Engineer
Makoto	YUI	@myui
<myui@treasure-data.com>
1
Agenda
1. Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Bayesian	Probabilistic	Ranking
2
What	is	Hivemall
Scalable	machine	learning	library	built	
as	a	collection	of	Hive	UDFs,	licensed	
under	the	Apache	License	v2
3
https://github.com/myui/hivemall
Hivemall’s Vision:	ML	on	SQL
Classification	with	Mahout
CREATE	TABLE	lr_model	AS
SELECT
feature,	-- reducers	perform	model	averaging	in	
parallel
avg(weight)	as	weight
FROM	(
SELECT	logress(features,label,..)	as	(feature,weight)
FROM	train
)	t	-- map-only	task
GROUP	BY	feature;	-- shuffled	to	reducers
✓Machine	Learning	made	easy	for	SQL	
developers	(ML	for	the	rest	of	us)
✓Interactive	and	Stable	APIs	w/ SQL	abstraction
This	SQL	query	automatically	runs	in	
parallel	on	Hadoop	
4
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	Vector
Feature	Vector
Label
Data	preparation 5
CREATE EXTERNAL TABLE e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How	to	use	Hivemall	- Data	preparation
Define	a	Hive	table	for	training/testing	data
6
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	Vector
Feature	Vector
Label
Feature	Engineering
7
create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature
Normalization
How	to	use	Hivemall	- Feature	Engineering
Transforming	a	label	value	
to	a	value	between	0.0	and	1.0
8
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	Vector
Feature	Vector
Label
Training
9
How	to	use	Hivemall	- Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training	by	logistic	regression
map-only	task	to	learn	a	prediction	model
Shuffle	map-outputs	to	reduces	by	feature
Reducers	perform	model	averaging	
in	parallel
10
How	to	use	Hivemall	- Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training	of	Confidence	Weighted	Classifier
Vote	to	use	negative	or	positive	
weights	for	avg
+0.7,	+0.3,	+0.2,	-0.1,	+0.7
Training	for	the	CW	classifier
11
How	to	use	Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature	Vector
Feature	Vector
Label
Prediction
12
How	to	use	Hivemall	- Prediction
CREATE	TABLE	lr_predict
as
SELECT
t.rowid,	
sigmoid(sum(m.weight))	 as	prob
FROM
testing_exploded t	LEFT	OUTER	JOIN
lr_model m	ON	(t.feature =	m.feature)
GROUP	BY	
t.rowid
Prediction	is	done	by	LEFT	OUTER	JOIN
between	test	data	and	prediction	model
No	need	to	load	the	entire	model	into	memory
13
14
Classification
✓ Perceptron
✓ Passive	Aggressive	(PA,	PA1,	
PA2)
✓ Confidence	Weighted	(CW)
✓ Adaptive	Regularization	of	
Weight	Vectors	(AROW)
✓ Soft	Confidence	Weighted	
(SCW)
✓ AdaGrad+RDA
✓ Factorization	Machines
✓ RandomForest	Classification
Regression
✓Logistic	Regression	(SGD)
✓PA	Regression
✓AROW	Regression
✓AdaGrad (logistic	loss)
✓AdaDELTA (logistic	loss)
✓Factorization	Machines
✓RandomForest	Regression
List	of	supported	Algorithms
List	of	supported	Algorithms
15
Classification	
✓ Perceptron
✓ Passive	Aggressive	(PA,	PA1,	
PA2)
✓ Confidence	Weighted	(CW)
✓ Adaptive	Regularization	of	
Weight	Vectors	(AROW)
✓ Soft	Confidence	Weighted	
(SCW)
✓ AdaGrad+RDA
✓ Factorization	Machines
✓ RandomForest	Classification
Regression
✓Logistic	Regression	(SGD)
✓AdaGrad (logistic	loss)
✓AdaDELTA (logistic	loss)
✓PA	Regression
✓AROW	Regression
✓Factorization	Machines
✓RandomForest	Regression
SCW is a good first choice
Try RandomForest if SCW does
not work
Logistic regression is good for
getting a probability of a positive
class
Factorization Machines is good
where features are sparse and
categorical ones
List	of	Algorithms	for	Recommendation
16
K-Nearest	Neighbor
✓ Minhash and	b-Bit	Minhash
(LSH	variant)
✓ Similarity	Search on	Vector	
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix	Completion
✓ Matrix	Factorization
✓ Factorization	Machines	
(regression)
each_top_k function of Hivemall
is useful for recommending top-k
items
Other	Supported	Algorithms
17
Anomaly	Detection
✓ Local	Outlier	Factor	(LoF)
Feature	Engineering
✓Feature	Hashing
✓Feature	Scaling
(normalization,	z-score)	
✓ TF-IDF	vectorizer
✓ Polynomial	Expansion
(Feature	Pairing)
✓ Amplifier
NLP
✓Basic	Englist text	Tokenizer	
✓Japanese	Tokenizer	
(Kuromoji)
Agenda
1. Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Bayesian	Probabilistic	Ranking
18
•Explicit	Feedback
• Item	Rating
• Item	Ranking
•Implicit	Feedback
• Positive-only	Implicit	Feedback
• Bought	(or	not)	
• Click	(or	not)
• Converged	(or	not)
19
Recommendation	101
•Explicit	Feedback
• Item	Rating
• Item	Ranking
•Implicit	Feedback
• Positive-only	Implicit	Feedback
• Bought	(or	not)	
• Click	(or	not)
• Converged	(or	not)
20
Recommendation	101
Case	for	Coursehero?
U/I Item	1 Item	2 Item	3 … Item	I
User	1 5 3
User	2 2 1
… 3 4
User	U 1 4 5
21
Explicit	Feedback
U/I Item	1 Item	2 Item	3 … Item	I
User	1 ? 5 ? ? 3
User	2 2 ? 1 ? ?
… ? 3 ? 4 ?
User	U 1 ? 4 ? 5
22
Explicit	Feedback
23
Explicit	Feedback
U/I Item	1 Item	2 Item	3 … Item	I
User	1 ? 5 ? ? 3
User	2 2 ? 1 ? ?
… ? 3 ? 4 ?
User	U 1 ? 4 ? 5
• Very	Sparse	Dataset
• #	of	feedback	is	small
• Unknown	data	>>	Training	data
• User	preference	to	rated	items	is	clear
• Has	negative	feedbacks
• Evaluation	is	easy	(MAE/RMSE)
U/I Item	1 Item	2 Item	3 … Item	I
User	1 ⭕ ⭕
User	2 ⭕ ⭕
… ⭕ ⭕
User	U ⭕ ⭕ ⭕
24
Implicit	Feedback
U/I Item	1 Item	2 Item	3 … Item	I
User	1 ⭕ ⭕
User	2 ⭕ ⭕
… ⭕ ⭕
User	U ⭕ ⭕ ⭕
25
Implicit	Feedback
• Sparse	Dataset
• Number	of	Feedbacks	are	large
• User	preference	is	unclear
• No negative feedback
• Known feedback maybe negative
• Unknown	feedback	maybe	positive
• Evaluation	is	not	so	easy	(NDCG,	Prec@K,	Recall@K)
26
Pros	and	Cons
Explicit
Feedback
Implicit	
Feedback
Data	size L J
User preference J L
Dislike/Unknown J L
Impact of	Bias L J
Agenda
1. Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Bayesian	Probabilistic	Ranking
27
28
Matrix	Factorization/Completion
Factorize	a	matrix	
into	a	product	of	matrices
having	k-latent	factor
29
Matrix	Completion How-to
• Mean	Rating	μ
• Rating	Bias	for	each	Item Bi
• Rating	Bias	for	each	User	Bu
30
Mean	Rating
Matrix	Factorization
Regularization
Bias	
for	each	user/item
Criteria	of	Biased	MF
Factorization
Diff	in	prediction
31
Training	of	Matrix	Factorization
Support iterative training using local disk cache
32
Prediction	of	Matrix	Factorization
Agenda
1. Introduction	to	Hivemall
2. Recommendation	101
3. Matrix	Factorization
4. Bayesian	Probabilistic	Ranking
33
Still	in	Beta	but	will	officially	be	supported	soon
34
Implicit	Feedback
A	naïve	L approach	by	filling	unknown	cell	
as	negative
35
Sampling	scheme	for	Implicit	Feedback
Sample	pairs	<u,	i,	j>	of	Positive	Item	i and	
Negative	Item j	for	each	User	u
• Uniform	user	sampling
Ø Sample	a	user.	Then,	sample	a	pair.
• Uniform	pair	sampling
Ø Sample	pairs	directory	(dist.	along	w/	original	dataset)
• With-replacement	or	without-replacement	sampling
U/I Item	1 Item	2 Item	3 … Item	I
User	1 ⭕ ⭕
User	2 ⭕ ⭕
… ⭕ ⭕
User	U ⭕ ⭕ ⭕
Default	Hivemall	sampling	
scheme:
- Uniform	user	sampling
- With	replacement
•Rendle et	al.,	“BPR:	Bayesian	Personalized	Ranking	
from	Implicit	Feedback”,	Proc.	UAI,	2009.
•A	most	proven(?)	algorithm	for	recommendation	for	
implicit	feedback
36
Bayesian	Probabilistic	Ranking
Key	assumption:	user	u prefers	item	i over	non-
observed	item j
Bayesian	Probabilistic	Ranking
37
Image	taken	from	
Rendle et	al.,	“BPR:	Bayesian	Personalized	Ranking	from	Implicit	Feedback”,	Proc.	UAI,	2009.
http://www.algo.uni-konstanz.de/members/rendle/pdf/Rendle_et_al2009-Bayesian_Personalized_Ranking.pdf
BPRMF’s	task	can	be	
considered	filling	0/1	the	
item-item	matrix		and	
getting	probability	of	I	>u	J
Train	by	BPR-Matrix	Factoriaztion
38
39
Predict	by	BPR-Matrix	Factorization
40
Predict	by	BPR-Matrix	Factorization
41
Predict	by	BPR-Matrix	Factorization
42
Recommendation	for	Implicit	Feedback	Dataset
1. Efficient	Top-k	computation	is	important	
for	prediction O(U	*	I)
2. Memory	consumption	is	heavy	for	where	
item	size	|i|	is	large
• MyMediaLite requires	lots	of	memory
• Maximum	data	size	of	Movielens:	33,000	
movies	by	240,000	users,	20	million	ratings
3. Better	to	avoid	computing	predictions	for	
each	time
43
We	support	machine	learning	in	Cloud
Any	feature	request?	Or,	questions?

Recommendation 101 using Hivemall