Recommendation 101 using Hivemall

Recommendation 101
using Hivemall
Research Engineer
Makoto YUI @myui
<myui@treasure-data.com>
1

Agenda
1. Introduction to Hivemall
2. Recommendation 101
3. Matrix Factorization
4. Bayesian Probabilistic Ranking
2

What is Hivemall
Scalable machine learning library built
as a collection of Hive UDFs, licensed
under the Apache License v2
3
https://github.com/myui/hivemall

Hivemall’s Vision: ML on SQL
Classification with Mahout
CREATE TABLE lr_model AS
SELECT
feature, -- reducers perform model averaging in
parallel
avg(weight) as weight
FROM (
SELECT logress(features,label,..) as (feature,weight)
FROM train
) t -- map-only task
GROUP BY feature; -- shuffled to reducers
✓Machine Learning made easy for SQL
developers (ML for the rest of us)
✓Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in
parallel on Hadoop
4

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Data preparation 5

CREATE EXTERNAL TABLE e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data
6

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Feature Engineering
7

create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature
Normalization
How to use Hivemall - Feature Engineering
Transforming a label value
to a value between 0.0 and 1.0
8

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Training
9

How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel
10

How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier
11

How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Prediction
12

How to use Hivemall - Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction is done by LEFT OUTER JOIN
between test data and prediction model
No need to load the entire model into memory
13

14
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1,
PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓PA Regression
✓AROW Regression
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓Factorization Machines
✓RandomForest Regression
List of supported Algorithms

List of supported Algorithms
15
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1,
PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does
not work
Logistic regression is good for
getting a probability of a positive
class
Factorization Machines is good
where features are sparse and
categorical ones

List of Algorithms for Recommendation
16
K-Nearest Neighbor
✓ Minhash and b-Bit Minhash
(LSH variant)
✓ Similarity Search on Vector
Space
(Euclid/Cosine/Jaccard/Angular)
Matrix Completion
✓ Matrix Factorization
(regression)
each_top_k function of Hivemall
is useful for recommending top-k
items

Other Supported Algorithms
17
Anomaly Detection
✓ Local Outlier Factor (LoF)
Feature Engineering
✓Feature Hashing
✓Feature Scaling
(normalization, z-score)
✓ TF-IDF vectorizer
✓ Polynomial Expansion
(Feature Pairing)
✓ Amplifier
NLP
✓Basic Englist text Tokenizer
✓Japanese Tokenizer
(Kuromoji)

Agenda
18

•Explicit Feedback
• Item Rating
• Item Ranking
•Implicit Feedback
• Positive-only Implicit Feedback
• Bought (or not)
• Click (or not)
• Converged (or not)
19
Recommendation 101

•Explicit Feedback
• Item Rating
• Item Ranking
•Implicit Feedback
• Positive-only Implicit Feedback
• Bought (or not)
• Click (or not)
• Converged (or not)
20
Recommendation 101
Case for Coursehero?

U/I Item 1 Item 2 Item 3 … Item I
User 1 5 3
User 2 2 1
… 3 4
User U 1 4 5
21
Explicit Feedback

User 1 ? 5 ? ? 3
User 2 2 ? 1 ? ?
… ? 3 ? 4 ?
User U 1 ? 4 ? 5
22
Explicit Feedback

23
Explicit Feedback
User 1 ? 5 ? ? 3
User 2 2 ? 1 ? ?
… ? 3 ? 4 ?
User U 1 ? 4 ? 5
• Very Sparse Dataset
• # of feedback is small
• Unknown data >> Training data
• User preference to rated items is clear
• Has negative feedbacks
• Evaluation is easy (MAE/RMSE)

User 1 ⭕ ⭕
User 2 ⭕ ⭕
… ⭕ ⭕
User U ⭕ ⭕ ⭕
24
Implicit Feedback

User 1 ⭕ ⭕
User 2 ⭕ ⭕
… ⭕ ⭕
User U ⭕ ⭕ ⭕
25
Implicit Feedback
• Sparse Dataset
• Number of Feedbacks are large
• User preference is unclear
• No negative feedback
• Known feedback maybe negative
• Unknown feedback maybe positive
• Evaluation is not so easy (NDCG, Prec@K, Recall@K)

26
Pros and Cons
Explicit
Feedback
Implicit
Feedback
Data size L J
User preference J L
Dislike/Unknown J L
Impact of Bias L J

Agenda
27

28
Matrix Factorization/Completion
Factorize a matrix
into a product of matrices
having k-latent factor

29
Matrix Completion How-to
• Mean Rating μ
• Rating Bias for each Item Bi
• Rating Bias for each User Bu

30
Mean Rating
Matrix Factorization
Regularization
Bias
for each user/item
Criteria of Biased MF
Factorization
Diff in prediction

31
Training of Matrix Factorization
Support iterative training using local disk cache

32
Prediction of Matrix Factorization

Agenda
33
Still in Beta but will officially be supported soon

34
Implicit Feedback
A naïve L approach by filling unknown cell
as negative

35
Sampling scheme for Implicit Feedback
Sample pairs <u, i, j> of Positive Item i and
Negative Item j for each User u
• Uniform user sampling
Ø Sample a user. Then, sample a pair.
• Uniform pair sampling
Ø Sample pairs directory (dist. along w/ original dataset)
• With-replacement or without-replacement sampling
User 1 ⭕ ⭕
User 2 ⭕ ⭕
… ⭕ ⭕
User U ⭕ ⭕ ⭕
Default Hivemall sampling
scheme:
- Uniform user sampling
- With replacement

•Rendle et al., “BPR: Bayesian Personalized Ranking
from Implicit Feedback”, Proc. UAI, 2009.
•A most proven(?) algorithm for recommendation for
implicit feedback
36
Bayesian Probabilistic Ranking
Key assumption: user u prefers item i over non-
observed item j

Bayesian Probabilistic Ranking
37
Image taken from
Rendle et al., “BPR: Bayesian Personalized Ranking from Implicit Feedback”, Proc. UAI, 2009.
http://www.algo.uni-konstanz.de/members/rendle/pdf/Rendle_et_al2009-Bayesian_Personalized_Ranking.pdf
BPRMF’s task can be
considered filling 0/1 the
item-item matrix and
getting probability of I >u J

Train by BPR-Matrix Factoriaztion
38

39
Predict by BPR-Matrix Factorization

40

41

42
Recommendation for Implicit Feedback Dataset
1. Efficient Top-k computation is important
for prediction O(U * I)
2. Memory consumption is heavy for where
item size |i| is large
• MyMediaLite requires lots of memory
• Maximum data size of Movielens: 33,000
movies by 240,000 users, 20 million ratings
3. Better to avoid computing predictions for
each time

43
We support machine learning in Cloud
Any feature request? Or, questions?

Recommendation 101 using Hivemall

More Related Content

What's hot

Similar to Recommendation 101 using Hivemall

More from Makoto Yui

Recently uploaded

Recommendation 101 using Hivemall