RecSys Challenge 2015: ensemble learning with categorical features

RecSys Challenge 2015: ensemble learning with
categorical features
Peter Romov, Evgeny Sokolov

•  Logs from e-commerce website: collection of sessions
•  Session
•  sequence of clicks on the item pages
•  could end with or without purchase
•  Click
•  Timestamp
•  ItemID (≈54k unique IDs)
•  CategoryID (≈350 unique IDs)
•  Purchase
•  set of bought items with price and quantity
•  Known purchases for the train-set, need to predict on the test-set
Problem statement
2

Problem statement
3
Clicks from session s
Purchase (actual)
Purchase (predicted)
c(s) = (c1(s), . . . , cn(s)(s))
h(s) ⇡ y(s)
y(s) =
(
; — no purchase
{i1, . . . , im(s)} (bought items) — otherwise
Evaluation measure:
— Jaccard distance between two sets
Sb
test
Stest
where
— all sessions from test-set
— sessions from test-set with purchase
J(y(s), h(s)) =
|y(s) h(s)|
|y(s) [ h(s)|
Q(h, Stest) =
X
s2Stest:
|h(s)|>0
8
<
:
|Sb
test|
|Stest| + J(y(s), h(s)), if y(s) 6= ;
|Sb
test|
|Stest| , otherwise

Problem statement
4
First observations (from the task):
•  the task is uncommon (set prediction with specific loss function)
•  evaluation measure could be rewritten
•  the original problem can be divided into two well-known
binary classification problems;
1.  predict purchase given session
optimize Purchase score
2.  predict bought items given session with purchase
optimize Jaccard score
P(y(s) 6= ;|s)
Q(h, Stest) =
|Sb
test|
|Stest|
(TP FP)
| {z }
purchase score
+
X
s2Stest
J(y(s), h(s))
| {z }
Jaccard score
,
P(i 2 y(s)|s, y(s) 6= ;)

•  Two-stage prediction
•  Two binary classification models learned on the train-set
•  Both classifiers require thresholds
•  Set up thresholds to optimize Purchase score and Jaccard score
using hold-out subsample of the train-set
Solution schema
5
Strain
purchase classifier
bought item classifier
Slearn
90%
Svalid
10%
s 7! P(purchase|s)
(s, i) 7! P(i 2 y(s)|s, y(s) 6= ;)
classifier
thresholds

Some relationships from data
6
Next observations (from the data):
•  Buying rate strongly depends on time features
•  Buying rate varies highly between categorical features
he
er-
87
e-
s)
es-
me
s.
he
or
ht
is
)) ,
ne
ty
Figure 1: Dynamics of the buying rate in time
Figure 2: Buying rate versus number of clicked
items (left) and ID of the item with the maximum
number of clicks in session (right)
The total number of item IDs and category IDs is 54, 287
and 347 correspondingly. Both training and test sets be-
long to an interval of 6 months. The target function y(s)
corresponds to the set of items that were bought in the ses-
sion s 1
. In other words, the target function gives some
subset of the universal item set I for each user session s.
We are given sets of bought items y(s) for all sessions s the
training set Strain, and are required to predict these sets for
test sessions s ∈ Stest.
2.2 Evaluation Measure
Denote by h(s) a hypothesis that predicts a set of bought
items for any user session s. The score of this hypothesis is
measured by the following formula:
Q(h, Stest) =
s∈Stest:
|h(s)|>0
|Sb
test|
|Stest|
(−1)isEmpty(y(s))
+J(y(s), h(s)) ,
where Sb
test is the set of all test sessions with at least one
purchase event, J(A, B) = |A∩B|
|A∪B|
is the Jaccard similarity
measure. It is easy to rewrite this expression as
Q(h, Stest) =
|Sb
test|
|Stest|
(TP − FP)
purchase score
+
s∈Stest
J(y(s), h(s))
Jaccard score
, (1)
where TP is the number of sessions with |y(s)| > 0 and |h(s)| >
0 (i.e. true positives), and FP is the number of sessions
with |y(s)| = 0 and |h(s)| > 0 (i.e. false positives). Now it
is easy to see that the score consists of two parts. The ﬁrst
one gives a reward for each correctly guessed session with
buy events and a penalty for each false alarm; the absolute
values of penalty and reward are both equal to
|Sb
test|
|Stest|
. The
second part calculates the total similarity of predicted sets
of bought items to the real sets.
2.3 Purchase statistics
Figure 2: Buying rate versus number of clicked
items (left) and ID of the item with the maximum
number of clicks in session (right)
is that a higher number of items clicked during the session
leads to a higher chance of a purchase.
Lots of information could be extracted from the data by
considering item identiﬁers and categories clicked during the
Buying rate — fraction of buyer sessions in some subset of sessions

Feature extraction
7
•  Purchase classifier: features from session (sequence of clicks)
•  Bought item classifier: features from pair session+itemID
•  Observation: bought item is a clicked item
•  We use two types of features
•  Numerical: real number, e.g. seconds between two clicks
•  Categorical: element of the unordered set of values (levels),
e.g. ItemID

Feature extraction: session
8
1.  Start/end of the session (month, day, hour, etc.)
[numerical + categorical with few levels]
2.  Number of clicks, unique items, categories, item-category pairs
[numerical]
3.  Top 10 items and categories by the number of clicks
[categorical with ≈50k levels]
4.  ID of the first/last item clicked at least k times
5.  Click counts for 100 items and 50 categories that were most
popular in the whole training set
[sparse numerical]

Feature extraction:
session+ItemID
9
1.  All session features
2.  ItemID
3.  Timestamp of the first/last click on the item (month, day, hour, etc.)
[numerical + categorical with few levels]
4.  Number of clicks in the session for the given item
[numerical]
5.  Total duration (by analogy with dwell time) of the clicks on the item
[numerical]

•  GBM and similar ensemble learning techniques
•  useful with numerical features
•  one-hot encoding of categorical features doesn’t perform well
•  Matrix decompositions, FM
•  useful with categorical features
•  hard to incorporate numerical features because of rough (bi-linear)
model
•  We used our internal learning algorithm: MatrixNet
•  GBM with oblivious decision trees
•  trees properly handle categorical features (multi-split decision trees)
•  SVD-like decompositions for new feature value combinations
Classification method
10

Classification method
11
Oblivious decision tree with categorical features
duration > 20
user
item item
[numerical]
[categorical]
[categorical]
yes no
…
user
item item…
… … ……

•  Training classifiers
•  GB with 10k trees for each classifier
•  ≈12 hours to train both models on 150 machines
•  Making predictions
•  We made 4000 predictions per second per thread
Classification method: speed
12

Threshold optimization
13
We optimized thresholds using validation set (10% hold-out
from train-set)
1)  Maximize Jaccard score
2)  Maximize Purchase+Jaccard scores using fixed bought
item threshold
Q(h, Svalid) =
|Sb
valid|
|Svalid|
(TP FP)
| {z }
purchase score
+
X
s2Svalid
J(y(s), h(s))
| {z }
Jaccard score
,
Figure 3: Item detection threshold (above) and pur-
chase detection threshold (below) quality on the val-
idation set.
we train purchase detection and purchased item detection
classifiers. The purchase detection classifier hp(s) predicts
the outcome of the function yp(s) = isNotEmpty(y(s)) and
uses the entire training set in the learning phase. The item
detection classifier hi(s, j) approximates the indicator func-
tion yi(s, j) = I(j ∈ y(s)) and uses only sessions with bought
items in the learning phase. Of course, it would be wise to
use classifiers that output probabilities rather than binary
predictions, because in this case we will be able to select
thresholds that directly optimize evaluation metric (1) in-
stead of the classifier’s internal quality measure. So, our
final expression for the hypothesis can be written as
h(s) =
∅ if hp(s) < αp,
{j ∈ I | hi(s, j) > αi} if hp(s) ≥ αp.
(2)
3.2 Feature Extraction
We have outlined two groups of features: one describes a
session and the other describes a session-item pair. The pur-
chase detection classifier uses only session features and the
item detection classifier uses both groups. The full feature
listing can be found in Table 1; for further details, please
refer to our code2
. We give some comments on our feature
extraction decisions below.
One could use sophisticated aggregations to extract nu-
merical features that describe items and categories. How-
Figure 3: Item detection threshold (above) and pur-
chase detection threshold (below) quality on the val-
idation set.

•  Leaderboard: 63102 (1st place)
•  Purchase detection on validation (10% hold-out):
•  16% precision
•  77% recall
•  AUC 0.85
•  Purchased item detection on validation:
•  Jaccard measure 0.765
•  Features / datasets used to learn classifiers / evaluation process
can be reproduced, see our code1
Final results
14 1https://github.com/romovpa/ydf-recsys2015-challenge

1.  Observations from the problem statement
›  The task is complex but decomposable into two well-known:
binary classification of sessions and (session, ItemID)-pairs
2.  Observations from the data (user click sessions)
›  Features from sessions and (session, ItemID)-pairs
›  Easy to develop many meaningful categorical features
3.  The algorithm
›  Gradient boosting on trees with categorical features
›  No sophisticated mixtures of Machine Learning techniques: one
algorithm to work with many numerical and categorical features
Summary / Questions?
15

RecSys Challenge 2015: ensemble learning with categorical features

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to RecSys Challenge 2015: ensemble learning with categorical features

Similar to RecSys Challenge 2015: ensemble learning with categorical features (20)

More from romovpa

More from romovpa (10)

Recently uploaded

Recently uploaded (20)

RecSys Challenge 2015: ensemble learning with categorical features