SlideShare a Scribd company logo
RecSys Challenge 2015: ensemble learning with
categorical features
Peter Romov, Evgeny Sokolov
•  Logs from e-commerce website: collection of sessions
•  Session
•  sequence of clicks on the item pages
•  could end with or without purchase
•  Click
•  Timestamp
•  ItemID (≈54k unique IDs)
•  CategoryID (≈350 unique IDs)
•  Purchase
•  set of bought items with price and quantity
•  Known purchases for the train-set, need to predict on the test-set
Problem statement
2
Problem statement
3
Clicks from session s
Purchase (actual)
Purchase (predicted)
c(s) = (c1(s), . . . , cn(s)(s))
h(s) ⇡ y(s)
y(s) =
(
; — no purchase
{i1, . . . , im(s)} (bought items) — otherwise
Evaluation measure:
— Jaccard distance between two sets
Sb
test
Stest
where
— all sessions from test-set
— sessions from test-set with purchase
J(y(s), h(s)) =
|y(s)  h(s)|
|y(s) [ h(s)|
Q(h, Stest) =
X
s2Stest:
|h(s)|>0
8
<
:
|Sb
test|
|Stest| + J(y(s), h(s)), if y(s) 6= ;
|Sb
test|
|Stest| , otherwise
Problem statement
4
First observations (from the task):
•  the task is uncommon (set prediction with specific loss function)
•  evaluation measure could be rewritten
•  the original problem can be divided into two well-known
binary classification problems;
1.  predict purchase given session
optimize Purchase score
2.  predict bought items given session with purchase
optimize Jaccard score
P(y(s) 6= ;|s)
Q(h, Stest) =
|Sb
test|
|Stest|
(TP FP)
| {z }
purchase score
+
X
s2Stest
J(y(s), h(s))
| {z }
Jaccard score
,
P(i 2 y(s)|s, y(s) 6= ;)
•  Two-stage prediction
•  Two binary classification models learned on the train-set
•  Both classifiers require thresholds
•  Set up thresholds to optimize Purchase score and Jaccard score
using hold-out subsample of the train-set
Solution schema
5
Strain
purchase classifier
bought item classifier
Slearn
90%
Svalid
10%
s 7! P(purchase|s)
(s, i) 7! P(i 2 y(s)|s, y(s) 6= ;)
classifier
thresholds
Some relationships from data
6
Next observations (from the data):
•  Buying rate strongly depends on time features
•  Buying rate varies highly between categorical features
he
er-
87
e-
s)
es-
me
s.
he
or
ht
is
)) ,
ne
ty
Figure 1: Dynamics of the buying rate in time
Figure 2: Buying rate versus number of clicked
items (left) and ID of the item with the maximum
number of clicks in session (right)
The total number of item IDs and category IDs is 54, 287
and 347 correspondingly. Both training and test sets be-
long to an interval of 6 months. The target function y(s)
corresponds to the set of items that were bought in the ses-
sion s 1
. In other words, the target function gives some
subset of the universal item set I for each user session s.
We are given sets of bought items y(s) for all sessions s the
training set Strain, and are required to predict these sets for
test sessions s ∈ Stest.
2.2 Evaluation Measure
Denote by h(s) a hypothesis that predicts a set of bought
items for any user session s. The score of this hypothesis is
measured by the following formula:
Q(h, Stest) =
s∈Stest:
|h(s)|>0
|Sb
test|
|Stest|
(−1)isEmpty(y(s))
+J(y(s), h(s)) ,
where Sb
test is the set of all test sessions with at least one
purchase event, J(A, B) = |A∩B|
|A∪B|
is the Jaccard similarity
measure. It is easy to rewrite this expression as
Q(h, Stest) =
|Sb
test|
|Stest|
(TP − FP)
purchase score
+
s∈Stest
J(y(s), h(s))
Jaccard score
, (1)
where TP is the number of sessions with |y(s)| > 0 and |h(s)| >
0 (i.e. true positives), and FP is the number of sessions
with |y(s)| = 0 and |h(s)| > 0 (i.e. false positives). Now it
is easy to see that the score consists of two parts. The first
one gives a reward for each correctly guessed session with
buy events and a penalty for each false alarm; the absolute
values of penalty and reward are both equal to
|Sb
test|
|Stest|
. The
second part calculates the total similarity of predicted sets
of bought items to the real sets.
2.3 Purchase statistics
Figure 2: Buying rate versus number of clicked
items (left) and ID of the item with the maximum
number of clicks in session (right)
is that a higher number of items clicked during the session
leads to a higher chance of a purchase.
Lots of information could be extracted from the data by
considering item identifiers and categories clicked during the
Buying rate — fraction of buyer sessions in some subset of sessions
Feature extraction
7
•  Purchase classifier: features from session (sequence of clicks)
•  Bought item classifier: features from pair session+itemID
•  Observation: bought item is a clicked item
•  We use two types of features
•  Numerical: real number, e.g. seconds between two clicks
•  Categorical: element of the unordered set of values (levels),
e.g. ItemID
Feature extraction: session
8
1.  Start/end of the session (month, day, hour, etc.)
[numerical + categorical with few levels]
2.  Number of clicks, unique items, categories, item-category pairs
[numerical]
3.  Top 10 items and categories by the number of clicks
[categorical with ≈50k levels]
4.  ID of the first/last item clicked at least k times
[categorical with ≈50k levels]
5.  Click counts for 100 items and 50 categories that were most
popular in the whole training set
[sparse numerical]
Feature extraction:
session+ItemID
9
1.  All session features
2.  ItemID
[categorical with ≈50k levels]
3.  Timestamp of the first/last click on the item (month, day, hour, etc.)
[numerical + categorical with few levels]
4.  Number of clicks in the session for the given item
[numerical]
5.  Total duration (by analogy with dwell time) of the clicks on the item
[numerical]
•  GBM and similar ensemble learning techniques
•  useful with numerical features
•  one-hot encoding of categorical features doesn’t perform well
•  Matrix decompositions, FM
•  useful with categorical features
•  hard to incorporate numerical features because of rough (bi-linear)
model
•  We used our internal learning algorithm: MatrixNet
•  GBM with oblivious decision trees
•  trees properly handle categorical features (multi-split decision trees)
•  SVD-like decompositions for new feature value combinations
Classification method
10
Classification method
11
Oblivious decision tree with categorical features
duration > 20
user
item item
[numerical]
[categorical]
[categorical]
yes no
…
user
item item…
… … ……
•  Training classifiers
•  GB with 10k trees for each classifier
•  ≈12 hours to train both models on 150 machines
•  Making predictions
•  We made 4000 predictions per second per thread
Classification method: speed
12
Threshold optimization
13
We optimized thresholds using validation set (10% hold-out
from train-set)
1)  Maximize Jaccard score
2)  Maximize Purchase+Jaccard scores using fixed bought
item threshold
Q(h, Svalid) =
|Sb
valid|
|Svalid|
(TP FP)
| {z }
purchase score
+
X
s2Svalid
J(y(s), h(s))
| {z }
Jaccard score
,
Figure 3: Item detection threshold (above) and pur-
chase detection threshold (below) quality on the val-
idation set.
we train purchase detection and purchased item detection
classifiers. The purchase detection classifier hp(s) predicts
the outcome of the function yp(s) = isNotEmpty(y(s)) and
uses the entire training set in the learning phase. The item
detection classifier hi(s, j) approximates the indicator func-
tion yi(s, j) = I(j ∈ y(s)) and uses only sessions with bought
items in the learning phase. Of course, it would be wise to
use classifiers that output probabilities rather than binary
predictions, because in this case we will be able to select
thresholds that directly optimize evaluation metric (1) in-
stead of the classifier’s internal quality measure. So, our
final expression for the hypothesis can be written as
h(s) =
∅ if hp(s) < αp,
{j ∈ I | hi(s, j) > αi} if hp(s) ≥ αp.
(2)
3.2 Feature Extraction
We have outlined two groups of features: one describes a
session and the other describes a session-item pair. The pur-
chase detection classifier uses only session features and the
item detection classifier uses both groups. The full feature
listing can be found in Table 1; for further details, please
refer to our code2
. We give some comments on our feature
extraction decisions below.
One could use sophisticated aggregations to extract nu-
merical features that describe items and categories. How-
Figure 3: Item detection threshold (above) and pur-
chase detection threshold (below) quality on the val-
idation set.
•  Leaderboard: 63102 (1st place)
•  Purchase detection on validation (10% hold-out):
•  16% precision
•  77% recall
•  AUC 0.85
•  Purchased item detection on validation:
•  Jaccard measure 0.765
•  Features / datasets used to learn classifiers / evaluation process
can be reproduced, see our code1
Final results
14 1https://github.com/romovpa/ydf-recsys2015-challenge
1.  Observations from the problem statement
›  The task is complex but decomposable into two well-known:
binary classification of sessions and (session, ItemID)-pairs
2.  Observations from the data (user click sessions)
›  Features from sessions and (session, ItemID)-pairs
›  Easy to develop many meaningful categorical features
3.  The algorithm
›  Gradient boosting on trees with categorical features
›  No sophisticated mixtures of Machine Learning techniques: one
algorithm to work with many numerical and categorical features
Summary / Questions?
15

More Related Content

What's hot

2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
Krish_ver2
 
1. Demystifying ML.pdf
1. Demystifying ML.pdf1. Demystifying ML.pdf
1. Demystifying ML.pdf
Jyoti Yadav
 
K means clustering
K means clusteringK means clustering
K means clustering
keshav goyal
 
Augmenting Data Structures
Augmenting Data StructuresAugmenting Data Structures
Augmenting Data Structures
Dr Sandeep Kumar Poonia
 
Associative Classification: Synopsis
Associative Classification: SynopsisAssociative Classification: Synopsis
Associative Classification: SynopsisJagdeep Singh Malhi
 
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)
Anmol Dwivedi
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
Lukas Tencer
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
Marco Quartulli
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Massimo Quadrana
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.pptbutest
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
Aws simple icons_ppt
Aws simple icons_pptAws simple icons_ppt
Aws simple icons_ppt
Trang Nguyễn
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
Amazon Web Services
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
nikshaikh786
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
Jaclyn Kokx
 
Data Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network AnalysisData Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network Analysis
vwchu
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
Seval Çapraz
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Universitat Politècnica de Catalunya
 
similarity measure
similarity measure similarity measure
similarity measure
ZHAO Sam
 

What's hot (20)

2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
1. Demystifying ML.pdf
1. Demystifying ML.pdf1. Demystifying ML.pdf
1. Demystifying ML.pdf
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Augmenting Data Structures
Augmenting Data StructuresAugmenting Data Structures
Augmenting Data Structures
 
Associative Classification: Synopsis
Associative Classification: SynopsisAssociative Classification: Synopsis
Associative Classification: Synopsis
 
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Aws simple icons_ppt
Aws simple icons_pptAws simple icons_ppt
Aws simple icons_ppt
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Data Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network AnalysisData Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network Analysis
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 
similarity measure
similarity measure similarity measure
similarity measure
 

Viewers also liked

Fields of Experts (доклад)
Fields of Experts (доклад)Fields of Experts (доклад)
Fields of Experts (доклад)
romovpa
 
A Simple yet Efficient Method for a Credit Card Upselling Prediction
A Simple yet Efficient Method for a Credit Card Upselling PredictionA Simple yet Efficient Method for a Credit Card Upselling Prediction
A Simple yet Efficient Method for a Credit Card Upselling Prediction
romovpa
 
Categorical Data Analysis in Python
Categorical Data Analysis in PythonCategorical Data Analysis in Python
Categorical Data Analysis in Python
Jaidev Deshpande
 
Fast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approachFast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approach
Yury Leonychev
 
Xgboost
XgboostXgboost
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
Mark Peng
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
Owen Zhang
 

Viewers also liked (7)

Fields of Experts (доклад)
Fields of Experts (доклад)Fields of Experts (доклад)
Fields of Experts (доклад)
 
A Simple yet Efficient Method for a Credit Card Upselling Prediction
A Simple yet Efficient Method for a Credit Card Upselling PredictionA Simple yet Efficient Method for a Credit Card Upselling Prediction
A Simple yet Efficient Method for a Credit Card Upselling Prediction
 
Categorical Data Analysis in Python
Categorical Data Analysis in PythonCategorical Data Analysis in Python
Categorical Data Analysis in Python
 
Fast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approachFast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approach
 
Xgboost
XgboostXgboost
Xgboost
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 

Similar to RecSys Challenge 2015: ensemble learning with categorical features

EE660 Project_sl_final
EE660 Project_sl_finalEE660 Project_sl_final
EE660 Project_sl_finalShanglin Yang
 
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
SMART Infrastructure Facility
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Md. Main Uddin Rony
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
Jeff Patti
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
台灣資料科學年會
 
Lec13 Clustering.pptx
Lec13 Clustering.pptxLec13 Clustering.pptx
Lec13 Clustering.pptx
Khalid Rabayah
 
Observations
ObservationsObservations
Observationsbutest
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
Kevin Lee
 
Design Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptxDesign Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptx
rajesshs31r
 
Analysis of Algorithm full version 2024.pptx
Analysis of Algorithm  full version  2024.pptxAnalysis of Algorithm  full version  2024.pptx
Analysis of Algorithm full version 2024.pptx
rajesshs31r
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
PAPIs.io
 
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docxWeek 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
melbruce90096
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
Ruby Shrestha
 
Oracle ebs r12eam part2
Oracle ebs r12eam part2Oracle ebs r12eam part2
Oracle ebs r12eam part2
jcvd12
 
Ali upload
Ali uploadAli upload
Ali upload
Ali Zahraei, Ph.D
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR models
Nisha Arankandath
 
The relationship between test and production code quality (@ SIG)
The relationship between test and production code quality (@ SIG)The relationship between test and production code quality (@ SIG)
The relationship between test and production code quality (@ SIG)
Maurício Aniche
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET Journal
 
Lecture 2 role of algorithms in computing
Lecture 2   role of algorithms in computingLecture 2   role of algorithms in computing
Lecture 2 role of algorithms in computing
jayavignesh86
 

Similar to RecSys Challenge 2015: ensemble learning with categorical features (20)

EE660 Project_sl_final
EE660 Project_sl_finalEE660 Project_sl_final
EE660 Project_sl_final
 
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
 
Lec13 Clustering.pptx
Lec13 Clustering.pptxLec13 Clustering.pptx
Lec13 Clustering.pptx
 
Observations
ObservationsObservations
Observations
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
 
Design Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptxDesign Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptx
 
Analysis of Algorithm full version 2024.pptx
Analysis of Algorithm  full version  2024.pptxAnalysis of Algorithm  full version  2024.pptx
Analysis of Algorithm full version 2024.pptx
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
 
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docxWeek 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
 
Oracle ebs r12eam part2
Oracle ebs r12eam part2Oracle ebs r12eam part2
Oracle ebs r12eam part2
 
Ali upload
Ali uploadAli upload
Ali upload
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR models
 
The relationship between test and production code quality (@ SIG)
The relationship between test and production code quality (@ SIG)The relationship between test and production code quality (@ SIG)
The relationship between test and production code quality (@ SIG)
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
 
Lecture 2 role of algorithms in computing
Lecture 2   role of algorithms in computingLecture 2   role of algorithms in computing
Lecture 2 role of algorithms in computing
 

More from romovpa

Машинное обучение для ваших игр и бизнеса
Машинное обучение для ваших игр и бизнесаМашинное обучение для ваших игр и бизнеса
Машинное обучение для ваших игр и бизнеса
romovpa
 
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
romovpa
 
Проекты для студентов ФКН ВШЭ
Проекты для студентов ФКН ВШЭПроекты для студентов ФКН ВШЭ
Проекты для студентов ФКН ВШЭ
romovpa
 
Dota Science: Роль киберспорта в обучении анализу данных
Dota Science: Роль киберспорта в обучении анализу данныхDota Science: Роль киберспорта в обучении анализу данных
Dota Science: Роль киберспорта в обучении анализу данных
romovpa
 
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
romovpa
 
Машинное обучение с элементами киберспорта
Машинное обучение с элементами киберспортаМашинное обучение с элементами киберспорта
Машинное обучение с элементами киберспорта
romovpa
 
Факторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системахФакторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системах
romovpa
 
Структурный SVM и отчет по курсовой
Структурный SVM и отчет по курсовойСтруктурный SVM и отчет по курсовой
Структурный SVM и отчет по курсовой
romovpa
 
Структурное обучение и S-SVM
Структурное обучение и S-SVMСтруктурное обучение и S-SVM
Структурное обучение и S-SVM
romovpa
 
Глобальная дискретная оптимизация при помощи разрезов графов
Глобальная дискретная оптимизация при помощи разрезов графовГлобальная дискретная оптимизация при помощи разрезов графов
Глобальная дискретная оптимизация при помощи разрезов графовromovpa
 

More from romovpa (10)

Машинное обучение для ваших игр и бизнеса
Машинное обучение для ваших игр и бизнесаМашинное обучение для ваших игр и бизнеса
Машинное обучение для ваших игр и бизнеса
 
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
 
Проекты для студентов ФКН ВШЭ
Проекты для студентов ФКН ВШЭПроекты для студентов ФКН ВШЭ
Проекты для студентов ФКН ВШЭ
 
Dota Science: Роль киберспорта в обучении анализу данных
Dota Science: Роль киберспорта в обучении анализу данныхDota Science: Роль киберспорта в обучении анализу данных
Dota Science: Роль киберспорта в обучении анализу данных
 
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
 
Машинное обучение с элементами киберспорта
Машинное обучение с элементами киберспортаМашинное обучение с элементами киберспорта
Машинное обучение с элементами киберспорта
 
Факторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системахФакторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системах
 
Структурный SVM и отчет по курсовой
Структурный SVM и отчет по курсовойСтруктурный SVM и отчет по курсовой
Структурный SVM и отчет по курсовой
 
Структурное обучение и S-SVM
Структурное обучение и S-SVMСтруктурное обучение и S-SVM
Структурное обучение и S-SVM
 
Глобальная дискретная оптимизация при помощи разрезов графов
Глобальная дискретная оптимизация при помощи разрезов графовГлобальная дискретная оптимизация при помощи разрезов графов
Глобальная дискретная оптимизация при помощи разрезов графов
 

Recently uploaded

THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
ronaldlakony0
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
alishadewangan1
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 

Recently uploaded (20)

THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 

RecSys Challenge 2015: ensemble learning with categorical features

  • 1. RecSys Challenge 2015: ensemble learning with categorical features Peter Romov, Evgeny Sokolov
  • 2. •  Logs from e-commerce website: collection of sessions •  Session •  sequence of clicks on the item pages •  could end with or without purchase •  Click •  Timestamp •  ItemID (≈54k unique IDs) •  CategoryID (≈350 unique IDs) •  Purchase •  set of bought items with price and quantity •  Known purchases for the train-set, need to predict on the test-set Problem statement 2
  • 3. Problem statement 3 Clicks from session s Purchase (actual) Purchase (predicted) c(s) = (c1(s), . . . , cn(s)(s)) h(s) ⇡ y(s) y(s) = ( ; — no purchase {i1, . . . , im(s)} (bought items) — otherwise Evaluation measure: — Jaccard distance between two sets Sb test Stest where — all sessions from test-set — sessions from test-set with purchase J(y(s), h(s)) = |y(s) h(s)| |y(s) [ h(s)| Q(h, Stest) = X s2Stest: |h(s)|>0 8 < : |Sb test| |Stest| + J(y(s), h(s)), if y(s) 6= ; |Sb test| |Stest| , otherwise
  • 4. Problem statement 4 First observations (from the task): •  the task is uncommon (set prediction with specific loss function) •  evaluation measure could be rewritten •  the original problem can be divided into two well-known binary classification problems; 1.  predict purchase given session optimize Purchase score 2.  predict bought items given session with purchase optimize Jaccard score P(y(s) 6= ;|s) Q(h, Stest) = |Sb test| |Stest| (TP FP) | {z } purchase score + X s2Stest J(y(s), h(s)) | {z } Jaccard score , P(i 2 y(s)|s, y(s) 6= ;)
  • 5. •  Two-stage prediction •  Two binary classification models learned on the train-set •  Both classifiers require thresholds •  Set up thresholds to optimize Purchase score and Jaccard score using hold-out subsample of the train-set Solution schema 5 Strain purchase classifier bought item classifier Slearn 90% Svalid 10% s 7! P(purchase|s) (s, i) 7! P(i 2 y(s)|s, y(s) 6= ;) classifier thresholds
  • 6. Some relationships from data 6 Next observations (from the data): •  Buying rate strongly depends on time features •  Buying rate varies highly between categorical features he er- 87 e- s) es- me s. he or ht is )) , ne ty Figure 1: Dynamics of the buying rate in time Figure 2: Buying rate versus number of clicked items (left) and ID of the item with the maximum number of clicks in session (right) The total number of item IDs and category IDs is 54, 287 and 347 correspondingly. Both training and test sets be- long to an interval of 6 months. The target function y(s) corresponds to the set of items that were bought in the ses- sion s 1 . In other words, the target function gives some subset of the universal item set I for each user session s. We are given sets of bought items y(s) for all sessions s the training set Strain, and are required to predict these sets for test sessions s ∈ Stest. 2.2 Evaluation Measure Denote by h(s) a hypothesis that predicts a set of bought items for any user session s. The score of this hypothesis is measured by the following formula: Q(h, Stest) = s∈Stest: |h(s)|>0 |Sb test| |Stest| (−1)isEmpty(y(s)) +J(y(s), h(s)) , where Sb test is the set of all test sessions with at least one purchase event, J(A, B) = |A∩B| |A∪B| is the Jaccard similarity measure. It is easy to rewrite this expression as Q(h, Stest) = |Sb test| |Stest| (TP − FP) purchase score + s∈Stest J(y(s), h(s)) Jaccard score , (1) where TP is the number of sessions with |y(s)| > 0 and |h(s)| > 0 (i.e. true positives), and FP is the number of sessions with |y(s)| = 0 and |h(s)| > 0 (i.e. false positives). Now it is easy to see that the score consists of two parts. The first one gives a reward for each correctly guessed session with buy events and a penalty for each false alarm; the absolute values of penalty and reward are both equal to |Sb test| |Stest| . The second part calculates the total similarity of predicted sets of bought items to the real sets. 2.3 Purchase statistics Figure 2: Buying rate versus number of clicked items (left) and ID of the item with the maximum number of clicks in session (right) is that a higher number of items clicked during the session leads to a higher chance of a purchase. Lots of information could be extracted from the data by considering item identifiers and categories clicked during the Buying rate — fraction of buyer sessions in some subset of sessions
  • 7. Feature extraction 7 •  Purchase classifier: features from session (sequence of clicks) •  Bought item classifier: features from pair session+itemID •  Observation: bought item is a clicked item •  We use two types of features •  Numerical: real number, e.g. seconds between two clicks •  Categorical: element of the unordered set of values (levels), e.g. ItemID
  • 8. Feature extraction: session 8 1.  Start/end of the session (month, day, hour, etc.) [numerical + categorical with few levels] 2.  Number of clicks, unique items, categories, item-category pairs [numerical] 3.  Top 10 items and categories by the number of clicks [categorical with ≈50k levels] 4.  ID of the first/last item clicked at least k times [categorical with ≈50k levels] 5.  Click counts for 100 items and 50 categories that were most popular in the whole training set [sparse numerical]
  • 9. Feature extraction: session+ItemID 9 1.  All session features 2.  ItemID [categorical with ≈50k levels] 3.  Timestamp of the first/last click on the item (month, day, hour, etc.) [numerical + categorical with few levels] 4.  Number of clicks in the session for the given item [numerical] 5.  Total duration (by analogy with dwell time) of the clicks on the item [numerical]
  • 10. •  GBM and similar ensemble learning techniques •  useful with numerical features •  one-hot encoding of categorical features doesn’t perform well •  Matrix decompositions, FM •  useful with categorical features •  hard to incorporate numerical features because of rough (bi-linear) model •  We used our internal learning algorithm: MatrixNet •  GBM with oblivious decision trees •  trees properly handle categorical features (multi-split decision trees) •  SVD-like decompositions for new feature value combinations Classification method 10
  • 11. Classification method 11 Oblivious decision tree with categorical features duration > 20 user item item [numerical] [categorical] [categorical] yes no … user item item… … … ……
  • 12. •  Training classifiers •  GB with 10k trees for each classifier •  ≈12 hours to train both models on 150 machines •  Making predictions •  We made 4000 predictions per second per thread Classification method: speed 12
  • 13. Threshold optimization 13 We optimized thresholds using validation set (10% hold-out from train-set) 1)  Maximize Jaccard score 2)  Maximize Purchase+Jaccard scores using fixed bought item threshold Q(h, Svalid) = |Sb valid| |Svalid| (TP FP) | {z } purchase score + X s2Svalid J(y(s), h(s)) | {z } Jaccard score , Figure 3: Item detection threshold (above) and pur- chase detection threshold (below) quality on the val- idation set. we train purchase detection and purchased item detection classifiers. The purchase detection classifier hp(s) predicts the outcome of the function yp(s) = isNotEmpty(y(s)) and uses the entire training set in the learning phase. The item detection classifier hi(s, j) approximates the indicator func- tion yi(s, j) = I(j ∈ y(s)) and uses only sessions with bought items in the learning phase. Of course, it would be wise to use classifiers that output probabilities rather than binary predictions, because in this case we will be able to select thresholds that directly optimize evaluation metric (1) in- stead of the classifier’s internal quality measure. So, our final expression for the hypothesis can be written as h(s) = ∅ if hp(s) < αp, {j ∈ I | hi(s, j) > αi} if hp(s) ≥ αp. (2) 3.2 Feature Extraction We have outlined two groups of features: one describes a session and the other describes a session-item pair. The pur- chase detection classifier uses only session features and the item detection classifier uses both groups. The full feature listing can be found in Table 1; for further details, please refer to our code2 . We give some comments on our feature extraction decisions below. One could use sophisticated aggregations to extract nu- merical features that describe items and categories. How- Figure 3: Item detection threshold (above) and pur- chase detection threshold (below) quality on the val- idation set.
  • 14. •  Leaderboard: 63102 (1st place) •  Purchase detection on validation (10% hold-out): •  16% precision •  77% recall •  AUC 0.85 •  Purchased item detection on validation: •  Jaccard measure 0.765 •  Features / datasets used to learn classifiers / evaluation process can be reproduced, see our code1 Final results 14 1https://github.com/romovpa/ydf-recsys2015-challenge
  • 15. 1.  Observations from the problem statement ›  The task is complex but decomposable into two well-known: binary classification of sessions and (session, ItemID)-pairs 2.  Observations from the data (user click sessions) ›  Features from sessions and (session, ItemID)-pairs ›  Easy to develop many meaningful categorical features 3.  The algorithm ›  Gradient boosting on trees with categorical features ›  No sophisticated mixtures of Machine Learning techniques: one algorithm to work with many numerical and categorical features Summary / Questions? 15