SlideShare a Scribd company logo
1 of 15
Download to read offline
RecSys Challenge 2015: ensemble learning with
categorical features
Peter Romov, Evgeny Sokolov
•  Logs from e-commerce website: collection of sessions
•  Session
•  sequence of clicks on the item pages
•  could end with or without purchase
•  Click
•  Timestamp
•  ItemID (≈54k unique IDs)
•  CategoryID (≈350 unique IDs)
•  Purchase
•  set of bought items with price and quantity
•  Known purchases for the train-set, need to predict on the test-set
Problem statement
2
Problem statement
3
Clicks from session s
Purchase (actual)
Purchase (predicted)
c(s) = (c1(s), . . . , cn(s)(s))
h(s) ⇡ y(s)
y(s) =
(
; — no purchase
{i1, . . . , im(s)} (bought items) — otherwise
Evaluation measure:
— Jaccard distance between two sets
Sb
test
Stest
where
— all sessions from test-set
— sessions from test-set with purchase
J(y(s), h(s)) =
|y(s)  h(s)|
|y(s) [ h(s)|
Q(h, Stest) =
X
s2Stest:
|h(s)|>0
8
<
:
|Sb
test|
|Stest| + J(y(s), h(s)), if y(s) 6= ;
|Sb
test|
|Stest| , otherwise
Problem statement
4
First observations (from the task):
•  the task is uncommon (set prediction with specific loss function)
•  evaluation measure could be rewritten
•  the original problem can be divided into two well-known
binary classification problems;
1.  predict purchase given session
optimize Purchase score
2.  predict bought items given session with purchase
optimize Jaccard score
P(y(s) 6= ;|s)
Q(h, Stest) =
|Sb
test|
|Stest|
(TP FP)
| {z }
purchase score
+
X
s2Stest
J(y(s), h(s))
| {z }
Jaccard score
,
P(i 2 y(s)|s, y(s) 6= ;)
•  Two-stage prediction
•  Two binary classification models learned on the train-set
•  Both classifiers require thresholds
•  Set up thresholds to optimize Purchase score and Jaccard score
using hold-out subsample of the train-set
Solution schema
5
Strain
purchase classifier
bought item classifier
Slearn
90%
Svalid
10%
s 7! P(purchase|s)
(s, i) 7! P(i 2 y(s)|s, y(s) 6= ;)
classifier
thresholds
Some relationships from data
6
Next observations (from the data):
•  Buying rate strongly depends on time features
•  Buying rate varies highly between categorical features
he
er-
87
e-
s)
es-
me
s.
he
or
ht
is
)) ,
ne
ty
Figure 1: Dynamics of the buying rate in time
Figure 2: Buying rate versus number of clicked
items (left) and ID of the item with the maximum
number of clicks in session (right)
The total number of item IDs and category IDs is 54, 287
and 347 correspondingly. Both training and test sets be-
long to an interval of 6 months. The target function y(s)
corresponds to the set of items that were bought in the ses-
sion s 1
. In other words, the target function gives some
subset of the universal item set I for each user session s.
We are given sets of bought items y(s) for all sessions s the
training set Strain, and are required to predict these sets for
test sessions s ∈ Stest.
2.2 Evaluation Measure
Denote by h(s) a hypothesis that predicts a set of bought
items for any user session s. The score of this hypothesis is
measured by the following formula:
Q(h, Stest) =
s∈Stest:
|h(s)|>0
|Sb
test|
|Stest|
(−1)isEmpty(y(s))
+J(y(s), h(s)) ,
where Sb
test is the set of all test sessions with at least one
purchase event, J(A, B) = |A∩B|
|A∪B|
is the Jaccard similarity
measure. It is easy to rewrite this expression as
Q(h, Stest) =
|Sb
test|
|Stest|
(TP − FP)
purchase score
+
s∈Stest
J(y(s), h(s))
Jaccard score
, (1)
where TP is the number of sessions with |y(s)| > 0 and |h(s)| >
0 (i.e. true positives), and FP is the number of sessions
with |y(s)| = 0 and |h(s)| > 0 (i.e. false positives). Now it
is easy to see that the score consists of two parts. The first
one gives a reward for each correctly guessed session with
buy events and a penalty for each false alarm; the absolute
values of penalty and reward are both equal to
|Sb
test|
|Stest|
. The
second part calculates the total similarity of predicted sets
of bought items to the real sets.
2.3 Purchase statistics
Figure 2: Buying rate versus number of clicked
items (left) and ID of the item with the maximum
number of clicks in session (right)
is that a higher number of items clicked during the session
leads to a higher chance of a purchase.
Lots of information could be extracted from the data by
considering item identifiers and categories clicked during the
Buying rate — fraction of buyer sessions in some subset of sessions
Feature extraction
7
•  Purchase classifier: features from session (sequence of clicks)
•  Bought item classifier: features from pair session+itemID
•  Observation: bought item is a clicked item
•  We use two types of features
•  Numerical: real number, e.g. seconds between two clicks
•  Categorical: element of the unordered set of values (levels),
e.g. ItemID
Feature extraction: session
8
1.  Start/end of the session (month, day, hour, etc.)
[numerical + categorical with few levels]
2.  Number of clicks, unique items, categories, item-category pairs
[numerical]
3.  Top 10 items and categories by the number of clicks
[categorical with ≈50k levels]
4.  ID of the first/last item clicked at least k times
[categorical with ≈50k levels]
5.  Click counts for 100 items and 50 categories that were most
popular in the whole training set
[sparse numerical]
Feature extraction:
session+ItemID
9
1.  All session features
2.  ItemID
[categorical with ≈50k levels]
3.  Timestamp of the first/last click on the item (month, day, hour, etc.)
[numerical + categorical with few levels]
4.  Number of clicks in the session for the given item
[numerical]
5.  Total duration (by analogy with dwell time) of the clicks on the item
[numerical]
•  GBM and similar ensemble learning techniques
•  useful with numerical features
•  one-hot encoding of categorical features doesn’t perform well
•  Matrix decompositions, FM
•  useful with categorical features
•  hard to incorporate numerical features because of rough (bi-linear)
model
•  We used our internal learning algorithm: MatrixNet
•  GBM with oblivious decision trees
•  trees properly handle categorical features (multi-split decision trees)
•  SVD-like decompositions for new feature value combinations
Classification method
10
Classification method
11
Oblivious decision tree with categorical features
duration > 20
user
item item
[numerical]
[categorical]
[categorical]
yes no
…
user
item item…
… … ……
•  Training classifiers
•  GB with 10k trees for each classifier
•  ≈12 hours to train both models on 150 machines
•  Making predictions
•  We made 4000 predictions per second per thread
Classification method: speed
12
Threshold optimization
13
We optimized thresholds using validation set (10% hold-out
from train-set)
1)  Maximize Jaccard score
2)  Maximize Purchase+Jaccard scores using fixed bought
item threshold
Q(h, Svalid) =
|Sb
valid|
|Svalid|
(TP FP)
| {z }
purchase score
+
X
s2Svalid
J(y(s), h(s))
| {z }
Jaccard score
,
Figure 3: Item detection threshold (above) and pur-
chase detection threshold (below) quality on the val-
idation set.
we train purchase detection and purchased item detection
classifiers. The purchase detection classifier hp(s) predicts
the outcome of the function yp(s) = isNotEmpty(y(s)) and
uses the entire training set in the learning phase. The item
detection classifier hi(s, j) approximates the indicator func-
tion yi(s, j) = I(j ∈ y(s)) and uses only sessions with bought
items in the learning phase. Of course, it would be wise to
use classifiers that output probabilities rather than binary
predictions, because in this case we will be able to select
thresholds that directly optimize evaluation metric (1) in-
stead of the classifier’s internal quality measure. So, our
final expression for the hypothesis can be written as
h(s) =
∅ if hp(s) < αp,
{j ∈ I | hi(s, j) > αi} if hp(s) ≥ αp.
(2)
3.2 Feature Extraction
We have outlined two groups of features: one describes a
session and the other describes a session-item pair. The pur-
chase detection classifier uses only session features and the
item detection classifier uses both groups. The full feature
listing can be found in Table 1; for further details, please
refer to our code2
. We give some comments on our feature
extraction decisions below.
One could use sophisticated aggregations to extract nu-
merical features that describe items and categories. How-
Figure 3: Item detection threshold (above) and pur-
chase detection threshold (below) quality on the val-
idation set.
•  Leaderboard: 63102 (1st place)
•  Purchase detection on validation (10% hold-out):
•  16% precision
•  77% recall
•  AUC 0.85
•  Purchased item detection on validation:
•  Jaccard measure 0.765
•  Features / datasets used to learn classifiers / evaluation process
can be reproduced, see our code1
Final results
14 1https://github.com/romovpa/ydf-recsys2015-challenge
1.  Observations from the problem statement
›  The task is complex but decomposable into two well-known:
binary classification of sessions and (session, ItemID)-pairs
2.  Observations from the data (user click sessions)
›  Features from sessions and (session, ItemID)-pairs
›  Easy to develop many meaningful categorical features
3.  The algorithm
›  Gradient boosting on trees with categorical features
›  No sophisticated mixtures of Machine Learning techniques: one
algorithm to work with many numerical and categorical features
Summary / Questions?
15

More Related Content

What's hot

Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisitedXavier Amatriain
 
kaggle hm fashion recsys pjct 발표 자료.pptx
kaggle hm fashion recsys pjct 발표 자료.pptxkaggle hm fashion recsys pjct 발표 자료.pptx
kaggle hm fashion recsys pjct 발표 자료.pptxJohnKim663844
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [리뷰의 재발견 팀] : 이커머스 리뷰 유용성 파악 및 필터링
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [리뷰의 재발견 팀] : 이커머스 리뷰 유용성 파악 및 필터링제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [리뷰의 재발견 팀] : 이커머스 리뷰 유용성 파악 및 필터링
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [리뷰의 재발견 팀] : 이커머스 리뷰 유용성 파악 및 필터링BOAZ Bigdata
 
제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [중고책나라] : 실시간 데이터를 이용한 Elasticsearch 클러스터 최적화
제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [중고책나라] : 실시간 데이터를 이용한 Elasticsearch 클러스터 최적화제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [중고책나라] : 실시간 데이터를 이용한 Elasticsearch 클러스터 최적화
제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [중고책나라] : 실시간 데이터를 이용한 Elasticsearch 클러스터 최적화BOAZ Bigdata
 
Recommendation system for ecommerce
Recommendation system for ecommerceRecommendation system for ecommerce
Recommendation system for ecommerceTu Pham
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
빅데이터 기술 현황과 시장 전망(2014)
빅데이터 기술 현황과 시장 전망(2014)빅데이터 기술 현황과 시장 전망(2014)
빅데이터 기술 현황과 시장 전망(2014)Channy Yun
 
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [#인스타툰 팀] : 해시태그 기반 인스타툰 추천 챗봇
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [#인스타툰 팀] : 해시태그 기반 인스타툰 추천 챗봇제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [#인스타툰 팀] : 해시태그 기반 인스타툰 추천 챗봇
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [#인스타툰 팀] : 해시태그 기반 인스타툰 추천 챗봇BOAZ Bigdata
 
Recommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking SystemRecommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking Systemivaderivader
 
Billetterie en ligne : Une solution miracle pour le spectacle vivant ? - SOCI...
Billetterie en ligne : Une solution miracle pour le spectacle vivant ? - SOCI...Billetterie en ligne : Une solution miracle pour le spectacle vivant ? - SOCI...
Billetterie en ligne : Une solution miracle pour le spectacle vivant ? - SOCI...socialunit
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기NAVER D2
 
데이터 기반 성장을 위한 선결 조건: Product-Market Fit, Instrumentation, 그리고 프로세스
데이터 기반 성장을 위한 선결 조건: Product-Market Fit, Instrumentation, 그리고 프로세스데이터 기반 성장을 위한 선결 조건: Product-Market Fit, Instrumentation, 그리고 프로세스
데이터 기반 성장을 위한 선결 조건: Product-Market Fit, Instrumentation, 그리고 프로세스Minwoo Kim
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Massimo Quadrana
 
Raduenzel_Mark_FinalAssignment_NSEC506_Fall2015
Raduenzel_Mark_FinalAssignment_NSEC506_Fall2015Raduenzel_Mark_FinalAssignment_NSEC506_Fall2015
Raduenzel_Mark_FinalAssignment_NSEC506_Fall2015Mark Raduenzel
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
 
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [보아酒] : 리뷰 감정분석을 통한 전통주 추천 서비스
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [보아酒] : 리뷰 감정분석을 통한 전통주 추천 서비스제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [보아酒] : 리뷰 감정분석을 통한 전통주 추천 서비스
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [보아酒] : 리뷰 감정분석을 통한 전통주 추천 서비스BOAZ Bigdata
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Cm:)e팀] : 이커머스 고객경험 관리 분석
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Cm:)e팀] : 이커머스 고객경험 관리 분석제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Cm:)e팀] : 이커머스 고객경험 관리 분석
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Cm:)e팀] : 이커머스 고객경험 관리 분석BOAZ Bigdata
 
How to Build Recommender System with Content based Filtering
How to Build Recommender System with Content based FilteringHow to Build Recommender System with Content based Filtering
How to Build Recommender System with Content based FilteringVõ Duy Tuấn
 
제 13회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [경기는 계속되어야 한다] : 실시간 축구 중계 영상 화질개선 모델
제 13회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [경기는 계속되어야 한다] : 실시간 축구 중계 영상 화질개선 모델제 13회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [경기는 계속되어야 한다] : 실시간 축구 중계 영상 화질개선 모델
제 13회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [경기는 계속되어야 한다] : 실시간 축구 중계 영상 화질개선 모델BOAZ Bigdata
 

What's hot (20)

Kdd 2014 Tutorial - the recommender problem revisited
Kdd 2014 Tutorial -  the recommender problem revisitedKdd 2014 Tutorial -  the recommender problem revisited
Kdd 2014 Tutorial - the recommender problem revisited
 
kaggle hm fashion recsys pjct 발표 자료.pptx
kaggle hm fashion recsys pjct 발표 자료.pptxkaggle hm fashion recsys pjct 발표 자료.pptx
kaggle hm fashion recsys pjct 발표 자료.pptx
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [리뷰의 재발견 팀] : 이커머스 리뷰 유용성 파악 및 필터링
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [리뷰의 재발견 팀] : 이커머스 리뷰 유용성 파악 및 필터링제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [리뷰의 재발견 팀] : 이커머스 리뷰 유용성 파악 및 필터링
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [리뷰의 재발견 팀] : 이커머스 리뷰 유용성 파악 및 필터링
 
제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [중고책나라] : 실시간 데이터를 이용한 Elasticsearch 클러스터 최적화
제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [중고책나라] : 실시간 데이터를 이용한 Elasticsearch 클러스터 최적화제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [중고책나라] : 실시간 데이터를 이용한 Elasticsearch 클러스터 최적화
제 17회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [중고책나라] : 실시간 데이터를 이용한 Elasticsearch 클러스터 최적화
 
Recommendation system for ecommerce
Recommendation system for ecommerceRecommendation system for ecommerce
Recommendation system for ecommerce
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
빅데이터 기술 현황과 시장 전망(2014)
빅데이터 기술 현황과 시장 전망(2014)빅데이터 기술 현황과 시장 전망(2014)
빅데이터 기술 현황과 시장 전망(2014)
 
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [#인스타툰 팀] : 해시태그 기반 인스타툰 추천 챗봇
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [#인스타툰 팀] : 해시태그 기반 인스타툰 추천 챗봇제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [#인스타툰 팀] : 해시태그 기반 인스타툰 추천 챗봇
제 16회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [#인스타툰 팀] : 해시태그 기반 인스타툰 추천 챗봇
 
Recommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking SystemRecommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking System
 
Billetterie en ligne : Une solution miracle pour le spectacle vivant ? - SOCI...
Billetterie en ligne : Une solution miracle pour le spectacle vivant ? - SOCI...Billetterie en ligne : Une solution miracle pour le spectacle vivant ? - SOCI...
Billetterie en ligne : Une solution miracle pour le spectacle vivant ? - SOCI...
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기
 
데이터 기반 성장을 위한 선결 조건: Product-Market Fit, Instrumentation, 그리고 프로세스
데이터 기반 성장을 위한 선결 조건: Product-Market Fit, Instrumentation, 그리고 프로세스데이터 기반 성장을 위한 선결 조건: Product-Market Fit, Instrumentation, 그리고 프로세스
데이터 기반 성장을 위한 선결 조건: Product-Market Fit, Instrumentation, 그리고 프로세스
 
Building Decision Tree model with numerical attributes
Building Decision Tree model with numerical attributesBuilding Decision Tree model with numerical attributes
Building Decision Tree model with numerical attributes
 
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
 
Raduenzel_Mark_FinalAssignment_NSEC506_Fall2015
Raduenzel_Mark_FinalAssignment_NSEC506_Fall2015Raduenzel_Mark_FinalAssignment_NSEC506_Fall2015
Raduenzel_Mark_FinalAssignment_NSEC506_Fall2015
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
 
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [보아酒] : 리뷰 감정분석을 통한 전통주 추천 서비스
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [보아酒] : 리뷰 감정분석을 통한 전통주 추천 서비스제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [보아酒] : 리뷰 감정분석을 통한 전통주 추천 서비스
제 18회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [보아酒] : 리뷰 감정분석을 통한 전통주 추천 서비스
 
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Cm:)e팀] : 이커머스 고객경험 관리 분석
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Cm:)e팀] : 이커머스 고객경험 관리 분석제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Cm:)e팀] : 이커머스 고객경험 관리 분석
제 15회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [Cm:)e팀] : 이커머스 고객경험 관리 분석
 
How to Build Recommender System with Content based Filtering
How to Build Recommender System with Content based FilteringHow to Build Recommender System with Content based Filtering
How to Build Recommender System with Content based Filtering
 
제 13회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [경기는 계속되어야 한다] : 실시간 축구 중계 영상 화질개선 모델
제 13회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [경기는 계속되어야 한다] : 실시간 축구 중계 영상 화질개선 모델제 13회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [경기는 계속되어야 한다] : 실시간 축구 중계 영상 화질개선 모델
제 13회 보아즈(BOAZ) 빅데이터 컨퍼런스 - [경기는 계속되어야 한다] : 실시간 축구 중계 영상 화질개선 모델
 

Viewers also liked

Fields of Experts (доклад)
Fields of Experts (доклад)Fields of Experts (доклад)
Fields of Experts (доклад)romovpa
 
A Simple yet Efficient Method for a Credit Card Upselling Prediction
A Simple yet Efficient Method for a Credit Card Upselling PredictionA Simple yet Efficient Method for a Credit Card Upselling Prediction
A Simple yet Efficient Method for a Credit Card Upselling Predictionromovpa
 
Categorical Data Analysis in Python
Categorical Data Analysis in PythonCategorical Data Analysis in Python
Categorical Data Analysis in PythonJaidev Deshpande
 
Fast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approachFast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approachYury Leonychev
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 

Viewers also liked (7)

Fields of Experts (доклад)
Fields of Experts (доклад)Fields of Experts (доклад)
Fields of Experts (доклад)
 
A Simple yet Efficient Method for a Credit Card Upselling Prediction
A Simple yet Efficient Method for a Credit Card Upselling PredictionA Simple yet Efficient Method for a Credit Card Upselling Prediction
A Simple yet Efficient Method for a Credit Card Upselling Prediction
 
Categorical Data Analysis in Python
Categorical Data Analysis in PythonCategorical Data Analysis in Python
Categorical Data Analysis in Python
 
Fast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approachFast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approach
 
Xgboost
XgboostXgboost
Xgboost
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 

Similar to RecSys Challenge 2015: ensemble learning with categorical features

EE660 Project_sl_final
EE660 Project_sl_finalEE660 Project_sl_final
EE660 Project_sl_finalShanglin Yang
 
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...SMART Infrastructure Facility
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Md. Main Uddin Rony
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies台灣資料科學年會
 
Observations
ObservationsObservations
Observationsbutest
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksKevin Lee
 
Analysis of Algorithm full version 2024.pptx
Analysis of Algorithm  full version  2024.pptxAnalysis of Algorithm  full version  2024.pptx
Analysis of Algorithm full version 2024.pptxrajesshs31r
 
Design Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptxDesign Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptxrajesshs31r
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...PAPIs.io
 
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docxWeek 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docxmelbruce90096
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxRuby Shrestha
 
Oracle ebs r12eam part2
Oracle ebs r12eam part2Oracle ebs r12eam part2
Oracle ebs r12eam part2jcvd12
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR modelsNisha Arankandath
 
The relationship between test and production code quality (@ SIG)
The relationship between test and production code quality (@ SIG)The relationship between test and production code quality (@ SIG)
The relationship between test and production code quality (@ SIG)Maurício Aniche
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET Journal
 
Lecture 2 role of algorithms in computing
Lecture 2   role of algorithms in computingLecture 2   role of algorithms in computing
Lecture 2 role of algorithms in computingjayavignesh86
 

Similar to RecSys Challenge 2015: ensemble learning with categorical features (20)

EE660 Project_sl_final
EE660 Project_sl_finalEE660 Project_sl_final
EE660 Project_sl_final
 
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
SMART Seminar Series: "Optimisation of closed loop supply chain decisions usi...
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
 
Lec13 Clustering.pptx
Lec13 Clustering.pptxLec13 Clustering.pptx
Lec13 Clustering.pptx
 
Observations
ObservationsObservations
Observations
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
 
Analysis of Algorithm full version 2024.pptx
Analysis of Algorithm  full version  2024.pptxAnalysis of Algorithm  full version  2024.pptx
Analysis of Algorithm full version 2024.pptx
 
Design Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptxDesign Analysis of Alogorithm 1 ppt 2024.pptx
Design Analysis of Alogorithm 1 ppt 2024.pptx
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
 
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docxWeek 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
Week 2 iLab TCO 2 — Given a simple problem, design a solutio.docx
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
 
Oracle ebs r12eam part2
Oracle ebs r12eam part2Oracle ebs r12eam part2
Oracle ebs r12eam part2
 
Ali upload
Ali uploadAli upload
Ali upload
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR models
 
The relationship between test and production code quality (@ SIG)
The relationship between test and production code quality (@ SIG)The relationship between test and production code quality (@ SIG)
The relationship between test and production code quality (@ SIG)
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
 
Lecture 2 role of algorithms in computing
Lecture 2   role of algorithms in computingLecture 2   role of algorithms in computing
Lecture 2 role of algorithms in computing
 

More from romovpa

Машинное обучение для ваших игр и бизнеса
Машинное обучение для ваших игр и бизнесаМашинное обучение для ваших игр и бизнеса
Машинное обучение для ваших игр и бизнесаromovpa
 
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...romovpa
 
Проекты для студентов ФКН ВШЭ
Проекты для студентов ФКН ВШЭПроекты для студентов ФКН ВШЭ
Проекты для студентов ФКН ВШЭromovpa
 
Dota Science: Роль киберспорта в обучении анализу данных
Dota Science: Роль киберспорта в обучении анализу данныхDota Science: Роль киберспорта в обучении анализу данных
Dota Science: Роль киберспорта в обучении анализу данныхromovpa
 
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...romovpa
 
Машинное обучение с элементами киберспорта
Машинное обучение с элементами киберспортаМашинное обучение с элементами киберспорта
Машинное обучение с элементами киберспортаromovpa
 
Факторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системахФакторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системахromovpa
 
Структурный SVM и отчет по курсовой
Структурный SVM и отчет по курсовойСтруктурный SVM и отчет по курсовой
Структурный SVM и отчет по курсовойromovpa
 
Структурное обучение и S-SVM
Структурное обучение и S-SVMСтруктурное обучение и S-SVM
Структурное обучение и S-SVMromovpa
 
Глобальная дискретная оптимизация при помощи разрезов графов
Глобальная дискретная оптимизация при помощи разрезов графовГлобальная дискретная оптимизация при помощи разрезов графов
Глобальная дискретная оптимизация при помощи разрезов графовromovpa
 

More from romovpa (10)

Машинное обучение для ваших игр и бизнеса
Машинное обучение для ваших игр и бизнесаМашинное обучение для ваших игр и бизнеса
Машинное обучение для ваших игр и бизнеса
 
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
Applications of Machine Learning in DOTA2: Literature Review and Practical Kn...
 
Проекты для студентов ФКН ВШЭ
Проекты для студентов ФКН ВШЭПроекты для студентов ФКН ВШЭ
Проекты для студентов ФКН ВШЭ
 
Dota Science: Роль киберспорта в обучении анализу данных
Dota Science: Роль киберспорта в обучении анализу данныхDota Science: Роль киберспорта в обучении анализу данных
Dota Science: Роль киберспорта в обучении анализу данных
 
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
 
Машинное обучение с элементами киберспорта
Машинное обучение с элементами киберспортаМашинное обучение с элементами киберспорта
Машинное обучение с элементами киберспорта
 
Факторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системахФакторизационные модели в рекомендательных системах
Факторизационные модели в рекомендательных системах
 
Структурный SVM и отчет по курсовой
Структурный SVM и отчет по курсовойСтруктурный SVM и отчет по курсовой
Структурный SVM и отчет по курсовой
 
Структурное обучение и S-SVM
Структурное обучение и S-SVMСтруктурное обучение и S-SVM
Структурное обучение и S-SVM
 
Глобальная дискретная оптимизация при помощи разрезов графов
Глобальная дискретная оптимизация при помощи разрезов графовГлобальная дискретная оптимизация при помощи разрезов графов
Глобальная дискретная оптимизация при помощи разрезов графов
 

Recently uploaded

GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry Areesha Ahmad
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Silpa
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Silpa
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Silpa
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLkantirani197
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxSilpa
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxANSARKHAN96
 

Recently uploaded (20)

GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 

RecSys Challenge 2015: ensemble learning with categorical features

  • 1. RecSys Challenge 2015: ensemble learning with categorical features Peter Romov, Evgeny Sokolov
  • 2. •  Logs from e-commerce website: collection of sessions •  Session •  sequence of clicks on the item pages •  could end with or without purchase •  Click •  Timestamp •  ItemID (≈54k unique IDs) •  CategoryID (≈350 unique IDs) •  Purchase •  set of bought items with price and quantity •  Known purchases for the train-set, need to predict on the test-set Problem statement 2
  • 3. Problem statement 3 Clicks from session s Purchase (actual) Purchase (predicted) c(s) = (c1(s), . . . , cn(s)(s)) h(s) ⇡ y(s) y(s) = ( ; — no purchase {i1, . . . , im(s)} (bought items) — otherwise Evaluation measure: — Jaccard distance between two sets Sb test Stest where — all sessions from test-set — sessions from test-set with purchase J(y(s), h(s)) = |y(s) h(s)| |y(s) [ h(s)| Q(h, Stest) = X s2Stest: |h(s)|>0 8 < : |Sb test| |Stest| + J(y(s), h(s)), if y(s) 6= ; |Sb test| |Stest| , otherwise
  • 4. Problem statement 4 First observations (from the task): •  the task is uncommon (set prediction with specific loss function) •  evaluation measure could be rewritten •  the original problem can be divided into two well-known binary classification problems; 1.  predict purchase given session optimize Purchase score 2.  predict bought items given session with purchase optimize Jaccard score P(y(s) 6= ;|s) Q(h, Stest) = |Sb test| |Stest| (TP FP) | {z } purchase score + X s2Stest J(y(s), h(s)) | {z } Jaccard score , P(i 2 y(s)|s, y(s) 6= ;)
  • 5. •  Two-stage prediction •  Two binary classification models learned on the train-set •  Both classifiers require thresholds •  Set up thresholds to optimize Purchase score and Jaccard score using hold-out subsample of the train-set Solution schema 5 Strain purchase classifier bought item classifier Slearn 90% Svalid 10% s 7! P(purchase|s) (s, i) 7! P(i 2 y(s)|s, y(s) 6= ;) classifier thresholds
  • 6. Some relationships from data 6 Next observations (from the data): •  Buying rate strongly depends on time features •  Buying rate varies highly between categorical features he er- 87 e- s) es- me s. he or ht is )) , ne ty Figure 1: Dynamics of the buying rate in time Figure 2: Buying rate versus number of clicked items (left) and ID of the item with the maximum number of clicks in session (right) The total number of item IDs and category IDs is 54, 287 and 347 correspondingly. Both training and test sets be- long to an interval of 6 months. The target function y(s) corresponds to the set of items that were bought in the ses- sion s 1 . In other words, the target function gives some subset of the universal item set I for each user session s. We are given sets of bought items y(s) for all sessions s the training set Strain, and are required to predict these sets for test sessions s ∈ Stest. 2.2 Evaluation Measure Denote by h(s) a hypothesis that predicts a set of bought items for any user session s. The score of this hypothesis is measured by the following formula: Q(h, Stest) = s∈Stest: |h(s)|>0 |Sb test| |Stest| (−1)isEmpty(y(s)) +J(y(s), h(s)) , where Sb test is the set of all test sessions with at least one purchase event, J(A, B) = |A∩B| |A∪B| is the Jaccard similarity measure. It is easy to rewrite this expression as Q(h, Stest) = |Sb test| |Stest| (TP − FP) purchase score + s∈Stest J(y(s), h(s)) Jaccard score , (1) where TP is the number of sessions with |y(s)| > 0 and |h(s)| > 0 (i.e. true positives), and FP is the number of sessions with |y(s)| = 0 and |h(s)| > 0 (i.e. false positives). Now it is easy to see that the score consists of two parts. The first one gives a reward for each correctly guessed session with buy events and a penalty for each false alarm; the absolute values of penalty and reward are both equal to |Sb test| |Stest| . The second part calculates the total similarity of predicted sets of bought items to the real sets. 2.3 Purchase statistics Figure 2: Buying rate versus number of clicked items (left) and ID of the item with the maximum number of clicks in session (right) is that a higher number of items clicked during the session leads to a higher chance of a purchase. Lots of information could be extracted from the data by considering item identifiers and categories clicked during the Buying rate — fraction of buyer sessions in some subset of sessions
  • 7. Feature extraction 7 •  Purchase classifier: features from session (sequence of clicks) •  Bought item classifier: features from pair session+itemID •  Observation: bought item is a clicked item •  We use two types of features •  Numerical: real number, e.g. seconds between two clicks •  Categorical: element of the unordered set of values (levels), e.g. ItemID
  • 8. Feature extraction: session 8 1.  Start/end of the session (month, day, hour, etc.) [numerical + categorical with few levels] 2.  Number of clicks, unique items, categories, item-category pairs [numerical] 3.  Top 10 items and categories by the number of clicks [categorical with ≈50k levels] 4.  ID of the first/last item clicked at least k times [categorical with ≈50k levels] 5.  Click counts for 100 items and 50 categories that were most popular in the whole training set [sparse numerical]
  • 9. Feature extraction: session+ItemID 9 1.  All session features 2.  ItemID [categorical with ≈50k levels] 3.  Timestamp of the first/last click on the item (month, day, hour, etc.) [numerical + categorical with few levels] 4.  Number of clicks in the session for the given item [numerical] 5.  Total duration (by analogy with dwell time) of the clicks on the item [numerical]
  • 10. •  GBM and similar ensemble learning techniques •  useful with numerical features •  one-hot encoding of categorical features doesn’t perform well •  Matrix decompositions, FM •  useful with categorical features •  hard to incorporate numerical features because of rough (bi-linear) model •  We used our internal learning algorithm: MatrixNet •  GBM with oblivious decision trees •  trees properly handle categorical features (multi-split decision trees) •  SVD-like decompositions for new feature value combinations Classification method 10
  • 11. Classification method 11 Oblivious decision tree with categorical features duration > 20 user item item [numerical] [categorical] [categorical] yes no … user item item… … … ……
  • 12. •  Training classifiers •  GB with 10k trees for each classifier •  ≈12 hours to train both models on 150 machines •  Making predictions •  We made 4000 predictions per second per thread Classification method: speed 12
  • 13. Threshold optimization 13 We optimized thresholds using validation set (10% hold-out from train-set) 1)  Maximize Jaccard score 2)  Maximize Purchase+Jaccard scores using fixed bought item threshold Q(h, Svalid) = |Sb valid| |Svalid| (TP FP) | {z } purchase score + X s2Svalid J(y(s), h(s)) | {z } Jaccard score , Figure 3: Item detection threshold (above) and pur- chase detection threshold (below) quality on the val- idation set. we train purchase detection and purchased item detection classifiers. The purchase detection classifier hp(s) predicts the outcome of the function yp(s) = isNotEmpty(y(s)) and uses the entire training set in the learning phase. The item detection classifier hi(s, j) approximates the indicator func- tion yi(s, j) = I(j ∈ y(s)) and uses only sessions with bought items in the learning phase. Of course, it would be wise to use classifiers that output probabilities rather than binary predictions, because in this case we will be able to select thresholds that directly optimize evaluation metric (1) in- stead of the classifier’s internal quality measure. So, our final expression for the hypothesis can be written as h(s) = ∅ if hp(s) < αp, {j ∈ I | hi(s, j) > αi} if hp(s) ≥ αp. (2) 3.2 Feature Extraction We have outlined two groups of features: one describes a session and the other describes a session-item pair. The pur- chase detection classifier uses only session features and the item detection classifier uses both groups. The full feature listing can be found in Table 1; for further details, please refer to our code2 . We give some comments on our feature extraction decisions below. One could use sophisticated aggregations to extract nu- merical features that describe items and categories. How- Figure 3: Item detection threshold (above) and pur- chase detection threshold (below) quality on the val- idation set.
  • 14. •  Leaderboard: 63102 (1st place) •  Purchase detection on validation (10% hold-out): •  16% precision •  77% recall •  AUC 0.85 •  Purchased item detection on validation: •  Jaccard measure 0.765 •  Features / datasets used to learn classifiers / evaluation process can be reproduced, see our code1 Final results 14 1https://github.com/romovpa/ydf-recsys2015-challenge
  • 15. 1.  Observations from the problem statement ›  The task is complex but decomposable into two well-known: binary classification of sessions and (session, ItemID)-pairs 2.  Observations from the data (user click sessions) ›  Features from sessions and (session, ItemID)-pairs ›  Easy to develop many meaningful categorical features 3.  The algorithm ›  Gradient boosting on trees with categorical features ›  No sophisticated mixtures of Machine Learning techniques: one algorithm to work with many numerical and categorical features Summary / Questions? 15