Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)

Kaggle – Airbnb New User
Bookingsのアプローチについて
Kaggle Tokyo Meetup #1
2016/03/05
id:@Keiku

本日のアジェンダ
• Airbnb New User Bookingsコンペ概要
– Datasetについて
– Metricについて
• 本コンペに参加した動機
• アプローチについて
– Preprocessing
– Stacked generalization
– Modeling
– Results
• Shakeupについて
• おわりに

Datasetについて(1)
• train_users.csv - the training set of
users
• test_users.csv - the test set of users
– id: user id
– date_account_created: the date of account
creation
– timestamp_first_active: timestamp of the
first activity, note that it can be earlier
than date_account_created
or date_first_booking because a user can
search before signing up
– date_first_booking: date of first booking
– gender
– age
– signup_method
– signup_flow: the page a user came to signup
up from
– language: international
language preference
– affiliate_channel: what kind
of paid marketing
– affiliate_provider: where the
marketing is e.g. google,
craigslist, other
– first_affiliate_tracked: whats
the first marketing the user
interacted with before the
signing up
– signup_app
– first_device_type
– first_browser
– country_destination: this is
the target variable you are
to predict

Datasetについて(2)
• sessions.csv - web sessions log for users
– user_id: to be joined with the column 'id' in users table
– action
– action_type
– action_detail
– device_type
– secs_elapsed
• countries.csv - summary statistics of destination countries in
this dataset and their locations
• age_gender_bkts.csv - summary statistics of users' age group,
gender, country of destination
• sample_submission.csv - correct format for submitting your
predictions

Metricについて(1)
• The evaluation metric for this competition is NDCG (Normalized
discounted cumulative gain) @k where k=5. NDCG is calculated as:
• where reli is the relevance of the result at position i.
• IDCGk is the maximum possible (ideal) DCG for a given set of queries. All
NDCG calculations are relative values on the interval 0.0 to 1.0.
• For each new user, you are to make a maximum of 5 predictions on the
country of the first booking. The ground truth country is marked with
relevance = 1, while the rest have relevance = 0.
• For example, if for a particular user the destination is FR, then the
predictions become:

本コンペに参加した動機
• 主な理由
– Learning to rank(MetricがNDCG)の問題に取り組んでみたかった
• 過去には、Personalize Expedia Hotel Searches - ICDM 2013
– Train datasetの期間が2010/01〜2014/06、Test datasetの期間が2014/07
〜2014/09であった
• このタイプのデータのCross Validationに苦手意識のある
• 過去コンペ：
– Rossmann Store Sales
– Recuruit - Coupon Purchase Prediction
– Avazu - Click-Through Rate Prediction
– デモグラが多く、特徴量がつくりやすい
• 当時の状況
– コンペ期間は、2015/11/25〜2016/02/11(78日間)で、First submissionは
2016/01/25であり終盤
– 残り3週間勉強のために参加

Preprocessing(1)
• 特徴抽出
– age内に含まれる生年月日を修正する
– date_first_bookingとdate_account_createdのlagを計算し、それを4カテゴリに
集約する
– date_first_bookingとtimestamp_first_activeのlagを計算し、それを3カテゴリに
集約する
– カテゴリカル変数をOne-Hot Encodingする
– train_users.csv、test_users.csvにage_gender_bkts.csvをjoinする
– train_users.csv、test_users.csvにcountries.csvをjoinする
– sessions.csvを(user_id、action)をキーにsecs_elapsedと行数をサマリ、
train_users.csv、test_users.csvにjoinする（action以外の変数も同様）
• 特徴抽出をするにあたり
– 使えるものはすべて使う
– sessions.csvは序列性も検討したが効果はなかった。Telstra Network
Disruptionsのデータは元の序列性がMagic featuresとなった例もある

Preprocessing(2)
• Rの{DescTools}パッケージが便利
• Desc()で基礎統計量がすべてわかる

Preprocessing(3)
• Rの{DescTools}パッケージが便利
• Desc()で基礎統計量がすべてわかる

Stacked generalization
• 以下の18モデルについてStacking
1. Model：XGBoost / Target：age / Train dataset：age非欠損
2. Model：XGBoost / Target：age_cln / Train dataset：age非欠損
3. Model：XGBoost / Target：age_cln2 / Train dataset：age非欠損
4. Model：glmnet / Target：age_cln / Train dataset：age非欠損
5. Model：glmnet / Target：age_cln2 / Train dataset：age非欠損
6. Model：XGBoost / Target：country_destination / Train dataset：Train全期間
7. Model：XGBoost / Target：country_destination / Train dataset：直近12ヶ月
8. Model：XGBoost / Target：country_destination / Train dataset：直近6ヶ月
9. Model：XGBoost / Target：country_destination / Train dataset：去年の7,8,9月
10. Model：XGBoost / Target：distance_km / Train dataset：distance_km非欠損
11. Model：XGBoost / Target：destination_km2 / Train dataset：destination_km2非欠損
12. Model：XGBoost / Target：gender / Train dataset：非-unknown-
13. Model：XGBoost / Target：dfb_dac_lag_flg / Train dataset：Train全期間
14. Model：XGBoost / Target：dfb_tfa_lag_flg / Train dataset：Train全期間
15. Model：XGBoost / Target：dfb_dac_lag / Train dataset：Train全期間
16. Model：XGBoost / Target：dfb_tfa_lag / Train dataset：Train全期間
17. Model：glmnet / Target：dfb_dac_lag / Train dataset：Train全期間
18. Model：glmnet / Target：dfb_tfa_lag / Train dataset：Train全期間

Modeling(1)
• XGBoostを使ってモデリング
– eval_metricはNDCG@5
• merror、mloglossは最終的に使用しなかった
• c4.8xlargeで1 roundのCVで1分ほど。面倒だが耐える
– Drip Coffee10杯分くらい消失:-)
– Techniques (Tricks) for Data Mining Competitions(@smly)
• BO、RSCVなどによるチューニングの優先度は低かった
• 特徴選択
– 特に生のageは精度を落とした
• 特徴選択することで精度が一気に向上
– 90%をランダムに特徴選択してモデルを作成

Modeling(2)
• XGBoostの変数重要
度
– 直近12ヶ月の
country_destination
– dfb_dac_lag_flg(XGBoost)
– 直近6ヶ月の
country_destination
– 去年の7,8,9月の
country_destination
– age_cln2(XGBoost)

Results(1)
• 精度一覧
– 最終的に、submission12(5-fold CV)、16(Last 6 weeks)を選択
Submission Memo 5 fold-CV Public Private Public Rank Private Rank
submission01.csv.7z
merror、mloglossなど
で試行錯誤
0.87958 0.88419
submission02.csv.7z 0.87848 0.88201
submission07.csv.7z Stackingなし 0.83265 0.88013 0.88590 152 55
submission08.csv.7z Stackingあり 0.83318 0.88123 0.88645 36 12
submission09.csv.7z Feature Selection(1) 0.83355 0.88162 0.88705 12 1
submission14.csv.7z Last 6 weeks(1) 0.83319 0.88167 0.88696 12 2
submission15.csv.7z 12のBagging 0.83371 0.88207 0.88688 2 2
submission16.csv.7z Last 6 weeks(2) 0.83346 0.88195 0.88678 2 2

Results(2)
• 精度確認
0.87900
0.88000
0.88100
0.88200
0.88300
0.88400
0.88500
0.88600
0.88700
0.88800
0.83240 0.83260 0.83280 0.83300 0.83320 0.83340 0.83360 0.83380
LBScore
Local 5 fold-CV Score
NDCG@5 Score
Public
Private

Shakeupについて
• Forumに「Expected Leaderboard Shakeup」というTopicが立つほどShakeup
が懸念される
• 私の考察
– 5-fold CVとPublic LB Scoreの関連が強く、単純に両方とも良いスコアのモ
デルを選べば良かった
– 最終Submissionに2つ選べるので、1つはPublic LBのScoreが最も高いモ
デル、もう1つはLast 6 weeksのScoreが最も高いモデルを選択した
• Best Public LB：Public: 0.88209(2nd)/Private: 0.88682(2nd)
• Best Last 6 weeks Validation：Public: 0.88195(2nd)/Private:
0.88678(2nd)
– Shakeupに強そうなGilberto Titericz Juniorさんのコメント
• CVのスコアが同じもの2つがあり、Public LBが良い方を選んだが、一
方はもっと良いスコアであり、Public: 0.88107(57th)/Private:
0.88675(3rd)であった
– シンプルなモデル構築を心がけた。アンサンブルはあまり効果がない
– 作成した特徴量が強く、Shakeupしたものの上位でとどまった

おわりに
• コンペ振り返り
– 常に勉強するというスタンスで取り組み、特に開始時期は
気にしない
– データをつぶさに見て、考察する
• Trainにしかないdate_first_bookingも使えるか検討する
– Evaluation Metricは面倒でも合わせる
– 基本的にはCross Validationの結果が良いモデルを選択
– Shakeupの懸念のある場合、シンプルなモデル構築を心
がけ、異なるValidationパターンを用意しておく
– Results(2)のようなグラフは必ず書けるようにメモを取る

Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)

Similar to Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305) (20)

Recently uploaded

Recently uploaded (10)

Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)