Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kaggle Avito Demand Prediction Challenge 9th Place Solution

4,576 views

Published on

kaggle meetup tokyo #5

Published in: Data & Analytics
  • Be the first to comment

Kaggle Avito Demand Prediction Challenge 9th Place Solution

  1. 1. KAGGLE AVITO DEMAND PREDICTION CHALLENGE 9TH SOLUTION Kaggle Meetup Tokyo 5th – 2018.12.01senkin13
  2. 2. About Me ¨  詹金 (せんきん) ¨  Kaggle ID: senkin13 ¨  Infrastructure&DB Engineer [Prefect World] [Square Enix] ¨  Bigdata Engineer [Square Enix] [OPT] [Line] [FastRetailing] ¨  Machine learning Engineer [FastRetailing] Background KaggleName
  3. 3. Agenda ¨  Avito-demand-prediction Overview ¨  Competition Pipeline ¨  Best Single Model (Lightgbm) ¨  Diverse Models ¨  Ynktk’s Best NN ¨  Kohei’s Ensemble ¨  China Competitions/Kagglers ¨  Q & A
  4. 4. Our Team
  5. 5. Public LB:8th Private LB:9th
  6. 6. Description 「Prediction」 predict demand for an online advertisement based on its full description (title, description, images, etc.), its context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts Russia’s largest classified advertisements website
  7. 7. Evaluation Item_id Deal_probability b912c3c6a6ad 0.12789 2dac0150717d 0.00000 ba83aefab5dc 0.43177 02996f1dd2eas 0.80323 [Target] This is the likelihood that an ad actually sold something. Scope: 0 ~ 1
  8. 8. Data Description ¨  ID item_id,user_id ¨  Numeric price ¨  Category region,city,parent_category_name,user_type category_name,param_1, param_2, param_3, image_top_1 ¨  Text title,description ¨  Image image ¨  Sequence item_seq_number ¨  Date activation_date,date_from,date_to Train/Test User_id …… Item_id Target Active User_id …… Item_id Periods Item_id Date_from Date_to Supplemental data of train minus deal_probability, image, and image_top_1
  9. 9. Train Test Period Train: 2017-03-15 ~ 2017-04-05 Test: 2017-04-12 ~ 2017-04-20
  10. 10. Pipeline [Baseline] 1.Table Data Model 2.Text Data Model (reduce wait time) [Validation] Kfold: 5 Feature Validation: once by one Validate Score: 5fold One Week One WeekOne Month Description Kernel Discussion Beseline Design [Feature Engineering] LightGBM(Table + Text + Image) Feature Save: 1 feature 1 pickle file [Validation] Kfold: 5 Feature Validation: once by one or by group Validate Score: 1fold [Parameter Tuning] Manually Teammates’ feature reuse Diverse Model’s oof
  11. 11. Preprocossing ¨  Tabular data df_all['price'] = np.log1p(df_all['price']) df_all['city'] = df_all['city'] + ‘_’ + df_all['region’] ¨  Text data def clean_text(s): s = re.sub('м²|d+/d|d+-к|d+к', ‘ ‘, s.lower()) s = re.sub('s+', ‘ ‘, s) s = s.strip() return s ¨  Image data Delete 4 empty images
  12. 12. Feature Engineering ¨  Date Feature df_all['wday'] = df_all['activation_date'].dt.weekday ※TrainとTest両方があるdate型を利用する ¨  Extended Text Feature df_all['param_123'] = (df_all['param_1'].fillna('') + ' ' + df_all['param_2'].fillna('') + ' ' + df_all['param_3'].fillna('')).astype(str) df_all['text'] = df_all['description'].fillna('').astype(str) + ' ' + df_all['title'].fillna('').astype(str) + ' ' + df_all['param_123'].fillna('').astype(str) ※Traing単語が増える
  13. 13. Aggrearation Feature ¨  Unique {'groupby': ['category_name'], 'target':’image_top_1', 'agg':'nunique'}, ¨  Count {'groupby': ['user_id'], 'target':'item_id', 'agg':'count'}, ¨  Sum {'groupby': ['parent_category_name'], 'target':'price', 'agg':'sum'}, ¨  Mean {'groupby': ['user_id'], 'target':'price', 'agg':'mean'}, ¨  Median {'groupby': ['image_top_1'], 'target':'price', 'agg':'median'}, ¨  Max {'groupby': ['image_top_1','user_id'], 'target':'price', 'agg':'max'}, ¨  Min {'groupby': ['user_id'], 'target':'price', 'agg':'min'}, ※業務視点から作るのが効率が良い
  14. 14. Interaction Feature ¨  Difference between two features df_all['image_top_1_diff_price'] = df_all['price'] - df_all['image_top_1_mean_price'] df_all['category_name_diff_price'] = df_all['price'] - df_all['category_name_mean_price'] df_all['param_1_diff_price'] = df_all['price'] - df_all['param_1_mean_price'] df_all['param_2_diff_price'] = df_all['price'] - df_all['param_2_mean_price'] df_all['user_id_diff_price'] = df_all['price'] - df_all['user_id_mean_price'] df_all['region_diff_price'] = df_all['price'] - df_all['region_mean_price'] df_all['city_diff_price'] = df_all['price'] - df_all['city_mean_price'] ※Business senseがある加減乗除特徴量が強い
  15. 15. Supplemental Data Feature ¨  Caculate each item’s up days all_periods['days_up'] = all_periods['date_to'].dt.dayofyear - all_periods['date_from'].dt.dayofyear ¨  Count and Sum of item’s up days {'groupby': ['item_id'], 'target':'days_up', 'agg':'count'}, {'groupby': ['item_id'], 'target':'days_up', 'agg':'sum'}, ¨  Merge to main table df_all = df_all.merge(all_periods, on='item_id', how='left') ※補足データの業務に関わる部分深掘りが大事
  16. 16. Impute Null Values ¨  Fillna with 0 df_all[‘price’].fillna(0) ¨  Fillna with median enc = df_all.groupby('category_name') ['item_id_count_days_up'].agg('median’).reset_index() enc.columns = ['category_name' ,'count_days_up_impute'] df_all = pd.merge(df_all, enc, how='left', on='category_name') df_all['item_id_count_days_up_impute'].fillna(df_all['count_days_up_impute' ], inplace=True) ¨  Fillna with model prediction value Rnn(text) -> image_top_1(rename:image_top_2) ※見つからなかったMagic feature: df[‘price’] – df[Rnn(text) -> price]
  17. 17. Text Feature ¨  TF-IDF for text ,title,param_123 vectorizer = FeatureUnion([ ('text',TfidfVectorizer( ngram_range=(1, 2), max_features=200000, **tfidf_para), ('title',TfidfVectorizer( ngram_range=(1, 2), stop_words = russian_stop), ('param_123',TfidfVectorizer( ngram_range=(1, 2), stop_words = russian_stop)) ]) tfidf_para = { "stop_words": russian_stop, "analyzer": 'word', "token_pattern": r'w{1,}', "lowercase": True, "sublinear_tf": True, "dtype": np.float32, "norm": 'l2', "smooth_idf":False }
  18. 18. Text Feature ¨  SVD for Title tfidf_vec = TfidfVectorizer(ngram_range=(1,1)) svd_title_obj = TruncatedSVD(n_components=40, algorithm='arpack') svd_title_obj.fit(full_title_tfidf) train_title_svd = pd.DataFrame(svd_title_obj.transform(train_title_tfidf)) test_title_svd = pd.DataFrame(svd_title_obj.transform(test_title_tfidf))
  19. 19. Text Feature ¨  Count Unique Feature for cols in ['text','title','param_123']: df_all[cols + '_num_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[А-ЯA-Z]', x)) df_all[cols + '_num_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[а-яa-z]', x)) df_all[cols + '_num_rus_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[А-Я]', x)) df_all[cols + '_num_eng_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[A-Z]', x)) df_all[cols + '_num_rus_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[а-я]', x)) df_all[cols + '_num_eng_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[a-z]', x)) df_all[cols + '_num_dig'] = df_all[cols].apply(lambda x: count_regexp_occ('[0-9]', x)) df_all[cols + '_num_pun'] = df_all[cols].apply(lambda x: sum(c in punct for c in x)) df_all[cols + '_num_space'] = df_all[cols].apply(lambda x: sum(c.isspace() for c in x)) df_all[cols + '_num_emo'] = df_all[cols].apply(lambda x: sum(c in emoji for c in x)) df_all[cols + '_num_row'] = df_all[cols].apply(lambda x: x.count('/n')) df_all[cols + '_num_chars'] = df_all[cols].apply(len) # Count number of Characters df_all[cols + '_num_words'] = df_all[cols].apply(lambda comment: len(comment.split())) df_all[cols + '_num_unique_words'] = df_all[cols].apply(lambda comment: len(set(w for w in comment.split()))) df_all[cols + '_ratio_unique_words'] = df_all[cols+'_num_unique_words'] / (df_all[cols+'_num_words']+1) df_all[cols +'_num_stopwords'] = df_all[cols].apply(lambda x: len([w for w in x.split() if w in stopwords])) df_all[cols +'_num_words_upper'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if w.isupper()])) df_all[cols +'_num_words_lower'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if w.islower()])) df_all[cols +'_num_words_title'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
  20. 20. Text Feature ¨  Ynktk’s WordEmbedding u  Self-trained FastText model = FastText(PathLineSentences(train+test+train_active+test_active), size=300, window=5, min_count=5, word_ngrams=1, seed=seed, workers=32) u  Self-trained Word2Vec model = Word2Vec(PathLineSentences(train+test+train_active+test_active), size=300, window=5, min_count=5, seed=seed, workers=32) ※Wikiなどで学習したembeddingsよりも、与えられたテキストで学習したembeddingsの方が 有効.おそらく、商品名などの固有名詞が目的変数に効いていたため
  21. 21. Image Feature ¨  Meta Feature u  Image_size ,Height,Width,Average_pixel_width,Average_blue,Average_red,Aver age_green,Blurrness,Whiteness,Dullness u  Dullness – Whiteness (Interaction feature) ¨  Pre-trained Prediction Feature u  Vgg16 Prediction Value u  Resnet50 Prediction Value ¨  Ynktk’s Feature u  上位入賞者はImageをVGGなどで特徴抽出していたが、hand-craftな特徴も有効だった u  NIMA [1] u  Brightness, Saturation, Contrast, Colorfullness, Dullness, Bluriness, Interest Points, Saliency Map, Human Facesなど[2] [1] Talebi, H., & Milanfar, P. (2018). NIMA: Neural Image Assessment [2] Cheng, H. et al. (2012). Multimedia Features for Click Prediction of New Ads in Display Advertising
  22. 22. Parameter Tuning q  Manually choosing using multi servers params = { 'boosting_type': 'gbdt', ’objective’: ‘xentropy’, #target value like a binary classification probability value 'metric': 'rmse', 'learning_rate': 0.02, 'num_leaves': 600, 'max_depth': -1, 'max_bin': 256, ’bagging_fraction’: 1, ’feature_fractio’: 0.1, #sparse text vector 'verbose': 1 }
  23. 23. Submission Analysis Single Lightgbm Sub File Stacking Sub File 1.  Bug Check 2.  Diverse Model Comparation 3.  Prediction Value Trend
  24. 24. Best Lightgbm Summary ¨  Table Feature Number ~250 ¨  Text Feature Number 1,500,000+ ¨  Image Feature Number 50+ ¨  Total Feature Number 1,503,424 ¨  Public LB better than 0.2174 ¨  Private LB better than 0.2210
  25. 25. Diversity Type Loss Data Set Feature Set Parameter NN Structure Lightgbm xentropy regression huber fair auc With/Without Active data Table Table + Text Table + Text + Image Table + Text + Image + Ridge_meta Learning_rate Num_leaves Xgboost reg:linear binary:logist ic With/Without Active data Table + Text Table + Text + Image Catboost binary_cross entropy With Active data Table + Image Random Forest regression With Active data Table + Text + Image Ridge Regression regression Without Active data Text Table + Text + Image Tfidf max_features Neural network regression binary_cross entropy With Active data Table + Text + Image + wordembedding Layer size Dropout BatchNorm Pooling rnn-dnn rnn-cnn-dnn rnn-attention-dnn
  26. 26. Ynktk’s Best NN Numerical Categorical Image Text Embedding Dense SpatialDropout LSTM GRU Conv1D LeakyReLU GAP GMP Concat BatchNorm LeakyReLU Dropout Dense LeakyReLU BatchNorm Dense Concat Embedding *callbacks •  EarlyStopping •  ReduceLROnPlateau *optimizer •  Adam with clipvalue 0.5 •  Learning rate 2e-03 *loss •  Binary cross entropy Priv LB: 0.2225 Pub LB: 0.2181
  27. 27. China Competitions & Platform Kaggle China Comp Platform Kaggle Tianchi,Tencent,Jdata,Kesci,DataCastle, Biendata,DataFountain… …[1] Round Round1:2~3 months Public/Private LB Round1:1.5months Public,3 days Private Round2:2 weeks Public,3 days Private Round3:Presentation Sub/day 5 Public:3,Private:1 Prize Top 3 Top 5/10/50 [1] https://github.com/iphysresearch/DataSciComp
  28. 28. Knowledge Sharing https://github.com/Smilexuhc/Data- Competition-TopSolution/blob/master/ README.md Learn From GrandMasters 1.  EDA by Excel 2.  Join every competitions 3.  Reuse pipeline & features 4.  Strictly time management 5.  Use differnet area’s knowledge 6.  Family Support
  29. 29. Thank You ! Q & A

×