CIKM AnalytiCup 2017: Bagging Model for Product Title Quality with Noise

CIKM AnalytiCup
Lazada Product Title Quality Challenge
1
$6,000
2$2,000
3$1,000
$2,000

Team Members
Tam T. Nguyen
nthanhtam@gmail.com
Postdoctoral Research Fellow
Ryerson University
Kaggle Grandmaster
Hossein Fani
hosseinfani@gmail.com
PhD Student
University of New Brunswick
Gilberto Titericz
giba1978@gmail.com
Machine Learning Expert
AirBnb Inc.
Kaggle Grandmaster
Ebrahim Bagheri
ebrahim.bagheri@gmail.com
Associate Professor
Ryerson University

“hot sexy red clutch rug sack travel backpack unisex cheap with free gift”
𝑦1
clarity
𝑦2
conciseness
“Hot Sexy Tom Clovers Womens Mens Classy Look Cool Simple Style Casual
Canvas Crossbody Messenger Bag Handbag Fashion Bag Tote Handbag Gray”
Problem Setting

Clarity if within five seconds one can understand the title, what the product is, and quickly figure out the key
attributes (color, size, model, ...).
Conciseness if it is short enough to contain all the necessary information. Otherwise, i.e., the title is
too long with many unnecessary words, Or it is too short such that it is unsure what the product is.
Data Set

ML-DM
1. Cleansing
• Noise
• Missing Values
• Outliers
2. Flirting
• Attributes
• Labels (if any)
• Augmentation
3. Feature Eng.
• Extraction
• Reduction
• Selection
4. Model Eng.
• Selection
• Tuning
• Evaluation

1. Cleansing
• Noise
• Html tags in ‘short_description’ (%94)
• Missing Values
• ‘product_type’ (less than %1)
• ‘category_lvl_3’ (about %6) → assign ‘category_lvl_2’
• ‘description’ (less than %1)
• Outliers
• ‘price’ {-1, 999999, 9999999},
• ‘price’ Normalization based on country

2. Flirting
• Attributes
• Color
• Brand
• Non-English
• <img> Image
• <li> enumeration
• 𝒚: Labels
• Disagreement in labels!(label noise)
• Augmentation
• Cloning  color, brand

multi-class
𝑓: 𝑋1 × 𝑋2 × … × 𝑋 𝑑 → 𝑦: 𝑐1, 𝑐2, … , 𝑐 𝑘
binary(boolean) classifier: 𝑦: 0,1
multi-output(label)
𝑓: 𝑋1 × 𝑋2 × … × 𝑋 𝑑 → 𝑦1: 𝑐1, 𝑐2, … , 𝑐 𝑘1
× 𝑦2: 𝑐1, 𝑐2, … , 𝑐 𝑘2
× ⋯ × 𝑦𝑟: 𝑐1, 𝑐2, … , 𝑐 𝑘r
multi-output binary(boolean) classifier: 𝑦1: 0,1 × 𝑦2: 0,1
Targets correlation: (single, fast model for all targets)
Only 3 combinations for (Clear,Concise):
(1,0), (1,1), (0,0)  |~Clear & Concise|= 0
if ~Clear then ~Concise
if Concise then Clear

3. Feature Eng.
• Extraction
• Reduction
• LSA,T-SNE,PCA,SVD
• Selection
• STD
• Correlation X~y
• Linear(t-test, chi2)
• Non-linear(mi)
• Model-driven
• LinearSVM
Feature Engineering

10-Fold Set 1 10-Fold Set 2 10-Fold Set 3 10-Fold Set 4
Base Model
Ensemble Model
Final Prediction
Fold Bagging
Fold Bagging
Set Fold Bagging
BLENDBLEND BLEND BLENDSTACK STACK STACK STACK
BLENDBLEND BLEND BLEND
BLEND
Bagging Models

Performance Evaluation
SGD: stochastic gradient descent
LOR: logistic regression
RDG: ridge regression
NBC: naive bayes classifier
XGB: extreme gradient boosting
LGB: light gradient boosting
W2V: word2vec

Model Importance
clarity conciseness

CIKM AnalytiCup 2017: Bagging Model for Product Title Quality with Noise

CIKM AnalytiCup 2017: Bagging Model for Product Title Quality with Noise

More Related Content

Similar to CIKM AnalytiCup 2017: Bagging Model for Product Title Quality with Noise

More from Hossein Fani

Recently uploaded

CIKM AnalytiCup 2017: Bagging Model for Product Title Quality with Noise

Editor's Notes