MLDM CM Kaggle Tips

About ME
CM 志明
Ph.D Student in TIGP-SNHCC
Research Assistant at AS CITI
Research Intern at KKBOX
Advisor: Prof. Ming-Feng Tsai (蔡銘峰)
Advisor: Dr. Eric Yang (楊弈軒)
• CLIP Lab
• MAC Lab
Research, Machine Learning team
https://about.me/chewme

3 http://kaggletw.azurewebsites.net/

台灣 Kaggle 交流區
https://www.facebook.com/groups/kaggletw/

First Thing First
• The type of prediction task 
- classification? regression? top-N recommendations?
• Evaluation Metric 
- AUC, MAE, RMSE, Log-loss, MAP@N, …
• Why Compete? 
- For fun 
- For learning 
- For networking

The Prediction Task
Binary 
Classification
Multi-label 
Classification
Regression
Recommendations

Evaluation Metric
https://www.kaggle.com/wiki/Metrics

Why Compete?
• For Fun: Competing with others like running or racing
• For Learning: Improving your abilities
• What's Your Motivation?

Other Considerations …
• Data Size 
- 10MB? 10GB? >100GB? 
- no $$ to pay AWS
• Need GPU Power? 
- no $$ to pay AWS
• Good Prize? 
- $$$$$$$$$$$$

Check the Provided Data
• The Distribution of Train/Test Data 
- random splitting 
- split by time 
- split by Ids
• Available Features 
- categorical, numerical 
- text 
- image, audio 
- time 
- sparse, dense

Cross Validation (1) Train
Validation
TRAIN TRAIN
TRAIN TRAIN
TRAIN TRAIN
Round 1:
Round 2:
Round 3:

Cross Validation (2) Train
Validation
TRAIN
TRAIN
TRAIN
Test
Round 1:
Round 2:
Round 3:

Hold A Proper Validation
• Random Splitting 
• Split by Time 
• Split by Id
Train
Validation
Test
7 DAYS7 DAYS
5/20 5/275/13
or

Data Cleaning / Preprocessing
• Missing Values 
- drop the missing data 
- replace them by certain statistical values 
- label them as the missing value
• Outlier Detection 
- https://en.wikipedia.org/wiki/Outlier
• Redundant Features 
- remove them usually
mean / 
median / 
mode / 
clustering / 
modelling methods

Categorical Features
• One-hot Encoding
• Clustering Group 
 
Mayday 1 0 0 0
Sodagreen 0 1 0 0
SEKAI_NO_OWARI 0 0 1 0
The_Beatles 0 0 0 1
Mayday 1 0 0
Sodagreen 1 0 0
SEKAI_NO_OWARI 0 1 0
The_Beatles 0 0 1
Language
Id

Categorical Features
• Col-hot Encoding
• Count-hot Encoding
• Likelihood Encoding
• …
T1 T2 T3
T1
T2U
T3
23
1
6
23 1 6
1 0 1
count
binary
probability 23/30 1/30 6/30

Categorical Features (2)
• Latent Representations 
- Principal Component Analysis (PCA) 
- Linear Discriminant Analysis (LDA) 
- Laplacian Eigenmaps (LE) 
- Locally linear embedding (LLE) 
- Low-Rank Approximation / Latent Factorization 
- Latent Topic Model
reduce the computation cost alleviating the overfitting issue
finding out the meaningful components remove the noise
https://en.wikipedia.org/wiki/Dimensionality_reduction

Categorical vs. Numerical
• Ordinal Categories
HATE DON’T MIND LIKE LOVE
0 1 2 3
0
2
4
6
8
HATE DON'T MIND LIKE LOVE
exp(value)

Numerical Features
• Standardization / Normalization
• Rescaling
• Transform the Distribution 
- logarithmic transformation 
- tf-idf like transformation
• Binning / Sampling
https://en.wikipedia.org/wiki/Feature_scaling
required by 
many ML algorithms
https://en.wikipedia.org/wiki/Data_transformation_(statistics)

Other Features
• Text-based 
- Natural Language Processing
• Image-based, Audio-based 
- Image/Signal Processing
• Time-based 
- Time Series
Domain Knowledge is Important

Example (1)
• Text-based 
- Vector Space Model - Word Embeddings
https://en.wikipedia.org/wiki/Vector_space_model
MAN
WOMAN
KING
QUEEN
need stemming? lemmatization?

Example (2)
Text
服務好、環境整潔 …
服務⼈人員笑容溫暖...
今天點了商業午餐...
segmentation
[服務] [好] [環境] [整潔]
[服務] [⼈人員] [笑容] [溫暖]
[今天] [點了] [商業午餐]
服務:1 好:1 環境:1 整潔:2
服務:1 笑容:1 溫暖:2
商業午餐:1filtering
Word 
Embeddings?
dummy 
variables
服務:2 好:1 環境:1 整潔:4
服務:2 笑容:1 溫暖:1
商業午餐:0.8
Advanced 
Weighting?

Example (3)
• Image-based 
- SIFT - Convolutional NN
https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
https://en.wikipedia.org/wiki/Convolutional_neural_network

Realize the Meaning Behind the Observed Features
• 2017/05/20 08:00
• Taipei
Holiday? Weekday?
Day? Night?
Asia
Mandarin

ML Libraries
• sci-kit learn
• xgboost, lightgbm, …
• vowpal wabbit
• libsvm, liblinear, libfm, libffm, …
• tensorflow, keras, h2o, caffe, mxnet, …
• …

Understand the Pros and Cons
• Linear Model 
- simple, fast and easy to tune 
- occupy low memory  
- non-complex
• Nearest Neighbours 
- depends on the prediction task and the data distributions

Understand the Pros and Cons (2)
• Random Forest 
- work very well in many competitions 
- fast and easy to tune 
- memory hungry
• SVM 
- strong theoretical guarantees 
- good to prevent from overfit 
- slow and memory heavy 
- usually needs grid-search on hyper parameters

Understand the Pros and Cons (3)
• There are too many details …
• Find some online courses or ML books 
• The Elements of Statistical Learning
• Machine Learning, A Probabilistic Perspective
• Programming Collective Intelligence
• Information Science and Statistics
• Pattern Recognition and Machine Learning
• …

Exploratory Data Analysis (EDA)
• Statistics Helps 
- min, max, variance, mode, …
• Data Visualization Helps

Model Ensembling
• Voting
• Averaging
• Bagging
• Boosting
• Blending
• Stacking
1 2 3 4
Avg.
Avg.

Model Ensembling
• Voting
• Averaging
• Bagging
• Boosting
• Blending
• Stacking
1 2 3 4
ENSEMBLE
MODEL

Model Ensembling
• Voting
• Averaging
• Bagging
• Boosting
• Blending
• Stacking
1 2 3 4 new feature
Avg.

Other Tricks
• Data Leakage
• Magic/Lucky Parameters

Overall - To Get into Top
• Correct Validation
• Good Feature Extractions
• Diverse Model
• Proper Ensembling

https://www.slideshare.net/NathanielShimoni/starting-data-science-with-kagglecom?qid=8f0c66f9-43ba-4646-8a05-
d03bf30b2eeb&v=&b=&from_search=9

http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/

Learning from Others/Winners
• Discussion Forum
• Kernels
• Winner Solutions

Learning from Others/Winners
http://blog.kaggle.com/

ANY QUESTION?
changecandy at gmail

MLDM CM Kaggle Tips

More Related Content

Similar to MLDM CM Kaggle Tips

More from 志明 陳

Recently uploaded

MLDM CM Kaggle Tips

More from 志明陳