MLDM MONDAY
Chih-Ming
About ME
CM 志明
Ph.D Student in TIGP-SNHCC
Research Assistant at AS CITI
Research Intern at KKBOX
Advisor: Prof. Ming-Feng Tsai (蔡銘峰)
Advisor: Dr. Eric Yang (楊弈軒)
• CLIP Lab
• MAC Lab
Research, Machine Learning team
https://about.me/chewme
3 http://kaggletw.azurewebsites.net/
台灣 Kaggle 交流區
https://www.facebook.com/groups/kaggletw/
5
6
7
8
First Thing First
• The type of prediction task

- classification? regression? top-N recommendations?
• Evaluation Metric

- AUC, MAE, RMSE, Log-loss, MAP@N, …
• Why Compete?

- For fun

- For learning

- For networking
The Prediction Task
Binary

Classification
Multi-label

Classification
Regression
Recommendations
Evaluation Metric
https://www.kaggle.com/wiki/Metrics
Why Compete?
• For Fun: Competing with others like running or racing
• For Learning: Improving your abilities
• What's Your Motivation?
Other Considerations …
• Data Size

- 10MB? 10GB? >100GB?

- no $$ to pay AWS
• Need GPU Power?

- no $$ to pay AWS
• Good Prize?

- $$$$$$$$$$$$
Check the Provided Data
• The Distribution of Train/Test Data

- random splitting

- split by time

- split by Ids
• Available Features

- categorical, numerical

- text

- image, audio

- time

- sparse, dense
Cross Validation (1) Train
Validation
TRAIN TRAIN
TRAIN TRAIN
TRAIN TRAIN
Round 1:
Round 2:
Round 3:
Cross Validation (2) Train
Validation
TRAIN
TRAIN
TRAIN
Test
Round 1:
Round 2:
Round 3:
Hold A Proper Validation
• Random Splitting

• Split by Time

• Split by Id
Train
Validation
Test
7 DAYS7 DAYS
5/20 5/275/13
or
Data Cleaning / Preprocessing
• Missing Values

- drop the missing data

- replace them by certain statistical values

- label them as the missing value
• Outlier Detection

- https://en.wikipedia.org/wiki/Outlier
• Redundant Features

- remove them usually
mean /

median /

mode /

clustering /

modelling methods
Categorical Features
• One-hot Encoding
• Clustering Group



Mayday 1 0 0 0
Sodagreen 0 1 0 0
SEKAI_NO_OWARI 0 0 1 0
The_Beatles 0 0 0 1
Mayday 1 0 0
Sodagreen 1 0 0
SEKAI_NO_OWARI 0 1 0
The_Beatles 0 0 1
Language
Id
Categorical Features
• Col-hot Encoding
• Count-hot Encoding
• Likelihood Encoding
• …
T1 T2 T3
T1
T2U
T3
23
1
6
23 1 6
1 0 1
count
binary
probability 23/30 1/30 6/30
Categorical Features (2)
• Latent Representations

- Principal Component Analysis (PCA)

- Linear Discriminant Analysis (LDA)

- Laplacian Eigenmaps (LE)

- Locally linear embedding (LLE)

- Low-Rank Approximation / Latent Factorization

- Latent Topic Model
reduce the computation cost alleviating the overfitting issue
finding out the meaningful components remove the noise
https://en.wikipedia.org/wiki/Dimensionality_reduction
Categorical vs. Numerical
• Ordinal Categories
HATE DON’T MIND LIKE LOVE
0 1 2 3
0
2
4
6
8
HATE DON'T MIND LIKE LOVE
exp(value)
Numerical Features
• Standardization / Normalization
• Rescaling
• Transform the Distribution

- logarithmic transformation

- tf-idf like transformation
• Binning / Sampling
https://en.wikipedia.org/wiki/Feature_scaling
required by

many ML algorithms
https://en.wikipedia.org/wiki/Data_transformation_(statistics)
Other Features
• Text-based

- Natural Language Processing
• Image-based, Audio-based

- Image/Signal Processing
• Time-based

- Time Series
Domain Knowledge is Important
Example (1)
• Text-based

- Vector Space Model - Word Embeddings
https://en.wikipedia.org/wiki/Vector_space_model
MAN
WOMAN
KING
QUEEN
need stemming? lemmatization?
Example (2)
Text
服務好、環境整潔 …
服務⼈人員笑容溫暖...
今天點了商業午餐...
segmentation
[服務] [好] [環境] [整潔]
[服務] [⼈人員] [笑容] [溫暖]
[今天] [點了] [商業午餐]
服務:1 好:1 環境:1 整潔:2
服務:1 笑容:1 溫暖:2
商業午餐:1filtering
Word

Embeddings?
dummy

variables
服務:2 好:1 環境:1 整潔:4
服務:2 笑容:1 溫暖:1
商業午餐:0.8
Advanced

Weighting?
Example (3)
• Image-based

- SIFT - Convolutional NN
https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
https://en.wikipedia.org/wiki/Convolutional_neural_network
Realize the Meaning Behind the Observed Features
• 2017/05/20 08:00
• Taipei
Holiday? Weekday?
Day? Night?
Asia
Mandarin
ML Libraries
• sci-kit learn
• xgboost, lightgbm, …
• vowpal wabbit
• libsvm, liblinear, libfm, libffm, …
• tensorflow, keras, h2o, caffe, mxnet, …
• …
Understand the Pros and Cons
• Linear Model

- simple, fast and easy to tune

- occupy low memory 

- non-complex
• Nearest Neighbours

- depends on the prediction task and the data distributions
Understand the Pros and Cons (2)
• Random Forest

- work very well in many competitions

- fast and easy to tune

- memory hungry
• SVM

- strong theoretical guarantees

- good to prevent from overfit

- slow and memory heavy

- usually needs grid-search on hyper parameters
Understand the Pros and Cons (3)
• There are too many details …
• Find some online courses or ML books

• The Elements of Statistical Learning
• Machine Learning, A Probabilistic Perspective
• Programming Collective Intelligence
• Information Science and Statistics
• Pattern Recognition and Machine Learning
• …
Exploratory Data Analysis (EDA)
• Statistics Helps

- min, max, variance, mode, …
• Data Visualization Helps
Model Ensembling
• Voting
• Averaging
• Bagging
• Boosting
• Blending
• Stacking
1 2 3 4
Avg.
Avg.
Model Ensembling
• Voting
• Averaging
• Bagging
• Boosting
• Blending
• Stacking
1 2 3 4
ENSEMBLE
MODEL
Model Ensembling
• Voting
• Averaging
• Bagging
• Boosting
• Blending
• Stacking
1 2 3 4 new feature
Avg.
Other Tricks
• Data Leakage
• Magic/Lucky Parameters
Overall - To Get into Top
• Correct Validation
• Good Feature Extractions
• Diverse Model
• Proper Ensembling
https://www.slideshare.net/NathanielShimoni/starting-data-science-with-kagglecom?qid=8f0c66f9-43ba-4646-8a05-
d03bf30b2eeb&v=&b=&from_search=9
http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/
Learning from Others/Winners
• Discussion Forum
• Kernels
• Winner Solutions
Learning from Others/Winners
http://blog.kaggle.com/
ANY QUESTION?
changecandy at gmail

MLDM CM Kaggle Tips