Kaggle nlp approaches

Agenda
• Current trends
• Jigsaw toxic comment classification
• Problem Description
• Different approaches
• WinningApproaches
• Mercari Price Suggestion Challenge
• Different approaches
• WinningApproaches

Text Properties
• High dimensional data
• Sparse data
• Sequential data
• Linearly separable
• https://www.svm-tutorial.com/2014/10/svm-linear-kernel-good-text-classification/

Text Properties
• Sparse data
• Sequential data
Feature Creation
• Remove stop words
• Remove special characters
• Create tfidf/ hash feature vectors

Text Properties
• Sparse data
• Sequential data
Feature Creation
• Remove stop words
• Remove special characters
• Create tfidf/ hash feature vectors
Machine Learning Model
• Linear SVM
• Multinomial Naïve Bayes
• Boosting approach such as Xgboost, GBM,
Decision tree approaches are not recommended

30/01/2019
Deep Learning Starter
• Neural Network
• Dense Layer
• Activation Layer
• Optimizer: Stochastic Gradient

30/01/2019
Deep Learning Starter
• Neural Network
• Dense Layer
• Activation Layer
• Optimizer: Stochastic Gradient
• Deep Learning
• Dense Layer can be replaced
• LSTM/GRU –Text
• Convolution - Image
• Softmax layer can be replaced Relu layer, Leaky Relu
• Different layer such as drop out, max pooling created for different purpose

Deep Learning approach
• Architecture
• Input features
• Embedding Layer
• Dense/LSTM/GRU/Conv
• Drop out (optional)
• Max pooling
• Activation function
• Softmax Layer
• Optimizer: Adam
• Activation: RELU
30/01/2019

Dropout
30/01/2019
Maxpool layer

30/01/2019
Embedding layer
• Tfidf, Count does not capture sematic relation
• Word2vec captures the relationship
Word2vec
https://github.com/3Top/word2vec-api
Glove
https://nlp.stanford.edu/projects/glove/
Faxttext
https://github.com/facebookresearch/fastText
Dependency-BasedWord Embeddings.
http://cistern.cis.lmu.de/meta-emb/

Problem Description
• Classify the Wikipedia comments
• Toxic
• Severe toxic
• Obscene
• Threat
• Insult
• Identity hate

Problem Description
• Classify the Wikipedia comments
• Toxic
• Severe toxic
• Obscene
• Threat
• Insult
• Identity hate
• Imbalanced dataset
• General short text
• Spelling mistakes
• Short hand
• Improper Use of Grammar
Data Description

NB-SVM Baseline Approach AUC – (0.972)
• Suggestion: Linear SVM and Naïve Bayes work well in text
• Approach:
• Create tfidf features
• Create naïve bayes features
• Build SVM model
• Why? Combine characteristics of Naïve Bayes and SVM
• Code: https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline
• Paper: https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf
• Take way: Try this approach directly over Single SVM or Naïve Bayes

Bidirectional LSTM Baseline – (0.971)
• Suggestion: Bidirectional LSTM is suitable for text
• Approach:
• Create Embedding layer of size 128
• Add LSTM layer of size 50
• Pass it to the GlobalMaxPool layer
• Add drop out of probability 0.1
• Add dense layer of size 50
• Add drop out of probability 0.1
• Add dense layer of size 6
• Code: https://www.kaggle.com/CVxTz/keras-bidirectional-lstm-baseline-lb-0-
069?scriptVersionId=2188439
• Take way: GlobalMaxPool is suggested for NLP and MaxPool for vision
• Reference: http://www.wildml.com/2015/11/understanding-convolutional-neural-
networks-for-nlp/

Bidirectional LSTM Improvement– (0.972)
• Approach:
• Add glove vector to embedding layer
• Reason for this approach
• Glove has better representation of words
• Code: https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout

Bidirectional LSTM Improvement– (0.972)
• Approach:
• Add glove vector to embedding layer
• Reason for this approach
• Glove has better representation of words
• Code: https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout
Bidirectional LSTM improvement (0.973)
• Approach:
• Add attention layer
• Why? Attention layer captures more information
• Code: https://www.kaggle.com/qqgeogor/keras-lstm-attention-glove840b-lb-0-
043

Logistic Regression– Feature Engineering (AUC – 0.9792)
• Approach:
• Create 2-6 character ngrams
• Create tfidf features for 2-6 character ngrams and words
• Build Logistic regression model
• Why?
• Phrases like Stuppid, Idiooot, nooo exists.
• Character ngram can capture this information well
• Not applicable: Healthcare and Summit
• Code: https://www.kaggle.com/tunguz/logistic-regression-with-words-and-char-n-
grams/output

FM FTRL – Feature Engineering (AUC – 0.981)
• Approach:
• Replace Logistic regression with FTRL
• Why?
• FTRL is always good replacement for logistic regression
• FTRL widely used in click rate prediction
• Code: https://www.kaggle.com/anttip/wordbatch-1-3-3-fm-ftrl-lb-0-9812/code
• Not applicable: Healthcare and Summit
• Take way:
• Wordbatch library
• Feature extraction such as tfidf to wordtovec is added
• Algorithm such as FM_FTRL, NN_Relu_H1

Bidirectional GRU – (0.980)
• Suggestion: GRU is the simplified version of LSTM
• Approach:
• Create Embedding layer of size 30000
• Add Spatial drop out with probability 0.1
• Add Bidirectional GRU of size 80
• Pass it to the GlobalAveragePooling layer
• Pass it to the GlobalMaxPool layer
• Concatenate the both pooling layer
• Pass it to Sigmoid layer of size 6
• Code: https://www.kaggle.com/prashantkikani/pooled-gru-with-
preprocessing/code
• Take way:
• Spatial dropout before GRU is unusual approach
• CombiningAverage pooling and max pool may capture additional information

Bidirectional GRU Improvement– (0.982)
• Approach:
• Add glove vector to previous vector
• Why? Glove has better representation of words
• Code: https://www.kaggle.com/prashantkikani/pooled-gru-glove-with-preprocessing

• Approach:
• Add glove vector to previous vector
• Why? Glove has better representation of words
• Code: https://www.kaggle.com/prashantkikani/pooled-gru-glove-with-preprocessing
• Approach:
• Add fasttext to previous vector
• Why? Fasttext can capture character ngrams too
• Code: https://www.kaggle.com/yekenot/pooled-gru-fasttext

Convolution with fasttext– (0.982)
• Approach:
• ReplaceGRU with 2d convolution
• Why? Convolution is universally tried
• Code: https://www.kaggle.com/shujian/textcnn-2d-convolution

Convolution with fasttext– (0.982)
• Approach:
• ReplaceGRU with 2d convolution
• Why? Convolution is universally tried
• Code: https://www.kaggle.com/shujian/textcnn-2d-convolution
• Approach:
• Combine GRU and convolution
• Why? Convolution acts as feature reduction technique like PCA
• Code: https://www.kaggle.com/eashish/bidirectional-gru-with-convolution/code
• Author Claim: Both GRU and LSTM gives same results
Bidirectional GRU with convolution– (0.9839)

Winning Approach – 0.9872
• Fix spelling mistakes
• Find regular word which have with a small Levenshtein distance to it
• Use text blog to fix error
• Fix misspelled words by finding word vector neighbourhoods
• Embedding layer
• Combine glove and twitter fast text embedding
• Words without word vectors are replaced with “something”
• Add flag 1 if it has capital letter
• Add Spatial drop out with probability 0.1
• Add Bidirectional LSTM of size 40
• Author note: LSTM shown better result than GRU
• Add BidirectionalGRU of size 40
• Combine layers with last state, maximum pool, average pool and two features: "Unique words
rate" and "Rate of all-caps words
• Discussion url: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-
challenge/discussion/52644

Winning Approach – 0.9890 (1st solution )
• Approach
• Diverse pre-trained embeddings (AUC – 0.9877)
• Combine high dimensional Fast text and glove
• Author note: embedding layer is more important than other layers
• Add two bidirectional GRU layer to two dense layers
• Add translation from German, French and other languages (AUC – 0.980)
• Pseudo Labeling (AUC – 0.985)
• Label test sample using best approach
• Used test sample for building model
• Stacking (AUC – 0.989)
• Use LGBM for Stacking
• Use Bayesian optimization for heavy parameter tuning
• Created 6 different six seeds and created different model using LGBM.
• Bagged different model to reduce variance
• Take way
http://neupy.com/2016/12/17/hyperparameter_optimization_for_neural_networks.html
• Discussion link: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-
challenge/discussion/52557

Mercari Price Suggestion Challenge
• Predict the price of the product based on description
• Constraints
• Ram - 4 cores / 16 GB RAM / -
• Code should running in 1 hours
• Focus – Both Speed and Accuracy

Jigsaw toxic comment classification
• Sample Data

Jigsaw toxic comment classification
• Scoring mechanism

Linear Regression (RMSLE – 0.6)
• Approach
• Create tfidf features on name and description
• Build Linear regression model
• Code: : https://www.kaggle.com/jkkphys/category-tf-idf-linear-regression

Linear Regression (RMSLE – 0.6)
• Approach
• Create tfidf features on name and description
• Build Linear regression model
• Code: : https://www.kaggle.com/jkkphys/category-tf-idf-linear-regression
• Approach
• Convert Name, item_condition, category_name, brand_name, shipping into numeric
features
• Build Random forest regression
• Code: https://www.kaggle.com/shikhar1/base-random-forest-lb-
532/notebook
• Skeptical???
Random Forest (RMSLE – 0.523)

Ridge Regression (RMSLE – 0.470)
• Approach
• Create Count feature vectors of name and category
• Create tfidf vectors on item description
• Create Labelbinarizer features on brand
• Combine all the features
• Build Ridge regressor model
• Code: https://www.kaggle.com/apapiu/ridge-script/code

• Approach
Ridge Regression with ngram (RMSLE – 0.452)
• Approach: Add ngram count feature vectors of names
• Code: https://www.kaggle.com/nishimoto/ridge-script-with-n-gram

• Approach
Ridge Regression with ngram (RMSLE – 0.452)
• Approach: Add ngram count feature vectors of names
• Code: https://www.kaggle.com/nishimoto/ridge-script-with-n-gram
Ridge Regression with TSVD (RMSLE – 0.470)
• Approach: ApplyTSVD on tfidf features
• Code: https://www.kaggle.com/eurbtc/public-ridge-tsvd

Keras with Parallel batch training (RMSLE – 0.4504)
• Approach
• Create feature (Standard features)
• Create count feature vectors on category name
• Create tfidf feature vectors on item description
• Create Label binarizer feature vectors on brand
• Create one hot feature vectors on item condition, shipping
• Model
• Create dense layer of size 64
• Add drop out layer with probability 0.3
• Add drop out layer with probability 0.25
• Code: https://www.kaggle.com/luisgarcia/keras-nn-with-parallelized-batch-training/code
• Why: Parallel processing.

LGBM (RMSLE – 0.422)
• Approach
• Split the categories into three subcategories
• Create standard features
• Build LGBM regression
• code: https://www.kaggle.com/kamidox/single-lgbm
• Why: Faster, less memory usage.
• Reference: https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-
takes-the-crown-light-gbm-vs-xgboost/

Ridge regression - feature engineering (RMSLE – 0.412)
• Approach
• Create features
• has_category
• General category cond - General category + shipping
• Subcategory cond 2 – subcategory 1 + shipping
• Subcategory cond 1- subcategory 2 + shipping
• Pre-processing brand
• Clean brand name using Levenshtein distance
• Hashing feature vectors are created for category name, brand name, Item description
• Count feature vectors on General category cond, Subcategory 1 cond, Subcategory 2 cond
• One hot feature vectors is extracted from Has brand, Shipping, Item_condition_id
• code: https://www.kaggle.com/rumbok/ridge-lb-0-41944/code

Top Approach (RMSLE – 0.406)
• Approach
• Create features
• Create tfidf feature vectors on name, item description
• Create one hot feature vectors of shipping, item_condition
• Combine features
• Create binary version of feature
• Combine binary features and features
• Model
• Create sparse input
• code: https://www.kaggle.com/peterhurford/lgb-and-fm-18th-place-0-40604

Kaggle nlp approaches

More Related Content

What's hot

Similar to Kaggle nlp approaches

Recently uploaded

Kaggle nlp approaches