NLP Approaches on Kaggle
Agenda
• Current trends
• Jigsaw toxic comment classification
• Problem Description
• Different approaches
• WinningApproaches
• Mercari Price Suggestion Challenge
• Problem Description
• Different approaches
• WinningApproaches
Text Properties
• High dimensional data
• Sparse data
• Sequential data
• Linearly separable
• https://www.svm-tutorial.com/2014/10/svm-linear-kernel-good-text-classification/
Text Properties
• High dimensional data
• Sparse data
• Sequential data
• Linearly separable
• https://www.svm-tutorial.com/2014/10/svm-linear-kernel-good-text-classification/
Feature Creation
• Remove stop words
• Remove special characters
• Create tfidf/ hash feature vectors
Text Properties
• High dimensional data
• Sparse data
• Sequential data
• Linearly separable
• https://www.svm-tutorial.com/2014/10/svm-linear-kernel-good-text-classification/
Feature Creation
• Remove stop words
• Remove special characters
• Create tfidf/ hash feature vectors
Machine Learning Model
• Linear SVM
• Multinomial Naïve Bayes
• Boosting approach such as Xgboost, GBM,
Decision tree approaches are not recommended
30/01/2019
Deep Learning Starter
• Neural Network
• Dense Layer
• Activation Layer
• Optimizer: Stochastic Gradient
30/01/2019
Deep Learning Starter
• Neural Network
• Dense Layer
• Activation Layer
• Optimizer: Stochastic Gradient
• Deep Learning
• Dense Layer can be replaced
• LSTM/GRU –Text
• Convolution - Image
• Softmax layer can be replaced Relu layer, Leaky Relu
• Different layer such as drop out, max pooling created for different purpose
Deep Learning approach
• Architecture
• Input features
• Embedding Layer
• Dense/LSTM/GRU/Conv
• Drop out (optional)
• Max pooling
• Activation function
• Softmax Layer
• Optimizer: Adam
• Activation: RELU
30/01/2019
Dropout
30/01/2019
Dropout
30/01/2019
Maxpool layer
30/01/2019
Embedding layer
• Tfidf, Count does not capture sematic relation
• Word2vec captures the relationship
Word2vec
https://github.com/3Top/word2vec-api
Glove
https://nlp.stanford.edu/projects/glove/
Faxttext
https://github.com/facebookresearch/fastText
Dependency-BasedWord Embeddings.
http://cistern.cis.lmu.de/meta-emb/
Problem Description
• Classify the Wikipedia comments
• Toxic
• Severe toxic
• Obscene
• Threat
• Insult
• Identity hate
Problem Description
• Classify the Wikipedia comments
• Toxic
• Severe toxic
• Obscene
• Threat
• Insult
• Identity hate
• Imbalanced dataset
• General short text
• Spelling mistakes
• Short hand
• Improper Use of Grammar
Data Description
NB-SVM Baseline Approach AUC – (0.972)
• Suggestion: Linear SVM and Naïve Bayes work well in text
• Approach:
• Create tfidf features
• Create naïve bayes features
• Build SVM model
• Why? Combine characteristics of Naïve Bayes and SVM
• Code: https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline
• Paper: https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf
• Take way: Try this approach directly over Single SVM or Naïve Bayes
Bidirectional LSTM Baseline – (0.971)
• Suggestion: Bidirectional LSTM is suitable for text
• Approach:
• Create Embedding layer of size 128
• Add LSTM layer of size 50
• Pass it to the GlobalMaxPool layer
• Add drop out of probability 0.1
• Add dense layer of size 50
• Add drop out of probability 0.1
• Add dense layer of size 6
• Code: https://www.kaggle.com/CVxTz/keras-bidirectional-lstm-baseline-lb-0-
069?scriptVersionId=2188439
• Take way: GlobalMaxPool is suggested for NLP and MaxPool for vision
• Reference: http://www.wildml.com/2015/11/understanding-convolutional-neural-
networks-for-nlp/
Bidirectional LSTM Improvement– (0.972)
• Approach:
• Add glove vector to embedding layer
• Reason for this approach
• Glove has better representation of words
• Code: https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout
Bidirectional LSTM Improvement– (0.972)
• Approach:
• Add glove vector to embedding layer
• Reason for this approach
• Glove has better representation of words
• Code: https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout
Bidirectional LSTM improvement (0.973)
• Approach:
• Add attention layer
• Why? Attention layer captures more information
• Code: https://www.kaggle.com/qqgeogor/keras-lstm-attention-glove840b-lb-0-
043
Logistic Regression– Feature Engineering (AUC – 0.9792)
• Approach:
• Create 2-6 character ngrams
• Create tfidf features for 2-6 character ngrams and words
• Build Logistic regression model
• Why?
• Phrases like Stuppid, Idiooot, nooo exists.
• Character ngram can capture this information well
• Not applicable: Healthcare and Summit
• Code: https://www.kaggle.com/tunguz/logistic-regression-with-words-and-char-n-
grams/output
FM FTRL – Feature Engineering (AUC – 0.981)
• Approach:
• Replace Logistic regression with FTRL
• Why?
• FTRL is always good replacement for logistic regression
• FTRL widely used in click rate prediction
• Code: https://www.kaggle.com/anttip/wordbatch-1-3-3-fm-ftrl-lb-0-9812/code
• Not applicable: Healthcare and Summit
• Take way:
• Wordbatch library
• Feature extraction such as tfidf to wordtovec is added
• Algorithm such as FM_FTRL, NN_Relu_H1
Bidirectional GRU – (0.980)
• Suggestion: GRU is the simplified version of LSTM
• Approach:
• Create Embedding layer of size 30000
• Add Spatial drop out with probability 0.1
• Add Bidirectional GRU of size 80
• Pass it to the GlobalAveragePooling layer
• Pass it to the GlobalMaxPool layer
• Concatenate the both pooling layer
• Pass it to Sigmoid layer of size 6
• Code: https://www.kaggle.com/prashantkikani/pooled-gru-with-
preprocessing/code
• Take way:
• Spatial dropout before GRU is unusual approach
• CombiningAverage pooling and max pool may capture additional information
Bidirectional GRU Improvement– (0.982)
• Approach:
• Add glove vector to previous vector
• Why? Glove has better representation of words
• Code: https://www.kaggle.com/prashantkikani/pooled-gru-glove-with-preprocessing
Bidirectional GRU Improvement– (0.982)
• Approach:
• Add glove vector to previous vector
• Why? Glove has better representation of words
• Code: https://www.kaggle.com/prashantkikani/pooled-gru-glove-with-preprocessing
Bidirectional GRU Improvement– (0.983)
• Approach:
• Add fasttext to previous vector
• Why? Fasttext can capture character ngrams too
• Code: https://www.kaggle.com/yekenot/pooled-gru-fasttext
Convolution with fasttext– (0.982)
• Approach:
• ReplaceGRU with 2d convolution
• Why? Convolution is universally tried
• Code: https://www.kaggle.com/shujian/textcnn-2d-convolution
Convolution with fasttext– (0.982)
• Approach:
• ReplaceGRU with 2d convolution
• Why? Convolution is universally tried
• Code: https://www.kaggle.com/shujian/textcnn-2d-convolution
• Approach:
• Combine GRU and convolution
• Why? Convolution acts as feature reduction technique like PCA
• Code: https://www.kaggle.com/eashish/bidirectional-gru-with-convolution/code
• Author Claim: Both GRU and LSTM gives same results
Bidirectional GRU with convolution– (0.9839)
Winning Approach
Winning Approach – 0.9872
• Fix spelling mistakes
• Find regular word which have with a small Levenshtein distance to it
• Use text blog to fix error
• Fix misspelled words by finding word vector neighbourhoods
• Embedding layer
• Combine glove and twitter fast text embedding
• Words without word vectors are replaced with “something”
• Add flag 1 if it has capital letter
• Add Spatial drop out with probability 0.1
• Add Bidirectional LSTM of size 40
• Author note: LSTM shown better result than GRU
• Add BidirectionalGRU of size 40
• Combine layers with last state, maximum pool, average pool and two features: "Unique words
rate" and "Rate of all-caps words
• Discussion url: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-
challenge/discussion/52644
Winning Approach – 0.9890 (1st solution )
• Approach
• Diverse pre-trained embeddings (AUC – 0.9877)
• Combine high dimensional Fast text and glove
• Author note: embedding layer is more important than other layers
• Add two bidirectional GRU layer to two dense layers
• Add translation from German, French and other languages (AUC – 0.980)
• Pseudo Labeling (AUC – 0.985)
• Label test sample using best approach
• Used test sample for building model
• Stacking (AUC – 0.989)
• Use LGBM for Stacking
• Use Bayesian optimization for heavy parameter tuning
• Created 6 different six seeds and created different model using LGBM.
• Bagged different model to reduce variance
• Take way
http://neupy.com/2016/12/17/hyperparameter_optimization_for_neural_networks.html
• Discussion link: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-
challenge/discussion/52557
Mercari Price Suggestion Challenge
• Problem Description
• Predict the price of the product based on description
• Constraints
• Ram - 4 cores / 16 GB RAM / -
• Code should running in 1 hours
• Focus – Both Speed and Accuracy
Jigsaw toxic comment classification
• Sample Data
Jigsaw toxic comment classification
• Scoring mechanism
Linear Regression (RMSLE – 0.6)
• Approach
• Create tfidf features on name and description
• Build Linear regression model
• Code: : https://www.kaggle.com/jkkphys/category-tf-idf-linear-regression
Linear Regression (RMSLE – 0.6)
• Approach
• Create tfidf features on name and description
• Build Linear regression model
• Code: : https://www.kaggle.com/jkkphys/category-tf-idf-linear-regression
• Approach
• Convert Name, item_condition, category_name, brand_name, shipping into numeric
features
• Build Random forest regression
• Code: https://www.kaggle.com/shikhar1/base-random-forest-lb-
532/notebook
• Skeptical???
Random Forest (RMSLE – 0.523)
Ridge Regression (RMSLE – 0.470)
• Approach
• Create Count feature vectors of name and category
• Create tfidf vectors on item description
• Create Labelbinarizer features on brand
• Combine all the features
• Build Ridge regressor model
• Code: https://www.kaggle.com/apapiu/ridge-script/code
Ridge Regression (RMSLE – 0.470)
• Approach
• Create Count feature vectors of name and category
• Create tfidf vectors on item description
• Create Labelbinarizer features on brand
• Combine all the features
• Build Ridge regressor model
• Code: https://www.kaggle.com/apapiu/ridge-script/code
Ridge Regression with ngram (RMSLE – 0.452)
• Approach: Add ngram count feature vectors of names
• Code: https://www.kaggle.com/nishimoto/ridge-script-with-n-gram
Ridge Regression (RMSLE – 0.470)
• Approach
• Create Count feature vectors of name and category
• Create tfidf vectors on item description
• Create Labelbinarizer features on brand
• Combine all the features
• Build Ridge regressor model
• Code: https://www.kaggle.com/apapiu/ridge-script/code
Ridge Regression with ngram (RMSLE – 0.452)
• Approach: Add ngram count feature vectors of names
• Code: https://www.kaggle.com/nishimoto/ridge-script-with-n-gram
Ridge Regression with TSVD (RMSLE – 0.470)
• Approach: ApplyTSVD on tfidf features
• Code: https://www.kaggle.com/eurbtc/public-ridge-tsvd
Keras with Parallel batch training (RMSLE – 0.4504)
• Approach
• Create feature (Standard features)
• Create count feature vectors on category name
• Create tfidf feature vectors on item description
• Create Label binarizer feature vectors on brand
• Create one hot feature vectors on item condition, shipping
• Model
• Create dense layer of size 64
• Add drop out layer with probability 0.3
• Create dense layer of size 32
• Add drop out layer with probability 0.25
• Create dense layer of size 1
• Code: https://www.kaggle.com/luisgarcia/keras-nn-with-parallelized-batch-training/code
• Why: Parallel processing.
LGBM (RMSLE – 0.422)
• Approach
• Split the categories into three subcategories
• Create standard features
• Build LGBM regression
• code: https://www.kaggle.com/kamidox/single-lgbm
• Why: Faster, less memory usage.
• Reference: https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-
takes-the-crown-light-gbm-vs-xgboost/
Ridge regression - feature engineering (RMSLE – 0.412)
• Approach
• Create features
• has_category
• General category cond - General category + shipping
• Subcategory cond 2 – subcategory 1 + shipping
• Subcategory cond 1- subcategory 2 + shipping
• Pre-processing brand
• Clean brand name using Levenshtein distance
• Hashing feature vectors are created for category name, brand name, Item description
• Count feature vectors on General category cond, Subcategory 1 cond, Subcategory 2 cond
• One hot feature vectors is extracted from Has brand, Shipping, Item_condition_id
• code: https://www.kaggle.com/rumbok/ridge-lb-0-41944/code
Winning Approach
Top Approach (RMSLE – 0.406)
• Approach
• Create features
• Create tfidf feature vectors on name, item description
• Create one hot feature vectors of shipping, item_condition
• Combine features
• Create binary version of feature
• Combine binary features and features
• Model
• Create sparse input
• Create dense layer of size 192
• Create dense layer of size 64
• Create dense layer of size 64
• Create dense layer of size 1
• code: https://www.kaggle.com/peterhurford/lgb-and-fm-18th-place-0-40604
Kaggle nlp approaches

Kaggle nlp approaches

  • 1.
  • 2.
    Agenda • Current trends •Jigsaw toxic comment classification • Problem Description • Different approaches • WinningApproaches • Mercari Price Suggestion Challenge • Problem Description • Different approaches • WinningApproaches
  • 3.
    Text Properties • Highdimensional data • Sparse data • Sequential data • Linearly separable • https://www.svm-tutorial.com/2014/10/svm-linear-kernel-good-text-classification/
  • 4.
    Text Properties • Highdimensional data • Sparse data • Sequential data • Linearly separable • https://www.svm-tutorial.com/2014/10/svm-linear-kernel-good-text-classification/ Feature Creation • Remove stop words • Remove special characters • Create tfidf/ hash feature vectors
  • 5.
    Text Properties • Highdimensional data • Sparse data • Sequential data • Linearly separable • https://www.svm-tutorial.com/2014/10/svm-linear-kernel-good-text-classification/ Feature Creation • Remove stop words • Remove special characters • Create tfidf/ hash feature vectors Machine Learning Model • Linear SVM • Multinomial Naïve Bayes • Boosting approach such as Xgboost, GBM, Decision tree approaches are not recommended
  • 6.
    30/01/2019 Deep Learning Starter •Neural Network • Dense Layer • Activation Layer • Optimizer: Stochastic Gradient
  • 7.
    30/01/2019 Deep Learning Starter •Neural Network • Dense Layer • Activation Layer • Optimizer: Stochastic Gradient • Deep Learning • Dense Layer can be replaced • LSTM/GRU –Text • Convolution - Image • Softmax layer can be replaced Relu layer, Leaky Relu • Different layer such as drop out, max pooling created for different purpose
  • 8.
    Deep Learning approach •Architecture • Input features • Embedding Layer • Dense/LSTM/GRU/Conv • Drop out (optional) • Max pooling • Activation function • Softmax Layer • Optimizer: Adam • Activation: RELU 30/01/2019
  • 9.
  • 10.
  • 11.
    30/01/2019 Embedding layer • Tfidf,Count does not capture sematic relation • Word2vec captures the relationship Word2vec https://github.com/3Top/word2vec-api Glove https://nlp.stanford.edu/projects/glove/ Faxttext https://github.com/facebookresearch/fastText Dependency-BasedWord Embeddings. http://cistern.cis.lmu.de/meta-emb/
  • 12.
    Problem Description • Classifythe Wikipedia comments • Toxic • Severe toxic • Obscene • Threat • Insult • Identity hate
  • 13.
    Problem Description • Classifythe Wikipedia comments • Toxic • Severe toxic • Obscene • Threat • Insult • Identity hate • Imbalanced dataset • General short text • Spelling mistakes • Short hand • Improper Use of Grammar Data Description
  • 14.
    NB-SVM Baseline ApproachAUC – (0.972) • Suggestion: Linear SVM and Naïve Bayes work well in text • Approach: • Create tfidf features • Create naïve bayes features • Build SVM model • Why? Combine characteristics of Naïve Bayes and SVM • Code: https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline • Paper: https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf • Take way: Try this approach directly over Single SVM or Naïve Bayes
  • 15.
    Bidirectional LSTM Baseline– (0.971) • Suggestion: Bidirectional LSTM is suitable for text • Approach: • Create Embedding layer of size 128 • Add LSTM layer of size 50 • Pass it to the GlobalMaxPool layer • Add drop out of probability 0.1 • Add dense layer of size 50 • Add drop out of probability 0.1 • Add dense layer of size 6 • Code: https://www.kaggle.com/CVxTz/keras-bidirectional-lstm-baseline-lb-0- 069?scriptVersionId=2188439 • Take way: GlobalMaxPool is suggested for NLP and MaxPool for vision • Reference: http://www.wildml.com/2015/11/understanding-convolutional-neural- networks-for-nlp/
  • 16.
    Bidirectional LSTM Improvement–(0.972) • Approach: • Add glove vector to embedding layer • Reason for this approach • Glove has better representation of words • Code: https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout
  • 17.
    Bidirectional LSTM Improvement–(0.972) • Approach: • Add glove vector to embedding layer • Reason for this approach • Glove has better representation of words • Code: https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout Bidirectional LSTM improvement (0.973) • Approach: • Add attention layer • Why? Attention layer captures more information • Code: https://www.kaggle.com/qqgeogor/keras-lstm-attention-glove840b-lb-0- 043
  • 18.
    Logistic Regression– FeatureEngineering (AUC – 0.9792) • Approach: • Create 2-6 character ngrams • Create tfidf features for 2-6 character ngrams and words • Build Logistic regression model • Why? • Phrases like Stuppid, Idiooot, nooo exists. • Character ngram can capture this information well • Not applicable: Healthcare and Summit • Code: https://www.kaggle.com/tunguz/logistic-regression-with-words-and-char-n- grams/output
  • 19.
    FM FTRL –Feature Engineering (AUC – 0.981) • Approach: • Replace Logistic regression with FTRL • Why? • FTRL is always good replacement for logistic regression • FTRL widely used in click rate prediction • Code: https://www.kaggle.com/anttip/wordbatch-1-3-3-fm-ftrl-lb-0-9812/code • Not applicable: Healthcare and Summit • Take way: • Wordbatch library • Feature extraction such as tfidf to wordtovec is added • Algorithm such as FM_FTRL, NN_Relu_H1
  • 20.
    Bidirectional GRU –(0.980) • Suggestion: GRU is the simplified version of LSTM • Approach: • Create Embedding layer of size 30000 • Add Spatial drop out with probability 0.1 • Add Bidirectional GRU of size 80 • Pass it to the GlobalAveragePooling layer • Pass it to the GlobalMaxPool layer • Concatenate the both pooling layer • Pass it to Sigmoid layer of size 6 • Code: https://www.kaggle.com/prashantkikani/pooled-gru-with- preprocessing/code • Take way: • Spatial dropout before GRU is unusual approach • CombiningAverage pooling and max pool may capture additional information
  • 21.
    Bidirectional GRU Improvement–(0.982) • Approach: • Add glove vector to previous vector • Why? Glove has better representation of words • Code: https://www.kaggle.com/prashantkikani/pooled-gru-glove-with-preprocessing
  • 22.
    Bidirectional GRU Improvement–(0.982) • Approach: • Add glove vector to previous vector • Why? Glove has better representation of words • Code: https://www.kaggle.com/prashantkikani/pooled-gru-glove-with-preprocessing Bidirectional GRU Improvement– (0.983) • Approach: • Add fasttext to previous vector • Why? Fasttext can capture character ngrams too • Code: https://www.kaggle.com/yekenot/pooled-gru-fasttext
  • 23.
    Convolution with fasttext–(0.982) • Approach: • ReplaceGRU with 2d convolution • Why? Convolution is universally tried • Code: https://www.kaggle.com/shujian/textcnn-2d-convolution
  • 24.
    Convolution with fasttext–(0.982) • Approach: • ReplaceGRU with 2d convolution • Why? Convolution is universally tried • Code: https://www.kaggle.com/shujian/textcnn-2d-convolution • Approach: • Combine GRU and convolution • Why? Convolution acts as feature reduction technique like PCA • Code: https://www.kaggle.com/eashish/bidirectional-gru-with-convolution/code • Author Claim: Both GRU and LSTM gives same results Bidirectional GRU with convolution– (0.9839)
  • 25.
  • 26.
    Winning Approach –0.9872 • Fix spelling mistakes • Find regular word which have with a small Levenshtein distance to it • Use text blog to fix error • Fix misspelled words by finding word vector neighbourhoods • Embedding layer • Combine glove and twitter fast text embedding • Words without word vectors are replaced with “something” • Add flag 1 if it has capital letter • Add Spatial drop out with probability 0.1 • Add Bidirectional LSTM of size 40 • Author note: LSTM shown better result than GRU • Add BidirectionalGRU of size 40 • Combine layers with last state, maximum pool, average pool and two features: "Unique words rate" and "Rate of all-caps words • Discussion url: https://www.kaggle.com/c/jigsaw-toxic-comment-classification- challenge/discussion/52644
  • 27.
    Winning Approach –0.9890 (1st solution ) • Approach • Diverse pre-trained embeddings (AUC – 0.9877) • Combine high dimensional Fast text and glove • Author note: embedding layer is more important than other layers • Add two bidirectional GRU layer to two dense layers • Add translation from German, French and other languages (AUC – 0.980) • Pseudo Labeling (AUC – 0.985) • Label test sample using best approach • Used test sample for building model • Stacking (AUC – 0.989) • Use LGBM for Stacking • Use Bayesian optimization for heavy parameter tuning • Created 6 different six seeds and created different model using LGBM. • Bagged different model to reduce variance • Take way http://neupy.com/2016/12/17/hyperparameter_optimization_for_neural_networks.html • Discussion link: https://www.kaggle.com/c/jigsaw-toxic-comment-classification- challenge/discussion/52557
  • 28.
    Mercari Price SuggestionChallenge • Problem Description • Predict the price of the product based on description • Constraints • Ram - 4 cores / 16 GB RAM / - • Code should running in 1 hours • Focus – Both Speed and Accuracy
  • 29.
    Jigsaw toxic commentclassification • Sample Data
  • 30.
    Jigsaw toxic commentclassification • Scoring mechanism
  • 31.
    Linear Regression (RMSLE– 0.6) • Approach • Create tfidf features on name and description • Build Linear regression model • Code: : https://www.kaggle.com/jkkphys/category-tf-idf-linear-regression
  • 32.
    Linear Regression (RMSLE– 0.6) • Approach • Create tfidf features on name and description • Build Linear regression model • Code: : https://www.kaggle.com/jkkphys/category-tf-idf-linear-regression • Approach • Convert Name, item_condition, category_name, brand_name, shipping into numeric features • Build Random forest regression • Code: https://www.kaggle.com/shikhar1/base-random-forest-lb- 532/notebook • Skeptical??? Random Forest (RMSLE – 0.523)
  • 33.
    Ridge Regression (RMSLE– 0.470) • Approach • Create Count feature vectors of name and category • Create tfidf vectors on item description • Create Labelbinarizer features on brand • Combine all the features • Build Ridge regressor model • Code: https://www.kaggle.com/apapiu/ridge-script/code
  • 34.
    Ridge Regression (RMSLE– 0.470) • Approach • Create Count feature vectors of name and category • Create tfidf vectors on item description • Create Labelbinarizer features on brand • Combine all the features • Build Ridge regressor model • Code: https://www.kaggle.com/apapiu/ridge-script/code Ridge Regression with ngram (RMSLE – 0.452) • Approach: Add ngram count feature vectors of names • Code: https://www.kaggle.com/nishimoto/ridge-script-with-n-gram
  • 35.
    Ridge Regression (RMSLE– 0.470) • Approach • Create Count feature vectors of name and category • Create tfidf vectors on item description • Create Labelbinarizer features on brand • Combine all the features • Build Ridge regressor model • Code: https://www.kaggle.com/apapiu/ridge-script/code Ridge Regression with ngram (RMSLE – 0.452) • Approach: Add ngram count feature vectors of names • Code: https://www.kaggle.com/nishimoto/ridge-script-with-n-gram Ridge Regression with TSVD (RMSLE – 0.470) • Approach: ApplyTSVD on tfidf features • Code: https://www.kaggle.com/eurbtc/public-ridge-tsvd
  • 36.
    Keras with Parallelbatch training (RMSLE – 0.4504) • Approach • Create feature (Standard features) • Create count feature vectors on category name • Create tfidf feature vectors on item description • Create Label binarizer feature vectors on brand • Create one hot feature vectors on item condition, shipping • Model • Create dense layer of size 64 • Add drop out layer with probability 0.3 • Create dense layer of size 32 • Add drop out layer with probability 0.25 • Create dense layer of size 1 • Code: https://www.kaggle.com/luisgarcia/keras-nn-with-parallelized-batch-training/code • Why: Parallel processing.
  • 37.
    LGBM (RMSLE –0.422) • Approach • Split the categories into three subcategories • Create standard features • Build LGBM regression • code: https://www.kaggle.com/kamidox/single-lgbm • Why: Faster, less memory usage. • Reference: https://www.analyticsvidhya.com/blog/2017/06/which-algorithm- takes-the-crown-light-gbm-vs-xgboost/
  • 38.
    Ridge regression -feature engineering (RMSLE – 0.412) • Approach • Create features • has_category • General category cond - General category + shipping • Subcategory cond 2 – subcategory 1 + shipping • Subcategory cond 1- subcategory 2 + shipping • Pre-processing brand • Clean brand name using Levenshtein distance • Hashing feature vectors are created for category name, brand name, Item description • Count feature vectors on General category cond, Subcategory 1 cond, Subcategory 2 cond • One hot feature vectors is extracted from Has brand, Shipping, Item_condition_id • code: https://www.kaggle.com/rumbok/ridge-lb-0-41944/code
  • 39.
  • 40.
    Top Approach (RMSLE– 0.406) • Approach • Create features • Create tfidf feature vectors on name, item description • Create one hot feature vectors of shipping, item_condition • Combine features • Create binary version of feature • Combine binary features and features • Model • Create sparse input • Create dense layer of size 192 • Create dense layer of size 64 • Create dense layer of size 64 • Create dense layer of size 1 • code: https://www.kaggle.com/peterhurford/lgb-and-fm-18th-place-0-40604