CTR Logistic Regression
Criteo Kaggle dataset
One Hot Encoding
ROC curves
Joe Duimstra
Sept 1, 2015
Workflow
In an iPython notebook, I predict Click Through Rates using logistic
regression with ridge regularization
1. Get the data
2. Load and split data into training, cross-validation, and test sets
3. Use One Hot Encoding (OHE) to generate a sparse matrix from
the categorical features in the dataset
4. Train a logistic regression classifier using log loss as the
evaluation metric and compare to a baseline model that predicts
the same value regardless of observation
5. Optimize classifier by finding best regularization parameter
6. Generate ROC curves
7. Use the model to predict the test set outcomes
Get the Data

Data is from Criteo's Kaggle competition
Generate OHE dictionary

Encode each unique feature value as a unique integ
Train LR classifier

Use the training data to train a logistic regression
classifier

Generate a baseline model for comparison
Optimize regularization parameter
ROC curves
●
Blue is train set
●
Green is unoptimized logistic regression
●
Red is optimized logistic regression
Comparing with PySpark

CTR logistic regression

  • 1.
    CTR Logistic Regression CriteoKaggle dataset One Hot Encoding ROC curves Joe Duimstra Sept 1, 2015
  • 2.
    Workflow In an iPythonnotebook, I predict Click Through Rates using logistic regression with ridge regularization 1. Get the data 2. Load and split data into training, cross-validation, and test sets 3. Use One Hot Encoding (OHE) to generate a sparse matrix from the categorical features in the dataset 4. Train a logistic regression classifier using log loss as the evaluation metric and compare to a baseline model that predicts the same value regardless of observation 5. Optimize classifier by finding best regularization parameter 6. Generate ROC curves 7. Use the model to predict the test set outcomes
  • 3.
    Get the Data  Datais from Criteo's Kaggle competition
  • 4.
    Generate OHE dictionary  Encodeeach unique feature value as a unique integ
  • 5.
    Train LR classifier  Usethe training data to train a logistic regression classifier  Generate a baseline model for comparison
  • 6.
  • 7.
    ROC curves ● Blue istrain set ● Green is unoptimized logistic regression ● Red is optimized logistic regression
  • 8.