Used a 40GB dataset made available by Avito via Kaggle to demonstrate how to handle big data for machine learning using limited memory. Instead of taking the incremental learning route to train a classifier, we used an intelligent technique to create a representative sample of the dataset.
Since ad clicks are very rare events, naively sampling the data would have lead to significantly biased predictions. This sampling bias was addressed by assigning an importance weight to each data example selected.
The resulting dataset could easily fit into memory and so was then trained using logistic regression.
8. Solution: Data Merging
Created database Indexes on columns
to join tables with
Python script to
Process, join &
write data (to file)
in chunks
Results in a merged file (50 GB)
9. How to process the merged file?
Out-of-Core
Learning
Representative
sampling
10. Smart Downsampling of Training Data
● Any query for which at least one of the ads was clicked.
● A fraction r ∈ (0, 1] of the queries where none of the ads were clicked.
Fixing sampling bias
reduced loss by
78%
Fixing the sampling bias*
*Ad Click Prediction: a View from the Trenches (Google, 2013)
11. How do you choose the sampling probability?
Experiments have verified that even fairly aggressive sub-sampling of
unclicked queries has a very mild impact on accuracy, and that predictive
performance is not especially impacted by the specific value of r *
*Ad Click Prediction: a View from the Trenches (Google, 2013)
12. Sampling results
Number of context ads:
190,157,736 (50 GB)
Number of sub-sampled ads:
5,766,142 (1.5 GB)
13. FeATURE EnGINEERING
Feature Description
Day_of_week Day of the week extracted from the ad’s posted date
Hour Hour of day extracted from the ad’s posted date
Search_Ad_Ratio Similarity between search query and ad title
User_click_prob Historic probability of clicking an ad per user
Regular_ads_no Number of regular ads per query
Context_ads_no Number of context ads per query
Highlighted_ads_no Number of highlighted ads per query
15. Predictive Model
Fit logistic regression model
on training data
Make predictions
on test set
Make predictions
on validation set
Evaluate locally
using log loss
0.0512
Evaluate on Kaggle
using log loss
0.0588