Dealing with imbalanced data in RTB

Dealing with imbalanced data in RTB
Yuya Kanemoto

Table of contents
1. Introduction
2. Methods
2.1 Re-sampling
2.2 Cost-sensitive learning
3. Tools in practice
4. Reference
1

Introduction
A classifier can predict the class labels of new data after the
training. Proportion of class labels for the training can be
imbalanced in real-world data sets, and imbalanced data makes the
training difficult for a classifier. This is the case for Real-Time
Bidding (RTB) framework in online advertisement, and there are
several ways to deal with the problem to improve the performance
of the classifier.
2

Methods: Re-sampling
Re-sampling can deal with the imbalanced data by balancing the
proportion of class labels
• Under-sampling the majority class
• Over-sampling the minority class
• Combining over- and under-sampling
• Create ensemble balanced sets
3

Methods: Calibration after re-sampling
There are several ways to calibrate the output probability from a
classiﬁer after the re-sampling
• Isotonic regression
minimize
∑
i wi (yi − ˆyi )2
subject to ˆymin = ˆy1 ≤ ˆy2... ≤ ˆyn = ˆymax
• Calibration factor for negative under-sampling
q = p
p+(1−p)/w
• q: calibrated probability
• p: prediction in under-sampling space
• w: under-sampling rate
4

Methods: Calibration after re-sampling
• Probability calibration should be done on new data not used
for model ﬁtting
• Logistic regression returns well calibrated predictions by
default as it directly optimizes log-loss
5

Cost-sensitive learning
Actual positive Actual negative
Predict positive C(0,0) C(0,1)
Predict negative C(1,0) C(1,1)
• Cost-sensitive learning takes the misclassiﬁcation costs into
consideration
• R(i|x) =
∑
j P(j|x)C(i, j)
• expected cost R(i|x) of classifying an instance x into class i
• Classiﬁer will classify an instance x into positive class if and
only if:
P(0|x)C(1, 0) ≤ P(1|x)C(0, 1) assumig C(0, 0) = C(1, 1) = 0
6

Cost-sensitive learning types: Thresholding
• Thresholding method modiﬁes the threshold (0.5 by defalut)
to label the class considering the costs
p∗ = C(1,0)
C(1,0)+C(0,1)
• threshold p∗ for the classiﬁer to classify an instance x into
positive if P(1|x) ≥ p∗
7

Cost-sensitive learning types: Sampling
• Re-sampling method described above can be considered as a
part of cost-sensitive learning
• Positive and negative examples are sampled by the ratio of:
p(1)FN : p(0)FP
• p(1) and p(0) are the prior probability of the positive and
negative examples in the original training set
8

Cost-sensitive learning types: Weighting
• Weighting method assigns a normalized weight to each
instance according to the misclassiﬁcation costs
• This can be considered as a part of Sampling method as
example with high weights (for rare class with high costs) can
be viewed as example duplication - thus sampling
• Weighting method can utilize all data unlike Sampling method
9

Tools in practice: Xgboost
• Balance the positive and negative weights via
scale-pos-weight if you care only about the ranking order of
your prediction
• typically by inserting sum(negative/major samples)
sum(positive/rare samples)
• Use AUC for evaluation. Utility [Chapelle O 2015] can also be
considered as a metric in RTB
• If you care about predicting the right probability, you cannot
re-balance the data
• setting parameter max-delta-step to a ﬁnite number (like 1)
will help convergence
10

Reference
• Oﬄine Evaluation of Response Prediction in Online
Advertising Auctions Categories and Subject Descriptors,
Chapelle O 2015
• XGBoost, Chen T et al. 2016
• Practical Lessons from Predicting Clicks on Ads at Facebook,
He X et al. 2014
• Cost-sensitive learning and the class imbalance problem, Ling
C et al. 2008
• Cost-sensitive Learning for Utility Optimization in Online
Advertising Auctions, Vasile F et al. 2016
11

Dealing with imbalanced data in RTB

Recommended

Recommended

More Related Content

Similar to Dealing with imbalanced data in RTB

Similar to Dealing with imbalanced data in RTB (20)

Recently uploaded

Recently uploaded (20)

Dealing with imbalanced data in RTB