EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
A Comparative Analysis of Sampling Techniques for Click-Through Rate Prediction in Native Advertising.
1. Base paper Title: A Comparative Analysis of Sampling Techniques for Click-Through Rate
Prediction in Native Advertising
Abstract
Native advertising is a popular form of online advertisements that has similar styles and
functions with the native content displayed on online platforms, such as news, sports and social
websites. It can better capture users’ attention, and they have gained increasing popularity in
many online platforms and among advertisers. In advertising, Click Trough Rate (CTR)
prediction is essential but challenging due to data sparsity: the non-clicks constitute most of
the data, whereas clicks form a significantly smaller portion. The performance of 19 class
imbalance approaches is compared in this study with the use of four traditional classifiers, to
determine the most effective imbalance methods for our native ads dataset. The data used is
real traffic data from Finland over the course of seven days provided by the native advertising
platform ReadPeak. The resampling methods used include seven undersampling techniques,
four oversampling techniques, four hybrid sampling techniques, and four ensemble systems.
The findings demonstrate that class imbalance learning can enhance the model’s capacity for
classification by as much as 20%. In general, oversampling is more stable comparatively. But,
undersampling performed the best with Random Forest. Our study also demonstrates that the
imbalance ratio plays an important role in the performance of the model and the features
importance.
Existing System
In a class imbalanced dataset, there are significantly less samples in one of its classes
than the other [3]. The difficulties of learning from such imbalanced data are inevitable.
Standard learning classifiers are biased toward the majority class due to the skewed distribution
of the training samples, making them unable to recognize unusual occurrences. It is possible to
mistake noise for rare minority samples and vise-versa [24]. This kind of problem with uneven
data is particularly prevalent in the advertising industry. The dataset contains a lot more non-
clicked advertising than clicked advertisements, and the difference between the two is typically
significant. To solve these issues, researchers have created numerous class imbalance
techniques and performance evaluation criteria that play an important role in this paper. In this
context, we implemented several class imbalance methods on a real-world class imbalanced
dataset provided by the native advertising platform ReadPeak [1]. The dataset used contains
2. information about the in-screen advertisements and whether they have been clicked or not.
According to the dataset, there is an extreme imbalance between clicks and non-clicks, with
250 non-clicks for every 1 click. Advertising is a key and crucial part of corporate operations.
In 2021, advertisers are estimated to have spent 118.72 billion dollars on display advertising
[52]. Demand-side platform (DSP)’s cost per click pricing model makes advertisers earnings
directly correlated to the number of clicks. Predicting performance indicators such as the click-
through rate (CTR) is crucial for DSP research [35]. The available literature presents a valid
solution to the problem at hand, but addressing the unbalance of data specifically and the
benefits we can get either in terms of training speed and prediction accuracy has been neglected.
In that spirit, the purpose of the current work is to explore the effect of sampling techniques on
prediction accuracy with the combination of a number of classical machine learning models
applied to the prediction of CTR in native advertisement. The contributions of this paper are
summarized as follows: • The performance of 19 class imbalance methods with 4 classical
classifiers is evaluated. The best performing methods are identified based on the AUC and
LogLoss metrics. • An analysis of the unbalanced data problem was conducted using real-world
data from ReadPeak, including complete outline of the steps for feature engineering, selection,
and data cleaning. • Through experimentation, it was found that feature importance varies with
different Imbalance Ratios. The Boruta algorithm was used to compute feature importance
using three levels of sampling, highlighting the importance of balancing the data for accurate
results. • Using Random Forest and random undersampling, it is shown that as the balance
between positive and negative samples is improved, the performance also improves. However,
there is a threshold of imbalance ratio beyond which the performance improvement becomes
marginal. • We available the community access to a real world datasets that can help
researchers study the phenomenon and find solutions to CTR Prediction in the native
advertisement field.
Drawback in Existing System
Data Representation and Generalization:
Drawback: The choice of a particular sampling technique may lead to biases in the
representation of the dataset. Oversampling or undersampling may impact the generalization
of the models to real-world scenarios, especially when the dataset is not fully representative of
the target population.
3. Impact on Model Complexity:
Drawback: Some sampling techniques may inadvertently impact the complexity of the
predictive models. For instance, oversampling may lead to overfitting, especially when dealing
with small datasets, while undersampling might result in loss of valuable information.
Limited Transparency:
Drawback: Some advanced sampling techniques, especially those involving synthetic
data generation, may lack transparency in terms of how they modify the underlying data
distribution. This can make it challenging to interpret the model's behavior and decision-
making processes.
Impact on Evaluation Metrics:
Drawback: The choice of sampling technique may influence the performance metrics
used for evaluation. For instance, oversampling may artificially boost metrics like accuracy
but may not necessarily improve the model's ability to generalize to new, unseen data.
Proposed System
Sampling Techniques:
Implement and analyze various sampling techniques, including but not limited to
random sampling, stratified sampling, undersampling, oversampling, and hybrid methods.
Investigate advanced techniques such as SMOTE (Synthetic Minority Over-sampling
Technique) and ADASYN (Adaptive Synthetic Sampling).
Predictive Models:
4. Employ state-of-the-art machine learning models for click-through rate prediction,
including logistic regression, decision trees, random forests, and gradient boosting
machines.
Utilize deep learning models, such as neural networks, to assess the impact of sampling
techniques on more complex predictive architectures.
Evaluation Metrics:
Measure the performance of each sampling technique using standard evaluation metrics
such as precision, recall, F1 score, and area under the receiver operating characteristic curve
(AUC-ROC).
Assess the impact on model calibration and reliability.
Comparison and Recommendations:
Present a comparative analysis of the various sampling techniques, highlighting their
strengths, weaknesses, and suitability for different scenarios.
Offer recommendations for selecting an appropriate sampling strategy based on specific
campaign characteristics and data distributions.
Algorithm
Stratified Sampling:
Algorithm: Stratified sampling based on target classes.
Description: Divides the dataset into strata (subgroups) based on target classes and
samples proportionally from each stratum, ensuring representation across classes.
ADASYN (Adaptive Synthetic Sampling):
Algorithm: ADASYN algorithm.
Description: An adaptive extension of SMOTE that focuses on generating synthetic
instances for regions of the minority class that are harder to learn.
5. Cluster-Based Sampling:
Algorithm: K-means clustering for cluster-based sampling.
Description: Groups instances into clusters and samples proportionally from each
cluster, facilitating localized sampling.
Advantages
Addressing Class Imbalance:
Advantage: Class imbalance is a common issue in CTR prediction, where the number
of non-click instances often significantly outweighs click instances. Sampling techniques,
such as oversampling or undersampling, help balance class distribution, ensuring that the
model is not biased towards the majority class.
Enhanced Generalization to Unseen Data:
Advantage: Effective sampling techniques contribute to the generalization of
predictive models to new, unseen data. This is crucial in native advertising where the goal
is to predict CTR for new ad campaigns.
Mitigation of Bias and Unfairness:
Advantage: Carefully chosen sampling techniques can help mitigate biases in predictive
models, ensuring fair representation and treatment of different classes. This is essential for
ethical considerations in advertising.
Increased Interpretability:
Advantage: Certain sampling techniques, like undersampling or clustering-based
sampling, may lead to a reduction in the number of instances, making it easier to interpret
model decisions. Enhanced interpretability is valuable for stakeholders who need to
understand the factors influencing CTR predictions.
6. Software Specification
Processor : I3 core processor
Ram : 4 GB
Hard disk : 500 GB
Software Specification
Operating System : Windows 10 /11
Frond End : Python
Back End : Mysql Server
IDE Tools : Pycharm