For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
The paper talks about implementation of Behavioral Targeting for the ad world. This is a statistical machine learning algorithm that helps select most relevant ads to be displayed to a web user based on their historical data.
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Predictive Analytics Approach for Behavioral Targeting
1. Prognosis - An Approach
to Predictive Analytics
Abstract
Prediction is a statement made about the future, an anticipatory
vision or perception. This White Paper discusses the emergence
of technology that enables precise predictions in varied fields,
and the application of exploratory and normative methods to
augment decision making.
Forecasting is primarily based on mining historical data sets,
extracting hidden patterns and transforming them into valuable
information through a process of classification, clustering,
regression and association rule learning.
The white paper talks about Impetus’ implementation of
Behavioral Targeting for the ad world. This is a widely accepted,
statistical machine learning algorithm that helps select most
relevant ads to be displayed to a web user based on their
historical data.
.
Impetus Technologies Inc.
www.impetus.com
W H I T E P A P E R
2. Prognosis – An Approach to Predictive Analytics
2
Table of Contents
Introduction ..................................................................................................................................................2
Large scale data analytics .........................................................................................................................3
Algorithms for forecasting & prediction ...................................................................................................4
Behavioral Targeting.....................................................................................................................................4
Advantages and threats ............................................................................................................................5
Industry impact.........................................................................................................................................6
Generic Approach to BT problem solving .................................................................................................6
Large scale implementation of BT ................................................................................................................7
Poisson’s Linear Regression ......................................................................................................................7
Implementing BT using Poisson’s Linear Regression ................................................................................7
1. Data Preparation...........................................................................................................................8
2. Model Training..............................................................................................................................9
3. Model Evaluation........................................................................................................................13
Summary.....................................................................................................................................................15
3. Prognosis – An Approach to Predictive Analytics
3
Introduction
A prediction is a statement about the way things will happen in the future, often
but not always based on experience or knowledge. Prediction is necessary to
allow plans to be made about possible developments. Large corporations invest
heavily in this kind of activity to help focus attention on possible events, risks
and business opportunities. Such work brings together all available past and
current data, as a basis to develop reasonable expectations about the future.
The basic idea behind any such algorithm is to gather gigantic behavioral data
that describes the historical series of events/actions/behavior of the entity in
question. This data is fed into machines and run through complex machine
learning algorithms to derive models. The models serve as the basis for
predictions, i.e. based on input criteria the models infer the expected behavior
of the entity.
The application of prediction algorithms has gained prominence in a wide range
of fields such as finance (stock market predictions), insurance (predicting life
expectancy), science (weather forecasting, predicting natural disasters), medical
science (treating developmental disabilities), marketing (behavioral targeting)
and many more.
Typically, with predictions, there is a huge amount of historical data, time is of
the essence and there is always a current activity happening that impacts the
future. In many cases, freshness of data is a key factor and plays a major role in
forecasting the future course of action. In other instances, the entire data set
has equal relevance and contributes to determining the future.
Large scale data analytics
Projects related to future predictions and forecasting point to a huge increase in
the amount of data that must not only be stored but processed quickly and
efficiently. These challenges are at once a daunting and exciting chance to use
data to create a positive impact.
Often, there is an immediate need to analyze the data at hand, to discover
patterns, reveal threats, monitor critical systems, and make decisions about the
direction the organization should take. Several constraints are always present:
the need to implement new analytics quickly enough to capitalize on new data
sources, limits on the scope of development efforts, and the pressure to expand
mission capability without an increase in budgets. For many of these
applications, the large data processing stack (which includes the simplified
programming model Map-Reduce, distributed file systems, semi-structured
stores, and integration components, all running on commodity class hardware),
4. Prognosis – An Approach to Predictive Analytics
4
has opened up a new avenue for scaling out efforts and enabling analytics that
were impossible in previous architectures. This new ecosystem has been found
to be remarkably versatile at handling various types of data and classes of
analytics.
Perhaps the most exciting benefit, however, from moving to these highly
scalable architectures is that after the immediate issues have been solved, often
with a system that can handle today’s requirements and scale up to 10x or
more, new analytics and capabilities can be developed, evaluated and
integrated easily. This is owing to the speed and ease of Map-Reduce, Pig, Hive,
and other technologies. More than ever, the large-scale data analysis software
stack is proving to be a platform for innovation.
Algorithms for forecasting and prediction
There are several classes of statistical algorithms that are well suited for these
kinds of problems, which are associated with trend analysis, pattern generation
and artificial intelligence based predictions. Some of the most common ones
are:
Conjoint Analysis – Expert opinion and Delphi surveys
Quantitative – Statistical, suited to predicting trends e.g. Poisson’s
Linear regression, Exponential smoothing
Qualitative – Subjective, providing a range of possible outcomes, e.g.
the Bayesian approach
Statistical combination – A mix of quantitative and qualitative
techniques e.g. Quasi Bayes
Behavioral Targeting
Behavioral targeting (BT) leverages historical user behavior to select the most
relevant ads to display. The state-of-the-art of BT derives a Linear Poisson
Regression model from fine-grained user behavioral data and predicts click-
through rate (CTR) from user history.
Behavioral targeting is an application of modern statistical machine learning
methods to online advertising. But unlike other computational advertising
techniques, BT does not primarily rely on contextual information such as query
(‘sponsored search’) and web page (‘content match’). Instead, BT learns from
past user behavior, especially the implicit feedback (i.e., ad clicks) to match the
best ads to users.
5. Prognosis – An Approach to Predictive Analytics
5
This makes BT enjoy a broader applicability such as graphical display ads, or at
least a valuable user dimension complementary to other contextual advertising
techniques. In today's practice, behaviorally targeted advertising inventory
comes in the form of some kind of demand-driven taxonomy. Hierarchical
examples are Finance, Investment and Technology, Consumer Electronics, and
Cellular Telephones. Within a category of interest, a BT model derives a
relevance score for each user from past activity. Should the user appear online
during a targeting time window, the ad serving system will qualify this user (to
be shown an ad in this category) if the score is above a certain threshold. One
de facto measure of relevance is CTR, and the threshold is predetermined in
such a way that both a desired level of relevance (measured by the cumulative
CTR of a collection of targeted users) and the volume of targeted ad impressions
(also called reach) can be achieved.
The impact of behavioral targeting can be negative if consumers feel annoyed or
threatened by the use of their ‘personal’ data. However, as demonstrated by
Amazon, when personal information and technology enhance the online
experience, there is less risk of a negative response.
Advantages and threats
There are a lot of advantages attributed to ad targeting and behavioral analysis,
but at the same time it is also important to look at the downsides and surface
the threats posed by them. Some of the advantages that can be seen right away
are:
Reaching the right audience at the right time (of the day, week or life
stage), with clear behavioral assumptions
Standing out in a cluttered category
Reaching target audiences when ‘context’ inventory is sold out
(reaching same target in alternative content)
High cost of entry in desired content (reaching the same target in
alternative content with lower costs)
Tailoring message to behavioral patterns to make it more relevant
As mentioned earlier, there are some downsides to BT:
Achieving high reach is difficult. Within extremely targeted segments,
the potential universe available may be very limited and there may be a
limit to the sites currently allowing behavioral targeting.
Inconsistencies within segment classifications. The definition of
‘common’ behavioral segment may differ by publisher (e.g., job seeker
searching Monster.com not the same job seeker as reading job-related
article on iVillage). Also, as the technology is cookie enabled, it suffers
the usual issues of cookie stability and data accuracy.
6. Prognosis – An Approach to Predictive Analytics
6
Ultimate issue of behavioral targeting clutter. Other advertisers within
the same vertical will compete in the same space/segments. This is
currently a future issue but in time, cost, clutter and inventory
availability positives will become challenges (as seen in paid search). In
the future, as targeting matures and advertisers have measurable
results, historical data will be a key indicator of which assumptions
work. This will provide optimization insights. Collecting and analyzing
response data generated from different segments are important
prerequisites for success.
Industry impact
Behavioral targeting, as a concept, has wide acceptance in the industry.
Indicated below are some use-cases where it is being successfully implemented
as a tool for predicting user behavior:
Ad Targeting and Predicting the buying behavior of users
Relationship building
Audience targeting
Presidential candidates using BT to target persuasion
Treatment of mental disorders and developmental disabilities
There is a vast horizon where BT, or BT based solutions are being used to
successfully predict/forecast behavior in order to increase reach, accessibility,
and revenue.
Generic approach to BT problem solving
Data mining involves extracting hidden patterns from data to transform
it into valuable information using computer power to apply knowledge
discovery methodologies.
It applies knowledge discovery and prediction through a process of
classification, clustering, regression and association rule learning.
The value of the information depends on the collection of indicative and
representative data.
Cookies for behavioral advertising usually contain text that uniquely
identifies the browser so that advertisers or ad networks can recognize
the same Internet user across different Web sites or multiple areas on
the same site.
7. Prognosis – An Approach to Predictive Analytics
7
Large Scale Implementation of BT
Poisson’s Linear Regression
This is a statistical method used to calculate the probability of an event, given
the rate of occurrence of the event in disjoint timeframes, suited for analyzing
outcomes that have positive values.
Poisson’s Linear Regression works really well where the input data is sparse i.e.
results are valid for rare events. It can model rare events when everyone is
followed for the same length of time, or when people have different length of
follow ups.
Implementing BT using Poisson’s Linear Regression
Behavioral targeting can be effectively implemented using the Poisson’s Linear
Regression algorithm, as it maps well to the nature of input data and the kind of
predictions that organizations are looking at.
8. Prognosis – An Approach to Predictive Analytics
8
The Algorithm is well explained by the flow chart:
Impetus Technologies implemented Behavioral targeting using the Poisson’s
Linear Regression algorithm. The algorithm was deployed using the Hadoop
ecosystem. The entire algorithm was decomposed into individual steps. Each of
the steps was implemented as a Hadoop M/R job and the jobs were run
sequentially using the Oozie workflow engine. The results of the
implementation were models for different categories. These models were
stored on the HBase data store and later consumed for analytics and behavioral
predictions.The steps involved in the above implementation are explained
below:
1. Data Preparation
In this preprocessing step, the data fields of interest were extracted from raw
data feeds, thus reducing the size of the data.
9. Prognosis – An Approach to Predictive Analytics
9
Raw data was related to user behavior with respect to one or more ads. It also
included ad clicks, ad views, page views, searches, organic clicks or overture
clicks.
1. The raw data came from the user base
2. The system stored the raw data in HDFS
3. The raw data was sent to the data preparation module which
undertook the following:
a. Aggregated event counts over a configurable period of time, to
further shrink the data size
b. Merged counts into a single entry with <cookie, time-
period> as unique key
c. It included two M/R jobs–Feature-Extractor and Feature-
Generator
1.1 Feature-Extractor
Input - Raw data feeds
Output - <cookie:time-period:feature-Type:feature-Name, feature-
Count>
1.2 Feature-Generator
Input - <cookie: time-period: feature-Type: feature-Name, feature-
Count>
Output - <cookie: time-period, feature-Type: feature-Name, feature-
Count ...>
2. Model Training
This fitted the Linear Poisson Regression Model from the preprocessed data and
involved the following:
1. Feature selection
2. Generating of training examples
3. Model weights initialization
4. Multiplicative recurrence to converge model weights
2.1 Poisson-Entity-Dictionary
It mainly performed feature selection and inverted indexing. It did this
by counting entity frequency in terms of touching cookies and selecting
the most frequent entities in the given feature space.
Output-Hashmap of <entityType:featureName, featureIndex>(inverted
index) for all entity types
10. Prognosis – An Approach to Predictive Analytics
10
An entity referred to the name (unique identifier) of an event (e.g. an ad
id, a space Id for page, or a query). The Entity was different from the
feature since the latter was uniquely identified by the <featureType,
featureName> pair.
In the context of BT, there were three types of entities—ad, page and
search
The Poisson entity dictionary included three M/R jobs—
PoissonEntityUnit, PoissonEntitySum, and PoissonEntityHash
2.2 Poisson-Feature-Vector
This generated training examples (feature vectors) that were directly
used later by model initialization and multiplicative recurrence.
It used a sparse data structure (populated primarily with zeros) for
feature vectors. Behavioral count data is very sparse by nature. For a
given user, in a given time period, his or her activity only involves a
limited number of events. Impetus used a pair of arrays of the same
length to represent a feature vector or a target vector—an Integer type
for feature and float type for value (float type for possible decaying),
with an array index giving a <feature, value> pair.
Feature Selection and inverted indexing: - With the feature space
selected from PoissonEntityDictionary, in this step, Impetus discarded
the unselected events from the training data in the feature (input
variable) side. On the target (response variable) side Impetus took the
option of using all features or only selected features to categorize them
into target event counts.
With the inverted index built from PoissonEntityDictionary,
from the PoissonFeatureVector step and onwards, Impetus
referenced an original feature name by its index. The same idea was
also applied to cookies, since the cookie field was irrelevant.
Several pre-computations were performed at this stage: -
1. Impetus further aggregated feature counts into a time window,
with a size larger than or equal to the resolution from data
preparation.
2. Decay counts over time using a configurable factor
3. Realized causal approach to generate examples. (Causal
approach collects features before targets temporarily; while the
non-causal approach generates targets and features from the
same period of history).
11. Prognosis – An Approach to Predictive Analytics
11
4. Impetus used binary representation (serialized objects in java)
and data compression (Sequence file with BLOCK compression
in Hadoop framework) for feature vectors.
Data structure for the feature vector
int[targetLength] targetIndex Array
float[targetLength] targetValue Array
int[inputLength] inputIndex Array
float[inputLength] inputValue Array
Input - <cookie:timeperiod,
featureType:featureName:featureCount ...>
Output - <cookieIndex, featureVector>
Target counts were collected from a sliding time window and feature
counts aggregated (possibly with decay) from a time period preceding
the target window. The size of the sliding window was kept relatively
small for the following reasons: -
1. A large window effectively discarded many <features,
targets>co-occurrences within that window. E.g. The following
setup yielded superior long term models: -
a. A target window of size one day
b. Sliding over a one week period
c. Preceded by a four week feature window(also sliding
along with the target window)
The Algorithm included the following:
1. For each cookie Impetus cached all the event count data.
2. It sorted events by time, forming an event stream of this
particular cookie covering the entire time period of interest.
3. Impetus pre-computed boundaries of the sliding window. Four
boundaries were specified — featureBegin,
featureEnd, targetBegin, targetEnd.
separatingfeatureEnd and targetBegin allowed a
gap window in between, which was necessary to emulate
possible latency in online prediction.
12. Prognosis – An Approach to Predictive Analytics
12
4. The company maintained three iterators on the event stream,
referencing previous featureBegin, current
FeatureBegin, and targetBegin. It used one pair of
treeMap objects (i.e. inputMap and targetMap) to hold
features and targets of a feature vector as the data was being
processed.
2.3 Poisson-Initializer
It initialized the model weights (coefficients of the regressor’s) by
scanning the training data once.
k: Index of target variables
j: Index of features or input variables
i: examples
a unigram(j) is one occurrence of feature j
a bigram(k,j) is one co-occurrence of target k and feature j
The basic idea was to allocate the weight w(k,j) as a normalized number
of co-occurrences of (k,j).Bigram based initialization.
The output of PoissonInitializer was an initialized weight
matrix of dimensionality number of targets by number of features.
1. Impetus distributed the computation of counting the bigrams by
a composite key<k,j> and effectively pre-computed total bigram
counts of all examples before the final stage.
2. The M/R framework provided a single key data structure. In
order to distribute <k,j>, Impetus needed an efficient function
to transform a composite key(two integers) into a single key and
recover the composite key back when needed.
bigram Key(k,j) = a long integer obtained by bitwise left
shift 32 bit of k and then bitwise OR by j
3. The Impetus team cached the output of first mapper that
emitted <bigramKey, bigramCount>.
2.4 Poisson-Multiplicative
It updated the model weights by scanning the training data iteratively. It
utilized highly effective multiplicative recurrence.
Computing a normalizer Poisson mean involved dot product a previous
weight vector by a feature vector (The input portion)
Input - <cookieIndex, featurevector>
Output - updated wk for all k
13. Prognosis – An Approach to Predictive Analytics
13
1. Impetus represented the model weight matrix as K dense
weight vectors (arrays) of length J, where K was the number of
targets and J the number of features.
2. Using weight vectors was more scalable in terms of memory
footprint than matrix representation. But, it raised challenges in
Disk IO. Impetus addressed this problem via in-memory caching.
Caching weight vectors was not the solution. The trick was to
cache input examples. After caching, Impetus maintained a
hashmap that recorded all relevant targets for cached feature
vectors. And provided constant time lookup from target Index
to array-index Map<targetIndex, arrayIndex>.
3. Impetus also used Hadoop's distributed cache, which copied the
requested files from HDFS to the slave nodes before the task
was executed. It only copied the files once per job for each task
tracker, which was shared by M/R tasks.
3. Model Evaluation
It tested the trained model on a test data set. The main tasks were:
1. Predicting expected target counts(clicks and views)
2. Scoring (CTR)
3. Ranking scores of a test set
4. Calculating and reporting performance metrics such as CTR lift and area
under ROC curve.
This component contained three sequential steps:
3.1 Poisson-Feature-Vector-Eval
It was Identical to Poisson-Feature-Vector.
There was no need to book keep the summary statistics for
training such as total count of examples, feature and target
unigrams.
Decay was typically necessary in generating test data. Since
it enabled efficient incremental predicting as new events
flow in, while diminishing the obsolete long history
exponentially.
Sampling and heuristic based robot filtering were not
applied to generate test data
Impetus could remove those examples without a target
from the test dataset, since these records did not impact
the performance, no matter how the model predicted
them. However, examples with targets were also kept, even
those without any inputs. This was because these records
14. Prognosis – An Approach to Predictive Analytics
14
(‘new users’) had to be scored by the model in production
and hence had a non-trivial impact on the performance.
Impetus categorized target counts either from the entire
feature space or from the selected space depending on the
learning goal.
The size of the sliding window was configured
approximately the same as the ad serving cycle in
production and the size of the gap window imitated the
latency between last seen events and the next ad serving in
production.
3.2 Poisson-Predictor
Input - <cookieIndex, FeatureVector>
Output - <cookieIndex, predictedActualtarget[2 x
numTarget]>
It took the dot product of a weight vector and a feature vector as the
predicted target count (a continuous variable). To predict the expected
number of ad clicks and views in all categories for an example I, the
algorithm needed to read the weight vectors of all targets converged
from Poisson-Multiplicative.
3.3 Poisson-Evaluator
Input - <cookieIndex,
predictedActualtarget[2xnumTarget]>
Output - performance metrics, per category and overall reports
It scored each testing example by dividing its predicted clicks by
predicted views and applying Laplacian smoothing. It then sorted all
examples by score and finally computed and reported the performance
metrics. The performance metrics include: -
The number of winning categories over certain benchmarks
Cumulative CTR
CTR lift
Area under ROC curve
Summary stats
It generated reports of both in accordance with category results and
overall performance.