Getting Cozy With Raw Data
(A Cautionary Tale)
Data Scientist, Tapad
The Ad Tech Space
The goal of Ad Tech is to show advertisements to consumers
on the internet and to ensure that the right ad gets shown to
the right person.
Publishers who have ad space on their pages
"Sell" side platforms which aggregate publishers and
facilitate selling of ad space
"Buy" side platforms (like Tapad) which bid on that space
to show current ad campaigns
Advertisers who entrust demand side platforms to place
their content appropriately
5.7 internet connected devices per household
Digital Natives switch screens 27 times every non-working hour
Purchasing Across Devices:
40% of shoppers consult 3 or more channels before purchase
Sources: NPD, March 2013; eMarketer, April 2012; Conlumino & Webloyalty, 2012
Tapad Connects Consumers' Devices
To address these issues, Tapad built The Device Graph.
The Device Graph seeks to connect devices within a household for
targeting across multiple screens.
Our edges are inferred based on a variety of techniques including
co-location, partnerships with other companies,
and obfuscated login data (where no personally identifiable data is
Over 2 billion nodes (devices) in The Device Graph.
Representing about 100 million households and
approximately 250 million individuals.
75% of connected devices are connected to 3 or more
38% of devices are computers -- 36% represents smart
phones and tablets.
The (Original, Household) Device Graph
No scores between edges.
No way to separate individuals.
What We Wanted
Edge thickness indicates confidence of link between devices
Colors indicate community detection based device clustering
The (household) Device Graph naturally restricts our search space
We never seek to identify individuals - only to group devices used
by the same individual
Graph can be traversed at varying thresholds (scale vs accuracy)
We needed a way to put weights on edges.
First Attempt: Use Segment data
Provided by first or third-parties
Tries to put devices into inferred buckets. ex:
Comic book enthusiast
Pros/Cons of Segment Data
Relatively extensive coverage
Simple to read/human intelligible
Don't know how the segments are determined (black box)
Different providers may not have the same methods
The longer a device has been in our graph the more audiences
it will accumulate (snowballs)
Plan of Attack
1. Used the segments as features to create feature
2. Compared several methods:
Simple dot product (baseline)
Probabilistic approaches that use segment
Machine learning approaches that use truth data and
existing graph structure as proxy data.
What Do We Mean By Proxy Data?
Two nodes connected in The (household) Device
Graph are more likely to be similar to
each other than two unconnected nodes.
To compare methods we compute the Win Rate.
1. Select pair of devices connected in graph; compute score
between them (true_score).
2. Select random device unconnected to original devices;
compute a score with one of original devices
if true_score > false_score:
win_value = 1.0
elif true_score < false_score:
win_value = 0.0
win_value = 0.5
A random algorithm should achieve an
average win_value of about 0.5.
We expect an optimal algorithm to achieve an
average win_value of about 0.75 -- 50% better than
Why? Because census data suggests around 2 adults per
household. Therefore, we expect about half of our
household edges to be highly correlated (similar) while the
remainder should be statistically uncorrelated (dissimilar).
Well, how do segment data perform?
In a word: poorly.
Our attempts eked in just above the random line around an
average win_value = 0.55.
At most, 10% better than random!
So what happened?
Segment data are riddled with:
randomness & noise
An example of randomness & noise: 1 out of 4 devices which
"self identified as mom" are also tagged as "male".
(Either we're really really progressive, or something has gone horribly wrong.)
So Much Bias!
Platform Bias: Certain segments are platform specific. (For
example: "used a specific mail client on Android")
Source Bias: We don't always have overlap between different
first and third parties we work with and the overlap is not
Temporal Bias: Long-lived devices tend to accumulate segments
Audience Value Bias: Certain segments are worth more to
advertisers so they appear more often than expected. (Example:
people intending to purchase automobiles.)
Account for these biases explicitly and try to correct them.
Test different algorithms.
Abandon the effort and look elsewhere for different data.
We opted for the last one.
In the end, we opted to use our in-house browsing data.
Browsing data are data we obtain when examining available ad
space. Each piece of data gives us an obfuscated ID and the url on
which the device is browsing.
Initially avoided due to sparsity:
While we saw about 20 pieces of audience data on average on a
device, we were in some cases limited to a single unique url per
device because this data is harder to come by than black box
Plan of Attack
(Preprocessing: remove the fraudulent urls associated with botnets.)
Just as before, create a feature vector but now the features are the
legitimate unique domains (tapad.com, mlconf.com, etc...).
Compare several methods:
The feature vector dot product (baseline)
Matrix-based approaches which use probabilistic correlations
based on url co-occurrence on nodes
Clustering-based approaches which reduce dimensionality by
first clustering highly correlated urls
Simple dot product (baseline) already performs about 18%
better than random.
Both the matrix-based and clustering-based approaches
perform up to 40% better than random.
This is in the range of how we expect an optimal algorithm
to perform - despite data sparsity!
Don't assume because pieces of data are nicely tied in a bow
and plentiful that they are the right data to use.
Question your data, not only your algorithms.
The best pieces of data may be scarce and raw because they
are often less fraught with hidden biases and unnecessary
Learn more about Tapad
Read our blog:
Follow us on twitter:
Follow us on Instagram:
@tapadinc (includes a picture of yours truly in a headstand.)