Getting Cozy With Raw Data
(A	Cautionary	Tale)
Yael	Elmatad
Data	Scientist,	Tapad
														@y_s_e
The Ad Tech Space
The	goal	of	Ad	Tech	is	to	show	advertisements	to	consumers
on	the	internet	and	to	ensure	that	the	right	...
Why Cross-Device?
Device	Proliferation:
5.7	internet	connected	devices	per	household
Screen	Switching:
Digital	Natives	swi...
Tapad Connects Consumers' Devices
	
To	address	these	issues,	Tapad	built	The	Device	Graph.
The	Device	Graph	seeks	to	conne...
Tapad Statistics
Over	2	billion	nodes	(devices)	in	The	Device	Graph.
Representing	about	100	million	households	and
approxi...
The (Original, Household) Device Graph
No	scores	between	edges.
No	way	to	separate	individuals.
iPad
computer
Kindle
What We Wanted
Edge	thickness	indicates	confidence	of	link	between	devices
Colors	indicate	community	detection	based	devic...
Scoring Edges
We	needed	a	way	to	put	weights	on	edges.
			First	Attempt:	Use	Segment	data	
Provided	by	first	or	third-part...
Pros/Cons of Segment Data
	
		Pros:
Relatively	extensive	coverage
Simple	to	read/human	intelligible
Finite
			Cons:
Don't	...
Plan of Attack
1.	 Used	the	segments	as	features	to	create	feature
vectors.																																															...
What Do We Mean By Proxy Data?
Assumption:
Two	nodes	connected	in	The	(household)	Device
Graph	are	more	likely	to	be	simil...
Measuring Performance
To	compare	methods	we	compute	the	Win	Rate.		
1.	 Select	pair	of	devices	connected	in	graph;	compute...
Performance Expectations
A	random	algorithm	should	achieve	an
average	win_value	of	about	0.5.
We	expect	an	optimal	algorit...
Well, how do segment data perform?
In	a	word:	poorly.
Our	attempts	eked	in	just	above	the	random	line	around	an
average	wi...
So what happened?
	
Segment	data	are	riddled	with:
randomness	&	noise
hidden	bias
An	example	of	randomness	&	noise:	1	out	...
So Much Bias!
Platform	Bias:	Certain	segments	are	platform	specific.	(For
example:	"used	a	specific	mail	client	on	Android...
Platform Bias
Platform Bias
Platform Bias
Source Bias
Next Steps
	Either:
Account	for	these	biases	explicitly	and	try	to	correct	them.
(see:	engineering.tapad.com)
Or:
Test	dif...
Browsing Data
In	the	end,	we	opted	to	use	our	in-house	browsing	data.		
Browsing	data	are	data	we	obtain	when	examining	av...
Plan of Attack
(Preprocessing:	remove	the	fraudulent	urls	associated	with	botnets.)
Just	as	before,	create	a	feature	vecto...
Performance
Much	better!
Simple	dot	product	(baseline)	already	performs	about	18%
better	than	random.
Both	the	matrix-base...
Moral
Don't	assume	because	pieces	of	data	are	nicely	tied	in	a	bow
and	plentiful	that	they	are	the	right	data	to	use.
Ques...
Learn more about Tapad
Read	our	blog:
http://engineering.tapad.com
Follow	us	on	twitter:
@tapad
@tapadeng
Follow	us	on	Ins...
Upcoming SlideShare
Loading in...5
×

MLconf Yael Elmatad

2,069

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,069
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "MLconf Yael Elmatad"

  1. 1. Getting Cozy With Raw Data (A Cautionary Tale) Yael Elmatad Data Scientist, Tapad @y_s_e
  2. 2. The Ad Tech Space The goal of Ad Tech is to show advertisements to consumers on the internet and to ensure that the right ad gets shown to the right person. Many components: Publishers who have ad space on their pages "Sell" side platforms which aggregate publishers and facilitate selling of ad space "Buy" side platforms (like Tapad) which bid on that space to show current ad campaigns Advertisers who entrust demand side platforms to place their content appropriately
  3. 3. Why Cross-Device? Device Proliferation: 5.7 internet connected devices per household Screen Switching: Digital Natives switch screens 27 times every non-working hour Purchasing Across Devices: 40% of shoppers consult 3 or more channels before purchase Sources: NPD, March 2013; eMarketer, April 2012; Conlumino & Webloyalty, 2012
  4. 4. Tapad Connects Consumers' Devices To address these issues, Tapad built The Device Graph. The Device Graph seeks to connect devices within a household for targeting across multiple screens. Our edges are inferred based on a variety of techniques including co-location, partnerships with other companies, and obfuscated login data (where no personally identifiable data is ever observed).
  5. 5. Tapad Statistics Over 2 billion nodes (devices) in The Device Graph. Representing about 100 million households and approximately 250 million individuals. 75% of connected devices are connected to 3 or more devices. 38% of devices are computers -- 36% represents smart phones and tablets.
  6. 6. The (Original, Household) Device Graph No scores between edges. No way to separate individuals. iPad computer Kindle
  7. 7. What We Wanted Edge thickness indicates confidence of link between devices Colors indicate community detection based device clustering The (household) Device Graph naturally restricts our search space We never seek to identify individuals - only to group devices used by the same individual Graph can be traversed at varying thresholds (scale vs accuracy)
  8. 8. Scoring Edges We needed a way to put weights on edges. First Attempt: Use Segment data Provided by first or third-parties Tries to put devices into inferred buckets. ex: Dog lover Comic book enthusiast Male
  9. 9. Pros/Cons of Segment Data Pros: Relatively extensive coverage Simple to read/human intelligible Finite Cons: Don't know how the segments are determined (black box) Different providers may not have the same methods The longer a device has been in our graph the more audiences it will accumulate (snowballs)
  10. 10. Plan of Attack 1. Used the segments as features to create feature vectors. 2. Compared several methods: Simple dot product (baseline) Probabilistic approaches that use segment co-occurrence Machine learning approaches that use truth data and existing graph structure as proxy data.
  11. 11. What Do We Mean By Proxy Data? Assumption: Two nodes connected in The (household) Device Graph are more likely to be similar to each other than two unconnected nodes.
  12. 12. Measuring Performance To compare methods we compute the Win Rate. 1. Select pair of devices connected in graph; compute score between them (true_score). 2. Select random device unconnected to original devices; compute a score with one of original devices (false_score). if true_score > false_score: win_value = 1.0 elif true_score < false_score: win_value = 0.0 else: #ties win_value = 0.5
  13. 13. Performance Expectations A random algorithm should achieve an average win_value of about 0.5. We expect an optimal algorithm to achieve an average win_value of about 0.75 -- 50% better than random. Why? Because census data suggests around 2 adults per household. Therefore, we expect about half of our household edges to be highly correlated (similar) while the remainder should be statistically uncorrelated (dissimilar).
  14. 14. Well, how do segment data perform? In a word: poorly. Our attempts eked in just above the random line around an average win_value = 0.55. At most, 10% better than random!
  15. 15. So what happened? Segment data are riddled with: randomness & noise hidden bias An example of randomness & noise: 1 out of 4 devices which "self identified as mom" are also tagged as "male". (Either we're really really progressive, or something has gone horribly wrong.)
  16. 16. So Much Bias! Platform Bias: Certain segments are platform specific. (For example: "used a specific mail client on Android") Source Bias: We don't always have overlap between different first and third parties we work with and the overlap is not uncorrelated. Temporal Bias: Long-lived devices tend to accumulate segments (snowballs!). Audience Value Bias: Certain segments are worth more to advertisers so they appear more often than expected. (Example: people intending to purchase automobiles.)
  17. 17. Platform Bias
  18. 18. Platform Bias
  19. 19. Platform Bias
  20. 20. Source Bias
  21. 21. Next Steps Either: Account for these biases explicitly and try to correct them. (see: engineering.tapad.com) Or: Test different algorithms. Or: Abandon the effort and look elsewhere for different data. We opted for the last one.
  22. 22. Browsing Data In the end, we opted to use our in-house browsing data. Browsing data are data we obtain when examining available ad space. Each piece of data gives us an obfuscated ID and the url on which the device is browsing. Initially avoided due to sparsity: While we saw about 20 pieces of audience data on average on a device, we were in some cases limited to a single unique url per device because this data is harder to come by than black box segment data.
  23. 23. Plan of Attack (Preprocessing: remove the fraudulent urls associated with botnets.) Just as before, create a feature vector but now the features are the legitimate unique domains (tapad.com, mlconf.com, etc...). Compare several methods: The feature vector dot product (baseline) Matrix-based approaches which use probabilistic correlations based on url co-occurrence on nodes Clustering-based approaches which reduce dimensionality by first clustering highly correlated urls
  24. 24. Performance Much better! Simple dot product (baseline) already performs about 18% better than random. Both the matrix-based and clustering-based approaches perform up to 40% better than random. This is in the range of how we expect an optimal algorithm to perform - despite data sparsity!
  25. 25. Moral Don't assume because pieces of data are nicely tied in a bow and plentiful that they are the right data to use. Question your data, not only your algorithms. The best pieces of data may be scarce and raw because they are often less fraught with hidden biases and unnecessary processing.
  26. 26. Learn more about Tapad Read our blog: http://engineering.tapad.com Follow us on twitter: @tapad @tapadeng Follow us on Instagram: @tapadinc (includes a picture of yours truly in a headstand.) Contact me: yael@tapad.com, @y_s_e
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×