MLconf Yael Elmatad

2,183
-1

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,183
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
16
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

MLconf Yael Elmatad

  1. 1. Getting Cozy With Raw Data (A Cautionary Tale) Yael Elmatad Data Scientist, Tapad @y_s_e
  2. 2. The Ad Tech Space The goal of Ad Tech is to show advertisements to consumers on the internet and to ensure that the right ad gets shown to the right person. Many components: Publishers who have ad space on their pages "Sell" side platforms which aggregate publishers and facilitate selling of ad space "Buy" side platforms (like Tapad) which bid on that space to show current ad campaigns Advertisers who entrust demand side platforms to place their content appropriately
  3. 3. Why Cross-Device? Device Proliferation: 5.7 internet connected devices per household Screen Switching: Digital Natives switch screens 27 times every non-working hour Purchasing Across Devices: 40% of shoppers consult 3 or more channels before purchase Sources: NPD, March 2013; eMarketer, April 2012; Conlumino & Webloyalty, 2012
  4. 4. Tapad Connects Consumers' Devices To address these issues, Tapad built The Device Graph. The Device Graph seeks to connect devices within a household for targeting across multiple screens. Our edges are inferred based on a variety of techniques including co-location, partnerships with other companies, and obfuscated login data (where no personally identifiable data is ever observed).
  5. 5. Tapad Statistics Over 2 billion nodes (devices) in The Device Graph. Representing about 100 million households and approximately 250 million individuals. 75% of connected devices are connected to 3 or more devices. 38% of devices are computers -- 36% represents smart phones and tablets.
  6. 6. The (Original, Household) Device Graph No scores between edges. No way to separate individuals. iPad computer Kindle
  7. 7. What We Wanted Edge thickness indicates confidence of link between devices Colors indicate community detection based device clustering The (household) Device Graph naturally restricts our search space We never seek to identify individuals - only to group devices used by the same individual Graph can be traversed at varying thresholds (scale vs accuracy)
  8. 8. Scoring Edges We needed a way to put weights on edges. First Attempt: Use Segment data Provided by first or third-parties Tries to put devices into inferred buckets. ex: Dog lover Comic book enthusiast Male
  9. 9. Pros/Cons of Segment Data Pros: Relatively extensive coverage Simple to read/human intelligible Finite Cons: Don't know how the segments are determined (black box) Different providers may not have the same methods The longer a device has been in our graph the more audiences it will accumulate (snowballs)
  10. 10. Plan of Attack 1. Used the segments as features to create feature vectors. 2. Compared several methods: Simple dot product (baseline) Probabilistic approaches that use segment co-occurrence Machine learning approaches that use truth data and existing graph structure as proxy data.
  11. 11. What Do We Mean By Proxy Data? Assumption: Two nodes connected in The (household) Device Graph are more likely to be similar to each other than two unconnected nodes.
  12. 12. Measuring Performance To compare methods we compute the Win Rate. 1. Select pair of devices connected in graph; compute score between them (true_score). 2. Select random device unconnected to original devices; compute a score with one of original devices (false_score). if true_score > false_score: win_value = 1.0 elif true_score < false_score: win_value = 0.0 else: #ties win_value = 0.5
  13. 13. Performance Expectations A random algorithm should achieve an average win_value of about 0.5. We expect an optimal algorithm to achieve an average win_value of about 0.75 -- 50% better than random. Why? Because census data suggests around 2 adults per household. Therefore, we expect about half of our household edges to be highly correlated (similar) while the remainder should be statistically uncorrelated (dissimilar).
  14. 14. Well, how do segment data perform? In a word: poorly. Our attempts eked in just above the random line around an average win_value = 0.55. At most, 10% better than random!
  15. 15. So what happened? Segment data are riddled with: randomness & noise hidden bias An example of randomness & noise: 1 out of 4 devices which "self identified as mom" are also tagged as "male". (Either we're really really progressive, or something has gone horribly wrong.)
  16. 16. So Much Bias! Platform Bias: Certain segments are platform specific. (For example: "used a specific mail client on Android") Source Bias: We don't always have overlap between different first and third parties we work with and the overlap is not uncorrelated. Temporal Bias: Long-lived devices tend to accumulate segments (snowballs!). Audience Value Bias: Certain segments are worth more to advertisers so they appear more often than expected. (Example: people intending to purchase automobiles.)
  17. 17. Platform Bias
  18. 18. Platform Bias
  19. 19. Platform Bias
  20. 20. Source Bias
  21. 21. Next Steps Either: Account for these biases explicitly and try to correct them. (see: engineering.tapad.com) Or: Test different algorithms. Or: Abandon the effort and look elsewhere for different data. We opted for the last one.
  22. 22. Browsing Data In the end, we opted to use our in-house browsing data. Browsing data are data we obtain when examining available ad space. Each piece of data gives us an obfuscated ID and the url on which the device is browsing. Initially avoided due to sparsity: While we saw about 20 pieces of audience data on average on a device, we were in some cases limited to a single unique url per device because this data is harder to come by than black box segment data.
  23. 23. Plan of Attack (Preprocessing: remove the fraudulent urls associated with botnets.) Just as before, create a feature vector but now the features are the legitimate unique domains (tapad.com, mlconf.com, etc...). Compare several methods: The feature vector dot product (baseline) Matrix-based approaches which use probabilistic correlations based on url co-occurrence on nodes Clustering-based approaches which reduce dimensionality by first clustering highly correlated urls
  24. 24. Performance Much better! Simple dot product (baseline) already performs about 18% better than random. Both the matrix-based and clustering-based approaches perform up to 40% better than random. This is in the range of how we expect an optimal algorithm to perform - despite data sparsity!
  25. 25. Moral Don't assume because pieces of data are nicely tied in a bow and plentiful that they are the right data to use. Question your data, not only your algorithms. The best pieces of data may be scarce and raw because they are often less fraught with hidden biases and unnecessary processing.
  26. 26. Learn more about Tapad Read our blog: http://engineering.tapad.com Follow us on twitter: @tapad @tapadeng Follow us on Instagram: @tapadinc (includes a picture of yours truly in a headstand.) Contact me: yael@tapad.com, @y_s_e
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×