Yahoo Display Advertising Attribution


Published on

Published in: Technology, Design
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Yahoo Display Advertising Attribution

  1. 1. Yahoo! Display Ads AttributionFramework:A Problem Of Efficient Sparse Joins On Massive Data Supreeth, Sundeep, Chenjie, Chinmay Data Team, Yahoo! 1
  2. 2. Agenda§  Problem description ›  Serves impressions clicks ›  Attribution§  Class of problems and application in other use cases§  Attribution framework§  Performance comparison§  Conclusion 2
  3. 3. Serves Impressions Clicks Web Ad Servers Servers Be the first place people go when they want to find, explore, and participate withImpressionsnews, from serious forfun. ad shown all forms of – client side event to an Serves - Server logged event forClicks – client side event for a click on an ad an ad served. Serve hasInteractions – client side events for interactions complete context within an ad Serve events are heavy and isImpressions clicks and conversions are a few a few 10s of KBsbytes Serve Guid + Serve timestamp + {other fields of Serve Guid + Serve timestamp + {other Join impressions/clicks/interactions} fields of serve} * Guid is global unique identifier 3
  4. 4. Need For Attribution Serves 5m Several hours to days Older instancesImpressions/Clicks Every 5 mins Attribute an impression/click with the serve 4
  5. 5. Distribution Of % Impressions ArrivedFrom The Client Side wrt Serves % of Impressions for a serve 90 80 70 60 50 %of Impressions for a serve 40 30 t1->201205301000 t2->201205300955 20 t3->201205300950 . 10 . . 0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t12 t13 Time period from when the serves happened 5
  6. 6. Distribution Of % Clicks Arrived FromThe Client Side wrt Serves %of Clicks for a serve 45 40 35 30 25 %of Clicks for a serve 20 15 t1->201205301000 t2->201205300955 10 t3->201205300950 . 5 . . 0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t12 t13 Time period from when the serves happened 6
  7. 7. Class Of Problems§  Sparse joins spanning TBs of data on grid§  Few MBs to a few TBs§  Left outer join or any other outer join Data Set Impressions Serves (5m*288) Data Size 400MB 20GB *288 ~= 5.6 TB (Compressed size) 7
  8. 8. Similar Use Cases§  Associating video, click, social interactions back to the activity data§  Attribute back a small size client beacon to a large dataset§  Within Yahoo ›  Audience view/click attribution ›  Weblog based investigation ›  Joining dimensional data with web traffic data 8
  9. 9. Pig Joins And Problem Fit Join Strategy Comments Cost Merge join The datasets are not sorted High Hash join Shuffle and reduce time High Replicated Join Does not meet performance High needs; left outer join on the replicated dataset Skewed Join Data set is not skewed N/A 9
  10. 10. Problem Statement To do a sparse outer join on a very largedataset with high performance requirements for display ad attribution on grid 10
  11. 11. Attribution Framework - Overview Smart Instrumentation Strategies Aggressive partitioning and selection Partition Aware Efficient Join Query Plan 11
  12. 12. Instrument For Attribution Ø Smart Instrumentation Strategies §  Serve guid Aggressive partitioning and selection §  Clues which can help you partition better Partition Aware Efficient Join Query Plan ›  Timestamp of the serve §  Partition keys used in event instrumentation §  In the impression attribution example: Impression ServesServe Guid + Serve timestamp + {other fields of Serve Guid + Serve timestamp + {other impressions/clicks/interactions} fields of serve} 12
  13. 13. Partitioning approach§  Join key based partitioning Smart Instrumentation Strategies§  Keys for leveraging physical partitioning Ø Aggressive partitioning and selection ›  timestamp Partition Aware Efficient Join Query Plan§  Use of hashes in partitioning ›  HashFNV, Murmur Key Partition Type Join keys Hash Timestamp Range 13
  14. 14. Pruning/Selection§  Hashing of keys in the data sets Smart Instrumentation Strategies Ø Aggressive partitioning§  Pruning of partitions and selection Partition Aware Efficient Join ›  Timestamp Query Plan ›  Hash of the join key§  IO costs and partitions§  Configurable partitions Key Partition Type Pruning Join keys Hash Yes Timestamp Range Yes 14
  15. 15. Partition Aware Efficient Join QueryPlans Stream the selectedImpression event keys Smart Instrumentation Size : MBs Serve event partitions Strategies Size : TBs Aggressive partitioning and selection Ø Partition Aware Inner Efficient Join Query Plan Join Stream full Annotated impression Impression event Size : MBs Size: Hundreds of MBs Left outer join Complete Annotated Impression- in memory data with Serve data- stream 15
  16. 16. Attribution Framework: Capabilities Smart Instrumentation Strategies§  Left outer on impression/click/interaction Aggressive partitioning and selection›  As long as the impression/click/interaction Partition Aware Efficient Join Query Plan exists, we will get a record in output§  Complete annotation with the serve§  Distinct join with serves§  Sparse joins achieved by pruning the partitions§  Map side joins 16
  17. 17. Attribution Framework: Implementation Smart Instrumentation Strategies§  Python embedded PIG Aggressive partitioning and selection§  Dynamic partitioning/pruning (UDFs) Partition Aware Efficient Join Query Plan§  Configurable parameters ›  Lookbacks ›  Partitions ›  CombinedSplitSize 17
  18. 18. Attribution Framework: Tuning Parameters§  Serve Partitions: trade off between IO & namespace used (lookback = 24 hours) 4000 180000 Bytes read Number of files 3500 160000 140000 3000 120000 2500 100000 2000 Bytes Read(GB) 80000 Namespace Used 1500 60000 1000 40000 500 20000 0 0 2 4 8 16 32 64 128 256 512 1024 Partitions 18
  19. 19. Attribution Framework: Tuning Parameters§  Split Size: trade off between number of mappers and map task run time(partitions = 16, lookback = 24 hours) 35000 1200 Number of Mappers Time taken 30000 1000 25000 800 20000 600 Number of Mappers 15000 Time Taken(s) 400 10000 200 5000 0 0 128MB 1 GB 2 GB 3 GB 4 GB Split Size 19
  20. 20. Comparison With Other PIG JoinsJoin Mappers Reducers Lookback Input Size Time to completeLeft Outer 2800 45 40mins 180GB 42.5m*Hash JoinReplicated 5680 0 5hours 1TB 7m**JoinAttribution 5760 0 24hours Effective 5.6 TB; 6m***Framework With Pruning 1.1 TB * Best case for hash join 1.5m+15.5m+25.5m (Mapper + Shuffle + Reducer) ** Map time taken *** 1 min + 2mins + 3mins (Selection/Pruning + Impression partitioning +Join) 20
  21. 21. Conclusion§  For the sparse look up problem, the attribution framework used works very well and within the performance needs§  Effective partitioning aids longer lookbacks and reduced IO§  The levers in the framework allow for tuning based on the computation/IO requirements 21
  22. 22. Future Steps§  Use Hbase/Cassandra to store the event grain serve data and do lookups§  Use of bloom filter along with an index format§  Compare the strategy with what Hive does and come up with a framework using Hive 22
  23. 23. Questions? 23