BDT102 Algorithms, Machines, and Crowdsourcing - AWS re: Invent 2012


Published on

In this session, join the Vice President of Mechanical Turk to explore how businesses are marrying human judgment with distributed data processing, improving accuracy of Big Data analytics without sacrificing efficiency or scalability. We’ll highlight real world examples and introduce Mike Brown, CTO of Comscore, to discuss how the combination technologies such as Hadoop and Mechanical Turk are driving large scale systems to cleanse and categorizes business critical data from unstructured and inconsistent data sources.

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

BDT102 Algorithms, Machines, and Crowdsourcing - AWS re: Invent 2012

  1. 1. NASDAQ SCORClients 2,000+ worldwideEmployees 1,000+Headquarters Reston, VAGlobal Coverage 220+ countries under measurementLocal Presence 32 locations in 23 countries
  2. 2. The Challenge • Available in 7 countries: USA, Brazil, Britain, Canada, France, Germany, Spain 2013: Mexico and India  Over 4B ads monthly  5M-10M unique new ads monthly
  3. 3. Display Ads• Observes advertising creatives• As they are encountered by the panelistCollects Facebook pages• Regular and premium ads Extracting all this information (and more)
  4. 4. Production Hadoop Cluster• 100 nodes• 2276 total CPUs, 6TB total memory, 1.7PB total disk space, 1GB Ethernet Facebook Facebook Facebook Ads Entity-Stream Entity- Hadoop Extraction Partitions DFS Dictionary-Apply Facebook News & Profiles Daily: 2 Hr / 70G 15min / 15Gx 30 min / 15Gx
  5. 5. Data size: Client NameNode• Compressed ~ 2 TB• Uncompressed ~ 6 TB• Total Pages - 320MNeed to process 3,700 pages/sec… Hadoop-1 Hadoop-2 Hadoop-3 Hadoop-N• Avg size per page: 18 KB…• Factor in time to collect, load to HDFS, buffer time for errors, etc… …Hadoop is used to extract entities• Each node processes 85 pages /sec• Daily Facebook entity extraction HDFS completes in ~2 hours Load FB Pages• Multi-Language Support NTFS
  6. 6. AdMetrix:• Total Ads: 85M• Ads per Ad-page: 3.7Social Essentials:• Total news items: 351M
  7. 7. Ad-Volume• 6M unique new ads monthly ?Advertiser-Space(Product Dictionary)• Over 56K companies• Over 100K company/brand pairsProblem  correctly  quickly  inexpensively
  8. 8. OCR based Image-Recognition based Pros • Potentially applicable to all non-Facebook online ads Cons • Low Accuracy • Low Coverage • Difficult to scale and maintain for huge daily data-volume
  9. 9. • Classify ads to cover ~80% impression• Automated Classification: Destination URL Title Currently classifying 7-20% of new ads no associated-text for ad new advertiser multi-advertiser ads new brand, movie
  10. 10. Classify ads for Turk- Turk- Classification to Ads Turk- Product-Names to Classification Product-Names to Classification Product-Names New No Prod Product? Name Yes Turk- Turk- Identification of Turk- Company-Name,of Identification Company-Name,of Identification URL, Category Company-Name, URL, Category URL, Category
  11. 11. Visit or follow @datagems for the latest gems.
  12. 12. Michael BrownCTOcomScore,
  13. 13. We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.