Your SlideShare is downloading. ×
BDT102 Algorithms, Machines, and Crowdsourcing - AWS re: Invent 2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

BDT102 Algorithms, Machines, and Crowdsourcing - AWS re: Invent 2012

453
views

Published on

In this session, join the Vice President of Mechanical Turk to explore how businesses are marrying human judgment with distributed data processing, improving accuracy of Big Data analytics without …

In this session, join the Vice President of Mechanical Turk to explore how businesses are marrying human judgment with distributed data processing, improving accuracy of Big Data analytics without sacrificing efficiency or scalability. We’ll highlight real world examples and introduce Mike Brown, CTO of Comscore, to discuss how the combination technologies such as Hadoop and Mechanical Turk are driving large scale systems to cleanse and categorizes business critical data from unstructured and inconsistent data sources.


0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
453
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. NASDAQ SCORClients 2,000+ worldwideEmployees 1,000+Headquarters Reston, VAGlobal Coverage 220+ countries under measurementLocal Presence 32 locations in 23 countries
  • 2. The Challenge • Available in 7 countries: USA, Brazil, Britain, Canada, France, Germany, Spain 2013: Mexico and India  Over 4B ads monthly  5M-10M unique new ads monthly
  • 3. Display Ads• Observes advertising creatives• As they are encountered by the panelistCollects Facebook pages• Regular and premium ads Extracting all this information (and more)
  • 4. Production Hadoop Cluster• 100 nodes• 2276 total CPUs, 6TB total memory, 1.7PB total disk space, 1GB Ethernet Facebook Facebook Facebook Ads Entity-Stream Entity- Hadoop Extraction Partitions DFS Dictionary-Apply Facebook News & Profiles Daily: 2 Hr / 70G 15min / 15Gx 30 min / 15Gx
  • 5. Data size: Client NameNode• Compressed ~ 2 TB• Uncompressed ~ 6 TB• Total Pages - 320MNeed to process 3,700 pages/sec… Hadoop-1 Hadoop-2 Hadoop-3 Hadoop-N• Avg size per page: 18 KB…• Factor in time to collect, load to HDFS, buffer time for errors, etc… …Hadoop is used to extract entities• Each node processes 85 pages /sec• Daily Facebook entity extraction HDFS completes in ~2 hours Load FB Pages• Multi-Language Support NTFS
  • 6. AdMetrix:• Total Ads: 85M• Ads per Ad-page: 3.7Social Essentials:• Total news items: 351M
  • 7. Ad-Volume• 6M unique new ads monthly ?Advertiser-Space(Product Dictionary)• Over 56K companies• Over 100K company/brand pairsProblem  correctly  quickly  inexpensively
  • 8. OCR based Image-Recognition based Pros • Potentially applicable to all non-Facebook online ads Cons • Low Accuracy • Low Coverage • Difficult to scale and maintain for huge daily data-volume
  • 9. • Classify ads to cover ~80% impression• Automated Classification: Destination URL Title Currently classifying 7-20% of new ads no associated-text for ad new advertiser multi-advertiser ads new brand, movie
  • 10. Classify ads for Turk- Turk- Classification to Ads Turk- Product-Names to Classification Product-Names to Classification Product-Names New No Prod Product? Name Yes Turk- Turk- Identification of Turk- Company-Name,of Identification Company-Name,of Identification URL, Category Company-Name, URL, Category URL, Category
  • 11. Visit www.comscoredatamine.com or follow @datagems for the latest gems.
  • 12. Michael BrownCTOcomScore, Inc.mbrown@comscore.com
  • 13. We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.

×