AWS Summit Berlin 2012 Talk on Web Data Commons

  • 3,509 views
Uploaded on

 

More in: Technology , Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,509
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
21
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Large-Scale Analysis of Web Pages− on a Startup Budget?Hannes Mühleisen, Web-Based Systems GroupAWS Summit 2012 | Berlin
  • 2. Our Starting Point 2
  • 3. Our Starting Point• Websites now embed structured data in HTML 2
  • 4. Our Starting Point• Websites now embed structured data in HTML• Various Vocabularies possible • schema.org, Open Graph protocol, ... 2
  • 5. Our Starting Point• Websites now embed structured data in HTML• Various Vocabularies possible • schema.org, Open Graph protocol, ...• Various Encoding Formats possible • μFormats, RDFa, Microdata 2
  • 6. Our Starting Point• Websites now embed structured data in HTML• Various Vocabularies possible • schema.org, Open Graph protocol, ...• Various Encoding Formats possible • μFormats, RDFa, MicrodataQuestion: How are Vocabularies and Formats used? 2
  • 7. Web Indices• To answer our question, we need to access to raw Web data. 3
  • 8. Web Indices• To answer our question, we need to access to raw Web data.• However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) 3
  • 9. Web Indices• To answer our question, we need to access to raw Web data.• However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google)• Google and Bing have indices, but do not let outsiders in 3
  • 10. • Non-Profit Organization 4
  • 11. • Non-Profit Organization• Runs crawler and provides HTML dumps 4
  • 12. • Non-Profit Organization• Runs crawler and provides HTML dumps• Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) 4
  • 13. • Non-Profit Organization• Runs crawler and provides HTML dumps• Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB)• Available on AWS Public Data Sets 4
  • 14. Why AWS?• Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) 5
  • 15. Why AWS?• Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)• Preliminary analysis: 1 GB / hour / CPU possible • 8-CPU Desktop: 8 months • 64-CPU Server: 1 month • 100 8-CPU EC2-Instances: ~ 3 days 5
  • 16. Common Crawl Dataset Size
  • 17. Common Crawl Dataset Size1 CPU, 1 h
  • 18. Common Crawl Dataset Size 1 CPU, 1 h1000 € PC, 1 h
  • 19. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h5000 € Server, 1 h
  • 20. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h17 € EC2 Instances, 1 h
  • 21. AWS Setup• Data Input: Read Index Splits from S3 7
  • 22. AWS Setup• Data Input: Read Index Splits from S3• Job Coordination: SQS Message Queue 7
  • 23. AWS Setup• Data Input: Read Index Splits from S3• Job Coordination: SQS Message Queue• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) 7
  • 24. AWS Setup• Data Input: Read Index Splits from S3• Job Coordination: SQS Message Queue• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)• Result Output: Write to S3 7
  • 25. AWS Setup• Data Input: Read Index Splits from S3• Job Coordination: SQS Message Queue• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)• Result Output: Write to S3• Logging: SDB 7
  • 26. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDCS3
  • 27. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDCS3
  • 28. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDCS3
  • 29. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 %Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type 9
  • 30. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 %Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support 9
  • 31. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 %Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support • “If Google consumes it, we will publish it” 9
  • 32. Results - Formats 2009/2010• 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 33. Results - Formats 2009/2010• 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3• Microdata +14% (schema.org?) 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 34. Results - Formats 2009/2010• 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3• Microdata +14% (schema.org?) 2• 1 RDFa +26% (Facebook?) 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 35. Results - Extracted Data• Extracted data available for download at • www.webdatacommons.org 11
  • 36. Results - Extracted Data• Extracted data available for download at • www.webdatacommons.org• Formats: RDF (~90 GB) and CSV Tables for Microformats (!) 11
  • 37. Results - Extracted Data• Extracted data available for download at • www.webdatacommons.org• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)• Have a look! 11
  • 38. AWS Costs• Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that 12
  • 39. AWS Costs• Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that• Cost for other services negligible * 12
  • 40. AWS Costs• Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that• Cost for other services negligible *• * At first, we underestimated SDB cost 12
  • 41. Takeaways• Web Data Commons now publishes the largest set of structured data from Web pages available 13
  • 42. Takeaways• Web Data Commons now publishes the largest set of structured data from Web pages available• Large-Scale Web Analysis now possible with Common Crawl datasets 13
  • 43. Takeaways• Web Data Commons now publishes the largest set of structured data from Web pages available• Large-Scale Web Analysis now possible with Common Crawl datasets• AWS great for massive ad-hoc computing power and complexity reduction 13
  • 44. Takeaways• Web Data Commons now publishes the largest set of structured data from Web pages available• Large-Scale Web Analysis now possible with Common Crawl datasets• AWS great for massive ad-hoc computing power and complexity reduction• Choose your architecture wisely, test by experiment, for us EMR was too expensive. 13
  • 45. Thank You! Questions? Want to hire me?Web Resources: http://webdatacommons.org http://hannes.muehleisen.org