Successfully reported this slideshow.

More Related Content

More from Amazon Web Services

Related Books

Free with a 14 day trial from Scribd

See all

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012

  1. 1. Large-Scale Analysis of Web Pages − on a Startup Budget? Hannes Mühleisen, Web-Based Systems Group AWS Summit 2012 | Berlin
  2. 2. Our Starting Point 2
  3. 3. Our Starting Point • Websites now embed structured data in HTML 2
  4. 4. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... 2
  5. 5. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... • Various Encoding Formats possible • μFormats, RDFa, Microdata 2
  6. 6. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... • Various Encoding Formats possible • μFormats, RDFa, Microdata Question: How are Vocabularies and Formats used? 2
  7. 7. Web Indices • To answer our question, we need to access to raw Web data. 3
  8. 8. Web Indices • To answer our question, we need to access to raw Web data. • However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) 3
  9. 9. Web Indices • To answer our question, we need to access to raw Web data. • However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) • Google and Bing have indices, but do not let outsiders in 3
  10. 10. • Non-Profit Organization 4
  11. 11. • Non-Profit Organization • Runs crawler and provides HTML dumps 4
  12. 12. • Non-Profit Organization • Runs crawler and provides HTML dumps • Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) 4
  13. 13. • Non-Profit Organization • Runs crawler and provides HTML dumps • Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) • Available on AWS Public Data Sets 4
  14. 14. Why AWS? • Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) 5
  15. 15. Why AWS? • Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) • Preliminary analysis: 1 GB / hour / CPU possible • 8-CPU Desktop: 8 months • 64-CPU Server: 1 month • 100 8-CPU EC2-Instances: ~ 3 days 5
  16. 16. Common Crawl Dataset Size
  17. 17. Common Crawl Dataset Size 1 CPU, 1 h
  18. 18. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h
  19. 19. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h
  20. 20. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h 17 € EC2 Instances, 1 h
  21. 21. AWS Setup • Data Input: Read Index Splits from S3 7
  22. 22. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue 7
  23. 23. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) 7
  24. 24. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) • Result Output: Write to S3 7
  25. 25. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) • Result Output: Write to S3 • Logging: SDB 7
  26. 26. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  27. 27. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  28. 28. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  29. 29. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type 9
  30. 30. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support 9
  31. 31. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support • “If Google consumes it, we will publish it” 9
  32. 32. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  33. 33. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 • Microdata +14% (schema.org?) 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  34. 34. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 • Microdata +14% (schema.org?) 2 • 1 RDFa +26% (Facebook?) 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  35. 35. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org 11
  36. 36. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org • Formats: RDF (~90 GB) and CSV Tables for Microformats (!) 11
  37. 37. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org • Formats: RDF (~90 GB) and CSV Tables for Microformats (!) • Have a look! 11
  38. 38. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that 12
  39. 39. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that • Cost for other services negligible * 12
  40. 40. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that • Cost for other services negligible * • * At first, we underestimated SDB cost 12
  41. 41. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available 13
  42. 42. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets 13
  43. 43. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets • AWS great for massive ad-hoc computing power and complexity reduction 13
  44. 44. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets • AWS great for massive ad-hoc computing power and complexity reduction • Choose your architecture wisely, test by experiment, for us EMR was too expensive. 13
  45. 45. Thank You! Questions? Want to hire me? Web Resources: http://webdatacommons.org http://hannes.muehleisen.org

×