Your SlideShare is downloading. ×
Large-Scale Analysis of Web Pages− on a Startup Budget?Hannes Mühleisen, Web-Based Systems GroupAWS Summit 2012 | Berlin
Our Starting Point        2
Our Starting Point•   Websites now embed structured data in HTML                             2
Our Starting Point•   Websites now embed structured data in HTML•   Various Vocabularies possible    •   schema.org, Open ...
Our Starting Point•   Websites now embed structured data in HTML•   Various Vocabularies possible    •   schema.org, Open ...
Our Starting Point•   Websites now embed structured data in HTML•   Various Vocabularies possible    •   schema.org, Open ...
Web Indices•   To answer our question, we need to access to raw Web data.                               3
Web Indices•   To answer our question, we need to access to raw Web data.•   However, maintaining Web indices is insanely ...
Web Indices•   To answer our question, we need to access to raw Web data.•   However, maintaining Web indices is insanely ...
•   Non-Profit Organization                              4
•   Non-Profit Organization•   Runs crawler and provides HTML dumps                              4
•   Non-Profit Organization•   Runs crawler and provides HTML dumps•   Available data:    •   Index 02-12: 1.7 B URLs (21 ...
•   Non-Profit Organization•   Runs crawler and provides HTML dumps•   Available data:    •   Index 02-12: 1.7 B URLs (21 ...
Why AWS?•   Now that we have a web crawl, how do we run our analysis?    •   Unpacking and DOM-Parsing on 50 TB? (CPU-heav...
Why AWS?•   Now that we have a web crawl, how do we run our analysis?    •   Unpacking and DOM-Parsing on 50 TB? (CPU-heav...
Common Crawl Dataset Size
Common Crawl              Dataset Size1 CPU, 1 h
Common Crawl                   Dataset Size     1 CPU, 1 h1000 € PC, 1 h
Common Crawl                         Dataset Size           1 CPU, 1 h      1000 € PC, 1 h5000 € Server, 1 h
Common Crawl                               Dataset Size                 1 CPU, 1 h           1000 € PC, 1 h     5000 € Ser...
AWS Setup•   Data Input: Read Index Splits from S3                               7
AWS Setup•   Data Input: Read Index Splits from S3•   Job Coordination: SQS Message Queue                               7
AWS Setup•   Data Input: Read Index Splits from S3•   Job Coordination: SQS Message Queue•   Workers: 100 EC2 Spot Instanc...
AWS Setup•   Data Input: Read Index Splits from S3•   Job Coordination: SQS Message Queue•   Workers: 100 EC2 Spot Instanc...
AWS Setup•   Data Input: Read Index Splits from S3•   Job Coordination: SQS Message Queue•   Workers: 100 EC2 Spot Instanc...
SQS                         •   Each input file queued in SQS                            •   EC2 Workers take tasks from S...
SQS                         •   Each input file queued in SQS                            •   EC2 Workers take tasks from S...
SQS                         •   Each input file queued in SQS                            •   EC2 Workers take tasks from S...
Results - Types of Data                                                     Microdata 02/2012                             ...
Results - Types of Data                                                            Microdata 02/2012                      ...
Results - Types of Data                                                             Microdata 02/2012                     ...
Results - Formats                                                                                                         ...
Results - Formats                                                                                                         ...
Results - Formats                                                                                                         ...
Results - Extracted Data•   Extracted data available for download at    •   www.webdatacommons.org                        ...
Results - Extracted Data•   Extracted data available for download at    •   www.webdatacommons.org•   Formats: RDF (~90 GB...
Results - Extracted Data•   Extracted data available for download at    •   www.webdatacommons.org•   Formats: RDF (~90 GB...
AWS Costs•   Ca. 5500 Machine-Hours were required    •   1100 € billed by AWS for that                                 12
AWS Costs•   Ca. 5500 Machine-Hours were required    •   1100 € billed by AWS for that•   Cost for other services negligib...
AWS Costs•   Ca. 5500 Machine-Hours were required    •   1100 € billed by AWS for that•   Cost for other services negligib...
Takeaways•   Web Data Commons now publishes the largest set of    structured data from Web pages available                ...
Takeaways•   Web Data Commons now publishes the largest set of    structured data from Web pages available•   Large-Scale ...
Takeaways•   Web Data Commons now publishes the largest set of    structured data from Web pages available•   Large-Scale ...
Takeaways•   Web Data Commons now publishes the largest set of    structured data from Web pages available•   Large-Scale ...
Thank You!              Questions?            Want to hire me?Web Resources: http://webdatacommons.org     http://hannes.m...
Upcoming SlideShare
Loading in...5
×

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012

1,359

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,359
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012"

  1. 1. Large-Scale Analysis of Web Pages− on a Startup Budget?Hannes Mühleisen, Web-Based Systems GroupAWS Summit 2012 | Berlin
  2. 2. Our Starting Point 2
  3. 3. Our Starting Point• Websites now embed structured data in HTML 2
  4. 4. Our Starting Point• Websites now embed structured data in HTML• Various Vocabularies possible • schema.org, Open Graph protocol, ... 2
  5. 5. Our Starting Point• Websites now embed structured data in HTML• Various Vocabularies possible • schema.org, Open Graph protocol, ...• Various Encoding Formats possible • μFormats, RDFa, Microdata 2
  6. 6. Our Starting Point• Websites now embed structured data in HTML• Various Vocabularies possible • schema.org, Open Graph protocol, ...• Various Encoding Formats possible • μFormats, RDFa, MicrodataQuestion: How are Vocabularies and Formats used? 2
  7. 7. Web Indices• To answer our question, we need to access to raw Web data. 3
  8. 8. Web Indices• To answer our question, we need to access to raw Web data.• However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) 3
  9. 9. Web Indices• To answer our question, we need to access to raw Web data.• However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google)• Google and Bing have indices, but do not let outsiders in 3
  10. 10. • Non-Profit Organization 4
  11. 11. • Non-Profit Organization• Runs crawler and provides HTML dumps 4
  12. 12. • Non-Profit Organization• Runs crawler and provides HTML dumps• Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) 4
  13. 13. • Non-Profit Organization• Runs crawler and provides HTML dumps• Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB)• Available on AWS Public Data Sets 4
  14. 14. Why AWS?• Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) 5
  15. 15. Why AWS?• Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)• Preliminary analysis: 1 GB / hour / CPU possible • 8-CPU Desktop: 8 months • 64-CPU Server: 1 month • 100 8-CPU EC2-Instances: ~ 3 days 5
  16. 16. Common Crawl Dataset Size
  17. 17. Common Crawl Dataset Size1 CPU, 1 h
  18. 18. Common Crawl Dataset Size 1 CPU, 1 h1000 € PC, 1 h
  19. 19. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h5000 € Server, 1 h
  20. 20. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h17 € EC2 Instances, 1 h
  21. 21. AWS Setup• Data Input: Read Index Splits from S3 7
  22. 22. AWS Setup• Data Input: Read Index Splits from S3• Job Coordination: SQS Message Queue 7
  23. 23. AWS Setup• Data Input: Read Index Splits from S3• Job Coordination: SQS Message Queue• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) 7
  24. 24. AWS Setup• Data Input: Read Index Splits from S3• Job Coordination: SQS Message Queue• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)• Result Output: Write to S3 7
  25. 25. AWS Setup• Data Input: Read Index Splits from S3• Job Coordination: SQS Message Queue• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)• Result Output: Write to S3• Logging: SDB 7
  26. 26. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDCS3
  27. 27. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDCS3
  28. 28. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDCS3
  29. 29. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 %Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type 9
  30. 30. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 %Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support 9
  31. 31. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 %Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support • “If Google consumes it, we will publish it” 9
  32. 32. Results - Formats 2009/2010• 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  33. 33. Results - Formats 2009/2010• 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3• Microdata +14% (schema.org?) 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  34. 34. Results - Formats 2009/2010• 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3• Microdata +14% (schema.org?) 2• 1 RDFa +26% (Facebook?) 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  35. 35. Results - Extracted Data• Extracted data available for download at • www.webdatacommons.org 11
  36. 36. Results - Extracted Data• Extracted data available for download at • www.webdatacommons.org• Formats: RDF (~90 GB) and CSV Tables for Microformats (!) 11
  37. 37. Results - Extracted Data• Extracted data available for download at • www.webdatacommons.org• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)• Have a look! 11
  38. 38. AWS Costs• Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that 12
  39. 39. AWS Costs• Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that• Cost for other services negligible * 12
  40. 40. AWS Costs• Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that• Cost for other services negligible *• * At first, we underestimated SDB cost 12
  41. 41. Takeaways• Web Data Commons now publishes the largest set of structured data from Web pages available 13
  42. 42. Takeaways• Web Data Commons now publishes the largest set of structured data from Web pages available• Large-Scale Web Analysis now possible with Common Crawl datasets 13
  43. 43. Takeaways• Web Data Commons now publishes the largest set of structured data from Web pages available• Large-Scale Web Analysis now possible with Common Crawl datasets• AWS great for massive ad-hoc computing power and complexity reduction 13
  44. 44. Takeaways• Web Data Commons now publishes the largest set of structured data from Web pages available• Large-Scale Web Analysis now possible with Common Crawl datasets• AWS great for massive ad-hoc computing power and complexity reduction• Choose your architecture wisely, test by experiment, for us EMR was too expensive. 13
  45. 45. Thank You! Questions? Want to hire me?Web Resources: http://webdatacommons.org http://hannes.muehleisen.org

×