• Save
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
Upcoming SlideShare
Loading in...5
×
 

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012

on

  • 1,427 views

 

Statistics

Views

Total Views
1,427
Views on SlideShare
1,427
Embed Views
0

Actions

Likes
1
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012 AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012 Presentation Transcript

  • Large-Scale Analysis of Web Pages− on a Startup Budget?Hannes Mühleisen, Web-Based Systems GroupAWS Summit 2012 | Berlin
  • Our Starting Point 2
  • Our Starting Point• Websites now embed structured data in HTML 2
  • Our Starting Point• Websites now embed structured data in HTML• Various Vocabularies possible • schema.org, Open Graph protocol, ... 2
  • Our Starting Point• Websites now embed structured data in HTML• Various Vocabularies possible • schema.org, Open Graph protocol, ...• Various Encoding Formats possible • μFormats, RDFa, Microdata 2
  • Our Starting Point• Websites now embed structured data in HTML• Various Vocabularies possible • schema.org, Open Graph protocol, ...• Various Encoding Formats possible • μFormats, RDFa, MicrodataQuestion: How are Vocabularies and Formats used? 2
  • Web Indices• To answer our question, we need to access to raw Web data. 3
  • Web Indices• To answer our question, we need to access to raw Web data.• However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) 3
  • Web Indices• To answer our question, we need to access to raw Web data.• However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google)• Google and Bing have indices, but do not let outsiders in 3
  • • Non-Profit Organization 4
  • • Non-Profit Organization• Runs crawler and provides HTML dumps 4
  • • Non-Profit Organization• Runs crawler and provides HTML dumps• Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) 4
  • • Non-Profit Organization• Runs crawler and provides HTML dumps• Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB)• Available on AWS Public Data Sets 4
  • Why AWS?• Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) 5
  • Why AWS?• Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)• Preliminary analysis: 1 GB / hour / CPU possible • 8-CPU Desktop: 8 months • 64-CPU Server: 1 month • 100 8-CPU EC2-Instances: ~ 3 days 5
  • Common Crawl Dataset Size
  • Common Crawl Dataset Size1 CPU, 1 h
  • Common Crawl Dataset Size 1 CPU, 1 h1000 € PC, 1 h
  • Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h5000 € Server, 1 h
  • Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h17 € EC2 Instances, 1 h
  • AWS Setup• Data Input: Read Index Splits from S3 7
  • AWS Setup• Data Input: Read Index Splits from S3• Job Coordination: SQS Message Queue 7
  • AWS Setup• Data Input: Read Index Splits from S3• Job Coordination: SQS Message Queue• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) 7
  • AWS Setup• Data Input: Read Index Splits from S3• Job Coordination: SQS Message Queue• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)• Result Output: Write to S3 7
  • AWS Setup• Data Input: Read Index Splits from S3• Job Coordination: SQS Message Queue• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)• Result Output: Write to S3• Logging: SDB 7
  • SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDCS3
  • SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDCS3
  • SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDCS3
  • Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 %Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type 9
  • Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 %Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support 9
  • Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 %Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support • “If Google consumes it, we will publish it” 9
  • Results - Formats 2009/2010• 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • Results - Formats 2009/2010• 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3• Microdata +14% (schema.org?) 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • Results - Formats 2009/2010• 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3• Microdata +14% (schema.org?) 2• 1 RDFa +26% (Facebook?) 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • Results - Extracted Data• Extracted data available for download at • www.webdatacommons.org 11
  • Results - Extracted Data• Extracted data available for download at • www.webdatacommons.org• Formats: RDF (~90 GB) and CSV Tables for Microformats (!) 11
  • Results - Extracted Data• Extracted data available for download at • www.webdatacommons.org• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)• Have a look! 11
  • AWS Costs• Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that 12
  • AWS Costs• Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that• Cost for other services negligible * 12
  • AWS Costs• Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that• Cost for other services negligible *• * At first, we underestimated SDB cost 12
  • Takeaways• Web Data Commons now publishes the largest set of structured data from Web pages available 13
  • Takeaways• Web Data Commons now publishes the largest set of structured data from Web pages available• Large-Scale Web Analysis now possible with Common Crawl datasets 13
  • Takeaways• Web Data Commons now publishes the largest set of structured data from Web pages available• Large-Scale Web Analysis now possible with Common Crawl datasets• AWS great for massive ad-hoc computing power and complexity reduction 13
  • Takeaways• Web Data Commons now publishes the largest set of structured data from Web pages available• Large-Scale Web Analysis now possible with Common Crawl datasets• AWS great for massive ad-hoc computing power and complexity reduction• Choose your architecture wisely, test by experiment, for us EMR was too expensive. 13
  • Thank You! Questions? Want to hire me?Web Resources: http://webdatacommons.org http://hannes.muehleisen.org