AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012

Large-Scale Analysis of Web Pages
− on a Startup Budget?
Hannes Mühleisen, Web-Based Systems Group

AWS Summit 2012 | Berlin

Our Starting Point

2

Our Starting Point
• Websites now embed structured data in HTML

2

Our Starting Point

• Various Vocabularies possible

• schema.org, Open Graph protocol, ...

2

Our Starting Point



• Various Encoding Formats possible

• μFormats, RDFa, Microdata

2

Our Starting Point



• Various Encoding Formats possible

• μFormats, RDFa, Microdata

Question: How are Vocabularies and Formats used?
2

Web Indices

• To answer our question, we need to access to raw Web data.

3

Web Indices


• However, maintaining Web indices is insanely expensive

• Re-Crawling, Storage, currently ~50 B pages (Google)

3

Web Indices


• However, maintaining Web indices is insanely expensive

• Re-Crawling, Storage, currently ~50 B pages (Google)

• Google and Bing have indices, but do not let outsiders in

3

• Non-Profit Organization

4


• Runs crawler and provides HTML dumps

4



• Available data:

• Index 02-12: 1.7 B URLs (21 TB)

• Index 09/12: 2.8 B URLs (29 TB)

4



• Available data:

• Index 02-12: 1.7 B URLs (21 TB)

• Index 09/12: 2.8 B URLs (29 TB)

• Available on AWS Public Data Sets

4

Why AWS?
• Now that we have a web crawl, how do we run our analysis?

• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

5

Why AWS?
• Now that we have a web crawl, how do we run our analysis?

• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

• Preliminary analysis: 1 GB / hour / CPU possible

• 8-CPU Desktop: 8 months

• 64-CPU Server: 1 month

• 100 8-CPU EC2-Instances: ~ 3 days

5

Common Crawl
Dataset Size
1 CPU, 1 h

Common Crawl
Dataset Size
1 CPU, 1 h

1000 € PC, 1 h

Common Crawl
Dataset Size
1 CPU, 1 h

1000 € PC, 1 h

5000 € Server, 1 h

Common Crawl
Dataset Size
1 CPU, 1 h

1000 € PC, 1 h

5000 € Server, 1 h

17 € EC2 Instances, 1 h

AWS Setup
• Data Input: Read Index Splits from S3

7

AWS Setup

• Job Coordination: SQS Message Queue

7

AWS Setup


• Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

7

AWS Setup



• Result Output: Write to S3

7

AWS Setup



• Result Output: Write to S3

• Logging: SDB

7

SQS • Each input file queued in SQS

• EC2 Workers take tasks from SQS

• Workers read and write S3 buckets

42

...

EC2

42 43 ... R42 R43 ...
CC WDC
S3

Results - Types of Data
Microdata 02/2012
RDFa 02/2012 Website Structure 23 %
5e+06

RDFa 2009/2010
Microdata 2009/2010
Products, Reviews 19 %
Entity Count (log)

5e+05

Movies, Music, ... 15 %
5e+04

Geodata 8 %
5e+03

People, Organizations 7 %
0 50 100 150 200 2012 Microdata Breakdown
Type

9

Microdata 02/2012
5e+06

RDFa 2009/2010
Microdata 2009/2010
Entity Count (log)

5e+05

5e+04

Geodata 8 %
5e+03

Type

• Available data largely determined by major player support

9

Microdata 02/2012
5e+06

RDFa 2009/2010
Microdata 2009/2010
Entity Count (log)

5e+05

5e+04

Geodata 8 %
5e+03

Type

• Available data largely determined by major player support

• “If Google consumes it, we will publish it”
9

Results - Formats

2009/2010

•

4
02−2012
URLs with embedded Data: +6%

Percentage of URLs

3
2
1
0
RDFa Microdata geo hcalendar hcard hreview XFN

Format

10

Results - Formats

2009/2010

•

4
02−2012

Percentage of URLs

3
• Microdata +14% (schema.org?)

2
1
0

Format

10

Results - Formats

2009/2010

•

4
02−2012

Percentage of URLs

3
• Microdata +14% (schema.org?)

2
•

1
RDFa +26% (Facebook?)

0

Format

10

Results - Extracted Data

• Extracted data available for download at

• www.webdatacommons.org

11




• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

11




• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

• Have a look!

11

AWS Costs

• Ca. 5500 Machine-Hours were required

• 1100 € billed by AWS for that

12

AWS Costs



• Cost for other services negligible *

12

AWS Costs



• Cost for other services negligible *

• * At first, we underestimated SDB cost

12

Takeaways
• Web Data Commons now publishes the largest set of
structured data from Web pages available

13

Takeaways

• Large-Scale Web Analysis now possible with Common Crawl
datasets

13

Takeaways

datasets

• AWS great for massive ad-hoc computing power and
complexity reduction

13

Takeaways

datasets

• AWS great for massive ad-hoc computing power and
complexity reduction

• Choose your architecture wisely, test by experiment, for us
EMR was too expensive.

13

Thank You!
Questions?
Want to hire me?

Web Resources: http://webdatacommons.org
http://hannes.muehleisen.org

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012

Similar to AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012