Our Starting Point
• Websites now embed structured data in HTML
• Various Vocabularies possible
• schema.org, Open Graph protocol, ...
2
Our Starting Point
• Websites now embed structured data in HTML
• Various Vocabularies possible
• schema.org, Open Graph protocol, ...
• Various Encoding Formats possible
• μFormats, RDFa, Microdata
2
Our Starting Point
• Websites now embed structured data in HTML
• Various Vocabularies possible
• schema.org, Open Graph protocol, ...
• Various Encoding Formats possible
• μFormats, RDFa, Microdata
Question: How are Vocabularies and Formats used?
2
Web Indices
• To answer our question, we need to access to raw Web data.
3
Web Indices
• To answer our question, we need to access to raw Web data.
• However, maintaining Web indices is insanely expensive
• Re-Crawling, Storage, currently ~50 B pages (Google)
3
Web Indices
• To answer our question, we need to access to raw Web data.
• However, maintaining Web indices is insanely expensive
• Re-Crawling, Storage, currently ~50 B pages (Google)
• Google and Bing have indices, but do not let outsiders in
3
• Non-Profit Organization
• Runs crawler and provides HTML dumps
4
• Non-Profit Organization
• Runs crawler and provides HTML dumps
• Available data:
• Index 02-12: 1.7 B URLs (21 TB)
• Index 09/12: 2.8 B URLs (29 TB)
4
• Non-Profit Organization
• Runs crawler and provides HTML dumps
• Available data:
• Index 02-12: 1.7 B URLs (21 TB)
• Index 09/12: 2.8 B URLs (29 TB)
• Available on AWS Public Data Sets
4
Why AWS?
• Now that we have a web crawl, how do we run our analysis?
• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)
5
Why AWS?
• Now that we have a web crawl, how do we run our analysis?
• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)
• Preliminary analysis: 1 GB / hour / CPU possible
• 8-CPU Desktop: 8 months
• 64-CPU Server: 1 month
• 100 8-CPU EC2-Instances: ~ 3 days
5
Results - Extracted Data
• Extracted data available for download at
• www.webdatacommons.org
11
Results - Extracted Data
• Extracted data available for download at
• www.webdatacommons.org
• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)
11
Results - Extracted Data
• Extracted data available for download at
• www.webdatacommons.org
• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)
• Have a look!
11
AWS Costs
• Ca. 5500 Machine-Hours were required
• 1100 € billed by AWS for that
12
AWS Costs
• Ca. 5500 Machine-Hours were required
• 1100 € billed by AWS for that
• Cost for other services negligible *
12
AWS Costs
• Ca. 5500 Machine-Hours were required
• 1100 € billed by AWS for that
• Cost for other services negligible *
• * At first, we underestimated SDB cost
12
Takeaways
• Web Data Commons now publishes the largest set of
structured data from Web pages available
13
Takeaways
• Web Data Commons now publishes the largest set of
structured data from Web pages available
• Large-Scale Web Analysis now possible with Common Crawl
datasets
13
Takeaways
• Web Data Commons now publishes the largest set of
structured data from Web pages available
• Large-Scale Web Analysis now possible with Common Crawl
datasets
• AWS great for massive ad-hoc computing power and
complexity reduction
13
Takeaways
• Web Data Commons now publishes the largest set of
structured data from Web pages available
• Large-Scale Web Analysis now possible with Common Crawl
datasets
• AWS great for massive ad-hoc computing power and
complexity reduction
• Choose your architecture wisely, test by experiment, for us
EMR was too expensive.
13
Thank You!
Questions?
Want to hire me?
Web Resources: http://webdatacommons.org
http://hannes.muehleisen.org