SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 30 day free trial to unlock unlimited reading.
3.
Our Starting Point
• Websites now embed structured data in HTML
2
4.
Our Starting Point
• Websites now embed structured data in HTML
• Various Vocabularies possible
• schema.org, Open Graph protocol, ...
2
5.
Our Starting Point
• Websites now embed structured data in HTML
• Various Vocabularies possible
• schema.org, Open Graph protocol, ...
• Various Encoding Formats possible
• μFormats, RDFa, Microdata
2
6.
Our Starting Point
• Websites now embed structured data in HTML
• Various Vocabularies possible
• schema.org, Open Graph protocol, ...
• Various Encoding Formats possible
• μFormats, RDFa, Microdata
Question: How are Vocabularies and Formats used?
2
7.
Web Indices
• To answer our question, we need to access to raw Web data.
3
8.
Web Indices
• To answer our question, we need to access to raw Web data.
• However, maintaining Web indices is insanely expensive
• Re-Crawling, Storage, currently ~50 B pages (Google)
3
9.
Web Indices
• To answer our question, we need to access to raw Web data.
• However, maintaining Web indices is insanely expensive
• Re-Crawling, Storage, currently ~50 B pages (Google)
• Google and Bing have indices, but do not let outsiders in
3
11.
• Non-Profit Organization
• Runs crawler and provides HTML dumps
4
12.
• Non-Profit Organization
• Runs crawler and provides HTML dumps
• Available data:
• Index 02-12: 1.7 B URLs (21 TB)
• Index 09/12: 2.8 B URLs (29 TB)
4
13.
• Non-Profit Organization
• Runs crawler and provides HTML dumps
• Available data:
• Index 02-12: 1.7 B URLs (21 TB)
• Index 09/12: 2.8 B URLs (29 TB)
• Available on AWS Public Data Sets
4
14.
Why AWS?
• Now that we have a web crawl, how do we run our analysis?
• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)
5
15.
Why AWS?
• Now that we have a web crawl, how do we run our analysis?
• Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)
• Preliminary analysis: 1 GB / hour / CPU possible
• 8-CPU Desktop: 8 months
• 64-CPU Server: 1 month
• 100 8-CPU EC2-Instances: ~ 3 days
5
35.
Results - Extracted Data
• Extracted data available for download at
• www.webdatacommons.org
11
36.
Results - Extracted Data
• Extracted data available for download at
• www.webdatacommons.org
• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)
11
37.
Results - Extracted Data
• Extracted data available for download at
• www.webdatacommons.org
• Formats: RDF (~90 GB) and CSV Tables for Microformats (!)
• Have a look!
11
38.
AWS Costs
• Ca. 5500 Machine-Hours were required
• 1100 € billed by AWS for that
12
39.
AWS Costs
• Ca. 5500 Machine-Hours were required
• 1100 € billed by AWS for that
• Cost for other services negligible *
12
40.
AWS Costs
• Ca. 5500 Machine-Hours were required
• 1100 € billed by AWS for that
• Cost for other services negligible *
• * At first, we underestimated SDB cost
12
41.
Takeaways
• Web Data Commons now publishes the largest set of
structured data from Web pages available
13
42.
Takeaways
• Web Data Commons now publishes the largest set of
structured data from Web pages available
• Large-Scale Web Analysis now possible with Common Crawl
datasets
13
43.
Takeaways
• Web Data Commons now publishes the largest set of
structured data from Web pages available
• Large-Scale Web Analysis now possible with Common Crawl
datasets
• AWS great for massive ad-hoc computing power and
complexity reduction
13
44.
Takeaways
• Web Data Commons now publishes the largest set of
structured data from Web pages available
• Large-Scale Web Analysis now possible with Common Crawl
datasets
• AWS great for massive ad-hoc computing power and
complexity reduction
• Choose your architecture wisely, test by experiment, for us
EMR was too expensive.
13
45.
Thank You!
Questions?
Want to hire me?
Web Resources: http://webdatacommons.org
http://hannes.muehleisen.org