Large-Scale Analysis of Web Pages
− on a Startup Budget?
Hannes Mühleisen, Web-Based Systems Group




AWS Summit 2012 | Berlin
Our Starting Point




        2
Our Starting Point
•   Websites now embed structured data in HTML




                             2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...




                                 2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...

•   Various Encoding Formats possible

    •   μFormats, RDFa, Microdata



                                 2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...

•   Various Encoding Formats possible

    •   μFormats, RDFa, Microdata


Question: How are Vocabularies and Formats used?
                                 2
Web Indices

•   To answer our question, we need to access to raw Web data.




                               3
Web Indices

•   To answer our question, we need to access to raw Web data.

•   However, maintaining Web indices is insanely expensive

    •   Re-Crawling, Storage, currently ~50 B pages (Google)




                                 3
Web Indices

•   To answer our question, we need to access to raw Web data.

•   However, maintaining Web indices is insanely expensive

    •   Re-Crawling, Storage, currently ~50 B pages (Google)

•   Google and Bing have indices, but do not let outsiders in



                                 3
•   Non-Profit Organization




                              4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps




                              4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps

•   Available data:

    •   Index 02-12: 1.7 B URLs (21 TB)

    •   Index 09/12: 2.8 B URLs (29 TB)



                                  4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps

•   Available data:

    •   Index 02-12: 1.7 B URLs (21 TB)

    •   Index 09/12: 2.8 B URLs (29 TB)

•   Available on AWS Public Data Sets

                                  4
Why AWS?
•   Now that we have a web crawl, how do we run our analysis?

    •   Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)




                               5
Why AWS?
•   Now that we have a web crawl, how do we run our analysis?

    •   Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

•   Preliminary analysis: 1 GB / hour / CPU possible

    •   8-CPU Desktop: 8 months

    •   64-CPU Server: 1 month

    •   100 8-CPU EC2-Instances: ~ 3 days

                                 5
Common Crawl
 Dataset Size
Common Crawl
              Dataset Size
1 CPU, 1 h
Common Crawl
                   Dataset Size
     1 CPU, 1 h

1000 € PC, 1 h
Common Crawl
                         Dataset Size
           1 CPU, 1 h

      1000 € PC, 1 h

5000 € Server, 1 h
Common Crawl
                               Dataset Size
                 1 CPU, 1 h

           1000 € PC, 1 h

     5000 € Server, 1 h




17 € EC2 Instances, 1 h
AWS Setup
•   Data Input: Read Index Splits from S3




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

•   Result Output: Write to S3




                                 7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

•   Result Output: Write to S3

•   Logging: SDB


                                 7
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
Results - Types of Data
                                                     Microdata 02/2012
                                                     RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                     RDFa 2009/2010
                                                     Microdata 2009/2010
                                                                            Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                             Movies, Music, ...              15 %
                     5e+04




                                                                                 Geodata                     8 %
                     5e+03




                                                                           People, Organizations             7 %
                             0   50     100    150                  200           2012 Microdata Breakdown
                                        Type




                                                                   9
Results - Types of Data
                                                            Microdata 02/2012
                                                            RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                            RDFa 2009/2010
                                                            Microdata 2009/2010
                                                                                   Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                                    Movies, Music, ...              15 %
                     5e+04




                                                                                        Geodata                     8 %
                     5e+03




                                                                                  People, Organizations             7 %
                             0       50      100      150                  200           2012 Microdata Breakdown
                                             Type




                                 •   Available data largely determined by major player support


                                                                          9
Results - Types of Data
                                                             Microdata 02/2012
                                                             RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                             RDFa 2009/2010
                                                             Microdata 2009/2010
                                                                                    Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                                     Movies, Music, ...              15 %
                     5e+04




                                                                                         Geodata                     8 %
                     5e+03




                                                                                   People, Organizations             7 %
                             0       50       100      150                  200           2012 Microdata Breakdown
                                             Type




                                 •   Available data largely determined by major player support

                                 •   “If Google consumes it, we will publish it”
                                                                           9
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
                                                         2
                                                         1
                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
•   Microdata +14% (schema.org?)




                                                         2
                                                         1
                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
•   Microdata +14% (schema.org?)




                                                         2
•

                                                         1
    RDFa +26% (Facebook?)




                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org




                                11
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org

•   Formats: RDF (~90 GB) and CSV Tables for Microformats (!)




                                11
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org

•   Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

•   Have a look!



                                11
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that




                                 12
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that

•   Cost for other services negligible *




                                 12
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that

•   Cost for other services negligible *

•   * At first, we underestimated SDB cost



                                 12
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets

•   AWS great for massive ad-hoc computing power and
    complexity reduction




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets

•   AWS great for massive ad-hoc computing power and
    complexity reduction

•   Choose your architecture wisely, test by experiment, for us
    EMR was too expensive.

                                13
Thank You!
              Questions?
            Want to hire me?


Web Resources: http://webdatacommons.org
     http://hannes.muehleisen.org

AWS Summit Berlin 2012 Talk on Web Data Commons

  • 1.
    Large-Scale Analysis ofWeb Pages − on a Startup Budget? Hannes Mühleisen, Web-Based Systems Group AWS Summit 2012 | Berlin
  • 2.
  • 3.
    Our Starting Point • Websites now embed structured data in HTML 2
  • 4.
    Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... 2
  • 5.
    Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... • Various Encoding Formats possible • μFormats, RDFa, Microdata 2
  • 6.
    Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... • Various Encoding Formats possible • μFormats, RDFa, Microdata Question: How are Vocabularies and Formats used? 2
  • 7.
    Web Indices • To answer our question, we need to access to raw Web data. 3
  • 8.
    Web Indices • To answer our question, we need to access to raw Web data. • However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) 3
  • 9.
    Web Indices • To answer our question, we need to access to raw Web data. • However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) • Google and Bing have indices, but do not let outsiders in 3
  • 10.
    Non-Profit Organization 4
  • 11.
    Non-Profit Organization • Runs crawler and provides HTML dumps 4
  • 12.
    Non-Profit Organization • Runs crawler and provides HTML dumps • Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) 4
  • 13.
    Non-Profit Organization • Runs crawler and provides HTML dumps • Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) • Available on AWS Public Data Sets 4
  • 14.
    Why AWS? • Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) 5
  • 15.
    Why AWS? • Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) • Preliminary analysis: 1 GB / hour / CPU possible • 8-CPU Desktop: 8 months • 64-CPU Server: 1 month • 100 8-CPU EC2-Instances: ~ 3 days 5
  • 16.
  • 17.
    Common Crawl Dataset Size 1 CPU, 1 h
  • 18.
    Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h
  • 19.
    Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h
  • 20.
    Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h 17 € EC2 Instances, 1 h
  • 21.
    AWS Setup • Data Input: Read Index Splits from S3 7
  • 22.
    AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue 7
  • 23.
    AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) 7
  • 24.
    AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) • Result Output: Write to S3 7
  • 25.
    AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) • Result Output: Write to S3 • Logging: SDB 7
  • 26.
    SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 27.
    SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 28.
    SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 29.
    Results - Typesof Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type 9
  • 30.
    Results - Typesof Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support 9
  • 31.
    Results - Typesof Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support • “If Google consumes it, we will publish it” 9
  • 32.
    Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 33.
    Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 • Microdata +14% (schema.org?) 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 34.
    Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 • Microdata +14% (schema.org?) 2 • 1 RDFa +26% (Facebook?) 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 35.
    Results - ExtractedData • Extracted data available for download at • www.webdatacommons.org 11
  • 36.
    Results - ExtractedData • Extracted data available for download at • www.webdatacommons.org • Formats: RDF (~90 GB) and CSV Tables for Microformats (!) 11
  • 37.
    Results - ExtractedData • Extracted data available for download at • www.webdatacommons.org • Formats: RDF (~90 GB) and CSV Tables for Microformats (!) • Have a look! 11
  • 38.
    AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that 12
  • 39.
    AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that • Cost for other services negligible * 12
  • 40.
    AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that • Cost for other services negligible * • * At first, we underestimated SDB cost 12
  • 41.
    Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available 13
  • 42.
    Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets 13
  • 43.
    Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets • AWS great for massive ad-hoc computing power and complexity reduction 13
  • 44.
    Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets • AWS great for massive ad-hoc computing power and complexity reduction • Choose your architecture wisely, test by experiment, for us EMR was too expensive. 13
  • 45.
    Thank You! Questions? Want to hire me? Web Resources: http://webdatacommons.org http://hannes.muehleisen.org