SlideShare a Scribd company logo
1 of 45
Large-Scale Analysis of Web Pages
− on a Startup Budget?
Hannes Mühleisen, Web-Based Systems Group




AWS Summit 2012 | Berlin
Our Starting Point




        2
Our Starting Point
•   Websites now embed structured data in HTML




                             2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...




                                 2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...

•   Various Encoding Formats possible

    •   μFormats, RDFa, Microdata



                                 2
Our Starting Point
•   Websites now embed structured data in HTML

•   Various Vocabularies possible

    •   schema.org, Open Graph protocol, ...

•   Various Encoding Formats possible

    •   μFormats, RDFa, Microdata


Question: How are Vocabularies and Formats used?
                                 2
Web Indices

•   To answer our question, we need to access to raw Web data.




                               3
Web Indices

•   To answer our question, we need to access to raw Web data.

•   However, maintaining Web indices is insanely expensive

    •   Re-Crawling, Storage, currently ~50 B pages (Google)




                                 3
Web Indices

•   To answer our question, we need to access to raw Web data.

•   However, maintaining Web indices is insanely expensive

    •   Re-Crawling, Storage, currently ~50 B pages (Google)

•   Google and Bing have indices, but do not let outsiders in



                                 3
•   Non-Profit Organization




                              4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps




                              4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps

•   Available data:

    •   Index 02-12: 1.7 B URLs (21 TB)

    •   Index 09/12: 2.8 B URLs (29 TB)



                                  4
•   Non-Profit Organization

•   Runs crawler and provides HTML dumps

•   Available data:

    •   Index 02-12: 1.7 B URLs (21 TB)

    •   Index 09/12: 2.8 B URLs (29 TB)

•   Available on AWS Public Data Sets

                                  4
Why AWS?
•   Now that we have a web crawl, how do we run our analysis?

    •   Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)




                               5
Why AWS?
•   Now that we have a web crawl, how do we run our analysis?

    •   Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!)

•   Preliminary analysis: 1 GB / hour / CPU possible

    •   8-CPU Desktop: 8 months

    •   64-CPU Server: 1 month

    •   100 8-CPU EC2-Instances: ~ 3 days

                                 5
Common Crawl
 Dataset Size
Common Crawl
              Dataset Size
1 CPU, 1 h
Common Crawl
                   Dataset Size
     1 CPU, 1 h

1000 € PC, 1 h
Common Crawl
                         Dataset Size
           1 CPU, 1 h

      1000 € PC, 1 h

5000 € Server, 1 h
Common Crawl
                               Dataset Size
                 1 CPU, 1 h

           1000 € PC, 1 h

     5000 € Server, 1 h




17 € EC2 Instances, 1 h
AWS Setup
•   Data Input: Read Index Splits from S3




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)




                               7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

•   Result Output: Write to S3




                                 7
AWS Setup
•   Data Input: Read Index Splits from S3

•   Job Coordination: SQS Message Queue

•   Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h)

•   Result Output: Write to S3

•   Logging: SDB


                                 7
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
SQS                         •   Each input file queued in SQS

                            •   EC2 Workers take tasks from SQS

                            •   Workers read and write S3 buckets

                      42



                                            ...

                                      EC2



      42   43   ...                   R42         R43   ...
                       CC                                     WDC
S3
Results - Types of Data
                                                     Microdata 02/2012
                                                     RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                     RDFa 2009/2010
                                                     Microdata 2009/2010
                                                                            Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                             Movies, Music, ...              15 %
                     5e+04




                                                                                 Geodata                     8 %
                     5e+03




                                                                           People, Organizations             7 %
                             0   50     100    150                  200           2012 Microdata Breakdown
                                        Type




                                                                   9
Results - Types of Data
                                                            Microdata 02/2012
                                                            RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                            RDFa 2009/2010
                                                            Microdata 2009/2010
                                                                                   Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                                    Movies, Music, ...              15 %
                     5e+04




                                                                                        Geodata                     8 %
                     5e+03




                                                                                  People, Organizations             7 %
                             0       50      100      150                  200           2012 Microdata Breakdown
                                             Type




                                 •   Available data largely determined by major player support


                                                                          9
Results - Types of Data
                                                             Microdata 02/2012
                                                             RDFa 02/2012           Website Structure                23 %
                     5e+06




                                                             RDFa 2009/2010
                                                             Microdata 2009/2010
                                                                                    Products, Reviews                19 %
Entity Count (log)

                     5e+05




                                                                                     Movies, Music, ...              15 %
                     5e+04




                                                                                         Geodata                     8 %
                     5e+03




                                                                                   People, Organizations             7 %
                             0       50       100      150                  200           2012 Microdata Breakdown
                                             Type




                                 •   Available data largely determined by major player support

                                 •   “If Google consumes it, we will publish it”
                                                                           9
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
                                                         2
                                                         1
                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
•   Microdata +14% (schema.org?)




                                                         2
                                                         1
                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Formats

                                                                                                                    2009/2010


•




                                                         4
                                                                                                                    02−2012
    URLs with embedded Data: +6%




                                    Percentage of URLs

                                                         3
•   Microdata +14% (schema.org?)




                                                         2
•

                                                         1
    RDFa +26% (Facebook?)




                                                         0
                                                             RDFa   Microdata   geo   hcalendar   hcard   hreview     XFN

                                                                                       Format




                                   10
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org




                                11
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org

•   Formats: RDF (~90 GB) and CSV Tables for Microformats (!)




                                11
Results - Extracted Data

•   Extracted data available for download at

    •   www.webdatacommons.org

•   Formats: RDF (~90 GB) and CSV Tables for Microformats (!)

•   Have a look!



                                11
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that




                                 12
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that

•   Cost for other services negligible *




                                 12
AWS Costs

•   Ca. 5500 Machine-Hours were required

    •   1100 € billed by AWS for that

•   Cost for other services negligible *

•   * At first, we underestimated SDB cost



                                 12
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets

•   AWS great for massive ad-hoc computing power and
    complexity reduction




                             13
Takeaways
•   Web Data Commons now publishes the largest set of
    structured data from Web pages available

•   Large-Scale Web Analysis now possible with Common Crawl
    datasets

•   AWS great for massive ad-hoc computing power and
    complexity reduction

•   Choose your architecture wisely, test by experiment, for us
    EMR was too expensive.

                                13
Thank You!
              Questions?
            Want to hire me?


Web Resources: http://webdatacommons.org
     http://hannes.muehleisen.org

More Related Content

What's hot

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowDatabricks
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorRomain Rigaux
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiLev Brailovskiy
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBaseHBaseCon
 
DNS Security Presentation ISSA
DNS Security Presentation ISSADNS Security Presentation ISSA
DNS Security Presentation ISSASrikrupa Srivatsan
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
Presentation oracle net services
Presentation    oracle net servicesPresentation    oracle net services
Presentation oracle net servicesxKinAnx
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c FeaturesBest Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c FeaturesMarkus Michalewicz
 
Presentation upgrade, migrate & consolidate to oracle database 12c &amp...
Presentation   upgrade, migrate & consolidate to oracle database 12c &amp...Presentation   upgrade, migrate & consolidate to oracle database 12c &amp...
Presentation upgrade, migrate & consolidate to oracle database 12c &amp...solarisyougood
 
Oracle 12c Multitenant architecture
Oracle 12c Multitenant architectureOracle 12c Multitenant architecture
Oracle 12c Multitenant architecturenaderattia
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB
 

What's hot (20)

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
Hue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL EditorHue architecture in the Hadoop ecosystem and SQL Editor
Hue architecture in the Hadoop ecosystem and SQL Editor
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
DNS Security Presentation ISSA
DNS Security Presentation ISSADNS Security Presentation ISSA
DNS Security Presentation ISSA
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Presentation oracle net services
Presentation    oracle net servicesPresentation    oracle net services
Presentation oracle net services
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c FeaturesBest Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
 
Presentation upgrade, migrate & consolidate to oracle database 12c &amp...
Presentation   upgrade, migrate & consolidate to oracle database 12c &amp...Presentation   upgrade, migrate & consolidate to oracle database 12c &amp...
Presentation upgrade, migrate & consolidate to oracle database 12c &amp...
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Oracle 12c Multitenant architecture
Oracle 12c Multitenant architectureOracle 12c Multitenant architecture
Oracle 12c Multitenant architecture
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
 

Similar to AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012

Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AITorsten Steinbach
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraAdrian Cockcroft
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMark Kromer
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMinsk MongoDB User Group
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Kristi Lewandowski
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017SingleStore
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Kristi Lewandowski
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangDatabricks
 
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
 
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Laure Vergeron
 
MongoDB in FS
MongoDB in FSMongoDB in FS
MongoDB in FSMongoDB
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
AWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and ResultsAWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and ResultsMongoDB
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformMaris Elsins
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
 
Intro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage ServiceIntro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage ServiceRod Boothby
 

Similar to AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012 (20)

Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup Presentation
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebService
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local London 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
 
MongoDB in FS
MongoDB in FSMongoDB in FS
MongoDB in FS
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
AWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and ResultsAWS to Bare Metal: Motivation, Pitfalls, and Results
AWS to Bare Metal: Motivation, Pitfalls, and Results
 
Database as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance PlatformDatabase as a Service on the Oracle Database Appliance Platform
Database as a Service on the Oracle Database Appliance Platform
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Intro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage ServiceIntro to Joyent's Manta Object Storage Service
Intro to Joyent's Manta Object Storage Service
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012

  • 1. Large-Scale Analysis of Web Pages − on a Startup Budget? Hannes Mühleisen, Web-Based Systems Group AWS Summit 2012 | Berlin
  • 3. Our Starting Point • Websites now embed structured data in HTML 2
  • 4. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... 2
  • 5. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... • Various Encoding Formats possible • μFormats, RDFa, Microdata 2
  • 6. Our Starting Point • Websites now embed structured data in HTML • Various Vocabularies possible • schema.org, Open Graph protocol, ... • Various Encoding Formats possible • μFormats, RDFa, Microdata Question: How are Vocabularies and Formats used? 2
  • 7. Web Indices • To answer our question, we need to access to raw Web data. 3
  • 8. Web Indices • To answer our question, we need to access to raw Web data. • However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) 3
  • 9. Web Indices • To answer our question, we need to access to raw Web data. • However, maintaining Web indices is insanely expensive • Re-Crawling, Storage, currently ~50 B pages (Google) • Google and Bing have indices, but do not let outsiders in 3
  • 10. Non-Profit Organization 4
  • 11. Non-Profit Organization • Runs crawler and provides HTML dumps 4
  • 12. Non-Profit Organization • Runs crawler and provides HTML dumps • Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) 4
  • 13. Non-Profit Organization • Runs crawler and provides HTML dumps • Available data: • Index 02-12: 1.7 B URLs (21 TB) • Index 09/12: 2.8 B URLs (29 TB) • Available on AWS Public Data Sets 4
  • 14. Why AWS? • Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) 5
  • 15. Why AWS? • Now that we have a web crawl, how do we run our analysis? • Unpacking and DOM-Parsing on 50 TB? (CPU-heavy!) • Preliminary analysis: 1 GB / hour / CPU possible • 8-CPU Desktop: 8 months • 64-CPU Server: 1 month • 100 8-CPU EC2-Instances: ~ 3 days 5
  • 17. Common Crawl Dataset Size 1 CPU, 1 h
  • 18. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h
  • 19. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h
  • 20. Common Crawl Dataset Size 1 CPU, 1 h 1000 € PC, 1 h 5000 € Server, 1 h 17 € EC2 Instances, 1 h
  • 21. AWS Setup • Data Input: Read Index Splits from S3 7
  • 22. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue 7
  • 23. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) 7
  • 24. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) • Result Output: Write to S3 7
  • 25. AWS Setup • Data Input: Read Index Splits from S3 • Job Coordination: SQS Message Queue • Workers: 100 EC2 Spot Instances (c1.xlarge, ~0.17 € / h) • Result Output: Write to S3 • Logging: SDB 7
  • 26. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 27. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 28. SQS • Each input file queued in SQS • EC2 Workers take tasks from SQS • Workers read and write S3 buckets 42 ... EC2 42 43 ... R42 R43 ... CC WDC S3
  • 29. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type 9
  • 30. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support 9
  • 31. Results - Types of Data Microdata 02/2012 RDFa 02/2012 Website Structure 23 % 5e+06 RDFa 2009/2010 Microdata 2009/2010 Products, Reviews 19 % Entity Count (log) 5e+05 Movies, Music, ... 15 % 5e+04 Geodata 8 % 5e+03 People, Organizations 7 % 0 50 100 150 200 2012 Microdata Breakdown Type • Available data largely determined by major player support • “If Google consumes it, we will publish it” 9
  • 32. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 33. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 • Microdata +14% (schema.org?) 2 1 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 34. Results - Formats 2009/2010 • 4 02−2012 URLs with embedded Data: +6% Percentage of URLs 3 • Microdata +14% (schema.org?) 2 • 1 RDFa +26% (Facebook?) 0 RDFa Microdata geo hcalendar hcard hreview XFN Format 10
  • 35. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org 11
  • 36. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org • Formats: RDF (~90 GB) and CSV Tables for Microformats (!) 11
  • 37. Results - Extracted Data • Extracted data available for download at • www.webdatacommons.org • Formats: RDF (~90 GB) and CSV Tables for Microformats (!) • Have a look! 11
  • 38. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that 12
  • 39. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that • Cost for other services negligible * 12
  • 40. AWS Costs • Ca. 5500 Machine-Hours were required • 1100 € billed by AWS for that • Cost for other services negligible * • * At first, we underestimated SDB cost 12
  • 41. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available 13
  • 42. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets 13
  • 43. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets • AWS great for massive ad-hoc computing power and complexity reduction 13
  • 44. Takeaways • Web Data Commons now publishes the largest set of structured data from Web pages available • Large-Scale Web Analysis now possible with Common Crawl datasets • AWS great for massive ad-hoc computing power and complexity reduction • Choose your architecture wisely, test by experiment, for us EMR was too expensive. 13
  • 45. Thank You! Questions? Want to hire me? Web Resources: http://webdatacommons.org http://hannes.muehleisen.org