SlideShare a Scribd company logo
Slide 1
International Internet Preservation Consortium
General Assembly 2014, Paris
Mining a Large Web Corpus
Robert Meusel
Christian Bizer
Slide 2
The Common Crawl
Slide 3
Hyperlink Graphs
Knowledge about the structure of the Web can be used to
improve crawling strategies, to help SEO experts or to
understand social phenomena.
Slide 4
HTML-embedded Data on the Web
Several million websites semantically markup the content of
their HTML pages.
Markup Syntaxes
 Microformats
 RDFa
 Microdata
Data snippets
within info boxes
Slide 5
Relational HTML Tables
HTML Tables over semi-structured data which can be used to
build up or extend knowledge bases as DBPedia.
• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
 In a corpus of 14B raw
tables, 154M are „good“
relations (1.1%)
Slide 6
The Web Data Commons Project
 Has developed an Amazon-based framework for extracting data
from large web crawls
 Capable to run on any cloud infrastructure
 Has applied this framework to the Common Crawl data
 Adaptable to other crawls
 Results and framework are publicly available
 http://webdatacommons.org
Goal: Offer an easy-to-use, cost efficient, distributed
extraction framework for large web crawls, as well as
datasets extracted out of the crawls.
Slide 7
Extraction Framework
AWS EC2
Instance
AWS EC2
Instance
Master
AWS SQS
AWS EC2
Instance
AWS S3
1: Fill queue
2: Launch instances
3: Request
file-reference
4: Download file
5: Extract &
Upload
automated
manual
6: Collect results
Slide 8
Extraction Worker
AWS S3
AWS S3
WDC Extractor
.(w)arc
Worker
Filter
output
Worker:
• Written in Java
• Process one page at
once
• Independent from
other files and
workers
Download file
Upload output file
Filter:
• Reduce Runtime
• Mime-Type filter
• Regex detection of
content or meta-
information
Worker
Slide 9
Web Data Commons – Extraction Framework
 Written in Java
 Mainly tailored for Amazon Web Services
 Fault tolerant and cheap
 300 USD to extract 17 billion RDF statements from 44 TB
 Easy customizable
 Only worker has to be adapted
 Worker is a single process method processing one file each time
 Scaling is automated by the framework
 Access Open Source Code:
 https://www.assembla.com/code/commondata/
Alternative: Hadoop Version, which can run on any Hadoop
cluster without Amazon Web Services.
Slide 10
Extracted Datasets
 Hyperlink Graph
 HTML-embedded Data
 Relational HTML Tables
Hyperlink Graph
HTML-embedded Data
Relational HTML Tables
Slide 11
Hyperlink Graph
 Extracted from the Common Crawl 2012 Dataset
 Over 3.5 billion pages connected by over 128 billion links
 Graph files: 386 GB
http://webdatacommons.org/hyperlinkgraph/
http://wwwranking.webdatacommons.org/
Slide 12
Hyperlink Graph
 Degrees do not follow a power-law
 Detection of Spam pages
 Further insights:
 WWW‘14: Graph Structure in the Web – Revisited (Meusel et al.)
 WebSci‘14: The Graph Structure of the Web aggregated by Pay-Level Domain (Lehmberg et al.)
Discovery of evolutions in the global structure of the World
Wide Web.
Slide 13
Hyperlink Graph
Discovery of important and interesting sites using different
popularity rankings or website categorization libraries
Websites connected by at least ½ Million Links
Slide 14
HTML-embedded Data
More and more Websites semantically
markup the content of their HTML pages.
Markup Syntaxes
RDFa
Microformats
Microdata
Slide 15
Websites containing Structured Data (2013)
1.8 million websites (PLDs) out of 12.8 million
provide Microformat, Microdata or RDFa data (13.9%)
585 million of the 2.2 billion pages contain
Microformat, Microdata or RDFa data (26.3%).
Web Data Commons - Microformat, Microdata, RDFa Corpus
 17 billion RDF triples from Common Crawl 2013
 Next release will be in winter 2014
http://webdatacommons.org/structureddata/
Slide 16
Top Classes Microdata (2013)
• schema = Schema.org
• dv = Google‘s
Rich Snippet Vocabulary
Slide 17
HTML Tables
• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
• Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011.
In corpus of 14B raw tables, 154M are “good” relations (1.1%).
Cafarella (2008)
Classification Precision: 70-80%
Slide 18
WDC - Web Tables Corpus
 Large corpus of relational Web tables for public download
 Extracted from Common Crawl 2012 (3.3 billion pages)
 147 million relational tables
 selected out of 11.2 B raw tables (1.3%)
 download includes the HTML pages of the tables (1TB zipped)
 Table Statistics
 Heterogeneity: Very high.
http://webdatacommons.org/webtables/
Min Max Average Median
Attributes 2 2,368 3.49 3
Data Rows 1 70,068 12.41 6
Slide 19
 Attribute Statistics
28,000,000 different attribute labels
WDC - Web Tables Corpus
Attribute #Tables
name 4,600,000
price 3,700,000
date 2,700,000
artist 2,100,000
location 1,200,000
year 1,000,000
manufacturer 375,000
counrty 340,000
isbn 99,000
area 95,000
population 86,000
 Subject Attribute Values
1.74 billion rows
253,000,000 different subject labels
Value #Rows
usa 135,000
germany 91,000
greece 42,000
new york 59,000
london 37,000
athens 11,000
david beckham 3,000
ronaldinho 1,200
oliver kahn 710
twist shout 2,000
yellow submarine 1,400
Slide 20
Conclusion
Three factors are necessary to work with web-scale data:
 Thanks to Common Crawl, this data is available
 Like Amazon or other on-demand cloud-services
 The Web Data Commons Framework, or standard tools like Pig
 Cost evaluation on task-base, but the WDC framework has turned
out to be cheaper
Availability of Crawls
Availability of cheap, easy-to-use infrastructures
Easy to adopt scalable extraction frameworks
Slide 21
Questions
 Please visit our website: www.webdatacommons.org
 Data and Framework are available as free download
 Web Data Commons is supported by:

More Related Content

What's hot

Scaling Credible Content
Scaling Credible ContentScaling Credible Content
Scaling Credible Content
Joe Griffin
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
Bishal Khanal
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
Intro To MongoDB
Intro To MongoDBIntro To MongoDB
Intro To MongoDB
Alex Sharp
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
Avkash Chauhan
 
An introduction to Apache Thrift
An introduction to Apache ThriftAn introduction to Apache Thrift
An introduction to Apache Thrift
Mike Frampton
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
DataStax
 
Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsSteven Francia
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
Neo4j
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
Carlos Rodriguez
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Sumit Maheshwari
 
Introduction to MongoDB.pptx
Introduction to MongoDB.pptxIntroduction to MongoDB.pptx
Introduction to MongoDB.pptx
Surya937648
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
Maris Lemba
 
Big query
Big queryBig query
Big query
Tanvi Parikh
 
Introduction to mongodb
Introduction to mongodbIntroduction to mongodb
Introduction to mongodb
neela madheswari
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit
 

What's hot (20)

Scaling Credible Content
Scaling Credible ContentScaling Credible Content
Scaling Credible Content
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Intro To MongoDB
Intro To MongoDBIntro To MongoDB
Intro To MongoDB
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
An introduction to Apache Thrift
An introduction to Apache ThriftAn introduction to Apache Thrift
An introduction to Apache Thrift
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS Applications
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Introduction to MongoDB.pptx
Introduction to MongoDB.pptxIntroduction to MongoDB.pptx
Introduction to MongoDB.pptx
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Big query
Big queryBig query
Big query
 
Introduction to mongodb
Introduction to mongodbIntroduction to mongodb
Introduction to mongodb
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 

Viewers also liked

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
Amazon Web Services
 
Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Data
huguk
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your dataset
Turi, Inc.
 
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
Robert Meusel
 
2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar
Open Analytics
 
Marketing in an Altered Reality World with Brent Csutoras
Marketing in an Altered Reality World with Brent CsutorasMarketing in an Altered Reality World with Brent Csutoras
Marketing in an Altered Reality World with Brent Csutoras
Search Engine Journal
 
The Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterThe Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecter
CommonCrawl
 
Is Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesIs Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal Policies
PromptCloud
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering project
Hoa Nguyen
 
Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customers
Albert Hui
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log Analysis
Anton Chuvakin
 
IBM Open Data
IBM Open DataIBM Open Data
IBM Open Data
Anders Quitzau
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
Yi-Shin Chen
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
Amazon Web Services
 
Migrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the CloudMigrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the Cloud
Amazon Web Services
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
Natasha Murashev
 

Viewers also liked (16)

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
 
Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Data
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your dataset
 
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
 
2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar
 
Marketing in an Altered Reality World with Brent Csutoras
Marketing in an Altered Reality World with Brent CsutorasMarketing in an Altered Reality World with Brent Csutoras
Marketing in an Altered Reality World with Brent Csutoras
 
The Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterThe Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecter
 
Is Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesIs Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal Policies
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering project
 
Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customers
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log Analysis
 
IBM Open Data
IBM Open DataIBM Open Data
IBM Open Data
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
Migrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the CloudMigrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the Cloud
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 

Similar to Mining a Large Web Corpus

Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
Chris Bizer
 
鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107
皓仁 柯
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
Applications of xml, semantic web or linked data in Library/Information Servi...
Applications of xml, semantic web or linked data in Library/Information Servi...Applications of xml, semantic web or linked data in Library/Information Servi...
Applications of xml, semantic web or linked data in Library/Information Servi...
Nurhazman Abdul Aziz
 
The Social Data Web
The Social Data WebThe Social Data Web
The Social Data Web
George Thomas
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014Robert Meusel
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Web
is20090
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approaches
Luxoft
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description Framework
IRJET Journal
 
Linked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter HaaseLinked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter Haase
Laboratory of Information Science and Semantic Technologies
 
Linked Data
Linked DataLinked Data
Linked Data
Danny Ayers
 
Big Data
Big DataBig Data
Big Data
Vinayak Kamath
 
Semantic web Santhosh N Basavarajappa
Semantic web   Santhosh N BasavarajappaSemantic web   Santhosh N Basavarajappa
Semantic web Santhosh N Basavarajappa
Santhosh Basavarajappa
 
(More) Transparency Transformation
(More) Transparency Transformation(More) Transparency Transformation
(More) Transparency Transformation
George Thomas
 
mx & dbs
mx & dbsmx & dbs
Web Technology Trends (early 2009)
Web Technology Trends (early 2009)Web Technology Trends (early 2009)
Web Technology Trends (early 2009)
Prodosh Banerjee
 
Strategies for integrating semantic and blockchain technologies
Strategies for integrating semantic and blockchain technologiesStrategies for integrating semantic and blockchain technologies
Strategies for integrating semantic and blockchain technologies
Héctor Ugarte
 

Similar to Mining a Large Web Corpus (20)

Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Applications of xml, semantic web or linked data in Library/Information Servi...
Applications of xml, semantic web or linked data in Library/Information Servi...Applications of xml, semantic web or linked data in Library/Information Servi...
Applications of xml, semantic web or linked data in Library/Information Servi...
 
disertation
disertationdisertation
disertation
 
Gt ea2009
Gt ea2009Gt ea2009
Gt ea2009
 
The Social Data Web
The Social Data WebThe Social Data Web
The Social Data Web
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Web
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approaches
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description Framework
 
Linked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter HaaseLinked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter Haase
 
Linked Data
Linked DataLinked Data
Linked Data
 
Big Data
Big DataBig Data
Big Data
 
Semantic web Santhosh N Basavarajappa
Semantic web   Santhosh N BasavarajappaSemantic web   Santhosh N Basavarajappa
Semantic web Santhosh N Basavarajappa
 
(More) Transparency Transformation
(More) Transparency Transformation(More) Transparency Transformation
(More) Transparency Transformation
 
mx & dbs
mx & dbsmx & dbs
mx & dbs
 
Web Technology Trends (early 2009)
Web Technology Trends (early 2009)Web Technology Trends (early 2009)
Web Technology Trends (early 2009)
 
Strategies for integrating semantic and blockchain technologies
Strategies for integrating semantic and blockchain technologiesStrategies for integrating semantic and blockchain technologies
Strategies for integrating semantic and blockchain technologies
 

Recently uploaded

Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
Kamal Acharya
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 

Recently uploaded (20)

Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 

Mining a Large Web Corpus

  • 1. Slide 1 International Internet Preservation Consortium General Assembly 2014, Paris Mining a Large Web Corpus Robert Meusel Christian Bizer
  • 3. Slide 3 Hyperlink Graphs Knowledge about the structure of the Web can be used to improve crawling strategies, to help SEO experts or to understand social phenomena.
  • 4. Slide 4 HTML-embedded Data on the Web Several million websites semantically markup the content of their HTML pages. Markup Syntaxes  Microformats  RDFa  Microdata Data snippets within info boxes
  • 5. Slide 5 Relational HTML Tables HTML Tables over semi-structured data which can be used to build up or extend knowledge bases as DBPedia. • Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.  In a corpus of 14B raw tables, 154M are „good“ relations (1.1%)
  • 6. Slide 6 The Web Data Commons Project  Has developed an Amazon-based framework for extracting data from large web crawls  Capable to run on any cloud infrastructure  Has applied this framework to the Common Crawl data  Adaptable to other crawls  Results and framework are publicly available  http://webdatacommons.org Goal: Offer an easy-to-use, cost efficient, distributed extraction framework for large web crawls, as well as datasets extracted out of the crawls.
  • 7. Slide 7 Extraction Framework AWS EC2 Instance AWS EC2 Instance Master AWS SQS AWS EC2 Instance AWS S3 1: Fill queue 2: Launch instances 3: Request file-reference 4: Download file 5: Extract & Upload automated manual 6: Collect results
  • 8. Slide 8 Extraction Worker AWS S3 AWS S3 WDC Extractor .(w)arc Worker Filter output Worker: • Written in Java • Process one page at once • Independent from other files and workers Download file Upload output file Filter: • Reduce Runtime • Mime-Type filter • Regex detection of content or meta- information Worker
  • 9. Slide 9 Web Data Commons – Extraction Framework  Written in Java  Mainly tailored for Amazon Web Services  Fault tolerant and cheap  300 USD to extract 17 billion RDF statements from 44 TB  Easy customizable  Only worker has to be adapted  Worker is a single process method processing one file each time  Scaling is automated by the framework  Access Open Source Code:  https://www.assembla.com/code/commondata/ Alternative: Hadoop Version, which can run on any Hadoop cluster without Amazon Web Services.
  • 10. Slide 10 Extracted Datasets  Hyperlink Graph  HTML-embedded Data  Relational HTML Tables Hyperlink Graph HTML-embedded Data Relational HTML Tables
  • 11. Slide 11 Hyperlink Graph  Extracted from the Common Crawl 2012 Dataset  Over 3.5 billion pages connected by over 128 billion links  Graph files: 386 GB http://webdatacommons.org/hyperlinkgraph/ http://wwwranking.webdatacommons.org/
  • 12. Slide 12 Hyperlink Graph  Degrees do not follow a power-law  Detection of Spam pages  Further insights:  WWW‘14: Graph Structure in the Web – Revisited (Meusel et al.)  WebSci‘14: The Graph Structure of the Web aggregated by Pay-Level Domain (Lehmberg et al.) Discovery of evolutions in the global structure of the World Wide Web.
  • 13. Slide 13 Hyperlink Graph Discovery of important and interesting sites using different popularity rankings or website categorization libraries Websites connected by at least ½ Million Links
  • 14. Slide 14 HTML-embedded Data More and more Websites semantically markup the content of their HTML pages. Markup Syntaxes RDFa Microformats Microdata
  • 15. Slide 15 Websites containing Structured Data (2013) 1.8 million websites (PLDs) out of 12.8 million provide Microformat, Microdata or RDFa data (13.9%) 585 million of the 2.2 billion pages contain Microformat, Microdata or RDFa data (26.3%). Web Data Commons - Microformat, Microdata, RDFa Corpus  17 billion RDF triples from Common Crawl 2013  Next release will be in winter 2014 http://webdatacommons.org/structureddata/
  • 16. Slide 16 Top Classes Microdata (2013) • schema = Schema.org • dv = Google‘s Rich Snippet Vocabulary
  • 17. Slide 17 HTML Tables • Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008. • Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011. In corpus of 14B raw tables, 154M are “good” relations (1.1%). Cafarella (2008) Classification Precision: 70-80%
  • 18. Slide 18 WDC - Web Tables Corpus  Large corpus of relational Web tables for public download  Extracted from Common Crawl 2012 (3.3 billion pages)  147 million relational tables  selected out of 11.2 B raw tables (1.3%)  download includes the HTML pages of the tables (1TB zipped)  Table Statistics  Heterogeneity: Very high. http://webdatacommons.org/webtables/ Min Max Average Median Attributes 2 2,368 3.49 3 Data Rows 1 70,068 12.41 6
  • 19. Slide 19  Attribute Statistics 28,000,000 different attribute labels WDC - Web Tables Corpus Attribute #Tables name 4,600,000 price 3,700,000 date 2,700,000 artist 2,100,000 location 1,200,000 year 1,000,000 manufacturer 375,000 counrty 340,000 isbn 99,000 area 95,000 population 86,000  Subject Attribute Values 1.74 billion rows 253,000,000 different subject labels Value #Rows usa 135,000 germany 91,000 greece 42,000 new york 59,000 london 37,000 athens 11,000 david beckham 3,000 ronaldinho 1,200 oliver kahn 710 twist shout 2,000 yellow submarine 1,400
  • 20. Slide 20 Conclusion Three factors are necessary to work with web-scale data:  Thanks to Common Crawl, this data is available  Like Amazon or other on-demand cloud-services  The Web Data Commons Framework, or standard tools like Pig  Cost evaluation on task-base, but the WDC framework has turned out to be cheaper Availability of Crawls Availability of cheap, easy-to-use infrastructures Easy to adopt scalable extraction frameworks
  • 21. Slide 21 Questions  Please visit our website: www.webdatacommons.org  Data and Framework are available as free download  Web Data Commons is supported by: