Focused Crawling 
for Structured Data 
Robert Meusel, Peter Mika, 
and Roi Blanco
HTML pages embed directly 
markup languages to annotate 
items using different vocabularies 
1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
2._:node1 <http://schema.org/Product/name> "Predator 
2 
Markup Languages in HTML Pages 
<html> 
… 
<body> 
… 
<div id="main-section" class="performance left" data-sku=" 
M17242_580“> 
580" itemscope 
itemtype="http://schema.org/Product"> 
h1 itemprop="name"> Predator Instinct FG Fußballschuh 
<h1> Predator Instinct FG Fußballschuh 
</h1> 
<div> 
div itemscope itemtype="http://schema.org/Offer" 
itemprop="offers"> 
type> <http://schema.org/Product> . 
itemprop="priceCurrency" content="EUR"> 
itemprop="price" data-sale-price=" 
219.95">219,95</span> 
<meta content="EUR"> 
<span 
data-sale-price="219.95">219,95</span> 
… 
</body> 
</html> 
Instinct FG Fußballschuh"@de . 
3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
type> <http://schema.org/Offer> . 
4._:node1 <http://schema.org/Offer/price> 
"219,95"@de . 
5._:node1 <http://schema.org/Offer/priceCurrency> 
"EUR" . 
6.… 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
3 
Deployment of Markup Languages 
14% of all sites use markup languages to annotate 
their data (status 2013) [Meusel2014] 
• Broad topical variations from Articles over Products to 
Recipe [Bizer2013] 
• Multiple strong drivers pushing the deployment 
• Search engine companies initiative on Schema.org 
• Open Graph Protocol used by Facebook 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
4 
Motivation 
• Existing datasets/crawls do not focus on structured data 
• Common Crawl Foundation uses PageRank and Breadth-First Search 
• Datasets, as the WebDataCommons corpus extracted from these 
corpora, are likely to miss large amounts of data [Meusel2014] 
• Structured information 
• Hundreds of million pages 
• Up-to-date information 
• Publicly available 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
5 
Main Idea 
• Adapting the idea of focused crawling 
• Similarities: 
• Evaluation of content based on a objective function 
• Differences: 
• Typically focused by topic, not quality/amount of data collected 
• Because of that, typically no direct feedback about crawled pages 
available 
Possibility to incorporate the feedback directly into 
our system to improve classification of newly 
discovered URLs. 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
6 
Online Learning for Focused Crawling 
• Capability to incorporates real-time feedback 
• Improves performance 
• Adapts to concept drifts 
• Possible features 
• URL-based features; mainly tokens from the URL-String itself 
• Features describing information from the parent(s) of the URL 
• Features describing information from the siblings of the URL 
• Free open-source software available (e.g. Massive Online 
Analysis Library by Bifet et al.) 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
7 
Exploration vs. Exploitation 
Selecting the page with the highest confidence for 
supporting our objective, might not always be the best 
choice 
• Decision/Classification is based on gathered knowledge 
• Knowledge can be incomplete 
• Crawled too few pages 
• Knowledge can get invalid 
• Reaching part of the Web with 
different behavior 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
8 
Bandit-Based Selection 
• Bin each URL to the host it belongs to 
• Each host represents one bandit 
• Calculate the expected score for each 
bandit based on a scoring function 
• Select the degree of randomness λ 
• λ between 0 and 1 
• For each turn draw a random number z 
• z > λ: select the bandit with highest score 
• else: select a random bandit 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
9 
Scoring Functions 
Incorporate knowledge in score calculation for bandit/host: 
• Best Score (Pure classification-based selection) 
• Negative Absolute Bad 
• Success Rate 
• Absolute Good · Best Score 
• Success Rate · Best Score 
• Thompson Sampling 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
10 
System Workflow 
Online 
Classifier 
Bandits 
Crawler 
URL 
Parser 
Semantic 
Parser 
Classified 
URL 
URL 
HTML 
Page 
URLs 
Feedback 
Seeds 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
11 
Setup for Experiments 
• Data originates from the Common Crawl Corpus 2012 
• including over 3.5 billion HTML pages 
• Extracted a subset of 5.5 million linked pages 
• Including 450k different hosts 
• Identified all pages within the subset containing at least one 
markup language (using the WebDataCommons corpus) 
• 27.5% of all pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
12 
Experiment Description 
Measure: Number of relevant pages retrieved within the first 1 
million pages crawled. 
1. Online vs. batch-based classification with 100K, 250K, and 1M 
pages 
2. Pure online classification vs. enhanced with bandit-based 
selection (λ=0) 
3. Improvements with different λ 
4. Improvements with decaying λ 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
13 
Results: Online vs. Offline 
• Both methods outperform Breadth-First Search (BFS) 
• Static approach: 340K 
• Adaptive approach: 539K 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
14 
Results: Pure Online Classification vs. +Bandit-based 
• Success rate based scoring functions show most promising results 
• Negative absolute bad scoring performs like BFS 
• Success rate 
function: 628K 
• Pure online-classification: 
539K 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
15 
Results: λ > 0 
• Including randomness seems not to have an effect 
• Beneficial effect of λ > 0 is shown e.g. for the success rate 
function within the first 400K crawled pages 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
16 
Results: Decaying λ 
Decaying λ over time, means the reduction of randomness while 
crawling more pages. 
• Success rate function with decaying λ = 0.5: 673K 
• Static λ: 628K 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
17 
Adaptation to more specific Objective 
• General objective is narrowed down to: 
• Pages making use of the markup language Microdata and 
• Include at least five marked up statements 
• Example: 
1. A page including information about a movie 
2. The movie has the name Se7en 
3. with a rating of 8.7 out of 10 
4. and it was released in 1995 
5. This information is maintained by imdb.com 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
18 
Results: Adaptation to more specific Objective 
• 3.5% of pages include such information 
• In general: Observation of beneficial effects using our approach 
• Static 
λ = 0.2: 120K 
• Decaying 
λ = 0.5: 108K 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
19 
Conclusion 
• Improvement by 26% in comparison to pure online 
classification-based selection strategy for general objective 
• Improvement by 66% for the more specific objective 
• Success rate based scoring functions shows most promising 
results for objectives 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
20 
Open Challenges 
• Expand the approach to exploit results from one bandit to the 
other bandits (contextual bandits) 
• Introduce a more fine grained grading of the crawled pages 
(multi-class problem) 
• Take into account the quality of gathered information (beside 
richness) 
• Adapt the process to traditional topical focused crawling 
• Publishing of code and data to the community 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
21 
More Information 
• Paper accepted at ACM International Conference on 
Information and Knowledge Management in Shanghai, China 
• ACM Digital Library: Focused Crawling for Structured Data 
• Detailed Descriptions and Source Code: 
• Anthelion Webpage 
• Datasets: 
• Common Crawl Foundation Corpora 
• WebDataCommons Corpora 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Focused Crawling for Structured Data

  • 1.
    Focused Crawling forStructured Data Robert Meusel, Peter Mika, and Roi Blanco
  • 2.
    HTML pages embeddirectly markup languages to annotate items using different vocabularies 1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 2._:node1 <http://schema.org/Product/name> "Predator 2 Markup Languages in HTML Pages <html> … <body> … <div id="main-section" class="performance left" data-sku=" M17242_580“> 580" itemscope itemtype="http://schema.org/Product"> h1 itemprop="name"> Predator Instinct FG Fußballschuh <h1> Predator Instinct FG Fußballschuh </h1> <div> div itemscope itemtype="http://schema.org/Offer" itemprop="offers"> type> <http://schema.org/Product> . itemprop="priceCurrency" content="EUR"> itemprop="price" data-sale-price=" 219.95">219,95</span> <meta content="EUR"> <span data-sale-price="219.95">219,95</span> … </body> </html> Instinct FG Fußballschuh"@de . 3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# type> <http://schema.org/Offer> . 4._:node1 <http://schema.org/Offer/price> "219,95"@de . 5._:node1 <http://schema.org/Offer/priceCurrency> "EUR" . 6.… Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 3.
    3 Deployment ofMarkup Languages 14% of all sites use markup languages to annotate their data (status 2013) [Meusel2014] • Broad topical variations from Articles over Products to Recipe [Bizer2013] • Multiple strong drivers pushing the deployment • Search engine companies initiative on Schema.org • Open Graph Protocol used by Facebook Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 4.
    4 Motivation •Existing datasets/crawls do not focus on structured data • Common Crawl Foundation uses PageRank and Breadth-First Search • Datasets, as the WebDataCommons corpus extracted from these corpora, are likely to miss large amounts of data [Meusel2014] • Structured information • Hundreds of million pages • Up-to-date information • Publicly available Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 5.
    5 Main Idea • Adapting the idea of focused crawling • Similarities: • Evaluation of content based on a objective function • Differences: • Typically focused by topic, not quality/amount of data collected • Because of that, typically no direct feedback about crawled pages available Possibility to incorporate the feedback directly into our system to improve classification of newly discovered URLs. Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 6.
    6 Online Learningfor Focused Crawling • Capability to incorporates real-time feedback • Improves performance • Adapts to concept drifts • Possible features • URL-based features; mainly tokens from the URL-String itself • Features describing information from the parent(s) of the URL • Features describing information from the siblings of the URL • Free open-source software available (e.g. Massive Online Analysis Library by Bifet et al.) Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 7.
    7 Exploration vs.Exploitation Selecting the page with the highest confidence for supporting our objective, might not always be the best choice • Decision/Classification is based on gathered knowledge • Knowledge can be incomplete • Crawled too few pages • Knowledge can get invalid • Reaching part of the Web with different behavior Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 8.
    8 Bandit-Based Selection • Bin each URL to the host it belongs to • Each host represents one bandit • Calculate the expected score for each bandit based on a scoring function • Select the degree of randomness λ • λ between 0 and 1 • For each turn draw a random number z • z > λ: select the bandit with highest score • else: select a random bandit Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 9.
    9 Scoring Functions Incorporate knowledge in score calculation for bandit/host: • Best Score (Pure classification-based selection) • Negative Absolute Bad • Success Rate • Absolute Good · Best Score • Success Rate · Best Score • Thompson Sampling Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 10.
    10 System Workflow Online Classifier Bandits Crawler URL Parser Semantic Parser Classified URL URL HTML Page URLs Feedback Seeds Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 11.
    11 Setup forExperiments • Data originates from the Common Crawl Corpus 2012 • including over 3.5 billion HTML pages • Extracted a subset of 5.5 million linked pages • Including 450k different hosts • Identified all pages within the subset containing at least one markup language (using the WebDataCommons corpus) • 27.5% of all pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 12.
    12 Experiment Description Measure: Number of relevant pages retrieved within the first 1 million pages crawled. 1. Online vs. batch-based classification with 100K, 250K, and 1M pages 2. Pure online classification vs. enhanced with bandit-based selection (λ=0) 3. Improvements with different λ 4. Improvements with decaying λ Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 13.
    13 Results: Onlinevs. Offline • Both methods outperform Breadth-First Search (BFS) • Static approach: 340K • Adaptive approach: 539K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 14.
    14 Results: PureOnline Classification vs. +Bandit-based • Success rate based scoring functions show most promising results • Negative absolute bad scoring performs like BFS • Success rate function: 628K • Pure online-classification: 539K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 15.
    15 Results: λ> 0 • Including randomness seems not to have an effect • Beneficial effect of λ > 0 is shown e.g. for the success rate function within the first 400K crawled pages Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 16.
    16 Results: Decayingλ Decaying λ over time, means the reduction of randomness while crawling more pages. • Success rate function with decaying λ = 0.5: 673K • Static λ: 628K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 17.
    17 Adaptation tomore specific Objective • General objective is narrowed down to: • Pages making use of the markup language Microdata and • Include at least five marked up statements • Example: 1. A page including information about a movie 2. The movie has the name Se7en 3. with a rating of 8.7 out of 10 4. and it was released in 1995 5. This information is maintained by imdb.com Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 18.
    18 Results: Adaptationto more specific Objective • 3.5% of pages include such information • In general: Observation of beneficial effects using our approach • Static λ = 0.2: 120K • Decaying λ = 0.5: 108K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 19.
    19 Conclusion •Improvement by 26% in comparison to pure online classification-based selection strategy for general objective • Improvement by 66% for the more specific objective • Success rate based scoring functions shows most promising results for objectives Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 20.
    20 Open Challenges • Expand the approach to exploit results from one bandit to the other bandits (contextual bandits) • Introduce a more fine grained grading of the crawled pages (multi-class problem) • Take into account the quality of gathered information (beside richness) • Adapt the process to traditional topical focused crawling • Publishing of code and data to the community Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 21.
    21 More Information • Paper accepted at ACM International Conference on Information and Knowledge Management in Shanghai, China • ACM Digital Library: Focused Crawling for Structured Data • Detailed Descriptions and Source Code: • Anthelion Webpage • Datasets: • Common Crawl Foundation Corpora • WebDataCommons Corpora Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai