SlideShare a Scribd company logo
1 of 21
Download to read offline
Focused Crawling 
for Structured Data 
Robert Meusel, Peter Mika, 
and Roi Blanco
HTML pages embed directly 
markup languages to annotate 
items using different vocabularies 
1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
2._:node1 <http://schema.org/Product/name> "Predator 
2 
Markup Languages in HTML Pages 
<html> 
… 
<body> 
… 
<div id="main-section" class="performance left" data-sku=" 
M17242_580“> 
580" itemscope 
itemtype="http://schema.org/Product"> 
h1 itemprop="name"> Predator Instinct FG Fußballschuh 
<h1> Predator Instinct FG Fußballschuh 
</h1> 
<div> 
div itemscope itemtype="http://schema.org/Offer" 
itemprop="offers"> 
type> <http://schema.org/Product> . 
itemprop="priceCurrency" content="EUR"> 
itemprop="price" data-sale-price=" 
219.95">219,95</span> 
<meta content="EUR"> 
<span 
data-sale-price="219.95">219,95</span> 
… 
</body> 
</html> 
Instinct FG Fußballschuh"@de . 
3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 
type> <http://schema.org/Offer> . 
4._:node1 <http://schema.org/Offer/price> 
"219,95"@de . 
5._:node1 <http://schema.org/Offer/priceCurrency> 
"EUR" . 
6.… 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
3 
Deployment of Markup Languages 
14% of all sites use markup languages to annotate 
their data (status 2013) [Meusel2014] 
• Broad topical variations from Articles over Products to 
Recipe [Bizer2013] 
• Multiple strong drivers pushing the deployment 
• Search engine companies initiative on Schema.org 
• Open Graph Protocol used by Facebook 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
4 
Motivation 
• Existing datasets/crawls do not focus on structured data 
• Common Crawl Foundation uses PageRank and Breadth-First Search 
• Datasets, as the WebDataCommons corpus extracted from these 
corpora, are likely to miss large amounts of data [Meusel2014] 
• Structured information 
• Hundreds of million pages 
• Up-to-date information 
• Publicly available 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
5 
Main Idea 
• Adapting the idea of focused crawling 
• Similarities: 
• Evaluation of content based on a objective function 
• Differences: 
• Typically focused by topic, not quality/amount of data collected 
• Because of that, typically no direct feedback about crawled pages 
available 
Possibility to incorporate the feedback directly into 
our system to improve classification of newly 
discovered URLs. 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
6 
Online Learning for Focused Crawling 
• Capability to incorporates real-time feedback 
• Improves performance 
• Adapts to concept drifts 
• Possible features 
• URL-based features; mainly tokens from the URL-String itself 
• Features describing information from the parent(s) of the URL 
• Features describing information from the siblings of the URL 
• Free open-source software available (e.g. Massive Online 
Analysis Library by Bifet et al.) 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
7 
Exploration vs. Exploitation 
Selecting the page with the highest confidence for 
supporting our objective, might not always be the best 
choice 
• Decision/Classification is based on gathered knowledge 
• Knowledge can be incomplete 
• Crawled too few pages 
• Knowledge can get invalid 
• Reaching part of the Web with 
different behavior 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
8 
Bandit-Based Selection 
• Bin each URL to the host it belongs to 
• Each host represents one bandit 
• Calculate the expected score for each 
bandit based on a scoring function 
• Select the degree of randomness λ 
• λ between 0 and 1 
• For each turn draw a random number z 
• z > λ: select the bandit with highest score 
• else: select a random bandit 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
9 
Scoring Functions 
Incorporate knowledge in score calculation for bandit/host: 
• Best Score (Pure classification-based selection) 
• Negative Absolute Bad 
• Success Rate 
• Absolute Good · Best Score 
• Success Rate · Best Score 
• Thompson Sampling 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
10 
System Workflow 
Online 
Classifier 
Bandits 
Crawler 
URL 
Parser 
Semantic 
Parser 
Classified 
URL 
URL 
HTML 
Page 
URLs 
Feedback 
Seeds 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
11 
Setup for Experiments 
• Data originates from the Common Crawl Corpus 2012 
• including over 3.5 billion HTML pages 
• Extracted a subset of 5.5 million linked pages 
• Including 450k different hosts 
• Identified all pages within the subset containing at least one 
markup language (using the WebDataCommons corpus) 
• 27.5% of all pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
12 
Experiment Description 
Measure: Number of relevant pages retrieved within the first 1 
million pages crawled. 
1. Online vs. batch-based classification with 100K, 250K, and 1M 
pages 
2. Pure online classification vs. enhanced with bandit-based 
selection (λ=0) 
3. Improvements with different λ 
4. Improvements with decaying λ 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
13 
Results: Online vs. Offline 
• Both methods outperform Breadth-First Search (BFS) 
• Static approach: 340K 
• Adaptive approach: 539K 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
14 
Results: Pure Online Classification vs. +Bandit-based 
• Success rate based scoring functions show most promising results 
• Negative absolute bad scoring performs like BFS 
• Success rate 
function: 628K 
• Pure online-classification: 
539K 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
15 
Results: λ > 0 
• Including randomness seems not to have an effect 
• Beneficial effect of λ > 0 is shown e.g. for the success rate 
function within the first 400K crawled pages 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
16 
Results: Decaying λ 
Decaying λ over time, means the reduction of randomness while 
crawling more pages. 
• Success rate function with decaying λ = 0.5: 673K 
• Static λ: 628K 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
17 
Adaptation to more specific Objective 
• General objective is narrowed down to: 
• Pages making use of the markup language Microdata and 
• Include at least five marked up statements 
• Example: 
1. A page including information about a movie 
2. The movie has the name Se7en 
3. with a rating of 8.7 out of 10 
4. and it was released in 1995 
5. This information is maintained by imdb.com 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
18 
Results: Adaptation to more specific Objective 
• 3.5% of pages include such information 
• In general: Observation of beneficial effects using our approach 
• Static 
λ = 0.2: 120K 
• Decaying 
λ = 0.5: 108K 
Percentage of relevant pages 
Fetched web pages 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
19 
Conclusion 
• Improvement by 26% in comparison to pure online 
classification-based selection strategy for general objective 
• Improvement by 66% for the more specific objective 
• Success rate based scoring functions shows most promising 
results for objectives 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
20 
Open Challenges 
• Expand the approach to exploit results from one bandit to the 
other bandits (contextual bandits) 
• Introduce a more fine grained grading of the crawled pages 
(multi-class problem) 
• Take into account the quality of gathered information (beside 
richness) 
• Adapt the process to traditional topical focused crawling 
• Publishing of code and data to the community 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
21 
More Information 
• Paper accepted at ACM International Conference on 
Information and Knowledge Management in Shanghai, China 
• ACM Digital Library: Focused Crawling for Structured Data 
• Detailed Descriptions and Source Code: 
• Anthelion Webpage 
• Datasets: 
• Common Crawl Foundation Corpora 
• WebDataCommons Corpora 
Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

More Related Content

What's hot

IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET Journal
 
Heuristics for Fixing Common Errors in Deployed schema.org Microdata
Heuristics for Fixing Common Errors in Deployed schema.org MicrodataHeuristics for Fixing Common Errors in Deployed schema.org Microdata
Heuristics for Fixing Common Errors in Deployed schema.org MicrodataRobert Meusel
 
Ontos NLP Stack, Sep. 2016
Ontos NLP Stack, Sep. 2016Ontos NLP Stack, Sep. 2016
Ontos NLP Stack, Sep. 2016Martin Voigt
 
Session 21 E-marketing - 26 Oct 10
Session 21  E-marketing - 26 Oct 10Session 21  E-marketing - 26 Oct 10
Session 21 E-marketing - 26 Oct 10Muhammad Talha Salam
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Calculating ROI with Innovative eCommerce Platforms
Calculating ROI with Innovative eCommerce PlatformsCalculating ROI with Innovative eCommerce Platforms
Calculating ROI with Innovative eCommerce PlatformsMongoDB
 

What's hot (6)

IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
 
Heuristics for Fixing Common Errors in Deployed schema.org Microdata
Heuristics for Fixing Common Errors in Deployed schema.org MicrodataHeuristics for Fixing Common Errors in Deployed schema.org Microdata
Heuristics for Fixing Common Errors in Deployed schema.org Microdata
 
Ontos NLP Stack, Sep. 2016
Ontos NLP Stack, Sep. 2016Ontos NLP Stack, Sep. 2016
Ontos NLP Stack, Sep. 2016
 
Session 21 E-marketing - 26 Oct 10
Session 21  E-marketing - 26 Oct 10Session 21  E-marketing - 26 Oct 10
Session 21 E-marketing - 26 Oct 10
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Calculating ROI with Innovative eCommerce Platforms
Calculating ROI with Innovative eCommerce PlatformsCalculating ROI with Innovative eCommerce Platforms
Calculating ROI with Innovative eCommerce Platforms
 

Similar to Focused Crawling for Structured Data

33 Tactics to Engage and Retain More Customers - IRCE 2016
33 Tactics to Engage and Retain More Customers - IRCE 201633 Tactics to Engage and Retain More Customers - IRCE 2016
33 Tactics to Engage and Retain More Customers - IRCE 2016Mark Ginsberg
 
Web Analytics: Challenges in Data Modeling
Web Analytics: Challenges in Data ModelingWeb Analytics: Challenges in Data Modeling
Web Analytics: Challenges in Data ModelingExcella
 
33 Tactics to Engage and Retain More Customers- IRCE 2016
33 Tactics to Engage and Retain More Customers- IRCE 201633 Tactics to Engage and Retain More Customers- IRCE 2016
33 Tactics to Engage and Retain More Customers- IRCE 2016Andrew Scarbrough
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...MongoDB
 
Phishing Website Detection by Machine Learning Techniques Presentation.pdf
Phishing Website Detection by Machine Learning Techniques Presentation.pdfPhishing Website Detection by Machine Learning Techniques Presentation.pdf
Phishing Website Detection by Machine Learning Techniques Presentation.pdfVaralakshmiKC
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptxScrbifPt
 
MongoDB Partner Program Update - November 2013
MongoDB Partner Program Update - November 2013MongoDB Partner Program Update - November 2013
MongoDB Partner Program Update - November 2013MongoDB
 
Scoping a Successful SharePoint 2016 Hybrid Search Implementation
Scoping a Successful SharePoint 2016 Hybrid Search ImplementationScoping a Successful SharePoint 2016 Hybrid Search Implementation
Scoping a Successful SharePoint 2016 Hybrid Search ImplementationAgnes Molnar
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Christopher Mbinda
 
SEO for recruitment career microsite and beyond gi group v1
SEO for recruitment   career microsite and beyond gi group v1SEO for recruitment   career microsite and beyond gi group v1
SEO for recruitment career microsite and beyond gi group v1S.P.CHATELAIN LTD
 
Search Engine Optimization (SEO) 101
Search Engine Optimization (SEO) 101Search Engine Optimization (SEO) 101
Search Engine Optimization (SEO) 101pointit
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at YahooPeter Mika
 
Search Engine Optimization (Seo) for Developers
Search Engine Optimization (Seo) for DevelopersSearch Engine Optimization (Seo) for Developers
Search Engine Optimization (Seo) for DevelopersMatthew Robinson
 
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profits
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profitsStop Playing Hide and Seek with Google: Drupal SEO for Non-profits
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profitsDesignHammer
 
SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013Agnes Molnar
 
Performing an SEO Audit- Pubcon Vegas 2013
Performing an SEO Audit- Pubcon Vegas 2013Performing an SEO Audit- Pubcon Vegas 2013
Performing an SEO Audit- Pubcon Vegas 2013Selena Vidya
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Seo Beginners Guide SriG Systems
Seo Beginners Guide SriG SystemsSeo Beginners Guide SriG Systems
Seo Beginners Guide SriG SystemsSriG Systems
 
Technical SEO - An Introduction to Core Aspects of Technical SEO Best-Practise
Technical SEO - An Introduction to Core Aspects of Technical SEO Best-PractiseTechnical SEO - An Introduction to Core Aspects of Technical SEO Best-Practise
Technical SEO - An Introduction to Core Aspects of Technical SEO Best-PractiseErudite
 
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...Lucidworks
 

Similar to Focused Crawling for Structured Data (20)

33 Tactics to Engage and Retain More Customers - IRCE 2016
33 Tactics to Engage and Retain More Customers - IRCE 201633 Tactics to Engage and Retain More Customers - IRCE 2016
33 Tactics to Engage and Retain More Customers - IRCE 2016
 
Web Analytics: Challenges in Data Modeling
Web Analytics: Challenges in Data ModelingWeb Analytics: Challenges in Data Modeling
Web Analytics: Challenges in Data Modeling
 
33 Tactics to Engage and Retain More Customers- IRCE 2016
33 Tactics to Engage and Retain More Customers- IRCE 201633 Tactics to Engage and Retain More Customers- IRCE 2016
33 Tactics to Engage and Retain More Customers- IRCE 2016
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
 
Phishing Website Detection by Machine Learning Techniques Presentation.pdf
Phishing Website Detection by Machine Learning Techniques Presentation.pdfPhishing Website Detection by Machine Learning Techniques Presentation.pdf
Phishing Website Detection by Machine Learning Techniques Presentation.pdf
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
MongoDB Partner Program Update - November 2013
MongoDB Partner Program Update - November 2013MongoDB Partner Program Update - November 2013
MongoDB Partner Program Update - November 2013
 
Scoping a Successful SharePoint 2016 Hybrid Search Implementation
Scoping a Successful SharePoint 2016 Hybrid Search ImplementationScoping a Successful SharePoint 2016 Hybrid Search Implementation
Scoping a Successful SharePoint 2016 Hybrid Search Implementation
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)
 
SEO for recruitment career microsite and beyond gi group v1
SEO for recruitment   career microsite and beyond gi group v1SEO for recruitment   career microsite and beyond gi group v1
SEO for recruitment career microsite and beyond gi group v1
 
Search Engine Optimization (SEO) 101
Search Engine Optimization (SEO) 101Search Engine Optimization (SEO) 101
Search Engine Optimization (SEO) 101
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 
Search Engine Optimization (Seo) for Developers
Search Engine Optimization (Seo) for DevelopersSearch Engine Optimization (Seo) for Developers
Search Engine Optimization (Seo) for Developers
 
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profits
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profitsStop Playing Hide and Seek with Google: Drupal SEO for Non-profits
Stop Playing Hide and Seek with Google: Drupal SEO for Non-profits
 
SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013SPConnections - Search Administration in SharePoint 2013
SPConnections - Search Administration in SharePoint 2013
 
Performing an SEO Audit- Pubcon Vegas 2013
Performing an SEO Audit- Pubcon Vegas 2013Performing an SEO Audit- Pubcon Vegas 2013
Performing an SEO Audit- Pubcon Vegas 2013
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Seo Beginners Guide SriG Systems
Seo Beginners Guide SriG SystemsSeo Beginners Guide SriG Systems
Seo Beginners Guide SriG Systems
 
Technical SEO - An Introduction to Core Aspects of Technical SEO Best-Practise
Technical SEO - An Introduction to Core Aspects of Technical SEO Best-PractiseTechnical SEO - An Introduction to Core Aspects of Technical SEO Best-Practise
Technical SEO - An Introduction to Core Aspects of Technical SEO Best-Practise
 
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...
 

Recently uploaded

PSP3 employability assessment form .docx
PSP3 employability assessment form .docxPSP3 employability assessment form .docx
PSP3 employability assessment form .docxmarwaahmad357
 
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptxSCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptxROVELYNEDELUNA3
 
Physics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and EngineersPhysics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and EngineersAndreaLucarelli
 
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPirithiRaju
 
Alternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabusAlternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabusPradnya Wadekar
 
World Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabWorld Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabkiyorndlab
 
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WayShiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WaySérgio Sacani
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsHassan Jolany
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptxVaishnaviAware
 
Bureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptxBureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptxkastureyashashree
 
Lehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptLehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptSachin Teotia
 
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...marwaahmad357
 
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Sérgio Sacani
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxmarwaahmad357
 
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPirithiRaju
 
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptxTHE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptxAkinrotimiOluwadunsi
 
Krishi Vigyan Kendras - कृषि विज्ञान केंद्र
Krishi Vigyan Kendras - कृषि विज्ञान केंद्रKrishi Vigyan Kendras - कृषि विज्ञान केंद्र
Krishi Vigyan Kendras - कृषि विज्ञान केंद्रKrashi Coaching
 
MARSILEA notes in detail for II year Botany.ppt
MARSILEA  notes in detail for II year Botany.pptMARSILEA  notes in detail for II year Botany.ppt
MARSILEA notes in detail for II year Botany.pptaigil2
 

Recently uploaded (20)

PSP3 employability assessment form .docx
PSP3 employability assessment form .docxPSP3 employability assessment form .docx
PSP3 employability assessment form .docx
 
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptxSCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
 
Physics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and EngineersPhysics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and Engineers
 
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
 
Alternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabusAlternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabus
 
World Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabWorld Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlab
 
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WayShiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbits
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptx
 
Bureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptxBureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptx
 
Applying Cheminformatics to Develop a Structure Searchable Database of Analyt...
Applying Cheminformatics to Develop a Structure Searchable Database of Analyt...Applying Cheminformatics to Develop a Structure Searchable Database of Analyt...
Applying Cheminformatics to Develop a Structure Searchable Database of Analyt...
 
Lehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptLehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.ppt
 
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
 
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docx
 
Cheminformatics tools supporting dissemination of data associated with US EPA...
Cheminformatics tools supporting dissemination of data associated with US EPA...Cheminformatics tools supporting dissemination of data associated with US EPA...
Cheminformatics tools supporting dissemination of data associated with US EPA...
 
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
 
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptxTHE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
THE HISTOLOGY OF THE CARDIOVASCULAR SYSTEM 2024.pptx
 
Krishi Vigyan Kendras - कृषि विज्ञान केंद्र
Krishi Vigyan Kendras - कृषि विज्ञान केंद्रKrishi Vigyan Kendras - कृषि विज्ञान केंद्र
Krishi Vigyan Kendras - कृषि विज्ञान केंद्र
 
MARSILEA notes in detail for II year Botany.ppt
MARSILEA  notes in detail for II year Botany.pptMARSILEA  notes in detail for II year Botany.ppt
MARSILEA notes in detail for II year Botany.ppt
 

Focused Crawling for Structured Data

  • 1. Focused Crawling for Structured Data Robert Meusel, Peter Mika, and Roi Blanco
  • 2. HTML pages embed directly markup languages to annotate items using different vocabularies 1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 2._:node1 <http://schema.org/Product/name> "Predator 2 Markup Languages in HTML Pages <html> … <body> … <div id="main-section" class="performance left" data-sku=" M17242_580“> 580" itemscope itemtype="http://schema.org/Product"> h1 itemprop="name"> Predator Instinct FG Fußballschuh <h1> Predator Instinct FG Fußballschuh </h1> <div> div itemscope itemtype="http://schema.org/Offer" itemprop="offers"> type> <http://schema.org/Product> . itemprop="priceCurrency" content="EUR"> itemprop="price" data-sale-price=" 219.95">219,95</span> <meta content="EUR"> <span data-sale-price="219.95">219,95</span> … </body> </html> Instinct FG Fußballschuh"@de . 3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# type> <http://schema.org/Offer> . 4._:node1 <http://schema.org/Offer/price> "219,95"@de . 5._:node1 <http://schema.org/Offer/priceCurrency> "EUR" . 6.… Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 3. 3 Deployment of Markup Languages 14% of all sites use markup languages to annotate their data (status 2013) [Meusel2014] • Broad topical variations from Articles over Products to Recipe [Bizer2013] • Multiple strong drivers pushing the deployment • Search engine companies initiative on Schema.org • Open Graph Protocol used by Facebook Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 4. 4 Motivation • Existing datasets/crawls do not focus on structured data • Common Crawl Foundation uses PageRank and Breadth-First Search • Datasets, as the WebDataCommons corpus extracted from these corpora, are likely to miss large amounts of data [Meusel2014] • Structured information • Hundreds of million pages • Up-to-date information • Publicly available Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 5. 5 Main Idea • Adapting the idea of focused crawling • Similarities: • Evaluation of content based on a objective function • Differences: • Typically focused by topic, not quality/amount of data collected • Because of that, typically no direct feedback about crawled pages available Possibility to incorporate the feedback directly into our system to improve classification of newly discovered URLs. Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 6. 6 Online Learning for Focused Crawling • Capability to incorporates real-time feedback • Improves performance • Adapts to concept drifts • Possible features • URL-based features; mainly tokens from the URL-String itself • Features describing information from the parent(s) of the URL • Features describing information from the siblings of the URL • Free open-source software available (e.g. Massive Online Analysis Library by Bifet et al.) Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 7. 7 Exploration vs. Exploitation Selecting the page with the highest confidence for supporting our objective, might not always be the best choice • Decision/Classification is based on gathered knowledge • Knowledge can be incomplete • Crawled too few pages • Knowledge can get invalid • Reaching part of the Web with different behavior Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 8. 8 Bandit-Based Selection • Bin each URL to the host it belongs to • Each host represents one bandit • Calculate the expected score for each bandit based on a scoring function • Select the degree of randomness λ • λ between 0 and 1 • For each turn draw a random number z • z > λ: select the bandit with highest score • else: select a random bandit Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 9. 9 Scoring Functions Incorporate knowledge in score calculation for bandit/host: • Best Score (Pure classification-based selection) • Negative Absolute Bad • Success Rate • Absolute Good · Best Score • Success Rate · Best Score • Thompson Sampling Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 10. 10 System Workflow Online Classifier Bandits Crawler URL Parser Semantic Parser Classified URL URL HTML Page URLs Feedback Seeds Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 11. 11 Setup for Experiments • Data originates from the Common Crawl Corpus 2012 • including over 3.5 billion HTML pages • Extracted a subset of 5.5 million linked pages • Including 450k different hosts • Identified all pages within the subset containing at least one markup language (using the WebDataCommons corpus) • 27.5% of all pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 12. 12 Experiment Description Measure: Number of relevant pages retrieved within the first 1 million pages crawled. 1. Online vs. batch-based classification with 100K, 250K, and 1M pages 2. Pure online classification vs. enhanced with bandit-based selection (λ=0) 3. Improvements with different λ 4. Improvements with decaying λ Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 13. 13 Results: Online vs. Offline • Both methods outperform Breadth-First Search (BFS) • Static approach: 340K • Adaptive approach: 539K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 14. 14 Results: Pure Online Classification vs. +Bandit-based • Success rate based scoring functions show most promising results • Negative absolute bad scoring performs like BFS • Success rate function: 628K • Pure online-classification: 539K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 15. 15 Results: λ > 0 • Including randomness seems not to have an effect • Beneficial effect of λ > 0 is shown e.g. for the success rate function within the first 400K crawled pages Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 16. 16 Results: Decaying λ Decaying λ over time, means the reduction of randomness while crawling more pages. • Success rate function with decaying λ = 0.5: 673K • Static λ: 628K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 17. 17 Adaptation to more specific Objective • General objective is narrowed down to: • Pages making use of the markup language Microdata and • Include at least five marked up statements • Example: 1. A page including information about a movie 2. The movie has the name Se7en 3. with a rating of 8.7 out of 10 4. and it was released in 1995 5. This information is maintained by imdb.com Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 18. 18 Results: Adaptation to more specific Objective • 3.5% of pages include such information • In general: Observation of beneficial effects using our approach • Static λ = 0.2: 120K • Decaying λ = 0.5: 108K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 19. 19 Conclusion • Improvement by 26% in comparison to pure online classification-based selection strategy for general objective • Improvement by 66% for the more specific objective • Success rate based scoring functions shows most promising results for objectives Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 20. 20 Open Challenges • Expand the approach to exploit results from one bandit to the other bandits (contextual bandits) • Introduce a more fine grained grading of the crawled pages (multi-class problem) • Take into account the quality of gathered information (beside richness) • Adapt the process to traditional topical focused crawling • Publishing of code and data to the community Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  • 21. 21 More Information • Paper accepted at ACM International Conference on Information and Knowledge Management in Shanghai, China • ACM Digital Library: Focused Crawling for Structured Data • Detailed Descriptions and Source Code: • Anthelion Webpage • Datasets: • Common Crawl Foundation Corpora • WebDataCommons Corpora Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai