BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

Vangelis Banos
Vangelis BanosSoftware engineer at Freelancer
BlogForever Crawler:
Techniques and Algorithms
to Harvest Modern Weblogs
Olivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos3
1Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland,
2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland,
3Department of Informatics, Aristotle University, Thessaloniki 54124, Greece
1
Contents
• Introduction:
– The disappearing web,
– Blog archiving.
• Our Contributions
• Algorithms
– Motivation,
– Blog content extraction,
– Extraction rules,
– Variations for authors, dates and comments.
• System Architecture
• Evaluation
– Comparison with 3 web article extraction systems.
• Issues and Future Work
2
The disappearing web
Source: http://gigaom.com/2012/09/19/the-disappearing-web-information-decay-is-eating-away-our-history/
3
Blog archiving
1. Why archive the web?
– Web archiving is the process of collecting portions of
the World Wide Web to ensure the information
is preserved in an archive for future researchers, historians,
and the public.
2. Blog archiving is a special case of web archiving.
3. The blogosphere is a live record of contemporary Society,
Culture, Science and Economy.
4. Some blogs contain unique data and valuable information.
– Users take action and make important decisions based on
this information.
5. We have a Responsibility to preserve the web.
4
5
Blog crawlers
 Real-time monitoring
 Html data extraction engine
 Spam filtering
Unstructured
information
Original data and
XML metadata
Blog digital repository
 Digital preservation
 Quality assurance
 Collections curation
 Public access APIs
 Personalised services
 Information retrieval
 Public web interface /
Browse, search, export
Harvesting
PreservingManaging and reusing
Web services
Web interface
FP7 EC Funded Project
http://blogforever.eu/
Our Contributions
• A web crawler capable of extracting blog articles,
authors, publication dates and comments.
• A new algorithm to build extraction rules from
blog web feeds with linear time complexity,
• Applications of the algorithm to extract authors,
publication dates and comments,
• A new web crawler architecture, including how we
use a complete web browser to render JavaScript
web pages before processing them.
• Extensive evaluation of the content extraction and
execution time of our algorithm against three
state-of-the-art web article extraction algorithms.
6
Motivation
• Extracting metadata and content from HTML
documents is a challenging task.
– Web standards usage is low (<0.5% of websites).
– More than 95% of websites do not pass HTML validation.
• Having blogs as our target websites, we made the
following observations which play a central role in the
extraction process:
a) Blogs provide web feeds: structured and standardized
XML views of the latest posts of a blog,
b) Posts of the same blog share a similar HTML structure.
c) Web feeds usually have 10-20 posts whereas blogs
contain a lot more. We have to access more posts than
the ones referenced in web feeds.
7
Content Extraction Overview
1. Use blog web feeds and referenced HTML pages
as training data to build extraction rules.
2. Extraction rules capable of locating in HTML
page all RSS referenced elements such as:
1. Title,
2. Author,
3. Description,
4. Publication date,
3. Use the defined extraction rules to process all
blog pages.
8
Locate in HTML page all RSS referenced elements
9
Generic procedure to build extraction rules
10
Extraction rules and string similarity
• Rules are XPath queries.
• For each rule, we compute the score based on string similarity.
• The choice of ScoreFunction greatly influences the running time
and precision of the extraction process.
• Why we chose Sorensen–Dice coefficient similarity:
1. Low sensitivity to word ordering and length variations
2. Runs in linear time
11
Example: blog post title best extraction rule
• RSS feed: http://vbanos.gr/en/feed/
• Find RSS blog post title: “volumelaser.eim.gr” in html page:
http://vbanos.gr/blog/2014/03/09/volumelaser-eim-gr-2/
12
XPath HTML Element Value Similarity
Score
/body/div[@id=“page”]/header
/h1
volumelaser.eim.gr 100%
/body/div[@id=“page”]/div[@cla
ss=“entry-code”]/p/a
http://volumelaser.eim.gr/ 80%
/head/title volumelaser.eim.gr | Βαγγέλης Μπάνος 66%
... ... ...
• The Best Extraction Rule for the blog post title is:
/body/div[@id=“page”]/header/h1
Time complexity and linear reformulation
13
Post-order traversal of
the HTML tree.
Compute node bigrams
from their children bigrams.
Variations for authors, dates,
comments
• Authors, dates and comments are special cases as
they appear many times throughout a post.
• To resolve this issue, we implement an extra
component in the Score function:
– For authors: an HTML tree distance between the
evaluated node and the post content node.
– For dates: we check the alternative formats of each
date in addition to the HTML tree distance between
the evaluated node and the post content node.
• Example: “1970-01-01” == “January 1 1970”
– For comments: we use the special comment RSS feed.
14
System Architecture
• Our crawler is built on top of Scrapy (http://www.scrapy.org/)
15
System Architecture
• Pipeline of operations:
1. Render HTML and JavaScript,
2. Extract content,
3. Extract comments,
4. Download multimedia files,
5. Propagate resulting records to the back-end.
• Interesting areas:
– Blog post page identification,
– Handle blogs with a large number of pages,
– JavaScript rendering,
– Scalability.
16
Blog post identification
• The crawler visits all blog pages.
• For each URL, it needs to identify whether it is
a blog post or not.
• We construct a regular expression based on
blog post RSS to identify blog posts.
• We assume that all posts from the same blog
use the same URL pattern.
• This assumption is valid for all blog platforms
we have encountered.
17
Handle blogs with a large number of pages
• Avoid random walk of pages, depth first search or
breadth first search.
• Use a priority queue with machine learning defined
priorities.
• Pages with a lot of blog post URLs have a higher
priority.
• Use Distance-Weighted kNN classifier to predict.
– Whenever a new page is downloaded, it is given to the
machine learning system as training data.
– When the crawler encounters a new URL, it will ask the
machine learning system for the potential number of blog
posts and use the value as the download priority of the
URL.
18
JavaScript rendering
• JavaScript is widely used client-side language.
• Traditional HTML based crawlers do not see web pages
using JavaScript.
• We embed PhantomJS, a headless web browser with
great performance and scripting capabilities.
• We instruct the PhantomJS browser to click dynamic
JavaScript pagination buttons on pages to retrieve
more content (e.g. Disqus Show More button to show
comments).
• This crawler functionality is non-generic and requires
human intervention to maintain and extend to other
cases.
19
Scalability
• When aiming to work with a large amount of
input, it is crucial to build every system layer
with scalability in mind.
• The two core crawler procedures NewCrawl
and UpdateCrawl are Stateless and Purely
Functional.
• All shared mutable state is delegated to the
back-end.
20
Evaluation
• Task:
– Extract articles and titles from web pages
• Comparison against three open-source projects:
– Readability (Javascript),
– Boilerpipe (Java),
– Goose (Scala).
• Criteria:
– Extraction success rate,
– Running time.
• Dataset:
– 2300 blog posts from 230 blogs obtained by the Spinn3r dataset.
• System:
– Debian linux 7.2, Intel Core i7-3770 3.4 GHz.
• All data, scripts and instructions to reproduce available at:
– https://github.com/OlivierBlanvillain/blogforever-crawler-publication
21
Evaluation: Extraction success rates
• BlogForever Crawler competitors are generic:
– They do not use RSS feeds.
– They do not use structural similarities between
web pages.
– They can be used with any HTML page.
22
Evaluation: Running time
• our approach spends the majority of its total running time between
the initialisation and the processing of the first blog post.
23
Issues & Future Work
• Our main causes of failure was:
– The insufficient quality of web feeds,
– The high structural variation of blog pages in the same
blog.
• Future Work
– Investigate hybrid extraction algorithms. Combine
with other techniques such as word density or spacial
reasoning.
– Large scale deployment of the software in a
distributed architecture.
24
Thank you!
BlogForever Crawler: Techniques and Algorithms
to Harvest Modern Weblogs
Olivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos3
1Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland,
2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland,
3Department of Informatics, Aristotle University, Thessaloniki 54124, Greece
• Contact email: vbanos@gmail.com
• Project code available at:
– https://github.com/BlogForever/crawler
25
1 of 25

Recommended

CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013 by
CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013
CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013Vangelis Banos
4K views32 slides
SemaGrow demonstrator: “Web Crawler + AgroTagger” by
SemaGrow demonstrator: “Web Crawler + AgroTagger”SemaGrow demonstrator: “Web Crawler + AgroTagger”
SemaGrow demonstrator: “Web Crawler + AgroTagger”AIMS (Agricultural Information Management Standards)
1.6K views30 slides
Web crawler by
Web crawlerWeb crawler
Web crawlerpoonamkenkre
38.3K views43 slides
Web Crawler by
Web CrawlerWeb Crawler
Web Crawleriamthevictory
25.1K views19 slides
Working of a Web Crawler by
Working of a Web CrawlerWorking of a Web Crawler
Working of a Web CrawlerSanchit Saini
9.1K views8 slides
Colloquim Report on Crawler - 1 Dec 2014 by
Colloquim Report on Crawler - 1 Dec 2014Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Sunny Gupta
924 views24 slides

More Related Content

What's hot

Web crawler synopsis by
Web crawler synopsisWeb crawler synopsis
Web crawler synopsisMayur Garg
2.1K views2 slides
Colloquim Report - Rotto Link Web Crawler by
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerAkshay Pratap Singh
1.8K views24 slides
Coding for a wget based Web Crawler by
Coding for a wget based Web CrawlerCoding for a wget based Web Crawler
Coding for a wget based Web CrawlerSanchit Saini
3.9K views10 slides
Smart crawler a two stage crawler by
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawlerRishikesh Pathak
600 views14 slides
WebCrawler by
WebCrawlerWebCrawler
WebCrawlermynameismrslide
2.8K views10 slides
Crawler-Friendly Web Servers by
Crawler-Friendly Web ServersCrawler-Friendly Web Servers
Crawler-Friendly Web Serverswebhostingguy
1.4K views16 slides

What's hot(16)

Web crawler synopsis by Mayur Garg
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
Mayur Garg2.1K views
Coding for a wget based Web Crawler by Sanchit Saini
Coding for a wget based Web CrawlerCoding for a wget based Web Crawler
Coding for a wget based Web Crawler
Sanchit Saini3.9K views
Crawler-Friendly Web Servers by webhostingguy
Crawler-Friendly Web ServersCrawler-Friendly Web Servers
Crawler-Friendly Web Servers
webhostingguy1.4K views
REST Methodologies by jrodbx
REST MethodologiesREST Methodologies
REST Methodologies
jrodbx1.2K views
Working with WebSPHINX Web Crawler by Sanchit Saini
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler
Sanchit Saini3.8K views
Leveraging Open Source Library Guides: Integrating Koha and SubjectsPlus by Myka Kennedy Stephens
Leveraging Open Source Library Guides: Integrating Koha and SubjectsPlusLeveraging Open Source Library Guides: Integrating Koha and SubjectsPlus
Leveraging Open Source Library Guides: Integrating Koha and SubjectsPlus
What is a web crawler and how does it work by Swati Sharma
What is a web crawler and how does it workWhat is a web crawler and how does it work
What is a web crawler and how does it work
Swati Sharma319 views
Website optimization with request reduce by Matt Wrock
Website optimization with request reduceWebsite optimization with request reduce
Website optimization with request reduce
Matt Wrock2.8K views
Upgrade webinar by ShanesCows
Upgrade webinarUpgrade webinar
Upgrade webinar
ShanesCows771 views

Viewers also liked

Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομένα by
Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομέναΥπερδιαύγεια - Αναζήτηση στα δημόσια δεδομένα
Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομέναVangelis Banos
3.7K views29 slides
BlogForever Project presentation at MTSR2013 by
BlogForever Project presentation at MTSR2013BlogForever Project presentation at MTSR2013
BlogForever Project presentation at MTSR2013eimgreece
3.8K views18 slides
Can you save the web? Web Archiving! by
Can you save the web? Web Archiving!Can you save the web? Web Archiving!
Can you save the web? Web Archiving!Vangelis Banos
3.4K views39 slides
Presentationjava by
PresentationjavaPresentationjava
PresentationjavaAmandeep Kaur
2.3K views37 slides
Price Comparison in turkey by
Price Comparison in turkeyPrice Comparison in turkey
Price Comparison in turkeySimplesurance
2.8K views17 slides
Opinioz_intern by
Opinioz_internOpinioz_intern
Opinioz_internSai Ganesh
127 views31 slides

Viewers also liked(14)

Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομένα by Vangelis Banos
Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομέναΥπερδιαύγεια - Αναζήτηση στα δημόσια δεδομένα
Υπερδιαύγεια - Αναζήτηση στα δημόσια δεδομένα
Vangelis Banos3.7K views
BlogForever Project presentation at MTSR2013 by eimgreece
BlogForever Project presentation at MTSR2013BlogForever Project presentation at MTSR2013
BlogForever Project presentation at MTSR2013
eimgreece3.8K views
Can you save the web? Web Archiving! by Vangelis Banos
Can you save the web? Web Archiving!Can you save the web? Web Archiving!
Can you save the web? Web Archiving!
Vangelis Banos3.4K views
Price Comparison in turkey by Simplesurance
Price Comparison in turkeyPrice Comparison in turkey
Price Comparison in turkey
Simplesurance2.8K views
Opinioz_intern by Sai Ganesh
Opinioz_internOpinioz_intern
Opinioz_intern
Sai Ganesh127 views
Prolog Visualizer by Zhixuan Lai
Prolog VisualizerProlog Visualizer
Prolog Visualizer
Zhixuan Lai15.6K views
How Mobile drives Indonesian to do shopping discovery and price comparison by Thanawat Malabuppha
How Mobile drives Indonesian to do shopping discovery and price comparisonHow Mobile drives Indonesian to do shopping discovery and price comparison
How Mobile drives Indonesian to do shopping discovery and price comparison
Comparison of Plumb5 with other digital marketing tools by Veer Endra
Comparison of Plumb5 with other digital marketing toolsComparison of Plumb5 with other digital marketing tools
Comparison of Plumb5 with other digital marketing tools
Veer Endra16.3K views
An introduction to Storm Crawler by Julien Nioche
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
Julien Nioche3.4K views
Comparision between online & offline marketing by Sunil Kumar
Comparision between online & offline marketingComparision between online & offline marketing
Comparision between online & offline marketing
Sunil Kumar9K views

Similar to BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

JS digest. Decemebr 2017 by
JS digest. Decemebr 2017JS digest. Decemebr 2017
JS digest. Decemebr 2017ElifTech
345 views29 slides
Aem offline content by
Aem offline contentAem offline content
Aem offline contentAshokkumar T A
251 views10 slides
Code for Startup MVP (Ruby on Rails) Session 1 by
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Henry S
5.9K views57 slides
Analysis of Google Page Speed Insight by
Analysis of Google Page Speed InsightAnalysis of Google Page Speed Insight
Analysis of Google Page Speed InsightSarvesh Sonawane
297 views34 slides
Spring Framework 3.2 - What's New by
Spring Framework 3.2 - What's NewSpring Framework 3.2 - What's New
Spring Framework 3.2 - What's NewSam Brannen
7.6K views79 slides
Web Crawlers by
Web CrawlersWeb Crawlers
Web CrawlersSuhasini S Kulkarni
1.4K views10 slides

Similar to BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14(20)

JS digest. Decemebr 2017 by ElifTech
JS digest. Decemebr 2017JS digest. Decemebr 2017
JS digest. Decemebr 2017
ElifTech345 views
Code for Startup MVP (Ruby on Rails) Session 1 by Henry S
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
Henry S5.9K views
Analysis of Google Page Speed Insight by Sarvesh Sonawane
Analysis of Google Page Speed InsightAnalysis of Google Page Speed Insight
Analysis of Google Page Speed Insight
Sarvesh Sonawane297 views
Spring Framework 3.2 - What's New by Sam Brannen
Spring Framework 3.2 - What's NewSpring Framework 3.2 - What's New
Spring Framework 3.2 - What's New
Sam Brannen7.6K views
Webcrawler by Govind Raj
Webcrawler Webcrawler
Webcrawler
Govind Raj3.9K views
Avtar's ppt by mak57
Avtar's pptAvtar's ppt
Avtar's ppt
mak57703 views
Arcomem training system-overview_advanced by arcomem
Arcomem training system-overview_advancedArcomem training system-overview_advanced
Arcomem training system-overview_advanced
arcomem551 views
Asp.Net 3 5 Part 1 by asim78
Asp.Net 3 5 Part 1Asp.Net 3 5 Part 1
Asp.Net 3 5 Part 1
asim782.4K views
Best Practices for Building WordPress Applications by Taylor Lovett
Best Practices for Building WordPress ApplicationsBest Practices for Building WordPress Applications
Best Practices for Building WordPress Applications
Taylor Lovett6.1K views
Scalability andefficiencypres by NekoGato
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
NekoGato523 views
Architecture Patterns - Open Discussion by Nguyen Tung
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open Discussion
Nguyen Tung4.2K views
Migrating Very Large Site Collections (SPSDC) by kiwiboris
Migrating Very Large Site Collections (SPSDC)Migrating Very Large Site Collections (SPSDC)
Migrating Very Large Site Collections (SPSDC)
kiwiboris1.2K views
Migrating very large site collections by kiwiboris
Migrating very large site collectionsMigrating very large site collections
Migrating very large site collections
kiwiboris2.1K views

More from Vangelis Banos

Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03 by
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03Vangelis Banos
987 views46 slides
Αποθηκεύεται το διαδίκτυο; Web Archiving! by
Αποθηκεύεται το διαδίκτυο; Web Archiving!Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!Vangelis Banos
814 views39 slides
The theory and practice of Website Archivability by
The theory and practice of Website ArchivabilityThe theory and practice of Website Archivability
The theory and practice of Website ArchivabilityVangelis Banos
2.3K views22 slides
ΥπερΔιαύγεια by
ΥπερΔιαύγειαΥπερΔιαύγεια
ΥπερΔιαύγειαVangelis Banos
1.5K views15 slides
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana by
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaVangelis Banos
864 views33 slides
Η Ιστορία της Μετρολογίας by
Η Ιστορία της ΜετρολογίαςΗ Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςVangelis Banos
1.4K views32 slides

More from Vangelis Banos(10)

Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03 by Vangelis Banos
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03
Vangelis Banos987 views
Αποθηκεύεται το διαδίκτυο; Web Archiving! by Vangelis Banos
Αποθηκεύεται το διαδίκτυο; Web Archiving!Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!
Vangelis Banos814 views
The theory and practice of Website Archivability by Vangelis Banos
The theory and practice of Website ArchivabilityThe theory and practice of Website Archivability
The theory and practice of Website Archivability
Vangelis Banos2.3K views
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana by Vangelis Banos
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
Vangelis Banos864 views
Η Ιστορία της Μετρολογίας by Vangelis Banos
Η Ιστορία της ΜετρολογίαςΗ Ιστορία της Μετρολογίας
Η Ιστορία της Μετρολογίας
Vangelis Banos1.4K views
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας by Vangelis Banos
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςΟ κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Vangelis Banos409 views
Heterogeneity in european digital libraries, the europeana challenge by Vangelis Banos
Heterogeneity in european digital libraries, the europeana challengeHeterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challenge
Vangelis Banos480 views
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε... by Vangelis Banos
Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
Vangelis Banos640 views
Η τεχνική υποδομή του εθνικού συσσωρευτή by Vangelis Banos
Η τεχνική υποδομή του εθνικού συσσωρευτήΗ τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτή
Vangelis Banos1.6K views

Recently uploaded

Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
80 views38 slides
"Surviving highload with Node.js", Andrii Shumada by
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada Fwdays
53 views29 slides
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPoolShapeBlue
84 views10 slides
The Power of Heat Decarbonisation Plans in the Built Environment by
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built EnvironmentIES VE
69 views20 slides
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...ShapeBlue
98 views29 slides
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...ShapeBlue
132 views15 slides

Recently uploaded(20)

"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays53 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue84 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE69 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue98 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue132 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue140 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue79 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson156 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue176 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue197 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10126 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu365 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue181 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash153 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue138 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue179 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software385 views
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty62 views

BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

  • 1. BlogForever Crawler: Techniques and Algorithms to Harvest Modern Weblogs Olivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos3 1Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland, 2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland, 3Department of Informatics, Aristotle University, Thessaloniki 54124, Greece 1
  • 2. Contents • Introduction: – The disappearing web, – Blog archiving. • Our Contributions • Algorithms – Motivation, – Blog content extraction, – Extraction rules, – Variations for authors, dates and comments. • System Architecture • Evaluation – Comparison with 3 web article extraction systems. • Issues and Future Work 2
  • 3. The disappearing web Source: http://gigaom.com/2012/09/19/the-disappearing-web-information-decay-is-eating-away-our-history/ 3
  • 4. Blog archiving 1. Why archive the web? – Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. 2. Blog archiving is a special case of web archiving. 3. The blogosphere is a live record of contemporary Society, Culture, Science and Economy. 4. Some blogs contain unique data and valuable information. – Users take action and make important decisions based on this information. 5. We have a Responsibility to preserve the web. 4
  • 5. 5 Blog crawlers  Real-time monitoring  Html data extraction engine  Spam filtering Unstructured information Original data and XML metadata Blog digital repository  Digital preservation  Quality assurance  Collections curation  Public access APIs  Personalised services  Information retrieval  Public web interface / Browse, search, export Harvesting PreservingManaging and reusing Web services Web interface FP7 EC Funded Project http://blogforever.eu/
  • 6. Our Contributions • A web crawler capable of extracting blog articles, authors, publication dates and comments. • A new algorithm to build extraction rules from blog web feeds with linear time complexity, • Applications of the algorithm to extract authors, publication dates and comments, • A new web crawler architecture, including how we use a complete web browser to render JavaScript web pages before processing them. • Extensive evaluation of the content extraction and execution time of our algorithm against three state-of-the-art web article extraction algorithms. 6
  • 7. Motivation • Extracting metadata and content from HTML documents is a challenging task. – Web standards usage is low (<0.5% of websites). – More than 95% of websites do not pass HTML validation. • Having blogs as our target websites, we made the following observations which play a central role in the extraction process: a) Blogs provide web feeds: structured and standardized XML views of the latest posts of a blog, b) Posts of the same blog share a similar HTML structure. c) Web feeds usually have 10-20 posts whereas blogs contain a lot more. We have to access more posts than the ones referenced in web feeds. 7
  • 8. Content Extraction Overview 1. Use blog web feeds and referenced HTML pages as training data to build extraction rules. 2. Extraction rules capable of locating in HTML page all RSS referenced elements such as: 1. Title, 2. Author, 3. Description, 4. Publication date, 3. Use the defined extraction rules to process all blog pages. 8
  • 9. Locate in HTML page all RSS referenced elements 9
  • 10. Generic procedure to build extraction rules 10
  • 11. Extraction rules and string similarity • Rules are XPath queries. • For each rule, we compute the score based on string similarity. • The choice of ScoreFunction greatly influences the running time and precision of the extraction process. • Why we chose Sorensen–Dice coefficient similarity: 1. Low sensitivity to word ordering and length variations 2. Runs in linear time 11
  • 12. Example: blog post title best extraction rule • RSS feed: http://vbanos.gr/en/feed/ • Find RSS blog post title: “volumelaser.eim.gr” in html page: http://vbanos.gr/blog/2014/03/09/volumelaser-eim-gr-2/ 12 XPath HTML Element Value Similarity Score /body/div[@id=“page”]/header /h1 volumelaser.eim.gr 100% /body/div[@id=“page”]/div[@cla ss=“entry-code”]/p/a http://volumelaser.eim.gr/ 80% /head/title volumelaser.eim.gr | Βαγγέλης Μπάνος 66% ... ... ... • The Best Extraction Rule for the blog post title is: /body/div[@id=“page”]/header/h1
  • 13. Time complexity and linear reformulation 13 Post-order traversal of the HTML tree. Compute node bigrams from their children bigrams.
  • 14. Variations for authors, dates, comments • Authors, dates and comments are special cases as they appear many times throughout a post. • To resolve this issue, we implement an extra component in the Score function: – For authors: an HTML tree distance between the evaluated node and the post content node. – For dates: we check the alternative formats of each date in addition to the HTML tree distance between the evaluated node and the post content node. • Example: “1970-01-01” == “January 1 1970” – For comments: we use the special comment RSS feed. 14
  • 15. System Architecture • Our crawler is built on top of Scrapy (http://www.scrapy.org/) 15
  • 16. System Architecture • Pipeline of operations: 1. Render HTML and JavaScript, 2. Extract content, 3. Extract comments, 4. Download multimedia files, 5. Propagate resulting records to the back-end. • Interesting areas: – Blog post page identification, – Handle blogs with a large number of pages, – JavaScript rendering, – Scalability. 16
  • 17. Blog post identification • The crawler visits all blog pages. • For each URL, it needs to identify whether it is a blog post or not. • We construct a regular expression based on blog post RSS to identify blog posts. • We assume that all posts from the same blog use the same URL pattern. • This assumption is valid for all blog platforms we have encountered. 17
  • 18. Handle blogs with a large number of pages • Avoid random walk of pages, depth first search or breadth first search. • Use a priority queue with machine learning defined priorities. • Pages with a lot of blog post URLs have a higher priority. • Use Distance-Weighted kNN classifier to predict. – Whenever a new page is downloaded, it is given to the machine learning system as training data. – When the crawler encounters a new URL, it will ask the machine learning system for the potential number of blog posts and use the value as the download priority of the URL. 18
  • 19. JavaScript rendering • JavaScript is widely used client-side language. • Traditional HTML based crawlers do not see web pages using JavaScript. • We embed PhantomJS, a headless web browser with great performance and scripting capabilities. • We instruct the PhantomJS browser to click dynamic JavaScript pagination buttons on pages to retrieve more content (e.g. Disqus Show More button to show comments). • This crawler functionality is non-generic and requires human intervention to maintain and extend to other cases. 19
  • 20. Scalability • When aiming to work with a large amount of input, it is crucial to build every system layer with scalability in mind. • The two core crawler procedures NewCrawl and UpdateCrawl are Stateless and Purely Functional. • All shared mutable state is delegated to the back-end. 20
  • 21. Evaluation • Task: – Extract articles and titles from web pages • Comparison against three open-source projects: – Readability (Javascript), – Boilerpipe (Java), – Goose (Scala). • Criteria: – Extraction success rate, – Running time. • Dataset: – 2300 blog posts from 230 blogs obtained by the Spinn3r dataset. • System: – Debian linux 7.2, Intel Core i7-3770 3.4 GHz. • All data, scripts and instructions to reproduce available at: – https://github.com/OlivierBlanvillain/blogforever-crawler-publication 21
  • 22. Evaluation: Extraction success rates • BlogForever Crawler competitors are generic: – They do not use RSS feeds. – They do not use structural similarities between web pages. – They can be used with any HTML page. 22
  • 23. Evaluation: Running time • our approach spends the majority of its total running time between the initialisation and the processing of the first blog post. 23
  • 24. Issues & Future Work • Our main causes of failure was: – The insufficient quality of web feeds, – The high structural variation of blog pages in the same blog. • Future Work – Investigate hybrid extraction algorithms. Combine with other techniques such as word density or spacial reasoning. – Large scale deployment of the software in a distributed architecture. 24
  • 25. Thank you! BlogForever Crawler: Techniques and Algorithms to Harvest Modern Weblogs Olivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos3 1Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland, 2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland, 3Department of Informatics, Aristotle University, Thessaloniki 54124, Greece • Contact email: vbanos@gmail.com • Project code available at: – https://github.com/BlogForever/crawler 25

Editor's Notes

  1. Our approach uses blog specific characteristics to build extraction rules which are applicable throughout a blog. Our approach uses blog specific characteristics to build extraction rules which are applicable throughout a blog.
  2. One might notice that each best rule computation is independent and operates on a different input pair. This implies that Algorithm 1 is embarrassingly parallel : iterations of the outer loop can trivially be executed on multiple threads. Functions in Algorithm 1 are voluntarily abstract at this point and will be explained in detail in the remaining of this section. Subsection 2.3 defines AllRules, Apply and the ScoreFunction we use for article extraction. In subsection 2.4 we analyse the time complexity of Algorithm 1 and give a linear time reformulation using dynamic programming. Finally, subsection 2.5 shows how the ScoreFunction can be adapted to extract authors, dates and comments.