Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)

Harvesting and archiving blog
content for current and future
generations
Ilias Trochidis
Tero LTD

State of the blogosphere
 Blogs have become fairly established as an online communication and web
publishing tool.
 Hundreds of millions of blogs are published about every conceivable subject.
Examples 12/9/2013
70+ million sites in the world
369 million people viewing more than
11.8 billion pages each month
38 million new posts and 62.3 million
new comments each month
136.5 million blogs
61 billion posts
83.7 million daily posts
Trend
http://www.tumblr.com/press
http://en.wordpress.com/stats/

Lost resources shared on social
media
http://arxiv.org/abs/1209.3026

The disappearing web
In the “Blogs of War: Weblogs as News”
paper there were documented 29 blogs on
the Iraq war; of those 29 blogs:
• 13 (45%) on June 2012 no longer exist on
the Internet,
• Only 9 blogs (31%) still contained
information on the Iraq war
• 12 out of the 20 (60%) blogs that don’t
exist were preserved by the Internet
Archive (however there are problems with
missing photos, comments not archived
etc.)
blogs on major events have already been lost
the average lifetime of a webpage is below 100 days

Blog archiving:
objectives and concerns
Aim: harvest, preserve, manage and reuse blogs and
their resources
Issues: Frequency of change, structure and
semantics of blogs, quantity and range of resources,
database driven websites, ownership and DRM, +++
BlogForever: a blog archiving project co-funded by
the European Commission
(March 2011 – August 2013).

BlogForever Architecture
Blog crawler
 Real-time monitoring
 Html data extraction engine
 Spam and noise filtering
 Web services extraction
engine
Unstructured
information
Web services
Blog APIs
XML metadata
Blog digital repository
 Digital preservation
 Quality assurance
 Collections curation
 Public access APIs
 Personalised services
 Information retreival
 Public web interface /
Browse, search, export
Harvesting
PreservingManaging and reusing
Web services
Web interface

BlogForever added value
 BlogForever structures the archived blog content. BlogForever
is not only about archiving html pages. It is about archiving
information entities (posts, comments, authors, metadata,
dates, pingbacks, etc) based on the blog archiving data model.
 BlogForever is based on an open source state-of-the-art
digital library management system developed by CERN
(Invenio).
 Better management of stored information increasing the
utility of the archive (granularity of the collected information,
better and fast search etc.).
 Added value services e.g. sentiment analysis and analytics.

BlogForever Impact
 Output: a simple blog archiving solution that any user, user group or
institution could use to preserve their collections of blogs ensuring:
 authenticity, integrity, completeness, usability, long term accessibility
 Parties that will benefit: Bloggers, Universities, Libraries & Information
Centres, Museums, Education, Research, Business
 Examples:
 CERN is currently implementing a physics blogs repository,
 Aristotle University, Greece, has decided to create an academic blog
repository ,
 The Linguistics department of the University of Hannover wants to know
how certain linguistic and textual phenomena / features have evolved
within the internet communication diachronically,
 L3S research centre, Hannover will collaborate with Tero and AUTH to
combine BlogForever and ARCOMEM projects in order to deliver a new
innovative web archiving platform.

Blog archiving support and
consultancy
Expected release in January 2014 by Tero

Cloud based blog archiving
Expected release in March/April 2014 by Tero

On demand analytics
Expected release in June 2014 by Tero

Future Work
 Collaborate with archives and institutions around
the world and spread the need for archiving the
web. Share technologies and best practices.
 Support the sustainability of web archives
(e.g. convincing public funders to support
web archives – show what data can do).
 We are already archiving the web.

Thank you!
Any Questions?
it@tero.gr
http://twitter.com/itroch
Visit: http://www.tero.gr/en
http://blogforever.eu

Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)

More Related Content

What's hot

Viewers also liked

Similar to Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)

Recently uploaded

Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)

Editor's Notes