7. Web Archiving
The process of collecting portions of the
World Wide Web to ensure the information is
preserved in an archive for future researchers,
historians, and the public.
MTSR 2013, 22 Nov 2013, Thessaloniki
7
8. The challenge of web archiving
File(s)
Software
Hardware
RECORD
Generic file archiving operation
MTSR 2013, 22 Nov 2013, Thessaloniki
8
9. The challenge of web archiving
File(s)
File(s)
Software
File(s)
File(s)
Software
???
Hardware
Website
Record(s)
???
File(s)
Software
File(s)
File(s)
Web archiving operation
MTSR 2013, 22 Nov 2013, Thessaloniki
9
10. We are focusing on blogs
Blogs have become fairly established as an online
communication and web publishing tool.
Hundreds of millions of blogs are published about every
conceivable subject.
Examples 12/9/2013
70+ million sites in the world
369 million people viewing more than
11.8 billion pages each month
38 million new posts and 62.3 million
new comments each month
136.5 million blogs
61 billion posts
83.7 million daily posts
MTSR 2013, 22 Nov 2013, Thessaloniki
10
11. Blog Archiving: Objectives & Concerns
Blog characteristics:
Database driven, dynamic websites,
High frequency of updates,
Special structure, metadata, semantics & communication
protocols,
Highly interconnected,
Quantity and range of resources,
Ownership and DRM.
Our aims:
harvest, preserve, manage and reuse blogs and their
resources.
MTSR 2013, 22 Nov 2013, Thessaloniki
11
12. The BlogForever Project
Collaborative EC funded project,
Duration: 1 Mar 11’ – 31 Aug 13’,
Aims: Theoretic and applied research on blog
archiving
Coordinated by AUTH.
Partners:
MTSR 2013, 22 Nov 2013, Thessaloniki
12
13. BlogForever project achievements
BlogForever has created a novel blog archiving approach.
It is not only about archiving pages. It is about archiving information
entities (posts, comments, authors, metadata, dates, pingbacks, etc.).
Blog modelling and
semantics
Preservation strategies
Cases studies and
validation
Implementation of the
BlogForever platform
MTSR 2013, 22 Nov 2013, Thessaloniki
13
14. BlogForever project achievements
Harvesting
Unstructured
information
Web services
Blog APIs
Blog crawlers
Real-time monitoring
Html data extraction engine
Spam filtering
Web services extraction
engine
Original data and
XML metadata
Web services
Web interface
Managing and reusing
Blog digital repository
Preserving
MTSR 2013, 22 Nov 2013, Thessaloniki
Digital preservation
Quality assurance
Collections curation
Public access APIs
Personalised services
Information retreival
Public web interface /
Browse, search,14
export
15. BlogForever Added Value
BlogForever structures the archived blog content. BlogForever is
not only about archiving html pages. It is about archiving
information entities (posts, comments, authors, metadata,
dates, pingbacks, etc) based on a special data model.
BlogForever is based on Invenio an open source state-of-the-art
digital library management system developed by CERN.
Better metadata and higher information granularity.
Open Standards and Interoperability (MARCXML, Web Services)
Better management of archived information, increasing the
utility of the web archive.
Easy to facilitate added value services e.g. analytics.
MTSR 2013, 22 Nov 2013, Thessaloniki
15
16. BlogForever Impact
Blog archiving methods and policies which
are reusable and generic.
A blog archiving solution that any institution
could use to preserve their collections of
blogs ensuring authenticity, integrity,
completeness, usability, long term accessibility
A blog archiving solution that any researcher
could use to gather, analyse and reuse blog
data.
MTSR 2013, 22 Nov 2013, Thessaloniki
16
17. BlogForever Applications
CERN is currently implementing a high energy
physics blogs repository.
AUTH is designing an academic blogs repository.
The Linguistics Department of the University of
Hannover is doing a diachronic analysis on certain
linguistic and textual phenomena / features using
German blogs.
The University of Warwick Computer Science
Department is doing social web analytics using blog
data.
MTSR 2013, 22 Nov 2013, Thessaloniki
17
18. Thank you!
Visit http://blogforever.eu
Access all BlogForever Deliverables (Open Access).
Download the Open Source BlogForever Platform.
Contact us:
Project Manager: Vangelis Banos vbanos@gmail.com
Exploitation Manager: Efstratios Arampatzis
sa@tero.gr
MTSR 2013, 22 Nov 2013, Thessaloniki
18
Editor's Notes
The key BlogForever project goals were fully achieved during the time span of the project, during a series of theoretical and applied research tasks.Initially, BlogForever focused on studying weblog structure and semantics, and started developing preservation strategies for weblogs.Later the focus gradually moved to implement the BlogForever platform as well as interoperability prospects and digital rights management strategies.An important aspect of the project was also the design and implementation of extensive case studies of variable complexity and size, to validate and test the BlogForever platform.BlogForever createdan exciting new system to harvest, preserve and manage blog content, developing new insights through its restructuring and reuse. Towards this, it has stepped into yet uncharted territories of theoretical and practical aspects of blog preservation; it first researched blog structure and semantics; it then defined solid blog preservation policies and developed a robust blog preservation software platform; finally it validated the platform through specific case studies using real world data.
After working on what to preserve and how to preserve it, we present how we implemented blog preservation.