Harvesting and archiving blog
content for current and future
generations
Ilias Trochidis
Tero LTD
Tero LTD
Research projects
State of the blogosphere
 Blogs have become fairly established as an online communication and web
publishing tool.
 Hundreds of millions of blogs are published about every conceivable subject.
Examples 12/9/2013
70+ million sites in the world
369 million people viewing more than
11.8 billion pages each month
38 million new posts and 62.3 million
new comments each month
136.5 million blogs
61 billion posts
83.7 million daily posts
Trend
http://www.tumblr.com/press
http://en.wordpress.com/stats/
Lost resources shared on social
media
http://arxiv.org/abs/1209.3026
The disappearing web
In the “Blogs of War: Weblogs as News”
paper there were documented 29 blogs on
the Iraq war; of those 29 blogs:
• 13 (45%) on June 2012 no longer exist on
the Internet,
• Only 9 blogs (31%) still contained
information on the Iraq war
• 12 out of the 20 (60%) blogs that don’t
exist were preserved by the Internet
Archive (however there are problems with
missing photos, comments not archived
etc.)
blogs on major events have already been lost
the average lifetime of a webpage is below 100 days
Blog archiving:
objectives and concerns
Aim: harvest, preserve, manage and reuse blogs and
their resources
Issues: Frequency of change, structure and
semantics of blogs, quantity and range of resources,
database driven websites, ownership and DRM, +++
BlogForever: a blog archiving project co-funded by
the European Commission
(March 2011 – August 2013).
BlogForever Architecture
Blog crawler
 Real-time monitoring
 Html data extraction engine
 Spam and noise filtering
 Web services extraction
engine
Unstructured
information
Web services
Blog APIs
XML metadata
Blog digital repository
 Digital preservation
 Quality assurance
 Collections curation
 Public access APIs
 Personalised services
 Information retreival
 Public web interface /
Browse, search, export
Harvesting
PreservingManaging and reusing
Web services
Web interface
BlogForever added value
 BlogForever structures the archived blog content. BlogForever
is not only about archiving html pages. It is about archiving
information entities (posts, comments, authors, metadata,
dates, pingbacks, etc) based on the blog archiving data model.
 BlogForever is based on an open source state-of-the-art
digital library management system developed by CERN
(Invenio).
 Better management of stored information increasing the
utility of the archive (granularity of the collected information,
better and fast search etc.).
 Added value services e.g. sentiment analysis and analytics.
BlogForever Impact
 Output: a simple blog archiving solution that any user, user group or
institution could use to preserve their collections of blogs ensuring:
 authenticity, integrity, completeness, usability, long term accessibility
 Parties that will benefit: Bloggers, Universities, Libraries & Information
Centres, Museums, Education, Research, Business
 Examples:
 CERN is currently implementing a physics blogs repository,
 Aristotle University, Greece, has decided to create an academic blog
repository ,
 The Linguistics department of the University of Hannover wants to know
how certain linguistic and textual phenomena / features have evolved
within the internet communication diachronically,
 L3S research centre, Hannover will collaborate with Tero and AUTH to
combine BlogForever and ARCOMEM projects in order to deliver a new
innovative web archiving platform.
Blog archiving support and
consultancy
Expected release in January 2014 by Tero
Cloud based blog archiving
Expected release in March/April 2014 by Tero
On demand analytics
Expected release in June 2014 by Tero
Future Work
 Collaborate with archives and institutions around
the world and spread the need for archiving the
web. Share technologies and best practices.
 Support the sustainability of web archives
(e.g. convincing public funders to support
web archives – show what data can do).
 We are already archiving the web.
Thank you!
Any Questions?
it@tero.gr
http://twitter.com/itroch
Visit: http://www.tero.gr/en
http://blogforever.eu

Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)

  • 1.
    Harvesting and archivingblog content for current and future generations Ilias Trochidis Tero LTD
  • 2.
  • 3.
    State of theblogosphere  Blogs have become fairly established as an online communication and web publishing tool.  Hundreds of millions of blogs are published about every conceivable subject. Examples 12/9/2013 70+ million sites in the world 369 million people viewing more than 11.8 billion pages each month 38 million new posts and 62.3 million new comments each month 136.5 million blogs 61 billion posts 83.7 million daily posts Trend http://www.tumblr.com/press http://en.wordpress.com/stats/
  • 4.
    Lost resources sharedon social media http://arxiv.org/abs/1209.3026
  • 5.
    The disappearing web Inthe “Blogs of War: Weblogs as News” paper there were documented 29 blogs on the Iraq war; of those 29 blogs: • 13 (45%) on June 2012 no longer exist on the Internet, • Only 9 blogs (31%) still contained information on the Iraq war • 12 out of the 20 (60%) blogs that don’t exist were preserved by the Internet Archive (however there are problems with missing photos, comments not archived etc.) blogs on major events have already been lost the average lifetime of a webpage is below 100 days
  • 6.
    Blog archiving: objectives andconcerns Aim: harvest, preserve, manage and reuse blogs and their resources Issues: Frequency of change, structure and semantics of blogs, quantity and range of resources, database driven websites, ownership and DRM, +++ BlogForever: a blog archiving project co-funded by the European Commission (March 2011 – August 2013).
  • 7.
    BlogForever Architecture Blog crawler Real-time monitoring  Html data extraction engine  Spam and noise filtering  Web services extraction engine Unstructured information Web services Blog APIs XML metadata Blog digital repository  Digital preservation  Quality assurance  Collections curation  Public access APIs  Personalised services  Information retreival  Public web interface / Browse, search, export Harvesting PreservingManaging and reusing Web services Web interface
  • 8.
    BlogForever added value BlogForever structures the archived blog content. BlogForever is not only about archiving html pages. It is about archiving information entities (posts, comments, authors, metadata, dates, pingbacks, etc) based on the blog archiving data model.  BlogForever is based on an open source state-of-the-art digital library management system developed by CERN (Invenio).  Better management of stored information increasing the utility of the archive (granularity of the collected information, better and fast search etc.).  Added value services e.g. sentiment analysis and analytics.
  • 9.
    BlogForever Impact  Output:a simple blog archiving solution that any user, user group or institution could use to preserve their collections of blogs ensuring:  authenticity, integrity, completeness, usability, long term accessibility  Parties that will benefit: Bloggers, Universities, Libraries & Information Centres, Museums, Education, Research, Business  Examples:  CERN is currently implementing a physics blogs repository,  Aristotle University, Greece, has decided to create an academic blog repository ,  The Linguistics department of the University of Hannover wants to know how certain linguistic and textual phenomena / features have evolved within the internet communication diachronically,  L3S research centre, Hannover will collaborate with Tero and AUTH to combine BlogForever and ARCOMEM projects in order to deliver a new innovative web archiving platform.
  • 10.
    Blog archiving supportand consultancy Expected release in January 2014 by Tero
  • 11.
    Cloud based blogarchiving Expected release in March/April 2014 by Tero
  • 12.
    On demand analytics Expectedrelease in June 2014 by Tero
  • 13.
    Future Work  Collaboratewith archives and institutions around the world and spread the need for archiving the web. Share technologies and best practices.  Support the sustainability of web archives (e.g. convincing public funders to support web archives – show what data can do).  We are already archiving the web.
  • 14.

Editor's Notes

  • #4 This is to present that the concept of blogs and blogging is still rising. Blogger is the largest of these sites with more than 46 million unique U.S. visitors during October 2011, making it second only to Facebook in the social networking category. Google does not tell how many blogspot blogs exist. facebook or microblogging sites such as Twitter have supported the growth of blogs by delivering traffic to content which originated in blogs
  • #5 After the first year: 11% lost and 20% archived After two and a half years: 27% lost and 41% archived
  • #9 1. Current web preservation initiatives are geared towards aggregating and preserving files and not information entities. For instance, the Internet Archive aggregates web pages and stores them into WARC files (ISO 28500:2009), compressed files similar to zip which are assigned a unique identification number and stored in a distributed file system. Additionally, WARC supports some metadata such as provenance and HTTP protocol metadata. Implicit page elements, such as: · Page title, headers, content, author information, · Metadata such as Dublin Core elements, · RSS feeds and other Semantic Web technologies such as Microformats (Khare R.) and Microdata (Ronallo J.) are completely ignored. This impacts greatly the way stored information is managed, reducing the utility of the archive and also hindering the creation of added-value services.   2. Current web archiving efforts disregard the preservation of Social Networks and of interrelations between the archived content. However, weblog interdependencies demonstrated by the identification of central actors and peripheral weblogs, as well as by the meme-effect that applies to them, need to be preserved, to provide meaningful features to the weblog repository.   3. Current web archive scope is limited to monolithic regions, subjects or events. There is no generic web archiving solution capable to implement arbitrary subjects and topic hierarchies. For instance, the National Library of Catalonia has initiated a web crawling and access project aiming to collect, process and provide permanent access to the entire cultural, scientific and general output of Catalonia in digital format (PADICAT). Alternatively, the Library of Congress has developed online collections for isolated historical events such as September 11, 2001 (Library of Congress). There is an ongoing debate, about benefits or disadvantages of one or another long-term preservation methodology. Many papers have been written and many conferences dedicated to this issue have appeared. It is surprising however, how little has been done at practical level.
  • #10 Mention the advantages of the archives