Your SlideShare is downloading. ×
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Web archiving meeting 2013 blog archiving (Trochidis Ilias - Tero LTD)

252

Published on

A software platform for blog archiving (BlogForever project).

A software platform for blog archiving (BlogForever project).

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
252
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is to present that the concept of blogs and blogging is still rising. Blogger is the largest of these sites with more than 46 million unique U.S. visitors during October 2011, making it second only to Facebook in the social networking category. Google does not tell how many blogspot blogs exist. facebook or microblogging sites such as Twitter have supported the growth of blogs by delivering traffic to content which originated in blogs
  • After the first year: 11% lost and 20% archived After two and a half years: 27% lost and 41% archived
  • 1. Current web preservation initiatives are geared towards aggregating and preserving files and not information entities. For instance, the Internet Archive aggregates web pages and stores them into WARC files (ISO 28500:2009), compressed files similar to zip which are assigned a unique identification number and stored in a distributed file system. Additionally, WARC supports some metadata such as provenance and HTTP protocol metadata. Implicit page elements, such as: · Page title, headers, content, author information, · Metadata such as Dublin Core elements, · RSS feeds and other Semantic Web technologies such as Microformats (Khare R.) and Microdata (Ronallo J.) are completely ignored. This impacts greatly the way stored information is managed, reducing the utility of the archive and also hindering the creation of added-value services.   2. Current web archiving efforts disregard the preservation of Social Networks and of interrelations between the archived content. However, weblog interdependencies demonstrated by the identification of central actors and peripheral weblogs, as well as by the meme-effect that applies to them, need to be preserved, to provide meaningful features to the weblog repository.   3. Current web archive scope is limited to monolithic regions, subjects or events. There is no generic web archiving solution capable to implement arbitrary subjects and topic hierarchies. For instance, the National Library of Catalonia has initiated a web crawling and access project aiming to collect, process and provide permanent access to the entire cultural, scientific and general output of Catalonia in digital format (PADICAT). Alternatively, the Library of Congress has developed online collections for isolated historical events such as September 11, 2001 (Library of Congress). There is an ongoing debate, about benefits or disadvantages of one or another long-term preservation methodology. Many papers have been written and many conferences dedicated to this issue have appeared. It is surprising however, how little has been done at practical level.
  • Mention the advantages of the archives
  • Transcript

    • 1. Harvesting and archiving blog content for current and future generations Ilias Trochidis Tero LTD
    • 2. Tero LTD Research projects
    • 3. State of the blogosphere  Blogs have become fairly established as an online communication and web publishing tool.  Hundreds of millions of blogs are published about every conceivable subject. Examples 12/9/2013 70+ million sites in the world 369 million people viewing more than 11.8 billion pages each month 38 million new posts and 62.3 million new comments each month 136.5 million blogs 61 billion posts 83.7 million daily posts Trend http://www.tumblr.com/press http://en.wordpress.com/stats/
    • 4. Lost resources shared on social media http://arxiv.org/abs/1209.3026
    • 5. The disappearing web In the “Blogs of War: Weblogs as News” paper there were documented 29 blogs on the Iraq war; of those 29 blogs: • 13 (45%) on June 2012 no longer exist on the Internet, • Only 9 blogs (31%) still contained information on the Iraq war • 12 out of the 20 (60%) blogs that don’t exist were preserved by the Internet Archive (however there are problems with missing photos, comments not archived etc.) blogs on major events have already been lost the average lifetime of a webpage is below 100 days
    • 6. Blog archiving: objectives and concerns Aim: harvest, preserve, manage and reuse blogs and their resources Issues: Frequency of change, structure and semantics of blogs, quantity and range of resources, database driven websites, ownership and DRM, +++ BlogForever: a blog archiving project co-funded by the European Commission (March 2011 – August 2013).
    • 7. BlogForever Architecture Blog crawler  Real-time monitoring  Html data extraction engine  Spam and noise filtering  Web services extraction engine Unstructured information Web services Blog APIs XML metadata Blog digital repository  Digital preservation  Quality assurance  Collections curation  Public access APIs  Personalised services  Information retreival  Public web interface / Browse, search, export Harvesting PreservingManaging and reusing Web services Web interface
    • 8. BlogForever added value  BlogForever structures the archived blog content. BlogForever is not only about archiving html pages. It is about archiving information entities (posts, comments, authors, metadata, dates, pingbacks, etc) based on the blog archiving data model.  BlogForever is based on an open source state-of-the-art digital library management system developed by CERN (Invenio).  Better management of stored information increasing the utility of the archive (granularity of the collected information, better and fast search etc.).  Added value services e.g. sentiment analysis and analytics.
    • 9. BlogForever Impact  Output: a simple blog archiving solution that any user, user group or institution could use to preserve their collections of blogs ensuring:  authenticity, integrity, completeness, usability, long term accessibility  Parties that will benefit: Bloggers, Universities, Libraries & Information Centres, Museums, Education, Research, Business  Examples:  CERN is currently implementing a physics blogs repository,  Aristotle University, Greece, has decided to create an academic blog repository ,  The Linguistics department of the University of Hannover wants to know how certain linguistic and textual phenomena / features have evolved within the internet communication diachronically,  L3S research centre, Hannover will collaborate with Tero and AUTH to combine BlogForever and ARCOMEM projects in order to deliver a new innovative web archiving platform.
    • 10. Blog archiving support and consultancy Expected release in January 2014 by Tero
    • 11. Cloud based blog archiving Expected release in March/April 2014 by Tero
    • 12. On demand analytics Expected release in June 2014 by Tero
    • 13. Future Work  Collaborate with archives and institutions around the world and spread the need for archiving the web. Share technologies and best practices.  Support the sustainability of web archives (e.g. convincing public funders to support web archives – show what data can do).  We are already archiving the web.
    • 14. Thank you! Any Questions? it@tero.gr http://twitter.com/itroch Visit: http://www.tero.gr/en http://blogforever.eu

    ×