Blogger is the largest of these sites with more than 46 million unique U.S. visitors during October 2011, making it second only to Facebook in the social networking category tumblr.com counts 38,884,272 total blogs with 53,399,798 posts on the 29 th of December while in July 2009 the number of posts per day was 650,000 facebook or microblogging sites such as Twitter have supported the growth of blogs by delivering traffic to content which originated in blogs
1. Current web preservation initiatives are geared towards aggregating and preserving files and not information entities. For instance, the Internet Archive aggregates web pages and stores them into WARC files (ISO 28500:2009), compressed files similar to zip which are assigned a unique identification number and stored in a distributed file system. Additionally, WARC supports some metadata such as provenance and HTTP protocol metadata. Implicit page elements, such as: · Page title, headers, content, author information, · Metadata such as Dublin Core elements, · RSS feeds and other Semantic Web technologies such as Microformats (Khare R.) and Microdata (Ronallo J.) are completely ignored. This impacts greatly the way stored information is managed, reducing the utility of the archive and also hindering the creation of added-value services. 2. Current web archiving efforts disregard the preservation of Social Networks and of interrelations between the archived content. However, weblog interdependencies demonstrated by the identification of central actors and peripheral weblogs, as well as by the meme-effect that applies to them, need to be preserved, to provide meaningful features to the weblog repository. 3. Current web archive scope is limited to monolithic regions, subjects or events. There is no generic web archiving solution capable to implement arbitrary subjects and topic hierarchies. For instance, the National Library of Catalonia has initiated a web crawling and access project aiming to collect, process and provide permanent access to the entire cultural, scientific and general output of Catalonia in digital format (PADICAT). Alternatively, the Library of Congress has developed online collections for isolated historical events such as September 11, 2001 (Library of Congress). There is an ongoing debate, about benefits or disadvantages of one or another long-term preservation methodology. Many papers have been written and many conferences dedicated to this issue have appeared. It is surprising however, how little has been done at practical level.
Mention the advantages of the archives
Transcript of "BlogForever eChallenges 2012"
The Need for Long Term Preservation of Weblogs: the BlogForever Project Ilias Trochidis Aristotle University of Thessaloniki GreeceWorkshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
State of the Blogosphere • Blogs have become fairly established as an online communication and web publishing tool. • Hundreds of millions of blogs are published about every conceivable subject ress.com ly ba sis in WordP d on a week new blogs create Number ofWorkshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
The problem of Blog Preservation • Despite the fast growth of blogosphere, there is still no effective solution for ubiquitous semantic weblog archiving, digital preservation, management and dissemination: – Current web preservation initiatives are geared towards aggregating and preserving html pages and not information entities (posts, comments, authors, metadata, dates, pingbacks, etc) – Current web archiving efforts disregard the preservation of Social Networks and interrelations between the archived content (meme- effect) – Current web archives cannot identify topics, subjects or events (monolithic). There is no generic web archiving solution capable to implement arbitrary subjects and topic hierarchies.Workshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
The disappearing webhttp://gigaom.com/2012/09/19/the-disappearing-web-information-decay-is-eating-away-our-history/Workshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
Blog archiving evaluation • Example: In the “Blogs of War: Weblogs as News” paper there were documented 29 blogs on the Iraq war: • of those 29 blogs, – 13 (45%) on June 2012 no longer exist on the Internet, – Only 9 blogs (31%) still contained information on the Iraq war – 12 out of the 20 (60%) blogs that don’t exist were preserved by the Internet Archive (problems with missing photos, comments not archived etc.) • blogs on major events have already been lostWorkshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
The BlogForever architectureWorkshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
Impact • Output: a simple weblog archiving solution that any user, user group or institution could use to preserve their collections of weblogs ensuring: – authenticity, integrity, completeness, usability, long term accessibility • Parties that will benefit: Bloggers, Universities, Libraries & Information Centres, Museums, Education, Research, Business • Examples: – CERN will create a repository with all physics blogs – National Documentation Centre of Greece will create a repository with academic blogs – a National Library of Medicine would like to preserve a collection of health and medicine blogsWorkshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
Business Model • BlogForever as a service (single installation that can be used as a service by users and institutions) • BlogForever as a software (open source distribution) • Universities, Research Institutes, Archives, Governments, Blog Communities will be able to easily preserve their collections of weblogs • BlogForever will assure the preservation, the aggregation, the management and the dissemination of these collections • Do you need to preserve some blogs? We can setup a BlogForever archive for you.Workshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
Future Work • Analyse blog archives in order to gain a better understanding of the content and provide new services: – Use Linked Open Data to link archived blog content with other web content – Apply Semantic Extension of Tags to understand them better and reuse them for multiple purposes. • In any case, use Ontologies to interpret and reason with information. • Data mining in order to extract information from the archives and transform it into an understandable structure for further use. • Brand reputation management and market sector repute analysisWorkshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever
Thank you! Any Questions? Visit: http://blogforever.eu to learn more. http://twitter.com/blogforever http://facebook.com/BlogForever The research leading to these results has received funding from the European Commission Framework Programme 7 (FP7), BlogForever project, grant agreement No.269963.Workshop 7c, 18 October 2012 eChallenges e-2012 Copyright 2012 BlogForever