Good afternoon everyone, it ’ s great to be here at LIBER and thank you for coming to listen. My name is Catherine Ryan and I ’ m from the National Library of Ireland and this is Chloe Martin, from Internet Memory. We are delighted to be here, (and I am doubly delighted to be able to afford to be here) to talk to you about our collaborative web archiving project that took place last year.
First of all we ’ d like to give you a little background information about us – who we are, where we are coming from and why we are doing this. The National Library was established in 1877 and our mission is ‘ to collect, preserve, promote and make accessible the documentary and intellectual record of the life of Ireland ’ . This involves all manner of publications, newspapers, books and journals. Of course, now the documentary record is part and increasingly digital so to tackle this and remain relevant the Library established the Born Digital Programme in 2010 and 2 web archiving pilot projects were carried out in 2011 in conjunction with Internet Memory.
Internet Memory Foundatio is a non-profit institution, established in 2004 under the name European Archive Foundation. In 2010, it changed its name to Internet Memory Foundation to express its interest in preserving web content as a new media for current and future generations. Currently Internet Memory Foundation hosts hundreds of terabytes of archived websites on open access including its own collection and collections from partner institutions, such as The UK National Archives, the National Library of Ireland and CERN. To operate web archiving workflows, Internet Memory Foundation closely collaborates with a its spin-off, Internet Memory Research (IMR). It was established in 2011 in Paris and works with many institutions in Europe: Institute for Sound and Vision in the Netherlands and several German audio-visual archives, the National Library of Ireland, Research Projects (Inside Installations, enpolitique.com,...), CERN, etc. Both of these organisations actively support the preservation of the Web for heritage and cultural purposes. To fulfill their goals, Internet Memory Foundation and Internet Memory Research are engaged in several research projects to support the growth and use of web archives.
The Born Digital Programme is one of a number of integrated work programmes to build a 21st century library. Other programmes include our Digitisation Programme, we are also working on building a single integrated catalogue and a digital repository. This is being done under the umbrella of our Digital Library Programme which looks at the policies and processes around people, data and technology to determine what is the most efficient in terms of access and services and to deliver a 21 st century library.
In light of these initiatives in the Library and given our strengths when it comes to political, cultural and historical collections, the decision to collect born digital materials was a natural one for us, there was an increasing digital gap in our collections as more and more political materials went online or were e-published. The question for us was not ‘ will we ’ but ‘ how will we ’ especially given the financial situation in Ireland. We have little money, few staff and are likely to have less of both again soon and it is in this environment that we need to address the digital challenge. The theory was that the Born Digital Programme would engage in requirements analysis and policy description at a high level and devote some time to produce a policy and outline the next steps possible for the Library. Very neat, everything planned, timetables set and so on…
It should come as no surprise to many people here that that didn ’ t happen. The hand of history took us by the throat in the form of a snap General Election and five weeks in which to capture it. As I referred to earlier, one of the Library ’ s collection strengths was in the area of politics, so capturing digital material complemented our existing collection. So we decided to
We quickly realised that a project in such a new area for us and such a short time frame meant that we need to work on a collaborative basis. We needed to find a partner that best suited our requirements and that had experience working in the cultural sector. Our requirements included the need for technical skills. The Library is quite small compared to other National Libraries around the world and while we have some excellent technical staff they were involved in other projects. We also have some very strong curatorial skills especially in the area of politics and we wanted to be able to leverage these skills to the best of our ability. We also wanted it done fast! We had already established informal contacts with Internet Memory through the usual networking process and so opened a conversation with them in relation to timeframes, costs and requirements and the project got started!
The project broke down into a number of phases. The project scoping and contract naturally came first and we were also processing our internal business case at the same time. Again, with such a short space in which to work, the time line was somewhat chaotic and out of sequence. This is also true for the site selection and permissions aspects of the project where there was some overlap and then the QA, publication and promotion completed the project.
When IM received URLs lists from NLI, we had to comment and highlight specific points to improve crawl results. So, we checked: If URLs were correctly written? Duplicated websites, eg. when one domain AND one subdomain were specified Redirections, which would not be followed by the crawler considering them as out-of-scope Website parts which will not be captured because they will be out-of-scope due to external links Dynamic websites, which are difficult to crawl Links to defined social Web (Facebook, Twitter and YouTube) to include them in the crawl Server hosting multimedia content (on the same domain, or not) Then after discussion and clarifications wit NLI, we were ready to launch the crawls. Seeds were gathered in batches using the following scope parameters: domain, host and path. Specific attention was paid to social web content in order to crawl home page of Twittter, Facebook and Youtube pages. Robots.txt files were not followed and a high level of politeness was setup. Moreover a Webmaster page ( http://www.nli.ie/crawl-ge2011.html ) was designed and published on NLI website in order to give some information on the crawl’s context, technical details and contact information. Specific incidents occurred during the crawl period and required some technical changes on the fly: Modification of scope due to offline domain which redirects to another site Pending crawls due to maintenance operation on website Adaptation of the politeness due to Websmater complaints Inclusion specific rules for robots.txt to get more deep content Finally, feedbacks from the crawl and QA time were used to improve the quality of the post election crawls by adding: Associated URL to help to catch the target content Politeness adjustment related to webmaster feedback
Once the crawl had completed the QA began using software supplied by Internet Memory. We felt that the nature the of the project suited manual QA and it was very important to us that Internet Memory carry out as complete a technical QA as possible as this left us free to leverage our own curatorial expertise in a basic ‘ look and feel ’ QA using multiple browsers. At times we also had to contact website owners which helped to build a relationship with them that helped with the later promotion of the archive. This was unintended but a very welcome consequence of our QA activities.
Capturing Web content is full of challenges and harvesting tools used have clear technical limitations. Because of these limitations and the now obvious incompleteness of Web archives, most European institutions are using Quality Assurance methodologies and tools that can be applied to Web archives. The most used Quality Assurance methodology is the visual method, followed by a capture of the identified missing resources. It applies mainly to selective harvesting and consists in visually checking pairs of Web Pages: the live version versus the archived version. It implies the use of the live version as a reference when checking quality of captured content and therefore means that it should always be done as fast as possible after the crawl, due to the ephemerality of Web content. IM offers the option of a methodological human driven QA, which includes repairing operations. The QA team takes prompt action to resolve as many issues identified as possible. The Level of quality assurance choosen by NLI was deep QA, corresponding to Homepage + 2 levels, which meant that the content of snapshots were checked from the primary URL to the level II. In this particular web archiving campaign, temporal coherence was an important criteria because during the election period, web content of many of the selected sites changed fast.
In terms of access, the archive is fully accessible to the public and full text search has been enabled by Internet Memory. The archive itself is available through Internet Memory ’ s website and through our own catalogue via a widget based on the OpenSearch protocol which displays results from the web-archive collection directly in our catalogue alongside our other resources. This widget has been developed by our Information Systems teams and Internet Memory. In the future we want to build on this and make the collection even more accessible through our own interfaces. As well as that, with such a huge result set this has made us as an organisation look at how best to integrate catalogue results from all our collections.
Because this was our first foray into a new collecting area, obviously we wanted to promote it. We needed to think about how best to promote the archive in both print and online media. In the end we used our own blog and Twitter feed, some of the project participants themselves promoted the archive on their own sites, we did use traditional print media as well and Internet Memory were working on their own side. Usage figures have increased. We feel that the real value of this archive will be more apparent in 5-10 years time.
For web archiving as a whole, what is of real value to us is the ability to open up new opportunities in terms of delivery of new media and formats online and this allows us to meet user expectations that material can be researched online. It also allows us to reach out to new users from the ‘ online generations ’ to people who may not be in Dublin or in a position to visit the Library.
Speaking in more particular terms, we felt that the General Election archive itself has the following uses: To enable researchers to compare online content before and after the elections using both pre- and post-election sites for comparison To facilitate research into how online this election was To assess the impact of technological developments such as social media and Web 2.0 in the communication of campaign information To act as a record of campaign information
And finally, the collaborative approach. For us in the Library, there were numerous advantages to approaching web archiving on a collaborative basis. We knew from the start that web archiving was something we wanted to engage in in the long-term. The collaborative approach allowed us to engage in this new area despite our lack of resources and technical expertise among the web archiving team. It allowed us to collect material that no one else was collecting or even in a position to collect, and it allowed us to do so quickly in a way that enabled us to exploit our own curatorial skills and to pick up new technical skills on the way.
For IM, the benefits of collaborating with heritage institutions, such as the National Library of Ireland include: To develop Web archiving initiatives To operate rapid deployment of Web archives Indeed, cultural or heritage institutions are not always in a position to build a full internal project for their web archives for resource and expertise reasons, so many of them need recourse to an outsourced solution. It enables them to keep complete control on their main focus (selection, crawling parameters, preservation and usage) without dealing with operational workflow issues (qualified human resources, crawl indexing technologies, access issues...) Moreover collaborating with institutions, show us which new challenges to address in this area: for example, our partners (including NLI) really appreciate specific effort on social media content. On one hand, this material is both ephemeral and highly contextualized, making it increasingly difficult for archivists to decide what to preserve. A recent Library of Congress blog post refers to the fact that every day a quarter billion photographs are uploaded to Facebook, and 340 million tweets are posted to Twitter. On the other hand, the social web is based on dynamic websites and uses specific technologies to publish content. Thus, collecting the social web requires new tools as well as new strategies. To improve technologies based on the needs of our partner institutions and to assess needed processes we should automate to reduce processing time and costs: automated Quality Assurance to match with temporal coherence challenges The “ visual ” methodology described above, if it provides good results on selective harvesting, obviously requires trained human resources and is time and money consuming. To overcome these difficulties and apply quality review on larger sets of Web content, institutions and companies are now looking into implementing automated or semi-automated QA. (A first option is to use metrics related to crawls as references and comparison tools. This method, that can be automated to detect problematic crawls, is useful when crawling the same list of domains at different frequencies. For instance, one could decide to store metrics from one reference crawl, which would have gone through visual checking and compare them to all future crawls during a set period of time. Other options consist of developing specific QA tools based on proxy, execution (Selenium - http://seleniumhq.org/ ) or image comparison tools (see for example the project SCAPE - http://www.scape-project.eu/ ), to mimic “ manual ” QA and either detect missing elements to fetch and index them automatically, or evaluate quality of crawls at a high level to minimize human intervention and reduce costs.)
Certainly, we are still no experts on t ’ interwebs but we now have two fantastic web archives up and running. If you want to find out more about them you can follow the links mentioned. And if you are thinking about web archiving, just do it.
How to Face the Challenges of Web Archiving? The Experiences of a Small Library on the Edge
How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library of Ireland LIBER 2012 - 1
Context: National Library of Ireland• Beginnings: Established by the Dublin Science and Museum Act, 1877• Mission: “to collect, preserve, promote and make accessible the documentary and intellectual record of the life of Ireland”.• The Digital Record: Born Digital Programme established in 2010, covering web archiving.• Web Archive Projects: 2 pilot projects in 2011 LIBER 2012 - 2
Context: Internet MemoryEuropean Archive / Internet Memory Foundation•Established in 2004 in Amsterdam (offices also in Paris)•Mission: to preserve Web content as a new media for current andfuture generations•Actions: Sensibilization, partnerships, R&D•Open Access Collections: UK National Archives & Parliament,PRONI, CERN and The National Library of IrelandInternet Memory Research•Spin-off of IM established in June 2011 in Paris•Missions: to operate large scale or selective crawls & develop newtechnologies (crawl, access, processing and extraction) LIBER 2012 - 3
Web Archiving Project: Project Origins National Library of IrelandBuilding a 21st Century Library: – Born Digital – Digitisation – Single Integrated Catalogue – Digital Repository – OSCAIL, the Digital Library Programme LIBER 2012 - 4
Web Archiving Project: Project Origins National Library of IrelandBorn Digital Materials:• Natural progression for NLI’s strong political, cultural and historical collections• How best to approach this in time of unprecedented financial difficulty?• Born Digital Programme established to examine requirements and produce a policy document for the next steps LIBER 2012 - 5
Web Archiving Project: Project Origins National Library of IrelandThe Hand of History: – Snap General Election – Five Weeks LIBER 2012 - 6
Web Archiving Project: Project Origins National Library of Ireland Just do it LIBER 2012 - 7
Web Archiving Project: Project Origins National Library of Ireland Just do it How? LIBER 2012 - 8
Web Archiving Project: Project Origins National Library of IrelandCollaborative Requirements:Partnership: – Technical skills in the NLI but working on other projects –Partner that suited our needed these skillsrequirements and that – Leverage NLI’s onhad experience with strong curatorialothers in the cultural experience, esp. insector politics – Fast! LIBER 2012 - 9
Web Archiving Project: Project Origins National Library of IrelandProject phases: – Project scoping and contract – Site selection – Permissions gathering – QA (look and feel) – Publication and promotion LIBER 2012 - 10
Site Selection and Permissions National Library of IrelandSelection Criteria: Permissions: – Website presence – All sites contacted and – Technical reasons provided with a brief – Cut-off date – Pressurised but – necessary phase Women candidates LIBER 2012 - 11
Scope of projects National Library of IrelandGeneral Election: Presidential Election: – Crawl: 200 snapshots – Crawl: 80 snapshots – Scope: 100 seeds – Scope: 70 seeds – Frequency: 2 times – Frequency: 3 times – Date: Feb. 2011 – Date: Oct-Nov. 2011 LIBER 2012 - 12
Crawl Internet Memory• Seeds Validation:URLs, Duplication, Redirection, External links, Dynamic websites• Scope Parameters:Domain, host and path ; Social Web content ; Frequency ; Robots.txt files exclusion ; Politeness• Specific incidents technical changes on the flyModification of scope ; Pending crawls ; Adaptation of the politeness• Improvement of second crawl LIBER 2012 - 13
Quality Assurance (QA) National Library of Ireland• Manual QA• Jira software• IM – Technical QA• NLI - ‘Look and Feel’ QA• Multiple browsers• Communication with site owners (building relationships and promotion) LIBER 2012 - 14
Quality Assurance (QA) Internet Memory• Why?• How? • Manual and visual method: homepage + 2 • Resolution of issues• Temporal Coherence LIBER 2012 - 15
Access National Library of Ireland• Available to the public• Full text search• IM website – search by keyword, URL• NLI catalogue – keyword via widget developed by NLI IS team and IM• Future – access through NLI’s own interfaces, issue of integrating results LIBER 2012 - 16
Publication and Promotion National Library of Ireland• NLI social media initiative (Twitter and blog)• Project participants• Print media (esp. in area of technology)• And IM!• Usage figures have increased but real value more apparent in 5-10 years LIBER 2012 - 17
Usage Statistics of Web Archive National Library of Ireland 21/09/2011: Official launch of NLI Web archives (Tweets) 26/10/2011: Blog post on nli.ie/blog and Paper in thejournal.ie 25/11/2011: Paper on irishtimes.com 20/01/2012: Paper on irishtimes.com 17/03/2012: Post on soundofthearchives.wordpress.com 04/05/2012: Paper on irisheconomy.ie LIBER 2012 - 18
Advantages of Web Archiving National Library of IrelandWeb archiving: – New opportunities for delivery of materials to users – Work with existing users expectations that content be online – Reach new audiences LIBER 2012 - 19
Advantages of Web Archiving National Library of IrelandPolitical web archives;Irish General Election: – Researchers can compare online content pre- and post-election – Facilitates research into how ‘online’ this election was – Assess impact of technological developments in campaign communications – Record of campaign information LIBER 2012 - 20
Benefits of Working Together National Library of IrelandPilot project for a long-term activity: – Allowed us to enter a new collecting area despite lack of tech expertise – Facilitated collection of important material that one else was collecting – Collect material quickly – Leverage curatorial skills – Gained new technical skills LIBER 2012 - 21
Benefits of Working Together Internet Memory• To supporte the development of Web archiving initiatives• To operate rapid deployment of Web archives• To address new challenges in this area: • Social media content • QA • Automatization LIBER 2012 - 22
Conclusion View the NLI collections at:General Election: http://www.nli.ie/en/udlist/digital-collections.a • 18,495,771 URLs • 1.14 TB • 10,405 ARCs View the Web archive blog entry at: http://www.nli.ie/blog/index.php/2011/10/26/Presidential Election: • 7,333,399 URLs View Internet Memory Collections at: • 278.10 GB http://collections.europarchive.org/ • 2,513 ARCs To be continued… LIBER 2012 - 23
Questions? Thanks for your attention!Catherine Ryan Chloe MartinNational Library of Ireland Internet Memoryhttp://www.nli.ie http://email@example.com firstname.lastname@example.org@NLIreland @InternetMemory LIBER 2012 - 24