• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard
 

Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard

on

  • 4,182 views

Presentada en una sesión de trabajo sobre Archivos Web, en la Biblioteca Nacional de España (BNE), el día 8 de julio de 2013

Presentada en una sesión de trabajo sobre Archivos Web, en la Biblioteca Nacional de España (BNE), el día 8 de julio de 2013

Statistics

Views

Total Views
4,182
Views on SlideShare
4,175
Embed Views
7

Actions

Likes
2
Downloads
4
Comments
0

1 Embed 7

https://twitter.com 7

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard Integrating web archiving in preservation workflows. Louise Fauduet, Clément Oury y Sébastien Peyrard Presentation Transcript

    • W/ARC file W/ARC record Header Block Ex: HTTP response, jpeg file… Ex: record ID, capture date, record type,…
    • WARC/1.0 WARC-Type: warcinfo WARC-Date: 2006-09-19T17:20:14Z WARC-Record-ID: <urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39> Content-Type: application/warc-fields Content-Length: 381 software: Heritrix 1.12.0 http://crawler.archive.org hostname: crawling017.archive.org ip: 207.241.227.234 isPartOf: testcrawl-20050708 description: testcrawl with WARC output operator: IA_Admin http-header-user-agent: Mozilla/5.0 (compatible; heritrix/1.4.0 +http://crawler.archive.org) format: WARC file version 0.17 conformsTo: http://www.archive.org/documents/WarcFileFormat-0.17.html
    • WARC/1.0 WARC-Type: request WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2006-09-19T17:20:24Z Content-Length: 236 WARC-Record-ID: <urn:uuid:4885803b-eebd-4b27-a090-144450c11594> Content-Type: application/http;msgtype=request WARC-Concurrent-To: <urn:uuid:92283950-ef2f-4d72-b224- f54c6ec90bb0> GET /images/logoc.jpg HTTP/1.0 User-Agent: Mozilla/5.0 (compatible; heritrix/1.10.0) From: stack@example.org Connection: close Referer: http://www.archive.org/ Host: www.archive.org Cookie: PHPSESSID=009d7bb11022f80605aa87e18224d824
    • WARC/1.0 WARC-Type: response WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2006-09-19T17:20:24Z WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2 WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2 WARC-IP-Address: 207.241.233.58 WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0> Content-Type: application/http;msgtype=response WARC-Identified-Payload-Type: image/jpeg Content-Length: 1902 HTTP/1.1 200 OK Date: Tue, 19 Sep 2006 17:18:40 GMT Server: Apache/2.0.54 (Ubuntu) Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT ETag: "3e45-67e-2ed02ec0" Accept-Ranges: bytes Content-Length: 1662 Connection: close Content-Type: image/jpeg [image/jpeg binary data here]
    • WARC/1.0 WARC-Type: resource WARC-Target-URI: file://var/www/htdoc/images/logoc.jpg WARC-Date: 2006-09-19T17:20:24Z WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224- f54c6ec90bb0> Content-Type: image/jpeg WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2 WARC-Block-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2 Content-Length: 1662 [image/jpeg binary data here]
    • WARC/1.0 WARC-Type: metadata WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2006-09-19T17:20:24Z WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e- 57494593b943> WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224- f54c6ec90bb0> Content-Type: application/warc-fields WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2 Content-Length: 59 via: http://www.archive.org/ hopsFromSeed: E fetchTimeMs: 565
    • WARC/1.0 WARC-Type: revisit WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2007-03-06T00:43:35Z WARC-Profile: http://netpreserve.org/warc/0.17/server-not-modified WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593bbbb> WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0> Content-Type: message/http Content-Length: 226 HTTP/1.x 304 Not Modified Date: Tue, 06 Mar 2007 00:43:35 GMT Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 Connection: Keep-Alive Keep-Alive: timeout=15, max=100 Etag: "3e45-67e-2ed02ec0"
    • WARC/1.0 WARC-Type: conversion WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2016-09-19T19:00:40Z WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e- 57494593dddd> WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224- f54c6ec90bb0> WARC-Block-Digest: sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK Content-Type: image/neoimg Content-Length: 934 [image/neoimg binary data here]
    • WARC/1.0 WARC-Type: response WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2006-09-19T17:20:24Z WARC-Block-Digest: sha1:2ASS7ZUZY6ND6CCHXETFVJDENAWF7KQ2 WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2 WARC-IP-Address: 207.241.233.58 WARC-Record-ID: <urn:uuid:39509228-ae2f-11b2-763a-aa4c6ec90bb0> WARC-Segment-Number: 1 Content-Type: application/http;msgtype=response Content-Length: 1600 HTTP/1.1 200 OK Date: Tue, 19 Sep 2006 17:18:40 GMT Server: Apache/2.0.54 (Ubuntu) Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT ETag: "3e45-67e-2ed02ec0" Accept-Ranges: bytes Content-Length: 1662 Connection: close Content-Type: image/jpeg [first 1360 bytes of image/jpeg binary data here]
    • WARC/1.0 WARC-Type: continuation WARC-Target-URI: http://www.archive.org/images/logoc.jpg WARC-Date: 2006-09-19T17:20:24Z WARC-Block-Digest: sha1:T7HXETFVA92MSS7ZENMFZY6ND6WF7KB7 WARC-Record-ID: <urn:uuid:70653950-a77f-b212-e434-7a7c6ec909ef> WARC-Segment-Origin-ID: <urn:uuid:39509228-ae2f-11b2-763a- aa4c6ec90bb0> WARC-Segment-Number: 2 WARC-Segment-Total-Length: 1902 WARC-Identified-Payload-Type: image/jpeg Content-Length: 302
    • Digitization as a mean for preservation and dissemination
    • Now Color (24 bits) – 400dpi – TIFF uncompressed 1 page ~ 80Mb More than x500 !!! Then Black & white – 300dpi – TIFF G4 1 page ~ 200Kb
    • SPAR - Infrastructure SPAR - Realization Ingest SPAR Storage Abstraction Service (SAS) Administration Data management Storage Access Preservation planning Productionapplications Disseminationapplications Preservation digitization … wayback WEB Archiving …. …. … Audiovisual
    • http://public.ccsds.org/publications/archive/650x0m2.pdf
    • P r e - I n g e s t P r e - I n g e s t P r e - I n g e s t Storage abstraction service Ingest Storage Preservation planning Administration Data management Accès SIP DIPmets rdf rdf Infrastructure Preservation digitization Web archives And so on
    • 2005 2006 2007 2008 2009 2010 2011 20122004 Operations may 2010 2013
    • Backup storage Backup Servers Backup site Backup secondary storage Primary storage Secondary storage Lookup storageServers Main site Backup Lookup storage Online storage
    • Oracle StorageTek SL8500 • up to 64 tape drives • up to 8500 tapes • up to 8 hand pickers • up to 32 linked libraries Primary storage 2 libraries 16 PB maximum Backup storage 2 libraries 16 PB maximum
    • Capacity 1.5 TB Transfer rate 140 MB/s Primary storage LTO5 Backup storage T10000B Capacity 1 TB Transfer rate 120 MB/s (previously: 9840C – 40GB) (previously: T10000A – 500GB)
    • P r e - I n g e s t Storage abstraction service Ingest Stockage Preservation Administration Data management Access SIP AIP DIPmets rdf rdf AIP Which formats are allowed? How copies are needed, in what kind of media ? What is the maximum size of a package ? Do we need to log each access?
    • SLA-I.xml, SLA-P.xml, SLA-A.xml Mets.xml Contract.pdf
    • 03/07/1882 28/02/1883 01/03/1883 set group object file 02/07/1882 Year 1883 Le Matin Year 1882 01/07/1882
    • For this purpose, PDF/X chosen as a good compromise between truth to the original, wide usage and standardization
    • Mets.xml: manifest T000001.tiff: sample format.xml: machine readable description format.txt: human description
    • http://www.fao.org/oek/jhove2/digital-preservation-and-jhove2-home/jhove2-tutorial/en/
    • http://bibnum.bnf.fr/containerMD-v1
    • Ingest request reception Manifest Validation Package search within SPAR SIP characteristics audit SIP files audit and characterization ARK identifier generation SET processing Ingest completion SIP reception Audit ACT_01 ACT_02 ACT_03 ACT_04 ACT_05 ACT_06 ACT_07 ACT_08 ACT_09
    • Structural metadata: METS Descriptive and source metadata: qualified Dublin Core Provenance metadata: PREMIS Technical metadata: depends on the data-objects
    • 58 1996-2005 2002 & 2004 2004-2008 2006-2010 2010-now 70 Tb 0.5 Tb 45 Tb 22 Tb operator robot + 150 Tb
    • 59 Pre Ingest Digitized books Digitized audiovisual documents web archiving Pre Ingest Pre Ingest
    • HTML HTML HTML HTML ARC data
    • HTML HTML HTML HTML HTML
    • HTML HTML HTML HTML HTML +
    • HTML HTML HTML HTML HTML
    • HTML HTML HTML HTML HTML
    • 1996-2005 2002 & 2004 2004-2008 2006-2010 70 Tb 0.5 Tb 45 Tb 22 Tb unknown 2010-now +Alexa bot 67 150 Tb
    • ARC data ARC metadata HTML HTML HTML HTML harvested files ARC data ARC data ARC data ARC data + harvest 1 harvest 2 + harvest 3 + … … This is a collection containing French election websites Here are the files we harvested They are included in web archives specific files This was done with these tools
    • A three-layered model in SPAR Harvest Definition (curator collection) Harvest Instance (“technical” harvest = job) ARC file (data or metadata)
    • filedesc://32-metadata-1.arc 0.0.0.0 20100416092026 text/plain 77 1 0 InternetArchive URL IP-address Archive-date Content-type Archive-length metadata://netarkivet.dk/crawl/setup/harvestInfo.xml?heritrixVers ion=1.14.3&harvestid=1&jobid=32 172.20.16.214 20100414095814 text/xml 366 <?xml version="1.0" encoding="UTF-8"?> <harvestInfo> <version>0.2</version> <jobId>32</jobId> <priority>LOWPRIORITY</priority> <harvestNum>0</harvestNum> <origHarvestDefinitionID>1</origHarvestDefinitionID> <maxBytesPerDomain>-1</maxBytesPerDomain> <maxObjectsPerDomain>1000</maxObjectsPerDomain> <orderXMLName>default</orderXMLName> </harvestInfo> metadata://netarkivet.dk/crawl/setup/order.xml?heritrixVersion=1. 14.3&harvestid=1&jobid=32 172.20.16.214 20100414095815 text/xml 44775 <?xml version="1.0" encoding="UTF-8"?> <crawl-order xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="heritrix_settings.xsd"> …
    • 1996-2005 2002 & 2004 2004-2008 2006-2010 70 Tb 0.5 Tb 45 Tb 22 Tb unknown 2010-now +Alexa bot 71 150 Tb
    • Two layers: - Collection - ARC files 1996-2005 2010-now Three layers: - Harvest Definition - Harvest instance - ARC files Two layers: - Collection - ARC files
    • 2006-2010 2010-now Four layers: - Collection - Harvest division - Harvest instance - ARC files Three layers: - Harvest Definition - Harvest instance - ARC files
    • 03/07/1882 28/02/1883 01/03/1883 set group object file 02/07/1882 Year 1883 Le Matin Year 1882 01/07/1882 03/07/1882 28/02/1883 01/03/188302/07/1882 Le Matin 01/07/1882
    • set group 03/07/1882 28/02/1883 01/03/188302/07/1882 Le Matin 01/07/1882 AIPAIP AIPAIP set Contains nothing but metadata Curator information, allows to group AIPs sharing the same intellectual content AIPAIP Must contain files to be preserved Each AIP is an autonomous unit AIPAIP AIPAIP AIPAIP
    • <mets> <dmdSec> Intellectual metadata <amdSec> Administrative metadata <fileSec> List of the files <structMap> Structure of the package <sourceMD> Metadata about the source used to produce this content <techMD> Technical metadata <digiprovMD> Provenance metadata
    • harvestInstance has harvest instance is documented in Outcome extensions persons: admins software organizations Harvest event
    • ARC data ARC metadata HTML HTML HTML HTML ARC data ARC data ARC data ARC data + harvest 1 harvest 2 + harvest 3 + … … This is a collection containing French election websites HTML HTML HTML HTML …
    • ARC data ARC data ARC metadataARC data ARC data ARC data ARC data … … This is a collection containing French election websites AIPAIP AIPAIP AIPAIP AIPAIP AIPAIP AIPAIP AIPAIPset ARC data ARC data … AIPAIP AIPAIP ARC data AIPAIP AIPAIPAIPAIP groups
    • ARC ARC.GZ ? ? HTML ? HTML ?
    • version- block header metadata object First ARC record data object
    • containerMD http://bibnum.bnf.fr/containerMD-v1
    • <mets> <dmdSec> Intellectual metadata <amdSec> Administrative metadata <fileSec> List of the files <structMap> Structure of the package <sourceMD> Metadata about the source used to produce this content <techMD> Technical metadata <digiprovMD> Provenance metadata
    • containerMD root element containerMD root element containercontainer entriesentries entriesInformationentriesInformation entryentry entryentry entryentry ARCContainerARCContainer ARCEntriesARCEntries ARCRecordARCRecord ARCRecordARCRecord ARCRecordARCRecord… ARC-specific extensions ARC-specific extensions aggregated information about the entries
    • factorizing and sum
    • 92
    • Web archiving at the British Library Helen Hockx-Yu Head of Web Archiving
    • Overview > Part 1: Background, history and organisation > Part 2: Web Archiving Tools (including demos) > Part 3: Access > Part 4: Non-print Legal Deposit and future strategy 29th November 2012 Session 7 -Web archiving at the British Library 2
    • BL Structure > BL Board and Executive Team > e-Strategy and Information Systems (eIS) > IT-based products and services > Finance and Corporate Services (F&CS) > Money > Human Resources > People > Operations & Services (O&S) > Front line services > Scholarship and Collections (S&C) > Content (Arts and humanities, Social Sciences, Science, Technology & Medicine) > Strategic Marketing and Communications (SMC) > Brand and reputation 29th November 2012 Session 7 -Web archiving at the British Library 3
    • Web archiving timeline 29th November 2012 Session 7 -Web archiving at the British Library 4
    • Current web archiving strategy > Selective archiving of websites that > reflect the diversity of lives, interests and activities throughout the UK > contain research value or are of research interest > feature political, cultural, social and economic events of national interest > demonstrate innovative use of the web4 areas > Also prioritise websites at risk and web-only content > Permission based > Permission to archive, to provide online access and to preserve. Also ask or 3rd rights clearance > 30% success rate, 5% explicit refusal (mostly due to 3rd party rights) > Online access through UK Web Archive > Expect to crawl at domain level (from April 2013) for Non- print Legal Deposit 29th November 2012 Session 7 -Web archiving at the British Library 5
    • The current Web Archiving team 29th November 2012 Session 7 -Web archiving at the British Library 6 Skills Profile > IT > Collection management, digital curation > Management > Communications > Web Archiving
    • (Internal Collaboration) > The Web Archiving Team is involved in the end to end process but work with other departments / teams in the library 29th November 2012 Session 7 -Web archiving at the British Library 7 Department /Team Activity / Support S&C > Subject specialist group > Curator’s Choice project Selection, curation eIS Network, hardware and IT support O&S Resource Discovery & Research Corporate level resource discovery http://explore.bl.uk/ CA&D Digital Processing Cataloguing (special collection level) SMC Publicity, press release, events The Legal Deposit Programme Domain crawl capability / process and policy
    • Curator’s Choice > Pilot project with a small group of dedicated curators / subject specialists > Special Collections of curator’s choice. Curators take responsibility for owning, maintaining and growing the collections over time > Evolving Role of Libraries in the UK > Political Action and Communication > Slavery and Abolition in the Caribbean > UK relations with the Low Countries > 19th Century English Literature > Oral History in the UK > Film in the UK > Energy 29th November 2012 Session 7 -Web archiving at the British Library 8
    • Web Archiving Advisory Group > Provide advice and support to the Web Archiving Team > Act as a ‘critical friend’ to assist in the development of policy and practice. > Specific advice and support on: > Purpose, vision and benefits. > Strategic direction and planning. > Synergy with internal teams and collaboration with external stakeholders/partners. > Policy changes and risk management 29th November 2012 Session 7 -Web archiving at the British Library 9
    • (External) Collaboration > UK Web Archiving Consortium (2004-2007): centralised infrastructure and development, distributed collections > UK Web Archive partners, National Archives, Legal Deposit Libraries (LDLs) > External Collaborators > Welcome Library > Live Art Development Agency > The Cambridge Innovation Network > The Women’s Library > Institute of Historical esearch, University of London > Individual researchers, specialists > General public – ca. 20 nominations / week > National organisations: DPC, JISC > International: IIPC 29th November 2012 Session 7 -Web archiving at the British Library 10
    • JISC UK Web Domain Dataset (1996-2010) > Collaboration with JISC and the Internet Archive > UK Web Domain Dataset (1996-2010) – UK websites extracted from the Internet Archive's collection and supported by funding from the JISC > 35TB research dataset > No local access to individual websites but access to secondary dataset allowed > BL has developed visualisations of the dataset > JISC funded 2 further projects using this dataset > Analytical Access to the Domain Dark Archive > Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research 29th November 2012 Session 7 -Web archiving at the British Library 11
    • Web Archiving Tools > Support key processes: selection, harvesting, storage, access, preservation > Mostly open source tools, some developed in-house > New tools / changes to current tools expected when business processes change due to non-print Legal Deposit 29th November 2012 Session 7 -Web archiving at the British Library 12
    • Selection Tools > Selection: decide what websites to archive and to include as part of a web archive collection > Selection and Permission Tool: https://wct.bl.uk/selection/ > Submit selection – real time checking of duplicates, fetching meta tags from live sites > Collect metadata > Add contact details > Suggest crawl frequency > Permissions management – send emails, direct users to online licence form, store the completed forms, pass details to WCT (create authorisation record and a pending target) > Reports > Twittervane 29th November 2012 Session 7 -Web archiving at the British Library 13
    • Harvesting Tools > Harvesting: automated downloading of selected websites using crawler software; quality assurance regarded as an element > The Web Curator Tool (WCT): https://wct.bl.uk/wct/ > Job scheduling > Metadata > Access control > Harvesting (uses Heritirx) > QA 29th November 2012 Session 7 -Web archiving at the British Library 14
    • Quality Assurance > Placing more emphasis on intellectual content than appearance or behaviour of a website > Use four aspects to define quality: > Completeness of capture: whether the intended content has been captured as part of the harvest. > Intellectual content: whether the intellectual content (as opposed to styling and layout) can be replayed in the Access Tool. > Behaviour: whether the harvested copy can be replayed including the behaviour present on the live site, such as the ability to browse between links interactively. > Appearance: look and feel of a website. > Rely on visual comparison, previous harvests & crawl logs > Recent development of QA module to allow bulk operation, reduce # of clicks and make QA recommendations 29th November 2012 Session 7 -Web archiving at the British Library 15
    • Supporting Long-term Preservation > Storing data in WARCs and metadata in METS > Migrate all legacy data into WARCs > WCT output WARC files > Submission Information Package (SIP) profiles for selective and domain crawls > Storing descriptive metadata (eg permission information) & technical metadata (eg crawl log, crawl configurations, virus scan events) > Ingest archived websites in the Digital Library System (DLS) > Command line tool generates SIPs > Providing access from the DLS (in future) 29th November 2012 Session 7 -Web archiving at the British Library 16
    • Demo (45 minutes) > Selection and Permission Tool (https://wct.bl.uk/selection/) > Web Curator Tool (https://wct.bl.uk/wct/) 29th November 2012 Session 7 -Web archiving at the British Library 17
    • Access > Currently 3 ways to access the web archive > Online through the UK Web Archive > Catalogue records (of special collections) > Keywords search through primo (corporate resource discovery system) > Conduct researcher survey to understand requirements > Analytical access 29th November 2012 Session 7 -Web archiving at the British Library 18
    • Catalogue Records 29th November 2012 Session 7 -Web archiving at the British Library 19
    • Keyword search through Primo 29th November 2012 Session 7 -Web archiving at the British Library 20
    • UK Web Archive 29th November 2012 Session 7 -Web archiving at the British Library 21 > Websites archived by BL and partners since 2004 (65% by BL) > 122,99 websites, 50,866 instances, 13.6TBWARCs > Over 100,000 unique visits since 1st April 2012 > Key websites include videos > Full-text, N-gram, title and URL search > Browse by subject / special collection, visual browsing http://www.webarchive.org.uk
    • Analytical Access > Shift of focus from the level of single webpages or websites to the entire web archive collection. > Use web archives as datasets > Support survey, annotation, contextualisation and visualisation > Allows discovery of patterns, trends and relationships in inter-linked web pages > Extracting value from the “haystacks” > Helps addresses a number of challenging issues > Scalability > Accessibility of individual websites > Components missed by crawlers 29th November 2012 Session 7 -Web archiving at the British Library 22
    • Visualising the UK Web > http://www.webarchive.org.uk/ukwa/visualisation > N-gram search > Links analysis > Format Analysis > Geo-index > http://www.webarchive.org.uk/bluebox/ > uses the Memento aggregate TimeGate hosted by lanl.gov > “resource not in archive” – who else has it? > Open data > Dataset and APIs for general use > Enable broader community to re-use, explore and visualise content of web archive 29th November 2012 Session 7 -Web archiving at the British Library 23
    • Web Archiving Infrastructure 29th November 2012 Session 7 -Web archiving at the British Library 24
    • Non-print Legal Deposit: Time of change > Expected to be in place in April 2013 > Access restricted to premises of Legal Deposit Libraries > Library-wide Legal Deposit Programme to develop capability and end-to-end process > Web Archiving Team acts as “technical supplier” for a number of projects > Still need to work out how current (permission-based) selective archiving relates to domain crawl under Legal Deposit > Will we request permissions for online access? > Will we stop crawling some of the sites we are crawling now and include them in the annual / bi-annual broad domain crawl? > Who does what? 29th November 2012 Session 7 -Web archiving at the British Library 25
    • 29th November 2012 Session 7 -Web archiving at the British Library 26 Web Archiving Strategy 26 Domain Crawl Event S p e c i a l c o l l e c t i o n S p e c i a l c o l l e c t i o n Domain harvesting: • Broad sweep of .uk domain • Once or twice a year Events & key sites: • Events of national interest • Sites need to be captured frequently Special Collection: • Focused, thematic collections • Support priority subjects Key sitesEvent S p e c i a l c o l l e c t i o n S p e c i a l c o l l e c t i o n
    • Web  Archiving  Workshop   Leïla  Medjkoune,  Internet  Memory   IIPC  workshop,  BNF,  Paris,  November  2012  
    • Internet  Memory   Internet  Memory  Founda/on  (European  Archive)   •  Established  in  2004  in  Amsterdam  and  then  Paris   •  Mission:  Preserve  Web  content  by  building  a  shared  WA  plaJorm   •  Ac/ons:  DisseminaLon,  R&D  and  partnerships  with  research  groups  and   cultural  insLtuLons   •  Open  Access  Collec/ons:  UK  NaLonal  Archives  &  Parliament,  PRONI,  CERN and  The  NaLonal  Library  of  Ireland   Internet  Memory  Research   •  Spin-­‐off  of  IM  established  in  June  2011  in  Paris   •  Missions:  Operate  large  scale  or  selecLve  crawls  &  develop  new   technologies  (crawl,  access,  processing  and  extracLon)    
    • Internet  Memory   Infrastructure     Green  datacenters     Repository  and  data  access  for  large-­‐scale  data   management:   •  HDFS  (Hadoop  File  System):  Distributed,  fault-­‐tolerant   file  system   •  Hbase.  A  distributed  key-­‐value  index   •  Convenient  model  for  temporal  archives   •  MapReduce:  A  distributed  execuLon  framework   •  Reliable  mechanism  to  run  an  analysis  job  on   very  large  datasets    
    • Internet  Memory   Focused  crawling:   •  Automated  crawls     •  Quality  focused  crawls  :   –  Video  capture,  Twiaer  crawls   –  ExecuLon  tools  to  overcome  crawling  issues  on  specific  content   Large  scale  crawling   •  Inhouse  developped  distributed  sobware     •  Scalable  crawler  (10-­‐50  Bn  pages)   •  Also  designed  for  focused  crawl  and  complex  scoping    
    • Research  projects  and  focus   Web  Archiving  and  Preserva/on   ✓  Living  Web  Archives  (2007-­‐2010)   ✓  Archives  to  Community  MEMories:   (2010-­‐2013)   ✓  SCAlable  PreservaLon  Environment   (2010-­‐2013)   Webscale  data  Archiving  and   Extrac/on   ✓  Living  Knowledge  (2009-­‐2012)   ✓  Longitudinal  AnalyLcs  of  Web   Archive  data  (2010-­‐2013)   ✓  TrendMiner  (2011-­‐2014)   ✓  DOPA  (2012-­‐2014)   ✓  AnnoMarket  (2012-­‐2014)  
    • Web  Archiving  project  ?   OrganisaLonal  challenges:   •  SelecLon/QA:  Librarian  /  Archivist,  Quality  assurance  team,   Project  manager   •  Content  capture/services  development:  Engineers,   developers,  technicians   •  Infrastructure  deployment  and  maintenance:  Engineers,   System  administrators   ➥ Web  Archiving  projects  require  strong  competences  and   experienced  human  resources  combined  with  a  scalable   infrastructure  
    • IM  Shared  plaJorm   Since  its  creaLon  in  2004,  the  Internet  Memory   FoundaLon  works  in  close  collaboraLon  with  partners   insLtuLons  and  research  groups  through  European   projects:   •  To  develop  methods  and  tools  improving  web   archiving  quality   •  To  grow  its  experLse  and  technological  taskforce  
    • Archivethe.Net  (1)     •  To  mutualize  knowledge  and  skills  between   insLtuLons   •  To  share  internal  developments  with  partners   insLtuLons   •  To  cut  services  and  R&D  costs  
    • Archivethe.Net  (2)   •  Archivethe.net is a shared web archiving platform associated to a service. •  The platform is combining new technology and user needs to ensure a good service quality in terms of reliability and efficiency •  For whom ? our current partners, our new partners and … for ourselves
    • Benefits  ?   •  Integrated  web  archiving  process  :  from  selecLon   to  access   •  Ongoing  technological  developments  through   specific  or  common  R&D  projects   •  Dedicated  and  highly  skilled  team  to  follow   partners’  projects   •  Dedicated  infrastructure  
    • How  does  it  work?  (1)   •  ATN  is  designed  as  a  Saas   (Sobware  as  a  service)     •  The  plaJorm  offers  a  friendly  user   interface  to  record  partners  web   archiving  orders   •  A  pipeline  organizes  and  manages   the  producLon   •  A  QA  team  ensures  the  quality  of   the  archive  to  meet  partners   requirements    
    • How  does  it  work?  (2)         Demo  
    • ARCOMEM  Archivist  tool  ?   Set  and  follow  web  archive   campaigns   •  V1:  A  crawler  cockpit  and  a  search     and  retrieval  applicaLon   Intelligent  content  acquisiLon:   •  Seeds  URLs   •  Keywords   •  Social  web  sites  APIs     •  Social  Media  Categories  (SMC)    
    • SARA   Search  and  retrieval  interface:   •  Advance  search   funcLonaliLes   •  Filtering  via  faceLng   •  SorLng  by  content  type,   Social  media  plaJorm,  text/ image  contextual   informaLon  (event,   enLty,...),  etc.      
    • Crawler  Cockpit  Interface       •  Create/select  a  campaign   •  Describe  campaign  (Ltle,   descripLon,  comments,  etc.)   •  Define  scope:  select  criteria  such   as  language,  keyword,  url,   organisaLon,  etc.   •  Select  social  media  categories  and   APIs  to  explore   •  Set  precedence  rules  for  some   content  type  or  source  (images,   videos,  tweets,  news,  etc.)  
    • Crawler  cockpit  interface         Demo    
    • ARCOMEM  Archivist  Tool  V2   •  Refinement  mode  :  Refine  crawl   parameters  to  improve  crawls   •  Improve  access  applicaLon  (SARA)  :   Preview  funcLon  so  that  the  users  can   review  the  results  of  the  campaign  set  up  
    • QA  for  Web  Archives?   IM  QA  is  based  on:   •  Tools  internally  developed   •  Tools  developed  in  the  context  of  European  projects     •   Automated  processes   •   Knowledge  and  skills  of  our  crawl  engineer  and  QA   teams    
    •  QA  Methodology  and  tools?    Methodology   •  Based  upon  crawler  behaviour   •  Based  on  insLtuLons  needs  and  policy   •  Can  be  manual  (visual)  or  “automated”   •  Can  be  made  at  pre  or  post  crawl  Lme   Tools   •  Open  source  tools  such  as  plugins  ,  proxies,  etc.   •  Internally  developed  tools  (fetchers,  automate  check,  etc.)   •  Bug  trackers  to  record  informaLon  and  communicate  with   partner  insLtuLons  
    •  QA  Methodology  and  tools?     SCApe:  Scalable  PreservaLon  Environments   •  Automate  visual  QA  to  detect  rendering  issues:   •  Improve  archives  quality  and  cut  QA  costs   •  Feed  “preservaLon  watch  and  planning”  tools   •  First  test  made  on  over  400  pairs  of  urls   •  Inhouse  “ExecuLon  plaJorm”  under  deployment   •  Results  and  processes  to  be  disseminated  to  IIPC   members  for  feedback  !    
    • Technical  challenges   Capture   •  Dynamically  generated  content,  deep  web,  etc.     •  Non  HTTP  protocoles  (e.g.:  RTMP)   •  Social  media  plaJorms,  ...   Access     •  Replicate  live  funcLonaliLes  and  look  &  feel   •  Provide  access  to  very  large  files       ➥ Fast  evolving  technologies   ➥ Ephemeral  content   ➥ MulLplicaLon  of  producLon  means:     ➥ Increase  of  user  generated  content                                    
    • Technical  SoluLons     •  ExecuLon  based   crawling  (vs  parsing)   •  API  crawling     •  ApplicaLon  aware   crawling   •  Bespoke  fetchers   ➥  OrchestraLon  of  tools     ARCOMEM content acquisition
    • Technical  SoluLons     Access  tool:   •  Player  replacement:  reproduce  players   funcLonaliLes     •  Adapt  access  soluLon  to  type  of  content/plaJorms   (generic  soluLons)   Storage  infrastructure  /  format:   •  Enable  access  to  large  files   •  Fast  access  to  large  amount  of  content  to  facilitate   search  &  retrieval  
    • Use  cases   •  Social  media  capture  and  access:   •  You  Tube     •  Twiaer   •  Flickr,  etc.   •  Web  Archiving  related  services:     •  RedirecLon  service   •  Memento   •  Legal  issues  with  captured  content     •  Full  text  search     •  etc.