Archiving the French Web:
the BnF web archiving workflow
Sara Aubry
Web Archiving Project Manager, IT department
Bibliothè...
Let’s start with some figures
• Programme start in 2000, industrialisation in 2008-
2012
• Collections:
– 1996 - now
– 20 ...
Digital curation is not different!
• « Actions, tools and practices defined
and applied to collect, identify, select,
orga...
BnF workflow overview
Selecting
Collecting
Indexing
Accessing
Preserving
nas_preload
Selecting with BCWeb
Selecting with BCWeb
• A form-based application, commonly called a
« curator tool »
– for content curators and researchers...
The Web is made of HTML pages
1 HTML page, 48
URL
• 1 HTML
• 1 text/css
• 4 javascript
• 17 image/png
• 5 image/jpeg
• 21 ...
Harvesting with Heritrix
• A harvester is a piece of
software (crawler,
spider, robot)
• Simulates what a
person would do ...
Assets:
- open source
- small and large scale
- textual or all-media formats
- data structures
Digital curators: legal
deposit department
Engineers : IT department
Challenges:
• rich media and ever-changing
environment
• social networks
• content beyond paywal...
Piloting the crawls with
NetarchiveSuite
• Prepare, schedule, run and monitor harvests
of websites, perform QA
Digital cur...
Offering access with Wayback
• Give readers the ability to
browse the web “as it
was” with:
– a regular web browser
– a se...
Challenges:
• links with our main Catalogue and
open data repository
• “smart” URL search
• full text search and indexing
...
Questions ?
E-mail: sara.aubry@bnf.fr
Web site: http://www.bnf.fr
Twitter: http://twitter.com/DLWebBnF
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Upcoming SlideShare
Loading in …5
×

Archiving the French Web: the BnF web archiving workflow. Sara Aubry

1,134 views

Published on

Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,134
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Archiving the French Web: the BnF web archiving workflow. Sara Aubry

  1. 1. Archiving the French Web: the BnF web archiving workflow Sara Aubry Web Archiving Project Manager, IT department Bibliothèque nationale de France International Conference on Web archives and e-LD Biblioteca Nacional de España, Madrid, July 9th 2013
  2. 2. Let’s start with some figures • Programme start in 2000, industrialisation in 2008- 2012 • Collections: – 1996 - now – 20 000 websites for focused crawls, 2.5 million .fr domains for broad crawls – 18.8 billion URLs, 370 TB, growing up +100TB / year • Resources: – 9 Full Time Employees (5 librarians, 4 engineers) – many partners within and out of Library, both at the national and international level – 70 robots (648GB RAM, 144 CPUs 2.4GHz)
  3. 3. Digital curation is not different! • « Actions, tools and practices defined and applied to collect, identify, select, organize and preserve digital contents (…) in order to use them and make them available (…) » Definition of Digital Archiving in Wikipedia
  4. 4. BnF workflow overview Selecting Collecting Indexing Accessing Preserving nas_preload
  5. 5. Selecting with BCWeb
  6. 6. Selecting with BCWeb • A form-based application, commonly called a « curator tool » – for content curators and researchers to nominate websites to harvest – giving basic information about them (content policies, trends watch) • Most important information for each website: – Internet address/URL – frequency (daily, monthly, yearly, once…) – size/budget (small, medium, big) – depth (entire domain, part of it) Content curators
  7. 7. The Web is made of HTML pages 1 HTML page, 48 URL • 1 HTML • 1 text/css • 4 javascript • 17 image/png • 5 image/jpeg • 21 image/gif all links and inclusions are URL references
  8. 8. Harvesting with Heritrix • A harvester is a piece of software (crawler, spider, robot) • Simulates what a person would do with a browser but repeatedly and very fast • Follows a looping process • Repeated until new and in-scope URL are found and limits are not reached (budget and time) WARC Pick a location Make a Request Receive a Response Examine for references Save the content
  9. 9. Assets: - open source - small and large scale - textual or all-media formats - data structures
  10. 10. Digital curators: legal deposit department
  11. 11. Engineers : IT department Challenges: • rich media and ever-changing environment • social networks • content beyond paywalls (news sites, ebooks)
  12. 12. Piloting the crawls with NetarchiveSuite • Prepare, schedule, run and monitor harvests of websites, perform QA Digital curators: legal deposit department Engineers : IT department
  13. 13. Offering access with Wayback • Give readers the ability to browse the web “as it was” with: – a regular web browser – a search and redisplay software • An application called “Web archives” – Wayback: for URL search, display and browsing – Nutch prototype for keyword search – Guided paths for collection highlights
  14. 14. Challenges: • links with our main Catalogue and open data repository • “smart” URL search • full text search and indexing • small-scale data mining projects with researchers
  15. 15. Questions ? E-mail: sara.aubry@bnf.fr Web site: http://www.bnf.fr Twitter: http://twitter.com/DLWebBnF

×