Your SlideShare is downloading. ×
0
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Archiving the French Web: the BnF web archiving workflow. Sara Aubry

645

Published on

Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.

Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
645
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Archiving the French Web: the BnF web archiving workflow Sara Aubry Web Archiving Project Manager, IT department Bibliothèque nationale de France International Conference on Web archives and e-LD Biblioteca Nacional de España, Madrid, July 9th 2013
  • 2. Let’s start with some figures • Programme start in 2000, industrialisation in 2008- 2012 • Collections: – 1996 - now – 20 000 websites for focused crawls, 2.5 million .fr domains for broad crawls – 18.8 billion URLs, 370 TB, growing up +100TB / year • Resources: – 9 Full Time Employees (5 librarians, 4 engineers) – many partners within and out of Library, both at the national and international level – 70 robots (648GB RAM, 144 CPUs 2.4GHz)
  • 3. Digital curation is not different! • « Actions, tools and practices defined and applied to collect, identify, select, organize and preserve digital contents (…) in order to use them and make them available (…) » Definition of Digital Archiving in Wikipedia
  • 4. BnF workflow overview Selecting Collecting Indexing Accessing Preserving nas_preload
  • 5. Selecting with BCWeb
  • 6. Selecting with BCWeb • A form-based application, commonly called a « curator tool » – for content curators and researchers to nominate websites to harvest – giving basic information about them (content policies, trends watch) • Most important information for each website: – Internet address/URL – frequency (daily, monthly, yearly, once…) – size/budget (small, medium, big) – depth (entire domain, part of it) Content curators
  • 7. The Web is made of HTML pages 1 HTML page, 48 URL • 1 HTML • 1 text/css • 4 javascript • 17 image/png • 5 image/jpeg • 21 image/gif all links and inclusions are URL references
  • 8. Harvesting with Heritrix • A harvester is a piece of software (crawler, spider, robot) • Simulates what a person would do with a browser but repeatedly and very fast • Follows a looping process • Repeated until new and in-scope URL are found and limits are not reached (budget and time) WARC Pick a location Make a Request Receive a Response Examine for references Save the content
  • 9. Assets: - open source - small and large scale - textual or all-media formats - data structures
  • 10. Digital curators: legal deposit department
  • 11. Engineers : IT department Challenges: • rich media and ever-changing environment • social networks • content beyond paywalls (news sites, ebooks)
  • 12. Piloting the crawls with NetarchiveSuite • Prepare, schedule, run and monitor harvests of websites, perform QA Digital curators: legal deposit department Engineers : IT department
  • 13. Offering access with Wayback • Give readers the ability to browse the web “as it was” with: – a regular web browser – a search and redisplay software • An application called “Web archives” – Wayback: for URL search, display and browsing – Nutch prototype for keyword search – Guided paths for collection highlights
  • 14. Challenges: • links with our main Catalogue and open data repository • “smart” URL search • full text search and indexing • small-scale data mining projects with researchers
  • 15. Questions ? E-mail: sara.aubry@bnf.fr Web site: http://www.bnf.fr Twitter: http://twitter.com/DLWebBnF

×