Presentation at Digital Humanities 2014, Lausanne. Looks at some of the issues related to digitising historic newspapers in Europe, particularly how a website that can search through all of them can be built
Representation and Absence in Digital Resources: The Case of Europeana Newspapers
1. Representation and Absence in Digital
Resources: The Case of Europeana
Newspapers
Alastair Dunning, The European Library, @alastairdunning
Clemens Neudecker, National Library of Netherlands,
@cneudecker
DH2014, Lausanne
5. The estimated total cost of digitising
the collections of Europe’s
museums, archives and libraries,
including the audiovisual material
they hold is approximately €100bn,
or €10bn per annum for the next 10
years, factoring in
a cumulative efficiency gain of 0.5%
per annum.
The Research & Development
Budget for the Joint Strike Fighter
programme is estimated at
€40.34bn.
It would cost between 10% and 40%
of the Joint Strike Fighter R&D
budget to digitise every eligible title
in Europe’s librariesSource: Nick Poole, Collections Trust,
http://nickpoole.org.uk/wp-
content/uploads/2011/12/digiti_repor
t.pdf
7. Currently:
2
million
pages of full text
By 2015:
10
million
pages of
full text
Searching by keyword, and
organise by language,
date, source library, title
Link: http://www.theeuropeanlibrary.org/tel4/newspapers
9. Full Text from following libraries
•Bibliotheque nationale de France / National Library
France
•Koninklijke Bibliotheek / National Library of the
Netherlands
•Landesbibliothek Dr. Friedrich Teßmann / Teßmann
Library
•Eesti Rahvusraamatukogu / Estonian National
Library
• Kansalliskirjasto / National Library of Finland
• Latvijas Nacionala Biblioteka / National Library of
Latvia
•Biblioteka Narodowa / National Library of Poland
•Milli Kutuphane Baskanligi / National Library of
Turkey
• Österreichische Nationalbibliothek / Austrian
National Library
•Staatsbibliothek zu Berlin / Berlin State Library
•Staats- und Universitätsbibliothek Hamburg / State
and University Library
• Univerzitet u Beogradu / University Library of
Belgrade
Searching by title
10. Issue Level Records from following libraries
•National Library of Wales
•St. Cyril and Methodius National Library / The
National Library of Bulgaria
•National Library of Czech Republic
•National and University Library in Zagreb
•Koninklijke Bibliotheek van België / Bibliothèque
royale de Belgique
•Narodna in univerzitetna knjinica / National and
University Library of Slovenia
•National Library of Portugal
•National Library of Romania
•Landsbókasafn Íslands - Háskólabókasafn / National
and Univeristy Library of Iceland National Library of
Spain
•Bibliothèque nationale de Luxembourg / National
Library of Luxembourg
Finding matching results in
single or multiple issues
12. So far, okay. Similar functionality to other national and
regional digital libraries of newspapers
See other archives via:
https://www.google.com/maps/ms?msid=217164746645697066594.0004c3d764fcb71ed2
314&msa=0
13. But what was the user response to an aggregation
of European newspaper libraries ?
Results of Usability Testing: http://www.europeana-newspapers.eu/wp-content/uploads/2014/05/The-European-
Library-Newspaper-Archive-Usability-testing-Report-April-2014.pdf
17. Plenty of quibbles about design
- positions of advanced options
- re-order list of results
- manipulating facets
18. Much greater expectations of functionality once logged in
For example,
Saved searches
New content notification
19. “Much of the value of the site to participants was provided by the
images of the documents.
Participants expected to be able to save a 'local' copy once they
had located content of relevance.
As no download facility is provided, this led to some frustration
and undermined the overall potential value of the site for some
participants.”
20. Timetable for rest of project
Now – Protype version of interface shared with project
Throughout 2014 - Ongoing creation of OCR, and other
related technical work (OLR, Named Entities)
Throughout 2014 – Live version of website improved /
usability testing / added content
Autumn 2014 - Final project conference
Late 2014 - Newspaper browser completed with content and
tools from project
More information at
http://www.europeana-newspapers.eu/
Interface at
http://www.theeuropeanlibrary.org/tel4/newspapers/
22. Why can’t I edit the text ?
(Our sample was researchers/ maybe it is other communities
interested in crowdsourcing?)
Note: If time permits, The European Library will develop some
crowdsourcing feature
23. Source: Europeana Strategic Plan, 2015-2020, currently unpublished. See also Enumerate Project, enumerate.eu
24. Number of digitised pages in interface: c.2m
Number of digitised pages in European libraries: c.130m
Number of physical pages in European libraries: 1.5bn+
Source: European Newspaper Survey Report
http://www.europeana-newspapers.eu/wp-content/uploads/2012/04/D4.1-Europeana-newspapers-
survey-report.pdf
25. Source: European Newspaper Survey Report
http://www.europeana-newspapers.eu/wp-content/uploads/2012/04/D4.1-Europeana-newspapers-
survey-report.pdf
Quantities of newspapers – a) in project b) digitised in total c) in
physical libraries
26. The project digital library is only a fraction of the newspaper
archive of the continent, indeed the world
30. ….. Difficult to represent
‘archival gaps’ when seen in
the context of how little has
been digitised - creates a
needle in the haystack ….
31. The estimated total cost of digitising
the collections of Europe’s
museums, archives and libraries,
including the audiovisual material
they hold is approximately €100bn,
or €10bn per annum for the next 10
years, factoring in
a cumulative efficiency gain of 0.5%
per annum.
The Research & Development
Budget for the Joint Strike Fighter
programme is estimated at
€40.34bn.
It would cost between 10% and 40%
of the Joint Strike Fighter R&D
budget to digitise every eligible title
in Europe’s librariesSource: Nick Poole, Collections Trust,
http://nickpoole.org.uk/wp-
content/uploads/2011/12/digiti_repor
t.pdf
35. Graphs are the most obvious way of adding context
but still very reliant on the library producing such
charts
36. How to derive a representative
(random) sample from a digital
collection?
Source: http://dilbert.com/strips/comic/2001-10-25/
37. Pieter Francois, winner of BL
Labs competition 2013:
“How representative are the
historical texts humanities
scholars study of the overall
body of ‘surviving’ texts that
are held in the various
library collections?”
labs.bl.uk/Sample+
Generator
38. There are other issues in the project content too
Major issues
OCR quality varies
Different licensing statements from
different countries
Date of copyright boundaries different in
each country
39. There are other issues in the interface too
Minor Issues
Some pages (2m by 2015) have articles
segmentation
Some library content has named entity
extraction effecting search results
42. How should we allow users better ways to
understand the digital library ?
43. What role can the API play in this?
Can opening up the data in the digital library and allowing it to
explored in different ways ?
44. Traditional Model With an API
Interface
(Created by Library)
Data
(Published by Library)
Interface
(Created by Third Party)
Data
(Published by Library)
API – Application Programming Interfaces
46. Currently:
2
million
pages of full text
By 2015:
10
million
pages of
full text
Searching by keyword, and
organise by language,
date, source library, title
Link: http://www.theeuropeanlibrary.org/tel4/newspapers
47. Trove Newspapers statistics
develolped by third party, based
on data provided by library
http://wraggelabs.com/shed/trove/graphs/
Interface
(Created by Third Party)
Data
(Published by Library)
48. Headline Roulette, developed by
third party, based on data
provided by library
http://wraggelabs.com/shed/headline-
roulette/
Interface
(Created by Third Party)
Data
(Published by Library)
49. Word Count of Articles, developed
by third party, based on data
provided by library
http://dhistory.org/frontpages/53/words/
Interface
(Created by Third Party)
Data
(Published by Library)