Your SlideShare is downloading. ×
Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years -  storing it, and getting it out there
Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years -  storing it, and getting it out there
Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years -  storing it, and getting it out there
Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years -  storing it, and getting it out there
Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years -  storing it, and getting it out there
Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years -  storing it, and getting it out there
Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years -  storing it, and getting it out there
Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years -  storing it, and getting it out there
Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years -  storing it, and getting it out there
Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years -  storing it, and getting it out there
Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years -  storing it, and getting it out there
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years - storing it, and getting it out there

1,288

Published on

In the next 20 years, the Dutch national library will digitize all printed publications since 1470, some 730M pages. To realize the first milestone of this ambition, KB made deals with Google and …

In the next 20 years, the Dutch national library will digitize all printed publications since 1470, some 730M pages. To realize the first milestone of this ambition, KB made deals with Google and Proquest to digitize 42M pages.
Since 2003 KB has operated its e-Depot, a system for permanent digital object storage. KB is now replacing it with a new solution to better deal with future demands, allowing improved storage of its mass digitization output.
To meet user demand for centralized access, KB is also replacing its scattered full-text online portfolio by a National Platform for Digital Publications, both a content delivery platform for its mass digitization output and a national domain aggregator for publications. From 2011 onwards, this collaborative, open and scalable platform will be expanded with more partners, content and functionalities.
The KB is also involved in setting up a Dutch cross-domain aggregator, enabling content exposure in Europeana.

Published in: Business, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,288
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years - storing it, and getting it out there Olaf D. Janssen Koninklijke Bibliotheek (KB), National Library of the Netherlands, Prins Willem-Alexanderhof 5, The Hague, The Netherlands olaf.janssen@kb.nl Abstract. In the next 20 years, the Dutch national library will digitize all printed publications since 1470, some 730M pages. To realize the first milestone of this ambition, KB made deals with Google and Proquest to digitize 42M pages. Since 2003 KB has operated its e-Depot, a system for permanent digital object storage. KB is now replacing it with a new solution to better deal with future demands, allowing improved storage of its mass digitization output. To meet user demand for centralized access, KB is also replacing its scattered full-text online portfolio by a National Platform for Digital Publications, both a content delivery platform for its mass digitization output and a national domain aggregator for publications. From 2011 onwards, this collaborative, open and scalable platform will be expanded with more partners, content and functionalities. The KB is also involved in setting up a Dutch cross-domain aggregator, enabling content exposure in Europeana. Keywords: National libraries, Digital library workflows, Mass digitization, Google, Proquest, Permanent storage, Integrated access, Cross-domain cultural heritage, Aggregation, Interoperability, Europeana1 Digitizing the KBThe KB 1 started digitizing its holdings in 1995, for reasons of accessibility and long-term preservation. In the first years small scale efforts focused on scanning visuallyattractive materials, highlights of the collection for the widest possible audiences. Oneof the first projects was 100 highlights of the Koninklijke Bibliotheek 2 , followed byMemory of the Netherlands 3 , the national programme for digitizing Dutch culturalheritage, which was focused on image based materials. It was not until 1999 that theKB started digitizing historical textual publications (books, newspapers &magazines).For the last 8 years, the focus has been on large-scale digitization of text corpora forstudy and research in the humanities using public funding. In 2003 a project took off
  • 2. to scan the complete run of Dutch Parliamentary Papers 4 . Consisting of 2.3 millionpages, this was at that time an unprecedented quantity for the Netherlands. At the endof 2006 the KB was rewarded the Historical Newspapers project 5 . By the end of2011, it will have scanned 8 million pages from popular Dutch regional, national andcolonial newspapers from the period 1618-1995.In addition, in February 2011 the Early Dutch Books Online digitization effort 6delivered 2.1 million full-text pages from the specials book collections of the KB andthe university libraries of Amsterdam and Leiden. Furthermore, by the end of thisyear, some 1.5 million pages from the most frequently consulted old magazines (1840-1950) will have been converted into full-texts.In 2010 the KB announced its ambitious plans to digitize all Dutch books,newspapers, magazines and other printed publications from 1470 onwards, a total of730 million pages. A first milestone is set for 2013, by when the library should havescanned 10% of this amount. To realize its ambition, the KB cannot not rely on publicfunding alone, especially in times when government support for cultural heritage is ina downward trend. It has therefore entered into strategic public-private partnershipswith both Google 7 and Proquest 8 to digitize 210.000 books (some 42M pages) fromits public domain collections.2 Permanent storage, now & in the futureAs the national library, the KB has a duty to permanently store not only printedpublications, but also digital ones (both born-digital and digitized). As early as 1994the KB recognized the importance of such an electronic depot and took actionaccordingly. It started making pilot agreements with major international publishers fordepositing e-journals (“safehaven”) and undertook market research to acquire atechnical solution for permanent storage. Such a system turned out not to be availableoff-the-shelf, so in 2000 KB joined forces with IBM to build the world’s first OAIS-based processing and preservation system for permanent storage of digital objects.This has resulted in the operational e-Depot 9 , which the KB has been running since2003.Nowadays, this deposit is a safehaven for over 15 million scientific articles fromsome of the world’s biggest publishers 10 , focusing on international scientific,technical and medical journals (STM-publications). In addition it houses digitalmonographs, periodicals and reports from Dutch publishers and materials from thescientific repositories of Dutch universities, as part of the NARCIS 11 initiative.2.1 Towards a new e-DepotIn 2012 the KB’s maintenance contract with IBM will run out and components of thesystem will no longer be supported. The current implementation of the e-Depot isbased on requirements set in the late ‘90s. Some of these have become outdated with
  • 3. respect to current & expected future requirements for speed and collectionmanagement facilities. Additionally, with the seven-year-itch or the system 12 havingpast, it is already living longer than most other IT systems. Other reasons forupgrading the e-Depot are  Volume & scalability: digital publishing has lead to enormous growth of KB’s digital collections. Furthermore, the KB wants to permanently store the hundreds of millions of files resulting from its mass digitization programme output.  Heterogeneity & flexibility: the current system is only optimized for processing and storing relatively small numbers of homogeneous single objects, i.e. mostly PDFs. In other words, it is not able to give fast access to large numbers of diverse and compound content, which will become increasingly common in the near future (e.g. enriched publications, e-books, websites)In defining new requirements, the KB looked for consultation with its internationalcolleagues, most notably with the National Library of Germany (DNB) and SUBGöttingen. This collaboration was based on the joint use of the IBM based system.Early 2009, the KB and DNB sought cooperation with other European nationallibraries to share experience, knowledge and resources. Another reason for doing sowas the lack of suitable commercial off-the-shelf products; the solutions that areavailable bring the risk of vendor lock-in. When national libraries would join forces indefining requirements and tendering, this could trigger commercial suppliers to investmore in developing solutions that answer their requirements. Together with thenational libraries of the UK, Germany, Norway, Spain, Portugal, Switzerland and theCzech Republic, KB defined an architectural outline, based on a two-layered OAISmodel and a modular setup of the preservation system. Unfortunately, later that yearthe libraries decided not to have a joint tender due to different timelines.To guarantee continued technical innovation and development of the e-Depot, the KBis a partner in the SCAPE project 13 . This EU-funded initiative will provide ongoingtechnical input by developing scalable preservation planning and execution servicesthat can be deployed in the new e-Depot system within the next three to five years.3 Providing access & adding valueThe back-end data standards 14 are identical across all KB-run mass digitizationprojects, making the outputs in theory fully interoperable. However, this potential hasnot yet been optimized in the front-end presentation of the KB’s full-text collections.So far this has been done via separate, websites (4, 5, 15 ), each with its own specificbranding, URLs, design and search & object display functionalities.For end-users the KB-collections thus appear to be unrelated and scattered, makingthem relatively difficult to use given the expectations of modern users. They demandall content to be available via a single point of entry, with the ability to apply multiple
  • 4. views & filters (by theme, by time, by geographical location, by object type etc.) tothe interoperable, contextualized, enriched and re-usable content, with minimumcopyright limitations. In addition, users are primarily interested in the digital contentitself, much less from which physical object or institution it was derived.3.1 Providing access – the Dutch National Platform for Digital PublicationsThe KB has taken these user demands seriously and has just finished designing andimplementing the first basic iteration of the Dutch National Platform for DigitalPublications (working name). This full-text content distribution platform will giveaccess to digitized books, newspapers and magazines. Not only will it include theoutput of the KB’s mass digitization projects, but it will also be open for textcollections from other libraries. Access will be central via a modern Web2.0 site, aswell as distributed via search and display APIs. These can deliver content to users intheir normal workflows (via regular social networks, on mobile devices, inprofessional virtual research environments & communities, in products like Zotero,ReWorks, EndNote etc.), as well as allow others (both business and consumers) tobuild their own applications based on the content.Further key design choices of the platform include: 1. Open: everybody can bring and get content, as long as it fits the scope (Dutch textual publications) and certain standards (e.g. metadata & object quality). This will enable small institutions without much in-house expertise or infrastructure to expose their content on a national level. Depending on the rights on the objects, the content can be used, re-used, shared or enriched by third parties. 2. Scalable: given the ambitions of the KB to make all (to be) digitized collections available online, the platform must be able to cope with huge amounts of metadata and objects in the future. This means the service should allow for step-by-step upscaling towards more content and functionalities, with as little manual programming or data conversion work as possible. 3. Collaborative: as said above, the platform will be an open network of KB and other institutions, starting with a coalition of the willing. To guarantee buy-in from the start, partners will need to work collaboratively on both operational and strategical levels. This not only includes technical, but also organizational issues, such as funding, sustainability, governance and policy development.This collaborative approach means that  responsibilities (e.g. financial, technical, business, product development) are shared among the partners,  national expertise about e.g. semantic & metadata interoperability is brought together,
  • 5.  barriers for new partners to join the network are lowered,  positions for joint support funding requests (both on national and European levels) become stronger, and thus  future sustainability of the platform is more likely.Furthermore, the National Platform for Digital Publications will improve thevisibility of the KB as an attractive business-to-business service & data provider forpartners in the Netherlands. KB could for instance offer a package of (paid)permanent object storage in its e-Depot, with an option to present the object on theplatform to end users free of charge.The platform marks a turning point towards centralized access of KB text collections.Starting with the output of the Early Dutch Books Online project in May 2011, thecontent of the platform will be expanded step-by-step in the years to come. Thecurrent planning is as follows:  2011: Early Dutch Books Online (2.1M pages), First set of old magazines (1840 -1950, up to 1.5M pages), First set of early 20th century books (1913 onwards)  2012: Historical Newspaper collection (8M pages, by transferring the content of http://kranten.kb.nl into the platform), Collection of historical children’s books from the Rotterdam public library  2012-2014: output from the Google & Proquest efforts to be included, up to 42M pagesFinally, the National Platform for Digital Publications will be positioned as a full-textand metadata aggregator, with the aim of making the content interoperable andexporting it to cross-domain initiatives, both on national, European and global levels.See Section 4 for more details.3.2 Improved access leads to added value creationIn the past decade, cultural heritage institutions have invested increasingly in theirdigital services, making their collections accessible and at the same time bringing neweconomic and social benefits within reach. A report 16 by the Dutch Foundation forEconomic Research has shown that the total benefits of digitization and accessibilityoutweigh the costs. The heritage sector, creative industries, the education sector andconsumers will all experience immediate benefits from widespread availability ofcultural heritage objects. In other words, digital collections represent significantpotential economic and social value, provided they are made easily accessible.To get an understanding how institutions should make their collections accessible togenerate maximum added value, the BMICE 17 distribution ring model 18 of Figure 1gives guidance.
  • 6. Figure 1. The BMICE ring model - Distribution rings showing four forms of accessto cultural heritage. The outward arrow represents the direction of added value.The four rings represent the following levels of access 1. Analogue in house: The work is displayed physically or made physically accessible in an archive, exhibition or reading room. 2. Digital in house: The work is described digitally and may be digitized. It is made available within the walls of the institution by means of a closed network (or through digital data carriers), such as a computer or terminal at the institution that visitors can use to search through the collection database. 3. Online: All or part of the digital collection of the institution is offered online through the institution’s website, but without explicit rights of use or reuse. 4. Online in the network: Digital collections of the institution are made available in online networks. Rights of use are granted to third parties (the public, other institutions) for use or reuse.Heritage institutions have traditionally focused on - and felt safe in - the first ring,with ring 2 opening up since the start of the digital age in the late ‘80s. The 3rd ringhas come into view since the mid ‘90s, when the web entered everyday life. The riseof the social web in the ‘00s has put momentum in giving access to objects in the 4thring. Even nowadays, many content holders are only just beginning to enter this circleand understand the huge benefits of opening up their collections within rights-controlled networks & communities; for many this means a big step outside theirtrusted safe zones. The yellow outward arrow in Figure 1 represents the direction ofadded value. It can thus be concluded that “the more heritage institutions moveoutside their comfort zones, the greater the value that is created.”
  • 7. Some examples of activities in the outermost ring are:  On-demand digital archive: Users can search & order (free or paid, depending on the rights) cultural heritage sources using various search functions.  Online museum experience: Alternative to or expansion of the museum using web 2.0 tools and platforms. Target users are approached actively by offering widgets, setting up discussion groups on social networks, and so on.  Collaborative storytelling: Users tell their own personal stories on platforms. Heritage institutions often provide specific rights-cleared archive material that users can then integrate into their narrative.  Distributed online research: Technical platforms, tools and social networks where users can jointly conduct and present research. This guarantees a certain degree of reliability with regard to the information, the relationship between the sources and the members of the community. An example of this is wikipedia.org.  Social tagging: Users are given the facility of tagging digitized cultural heritage sources. The tags can contain a description or can express some appreciation, and they enrich the collection, making it easier and more worthwhile to discover.  Online marketplace: This offers users the chance to bid online for cultural heritage objects and works of art.Another example of a 4th ring service is the National Platform for DigitalPublications. As said above, it will be an open & collaborative service, providingsearch and display APIs for delivering content to the places and networks the user are.Similar to Youtube, it will offer widget-based embeddable content, possibilities foruser annotation, user profile pages, and cross-collection searching & display.4 The cross-domain & international dimensionsAs the national library, the KB has a very important facilitating and networking rolein the Dutch scientific and cultural infrastructure. Using this position, it has thepotential to set up and stimulate different levels of collaboration to make onlineheritage more accessible. This is illustrated by the 3-tier collaborative model in Fig.2.
  • 8. Figure 2. Dutch national collaborative aggregation model. The KB is responsible foraggregating publications in the National Platform for Digital PublicationsLower level: domain specific collaboration & aggregationAs said in Section 3, KB’s National Platform for Digital Publications will bepositioned as an aggregator for Dutch full-texts, aiming to make the content - and thenetwork of content delivering partners - interoperable and ready for participation incross-domain initiatives on national and international levels.Besides the KB with its platform, organizations from other domains are working oninteroperability and aggregation for their specific sectors. Lead by the Institute forSound & Vision 19 , institutions from the audio-visual domain collaborate to enableaggregation of AV-materials. Similar initiatives are taking place for the archivaldomain, with the National Archives 20 as the facilitator, and for the museum sector.For the latter, the Rijksdienst voor het Cultureel Erfgoed 21 is the main player.The ways content aggregation and the supporting technical and organizationalstructures are set up are not uniform, but differ across the domains. Based on sector-specific best-practices, knowledge and culture, each aggregator is setting up domaininteroperability in the best possible way. This is however not done in isolation; thedomains are in regular contact to reach consensus on issues such as “which contentgoes where”, to learn from each other and to avoid overlapping work. This wayresponsibilities & roles are kept clear, while at the same time synergies are exploitedwhere possible.
  • 9. Middle level: national cross-domain collaboration & aggregationTo enable these sector specific aggregation initiatives to come together, the results ofthe NED! project 22 are used. It delivered a basic infrastructure for the interoperabilityof Dutch digital heritage, using open standards including XML, DublinCore, OAI-PMH and SRU. It is now being expanded to build a cross-domain heritage aggregatorthat can become the national hub for content delivery to international initiatives.Building a national aggregator is however a step-by-step process, not finishedovernight. Until that time domain-specific aggregators - in case of the library domainthe Dutch National Platform for Digital Publications or The European Library 23 -will continue to have an important role in routing Dutch library content directly totop-level services. Finally, it should be noted that the cross-domain hub is envisionedas a “dark aggregator”, i.e. a B2B service without an interface (website) for end users(however, see item 5 below).Top level: International cross-country collaboration & aggregationHaving established national cross-domain aggregation and interoperability on asmany levels as possible 24 , Dutch content can be shown and used on internationalstages, most notably Europeana 25 .This fast growing, largely EU-funded, metadata aggregator and display space forEuropean digitized works enables people to explore the resources of Europesmuseums, libraries, archives and audio-visual collections. It promotes discovery andnetworking opportunities in a multilingual space where users can engage, share in andbe inspired by the rich diversity of Europes cultural and scientific heritage.Europeana always connects users to the original source of the material so authenticityis ensured. The digital objects they can find are not stored centrally with Europeana,but remain hosted at the providing cultural institutions.Europeana offers the following added values for (Dutch) content holding institutions: 1. It enriches the experience of their users by making relations between their objects and information from other countries and in other formats. This enables cross-border and interdisciplinary research, as well as enriching the content by presenting it in a wider context. 2. Users expect integrated content – they want to see video’s, listen to sound recordings, look at images and read texts, all in once place. Using Europeana they can find related content in multiple formats, from different countries and from diverse domains and disciplines. 3. Europeana makes their content findable in search engines. 4. Europeana generates extra visits to their holdings by redirecting users to the original source of the content (i.e. the content holders’ websites).
  • 10. 5. Europeana offers a set of APIs 26 . These not only enable reuse of Europeana content by third parties, but also allow the contextualized & enriched content of the providing institutions to be used in their own environments. The APIs, in other words, make it possible to create user interface elements for (dark) aggregation services on the lower and middle levels, as indicated in Figure 2 by the dotted API arrows. 6. Knowledge transfer can be major added value for participants in the Europeana network. Europeana collaborates with professionals from digital libraries across Europe and the US. Knowledge generated by these experts is fed back into the network via presentations, workshops and seminars. This way valuable knowledge about the theory and practice on metadata standards, multilinguality, semantic web, information architectures, usability, geolocation, object modeling and many other subjects becomes available for content suppliers.All advantages mentioned in Section 3 about openness, scalability and collaborationapply equally to Europeana, as these key design choices were also the foundations onwhich Europeana was built. Similar to the National Platform for Digital Publications,Europeana is also a service in the 4th ring of the BMICE model. Becoming partners inthe Europeana network and making their content (re-)usable there, will thus allowDutch institutions to add another layer of added value to Dutch cultural & scientificheritage. 1 Koninklijke Bibliotheek (KB), national library of the Netherlands, http://www.kb.nl 2 100 highlights of the KB, http://www.kb.nl/galerie/100hoogtepunten/index-en.html 3 Memory of the Netherlands, the national programme for digitizing Dutch cultural heritage,http://www.geheugenvannederland.nl 4 Filming and digitization of the Dutch parliamentary papers 1814-1995,http://www.kb.nl/hrd/digitalisering/archief/staten-generaal-en.html (project information) &http://www.statengeneraaldigitaal.nl/ (website) 5 Dutch Historical Newspapers 1618-1945, http://www.kb.nl/hrd/digi/ddd/index-en.html(project information) & http://kranten.kb.nl (website) 6 EDBO – Early Dutch Books Online - 10.000 full-text digitized books from 1781-1800, 2.1million pages, http://www.earlydutchbooksonline.nl (from 26-5-2011 onwards)
  • 11. 7 KB and Google sign book digitization agreement, http://www.kb.nl/nieuws/2010/google-en.html 8 Digitization by Proquest of early printed books in KB collection,http://www.kb.nl/nieuws/2011/proquest-en.html 9 E-Depot, the KB’s digital archiving environment for permanent access to digital objects -http://www.kb.nl/hrd/dd/index-en.html 10 Including, but not limited to Elsevier, BioMed Central, Blackwell Publishing, OxfordUniversity Press, Springer and Brill. For a complete list, see http://www.kb.nl/dnp/e-depot/operational/background/policy_archiving_agreements-en.html 11 NARCIS, National Academic Research and Collaborations Information System,http://www.narcis.nl/about/Language/en 12 Wijngaarden, H. van.: The seven year itch. Developing a next generation e-Depot at theKB. Paper for the 76th IFLA General Conference and Assembly, 10-15 August 2010,Gothenburg, Sweden, http://www.ifla.org/files/hq/papers/ifla76/157-wijngaarden-en.pdf(accessed on 28-03-2011) 13 SCAPE - SCAlable Preservation Environments, http://www.scape-project.eu/ 14 KB’s open digitization & accessibility standards,http://www.kb.nl/hrd/digitalisering/standaarden-en.html 15 Digitization of ANP news items, http://www.kb.nl/hrd/digitalisering/archief/anp-en.html(project information) & http://anp.kb.nl (website) 16 Hof, B.J.F. et al.: Baten in beeld; Kengetallen kosten-batenanalyse: beelden voor detoekomst, SEO Amsterdam (2006), ISBN13 9789067333405,http://www.kennisland.nl/uploads/.../8ba66f40-51c9-4f7f-9e60-8404c8aa84e8 (accessed on 27-03-2011) 17 BMICE, Business Model Innovatie Cultural Erfgoed / Business Model InnovationCultural Heritage, http://www.bmice.nl/ 18 BMICE ring model, taken fromhttp://www.den.nl/getasset.aspx?id=Businessmodellen/KL_BusModIn_web_eng_04.pdf&assettype=attachments 19 The Netherlands Institute for Sound & Vision, http://instituut.beeldengeluid.nl 20 National Archives of the Netherlands, http://www.en.nationaalarchief.nl/default.asp 21 Rijksdienst voor het Cultureel Erfgoed, http://www.cultureelerfgoed.nl 22 NED! - Nederlands Erfgoed Digitaal!, http://www.nederlandserfgoeddigitaal.nl/ 23 The European Library; on the one hand a free service that offers access to the resources ofthe 48 national libraries of Europe in 35 languages, on the other hand an international librarydomain aggregator for Europeana, http://www.theeuropeanlibrary.org 24 Establishing interoperability on as many levels as possible: technical, metadata,semantical, human, inter-domain, organizational, political, .etc. 25 Europeana; paintings, music, films and books from over 1500 of Europes galleries,libraries, archives and museums, http://www.europeana.eu 26 Europeana Application Programming Interfaces, http://version1.europeana.eu/web/api

×