0
@maxkaiser
Austrian Books Online
The Austrian National Library’s
large-scale digitisation public-private partnership
with ...
@maxkaiser
Austrian Books Online
www.onb.ac.at/ev/austrianbooksonline/
@maxkaiser
www.slideshare.net/maxkaiser
@maxkaiser
digitisation
of the entire historical
book holdings of the
Austrian National Library
@maxkaiser
largest Austrian
public private partnership
in the cultural sector
@maxkaiser@maxkaiser
Austrian National Library
@maxkaiser
history back to the
14th century
@maxkaiser@maxkaiser
one of the world‘s
most significant
collections
@maxkaiser@maxkaiser
Quelle:
http://commons.wikimedia.org/wiki/File:A
ustria_Hungary_ethnic_de.svg
„legal deposit“
@maxkaiser@maxkaiser
@maxkaiser
legal deposit today
→print publications
→online publications
→web archiving
@maxkaiser@maxkaiser
seven special collections
@maxkaiser
→ Picture Archives and Graphics Department
→ Map Department
→ Music Department
→ Literary Archives
→ Papyri Dep...
@maxkaiser
music department
@maxkaiser@maxkaiser
music department
@maxkaiser@maxkaiser
four museums
@maxkaiser
→ State Hall
→ Papyrus Museum
→ Globe Museum
→ Esperanto Museum
@maxkaiser@maxkaiser
papyrus department & museum
@maxkaiser@maxkaiser
Department of Planned Languages
& Esperanto Museum
@maxkaiser@maxkaiser
Globe Museum
@maxkaiser@maxkaiser
@maxkaiser@maxkaiser
@maxkaiser@maxkaiser
@maxkaiser@maxkaiser
16 reading rooms
@maxkaiser@maxkaiser
9am–9pm – 7 days/week
@maxkaiser@maxkaiser
library as social space
@maxkaiser@maxkaiser
@maxkaiser@maxkaiser
@maxkaiser@maxkaiserservices for researchers
@maxkaiser@maxkaiser
@maxkaiser
access for everyone
from anywhere
@maxkaiser@maxkaiser
+10 mio. pages
historical newspapers & legal texts
@maxkaiser@maxkaiser
@maxkaiser@maxkaiser
@maxkaiser@maxkaiser
@maxkaiser@maxkaiser
several 100k
images
@maxkaiser@maxkaiser
140k
portraits
@maxkaiser@maxkaiser
100k* posters
*by end 2012
@maxkaiser@maxkaiser
papyri…
@maxkaiser@maxkaiser
@maxkaiser@maxkaiser
@maxkaiser
→ September 2012
http://www.onb.ac.at/
about/21043.htm
@maxkaiser
Vision 2025Knowledge for the world of tomorrow
Our holdings are digitized
We collect and sustain knowledge
Acce...
@maxkaiser
→substantial parts of holdings digitized
→cooperation with private partners
→full text search
→added-value serv...
@maxkaiser
→focal point of collection policy is digital
→preference for digital versions of publications
→user generated c...
@maxkaiser
→unified access system for all collections
→focus of cataloguing: metadata enrichment
→linking of metadata with...
@maxkaiser
→integration of digital content in virtual
research environments
→support for digital humanities
→strong resear...
@maxkaiser
→digital services, reading rooms and
museums
→innovative interfaces
→mobile services
→cooperation with private ...
@maxkaiser
Austrian Books Online
@maxkaiser
600,000 volumes
200 Mio pages
@maxkaiser
16th century
2nd half of
19th century _
@maxkaiser
Google Books
Digital Library
Austrian National Library
@maxkaiser
Partner Program
Library Program
Google Books
@maxkaiser
13 Libraries in Europe
5 National Libraries
 Italy
 Austria
 The Netherlands
 Czech Republic
 Great Britain
@maxkaiser
>20 Mio. books
> 50% non-English
~ 75% from libraries
~ 2 Mio. books from European libraries
> 3 Mio. books pub...
@maxkaiser
some strategy and policy considerations…
policy slides ahead!
@maxkaiser
@maxkaiser
is a
public private
partnership?
@maxkaiser
service contract or service outsourcing
→long duration of the relationship
→substantial investment by private
p...
@maxkaiser
rationales for PPPs
→private funding for Public Sector
→benefit from know-how and working
methods of the privat...
@maxkaiser
public private
partnerships in the
cultural sector
@maxkaiser
objectives for public partners
→funding for digitisation
→enhanced access
→engaging new audiences
→access to te...
@maxkaiser
objectives for private partners
→commercial objectives
→access to new markets or customer groups
→association w...
@maxkaiser
benefits for citizens
→increased online access
→democratisation of access to knowledge
→added-value services
→b...
http://ec.europa.eu/information_society/activities/digital_libraries/
doc/reflection_group/final_report_%20cds.pdf
10 Janu...
@maxkaiser
„Stimulating the flow of private funds
for the digitisation of cultural assets through
equitable public private...
@maxkaiser
„The key question is not
whether public-private
partnerships for digitisation
should be encouraged, but
how‚ an...
27 October 2011
@maxkaiser
„(...) recommends that Member States (...)
encourage partnerships between cultural
institutions and the private...
@maxkaiser@maxkaiser
key principles:
1. respect for intellectual property rights
→ ONB-Google: only public-domain works
di...
@maxkaiser
key principles:
4. transparency of agreements
→ ONB-Google: Very detailed FAQs online
5. accessibility through ...
@maxkaiser
key criteria for assessing PPPs
→ total investment by private partner / effort of
public partner
→ (free) acces...
@maxkaiser
additional key elements in
ONB-Google cooperation:
→selection of books by library
→Institute for Conservation i...
@maxkaiser
@maxkaiser
„Genuine PPPs currently not a widespread
method for financing digitisation by cultural
institutions in Europe.“...
@maxkaiser
aim to maximize access
and re-use via digitisation
access restrictions /
re-Use limitations in PPPs
@maxkaiser
public private partnerships
as commodification
of the cultural commons?
@maxkaiser
Cultural Commons
→Body of work freely available to the public for
legal use, sharing, repurposing, and remixing...
@maxkaiser
@maxkaiser
Public Domain
→material to derive knowledge and create new
cultural works
→essential for society and economy
@maxkaiser
http://www.europeana-libraries.eu/web/europeana-project/publications
@maxkaiser
Public Domain Mark
„This work has been identified
as being free of known
restrictions under copyright
law, incl...
@maxkaiser
Public Domain Charter
„Public-Private Partnerships have become one
option for funding large scale digitisation ...
@maxkaiser
@maxkaiser@maxkaiser@maxkaiser
Public Sector Information
→information produced, collected and held by
public institutions
...
@maxkaiser
PSI Directive
→EC “Directive on the Re-Use of Public Sector
Information” (31 Dec. 2003)
→aim: Foster re-use of ...
@maxkaiser
key provisions of PSI Directive
→clear procedures for re-use requests
→upper limit for charging
→transparency o...
@maxkaiser
→12 Dec 2011:
Commission proposal
for PSI Directive
amendment
@maxkaiser
proposed changes
→withdraw current exemption for cultural
institutions
→restrict public sector bodies to only a...
@maxkaiser
→discussion in Council Working Groups
under Danish and Cyprus Presidencies
→latest published draft: 1 Oct. 2012
http://register.consilium.europa.eu/pdf/en/12/st13/st13162.en12.pdf
@maxkaiser
→Working Draft, 1 Oct. 2012: Article 11
@maxkaiser
→Working Draft, 1 Oct. 2012: Article 11
@maxkaiser
the project …
who is paying
for what?
http://www.bildarchivaustria.at/downl/1148453/layout/CE%2043_3.jpg
@maxkaiser
costs
→full text-digitisation:
very expensive
→report by
Collections Trust
for Comité des Sages
http://ec.europ...
@maxkaiser
Google:
→transport
→insurance
→scanning
→OCR
→image processing
→quality control
→Google Books
@maxkaiser
Austrian National Library:
→ provision of Metadata
→ selection
→ internal logistics
→ conservational assessment...
→conservation
→preservation
http://www.mediathek.at/akustische-chronik/popup/popup.php?document_id=1000115&zone_id=
100004...
@maxkaiser
which books?
entire historical
book holdings
16th –19th century
@maxkaiser@maxkaiser
200.000 volumes
State Hall
Quelle: http://deu.archinform.net/projekte/107
Department of Manuscripts
and Rare Books
Map Department
Department of Music
Quelle: http://commons.wikimedia.org/wiki/File:Palais_Lobkowitz_Vienna_Oct._2006_006.jpg
Theatre Museum
Fidei Commiss Library
@maxkaiser@maxkaiser@maxkaiser
@maxkaiser
7 Work Packages
 Book logistics
 Metadata / Catalogues
 Conservation / Restoration
 Data download / Quality...
@maxkaiser
preparatory project
mid - end 2010
→integration with organisational processes
→personnel resources
→logistics w...
@maxkaiser
internal communication
→change processes
→re-evaluation of workflows
→availability of internal resources
consultation with other
Google partners
Quelle: http://commons.wikimedia.org/wiki/File:M%C3%BCnchen_Bayerische_Staatsbibli...
@maxkaiser
70+ staff members
20+ exclusively for project
→ book logistics
→ metadata adaptation
→ cataloguing
→ conservati...
@maxkaiser
end of 2010
test shipment & start operational project
Spring 2011
start of digitisation
no individual selection …
size
size
condition
preparation
conservational
evaluation
value
@maxkaiser
book flow
logistics in the
State Hall
logistics in the
State Hall
logistics in the
State Hall
@maxkaiser
challenges…
challenges…
challenges…
challenges…
challenges…
logistics in the
„Aurum“ Depot
logistics in the
„Aurum“ Depot
preparation for
digitisation
manipulation area …
barcoding
adaptation of metadata
@maxkaiser@maxkaiser
8 minutes / volume
@maxkaiser
books
@maxkaiser
hours
@maxkaiser
working days
@maxkaiser
person years
complex cases …
bound-togethers …
bound-togethers …
bound-togethers …
„slim“ volumes …
special collections …
conservational protection
conservational protection
conservational protection
@maxkaiser@maxkaiser
conservational protection
cataloguing the
Fidei Commiss Library
cataloguing the
Fidei Commiss Library
ready for digitisation …
@maxkaiser
digitisation
→ scanning Center in Germany
→ procedures agreed
→ Austrian Federal Office for Monuments involved
...
@maxkaiser
@maxkaiser@maxkaiser
@maxkaiser
where are we today?
@maxkaiser100.000volumes digitized
today
@maxkaiser185.000volumes digitized
by end 2013
@maxkaiser
of 100.000 volumes:
9,19% 16th century
14,24% 17th century
31,48% 18th century
43,01% 19th century
2,07% [no ye...
@maxkaiser
of 100.000 volumes:
33,41% German
31,31% Latin
15,55% French
13,78% Italian
2,73% English
@maxkaiser
book flowdigital flow
@maxkaiser
digitisation
data download
book logistics
quality control
storage
access
ADOCO
(Austrian Books Online
Download ...
@maxkaiser
up to
digitised items / day
@maxkaiser
quality control
@maxkaiser
quality control
→goal: Automated jobs
→representative samples
→IT assisted discovery of error clusters
→error c...
@maxkaiser
error model
→ level 1: data / information
→ image (thick, broken)
→ illustration (scanner effects, tone, color ...
@maxkaiser
error model
→ level 3: whole volume
→ order of pages
→ missing pages
→ duplicate pages
→ false pages
→ full tex...
@maxkaiser
use cases
→reading online images
→printing on demand
→processing full text data
→managing collections
Informed ...
@maxkaiser
@maxkaiser
@maxkaiser
non-critical
errors
@maxkaiser
bleedthrough
@maxkaiser
@maxkaiser
@maxkaiser
errors
@maxkaiser
cropping
error
@maxkaiser
quality control
via sampling
re-processing
re-download
@maxkaiser
cropping
error
FIXED
@maxkaiserbig data processing…
http://blogs.loc.gov/digitalpreservation/files/2012/05/3875300483_a8875fea1c-500.jpg
technical slides ahead!
@maxkaiser
technologies and workflows
from EC co-funded FP7 projects:
→SCAPE
(Scalable Preservation Environments)
→http://...
experimental cluster
hadoop / map reduce
SLAVE 1
Task Tracker
Data Node
SLAVE 2
Task Tracker
Data Node
SLAVE n
Task Tracker
Data Node
MASTER
Jo...
@maxkaiser
use case 1: duplicate pages
in one book
→books with duplicated pages
→due to scanning process & post processing...
use case 1: duplicate pages
in one book
use case 1: duplicate pages
in one book
@maxkaiser
use case 2: book comparison
based on image similarity
→different instances of one book, coming
→e.g. from diffe...
use case 2: book comparison
based on image similarity
measure for book similarity
based on book page image
similarity
 he...
@maxkaiser
large scale document processing
→extract image metadata using Exiftool
→large scale batch processing using Apac...
find
/NAS/Z119585409/00000001.jp2
/NAS/Z119585409/00000002.jp2
/NAS/Z119585409/00000003.jp2
…
/NAS/Z117655409/00000001.jp2...
@maxkaiser
large scale document processing
→ store once in HDFS and read many times
→ small files (TXT, HTML) stored in HD...
find
/NAS/Z119585409/00000707.html
/NAS/Z119585409/00000708.html
/NAS/Z119585409/00000709.html
…
/NAS/Z138682341/00000707....
Z119585409/00000001
Z119585409/00000002
Z119585409/00000003
Z119585409/00000004
Z119585409/00000005
...
~ 6 h
Z119585409/0...
Sqoop
combine MySQL and Apache Hive DB
book level
metadata
page level
metadata
@maxkaiserstorage and access…
@maxkaiser
data
average size data package (~book): 101 MB
colour data package: 187 MB
grayscale data package: 82 MB
101 MB...
@maxkaiser
storage & access
→ data storage: in-house
→ JPEG-2000 master files stored redundantly
→ access copies generated...
@maxkaiser
book viewer
catalogue /
“Quick Search”
full-text search
[mobile apps]
@maxkaiser@maxkaiser
Image Server
Catalogue
URN Resolver
Master Images
Digital Repository
Quick Search
Fulltext
Index Serv...
@maxkaiser
outlook
→ full-text: new possibilities for research
→ data enrichment
→ named entity recognition
→ linked data
...
@maxkaiser@maxkaiser
@maxkaiser
DM2E
→http://dm2e.eu/
→European Commission co-funded project
→stimulate creation of new tools and
services for ...
@maxkaiser
next steps
→80.000 books already accessible via
Google Books
→Spring 2013: launch of Austrian Books
Online View...
@maxkaiser
http://books.google.at/books?vid=ONB%2BZ15367990X
@maxkaiser
http://books.google.at/books?vid=ONB%2BZ155606704
@maxkaiser
http://books.google.at/books?vid=ONB%2BZ174115105
@maxkaiser
http://books.google.at/books?vid=ONB%2BZ158211101
@maxkaiserhttp://books.google.at/books?vid=ONB%2BZ169472305
@maxkaiser
http://books.google.at/books?vid=ONB%2BZ164893308
@maxkaiser
more information
www.onb.ac.at/ev/austrianbooksonline
www.onb.ac.at/ev/austrianbooksonline/faq.htm
twitter.com/...
@maxkaiser
thank you!
max.kaiser@onb.ac.at
www.onb.ac.at
www.slideshare.net/maxkaiser
www.linkedin.com/in/maxkaiser
gplus....
Upcoming SlideShare
Loading in...5
×

Austrian Books Online. The Austrian National Library's Large-Scale Digitisation Public-Private Partnership with Google

1,410

Published on

Library Science Talk in Geneva and Bern, Switzerland, 15 & 16 October, 2012

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,410
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Austrian Books Online. The Austrian National Library's Large-Scale Digitisation Public-Private Partnership with Google"

  1. 1. @maxkaiser Austrian Books Online The Austrian National Library’s large-scale digitisation public-private partnership with Google Max Kaiser Head R&D, Austrian National Library Library Science Talk Geneva, 15 October 2012 Bern, 16 October 2012
  2. 2. @maxkaiser Austrian Books Online www.onb.ac.at/ev/austrianbooksonline/
  3. 3. @maxkaiser www.slideshare.net/maxkaiser
  4. 4. @maxkaiser digitisation of the entire historical book holdings of the Austrian National Library
  5. 5. @maxkaiser largest Austrian public private partnership in the cultural sector
  6. 6. @maxkaiser@maxkaiser Austrian National Library
  7. 7. @maxkaiser history back to the 14th century
  8. 8. @maxkaiser@maxkaiser one of the world‘s most significant collections
  9. 9. @maxkaiser@maxkaiser Quelle: http://commons.wikimedia.org/wiki/File:A ustria_Hungary_ethnic_de.svg „legal deposit“
  10. 10. @maxkaiser@maxkaiser
  11. 11. @maxkaiser legal deposit today →print publications →online publications →web archiving
  12. 12. @maxkaiser@maxkaiser seven special collections
  13. 13. @maxkaiser → Picture Archives and Graphics Department → Map Department → Music Department → Literary Archives → Papyri Department → Department of Planned Languages → Department of Rare Books and Manuscripts
  14. 14. @maxkaiser music department
  15. 15. @maxkaiser@maxkaiser music department
  16. 16. @maxkaiser@maxkaiser four museums
  17. 17. @maxkaiser → State Hall → Papyrus Museum → Globe Museum → Esperanto Museum
  18. 18. @maxkaiser@maxkaiser papyrus department & museum
  19. 19. @maxkaiser@maxkaiser Department of Planned Languages & Esperanto Museum
  20. 20. @maxkaiser@maxkaiser Globe Museum
  21. 21. @maxkaiser@maxkaiser
  22. 22. @maxkaiser@maxkaiser
  23. 23. @maxkaiser@maxkaiser
  24. 24. @maxkaiser@maxkaiser 16 reading rooms
  25. 25. @maxkaiser@maxkaiser 9am–9pm – 7 days/week
  26. 26. @maxkaiser@maxkaiser library as social space
  27. 27. @maxkaiser@maxkaiser
  28. 28. @maxkaiser@maxkaiser
  29. 29. @maxkaiser@maxkaiserservices for researchers
  30. 30. @maxkaiser@maxkaiser
  31. 31. @maxkaiser access for everyone from anywhere
  32. 32. @maxkaiser@maxkaiser +10 mio. pages historical newspapers & legal texts
  33. 33. @maxkaiser@maxkaiser
  34. 34. @maxkaiser@maxkaiser
  35. 35. @maxkaiser@maxkaiser
  36. 36. @maxkaiser@maxkaiser several 100k images
  37. 37. @maxkaiser@maxkaiser 140k portraits
  38. 38. @maxkaiser@maxkaiser 100k* posters *by end 2012
  39. 39. @maxkaiser@maxkaiser papyri…
  40. 40. @maxkaiser@maxkaiser
  41. 41. @maxkaiser@maxkaiser
  42. 42. @maxkaiser → September 2012 http://www.onb.ac.at/ about/21043.htm
  43. 43. @maxkaiser Vision 2025Knowledge for the world of tomorrow Our holdings are digitized We collect and sustain knowledge Access to our knowledge is simple With us, research is more faceted and effective We enrich cultural and social life
  44. 44. @maxkaiser →substantial parts of holdings digitized →cooperation with private partners →full text search →added-value services like semantic search →unified access system Our holdings are digitized
  45. 45. @maxkaiser →focal point of collection policy is digital →preference for digital versions of publications →user generated content and social networks →digital photography →preservation of analogue and digital collections →scalable digital archive We collect and sustain knowledge
  46. 46. @maxkaiser →unified access system for all collections →focus of cataloguing: metadata enrichment →linking of metadata with external resources →open data →APIs and support for third party apps Access to our knowledge is simple
  47. 47. @maxkaiser →integration of digital content in virtual research environments →support for digital humanities →strong research collections and libraries →cooperation with universities and research centres With us, research is more faceted and simple
  48. 48. @maxkaiser →digital services, reading rooms and museums →innovative interfaces →mobile services →cooperation with private partners: reuse of data for innovative services →reinforce library as social space We enrich cultural and social life
  49. 49. @maxkaiser Austrian Books Online
  50. 50. @maxkaiser 600,000 volumes 200 Mio pages
  51. 51. @maxkaiser 16th century 2nd half of 19th century _
  52. 52. @maxkaiser Google Books Digital Library Austrian National Library
  53. 53. @maxkaiser Partner Program Library Program Google Books
  54. 54. @maxkaiser 13 Libraries in Europe 5 National Libraries  Italy  Austria  The Netherlands  Czech Republic  Great Britain
  55. 55. @maxkaiser >20 Mio. books > 50% non-English ~ 75% from libraries ~ 2 Mio. books from European libraries > 3 Mio. books public domain
  56. 56. @maxkaiser some strategy and policy considerations…
  57. 57. policy slides ahead!
  58. 58. @maxkaiser
  59. 59. @maxkaiser is a public private partnership?
  60. 60. @maxkaiser service contract or service outsourcing →long duration of the relationship →substantial investment by private partner →distribution of risks ≠
  61. 61. @maxkaiser rationales for PPPs →private funding for Public Sector →benefit from know-how and working methods of the private sector →but not a „miracle solution“ for the public sector (EC Green Paper on Public Private Partnerships, 2004)
  62. 62. @maxkaiser public private partnerships in the cultural sector
  63. 63. @maxkaiser objectives for public partners →funding for digitisation →enhanced access →engaging new audiences →access to technology →access to private sector competencies →commercial income through user fees, royalties or revenue share →lobbying effort to increase public funding
  64. 64. @maxkaiser objectives for private partners →commercial objectives →access to new markets or customer groups →association with strong public brands →access to (rare, unique) content →corporate social responsibility
  65. 65. @maxkaiser benefits for citizens →increased online access →democratisation of access to knowledge →added-value services →benefit for learning and tourism →new creative endeavours
  66. 66. http://ec.europa.eu/information_society/activities/digital_libraries/ doc/reflection_group/final_report_%20cds.pdf 10 January 2011
  67. 67. @maxkaiser „Stimulating the flow of private funds for the digitisation of cultural assets through equitable public private partnerships appears as a viable and sustainable way of tackling the pressing question of making Europe’s cultural wealth accessible online and preserving it for future generations.“
  68. 68. @maxkaiser „The key question is not whether public-private partnerships for digitisation should be encouraged, but how‚ and under which conditions.“
  69. 69. 27 October 2011
  70. 70. @maxkaiser „(...) recommends that Member States (...) encourage partnerships between cultural institutions and the private sector in order to create new ways of funding digitisation of cultural material and to stimulate innovative uses of the material, while ensuring that public private partnerships for digitisation are fair and balanced (…).“
  71. 71. @maxkaiser@maxkaiser key principles: 1. respect for intellectual property rights → ONB-Google: only public-domain works digitised 2. non-exclusivity → ONB-Google: ONB free to digitise material with other partners 3. transparency of the process → ONB-Google: public tender
  72. 72. @maxkaiser key principles: 4. transparency of agreements → ONB-Google: Very detailed FAQs online 5. accessibility through Europeana → ONB-Google: → all files available for non-commercial use → access via platforms like Europeana → provision to research partners 6. key criteria → [Next slide]
  73. 73. @maxkaiser key criteria for assessing PPPs → total investment by private partner / effort of public partner → (free) access to material for general public, including through Europeana → cross-border access → length of any period of preferential commercial use by private partner → quality of digital copies for public partner → usage conditions for public partner in non- commercial context → time-scale of project
  74. 74. @maxkaiser additional key elements in ONB-Google cooperation: →selection of books by library →Institute for Conservation involved →termination
  75. 75. @maxkaiser
  76. 76. @maxkaiser „Genuine PPPs currently not a widespread method for financing digitisation by cultural institutions in Europe.“ Commission Staff Working Paper Accompanying the document Commission Recommendation on the digitisation and online accessibility of cultural material and digital preservation, p18 http://ec.europa.eu/information_society/activities/digital_libraries/doc/recommendation/recom28nov_all_versions/staff_working_paper.pdf
  77. 77. @maxkaiser aim to maximize access and re-use via digitisation access restrictions / re-Use limitations in PPPs
  78. 78. @maxkaiser public private partnerships as commodification of the cultural commons?
  79. 79. @maxkaiser Cultural Commons →Body of work freely available to the public for legal use, sharing, repurposing, and remixing →Source for cultural creativity →http://creativcommons.org/culture
  80. 80. @maxkaiser
  81. 81. @maxkaiser Public Domain →material to derive knowledge and create new cultural works →essential for society and economy
  82. 82. @maxkaiser http://www.europeana-libraries.eu/web/europeana-project/publications
  83. 83. @maxkaiser Public Domain Mark „This work has been identified as being free of known restrictions under copyright law, including all related and neighbouring rights. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.“ http://creativecommons.org/publicdomain/mark/1.0/
  84. 84. @maxkaiser Public Domain Charter „Public-Private Partnerships have become one option for funding large scale digitisation efforts. Commercial content aggregators pay for the digitisation in exchange for privileged access to the digitised collections. These activities are seen as a reason for attempting to exercise as much control as possible over digital reproductions of Public Domain works. Organisations are claiming exclusive rights in digitised versions of Public Domain works and are entering into exclusive relationships with commercial partners that hinder free access.”
  85. 85. @maxkaiser
  86. 86. @maxkaiser@maxkaiser@maxkaiser Public Sector Information →information produced, collected and held by public institutions →single largest source of information in Europe →should be widely re-used to foster economy and creativity
  87. 87. @maxkaiser PSI Directive →EC “Directive on the Re-Use of Public Sector Information” (31 Dec. 2003) →aim: Foster re-use of PSI →legally binding document →implemented by all Member States in 2008 →currently: Cultural & research institutions excluded from directive
  88. 88. @maxkaiser key provisions of PSI Directive →clear procedures for re-use requests →upper limit for charging →transparency of conditions and standard charges for re-use →avoid discrimination between players →prohibition of exclusive agreements
  89. 89. @maxkaiser →12 Dec 2011: Commission proposal for PSI Directive amendment
  90. 90. @maxkaiser proposed changes →withdraw current exemption for cultural institutions →restrict public sector bodies to only apply charges for re-used based on marginal costs →exemption for libraries, archives, museums →prohibit agreement of terms for re-use which grant exclusive rights to any one party
  91. 91. @maxkaiser →discussion in Council Working Groups under Danish and Cyprus Presidencies →latest published draft: 1 Oct. 2012
  92. 92. http://register.consilium.europa.eu/pdf/en/12/st13/st13162.en12.pdf
  93. 93. @maxkaiser →Working Draft, 1 Oct. 2012: Article 11
  94. 94. @maxkaiser →Working Draft, 1 Oct. 2012: Article 11
  95. 95. @maxkaiser the project …
  96. 96. who is paying for what? http://www.bildarchivaustria.at/downl/1148453/layout/CE%2043_3.jpg
  97. 97. @maxkaiser costs →full text-digitisation: very expensive →report by Collections Trust for Comité des Sages http://ec.europa.eu/information_society/activities/digital_libraries/ doc/refgroup/annexes/digiti_report.pdf
  98. 98. @maxkaiser Google: →transport →insurance →scanning →OCR →image processing →quality control →Google Books
  99. 99. @maxkaiser Austrian National Library: → provision of Metadata → selection → internal logistics → conservational assessment → barcoding → metadata adjustments → data download and control → data storage & digital preservation → Digital Library
  100. 100. →conservation →preservation http://www.mediathek.at/akustische-chronik/popup/popup.php?document_id=1000115&zone_id= 1000043&template_id=1000016&zone_name=IMAGE_ZONE1
  101. 101. @maxkaiser which books?
  102. 102. entire historical book holdings 16th –19th century
  103. 103. @maxkaiser@maxkaiser 200.000 volumes State Hall
  104. 104. Quelle: http://deu.archinform.net/projekte/107 Department of Manuscripts and Rare Books Map Department
  105. 105. Department of Music
  106. 106. Quelle: http://commons.wikimedia.org/wiki/File:Palais_Lobkowitz_Vienna_Oct._2006_006.jpg Theatre Museum
  107. 107. Fidei Commiss Library
  108. 108. @maxkaiser@maxkaiser@maxkaiser
  109. 109. @maxkaiser 7 Work Packages  Book logistics  Metadata / Catalogues  Conservation / Restoration  Data download / Quality control  Access  IT infrastructure  Project management
  110. 110. @maxkaiser preparatory project mid - end 2010 →integration with organisational processes →personnel resources →logistics workflows
  111. 111. @maxkaiser internal communication →change processes →re-evaluation of workflows →availability of internal resources
  112. 112. consultation with other Google partners Quelle: http://commons.wikimedia.org/wiki/File:M%C3%BCnchen_Bayerische_Staatsbibliothek_001.JPG
  113. 113. @maxkaiser 70+ staff members 20+ exclusively for project → book logistics → metadata adaptation → cataloguing → conservation / restoration → quality control → software implementation → project management
  114. 114. @maxkaiser end of 2010 test shipment & start operational project Spring 2011 start of digitisation
  115. 115. no individual selection …
  116. 116. size
  117. 117. size
  118. 118. condition
  119. 119. preparation
  120. 120. conservational evaluation
  121. 121. value
  122. 122. @maxkaiser book flow
  123. 123. logistics in the State Hall
  124. 124. logistics in the State Hall
  125. 125. logistics in the State Hall
  126. 126. @maxkaiser challenges…
  127. 127. challenges…
  128. 128. challenges…
  129. 129. challenges…
  130. 130. challenges…
  131. 131. logistics in the „Aurum“ Depot
  132. 132. logistics in the „Aurum“ Depot
  133. 133. preparation for digitisation
  134. 134. manipulation area …
  135. 135. barcoding
  136. 136. adaptation of metadata
  137. 137. @maxkaiser@maxkaiser
  138. 138. 8 minutes / volume
  139. 139. @maxkaiser books
  140. 140. @maxkaiser hours
  141. 141. @maxkaiser working days
  142. 142. @maxkaiser person years
  143. 143. complex cases …
  144. 144. bound-togethers …
  145. 145. bound-togethers …
  146. 146. bound-togethers …
  147. 147. „slim“ volumes …
  148. 148. special collections …
  149. 149. conservational protection
  150. 150. conservational protection
  151. 151. conservational protection
  152. 152. @maxkaiser@maxkaiser conservational protection
  153. 153. cataloguing the Fidei Commiss Library
  154. 154. cataloguing the Fidei Commiss Library
  155. 155. ready for digitisation …
  156. 156. @maxkaiser digitisation → scanning Center in Germany → procedures agreed → Austrian Federal Office for Monuments involved → each volume checked after return → books unavailable to users for ~ 3 months
  157. 157. @maxkaiser
  158. 158. @maxkaiser@maxkaiser
  159. 159. @maxkaiser where are we today?
  160. 160. @maxkaiser100.000volumes digitized today
  161. 161. @maxkaiser185.000volumes digitized by end 2013
  162. 162. @maxkaiser of 100.000 volumes: 9,19% 16th century 14,24% 17th century 31,48% 18th century 43,01% 19th century 2,07% [no year of publication]
  163. 163. @maxkaiser of 100.000 volumes: 33,41% German 31,31% Latin 15,55% French 13,78% Italian 2,73% English
  164. 164. @maxkaiser book flowdigital flow
  165. 165. @maxkaiser digitisation data download book logistics quality control storage access ADOCO (Austrian Books Online Download & Control)
  166. 166. @maxkaiser up to digitised items / day
  167. 167. @maxkaiser quality control
  168. 168. @maxkaiser quality control →goal: Automated jobs →representative samples →IT assisted discovery of error clusters →error candidates checked manually →detect systematic and critical errors
  169. 169. @maxkaiser error model → level 1: data / information → image (thick, broken) → illustration (scanner effects, tone, color etc) → full-text (OCR errors per page-image) → level 2: entire page → blur / warp / skew → cropping → obscure / cleaned → colorization → full-text (OCR error patterns at page level) Informed by „Validating Quality in Large-Scale Digitization“ project of Univ. of Michigan & Univ. of Minesota, http://hathitrust-quality.projects.si.umich.edu/
  170. 170. @maxkaiser error model → level 3: whole volume → order of pages → missing pages → duplicate pages → false pages → full text (OCR error patterns at volume level) Informed by „Validating Quality in Large-Scale Digitization“ project of Univ. of Michigan & Univ. of Minesota, http://hathitrust-quality.projects.si.umich.edu/
  171. 171. @maxkaiser use cases →reading online images →printing on demand →processing full text data →managing collections Informed by „Validating Quality in Large-Scale Digitization“ project of Univ. of Michigan & Univ. of Minesota, http://hathitrust-quality.projects.si.umich.edu/
  172. 172. @maxkaiser
  173. 173. @maxkaiser
  174. 174. @maxkaiser non-critical errors
  175. 175. @maxkaiser bleedthrough
  176. 176. @maxkaiser
  177. 177. @maxkaiser
  178. 178. @maxkaiser errors
  179. 179. @maxkaiser cropping error
  180. 180. @maxkaiser quality control via sampling re-processing re-download
  181. 181. @maxkaiser cropping error FIXED
  182. 182. @maxkaiserbig data processing… http://blogs.loc.gov/digitalpreservation/files/2012/05/3875300483_a8875fea1c-500.jpg
  183. 183. technical slides ahead!
  184. 184. @maxkaiser technologies and workflows from EC co-funded FP7 projects: →SCAPE (Scalable Preservation Environments) →http://www.scape-project.eu/ →IMPACT (Improving Access to Text) →http://www.impact-project.eu/
  185. 185. experimental cluster
  186. 186. hadoop / map reduce SLAVE 1 Task Tracker Data Node SLAVE 2 Task Tracker Data Node SLAVE n Task Tracker Data Node MASTER Job Tracker Name Node Hadoop Distributed File System (HDFS) → experimental 5 server cluster at ONB: → 40 cores in total → 30 cores assigned to task trackers
  187. 187. @maxkaiser use case 1: duplicate pages in one book →books with duplicated pages →due to scanning process & post processing →use key points of images to determine structural image similarity
  188. 188. use case 1: duplicate pages in one book
  189. 189. use case 1: duplicate pages in one book
  190. 190. @maxkaiser use case 2: book comparison based on image similarity →different instances of one book, coming →e.g. from different downloads of one book at different points in time →book similarity measure →based on comparison of book page images from two different book instances
  191. 191. use case 2: book comparison based on image similarity measure for book similarity based on book page image similarity  helps finding prominent changes in book re- downloads
  192. 192. @maxkaiser large scale document processing →extract image metadata using Exiftool →large scale batch processing using Apache Hadoop Streaming API →bash script using Exiftool is executed on the cluster →book page image data is accessible from each node of the cluster →parallelisation of batch processing
  193. 193. find /NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ... ... NAS reading files from NAS 1,4 GB 1,2 GB ~ 5 h + ~ 38 h = ~ 43 h 60.000 books 24 mio pages Jp2PathCreator HadoopStreamingExiftoolRead Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ...
  194. 194. @maxkaiser large scale document processing → store once in HDFS and read many times → small files (TXT, HTML) stored in HDFS → files of each file type stored as one big file (SequenceFile) → principle: store once in HDFS and read many times → example: → storing OCR results of 24 mio pages (ca. 60.000 books)  reading data from file server and storing on cluster takes more than 1 day → subsequent processing of a Map/Reduce job (calculate average block width) takes 6 hours
  195. 195. find /NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html … /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html ... Z119585409/00000707 Z119585409/00000708 Z119585409/00000709 Z119585409/00000710 Z119585409/00000711 Z119585409/00000712 NAS reading files from NAS 1,4 GB 997 GB (uncompressed) ~ 5 h + ~ 24 h = ~ 29 h 60.000 books 24 mio pages HtmlPathCreator SequenceFileCreator
  196. 196. Z119585409/00000001 Z119585409/00000002 Z119585409/00000003 Z119585409/00000004 Z119585409/00000005 ... ~ 6 h Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400 Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400 Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400 Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400 Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400 Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250 Map/Reduce HadoopAvBlockWidthMapReduce SequenceFile Textfile example map/reduce job: calculate average block width
  197. 197. Sqoop combine MySQL and Apache Hive DB book level metadata page level metadata
  198. 198. @maxkaiserstorage and access…
  199. 199. @maxkaiser data average size data package (~book): 101 MB colour data package: 187 MB grayscale data package: 82 MB 101 MB * 600.000 = 60 TB
  200. 200. @maxkaiser storage & access → data storage: in-house → JPEG-2000 master files stored redundantly → access copies generated on-the-fly → URN resolver for permanent identification
  201. 201. @maxkaiser book viewer catalogue / “Quick Search” full-text search [mobile apps]
  202. 202. @maxkaiser@maxkaiser Image Server Catalogue URN Resolver Master Images Digital Repository Quick Search Fulltext Index Server Book Viewer ADOCO Google USER
  203. 203. @maxkaiser outlook → full-text: new possibilities for research → data enrichment → named entity recognition → linked data → new data centric research in the Humanities & Social Sciences → http://www.diggingintodata.org/
  204. 204. @maxkaiser@maxkaiser
  205. 205. @maxkaiser DM2E →http://dm2e.eu/ →European Commission co-funded project →stimulate creation of new tools and services for re-use of Europeana data in the Digital Humanities →implementation of semantic annotation tool →Austrian Books Online data part of the project
  206. 206. @maxkaiser next steps →80.000 books already accessible via Google Books →Spring 2013: launch of Austrian Books Online Viewer →full text search
  207. 207. @maxkaiser http://books.google.at/books?vid=ONB%2BZ15367990X
  208. 208. @maxkaiser http://books.google.at/books?vid=ONB%2BZ155606704
  209. 209. @maxkaiser http://books.google.at/books?vid=ONB%2BZ174115105
  210. 210. @maxkaiser http://books.google.at/books?vid=ONB%2BZ158211101
  211. 211. @maxkaiserhttp://books.google.at/books?vid=ONB%2BZ169472305
  212. 212. @maxkaiser http://books.google.at/books?vid=ONB%2BZ164893308
  213. 213. @maxkaiser more information www.onb.ac.at/ev/austrianbooksonline www.onb.ac.at/ev/austrianbooksonline/faq.htm twitter.com/abooksonline
  214. 214. @maxkaiser thank you! max.kaiser@onb.ac.at www.onb.ac.at www.slideshare.net/maxkaiser www.linkedin.com/in/maxkaiser gplus.to/maxkaiser twitter.com/maxkaiser
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×