@maxkaiser
Text Mining in Cultural
Heritage: Challenges
Max Kaiser
Head of Research & Development
Austrian National Library
max.kaiser@onb.ac.at
@maxkaiser
Text & Data Mining in Europe
The Hague, Nov. 11, 2015https://www.flickr.com/photos/shanegorski/2694765397/
No, this is
not the
Austrian National Library …
@maxkaiser
www.slideshare.net/maxkaiser
@maxkaiser@maxkaiser
Context at Austrian
National Library
@maxkaiser
@maxkaiser
@maxkaiser
Austrian Books Online
@maxkaiser
digitisation
of the entire historical
book holdings of the
Austrian National Library
Austrian Books
Online
www.onb.ac.at/austrianbooksonline/
@maxkaiser
cooperation with Google
largest
public private partnership
in Austria’s cultural sector
@maxkaiser
600,000 volumes
200 Mio pages
@maxkaiser~337.000volumes digitised
October 2015
@maxkaiser
@maxkaiser
Austrian Newspapers Online
→Started 2003
→More than 14 mio pages digitized
→More than 10 mio pages OCR read within
several projects
→Structured by newspaper & publication
date
@maxkaiser
Europeana Newspapers
→18 partners
→12 million pages
→1.6 mio newspaper pages OCRed @ ONB
@maxkaiser
critical mass
of digitally available texts
and (meta) data
new research questions
to textual material?
@maxkaiser
A short digression …
Full text data is already useful today
(without text mining)
Full text data will be even more useful in the
future (with text mining)
@maxkaiser
@maxkaiser
@maxkaiser
@maxkaiser
@maxkaiser
@maxkaiser
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Besuche
Seitenansichten
Austrian Books
Online
@maxkaiser
search
find
read
@maxkaiser
@maxkaiser
metadata
digitised
collections
data data
data
Tool
Tool
Tool
Tool
Tool
Tool
Server
Server
Server
Server
Server
data
processing
Tool
Tool
Tool
Tool
@maxkaiser
metadata
digitised
collections
data fata
data
Server
Server
Server
Server
Server
data
processing
Tool
Tool
Tool
Tool
@maxkaiser
close reading
distant reading
interpretation / analysis /
edition of individual texts
analysis of Big Data
textmining
@maxkaiser
@maxkaiser
Hadoop-Cluster
→Comprehensive analysis with tools on the
entire ONB corpus
→Enquiries over the whole range of
materials in Austrian Books Online
→Scalable processing, eg. for text mining
→Processing of analyses on site
→User-driven analyses
@maxkaiser
@maxkaiser
325.000+ books
38TBdata:
105 Million pages
33 Billion running words
5 Billion lines of text
186 identified languages
0.5TBtext index size in 3 Solr Shards
Austrian Books
Online
Status
Oct. 2015
@maxkaiser
heterogeneity
10%
13%
31%
44%
2%
16. Jh.
17. Jh.
18. Jh.
19. Jh.
no year
centuries…Austrian Books
Online
3%
16%
33%
12%
26%
2% 8%
eng
fre
ger
ita
lat
slav
others
languages…Austrian Books
Online
0,00%
10,00%
20,00%
30,00%
40,00%
50,00%
60,00%
16. Jh. 17. Jh. 18. Jh. 19. Jh.
eng
fre
ger
ita
lat
Austrian Books
Online
@maxkaiser
@maxkaiser
→technology/maths are there and proven
→ algorithms & tools for NER
→ algorithms & tools for topic modelling
@maxkaiser
@maxkaiser
@maxkaiser
→ Difficult to integrate in stable (legacy) systems
→ Base technologies still evolving
→ Spread of languages / diachronic evolution of
languages makes analysis pipelines complex
→ (Noisy) OCR results as source for analyses
→ Data not stable but constantly being updated:
→ Images improve
→ OCR improves
@maxkaiser
OCR as basis for text mining
→The simple case
→The challenging case
@maxkaiser
→ Images are OCRed
→ Processes subject to
continuous
improvement
Text data
@maxkaiser
→ Chunks of image
and text-blocks are
recognized
→ Content is split into
contiguous blocks of text
→ The yellow boxes are
recognized as graphic
content
Text data
@maxkaiser
→ If possible,
paragraph
structures are used
for chunking.
Text data
@maxkaiser
→ Lines and words are
identified
Text data
@maxkaiser
@maxkaiser
@maxkaiser
@maxkaiser
@maxkaiser
@maxkaiser
→IP or otherwise restricted content
has to be processed on site
→(e.g. local Hadoop installation)
→Interfaces (APIs etc.) not yet available,
but will be in the near future
@maxkaiser
@maxkaiser
→ Text mining of OCRed historical texts is still
a research topic
→ R&D in this area is understood, however
still requires commitment of using results
in production
→ Requires interfacing between:
→R&D
→IT services
→Library systems
→ And: what do the users want?
@maxkaiser
not necessarily
those at the
Austrian National Library …
@maxkaiser
→ “Integration with current infrastructure is
complex”
→ “Legal implications are risky”
→ “There is still too much R&D involved to be
integrated in production environment”
→ “There are no clear use cases”
→ “A shift from (human produced) meta-data to
machine-understanding primary data is too
fundamental”
→ “Text mining results are opaque (and statistics
are hard to grasp)”
@maxkaiser
@maxkaiser
→Topic modelling and NER should be
traversing institutional boundaries
→Text mining techniques cover a wide
spectrum of possible applications
→User driven approach applied to material
from multiple institutions
Q: Who are the main drivers (Europeana?,
institutional consortia?, the users?,
researchers?)
@maxkaiser
@maxkaiser
→ Lack of integration in library discovery systems
→ Limitations in available vendor systems
→ Difficulty to explain possible instabilities in results
→ results are expected to improve as corpora evolve
→ methodology, corpus and implementation are moving
targets
→ New and different resources required
→ IT infrastructures, tech- and research staff, text
mining competency
→ Progress is fast!
→ The expectation of even more potential in the future
delays investment in this future, now.
@maxkaiser
@maxkaiser
→Content discovery
→Recommender systems (for librarians, for
patrons)
→Focused user-driven aggregation of
content
→Cross referencing of NERs in materials
→Crowd-sourcing and user-driven
development
@maxkaiser
Let‘s talk!
@maxkaiser
Thank you!
max.kaiser@onb.ac.at
www.onb.ac.at
www.linkedin.com/in/maxkaiser
twitter.com/maxkaiser

Text Mining in Cultural Heritage: Challenges