1. NEDERLAB & friends
Today’s Nederlab spokesman: Martin Reynaert
Tilburg center for Cognition and Communication - Tilburg University
Centre for Language and Speech Technology - Radboud Universiteit Nijmegen
Symposium: Digitale historische kranten als ‘big data’.
Koninklijke Bibliotheek, The Hague. March 24, 2015
1
3. Nederlab: aims
The Nederlab project aims to bring together all digitized
texts relevant to the Dutch national heritage (c. A.D. 800 –
present) consisting of terabytes of data in one user-friendly
and tool-enriched web interface, allowing scholars to
simultaneously search and analyze textual data in a virtual
research environment.
The focus in Nederlab is currently on incorporating the vast
digital text collections of the Koninklijke Bibliotheek
(http://www.kb.nl/en) (KB or Dutch National Library)
as well as the contents of the Digitale Bibliotheek voor de
Nederlandse Letteren (http://www.dbnl.org/) (DBNL -
The Digital Library of Dutch Literature).
KB text collections comprise newspapers from 1618 to 1995
and the Early Dutch Books Online or EDBO
(http://www.delpher.nl/).
3
6. Nederlab: added value
Nederlab adds extra value to the corpora it incorporates
Texts are uniformly reformatted in FoLIA XML: Format for
Linguistic Annotations
Provides OCR post-correction by means of Text-Induced
Corpus Clean-up or TICCL
Provides extra linguistic annotations: lemmatisation,
POS-tagging and Named Entity labeling
Enhances search and retrieval, provides better research
opportunities
6
7. TICCL correction for ’Bataaffhe’
Most EDBO books are printed in Fraktur: long ‘s’
Most EDBO books are from the period of the
‘Bataafsche Republiek’ (late 18th century)
EDBO has 10,333 Dutch books,
in all about 235 million words of text
Top 7 TICCL corrections with corpus frequencies:
Bataaffche 53.538
Bataaffchen 15.735
Bataaffehe 1.749
Bataafseh 796
Bataafiche 683
Bataaflche 443
Bataavfche 433
TICCL identified 1.445 variants for ‘Bataafsche’, of which 872
were hapaxes (single corpus occurrences)
In all, TICCL corrected 81.302 EDBO tokens into ’Bataafsche’
7
10. Crowd Sourcing the text quality problem
Nederlab coordinator Nicoline van der Sijs runs a crowd-sourcing
endeavour on the side
Volunteers are welcome, please check:
http://www.meertens.knaw.nl/kranten_editor/
People retype the 17th century KB newspapers
Some statistics:
10
11. @PhilosTEI in the CLARIN-NL Infrastructure
Project leader: philosopher Arianna Betti (UvA) -
http://www.axiom.humanities.uva.nl
There is a system online that allows you to start building your very
own corpus
It has an OCR engine (Tesseract)
It is multilingual
And it has Text-Induced Corpus Clean-up or TICCL
Throw in images and get post-corrected FoLiA or TEI XML!
It is free and Open-Source
It is to be further developed in CLARIAH into ‘PICCL’:
Philosophical (or: Practical) Integrator of Computational and
Corpus Libraries, i.e. a complete corpus building work flow
11
12. @PhilosTEI in the CLARIN-NL Infrastructure
System:
http://philostei.clarin.inl.nl
Poster CLARIN-NL:
http://ticclops.uvt.nl/CLARINFinalEvent.PhilosTEI.pdf
Poster CLARIAH:
http://ticclops.uvt.nl/CLIN2015-poster.pdf
12
13. Next phase in Nederlab
Incorporate corpus exploration and exploitation tools built in
companion projects
OpenSoNaR has nice features
We will adopt them!
13
14. OpenSoNaR in the CLARIN-NL Infrastructure
System:
http://opensonar.clarin.inl.nl
Poster:
http://ticclops.uvt.nl/CLARINFinalEvent.OpenSoNaR.pdf
14
15. ENJOY!!
Thank you for your attention!
http://www.nederlab.nl/onderzoeksportaal/
NEDERLAB & friends
Today’s Nederlab spokesman: Martin Reynaert
Tilburg center for Cognition and Communication - Tilburg University
Centre for Language and Speech Technology - Radboud Universiteit Nijmegen
Symposium: Digitale historische kranten als ‘big data’.
15