NEDERLAB & friends
Today’s Nederlab spokesman: Martin Reynaert
Tilburg center for Cognition and Communication - Tilburg University
Centre for Language and Speech Technology - Radboud Universiteit Nijmegen
Symposium: Digitale historische kranten als ‘big data’.
Koninklijke Bibliotheek, The Hague. March 24, 2015
1
Nederlab: Consortium
In Nederlab werken samen:
2
Nederlab: aims
The Nederlab project aims to bring together all digitized
texts relevant to the Dutch national heritage (c. A.D. 800 –
present) consisting of terabytes of data in one user-friendly
and tool-enriched web interface, allowing scholars to
simultaneously search and analyze textual data in a virtual
research environment.
The focus in Nederlab is currently on incorporating the vast
digital text collections of the Koninklijke Bibliotheek
(http://www.kb.nl/en) (KB or Dutch National Library)
as well as the contents of the Digitale Bibliotheek voor de
Nederlandse Letteren (http://www.dbnl.org/) (DBNL -
The Digital Library of Dutch Literature).
KB text collections comprise newspapers from 1618 to 1995
and the Early Dutch Books Online or EDBO
(http://www.delpher.nl/).
3
Nederlab Portal: Home
http://www.nederlab.nl/onderzoeksportaal/
4
Nederlab Portal: Simple Query
5
Nederlab: added value
Nederlab adds extra value to the corpora it incorporates
Texts are uniformly reformatted in FoLIA XML: Format for
Linguistic Annotations
Provides OCR post-correction by means of Text-Induced
Corpus Clean-up or TICCL
Provides extra linguistic annotations: lemmatisation,
POS-tagging and Named Entity labeling
Enhances search and retrieval, provides better research
opportunities
6
TICCL correction for ’Bataaffhe’
Most EDBO books are printed in Fraktur: long ‘s’
Most EDBO books are from the period of the
‘Bataafsche Republiek’ (late 18th century)
EDBO has 10,333 Dutch books,
in all about 235 million words of text
Top 7 TICCL corrections with corpus frequencies:
Bataaffche 53.538
Bataaffchen 15.735
Bataaffehe 1.749
Bataafseh 796
Bataafiche 683
Bataaflche 443
Bataavfche 433
TICCL identified 1.445 variants for ‘Bataafsche’, of which 872
were hapaxes (single corpus occurrences)
In all, TICCL corrected 81.302 EDBO tokens into ’Bataafsche’
7
Nederlab Portal: document hits before TICCL
correction
8
Nederlab Portal: document hits after TICCL correction
9
Crowd Sourcing the text quality problem
Nederlab coordinator Nicoline van der Sijs runs a crowd-sourcing
endeavour on the side
Volunteers are welcome, please check:
http://www.meertens.knaw.nl/kranten_editor/
People retype the 17th century KB newspapers
Some statistics:
10
@PhilosTEI in the CLARIN-NL Infrastructure
Project leader: philosopher Arianna Betti (UvA) -
http://www.axiom.humanities.uva.nl
There is a system online that allows you to start building your very
own corpus
It has an OCR engine (Tesseract)
It is multilingual
And it has Text-Induced Corpus Clean-up or TICCL
Throw in images and get post-corrected FoLiA or TEI XML!
It is free and Open-Source
It is to be further developed in CLARIAH into ‘PICCL’:
Philosophical (or: Practical) Integrator of Computational and
Corpus Libraries, i.e. a complete corpus building work flow
11
@PhilosTEI in the CLARIN-NL Infrastructure
System:
http://philostei.clarin.inl.nl
Poster CLARIN-NL:
http://ticclops.uvt.nl/CLARINFinalEvent.PhilosTEI.pdf
Poster CLARIAH:
http://ticclops.uvt.nl/CLIN2015-poster.pdf
12
Next phase in Nederlab
Incorporate corpus exploration and exploitation tools built in
companion projects
OpenSoNaR has nice features
We will adopt them!
13
OpenSoNaR in the CLARIN-NL Infrastructure
System:
http://opensonar.clarin.inl.nl
Poster:
http://ticclops.uvt.nl/CLARINFinalEvent.OpenSoNaR.pdf
14
ENJOY!!
Thank you for your attention!
http://www.nederlab.nl/onderzoeksportaal/
NEDERLAB & friends
Today’s Nederlab spokesman: Martin Reynaert
Tilburg center for Cognition and Communication - Tilburg University
Centre for Language and Speech Technology - Radboud Universiteit Nijmegen
Symposium: Digitale historische kranten als ‘big data’.
15

17. kb.nederlab.20150324

  • 1.
    NEDERLAB & friends Today’sNederlab spokesman: Martin Reynaert Tilburg center for Cognition and Communication - Tilburg University Centre for Language and Speech Technology - Radboud Universiteit Nijmegen Symposium: Digitale historische kranten als ‘big data’. Koninklijke Bibliotheek, The Hague. March 24, 2015 1
  • 2.
  • 3.
    Nederlab: aims The Nederlabproject aims to bring together all digitized texts relevant to the Dutch national heritage (c. A.D. 800 – present) consisting of terabytes of data in one user-friendly and tool-enriched web interface, allowing scholars to simultaneously search and analyze textual data in a virtual research environment. The focus in Nederlab is currently on incorporating the vast digital text collections of the Koninklijke Bibliotheek (http://www.kb.nl/en) (KB or Dutch National Library) as well as the contents of the Digitale Bibliotheek voor de Nederlandse Letteren (http://www.dbnl.org/) (DBNL - The Digital Library of Dutch Literature). KB text collections comprise newspapers from 1618 to 1995 and the Early Dutch Books Online or EDBO (http://www.delpher.nl/). 3
  • 4.
  • 5.
  • 6.
    Nederlab: added value Nederlabadds extra value to the corpora it incorporates Texts are uniformly reformatted in FoLIA XML: Format for Linguistic Annotations Provides OCR post-correction by means of Text-Induced Corpus Clean-up or TICCL Provides extra linguistic annotations: lemmatisation, POS-tagging and Named Entity labeling Enhances search and retrieval, provides better research opportunities 6
  • 7.
    TICCL correction for’Bataaffhe’ Most EDBO books are printed in Fraktur: long ‘s’ Most EDBO books are from the period of the ‘Bataafsche Republiek’ (late 18th century) EDBO has 10,333 Dutch books, in all about 235 million words of text Top 7 TICCL corrections with corpus frequencies: Bataaffche 53.538 Bataaffchen 15.735 Bataaffehe 1.749 Bataafseh 796 Bataafiche 683 Bataaflche 443 Bataavfche 433 TICCL identified 1.445 variants for ‘Bataafsche’, of which 872 were hapaxes (single corpus occurrences) In all, TICCL corrected 81.302 EDBO tokens into ’Bataafsche’ 7
  • 8.
    Nederlab Portal: documenthits before TICCL correction 8
  • 9.
    Nederlab Portal: documenthits after TICCL correction 9
  • 10.
    Crowd Sourcing thetext quality problem Nederlab coordinator Nicoline van der Sijs runs a crowd-sourcing endeavour on the side Volunteers are welcome, please check: http://www.meertens.knaw.nl/kranten_editor/ People retype the 17th century KB newspapers Some statistics: 10
  • 11.
    @PhilosTEI in theCLARIN-NL Infrastructure Project leader: philosopher Arianna Betti (UvA) - http://www.axiom.humanities.uva.nl There is a system online that allows you to start building your very own corpus It has an OCR engine (Tesseract) It is multilingual And it has Text-Induced Corpus Clean-up or TICCL Throw in images and get post-corrected FoLiA or TEI XML! It is free and Open-Source It is to be further developed in CLARIAH into ‘PICCL’: Philosophical (or: Practical) Integrator of Computational and Corpus Libraries, i.e. a complete corpus building work flow 11
  • 12.
    @PhilosTEI in theCLARIN-NL Infrastructure System: http://philostei.clarin.inl.nl Poster CLARIN-NL: http://ticclops.uvt.nl/CLARINFinalEvent.PhilosTEI.pdf Poster CLARIAH: http://ticclops.uvt.nl/CLIN2015-poster.pdf 12
  • 13.
    Next phase inNederlab Incorporate corpus exploration and exploitation tools built in companion projects OpenSoNaR has nice features We will adopt them! 13
  • 14.
    OpenSoNaR in theCLARIN-NL Infrastructure System: http://opensonar.clarin.inl.nl Poster: http://ticclops.uvt.nl/CLARINFinalEvent.OpenSoNaR.pdf 14
  • 15.
    ENJOY!! Thank you foryour attention! http://www.nederlab.nl/onderzoeksportaal/ NEDERLAB & friends Today’s Nederlab spokesman: Martin Reynaert Tilburg center for Cognition and Communication - Tilburg University Centre for Language and Speech Technology - Radboud Universiteit Nijmegen Symposium: Digitale historische kranten als ‘big data’. 15