IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




          Computer Lexica in OCR and Retrieval

                 Katrien Depuydt, Jesse de Does
                 (Instituut voor Nederlandse Lexicologie, Leiden)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Can we handle ‘de wereld’ (‘the world’)’?




                                                            werreid


4 March 2009 presentation The Hague                                                                                                                      2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR:
Abbyy Finereader SDK with built in standard Dutch dictionary

OCR:
Abbyy Finereader SDK combining built in modernDutch dictionary with
IMPACT external historical lexicon of Dutch:



                                            werreld




   IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                            werelt weerelt wereld weerelds wereldt
                            werelden weereld werrelts waerelds
                            weerlyt wereldts vveerelts waereld
                            weerelden waerelden weerlt werlt
                            werelds sweerels zwerlys swarels
                            swerelts werelts swerrels weirelts
                            tsweerelds werret vverelt werlts
                            werrelt worreld werlden wareld
                            weirelt weireld waerelt werreld werld
                            vvereld weerelts werlde tswerels
                            werreldts weereldt wereldje waereldje
                            weurlt wald weëled

RETRIEVAL: key in modern WERELD and find all
 IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The long s problem: An example ….




                         OCR at start of project
                         A. De eerde was de gevaarlykflti om de verlei¬                                               .
                         ding aan 't Hof; de tweede de ftillie en veiligde;
                         de derde de zwaarde, daar hy byna drie millioenen
                         harde en onbefchaafde Menfchen beftieren moest.




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




            The long s problem: An example ….




OCR at start of project                                                              Results April 2010
A. De eerde was de gevaarlykflti om de verlei¬                                       A. De eerste was de gevaarlykste om de verlei-
ding aan 't Hof; de tweede de ftillie en veiligde;                                   ding aan 't Hof; de tweede de stilste en veiligste;
de derde de zwaarde, daar hy byna drie millioenen                                    de derde de zwaarste, daar hy byna drie millioenen
harde en onbefchaafde Menfchen beftieren moest.                                      harde en onbeschaafde Menschen bestieren moest.




            IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




            The long s problem: An example ….




OCR at start of project                                                              Results April 2010
A. De eerde was de gevaarlykflti om de verlei¬                                       A. De eerste was de gevaarlykste om de verlei-
ding aan 't Hof; de tweede de ftillie en veiligde;                                   ding aan 't Hof; de tweede de stilste en veiligste;
de derde de zwaarde, daar hy byna drie millioenen                                    de derde de zwaarste, daar hy byna drie millioenen
harde en onbefchaafde Menfchen beftieren moest.                                      harde en onbeschaafde Menschen bestieren moest.

   Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and
   postcorrect it afterwards with the lexicon.
       In this way we keep it from turning to “eerde” (earth) instead of “eerste” (first)

            IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Overview

      What is a computer lexicon
      Lexica in IMPACT
      Tools for lexicon building and applying lexica
      Some results
      Searching Demonstration




IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                     What is a computer lexicon?




IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Computer lexicon vs electronic dictionary (1)

An electronic dictionary is:
  Digitised full text (no pictures)
  For human use
  Ideally: searchable with explicitely coded material (XML), such as a
lemma, part of speech (PoS), meaning, quotes etc.
  Examples: OED online, WNT online




IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




        Dictionary XML (example)




IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Computer Lexicon vs Electronic Dictionary (2)
 A computer lexicon is:
   Always in a structured digital format (XML, relational database)
   Main purpose: computer application
   Explicitely coded information (e.g. lemma wereld, part of speech
 noun, morphology werelden, werelds … , syntax)

 Examples of use:
 Linguistic enrichment of text material
   ‘Advanced’ searching (words with all spelling variant and inflections)
   Automatic summarization, keyword extraction…
IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                  Lexica in IMPACT




IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The OCR lexicon
 An OCR lexicon is
    A checked list of words in a language
    Based on a corpus (collection) of dated texts (selection!)
    Preferably with frequency information
    Preferably from the same time period or of the same text type as
  the texts you wish to digitize




IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      OCR lexicon: example
        1550-1750                                                  > 1900
        song              820                                      television        418
        rihte             818                                      electronic        375
        theire            818                                      video             194
        manye             818                                      hormone           176
        sume              815                                      jazz              162
        Do                814                                      eco               142
        Whiche            811                                      software          136
        fyrst             811                                      vitamin           128
        while             811                                      movie             121
        Water             810                                      taxi              113
        wt                809                                      isotopic          108
        shalbe            808                                      electronics       95
        thingis           807                                      radar             86
        again             806                                      basically         71
        sona              806                                      sabotage          71
        wa                805                                      homozygote        70
        mode              804                                      psychedelic       67
        work              802                                      phonemic          66
        between           801                                      insulin           64
        law               799                                      zap               64
        moder             798                                      antibody          61
        mis               798                                      fungicidal        61
        softe             798

IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




The IR lexicon
IR lexicon: most important information categories
   word forms (lists of words) +
        - frequency information
        - quotes (dated sources) from corpora or electronic
        dictionaries
        - MODERN LEMMA (// entrance dictionary) linked to spelling
        variants and inflected forms of the same word
   The modern lemma is used for searching in texts
   Standard use in corpus linguistics and modern historical lexicography


IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


<?xml version='1.0'?>
<!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'>
<lexicon>
<lexical_entry><lemma_id>219490</lemma_id>
<modern_lemma>aantuilen</modern_lemma>
<gloss></gloss>
<POS>VRB</POS>
<ne_label></ne_label>
<language_id></language_id>
<portmanteau_lemma_id></portmanteau_lemma_id>

<wordform><form_representation>
<wordform_id>850026</wordform_id>
<written_form>tuyld</written_form>
<attestation><id>92141</id>
<token_id></token_id>
<quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen:
Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote>
<derivation_id>0</derivation_id>
<document_id>204</document_id>
<start_pos>119</start_pos>
<end_pos>124</end_pos>
</attestation>
</form_representation>
</wordform> <Demo Day BL, 12 July 2011>
       IMPACT                                                                                          19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Tools for lexicon building and application of lexica




  IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




     Types variation (spelling, inflection…)
            uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken
            uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken
            d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke
            uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke
I           uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken
            uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke
            uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick
            uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic
            uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk

               (patterns to predict variation)
            werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt
            wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys
II          swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt
            worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde
            tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

                 (a number are predictable with patterns, others need to be taken from a lexicon )

     IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Neil Fitzgerald, 7th July 2011                                                                                                                           22
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Computer lexica
     For OCR and OCR post correction
     Improving searchability of historic text material by building a lexicon
     with variants by using a modern lemma as a search entry

     Tools for lexicon building
     Tools for application of lexicon in search engines
     Lexicon cookbook



IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       23
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Tools (more specific)
- Lexicon building from corpus material and dictionaries
- Use of lexica in search engines

- Tool to extract spelling variation patterns from historical
  material
- Tool to relate previously unrecognised spelling variations to
  their standard form
- Tool to deduct previously unrecognised inflected forms to their
  basic form



IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Spelling variation tools (pattern-based)
      Language-independent approach:
         Supervised rule (pattern) induction from pairs (“modern” word,
         historical word), yielding patterns like aa/ae, s/z, ….
         Pattern weights are computed from example material

Additional approaches possible, eg. :
  Use of aligned data (parallel historical text and modern version)




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 25
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lemmatization
      Reduction of historical word forms to modern lemma
      Historical word standard (“modern”) spelling lemma form
               (pattern matching)           (lemmatizer)

      Dystels                (1)                            distels              (2)                                         distel

   When we have a perfect or near-perfect modern full form
   lexicon, the second step is simply lexicon lookup.
         But:
1) We will not have full form information for many lemmata
   (especially the historical ones)
2) Even lemmata present in modern language may have historical
   inflected forms different from the present-day paradigm
IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 26
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Lemmatization and reverse lemmatization
We also need a lemmatization process for these situations
   A typical lemmatizer assigns some standard form (infinitive,
   nominative, stem) to inflected forms. Usually based on patterns
   relating the inflected form to the standard form.
But:
   Matching these patterns can be hard to combine with matching
   both spelling variation patterns and OCR errors
   (bok/bokken/bokkeu)
   We adopt the solution of actually expanding the “hypothetical
   modern full form lexicon” containing the most plausible possible
   paradigmatic expansions of lemmata
   This construction is carried out by means of a statistical reverse
   lemmatizer
IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 27
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Attestation
      From hypothetical (non-witnessed) lexicon content to attested word
      forms in “real” text
      Automatic selection of candidate attestations
      Manual work: verification and correction

      Two approaches
        Dictionary based (INL): Woordenboek der Nederlandsche Taal
        Corpus based (LMU, INL): Dutch DBNL corpus



IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 28
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




  IMPACT Dictionary Attestation Tool
  Lexicon building at work: Verifying attestations in historical dictionaries
Task
    Find the variants of a headword as they occur in the quotations
headword


        work
                 • We are working on what works.
                 • Depart from me, ye that worke iniquity.                                                   Quotations

                 • She worcketh knittinge of stockings.



                                                      variants
  IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 29
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             IMPACT Dictionary Attestation Tool
  Task
      Find the variants of a headword as they occur in the quotations
                                              Automatically (preprocessing)
Electronic
                                                                • match         literally
historical
                                                                                         e.g: work          work, Work
dictionary

                                                                • match         using existing lexica and lists
                              Database

                             with lemmata

                            and quotatioms                                               e.g: work           works, worked, wrought
                                                                • approximate matching
                                                                            e.g: work worke
                                              By hand (using the tool)
                                                                • correct        automatic mismatches
                                                                                          e.g: works            words, worms
                                                                • find     missed matches
                                                                                          e.g: work           worketh, wrowght

             IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 30
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      IMPACT Attestation Tool
                                                                  Up-to-date overview of what is done and needs to be don
Tool
                                                                           Done by this user so far




Lemma headword




Quotations
Sorted by uncertainty




      IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 31
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




  IMPACT Lexicon Tool
Task
    Find and verify attestations in a historical corpus
                     Automatically (preprocessing = apply lemmatizer)
                                       • match         literally
                                                               e.g: work           work, Work
                                       • match         using existing lexica and lists
                                                                e.g: work           works, worked, wrought
                                       • matching            using spelling variation module
                                                                e.g: uiterlijk        uyterlick
                     By hand (using the tool)
                                       • assign         correct lemma
                                                                 e.g: was (N)            zijn (V)
                                       • group        tokens belonging together
                                                                 e.g: konings zoon                koningszoon
                                       • select attestations
   IMPACT workshop, Bratislava, May 7, 2010                                                                                                                32
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Corpus-based lexicon building: Impact Lexicon
Tool




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 33
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




General vocabulary vs. Named entities
      Tools for lexicon building described so far: applicable to general
      lexicon
      Tools for NE recognition, classification and variant matching
      - library requirement
      - distinguish general vocabulary from NE’s
      - avoid unpleasant mixups like Abimelech           apemelk!
         (b/p; i/e; e/0; k/ch)




IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 34
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Improvement of state of the art / innovation

      We use existing computational linguistic approaches, but figure out
      how to apply them to historical language
      We develop a workflow to deal with the problems posed by historical
      language, figuring out how all pieces fit together
         Data selection and acquisition
         Manual work
         Computational linguistics tools


IMPACT workshop, Bratislava, May 7, 2010                                                                                                                 35
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




languages in IMPACT
      Dutch, German, English, Spanish, French
      Polish, Czech, Slovene and Bulgarian

-     Cross language perspective paper
-     Parallel OCR and IR experiments
-     GT datasets

-     Language tools: language independent
-     Except from 3 core languages: proof of concept lexica




IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       36
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR evaluation results
(preliminary!)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




1. Czech
           Co jest konstituce?, čili, Krátký, prostonárodní wýklad hlawnějších
           zásad konstitucí ewropejských, 1848
           Ferina Lišák z Kuliferdy a na Klukově, čili, Kratičká historye
           zlopověstných kousků starého Reinecke, 1848
           Homerowa Iliada, 1802
           Na den narození neimocněišího, a neijasněišího cysare rímského,
           téz dědičného rakauského a krále ceského, Frantiska II., w Praze
           12. den mesyce Unora, léta 1805, 1805
           Plody sborů učenců řeči českoslowanské prešporského, 1836
           Rozprawy o gmenách, počátkách i starožitnostech národu
           Slawského a geho kmeni /, 1830
           Sokol, 1872
           Základowé pitwy (Anatomie), čili, Soustawnj rozbor a popis těla
           lidského a gednotliwých geho částek, 1840
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




2.Dutch
      18th and 19th century books, newspapers, parliamentary papers

      Provinciale Overijsselsche en Zwolsche courant : staats-, handels-, nieuws- en
      advertentieblad, 1852-1852
      Rechtsgeleerd advis in de zaak van den gewezen stadhouder, en over deszelfs
      schryven aan de gouverneurs van de Oost- en West-Indische bezittingen van den
      staat [...]. Ingelevert [...] op den 7 january 1796. / By B. Voorda et al, 1796-1796
      Verhaal van het levensgevaar, waar in zig drie Rotterdamsche burgers [...]
      bevonden hebben, te Utrecht, 1784-1784
      Vrijmoedige aanmerkingen, over de uitsluiting van allen die door publieke
      armkassen bedeeld worden, als stemgerechtigden [...] bij eene oproeping van het
      Nederlandsche volk tot eene Nationaale Conventie, 1795-1795
Precision: 0.8432889410216431 , Recall: 0.843331934927516
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




English
      16th-19th century material
      Sources for lexicon building: OED, ECCO
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




French
17th century books

      Conduite du jugement naturel où tous les bons esprits de l'un et l'autre sexe
      pourront facilement puiser la pureté de la science, par M. Jacques Forton, sieur de
      S. Ange,..., 1653
      Dissertation de la philosophie en général, 1668
      La Dialectique du sieur de Launay, contenant l'art de raisonner juste sur toute sorte
      de matières..., 1673
      Lettre de M. Gadroys à M. de La Grange Trianon,... pour servir de réponse à celle
      que M. de Castelet a écrite contre les raisons de M. Descartes touchant le flux et le
      reflux de la mer. - Seconde lettre de M. Gadroys... [au même, sur le même sujet.],
      1677
      Traitez de métaphysique démontrée selon la méthode des géomètres. [Par le sieur
      de La Coudraye.], 1693
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




German
           Das Buch des heyligen Römischen Reichs unnderhalltunge, 1501
           Die Poesie ihr Wesen und ihre Formen mit Grundzügen der vergleichenden
           Literaturgeschichte, 1884
           Echo Deß Hochzeitlichen Te Deum Laudamus, 1722
           Ergebnisse der Erhebungen über die Beschäftigung gewerblicher Arbeiter an
           Sonn- und Festtagen, Bd.:1, Gruppe I bis VII der Gewerbestatistik, Berlin, 1887,
           1887
           Quedlinburgisches Kreis-Tags-Memorial, 1673
           Von der Regierung der Kirche und den unterschiedlichen Würden der
           Geistlichkeit *(full title in comments), 1779
           Warhaffter und grundlicher Bericht uß was Ursachen Martinus du Voysin (zu
           Basel verburgerter Krämer) inn der Statt Surseew im Aargöw, ..., den 13. Tag
           Octobris deß 1608. Jars erstlich enthauptet, und volgends verbrennt worden,
           1609
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Polish
           Adwersaria, albo terminata sprawy wojennej, która się toczyła w wołoskiej ziemi z
           tureckim cesarzem, 1621
           Chorągiew Sarmacka w Wołoszech, to jest pospolite ruszenie i szczęśliwy powrót
           Polaków z Wołoch w roku 1621, 1621
           Diariusz wiadomości od wyjazdu króla z Wilna do Smoleńska, 1610
           Discurs o cenie pieniedzy teraznieyszey y o niektorych skutkach iey…, 1632
           Nowe Ateny, albo Akademia wszelkiey scyencyi pełna, na różne tytuły iak na classes
           podzielona, mądrym dla memoryału, idiotom dla nauki, politykom dla praktyki,
           melancholikom dla rozrywki erygowana ... . Część 3 albo Supplement., 1746
           Pasja żołnierzy obojga narodów w stolicy moskiewskiej krótko opisana, 1613
           Powodzenia niebezpiecznego ale szczęśliwego wojska j. k. m. w Multanach opisanie,
           1601
           Relacja chwalebnej ekspedycji Jana Kazimierza, króla polskiego i szwedzkiego, 1650
           Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej,
           1634
           Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony
           Polskiej_BW, 1634
           Żałosne opisanie upadku króla hiszpańskiego na morzu i na lądzie, 1589
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Slovene
           Genovefa, 1841
           Gosp. Krištofa Šmida korarja avgustanskiga, zgodBe S. Pisma za
           mlade ljud..., 1850
           Kmetijske in rokodelske novice, 1844
           Kratkozhasne uganke, 1788
           Kuharske Bukve, 1799
           Marianske Kempensar, ali Dvoje bukuvze, 1769
           Novice kmetijskih, rokodelnih in narodskih reči, 1851
           Sgodbe svetiga pisma za mlade ljudi, 1830
           Ta male katechismus, 1768
           Vezhna pratika od gospodarstva, 1789
           Zerkviza na skali, 1855
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Retrieval demonstrator

      Indexing and retrieval library (java) implemented on the lucene search engine
      Lexicon in MySQL database



      OCR with Finereader SDK and external dictionary interface of about 2000 images
      of the Dutch Ground Truth selection
      Page XML output [in framework]
      NE tagging
      Indexing and retrieval while using lexicon and NE tagging



IMPACT <Demo Day BL, 12 July 2011>                                                                                                                       53
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Computer Lexica in OCR and Retrieval

  • 1.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computer Lexica in OCR and Retrieval Katrien Depuydt, Jesse de Does (Instituut voor Nederlandse Lexicologie, Leiden)
  • 2.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Can we handle ‘de wereld’ (‘the world’)’? werreid 4 March 2009 presentation The Hague 2
  • 3.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR: Abbyy Finereader SDK with built in standard Dutch dictionary OCR: Abbyy Finereader SDK combining built in modernDutch dictionary with IMPACT external historical lexicon of Dutch: werreld IMPACT <Demo Day BL, 12 July 2011> 3
  • 4.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled RETRIEVAL: key in modern WERELD and find all IMPACT <Demo Day BL, 12 July 2011> 4
  • 5.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The long s problem: An example …. OCR at start of project A. De eerde was de gevaarlykflti om de verlei¬ . ding aan 't Hof; de tweede de ftillie en veiligde; de derde de zwaarde, daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. IMPACT workshop, Bratislava, May 7, 2010 5
  • 6.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The long s problem: An example …. OCR at start of project Results April 2010 A. De eerde was de gevaarlykflti om de verlei¬ A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de ftillie en veiligde; ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarde, daar hy byna drie millioenen de derde de zwaarste, daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. harde en onbeschaafde Menschen bestieren moest. IMPACT workshop, Bratislava, May 7, 2010 6
  • 7.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The long s problem: An example …. OCR at start of project Results April 2010 A. De eerde was de gevaarlykflti om de verlei¬ A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de ftillie en veiligde; ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarde, daar hy byna drie millioenen de derde de zwaarste, daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. harde en onbeschaafde Menschen bestieren moest. Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and postcorrect it afterwards with the lexicon. In this way we keep it from turning to “eerde” (earth) instead of “eerste” (first) IMPACT workshop, Bratislava, May 7, 2010 7
  • 8.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Overview What is a computer lexicon Lexica in IMPACT Tools for lexicon building and applying lexica Some results Searching Demonstration IMPACT <Demo Day BL, 12 July 2011> 8
  • 9.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. What is a computer lexicon? IMPACT <Demo Day BL, 12 July 2011> 9
  • 10.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computer lexicon vs electronic dictionary (1) An electronic dictionary is: Digitised full text (no pictures) For human use Ideally: searchable with explicitely coded material (XML), such as a lemma, part of speech (PoS), meaning, quotes etc. Examples: OED online, WNT online IMPACT <Demo Day BL, 12 July 2011> 10
  • 11.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Dictionary XML (example) IMPACT <Demo Day BL, 12 July 2011> 11
  • 12.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT <Demo Day BL, 12 July 2011> 12
  • 13.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computer Lexicon vs Electronic Dictionary (2) A computer lexicon is: Always in a structured digital format (XML, relational database) Main purpose: computer application Explicitely coded information (e.g. lemma wereld, part of speech noun, morphology werelden, werelds … , syntax) Examples of use: Linguistic enrichment of text material ‘Advanced’ searching (words with all spelling variant and inflections) Automatic summarization, keyword extraction… IMPACT <Demo Day BL, 12 July 2011> 13
  • 14.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT <Demo Day BL, 12 July 2011> 14
  • 15.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexica in IMPACT IMPACT <Demo Day BL, 12 July 2011> 15
  • 16.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The OCR lexicon An OCR lexicon is A checked list of words in a language Based on a corpus (collection) of dated texts (selection!) Preferably with frequency information Preferably from the same time period or of the same text type as the texts you wish to digitize IMPACT <Demo Day BL, 12 July 2011> 16
  • 17.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR lexicon: example 1550-1750 > 1900 song 820 television 418 rihte 818 electronic 375 theire 818 video 194 manye 818 hormone 176 sume 815 jazz 162 Do 814 eco 142 Whiche 811 software 136 fyrst 811 vitamin 128 while 811 movie 121 Water 810 taxi 113 wt 809 isotopic 108 shalbe 808 electronics 95 thingis 807 radar 86 again 806 basically 71 sona 806 sabotage 71 wa 805 homozygote 70 mode 804 psychedelic 67 work 802 phonemic 66 between 801 insulin 64 law 799 zap 64 moder 798 antibody 61 mis 798 fungicidal 61 softe 798 IMPACT <Demo Day BL, 12 July 2011> 17
  • 18.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The IR lexicon IR lexicon: most important information categories word forms (lists of words) + - frequency information - quotes (dated sources) from corpora or electronic dictionaries - MODERN LEMMA (// entrance dictionary) linked to spelling variants and inflected forms of the same word The modern lemma is used for searching in texts Standard use in corpus linguistics and modern historical lexicography IMPACT <Demo Day BL, 12 July 2011> 18
  • 19.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> <modern_lemma>aantuilen</modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> <written_form>tuyld</written_form> <attestation><id>92141</id> <token_id></token_id> <quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform> <Demo Day BL, 12 July 2011> IMPACT 19
  • 20.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools for lexicon building and application of lexica IMPACT <Demo Day BL, 12 July 2011> 20
  • 21.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Types variation (spelling, inflection…) uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke I uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk (patterns to predict variation) werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys II swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled (a number are predictable with patterns, others need to be taken from a lexicon ) IMPACT <Demo Day BL, 12 July 2011> 21
  • 22.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Neil Fitzgerald, 7th July 2011 22
  • 23.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computer lexica For OCR and OCR post correction Improving searchability of historic text material by building a lexicon with variants by using a modern lemma as a search entry Tools for lexicon building Tools for application of lexicon in search engines Lexicon cookbook IMPACT <Demo Day BL, 12 July 2011> 23
  • 24.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools (more specific) - Lexicon building from corpus material and dictionaries - Use of lexica in search engines - Tool to extract spelling variation patterns from historical material - Tool to relate previously unrecognised spelling variations to their standard form - Tool to deduct previously unrecognised inflected forms to their basic form IMPACT <Demo Day BL, 12 July 2011> 24
  • 25.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Spelling variation tools (pattern-based) Language-independent approach: Supervised rule (pattern) induction from pairs (“modern” word, historical word), yielding patterns like aa/ae, s/z, …. Pattern weights are computed from example material Additional approaches possible, eg. : Use of aligned data (parallel historical text and modern version) IMPACT workshop, Bratislava, May 7, 2010 25
  • 26.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lemmatization Reduction of historical word forms to modern lemma Historical word standard (“modern”) spelling lemma form (pattern matching) (lemmatizer) Dystels (1) distels (2) distel When we have a perfect or near-perfect modern full form lexicon, the second step is simply lexicon lookup. But: 1) We will not have full form information for many lemmata (especially the historical ones) 2) Even lemmata present in modern language may have historical inflected forms different from the present-day paradigm IMPACT workshop, Bratislava, May 7, 2010 26
  • 27.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lemmatization and reverse lemmatization We also need a lemmatization process for these situations A typical lemmatizer assigns some standard form (infinitive, nominative, stem) to inflected forms. Usually based on patterns relating the inflected form to the standard form. But: Matching these patterns can be hard to combine with matching both spelling variation patterns and OCR errors (bok/bokken/bokkeu) We adopt the solution of actually expanding the “hypothetical modern full form lexicon” containing the most plausible possible paradigmatic expansions of lemmata This construction is carried out by means of a statistical reverse lemmatizer IMPACT workshop, Bratislava, May 7, 2010 27
  • 28.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Attestation From hypothetical (non-witnessed) lexicon content to attested word forms in “real” text Automatic selection of candidate attestations Manual work: verification and correction Two approaches Dictionary based (INL): Woordenboek der Nederlandsche Taal Corpus based (LMU, INL): Dutch DBNL corpus IMPACT workshop, Bratislava, May 7, 2010 28
  • 29.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Dictionary Attestation Tool Lexicon building at work: Verifying attestations in historical dictionaries Task Find the variants of a headword as they occur in the quotations headword work • We are working on what works. • Depart from me, ye that worke iniquity. Quotations • She worcketh knittinge of stockings. variants IMPACT workshop, Bratislava, May 7, 2010 29
  • 30.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Dictionary Attestation Tool Task Find the variants of a headword as they occur in the quotations Automatically (preprocessing) Electronic • match literally historical e.g: work work, Work dictionary • match using existing lexica and lists Database with lemmata and quotatioms e.g: work works, worked, wrought • approximate matching e.g: work worke By hand (using the tool) • correct automatic mismatches e.g: works words, worms • find missed matches e.g: work worketh, wrowght IMPACT workshop, Bratislava, May 7, 2010 30
  • 31.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Attestation Tool Up-to-date overview of what is done and needs to be don Tool Done by this user so far Lemma headword Quotations Sorted by uncertainty IMPACT workshop, Bratislava, May 7, 2010 31
  • 32.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Lexicon Tool Task Find and verify attestations in a historical corpus Automatically (preprocessing = apply lemmatizer) • match literally e.g: work work, Work • match using existing lexica and lists e.g: work works, worked, wrought • matching using spelling variation module e.g: uiterlijk uyterlick By hand (using the tool) • assign correct lemma e.g: was (N) zijn (V) • group tokens belonging together e.g: konings zoon koningszoon • select attestations IMPACT workshop, Bratislava, May 7, 2010 32
  • 33.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Corpus-based lexicon building: Impact Lexicon Tool IMPACT workshop, Bratislava, May 7, 2010 33
  • 34.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. General vocabulary vs. Named entities Tools for lexicon building described so far: applicable to general lexicon Tools for NE recognition, classification and variant matching - library requirement - distinguish general vocabulary from NE’s - avoid unpleasant mixups like Abimelech apemelk! (b/p; i/e; e/0; k/ch) IMPACT workshop, Bratislava, May 7, 2010 34
  • 35.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improvement of state of the art / innovation We use existing computational linguistic approaches, but figure out how to apply them to historical language We develop a workflow to deal with the problems posed by historical language, figuring out how all pieces fit together Data selection and acquisition Manual work Computational linguistics tools IMPACT workshop, Bratislava, May 7, 2010 35
  • 36.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. languages in IMPACT Dutch, German, English, Spanish, French Polish, Czech, Slovene and Bulgarian - Cross language perspective paper - Parallel OCR and IR experiments - GT datasets - Language tools: language independent - Except from 3 core languages: proof of concept lexica IMPACT <Demo Day BL, 12 July 2011> 36
  • 37.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR evaluation results (preliminary!)
  • 38.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 1. Czech Co jest konstituce?, čili, Krátký, prostonárodní wýklad hlawnějších zásad konstitucí ewropejských, 1848 Ferina Lišák z Kuliferdy a na Klukově, čili, Kratičká historye zlopověstných kousků starého Reinecke, 1848 Homerowa Iliada, 1802 Na den narození neimocněišího, a neijasněišího cysare rímského, téz dědičného rakauského a krále ceského, Frantiska II., w Praze 12. den mesyce Unora, léta 1805, 1805 Plody sborů učenců řeči českoslowanské prešporského, 1836 Rozprawy o gmenách, počátkách i starožitnostech národu Slawského a geho kmeni /, 1830 Sokol, 1872 Základowé pitwy (Anatomie), čili, Soustawnj rozbor a popis těla lidského a gednotliwých geho částek, 1840
  • 39.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 40.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 2.Dutch 18th and 19th century books, newspapers, parliamentary papers Provinciale Overijsselsche en Zwolsche courant : staats-, handels-, nieuws- en advertentieblad, 1852-1852 Rechtsgeleerd advis in de zaak van den gewezen stadhouder, en over deszelfs schryven aan de gouverneurs van de Oost- en West-Indische bezittingen van den staat [...]. Ingelevert [...] op den 7 january 1796. / By B. Voorda et al, 1796-1796 Verhaal van het levensgevaar, waar in zig drie Rotterdamsche burgers [...] bevonden hebben, te Utrecht, 1784-1784 Vrijmoedige aanmerkingen, over de uitsluiting van allen die door publieke armkassen bedeeld worden, als stemgerechtigden [...] bij eene oproeping van het Nederlandsche volk tot eene Nationaale Conventie, 1795-1795
  • 41.
    Precision: 0.8432889410216431 ,Recall: 0.843331934927516 IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 42.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 43.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. English 16th-19th century material Sources for lexicon building: OED, ECCO
  • 44.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 45.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. French 17th century books Conduite du jugement naturel où tous les bons esprits de l'un et l'autre sexe pourront facilement puiser la pureté de la science, par M. Jacques Forton, sieur de S. Ange,..., 1653 Dissertation de la philosophie en général, 1668 La Dialectique du sieur de Launay, contenant l'art de raisonner juste sur toute sorte de matières..., 1673 Lettre de M. Gadroys à M. de La Grange Trianon,... pour servir de réponse à celle que M. de Castelet a écrite contre les raisons de M. Descartes touchant le flux et le reflux de la mer. - Seconde lettre de M. Gadroys... [au même, sur le même sujet.], 1677 Traitez de métaphysique démontrée selon la méthode des géomètres. [Par le sieur de La Coudraye.], 1693
  • 46.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 47.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. German Das Buch des heyligen Römischen Reichs unnderhalltunge, 1501 Die Poesie ihr Wesen und ihre Formen mit Grundzügen der vergleichenden Literaturgeschichte, 1884 Echo Deß Hochzeitlichen Te Deum Laudamus, 1722 Ergebnisse der Erhebungen über die Beschäftigung gewerblicher Arbeiter an Sonn- und Festtagen, Bd.:1, Gruppe I bis VII der Gewerbestatistik, Berlin, 1887, 1887 Quedlinburgisches Kreis-Tags-Memorial, 1673 Von der Regierung der Kirche und den unterschiedlichen Würden der Geistlichkeit *(full title in comments), 1779 Warhaffter und grundlicher Bericht uß was Ursachen Martinus du Voysin (zu Basel verburgerter Krämer) inn der Statt Surseew im Aargöw, ..., den 13. Tag Octobris deß 1608. Jars erstlich enthauptet, und volgends verbrennt worden, 1609
  • 48.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 49.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Polish Adwersaria, albo terminata sprawy wojennej, która się toczyła w wołoskiej ziemi z tureckim cesarzem, 1621 Chorągiew Sarmacka w Wołoszech, to jest pospolite ruszenie i szczęśliwy powrót Polaków z Wołoch w roku 1621, 1621 Diariusz wiadomości od wyjazdu króla z Wilna do Smoleńska, 1610 Discurs o cenie pieniedzy teraznieyszey y o niektorych skutkach iey…, 1632 Nowe Ateny, albo Akademia wszelkiey scyencyi pełna, na różne tytuły iak na classes podzielona, mądrym dla memoryału, idiotom dla nauki, politykom dla praktyki, melancholikom dla rozrywki erygowana ... . Część 3 albo Supplement., 1746 Pasja żołnierzy obojga narodów w stolicy moskiewskiej krótko opisana, 1613 Powodzenia niebezpiecznego ale szczęśliwego wojska j. k. m. w Multanach opisanie, 1601 Relacja chwalebnej ekspedycji Jana Kazimierza, króla polskiego i szwedzkiego, 1650 Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej, 1634 Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej_BW, 1634 Żałosne opisanie upadku króla hiszpańskiego na morzu i na lądzie, 1589
  • 50.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 51.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Slovene Genovefa, 1841 Gosp. Krištofa Šmida korarja avgustanskiga, zgodBe S. Pisma za mlade ljud..., 1850 Kmetijske in rokodelske novice, 1844 Kratkozhasne uganke, 1788 Kuharske Bukve, 1799 Marianske Kempensar, ali Dvoje bukuvze, 1769 Novice kmetijskih, rokodelnih in narodskih reči, 1851 Sgodbe svetiga pisma za mlade ljudi, 1830 Ta male katechismus, 1768 Vezhna pratika od gospodarstva, 1789 Zerkviza na skali, 1855
  • 52.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 53.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Retrieval demonstrator Indexing and retrieval library (java) implemented on the lucene search engine Lexicon in MySQL database OCR with Finereader SDK and external dictionary interface of about 2000 images of the Dutch Ground Truth selection Page XML output [in framework] NE tagging Indexing and retrieval while using lexicon and NE tagging IMPACT <Demo Day BL, 12 July 2011> 53
  • 54.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 55.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 56.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 57.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 58.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 59.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 60.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 61.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 62.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 63.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 64.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 65.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.