Language tools bne-5-10-2011
Upcoming SlideShare
Loading in...5
×
 

Language tools bne-5-10-2011

on

  • 282 views

Presentation on language tools, presented by Jesse de Does and Katrien Depuydt during demo session held at the BNE 5th of October 2011.

Presentation on language tools, presented by Jesse de Does and Katrien Depuydt during demo session held at the BNE 5th of October 2011.

Statistics

Views

Total Views
282
Views on SlideShare
282
Embed Views
0

Actions

Likes
0
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • A snippet from a Dutch magazine (De Denker. No. 4. Den 24. January 1763) ------------------------------------------- OCR, improving Access to text: improving the quality of the text. RETRIEVAL: Improving Access to text: dealing with historical spelling variants Used: HISTORICAL LEXICON OF DUTCH Can we handle ‘the world’? Yes we can, ought to be our answer, especially when investing hugely in mass digitisation. Mass digitisation is the very reason for investing in lexicon building. Efforts in digitising huge quantities of historical text demand efforts in quality of OCR as well as retrieval. Historical lexicon building for OCR and Retrieval, as shown above in this little example, can contribute to that. An example: in a ground truth text corpus of Dutch texts from 1550 until 1950, containing approximately 150 million words, search for the very common word ‘wereld’ yielded 23396 hits. Using a historical lexicon, containing spelling and morphological variants of this word, resulted in 34339 hits. I
  • This presentation is based on how the INL works with language. A electronic dictionary is not what we need for OCR and simple retrieval but is introduced anyway because we can (and do) use our dictionaries for lexicon construction.
  • This is what an XML-based electronic dictionary looks like.
  • This is the XML of the Oxford English dictionary. The horizontal lines mark a place where part of the structure has been folded in.
  • We need further explanation for what ‘lemma’, ‘part of speech’ and ‘morphology’ mean Lemma: headword, like in an ordinary dictionary the entry Morphology: morphological analysis is done for compounds and derivates: which parts are to be distinguished in a word, e.g. apple pie : apple + pie
  • This is an little part of a computational lexicon (of a certain type; there are many types of computational lexica)
  • again, unsure of what LEMMA means Be, was, am, is, etc. all forms of the same word BE (and that is an example of a lemma)
  • Two types of variation, examples for Dutch from the lexicon
  • To give an indication of possible spelling variants of the word ‘world’ for English, a screenshot from the OED online...
  • These are some of the ways in which we are using Computer lexica as building blocks.
  • Uitleg: Semi-sipervised approach: match word list from corpus with lexicon and find both the pairings of corpus words with lexicon words and the patterns needed for transformation. This only works if corpus and lexicon are a good match.
  • Uitleg: Semi-sipervised approach: match word list from corpus with lexicon and find both the pairings of corpus words with lexicon words and the patterns needed for transformation. This only works if corpus and lexicon are a good match.
  • Uitleg: Semi-sipervised approach: match word list from corpus with lexicon and find both the pairings of corpus words with lexicon words and the patterns needed for transformation. This only works if corpus and lexicon are a good match.
  • Note: applicable to other historical dictionaries with attestations. Tested on OED material!
  • Note: applicable to other historical dictionaries with attestations. Tested on OED material!

Language tools bne-5-10-2011 Language tools bne-5-10-2011 Presentation Transcript

  • Computer Lexica in OCR and Retrieval Katrien Depuydt, Jesse de Does (Instituut voor Nederlandse Lexicologie, Leiden)
  • Can we handle ‘de wereld’ (‘the world’)’? 4 March 2009 presentation The Hague werreid
  • IMPACT <Demo Day BL, 12 July 2011> OCR: Abbyy Finereader SDK with built in standard Dutch dictionary OCR: Abbyy Finereader SDK combining built in modernDutch dictionary with IMPACT external historical lexicon of Dutch: werreld
  • IMPACT <Demo Day BL, 12 July 2011> werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled RETRIEVAL: key in modern WERELD and find all
  • The long s problem: An example …. IMPACT workshop, Bratislava, May 7, 2010 OCR at start of project A. De eerde was de gevaarlykflti om de verlei¬ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. .
  • The long s problem: An example …. IMPACT workshop, Bratislava, May 7, 2010 OCR at start of project Results April 2010 A. De eerde was de gevaarlykflti om de verlei¬ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
  • The long s problem: An example …. IMPACT workshop, Bratislava, May 7, 2010 Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and postcorrect it afterwards with the lexicon. In this way we keep it from turning to “eerde” (earth) instead of “eerste” (first) OCR at start of project Results April 2010 A. De eerde was de gevaarlykflti om de verlei¬ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
  • Overview
    • What is a computer lexicon
    • Lexica in IMPACT
    • Tools for lexicon building and applying lexica
    • Some results
    • Searching Demonstration
    IMPACT <Demo Day BL, 12 July 2011>
  • What is a computer lexicon? IMPACT <Demo Day BL, 12 July 2011>
  • Computer lexicon vs electronic dictionary (1) IMPACT <Demo Day BL, 12 July 2011> An electronic dictionary is:
    • Digitised full text (no pictures)
    • For human use
    • Ideally: searchable with explicitely coded material (XML), such as a lemma, part of speech (PoS), meaning, quotes etc.
    • Examples: OED online, WNT online
  • Dictionary XML (example) IMPACT <Demo Day BL, 12 July 2011>
  • IMPACT <Demo Day BL, 12 July 2011>
  • Computer Lexicon vs Electronic Dictionary (2) IMPACT <Demo Day BL, 12 July 2011>
    • A computer lexicon is:
    • Always in a structured digital format (XML, relational database)
    • Main purpose: computer application
    • Explicitely coded information (e.g. lemma wereld , part of speech noun , morphology werelden, werelds … , syntax)
    • Examples of use:
    • Linguistic enrichment of text material
    • ‘ Advanced’ searching (words with all spelling variant and inflections)
    • Automatic summarization, keyword extraction…
  • IMPACT <Demo Day BL, 12 July 2011>
  • Lexica in IMPACT IMPACT <Demo Day BL, 12 July 2011>
  • The OCR lexicon IMPACT <Demo Day BL, 12 July 2011> An OCR lexicon is
    • A checked list of words in a language
    • Based on a corpus (collection) of dated texts (selection!)
    • Preferably with frequency information
    • Preferably from the same time period or of the same text type as the texts you wish to digitize
  • OCR lexicon: example IMPACT <Demo Day BL, 12 July 2011> 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61
  • The IR lexicon
    • IR lexicon : most important information categories word forms (lists of words) + - frequency information
    • - quotes (dated sources) from corpora or electronic dictionaries - MODERN LEMMA (// entrance dictionary) linked to spelling variants and inflected forms of the same word
    • The modern lemma is used for searching in texts
    • Standard use in corpus linguistics and modern historical lexicography
    IMPACT <Demo Day BL, 12 July 2011>
  • IMPACT <Demo Day BL, 12 July 2011> <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> < modern_lemma > aantuilen </modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> < written_form > tuyld </written_form> <attestation><id>92141</id> <token_id></token_id> < quote >Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an , Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform>
  • Tools for lexicon building and application of lexica IMPACT <Demo Day BL, 12 July 2011>
  • Types variation (spelling, inflection…) IMPACT <Demo Day BL, 12 July 2011> uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk I werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled II (patterns to predict variation) (a number are predictable with patterns, others need to be taken from a lexicon )
  • Neil Fitzgerald, 7th July 2011
  • Computer lexica
    • For OCR and OCR post correction
    • Improving searchability of historic text material by building a lexicon with variants by using a modern lemma as a search entry
    • Tools for lexicon building
    • Tools for application of lexicon in search engines
    • Lexicon cookbook
    IMPACT <Demo Day BL, 12 July 2011>
  • Tools (more specific)
    • Lexicon building from corpus material and dictionaries
    • Use of lexica in search engines
    • Tool to extract spelling variation patterns from historical material
    • Tool to relate previously unrecognised spelling variations to their standard form
    • Tool to deduct previously unrecognised inflected forms to their basic form
    IMPACT <Demo Day BL, 12 July 2011>
  • Spelling variation tools (pattern-based)
    • Language-independent approach:
      • Supervised rule (pattern) induction from pairs (“modern” word, historical word), yielding patterns like aa/ae, s/z , ….
      • Pattern weights are computed from example material
    • Additional approaches possible, eg. :
    • Use of aligned data (parallel historical text and modern version)
    IMPACT workshop, Bratislava, May 7, 2010
  • Lemmatization
    • Reduction of historical word forms to modern lemma
    • Historical word  standard (“modern”) spelling  lemma form
    • (pattern matching) (lemmatizer)
    • Dystels  (1) distels  (2) distel
    • When we have a perfect or near-perfect modern full form lexicon, the second step is simply lexicon lookup .
    • But:
    • We will not have full form information for many lemmata (especially the historical ones)
    • Even lemmata present in modern language may have historical inflected forms different from the present-day paradigm
    IMPACT workshop, Bratislava, May 7, 2010
  • Lemmatization and reverse lemmatization
    • We also need a lemmatization process for these situations
    • A typical lemmatizer assigns some standard form (infinitive, nominative, stem) to inflected forms. Usually based on patterns relating the inflected form to the standard form.
    • But:
    • Matching these patterns can be hard to combine with matching both spelling variation patterns and OCR errors (bok/bokken/bokkeu)
    • We adopt the solution of actually expanding the “hypothetical modern full form lexicon” containing the most plausible possible paradigmatic expansions of lemmata
    • This construction is carried out by means of a statistical reverse lemmatizer
    IMPACT workshop, Bratislava, May 7, 2010
  • Attestation
    • From hypothetical (non-witnessed) lexicon content to attested word forms in “real” text
    • Automatic selection of candidate attestations
    • Manual work: verification and correction
    • Two approaches
      • Dictionary based (INL): Woordenboek der Nederlandsche Taal
      • Corpus based (LMU, INL): Dutch DBNL corpus
    IMPACT workshop, Bratislava, May 7, 2010
  • IMPACT Dictionary Attestation Tool IMPACT workshop, Bratislava, May 7, 2010
    • work
      • We are working on what works.
      • Depart from me, ye that worke iniquity.
      • She worcketh knittinge of stockings.
    headword Quotations variants Task Find the variants of a headword as they occur in the quotations Lexicon building at work: Verifying attestations in historical dictionaries
  • IMPACT Dictionary Attestation Tool IMPACT workshop, Bratislava, May 7, 2010
    • Automatically (preprocessing)
        • match literally e.g: work  work, Work
        • match using existing lexica and lists e.g: work  works, worked, wrought
        • approximate matching e.g: work  worke
    • By hand (using the tool)
        • correct automatic mismatches e.g: works  words, worms
        • find missed matches e.g: work  worketh, wrowght
    Task Find the variants of a headword as they occur in the quotations Electronic historical dictionary Database with lemmata and quotatioms
  • IMPACT Attestation Tool IMPACT workshop, Bratislava, May 7, 2010 Tool Lemma headword Quotations Sorted by uncertainty Up-to-date overview of what is done and needs to be done Done by this user so far
  • IMPACT Lexicon Tool IMPACT workshop, Bratislava, May 7, 2010
    • Automatically (preprocessing = apply lemmatizer)
        • match literally e.g: work  work, Work
        • match using existing lexica and lists e.g: work  works, worked, wrought
        • matching using spelling variation module e.g: uiterlijk  uyterlick
    • By hand (using the tool)
        • assign correct lemma e.g: was (N)  zijn (V)
        • group tokens belonging together e.g: konings zoon  koningszoon
        • select attestations
    Task Find and verify attestations in a historical corpus
  • Corpus-based lexicon building: Impact Lexicon Tool IMPACT workshop, Bratislava, May 7, 2010
  • General vocabulary vs. Named entities
    • Tools for lexicon building described so far: applicable to general lexicon
    • Tools for NE recognition, classification and variant matching
    • - library requirement - distinguish general vocabulary from NE’s - avoid unpleasant mixups like Abimelech  apemelk! (b/p; i/e; e/0; k/ch )
    IMPACT workshop, Bratislava, May 7, 2010
  • Improvement of state of the art / innovation
    • We use existing computational linguistic approaches, but figure out how to apply them to historical language
    • We develop a workflow to deal with the problems posed by historical language, figuring out how all pieces fit together
      • Data selection and acquisition
      • Manual work
      • Computational linguistics tools
    IMPACT workshop, Bratislava, May 7, 2010 
  • languages in IMPACT
    • Dutch, German, English , Spanish, French
    • Polish, Czech, Slovene and Bulgarian
    • Cross language perspective paper
    • Parallel OCR and IR experiments
    • GT datasets
    • Language tools: language independent
    • Except from 3 core languages: proof of concept lexica
    IMPACT <Demo Day BL, 12 July 2011>
  • OCR evaluation results (preliminary!)
  • 1. Czech
    • C o jest konstituce?, čili, Krátký, prostonárodní wýklad hlawnějších zásad konstitucí ewropejských, 1848
    • Ferina Lišák z Kuliferdy a na Klukově, čili, Kratičká historye zlopověstných kousků starého Reinecke, 1848
    • Homerowa Iliada, 1802
    • Na den narození neimocněišího, a neijasněišího cysare rímského, téz dědičného rakauského a krále ceského, Frantiska II., w Praze 12. den mesyce Unora, léta 1805, 1805
    • Plody sborů učenců řeči českoslowanské prešporského, 1836
    • Rozprawy o gmenách, počátkách i starožitnostech národu Slawského a geho kmeni /, 1830
    • Sokol, 1872
    • Základowé pitwy (Anatomie), čili, Soustawnj rozbor a popis těla lidského a gednotliwých geho částek, 1840
  •  
  • 2.Dutch
    • 18th and 19th century books, newspapers, parliamentary papers
    • Provinciale Overijsselsche en Zwolsche courant : staats-, handels-, nieuws- en advertentieblad, 1852-1852
    • Rechtsgeleerd advis in de zaak van den gewezen stadhouder, en over deszelfs schryven aan de gouverneurs van de Oost- en West-Indische bezittingen van den staat [...]. Ingelevert [...] op den 7 january 1796. / By B. Voorda et al, 1796-1796
    • Verhaal van het levensgevaar, waar in zig drie Rotterdamsche burgers [...] bevonden hebben, te Utrecht, 1784-1784
    • Vrijmoedige aanmerkingen, over de uitsluiting van allen die door publieke armkassen bedeeld worden, als stemgerechtigden [...] bij eene oproeping van het Nederlandsche volk tot eene Nationaale Conventie, 1795-1795
  • Precision: 0.8432889410216431 , Recall: 0.843331934927516
  •  
  • English
    • 16th-19th century material
    • Sources for lexicon building: OED, ECCO
  •  
  • French
    • 17th century books
    • Conduite du jugement naturel où tous les bons esprits de l'un et l'autre sexe pourront facilement puiser la pureté de la science, par M. Jacques Forton, sieur de S. Ange,..., 1653
    • Dissertation de la philosophie en général, 1668
    • La Dialectique du sieur de Launay, contenant l'art de raisonner juste sur toute sorte de matières..., 1673
    • Lettre de M. Gadroys à M. de La Grange Trianon,... pour servir de réponse à celle que M. de Castelet a écrite contre les raisons de M. Descartes touchant le flux et le reflux de la mer. - Seconde lettre de M. Gadroys... [au même, sur le même sujet.], 1677
    • Traitez de métaphysique démontrée selon la méthode des géomètres. [Par le sieur de La Coudraye.], 1693
  •  
  • German
    • Das Buch des heyligen Römischen Reichs unnderhalltunge, 1501
    • Die Poesie ihr Wesen und ihre Formen mit Grundzügen der vergleichenden Literaturgeschichte, 1884
    • Echo Deß Hochzeitlichen Te Deum Laudamus, 1722
    • Ergebnisse der Erhebungen über die Beschäftigung gewerblicher Arbeiter an Sonn- und Festtagen, Bd.:1, Gruppe I bis VII der Gewerbestatistik, Berlin, 1887, 1887
    • Quedlinburgisches Kreis-Tags-Memorial, 1673
    • Von der Regierung der Kirche und den unterschiedlichen Würden der Geistlichkeit *(full title in comments), 1779
    • Warhaffter und grundlicher Bericht uß was Ursachen Martinus du Voysin (zu Basel verburgerter Krämer) inn der Statt Surseew im Aargöw, ..., den 13. Tag Octobris deß 1608. Jars erstlich enthauptet, und volgends verbrennt worden, 1609
  •  
  • Polish
    • Adwersaria, albo terminata sprawy wojennej, która się toczyła w wołoskiej ziemi z tureckim cesarzem, 1621
    • Chorągiew Sarmacka w Wołoszech, to jest pospolite ruszenie i szczęśliwy powrót Polaków z Wołoch w roku 1621, 1621
    • Diariusz wiadomości od wyjazdu króla z Wilna do Smoleńska, 1610
    • Discurs o cenie pieniedzy teraznieyszey y o niektorych skutkach iey…, 1632
    • Nowe Ateny, albo Akademia wszelkiey scyencyi pełna, na różne tytuły iak na classes podzielona, mądrym dla memoryału, idiotom dla nauki, politykom dla praktyki, melancholikom dla rozrywki erygowana ... . Część 3 albo Supplement., 1746
    • Pasja żołnierzy obojga narodów w stolicy moskiewskiej krótko opisana, 1613
    • Powodzenia niebezpiecznego ale szczęśliwego wojska j. k. m. w Multanach opisanie, 1601
    • Relacja chwalebnej ekspedycji Jana Kazimierza, króla polskiego i szwedzkiego, 1650
    • Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej, 1634
    • Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej_BW, 1634
    • Żałosne opisanie upadku króla hiszpańskiego na morzu i na lądzie, 1589
  •  
  • Slovene
    • Genovefa, 1841
    • Gosp. Krištofa Šmida korarja avgustanskiga, zgodBe S. Pisma za mlade ljud..., 1850
    • Kmetijske in rokodelske novice, 1844
    • Kratkozhasne uganke, 1788
    • Kuharske Bukve, 1799
    • Marianske Kempensar, ali Dvoje bukuvze, 1769
    • Novice kmetijskih, rokodelnih in narodskih reči, 1851
    • Sgodbe svetiga pisma za mlade ljudi, 1830
    • Ta male katechismus, 1768
    • Vezhna pratika od gospodarstva, 1789
    • Zerkviza na skali, 1855
  •  
  • Retrieval demonstrator
    • Indexing and retrieval library (java) implemented on the lucene search engine
    • Lexicon in MySQL database
    • OCR with Finereader SDK and external dictionary interface of about 2000 images of the Dutch Ground Truth selection
    • Page XML output [in framework]
    • NE tagging
    • Indexing and retrieval while using lexicon and NE tagging
    IMPACT <Demo Day BL, 12 July 2011>
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •