• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
IMPACT Final Conference - Jesse de Does
 

IMPACT Final Conference - Jesse de Does

on

  • 2,051 views

Evaluation of lexicon supported OCR and Information retrieval with Jesse de Does from the INL

Evaluation of lexicon supported OCR and Information retrieval with Jesse de Does from the INL

Statistics

Views

Total Views
2,051
Views on SlideShare
687
Embed Views
1,364

Actions

Likes
0
Downloads
17
Comments
0

6 Embeds 1,364

http://www.digitisation.eu 1018
http://impactocr.wordpress.com 189
http://impact.dlsi.ua.es 140
http://impact2.sherrydesign.co.uk 11
http://impact.sherrydesign.co.uk 4
http://localhost 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NoDerivs LicenseCC Attribution-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    IMPACT Final Conference - Jesse de Does IMPACT Final Conference - Jesse de Does Presentation Transcript

    • IMPACT Lexica in OCR and IR Evaluation for Bulgarian, Czech, Dutch, English, French, German, Polish, Slovenian, Spanish Jesse de Does
    • Contents
      • OCR evaluation
        • Use of lexica in OCR
        • Evaluation Method
        • (non-final) Results
      • IR evaluation
        • Use of lexica in IR
        • Evaluation Method
        • (Very preliminary) results
    • Use of lexica in OCR
      • ! This is not about postcorrection, but about what happens during OCR
      • Using “Finereader Engine External Dictionary Interface” Functionality:
      • Any procedure that prunes a set of candidates and assigns weights can be implemented in this way
      • Such a procedure need not be limited to the use static of word lists
      • Permits dynamic implementations (spelling variation rules, morphology, …)
      date footertext
    • Finereader SDK external dictionaries
      • SDK users have to implement a COM interface to prune a set of “Fuzzy Words”
      eerde cc cc eerste cc f c o o External dictionary prunes this to the linguistically possible ones (In this case: { eerste, eerde}) Fuzzy Word: set of character recognition candidates for each position in a word
    • Finereader SDK external dictionaries eerde cc cc eerste cc f c o o
      • Of cause a lot of things may go wrong in this simple scenario
      • Lexicon may be too small (you will never have all spelling variations, compounds, …)
      • Lexicon may include typical OCR errors (eu, cn, ….)
      • ! The Fuzzy word may be too restricted (or of course too comprehensive)
      {eerste, eerde} x ____
    • OCR Evaluation
      • Measure evaluation of Finereader SDK 10 with default included dictionary  Finereader SDK 10 with both default dictionary AND use of historical lexicon
      • Main performance indicator: word recall: after alignment, how many of the words in the ground truth have a (case-insensitive) match in the OCR. Errors on punctuation not penalized.
      • Specific evaluation tool (only word accuracy)
        • Workaround for region segmentation problems
        • Display specific information about dictionary coverage, information about performance on dictionary words, false friends ….
    •  
    • Dictionary “cleaning”
      • Dictionary hallucinations:
      • Many in-dictionary errors (“false friends”)
      • Many errors on short words
      • Dictionary cleaning procedures:
      • Remove false friends (words related by frequent OCR substitution to much more frequent words)
      • Remove infrequent short words (even if correct)
    • Evaluation procedure (data)
      • IMPACT demonstrator sets (size between ~1200-8000 pages)
      • Split:
        • Development
        • Evaluation
        • Demonstration
      • OCR evaluation sets: random choice of about 200 pages from evaluation portion
        • Manageable size (one experiment takes between 30 min and 1:30)
    • Bulgarian
      • Bylgarska iliustracia, 1880 -
      • Jenski glas, 1889 -
      • Sborniche za spomen na 25-godishninata ot smyrtta na Levski, 1898 -
      • Spisanie Dennica, 1890 -
      • Ugozapadna Bulgaria, 1893 -
      • Zelokupna Bulgaria, 1880 -
    •  
    • Freq. OCR->GT 114 д->л 218 ш->н і 242 и->п 247 г->т 256 ъ->ь 270 ь->ъ 356 и->н 378 п->н 441 е->с 579 н->п 732 н->и 825 ж -> ѫ 968 п->и Freq. OCR->GT 165 е-> ѣ 185 г->т 200 е->с 205 и->п 217 ъ->ь 220 ш->н і 249 и->н 283 ж-> ѫ 330 ь->ъ 354 п->н 463 н->п 599 н->и 733 п->и
    •  
    • 1. Czech
      • Co jest konstituce?, čili, Krátký, prostonárodní wýklad hlawnějších zásad konstitucí ewropejských, 1848
      • Ferina Lišák z Kuliferdy a na Klukově, čili, Kratičká historye zlopověstných kousků starého Reinecke, 1848
      • Homerowa Iliada, 1802
      • Na den narození neimocněišího, a neijasněišího cysare rímského, téz dědičného rakauského a krále ceského, Frantiska II., w Praze 12. den mesyce Unora, léta 1805, 1805
      • Plody sborů učenců řeči českoslowanské prešporského, 1836
      • Rozprawy o gmenách, počátkách i starožitnostech národu Slawského a geho kmeni /, 1830
      • Sokol, 1872
      • Základowé pitwy (Anatomie), čili, Soustawnj rozbor a popis těla lidského a gednotliwých geho částek, 1840
    •  
    • 2.Dutch
      • 18th and 19th century books, newspapers, parliamentary papers
      • …… ..
      • Provinciale Overijsselsche en Zwolsche courant : staats-, handels-, nieuws- en advertentieblad, 1852-1852
      • Rechtsgeleerd advis in de zaak van den gewezen stadhouder, en over deszelfs schryven aan de gouverneurs van de Oost- en West-Indische bezittingen van den staat [...]. Ingelevert [...] op den 7 january 1796. / By B. Voorda et al, 1796-1796
      • Verhaal van het levensgevaar, waar in zig drie Rotterdamsche burgers [...] bevonden hebben, te Utrecht, 1784-1784
      • Vrijmoedige aanmerkingen, over de uitsluiting van allen die door publieke armkassen bedeeld worden, als stemgerechtigden [...] bij eene oproeping van het Nederlandsche volk tot eene Nationaale Conventie, 1795-1795
    •  
    •  
    • English
      • Standard Finereader language: OldEnglish
      • 15th-19th century material
      • 2 sets:
        • One general set, 15th-19th century
        • One 17th century-specific set
    • General set with various choices of dictionary – no improvement!
    • More distinct improvement on 17th century set with special dictionary compiled from OED quotations dated 1580-1720
    • French
      • Standard Finereader language: OldFrench
      • 17th century books
      • Conduite du jugement naturel où tous les bons esprits de l'un et l'autre sexe pourront facilement puiser la pureté de la science, par M. Jacques Forton, sieur de S. Ange,..., 1653
      • Dissertation de la philosophie en général, 1668
      • La Dialectique du sieur de Launay, contenant l'art de raisonner juste sur toute sorte de matières..., 1673
      • Lettre de M. Gadroys à M. de La Grange Trianon,... pour servir de réponse à celle que M. de Castelet a écrite contre les raisons de M. Descartes touchant le flux et le reflux de la mer. - Seconde lettre de M. Gadroys... [au même, sur le même sujet.], 1677
      • Traitez de métaphysique démontrée selon la méthode des géomètres. [Par le sieur de La Coudraye.], 1693
    •  
    • German
      • Standard Finereader language: OldGerman
      • Das Buch des heyligen Römischen Reichs unnderhalltunge, 1501
      • Die Poesie ihr Wesen und ihre Formen mit Grundzügen der vergleichenden Literaturgeschichte, 1884
      • Echo Deß Hochzeitlichen Te Deum Laudamus, 1722
      • Ergebnisse der Erhebungen über die Beschäftigung gewerblicher Arbeiter an Sonn- und Festtagen, Bd.:1, Gruppe I bis VII der Gewerbestatistik, Berlin, 1887, 1887
      • Quedlinburgisches Kreis-Tags-Memorial, 1673
      • Von der Regierung der Kirche und den unterschiedlichen Würden der Geistlichkeit *(full title in comments), 1779
      • Warhaffter und grundlicher Bericht uß was Ursachen Martinus du Voysin (zu Basel verburgerter Krämer) inn der Statt Surseew im Aargöw, ..., den 13. Tag Octobris deß 1608. Jars erstlich enthauptet, und volgends verbrennt worden, 1609
    •  
    • Polish
      • Adwersaria, albo terminata sprawy wojennej, która się toczyła w wołoskiej ziemi z tureckim cesarzem, 1621
      • Chorągiew Sarmacka w Wołoszech, to jest pospolite ruszenie i szczęśliwy powrót Polaków z Wołoch w roku 1621, 1621
      • Diariusz wiadomości od wyjazdu króla z Wilna do Smoleńska, 1610
      • Discurs o cenie pieniedzy teraznieyszey y o niektorych skutkach iey…, 1632
      • Nowe Ateny, albo Akademia wszelkiey scyencyi pełna, na różne tytuły iak na classes podzielona, mądrym dla memoryału, idiotom dla nauki, politykom dla praktyki, melancholikom dla rozrywki erygowana ... . Część 3 albo Supplement., 1746
      • Pasja żołnierzy obojga narodów w stolicy moskiewskiej krótko opisana, 1613
      • Powodzenia niebezpiecznego ale szczęśliwego wojska j. k. m. w Multanach opisanie, 1601
      • Relacja chwalebnej ekspedycji Jana Kazimierza, króla polskiego i szwedzkiego, 1650
      • Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej, 1634
      • Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej_BW, 1634
      • Żałosne opisanie upadku króla hiszpańskiego na morzu i na lądzie, 1589
    •  
    • Slovene
      • Genovefa, 1841
      • Gosp. Krištofa Šmida korarja avgustanskiga, zgodBe S. Pisma za mlade ljud..., 1850
      • Kmetijske in rokodelske novice, 1844
      • Kratkozhasne uganke, 1788
      • Kuharske Bukve, 1799
      • Marianske Kempensar, ali Dvoje bukuvze, 1769
      • Novice kmetijskih, rokodelnih in narodskih reči, 1851
      • Sgodbe svetiga pisma za mlade ljudi, 1830
      • Ta male katechismus, 1768
      • Vezhna pratika od gospodarstva, 1789
      • Zerkviza na skali, 1855
    •  
    • Spanish
      • Carta athenagorica, 1690
      • Commentarios reales, 1609
      • El Parnasso español, 1648
      • Obras de Garcilasso de la Vega con las anotaciones por el Mtro. Francisco Sánchez Brocense, 1612
      • Obras de Lope de Vega, 1604
      • Vida de Lazarillo de Tormes, 1652
    • Results
    • Summary
    • Evaluation of “IR”
      • Main question:
      • Are we able to retrieve historical variants of words?
      • Practical evaluation criterion:
      • Measure accuracy of modern lemma assignment
      • ( If we can do this, good retrieval is possible)
      • More complete evaluation to follow soon – all partners are finishing the work
    • Evaluation method
      • Each language partner annotates ~10.000 tokens of Ground Truth with modern lemma and/or equivalent word form
      • We measure performance of:
        • Lemmatization with a modern lexicon
        • Lemmatization with a modern lexicon and spelling variation patterns
        • Lemmatization with a historical lexicon, a modern lexicon and spelling variation patterns
        • No context information is used
    • English
      • Using OED IR lexicon and very restricted set of spelling variation patterns
      • Considered tokens: 9409.
      • 8994 had a correct lemma (recall 0,956 )
      • Total correct suggestions : 8994
      • Average rank of correct lemma : 1,086280
      • total possible lemmata : 23859
      • None match at all : 265
      • Matched With Patterns : 1330
      • Exact Match : 7814
    • Spanish
      • Using Apertium modern Spanish Lexicon, IMPACT historical spanish IR lexicon and
      • 9298 token considered
      • With only modern lexicon and patterns
      • 7473 with at least one correct lemma and 1825 without (recall 0,80 ) Average rank of correct lemma: 1,1, Total suggestions 9699 No match at all: 991
      • Modern Exact: 7471; Modern With Patterns: 836
      • With historical lexicon, modern lexicon and patterns:
      • 8864 with at least one correct lemma and 434 without (recall 0,926 ) Average rank of correct lemma: 1,16,
      • Total suggestions 12417
      • ModernWithPatterns: 186
      • No match at all: 542
      • Historical Lexicon Exact match: 8265
      • ModernExact: 305