Your SlideShare is downloading. ×
0
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Bratislava WS - Depuydt - INL - lexicon building_pdf
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Bratislava WS - Depuydt - INL - lexicon building_pdf

501

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
501
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. A gentle introduction to lexicon building and lexicon application Katrien Depuydt (Institute for Dutch Lexicology, Leiden)
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Outline  What is a lexicon  Lexica in IMPACT  Lexicon building and lexicon application tools  Results so far with focus on Dutch IMPACT workshop, Bratislava, May 7, 2010 2
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. What is a lexicon? IMPACT workshop, Bratislava, May 7, 2010 3
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon vs. electronic dictionary (1) An electronic dictionary has Of course, digitized full text (no images) Primarily: for human use Ideally: searchable with explicitly (XML) tagged information lemma, Part of speech, meaning, quotations etc. Example:online Oxford English Dictionary IMPACT workshop, Bratislava, May 7, 2010 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Dictionary XML (example) IMPACT workshop, Bratislava, May 7, 2010 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon vs. electronic dictionary (2) A computational lexicon is Of course, in structured digital format (XML, relational database) Primarily for use in computer applications Has explicitly coded information (eg. lemma, part of speech, morphology, semantics, syntax…). Used (for instance): Linguistic annotation Enhanced retrieval (basic: inflected forms; advanced: synonyms etc.) Syntactic parsing, machine translation IMPACT workshop, Bratislava, May 7, 2010 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT workshop, Bratislava, May 7, 2010 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexica in IMPACT IMPACT workshop, Bratislava, May 7, 2010 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The OCR lexicon An OCR lexicon is A verified list of words in a language Based on a corpus, dated to enable relevant selection Preferably with frequency information Preferably from same period/text type as the documents you want OCR’d (selection!) IMPACT workshop, Bratislava, May 7, 2010 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR lexicon example From WNT attestation lexicon From DBNL historical corpus absoluut 8 wechgerukt 5 absoluyt 2 wechgeschickt 6 absoluyter 1 wechgeven 6 absolveren 3 wech-gevoerde 11 absolverende 1 wechgevoerde 14 absorbeeren 1 wech-gevoert 59 absorbeert 1 wechgevoert 98 absorberen 1 wechgeworpen 21 absorptie 3 wechghenomen 12 absoute 2 wechghevoert 7 abstineeren 1 wechginck 5 abstinencie 1 wechloopen 6 abstinentie 2 wechneemt 11 abstineren 1 wechneme 6 abstrackheyt 1 wech-nemen 20 abstract 7 wechnemen 74 abstracta 1 wechneminge 12 abstracte 7 wech-neminge 6 abstracten 4 wechrapen 6 abstractheid 1 wechrucken 6 abstractie 1 wechruiming 7 abstractiën 1 wecht 7 IMPACT workshop, Bratislava, May 7, 2010 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The IR lexicon IR lexicon: Main information categories: wordforms (list of words) + - frequency information - quotations (dated sources) from corpora or electronic dictionaries - MODERN LEMMA (// dictionary entry) assigned to spelling variants and morphological variants of the same word  The modern lemma forms are the main search keys for retrieval  This is a standard practice in corpus linguistics and modern historical lexicography IMPACT workshop, Bratislava, May 7, 2010 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> <modern_lemma>aantuilen</modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> <written_form>tuyld</written_form> <attestation><id>92141</id> <token_id></token_id> <quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform> IMPACT workshop, Bratislava, May 7, 2010 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. How to build and apply these lexica? IMPACT workshop, Bratislava, May 7, 2010 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon building Build a lexicon with the aims of  Be profitable to OCR and OCR postcorrection  Improving retrieval by building a lexicon of variants with the modern lemma as a main entry key  Tools for lexicon building  Tools on how to use the lexicon (lexicon deployment) for enrichment  Lexicon cookbook  Best practice and tools to use lexica in OCR !!! No lexicon will ever contain all variants found in historical text IMPACT workshop, Bratislava, May 7, 2010 14
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Types of variation (orthographical and other) uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke I uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk (most of these can be dealt with by means of patterns) werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys II swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled (some of these can be dealt with by patterns and/or fuzzy matching, others can only be handled by explicit listing) IMPACT workshop, Bratislava, May 7, 2010 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The “hypothetical” vs. the witnessed lexicon (1) Mechanisms - to extend the lexicon - to assess the plausibility of “hypothetical” words without previous attestations, i.e. words we have not seen before. IMPACT workshop, Bratislava, May 7, 2010 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The “hypothetical” vs. the witnessed lexicon (2)  Unknown inflected forms of registered lemmata: automatic expansion from the lemma to the full paradigm of word forms: paradigmatic expansion or reverse lemmatization  New spellings of known words can be dealt with by developing a good model of the historical spelling. (The database structure provides for the storage of orthographic variant patterns.)  Previously unseen compounds can be dealt with by means of a good model of word formation. (work scheduled for 2010) IMPACT workshop, Bratislava, May 7, 2010 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Virtual lexicon of generated word forms Witnessed Modern Word Hypothetical Modern Word Historical Variant 1 Transformation Patterns Historical Variant 2 IMPACT workshop, Bratislava, May 7, 2010 18
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. What is needed for lexicon building  Build models of linguistic variation (inflection, orthography)  Collect variants  Approach  Cycle: model helps to construct lexicon, and vice versa (induction of rules/patterns)  Combination of manual work and computational linguistics  Lexicon building toolkit to support development, containing both computational linguistic tools and tools supporting manual work IMPACT workshop, Bratislava, May 7, 2010 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Cf. Computational Tools and Lexica to Improve Access to Text, Jesse de Does, Katrien Depuydt IMPACT workshop, Bratislava, May 7, 2010 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Spelling variation tools (pattern-based)  Language-independent approach:  Supervised rule (pattern) induction from pairs (“modern” word, historical word), yielding patterns like aa/ae, s/z, ….  Pattern weights are computed from example material Additional approaches possible:  Use of aligned data (parallel historical text and modern version)  Unsupervised pattern weighting (=~ text profiling from TR5) IMPACT workshop, Bratislava, May 7, 2010 21
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lemmatization  Reduction of historical word forms to modern lemma  Historical word  standard (“modern”) spelling  lemma form (pattern matching) (lemmatizer) Dystels  (1) distels  (2) distel  When we have a perfect or near-perfect modern full form lexicon, the second step is simply lexicon lookup. But: 1) We will not have full form information for many lemmata (especially the historical ones) 2) Even lemmata present in modern language may have historical inflected forms different from the present-day paradigm IMPACT workshop, Bratislava, May 7, 2010 22
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lemmatization and reverse lemmatization We also need a lemmatization process for these situations  A typical lemmatizer assigns some standard form (infinitive, nominative, stem) to inflected forms. Usually based on patterns relating the inflected form to the standard form. But:  Matching these patterns can be hard to combine with matching both spelling variation patterns and OCR errors (bok/bokken/bokkeu)  We adopt the solution of actually expanding the “hypothetical modern full form lexicon” containing the most plausible possible paradigmatic expansions of lemmata  This construction is carried out by means of a statistical reverse lemmatizer IMPACT workshop, Bratislava, May 7, 2010 23
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Attestation  From hypothetical (non-witnessed) lexicon content to attested word forms in “real” text  Automatic selection of candidate attestations  Manual work: verification and correction  Two approaches  Dictionary based (INL): Woordenboek der Nederlandsche Taal  Corpus based (LMU, INL): Dutch DBNL corpus IMPACT workshop, Bratislava, May 7, 2010 24
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Dictionary Attestation Tool Lexicon building at work: Verifying attestations in historical dictionaries Task Find the variants of a headword as they occur in the quotations headword work • We are working on what works. • Depart from me, ye that worke iniquity. Quotations • She worcketh knittinge of stockings. variants IMPACT workshop, Bratislava, May 7, 2010 25
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Dictionary Attestation Tool Task Find the variants of a headword as they occur in the quotations Automatically (preprocessing) Electronic • match literally historical e.g: work  work, Work dictionary Database with lemmata • match using existing lexica and lists and quotatioms e.g: work  works, worked, wrought • approximate matching e.g: work worke By hand (using the tool) • correct automatic mismatches e.g: works  words, worms • find missed matches e.g: work  worketh, wrowght IMPACT workshop, Bratislava, May 7, 2010 26
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Attestation Tool Up-to-date overview of what is done and needs to be don Tool Done by this user so far Lemma headword Quotations Sorted by uncertainty IMPACT workshop, Bratislava, May 7, 2010 27
  • 28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Lexicon Tool Task Find and verify attestations in a historical corpus Automatically (preprocessing = apply lemmatizer) • match literally e.g: work  work, Work • match using existing lexica and lists e.g: work  works, worked, wrought • matching using spelling variation module e.g: uiterlijk uyterlick By hand (using the tool) • assign correct lemma e.g: was (N)  zijn (V) • group tokens belonging together e.g: konings zoon  koningszoon • select attestations IMPACT workshop, Bratislava, May 7, 2010 28
  • 29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Corpus-based lexicon building: Impact Lexicon Tool IMPACT workshop, Bratislava, May 7, 2010 29
  • 30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. General vocabulary vs. Named entities  Tools for lexicon building described so far: applicable to general lexicon  Tools for NE recognition, classification and variant matching - library requirement - distinguish general vocabulary from NE’s - avoid unpleasant mixups like Abimelech  apemelk! (b/p; i/e; e/0; k/ch) IMPACT workshop, Bratislava, May 7, 2010 30
  • 31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improvement of state of the art / innovation  We use existing computational linguistic approaches, but figure out how to apply them to historical language   We develop a workflow to deal with the problems posed by historical language, figuring out how all pieces fit together  Data selection and acquisition  Manual work  Computational linguistics tools IMPACT workshop, Bratislava, May 7, 2010 31
  • 32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Some results so far with focus on Dutch IMPACT workshop, Bratislava, May 7, 2010 32
  • 33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Measuring results for Dutch We use the ground truth data developed in the project Evaluation of EE tools Evaluation of lexicon coverage Evaluation of lexicon usage in IR (2010) Evaluation of OCR and lexicon usage in OCR (2010) Evaluation of benefit of lexicon building for OCR (for which type of material / quality of OCR does this make sense) (2010-11) IMPACT workshop, Bratislava, May 7, 2010 33
  • 34. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Dutch ground truth data Type and genre # words Gold Standard Book 300k Random Set Book 340k Random Set Staten Generaal 2.5M Gold Standard Staten Generaal 500k Gold Standard Newspapers 1 3.4M Gold Standard Newspapers 2 170k Random Set Newspapers 3.2M total 13.1M IMPACT workshop, Bratislava, May 7, 2010 34
  • 35. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Efficiency of lexicon building Dictionary-based lexicon building using historical dictionary: Woordenboek der Nederlandsche Taal  Lemmata: 220211, quotations: 1524366  Tempo: 1725 quotations/hour; 231 lemmata/hour IMPACT workshop, Bratislava, May 7, 2010 35
  • 36. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Reverse lemmatization  Reminder: build hypothetical (non-attested) word forms in a “quick and dirty” way to use in lemmatization and corpus-based lexicon building  Using simple statistical algorithms and a simple approach to inflection  Results: Accuracy Small Dutch lexicon (JVKlex) 96.6% French lexicon (Morphalou) 99.4% Polish lexicon, verbs (Morfologik) 98.7% IMPACT workshop, Bratislava, May 7, 2010 36
  • 37. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon coverage (1: ground truth books) Type coverage Token coverage Modern lexicon (e-Lex) 46% 76% EE3.3 56% 84% 1+2 63% 89% Type frequency list 70% 93% historical corpus, top 200K (freq >= 19) Type frequency list 78% 95% historical corpus, top 500K (freq >= 5) IMPACT workshop, Bratislava, May 7, 2010 37
  • 38. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon coverage (2: gt newspapers 18th-19th c.) Type coverage Token coverage Modern lexicon (e-Lex) 40% 83% EE3.3 41% 84% 1+2 51% 89% Type frequency list 52% 93% historical corpus, top 200K Type frequency list 62% 95% historical corpus, top 500K IMPACT workshop, Bratislava, May 7, 2010 38
  • 39. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon coverage (3: gt Parl. Papers 19th c.) Type coverage Token coverage Modern lexicon (e-Lex) 51% 89% EE3.3 47% 88% 1+2 58% 93% Type frequency 59% 96% historical corpus, top 200K Type frequency 68% 97% historical corpus, top 500K IMPACT workshop, Bratislava, May 7, 2010 39
  • 40. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon coverage (4: gt Parl. Papers 20th c.) Type coverage Token coverage Modern lexicon (e-Lex) 70% 93% EE3.3 66% 93% 1+2 76% 96% Type frequency 74% 97% historical corpus, top 200K Type frequency 81% 98% historical corpus, top 500K IMPACT workshop, Bratislava, May 7, 2010 40
  • 41. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon coverage (5: Genesis, 1637 bible) Type coverage Token coverage Modern lexicon (e-Lex) 31% 61% EE3.3 62% 83% 1+2 65% 89% Type frequency 76% 97% historical corpus, top 200K Type frequency 87% 98.6% historical corpus, top 500K IMPACT workshop, Bratislava, May 7, 2010 41
  • 42. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lexicon coverage (6: Hooft, historiën) Type coverage Token coverage Modern lexicon (e-Lex) 26% 67% EE3.3 47% 88% 1+2 50% 90% Type frequency 44% 93% historical corpus, top 200K Type frequency 58% 96% historical corpus, top 500K IMPACT workshop, Bratislava, May 7, 2010 42
  • 43. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Conclusion from this evaluation  Evident next step for Dutch lexicon building is corpus based work  First target: cover the top 200000 from the historical corpus. – Contains 97885 types not in the witnessed historical EE3.3 lexicon – Roughly 24% of these are covered by the modern lexicon – Roughly 25% are names – This leaves about 45000 common words to look into. IMPACT workshop, Bratislava, May 7, 2010 43
  • 44. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Measuring effect of lexicon use in IR  Example: Improved recall for retrieval in a historical corpus of about 150 million tokens, using only the modern lexicon for wereld yields 23396 hits, using th current EE3.3 lexicon we get 34339 hits.  Simple IR will be part of the demonstrators  Hard to IR results proper without special datasets  We have measured up to now either lemmatization or modern to historical word form matching accuracy IMPACT workshop, Bratislava, May 7, 2010 44
  • 45. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Lemmatization  Combination of lookup, matching of spelling variation, reverse lemmatization  As yet no good evaluation set for IMPACT (current work)  Evaluation on “type” level We will use other material here (1637 Genesis, 97144 tokens) Approach  Restrict to “ordinary words” (no names, numbers, clitic combinations)  Ambiguous lemmatization (context is not used) (avg. 5 suggestions per word)  Ranking based on frequency and pattern weights IMPACT workshop, Bratislava, May 7, 2010 45
  • 46. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Result  6265 distinct types. 5991 (95.7%) had at least one correct suggestion  Average rank of correct suggestions: 1.23 – 5222 types found in current EE3.3 (83%) – 65 additional types in modern lexicon – 49 types without any match – 969 types (15%) identified with “approximate” matching using ~500 weighted patterns and returning at most 2 suggestions IMPACT workshop, Bratislava, May 7, 2010 46
  • 47. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Real and hypothetical lexicon coverage (Hooft, historiën)  Result (again restricting to ‘ordinary’ words)  36332 distinct types. Avg rank of correct suggestions: 1.23 – 20087 types found in current EE3.3 (55%) – 1061 additional types in modern lexicon – 2411 types without any match (7%) – 12773 types (35%) identified with “approximate” matching using ~500 weighted patterns and returning at most 2 suggestions (Probably about 75% of the highest-ranking approximate matches are correct) IMPACT workshop, Bratislava, May 7, 2010 47
  • 48. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation of TR results Using Finereader SDK (version 9)  External dictionary interface for experimentation  Not completely straighforward how to apply this Translation of corpus frequencies to weights on a scale 0-100 Other details: hyphenated words, case-sensitivity, … Workaround to circumvent the long s problem Lexicon Data used Corpus-based type-frequency list EE3.2 deliverable lexicon Finereader internal lexicon IMPACT workshop, Bratislava, May 7, 2010 48
  • 49. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR evaluation 1. Character accuracy 2. Word accuracy 3. In case of block alignment problems, a simple alternative is bag-of- words accuracy 1. and 2. presuppose a good alignment of OCR with ground truth.  We will use word accuracy, or the simpler alternative 3. when there are alignment problems IMPACT workshop, Bratislava, May 7, 2010 49
  • 50. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR results Dataset With ABBYY internal Dutch With combination With combination of lexicon of corpus-based corpus-based historical lexicon historical lexicon and and EE3.2 EE3.2 deliverable deliverable (case improved deployment insensitive, taking hyphenation into account) DPO35 88.8% 90.9% 94.4 % accuracy (word accuracy) Parliamenta 90.9% 94.9% 94.9% ry papers, 1826-27 selection (bag of words recall) IMPACT workshop, Bratislava, May 7, 2010 50
  • 51. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. ‘The Book’ “Kort begrip der waereld-historie voor de jeugd” J.F. Martinet Predikant te Zutphen, uit 1789. Why this book? Representative font and amount of spelling variation etc for late 18th century Dutch It has the “long s problem”: = stilste not ftilfte  …. IMPACT workshop, Bratislava, May 7, 2010 51
  • 52. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The long s problem: An example …. OCR at start of project Results April 2010 A. De eerde was de gevaarlykflti om de verlei¬ A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de ftillie en veiligde; ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarde, daar hy byna drie millioenen de derde de zwaarste, daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. harde en onbeschaafde Menschen bestieren moest. Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and postcorrect it afterwards with the lexicon. In this way we keep it from turning to “eerde” IMPACT workshop, Bratislava, May 7, 2010 52
  • 53. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Future work  Compound analysis  Irregular historical white space use (“impacttok++”) (cf attestations)  Corpus based lexicon extension  Testing and optimization with ground truth data  Improve the TR lexicon by extending the IR lexicon and removing false friends from the DBNL-corpus based TR lexicon  Continue work on best way deploy lexica in OCR, with help from ABBYY IMPACT workshop, Bratislava, May 7, 2010 53

×