• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
BL Demo Day - July2011 - (6) Language Tools for IMPACT
 

BL Demo Day - July2011 - (6) Language Tools for IMPACT

on

  • 1,383 views

Katrien Dupuydt presents on the IMPACT language tools for the BL IMPACT Demo Day on the 12th of July 2011

Katrien Dupuydt presents on the IMPACT language tools for the BL IMPACT Demo Day on the 12th of July 2011

Statistics

Views

Total Views
1,383
Views on SlideShare
637
Embed Views
746

Actions

Likes
0
Downloads
17
Comments
0

8 Embeds 746

http://impactocr.wordpress.com 552
http://www.digitisation.eu 96
http://impact.sherrydesign.co.uk 84
http://impact2.neme.com 5
http://impactcoc.sub.uni-goettingen.de 3
http://translate.googleusercontent.com 3
http://impact.neme.com 2
http://digitisation.eu 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This presentation is based on how the INL works with language. A electronic dictionary is not what we need for OCR and simple retrieval but is introduced anyway because we can (and do) use our dictionaries for lexicon construction.
  • This is what an XML-based electronic dictionary looks like.
  • This is the XML of the Oxford English dictionary. The horizontal lines mark a place where part of the structure has been folded in.
  • We need further explanation for what ‘lemma’, ‘part of speech’ and ‘morphology’ mean Lemma: headword, like in an ordinary dictionary the entry Morphology: morphological analysis is done for compounds and derivates: which parts are to be distinguished in a word, e.g. apple pie : apple + pie
  • This is an little part of a computational lexicon (of a certain type; there are many types of computational lexica)
  • again, unsure of what LEMMA means Be, was, am, is, etc. all forms of the same word BE (and that is an example of a lemma)
  • Two types of variation, examples for Dutch from the lexicon
  • To give an indication of possible spelling variants of the word ‘world’ for English, a screenshot from the OED online...
  • These are some of the ways in which we are using Computer lexica as building blocks.
  • The
  • The
  • The
  • The
  • The
  • The
  • These are results with a rather limited historical lexicon of German.
  • Computational Natural Language Learning
  • 322445 (vierde kolom middennin) 424979

BL Demo Day - July2011 - (6) Language Tools for IMPACT BL Demo Day - July2011 - (6) Language Tools for IMPACT Presentation Transcript

  • Computer Lexica in OCR and Retrieval Katrien Depuydt (Instituut voor Nederlandse Lexicologie, Leiden)
  • Overview
    • What is a computer lexicon
    • Lexica in IMPACT
    • Tools for lexicon building and applying lexica
    • Some results
    • Searching Demonstration
    IMPACT <Demo Day BL, 12 July 2011>
  • What is a computer lexicon? IMPACT <Demo Day BL, 12 July 2011>
  • Computer lexicon vs electronic dictionary (1) IMPACT <Demo Day BL, 12 July 2011> An electronic dictionary is:
    • Digitised full text (no pictures)
    • For human use
    • Ideally: searchable with explicitely coded material (XML), such as a lemma, part of speech (PoS), meaning, quotes etc.
    • Examples: OED online, WNT online
  • Dictionary XML (example) IMPACT <Demo Day BL, 12 July 2011>
  • IMPACT <Demo Day BL, 12 July 2011>
  • Computer Lexicon vs Electronic Dictionary (2) IMPACT <Demo Day BL, 12 July 2011>
    • A computer lexicon is:
    • Always in a structured digital format (XML, relational database)
    • Main purpose: computer application
    • Explicitely coded information (e.g. lemma, part of speech, morphology, syntax)
    • Examples of use:
    • Linguistic enrichment of text material
    • ‘ Advanced’ searching (words with all spelling variant and inflections)
    • Automatic summarization, keyword extraction…
  • IMPACT <Demo Day BL, 12 July 2011>
  • Lexica in IMPACT IMPACT <Demo Day BL, 12 July 2011>
  • The OCR lexicon IMPACT <Demo Day BL, 12 July 2011> An OCR lexicon is
    • A checked list of words in a language
    • Based on a corpus (collection) of dated texts (selection!)
    • Preferably with frequency information
    • Preferably from the same time period or of the same text type as the texts you wish to digitize
  • OCR lexicon: example IMPACT <Demo Day BL, 12 July 2011> 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61
  • The IR lexicon
    • IR lexicon : most important information categories word forms (lists of words) + - frequency information
    • - quotes (dated sources) from corpora or electronic dictionaries - MODERN LEMMA (// entrance dictionary) linked to spelling variants and inflected forms of the same word
    • The modern lemma is used for searching in texts
    • Standard use in corpus linguistics and modern historical lexicography
    IMPACT <Demo Day BL, 12 July 2011>
  • IMPACT <Demo Day BL, 12 July 2011> <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> < modern_lemma > aantuilen </modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> < written_form > tuyld </written_form> <attestation><id>92141</id> <token_id></token_id> < quote >Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an , Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform>
  • Tools for lexicon building and application of lexica IMPACT <Demo Day BL, 12 July 2011>
  • Types variation (spelling, inflection…) IMPACT <Demo Day BL, 12 July 2011> uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk I werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled II (patterns to predict variation) (a number are predictable with patterns, others need to be taken from a lexicon )
  • Neil Fitzgerald, 7th July 2011
  • Computer lexica
    • For OCR and OCR post correction
    • Improving searchability of historic text material by building a lexicon with variants by using a modern lemma as a search entry
    • Tools for lexicon building
    • Tools for application of lexicon in search engines
    • Lexicon cookbook
    • Guidelines and tools to use the lexica in OCR
    IMPACT <Demo Day BL, 12 July 2011>
  • Tools (more specific)
    • Lexicon building from corpus material and dictionaries
    • Use of lexica in search engines
    • Tool to extract spelling variation patterns from historical material
    • Tool to relate previously unrecognised spelling variations to their standard form
    • Tool to deduct previously unrecognised inflected forms to their basic form
    IMPACT <Demo Day BL, 12 July 2011>
  • Ordinary words vs Names (NEs)
    • Tools for the automatic recognition, classification and finding of variant names
      • Wish of the libraries
      • Separate regular vocabulary from names
      • Reduce unpleasant results: Abimelech  apemelk! (b/p; i/e; e/0; k/ch ) (apemelk means monkeymilk..)
    • NE lexica
    IMPACT <Demo Day BL, 12 July 2011>
  • A number of results for Dutch and German IMPACT <Demo Day BL, 12 July 2011>
  • Ground truth data: Dutch IMPACT <Demo Day BL, 12 July 2011> Type and genre # words Gold Standard Book 300k Random Set Books 340k Random Set Staten Generaal (Legal Papers) 2.5M Gold Standard Staten Generaal 500k Gold Standard Newspapers 1 3.4M Gold Standard Newspapers 2 170k Random Set Newspapers 3.2M total 13.1M
  • Lexicon coverage (1: ground truth books) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 46% 76% Core general lexicon 56% 84% 1 + 2 63% 89% Expansion with corpus material 78% 95%
  • Lexicon coverage (2: GT newspapers 18 th -19 th C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 40% 83% Core general lexicon 41% 84% 1 + 2 51% 89% Expansion with corpus material 62% 95%
  • Lexicon coverage (3: GT Staten Generaal 19 e C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 51% 89% Core general lexicon 47% 88% 1 + 2 58% 93% Expansion with corpus material 68% 97%
  • Lexicon coverage (4: GT Staten Generaal 20 e C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 70% 93% Core general lexicon 66% 93% 1 + 2 76% 96% Expansion with corpus material 81% 98%
  • Lexicon coverage (5: Genesis, 1637 bible) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 31% 61% Core lexicon 62% 83% 1 + 2 65% 89% Expansion with corpus material 87% 98.6%
  • Lexicon coverage (6: P.C. Hooft, histories) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 26% 67% Core lexicon 47% 88% 1 + 2 50% 90% Expansion with corpus material 58% 96%
  • Evaluation of OCR IMPACT <Demo Day BL, 12 July 2011>
    • Finereader SDK (version 9, 10)
    • External dictionary interface (implementation module)
    • Challenge
      • Translation of corpus frequencies to weights 0-100
      • Broken words, case-sensitivity, …
      • Problem with long ‘ s’ (work around)
    • Lexicon Data
    • IMPACT OCR-lexicon for Dutch
    • Finereader internal lexicon
  • OCR results: word recognition rate IMPACT <Demo Day BL, 12 July 2011> Dataset With ABBYY internal Dutch lexicon With IMPACT lexicon for Dutch (case hyphenation) With IMPACT lexicon for Dutch (case hyphenation) + long S problem) DPO35 88.8% 90.9% 93,5 %
  • An example: IMPACT <Demo Day BL, 12 July 2011> OCR at the beginning of the project: Results: A. De eerde was de gevaarlykflti om de verlei¬ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
  • IMPACT <Demo Day BL, 12 July 2011> Dictionary 16 th century No. of word errors Reduction of error rate 18 th century No. of word errors Reduction of error rate 19 th century No. of word errors Reduction of error rate No Lexicon 1306 - 827 - 2074 - Optimal Lexicon 756 42% 395 52% 612 70% Modern Lexicon 1096 16% 501 39% 888 57% W.Historical Lexicon 938 28% 481 42% 856 59% Modern + Virtual H.L. 1011 25% 480 42% 849 59%
  • Languages in IMPACT
    • Dutch, German, English , Spanish, French
    • Polish, Czech, Slovene and Bulgarian
    • Cross language perspective paper
    • Parallel OCR and IR experiments
    • GT datasets
    • Language tools: language independent
    • Except from 3 core languages: proof of concept lexica
    IMPACT <Demo Day BL, 12 July 2011>
  • English in IMPACT
    • Lexicon building using OED
      • OCR lexicon from quotations full text, possibly supplemented with corpus material
      • IR lexicon from headword variants in quotations (small demo)
    • Named Entity Recognition on newspaper material
      • NE lexicon
      • Gold standard corpus NE recognition (CONLL) ( Named Entity Recognition Task Definition, by: N. Chinchor, E. Brown, L. Ferro, and P. Robinson , Nr. Version 1.4 (1999) ) PER, LOC, ORG
    • Research into the possible benefits from exclusion of modern words from the OCR lexicon
    IMPACT <Demo Day BL, 12 July 2011>
  • IMPACT <Demo Day BL, 12 July 2011> An indemnity shall be granted to the surfer…. … bikini …
  • Retrieval demonstrator
    • Indexing and retrieval library (java) implemented on the lucene search engine
    • Lexicon in MySQL database
    • OCR with Finereader SDK and external dictionary interface of about 2000 images of the Dutch Ground Truth selection
    • Page XML output [in framework]
    • NE tagging
    • Indexing and retrieval while using lexicon and NE tagging
    IMPACT <Demo Day BL, 12 July 2011>