• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
IMPACT Final Conference - Katrien Depuydt
 

IMPACT Final Conference - Katrien Depuydt

on

  • 2,962 views

Overview of the IMPACT language work with Katrien Depuydt from the INL

Overview of the IMPACT language work with Katrien Depuydt from the INL

Statistics

Views

Total Views
2,962
Views on SlideShare
591
Embed Views
2,371

Actions

Likes
0
Downloads
18
Comments
0

6 Embeds 2,371

http://www.digitisation.eu 1933
http://impact.dlsi.ua.es 248
http://impactocr.wordpress.com 168
http://impact2.sherrydesign.co.uk 14
http://impact.sherrydesign.co.uk 6
http://localhost 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NoDerivs LicenseCC Attribution-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • A snippet from a Dutch magazine (De Denker. No. 4. Den 24. January 1763) ------------------------------------------- OCR, improving Access to text: improving the quality of the text. RETRIEVAL: Improving Access to text: dealing with historical spelling variants Used: HISTORICAL LEXICON OF DUTCH Can we handle ‘the world’? Yes we can, ought to be our answer, especially when investing hugely in mass digitisation. Mass digitisation is the very reason for investing in lexicon building. Efforts in digitising huge quantities of historical text demand efforts in quality of OCR as well as retrieval. Historical lexicon building for OCR and Retrieval, as shown above in this little example, can contribute to that. An example: in a ground truth text corpus of Dutch texts from 1550 until 1950, containing approximately 150 million words, search for the very common word ‘wereld’ yielded 23396 hits. Using a historical lexicon, containing spelling and morphological variants of this word, resulted in 34339 hits. I
  • again, unsure of what LEMMA means Be, was, am, is, etc. all forms of the same word BE (and that is an example of a lemma)
  • Two types of variation, examples for Dutch from the lexicon
  • Here you search for wereld and automatically for variations of wereld. WERELD is the modern Dutch spelling. How: by using the lexicon built in impact that has a modern word + variations stored in a format so that it can be integrated into a lucene search engine

IMPACT Final Conference - Katrien Depuydt IMPACT Final Conference - Katrien Depuydt Presentation Transcript

  • Overview of the Language Work in IMPACT Katrien Depuydt Institute for Dutch Lexicology Leiden
  • IMProving ACcess to Text
  • Can we handle ‘de wereld’ (‘the world’)’?
    • OCR:
    werreid
  • OCR: Abbyy Finereader SDK with built in standard Dutch dictionary OCR: Abbyy Finereader SDK combining built in modernDutch dictionary with IMPACT external historical lexicon of Dutch: werreld
  • werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled RETRIEVAL: key in modern WERELD and find all
  • Lexica in IMPACT
  • The OCR lexicon
    • A checked list of words in a language
    • Based on a corpus (collection) of dated texts (selection!)
    • Preferably with frequency information
    • Preferably from the same time period or of the same text type as the texts you wish to digitize
    • For OCR and OCR postcorrection
  • OCR lexicon: example 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61
  • The IR lexicon
    • IR lexicon : most important information categories word forms (lists of words) + - frequency information
    • - quotes (dated sources) from corpora or electronic dictionaries - MODERN LEMMA (// entrance dictionary) linked to spelling variants and inflected forms of the same word
    • The modern lemma is used for searching in texts
    • Standard use in corpus linguistics and modern historical lexicography
  • NE lexica
    • Lexica for OCR and NE Recognition and variant matching in historical documents!
    • English, German and Dutch
    • Stanford NE tagger with additonal IMPACT module
    • NE repository with gazetteers and authority files
    • Parallel session: Frank Landsbergen on the NE work in IMPACT
  • Strategies, material and Toolbox for Lexicon building
  • Types variation (spelling, inflection…) uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk I werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled II (patterns to predict variation) (a number are predictable with patterns, others need to be taken from a lexicon )
  • Material for lexicon building: - historical dictionaries with quotations (OED, WNT) - corpus material, ground truth quality - list of dictionary entries - modern or historical language computional lexica
    • Toolbox (a selection):
    • Tool to automatically derive spelling variation rules from a dataset of historical word forms with their modern equivalent
    • eg. To be used to predict historical forms starting from a modern lexicon
    • Tool to automatically expand a list of dictionary entries with inflectional variants (“reverse lemmatisation”)
    • Tool to lemmatise word forms Historical word > standard (“modern”) spelling > lemma form
    • (pattern matching) (lemmatizer)
    • Dystels > (1) > distels  (2) > distel
  •  
  •  
  • Corpus-based lexicon building (COBALT)
  • Improvement of state of the art / innovation
    • We use existing computational linguistic approaches, but figure out how to apply them to historical language
    • We develop a workflow to deal with the problems posed by historical language, figuring out how all pieces fit together
      • Data selection and acquisition
      • Manual work
      • Computational linguistics tools
  • Cross-language perspective on lexicon building
  • Multi-Language Lexicon Building: Challenges
    • Different points of departure
      • For which periods does historical lexicon building make sense?
      • What language resources (lexica, corpora, dictionaries) are available?
      • What tools are available?
      • Special character sets (Polish, Bulgarian)
    • Set up fruitful cooperation with many institutes (“training”)
      • General meetings
      • Individual training sessions by LMU and INL
      • Extensive testing of tools, additional feature requirements
  • Different languages, different periods
  • Resources for lexicon building   Dictionaries Corpora Lexica Bulgarian Ground Truth, Early OCR   Czech Jungmann, Kott Ground Truth, Czech National Corpus Based on modern dictionary English OED Ground Truth   French   Ground Truth, Frantext morphalou Polish The dictionary of 17 th and early 18 th century Polish Ground truth Grammatical dictionary of polish Slovene   AHLib, wikisource, Ground Truth Multext-east lexicon Spanish Diccionario de Autoridades, Real Academia Española Cervantes Virtual Library, Ground Truth Apertium lexicon
  • Issues and challenges Language Issues Countermeasures Bulgarian Some characters in late 19 th century bulgarian not recognized by FineReader; Old Church Slavonic printing not at all implemented Lack of sufficient corpus material Special font training; lexicon development ground truth Czech Lack of sufficient corpus material lexicon development ground truth Polish Special Glyphs; Lack of sufficient corpus material lexicon development ground truth
  • COBALT
    • New features:
    • Page XML and TEI import
    • Major adaption: make tool suitable for lexicon building with OCR material
      • Highlighting of (suspicious) words in page image
      • Editing of word forms
    • Many small enhancements to improve usability at the request of users (new language partners)
  • Parallell language session
    • Annette Gotscharek: Work on 16th Century German
    • Janusz Bien: Work on Polish language
    • Tomaž Erjavec: Work on Slovene
  • Use of Lexica in OCR
  • ABBYY External dictionary interface
    • Use of historical lexica within Finereader SDK (FR 9 and 10)
    • Implemented as web service in OC5 framework
    • Possible enhancements
      • Morphological structure: integrated in the external dictionary implementation
      • Historical spelling variation patterns
      • Cf. Talk by Jesse de Does on a.o. OCR results
  •  
  • Lexica in Retrieval
  • Retrieval demonstrator
    • Indexing and retrieval library (java) implemented on the lucene search engine
    • Lexicon in MySQL database
    • Page XML [in framework], also suitable for other XML-formats
    • NE tagging
    • Indexing and retrieval while using lexicon and NE tagging
  • Neil Fitzgerald, 7th July 2011
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  • Final remarks
    • Cross language perspective paper
    • Lexicon cookbook + toolbox
    • Lexica