0
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
Upcoming SlideShare
Loading in...5
×

IMPACT Final Conference - Language Parallel Sessions - Gotscharek

1,795

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,795
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "IMPACT Final Conference - Language Parallel Sessions - Gotscharek"

  1. 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Special resources to access 16th century GermanLudwig-Maximilians-Universität MünchenAnnette Gotscharek15. 10. 2011, IMPACT Conference
  2. 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Special resources to access 16th century German “access”? OCR: Role of the lexicon: defines the set of valid words. ... Geist Geister Teile gemütlich … Information Retrieval (IR): Role of the lexicon: meaningful expansion of the user query to increase recall. ... Geist  Geister, Geiste, Geistern Teil  Teile, Teils, Teilen gemütlich  gemütlicher, gemütlichste ...15. 10. 2011, IMPACT Conference 2
  3. 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Special resources to access 16th century German In IMPACT, we worked on documents from 1500-1950, but 16th century is special: – Language period: Early New High German (1350-1650) – Oldest and therefore most challenging period of printed books – Large library holdings from 16th century at our partner library BSB linguistic features of historical language on word-level Historic  modern English – Historical spelling variation: geyſte Geiste spirit – Historical morphology: er frug  er fragte he asked – Obsolete vocabulary: mirackel Wunder (?) miracle – Obsolete character set: aͤ ä…  Need adapted linguistic resources15. 10. 2011, IMPACT Conference 3
  4. 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Adapted linguistic resources: structure OCR: ... Geist Geister Teile gemütlich … Information Retrieval (IR): ... Geist  Geister, Geiste, Geistern Teil  Teile, Teils, Teilen gemütlich  gemütlicher, gemütlichste ...15. 10. 2011, IMPACT Conference 4
  5. 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Adapted linguistic resources: structure OCR: ... Geist Geyst Geister Geyster Teile Theile gemütlich gemüthlich … Information Retrieval (IR): ... Geist  Geister, Geiste, Geistern Geyster, Geyste, Geystern Teil  Teile, Teils, Teilen Theile, Theils, Theilen gemütlich  gemütlicher, gemütlichste gemüthlicher, gemüthlichste...15. 10. 2011, IMPACT Conference 5
  6. 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Linguistic Resources for Historical Texts Diachronic Groundtruth Corpus (1500-1950) Hypothetical lexicon for rule based variants Manually verified lexicon15. 10. 2011, IMPACT Conference 6
  7. 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Linguistic Resources for Historical Texts Diachronic Groundtruth Corpus (1500-1950) Hypothetical lexicon for rule based variants Manually verified lexicon15. 10. 2011, IMPACT Conference 7
  8. 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Diachronic Groundtruth Corpus (1500-1950) Collection of groundtruth material from different sources in the web and non-public electronic corpora (Institut für Deutsche Sprache Mannheim) Large gap especially in 16th / 17th century:  with BSB: preparation of additional corpus from BSB documents: – Random selection of 100 works from digitized images of 16th and 17th century – Mostly related to theology – Latin texts excluded, no poems etc. – Keyed by a service provider – 1766 pages with ~ 858,000 tokens groundtruth material15. 10. 2011, IMPACT Conference 8
  9. 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Diachronic Groundtruth Corpus (1500-1950) Gains of tokens by the extension of the corpus: Complete corpus contains ~ 3,380,000 tokens in 500 texts from 4 centuries  basis for different analyses and lexicon building15. 10. 2011, IMPACT Conference 9
  10. 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Coverage on Diachronic Corpus: modernTypes (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900- – 1549 1599 1649 1699 1749 1799 1849 1899 1949Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1wordsModern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8compounds  Less than 45% of the vocabulary is covered by modern resources before 1750.  16th century: only 15% - 29% modern simple words, modern closed compounds are hardly relevant. 15. 10. 2011, IMPACT Conference 10
  11. 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Linguistic Resources for Historical Texts Diachronic Groundtruth Corpus (1500-1950) Hypothetical lexicon for rule based variants Manually verified lexicon15. 10. 2011, IMPACT Conference 11
  12. 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Hypothetical lexicon for rule based variants  Systematic substitution rules (patterns) describe the difference between modern and historical spelling: t th,ei ey (modern) teil theyl (historic)  Based on the modern lexicon and the 140 manually collected patterns, the set of all potential rule based historical variants can be computed automatically (“hypothetical lexicon”).15. 10. 2011, IMPACT Conference 12
  13. 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Hypothetical lexicon for rule based variants hypothetical lexicon … Esel Teil … Esel Teill Teil Esell Teyl … … Esehl e →eh Teyll Esehll ei →ey Tehill Eßel s →ß Theilmodern Eßell l→ll … Eßehll t →thlexicon … … pattern set 15. 10. 2011, IMPACT Conference 13
  14. 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Hypothetical lexicon for rule based variants  Automatic mapping from rule based historical variants to their equivalent in the modern vocabulary is possible: historic modern Geyst = Geist + (ei  ey) Theile = Teile + (t th) By far not all historical variants can be described by simple replacement rules: historic modern frug = fragte + ? Mirackel = ?+?15. 10. 2011, IMPACT Conference 14
  15. 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Coverage on Diachronic Corpus: hypotheticTypes (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900- 1549 1599 1649 1699 1749 1799 1849 1899 1949Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1wordsModern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8compoundsHypothetic 29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0  16th century: 30% of the vocabulary are covered by the lexicon of rule based variants  Applied as OCR-Lexicon via the IMPACT Abbyy External Dictionary Interface: improvement of recognition rate (published 2009) 15. 10. 2011, IMPACT Conference 15
  16. 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Coverage on Diachronic Corpus: missingTypes (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900- 1549 1599 1649 1699 1749 1799 1849 1899 1949Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1wordsModern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8compoundsHypothetic 29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0Missing 45.9 28.7 29.7 26.0 23.5 15.1 13.9 13.5 8.1  Especially in the 16th century: Up to 46% “difficult” vocabulary.  manually verified lexicon necessary! 15. 10. 2011, IMPACT Conference 16
  17. 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Linguistic Resources for Historical Texts Diachronic Groundtruth Corpus (1500-1950) Hypothetical lexicon for rule based variants Manually verified lexicon15. 10. 2011, IMPACT Conference 17
  18. 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Manually verified IR-lexicon: StructureOne entry contains: – Historical word form from the corpus – Corresponding modern word form – Patterns if applicable – Corresponding modern lemma – At least one occurrence in the corpus as a attestation for the reading Manual assignment of modern word form and lemma Explicit handling of not rule based variants15. 10. 2011, IMPACT Conference 18
  19. 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Manually verified IR-lexicon: Compilation Web-based, collaborative user interface User support: – For rule based variants: Suggestion of the corresponding modern word form by the hypothetic lexicon – Suggestion of all possible lemmas for the modern word form by a large modern lexicon (CISLEX) – Concordance list of the historical variant15. 10. 2011, IMPACT Conference 19
  20. 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Manually verified IR-lexicon: Status 41,600 entries have been created for 24,800 historical word forms from the diachronic corpus, 72,100 attestations were annotated. IMPACT-Partner in Slovenia und Bulgaria create corresponding lexica with an adapted version of the tool.15. 10. 2011, IMPACT Conference 20
  21. 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you.15. 10. 2011, IMPACT Conference 21
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×