IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Special resources to access 16th century German
Ludwig-Maximilians-Universität München

Annette Gotscharek




15. 10. 2011, IMPACT Conference
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Special resources to access 16th century German
                   “access”?
         OCR:
          Role of the lexicon: defines the set of valid words.

          ...      Geist
                   Geister
                   Teile
                   gemütlich …


         Information Retrieval (IR):
          Role of the lexicon: meaningful expansion of the user query to increase recall.

          ...      Geist  Geister, Geiste, Geistern
                   Teil  Teile, Teils, Teilen
                   gemütlich  gemütlicher, gemütlichste ...
15. 10. 2011, IMPACT Conference                                                                                                                          2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Special resources to access 16th century German
         In IMPACT, we worked on documents from 1500-1950, but 16th century is special:
         –          Language period: Early New High German (1350-1650)
         –          Oldest and therefore most challenging period of printed books
         –          Large library holdings from 16th century at our partner library BSB
         linguistic features of historical language on word-level

                                                                                       Historic               modern                                    English
          –      Historical spelling variation:                                        geyſte                Geiste                                     spirit
          –      Historical morphology:                                                er frug                er fragte                                 he asked
          –      Obsolete vocabulary:                                                  mirackel              Wunder (?)                                 miracle
          –      Obsolete character set:                                               aͤ                    ä…


  Need adapted linguistic resources
15. 10. 2011, IMPACT Conference                                                                                                                                     3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                Adapted linguistic resources: structure
         OCR:

          ...      Geist
                   Geister
                   Teile
                   gemütlich …


         Information Retrieval (IR):

          ...      Geist  Geister, Geiste, Geistern
                   Teil  Teile, Teils, Teilen
                   gemütlich  gemütlicher, gemütlichste ...




15. 10. 2011, IMPACT Conference                                                                                                                          4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                Adapted linguistic resources: structure
         OCR:

          ...      Geist                                                                                     Geyst
                   Geister                                                                                   Geyster
                   Teile                                                                                     Theile
                   gemütlich                                                                                 gemüthlich …


         Information Retrieval (IR):

          ...      Geist  Geister, Geiste, Geistern                                                         Geyster, Geyste, Geystern
                   Teil  Teile, Teils, Teilen                                                               Theile, Theils, Theilen
                   gemütlich  gemütlicher, gemütlichste                                                     gemüthlicher, gemüthlichste...




15. 10. 2011, IMPACT Conference                                                                                                                          5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Linguistic Resources for Historical Texts

 Diachronic Groundtruth Corpus (1500-1950)
 Hypothetical lexicon for rule based variants
 Manually verified lexicon




15. 10. 2011, IMPACT Conference                                                                                                                          6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Linguistic Resources for Historical Texts

 Diachronic Groundtruth Corpus (1500-1950)
 Hypothetical lexicon for rule based variants
 Manually verified lexicon




15. 10. 2011, IMPACT Conference                                                                                                                          7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




       Diachronic Groundtruth Corpus (1500-1950)

 Collection of groundtruth material from different sources in the web and non-public
  electronic corpora (Institut für Deutsche Sprache Mannheim)

 Large gap especially in 16th / 17th century:
   with BSB: preparation of additional corpus from BSB documents:
   – Random selection of 100 works from digitized images of 16th and 17th century
   – Mostly related to theology
   – Latin texts excluded, no poems etc.
   – Keyed by a service provider
   – 1766 pages with ~ 858,000 tokens groundtruth material

15. 10. 2011, IMPACT Conference                                                                                                                          8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




       Diachronic Groundtruth Corpus (1500-1950)
         Gains of tokens by the extension of the corpus:




         Complete corpus contains ~ 3,380,000 tokens in 500 texts from 4 centuries
           basis for different analyses and lexicon building

15. 10. 2011, IMPACT Conference                                                                                                                          9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Coverage on Diachronic Corpus: modern
Types (%)                        1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900-
           –                     1549              1599             1649             1699              1749             1799             1849             1899   1949
Modern simple                      15.3             28.8              29.2             31.5             38.1              52.0           54.7             48.0   60.1
words
Modern                             5.1              6.1               6.9              8.6              7.13              15.5           20.6             28.1   27.8
compounds


   Less than 45% of the vocabulary is covered by modern resources before 1750.
   16th century: only 15% - 29% modern simple words, modern closed compounds
    are hardly relevant.




 15. 10. 2011, IMPACT Conference                                                                                                                                        10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Linguistic Resources for Historical Texts

 Diachronic Groundtruth Corpus (1500-1950)
 Hypothetical lexicon for rule based variants
 Manually verified lexicon




15. 10. 2011, IMPACT Conference                                                                                                                          11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




        Hypothetical lexicon for rule based variants

  Systematic substitution rules (patterns) describe the difference
   between modern and historical spelling:
                                                                      t        th,ei               ey
 (modern)                             teil                                                                            theyl                              (historic)

  Based on the modern lexicon and the 140 manually collected
   patterns, the set of all potential rule based historical variants can be
   computed automatically (“hypothetical lexicon”).


15. 10. 2011, IMPACT Conference                                                                                                                                       12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




           Hypothetical lexicon for rule based variants
                                                                                                      hypothetical
                                                                                                             lexicon
    …
   Esel                                                                                 Teil
    …                                                 Esel                             Teill
   Teil                                               Esell                            Teyl
                                                                                                                                                             …
    …                                                Esehl                                                                                                  e →eh
                                                                                       Teyll
                                                     Esehll                                                                                                 ei →ey
                                                                                      Tehill
                                                      Eßel                                                                                                  s →ß
                                                                                      Theil
modern                                                Eßell                                                                                                  l→ll
                                                                                        …
                                                    Eßehll                                                                                                  t →th
lexicon                                                                                                                                                       …
                                                        …

                                                                                                                                                pattern set

   15. 10. 2011, IMPACT Conference                                                                                                                                   13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




        Hypothetical lexicon for rule based variants

  Automatic mapping from rule based historical variants to their equivalent in
   the modern vocabulary is possible:
                       historic                                  modern
                       Geyst                 =                   Geist + (ei  ey)
                       Theile                =                   Teile + (t th)


 By far not all historical variants can be described by simple replacement rules:
                     historic                                    modern
                     frug     =                                  fragte + ?
                     Mirackel =                                  ?+?


15. 10. 2011, IMPACT Conference                                                                                                                          14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




        Coverage on Diachronic Corpus: hypothetic
Types (%)                        1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900-
                                 1549              1599             1649             1699              1749             1799             1849             1899   1949
Modern simple                      15.3             28.8              29.2             31.5             38.1              52.0           54.7             48.0   60.1
words
Modern                             5.1              6.1               6.9              8.6              7.13              15.5           20.6             28.1   27.8
compounds
Hypothetic                         29.5             29.8              27.9             26.0             21.9              14.3           8.1              7.7    2.0




   16th century: 30% of the vocabulary are covered by the lexicon of rule based
    variants
   Applied as OCR-Lexicon via the IMPACT Abbyy External Dictionary Interface:
    improvement of recognition rate (published 2009)
 15. 10. 2011, IMPACT Conference                                                                                                                                        15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Coverage on Diachronic Corpus: missing
Types (%)                        1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900-
                                 1549              1599             1649             1699              1749             1799             1849             1899   1949
Modern simple                      15.3             28.8              29.2             31.5             38.1              52.0           54.7             48.0   60.1
words
Modern                             5.1              6.1               6.9              8.6              7.13              15.5           20.6             28.1   27.8
compounds
Hypothetic                         29.5             29.8              27.9             26.0             21.9              14.3           8.1              7.7    2.0


Missing                            45.9             28.7              29.7             26.0             23.5              15.1           13.9             13.5   8.1




   Especially in the 16th century: Up to 46% “difficult” vocabulary.
     manually verified lexicon necessary!
 15. 10. 2011, IMPACT Conference                                                                                                                                        16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




             Linguistic Resources for Historical Texts

 Diachronic Groundtruth Corpus (1500-1950)
 Hypothetical lexicon for rule based variants
 Manually verified lexicon




15. 10. 2011, IMPACT Conference                                                                                                                          17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                  Manually verified IR-lexicon: Structure

One entry contains:
          –      Historical word form from the corpus
          –      Corresponding modern word form
          –      Patterns if applicable
          –      Corresponding modern lemma
          –      At least one occurrence in the corpus as a attestation for the reading


 Manual assignment of modern word form and lemma
 Explicit handling of not rule based variants

15. 10. 2011, IMPACT Conference                                                                                                                          18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




            Manually verified IR-lexicon: Compilation

 Web-based, collaborative user interface
 User support:
          – For rule based variants: Suggestion of the corresponding modern word
            form by the hypothetic lexicon
          – Suggestion of all possible lemmas for the modern word form by a large
            modern lexicon (CISLEX)
          – Concordance list of the historical variant




15. 10. 2011, IMPACT Conference                                                                                                                          19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                       Manually verified IR-lexicon: Status

 41,600 entries have been created for 24,800 historical word forms
  from the diachronic corpus, 72,100 attestations were annotated.

 IMPACT-Partner in Slovenia und Bulgaria create corresponding
  lexica with an adapted version of the tool.




15. 10. 2011, IMPACT Conference                                                                                                                          20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                 Thank you.




15. 10. 2011, IMPACT Conference                                                                                                                          21

IMPACT Final Conference - Language Parallel Sessions - Gotscharek

  • 1.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Special resources to access 16th century German Ludwig-Maximilians-Universität München Annette Gotscharek 15. 10. 2011, IMPACT Conference
  • 2.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Special resources to access 16th century German “access”?  OCR: Role of the lexicon: defines the set of valid words. ... Geist Geister Teile gemütlich …  Information Retrieval (IR): Role of the lexicon: meaningful expansion of the user query to increase recall. ... Geist  Geister, Geiste, Geistern Teil  Teile, Teils, Teilen gemütlich  gemütlicher, gemütlichste ... 15. 10. 2011, IMPACT Conference 2
  • 3.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Special resources to access 16th century German  In IMPACT, we worked on documents from 1500-1950, but 16th century is special: – Language period: Early New High German (1350-1650) – Oldest and therefore most challenging period of printed books – Large library holdings from 16th century at our partner library BSB  linguistic features of historical language on word-level Historic  modern English – Historical spelling variation: geyſte Geiste spirit – Historical morphology: er frug  er fragte he asked – Obsolete vocabulary: mirackel Wunder (?) miracle – Obsolete character set: aͤ ä…  Need adapted linguistic resources 15. 10. 2011, IMPACT Conference 3
  • 4.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Adapted linguistic resources: structure  OCR: ... Geist Geister Teile gemütlich …  Information Retrieval (IR): ... Geist  Geister, Geiste, Geistern Teil  Teile, Teils, Teilen gemütlich  gemütlicher, gemütlichste ... 15. 10. 2011, IMPACT Conference 4
  • 5.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Adapted linguistic resources: structure  OCR: ... Geist Geyst Geister Geyster Teile Theile gemütlich gemüthlich …  Information Retrieval (IR): ... Geist  Geister, Geiste, Geistern Geyster, Geyste, Geystern Teil  Teile, Teils, Teilen Theile, Theils, Theilen gemütlich  gemütlicher, gemütlichste gemüthlicher, gemüthlichste... 15. 10. 2011, IMPACT Conference 5
  • 6.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Linguistic Resources for Historical Texts  Diachronic Groundtruth Corpus (1500-1950)  Hypothetical lexicon for rule based variants  Manually verified lexicon 15. 10. 2011, IMPACT Conference 6
  • 7.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Linguistic Resources for Historical Texts  Diachronic Groundtruth Corpus (1500-1950)  Hypothetical lexicon for rule based variants  Manually verified lexicon 15. 10. 2011, IMPACT Conference 7
  • 8.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Diachronic Groundtruth Corpus (1500-1950)  Collection of groundtruth material from different sources in the web and non-public electronic corpora (Institut für Deutsche Sprache Mannheim)  Large gap especially in 16th / 17th century:  with BSB: preparation of additional corpus from BSB documents: – Random selection of 100 works from digitized images of 16th and 17th century – Mostly related to theology – Latin texts excluded, no poems etc. – Keyed by a service provider – 1766 pages with ~ 858,000 tokens groundtruth material 15. 10. 2011, IMPACT Conference 8
  • 9.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Diachronic Groundtruth Corpus (1500-1950)  Gains of tokens by the extension of the corpus:  Complete corpus contains ~ 3,380,000 tokens in 500 texts from 4 centuries  basis for different analyses and lexicon building 15. 10. 2011, IMPACT Conference 9
  • 10.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Coverage on Diachronic Corpus: modern Types (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900- – 1549 1599 1649 1699 1749 1799 1849 1899 1949 Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1 words Modern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8 compounds  Less than 45% of the vocabulary is covered by modern resources before 1750.  16th century: only 15% - 29% modern simple words, modern closed compounds are hardly relevant. 15. 10. 2011, IMPACT Conference 10
  • 11.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Linguistic Resources for Historical Texts  Diachronic Groundtruth Corpus (1500-1950)  Hypothetical lexicon for rule based variants  Manually verified lexicon 15. 10. 2011, IMPACT Conference 11
  • 12.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Hypothetical lexicon for rule based variants  Systematic substitution rules (patterns) describe the difference between modern and historical spelling: t th,ei ey (modern) teil theyl (historic)  Based on the modern lexicon and the 140 manually collected patterns, the set of all potential rule based historical variants can be computed automatically (“hypothetical lexicon”). 15. 10. 2011, IMPACT Conference 12
  • 13.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Hypothetical lexicon for rule based variants hypothetical lexicon … Esel Teil … Esel Teill Teil Esell Teyl … … Esehl e →eh Teyll Esehll ei →ey Tehill Eßel s →ß Theil modern Eßell l→ll … Eßehll t →th lexicon … … pattern set 15. 10. 2011, IMPACT Conference 13
  • 14.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Hypothetical lexicon for rule based variants  Automatic mapping from rule based historical variants to their equivalent in the modern vocabulary is possible: historic modern Geyst = Geist + (ei  ey) Theile = Teile + (t th)  By far not all historical variants can be described by simple replacement rules: historic modern frug = fragte + ? Mirackel = ?+? 15. 10. 2011, IMPACT Conference 14
  • 15.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Coverage on Diachronic Corpus: hypothetic Types (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900- 1549 1599 1649 1699 1749 1799 1849 1899 1949 Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1 words Modern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8 compounds Hypothetic 29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0  16th century: 30% of the vocabulary are covered by the lexicon of rule based variants  Applied as OCR-Lexicon via the IMPACT Abbyy External Dictionary Interface: improvement of recognition rate (published 2009) 15. 10. 2011, IMPACT Conference 15
  • 16.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Coverage on Diachronic Corpus: missing Types (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900- 1549 1599 1649 1699 1749 1799 1849 1899 1949 Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1 words Modern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8 compounds Hypothetic 29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0 Missing 45.9 28.7 29.7 26.0 23.5 15.1 13.9 13.5 8.1  Especially in the 16th century: Up to 46% “difficult” vocabulary.  manually verified lexicon necessary! 15. 10. 2011, IMPACT Conference 16
  • 17.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Linguistic Resources for Historical Texts  Diachronic Groundtruth Corpus (1500-1950)  Hypothetical lexicon for rule based variants  Manually verified lexicon 15. 10. 2011, IMPACT Conference 17
  • 18.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Manually verified IR-lexicon: Structure One entry contains: – Historical word form from the corpus – Corresponding modern word form – Patterns if applicable – Corresponding modern lemma – At least one occurrence in the corpus as a attestation for the reading  Manual assignment of modern word form and lemma  Explicit handling of not rule based variants 15. 10. 2011, IMPACT Conference 18
  • 19.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Manually verified IR-lexicon: Compilation  Web-based, collaborative user interface  User support: – For rule based variants: Suggestion of the corresponding modern word form by the hypothetic lexicon – Suggestion of all possible lemmas for the modern word form by a large modern lexicon (CISLEX) – Concordance list of the historical variant 15. 10. 2011, IMPACT Conference 19
  • 20.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Manually verified IR-lexicon: Status  41,600 entries have been created for 24,800 historical word forms from the diachronic corpus, 72,100 attestations were annotated.  IMPACT-Partner in Slovenia und Bulgaria create corresponding lexica with an adapted version of the tool. 15. 10. 2011, IMPACT Conference 20
  • 21.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you. 15. 10. 2011, IMPACT Conference 21