Your SlideShare is downloading. ×
0
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National...
Upcoming SlideShare
Loading in...5
×

BSB Demo Day - Gotscharek - Spezial-Lexika

398

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
398
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "BSB Demo Day - Gotscharek - Spezial-Lexika"

  1. 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Spezial-Lexika zur Erschließung historischer TexteLudwig-Maximilians-Universität MünchenCentrum für Informations- und SprachverarbeitungAnnette Gotscharek11. 10. 2011, BSB München – IMPACT Demo Day
  2. 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Spezial-Lexika zur Erschließung historischer Texte: “Erschließung”? OCR : Textuelle Repräsentation des Dokuments aus dem Scan gewinnen. Aufgabe des Lexikons: Definition der Menge gültiger Wörter (mit Wahrscheinlichkeiten) ... Teil (355.133) des (1.243.455) Lexikons (4.625) Lexika (512) ... 11. 10. 2011, BSB München – IMPACT Demo Day 2
  3. 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Spezial-Lexika zur Erschließung historischer Texte: “Erschließung”? Information Retrieval (IR): Zu einer Benutzeranfrage relevante Dokumente aus einer Kollektion finden. Aufgabe des Lexikons: Benutzeranfrage sinnvoll erweitern, um Recall zu erhöhen. ... Lexikon Lexika, Lexikons Teil Teile, Teils, Teilen Geist Geister, Geists, Geistern ... 11. 10. 2011, BSB München – IMPACT Demo Day 3
  4. 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Problem der historischen Sprachvariation Historische Schreibvarianten: geyſte Geiste Veraltetes Vokabular: mirackel Wunder (?) Historische Morphologie: er frug er fragte Veralteter Zeichensatz: ſ s, aͤ ä, …11. 10. 2011, BSB München – IMPACT Demo Day 4
  5. 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Adaptierte Lexika für historische Texte: StrukturOCR : ... Teil (355.133) Theile (223.405) des (1.243.455) teyls (41.944) Lexikons (4.625) Lexicons (1.520) Lexika (512) frug (2.311) ...IR: ... Geist Geister, Geists, Geistern, geyſte, geyſt, geyster Lexikon Lexika, Lexikons, Lexicon, Lexica, Lexicons Teil Teile, Teils, Teilen, Theyl, Theil, Theyls, Theilen …11. 10. 2011, BSB München – IMPACT Demo Day 5
  6. 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Ressourcen und Spezial-Lexikafür historische Texte Diachrones Groundtruth Korpus (1500-1950) Hypothetisches Lexikon für regelbasierte Varianten Manuell verifiziertes Lexikon Lexika für Named Entities11. 10. 2011, BSB München – IMPACT Demo Day 6
  7. 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Ressourcen und Spezial-Lexikafür historische Texte Diachrones Groundtruth Korpus (1500-1950) Hypothetisches Lexikon für regelbasierte Varianten Manuell verifiziertes Lexikon Lexika für Named Entities11. 10. 2011, BSB München – IMPACT Demo Day 7
  8. 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Diachrones Groundtruth – Korpus (1500-1950) Korpus-Erstellung aus verschiedenen Quellen im Web bzw. nicht-öffentlichen elektronischen Korpora (IDS Mannheim). Große Lücke insbesondere im 16. /17. Jahrhundert Mit BSB: Erstellung eines zusätzlichen Korpus aus BSB-Dokumenten. Insgesamt ~ 3.380.000 token aus 4 Jahrhunderten. Basis für verschiedene Analysen und Lexikonerstellung11. 10. 2011, BSB München – IMPACT Demo Day 8
  9. 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Ressourcen und Spezial-Lexikafür historische Texte Diachrones Groundtruth Korpus (1500-1950) Hypothetisches Lexikon für regelbasierte Varianten Manuell verifiziertes Lexikon Lexika für Named Entities11. 10. 2011, BSB München – IMPACT Demo Day 9
  10. 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Hypothetisches Lexikon: Regelbasierte Varianten Regelmäßig auftretende Ersetzungsmuster (Patterns) erklären auf Symbolebene die Unterschiede zwischen moderner und historischer Schreibung: t → th , ei → ey teil   → theyl Auf Basis des modernen Lexikons und der 140 Patterns kann automatisch die Menge der potentiellen regelbasierten historischen Varianten erzeugt werden („Hypothetisches Lexikon“).11. 10. 2011, BSB München – IMPACT Demo Day 10
  11. 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Hypothetisches Lexikon Hypothetisches Lexikon … Esel Teil … Esel Teill Teil Esell Teyl … … Esehl e →eh Teyll Esehll ei →ey Tehill Eßel s →ß TheilModernes Eßell l→ll … Eßehll t →thLexikon … … Patternmenge 11. 10. 2011, BSB München – IMPACT Demo Day 11
  12. 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Hypothetisches Lexikon: Regelbasierte Varianten Zuordnung von regelbasierten Varianten zu ihren Entsprechungen im modernen Wortschatz automatisch möglich: Geyst = Geist + (ei ey) Theile = Teile + (t th) Bei weitem nicht alle historischen Varianten lassen sich mit einfachen Ersetzungsregeln ableiten: frug = fragte + ? Mirackel = ? + ?11. 10. 2011, BSB München – IMPACT Demo Day 12
  13. 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Abdeckung auf diachronem KorpusTypes (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900- 1549 1599 1649 1699 1749 1799 1849 1899 1949Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1wordsModern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8compoundsHypothetic 29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0 Einsatz als Lexikon bei der OCR: Verbesserung der Erkennungsqualität über IMPACT Abbyy External Dictionary Interface (publiziert 2009) Zentrale Ressource bei Text- und Fehlerprofilierung und im Postkorrektursystem ( vgl. Vortrag Ulrich Reffle) 11. 10. 2011, BSB München – IMPACT Demo Day 13
  14. 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Abdeckung auf diachronem KorpusTypes (%) 1500- 1550- 1600- 1650- 1700- 1750- 1800- 1850- 1900- 1549 1599 1649 1699 1749 1799 1849 1899 1949Modern simple 15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1wordsModern 5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8compoundsHypothetic 29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0Missing 45.9 28.7 29.7 26.0 23.5 15.1 13.9 13.5 8.1 Hoher Anteil „schwierigen“ Vokabulars vor 1750, insbesondere im 16. Jhdt. manuell verifiziertes Lexikon notwendig! 11. 10. 2011, BSB München – IMPACT Demo Day 14
  15. 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Ressourcen und Spezial-Lexikafür historische Texte Diachrones Groundtruth Korpus (1500-1950) Hypothetisches Lexikon für regelbasierte Varianten Manuell verifiziertes Lexikon Lexika für Named Entities11. 10. 2011, BSB München – IMPACT Demo Day 15
  16. 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Manuell verifiziertes IR-Lexikon: StrukturEin Eintrag enthält: – Historische Wortform aus Korpus – Entsprechende moderne Wortform – Ggf. Patterns – Entsprechendes modernes Lemma – Mindestens eine Textstelle aus dem Korpus als Beleg für die Lesart Manuelle Zuordnung von moderner Wortform und Lemma Explizites Kodieren nicht regelbasierter historischer Varianten11. 10. 2011, BSB München – IMPACT Demo Day 16
  17. 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Manuell verifiziertes IR-Lexikon: Erstellung Webbasierte, kollaborative Oberfläche Unterstützung des Bearbeiters durch: – Vorschläge für entsprechende moderne Wortformen für regelbasierte Varianten durch Hypothetisches Lexikon (theile -> teile) – Vorschläge aller möglichen Lemmas für die entsprechende moderne Wortform aus einem großen modernen Lexikon CISLEX (teile -> der Teil, das Teil, teilen) – Konkordanz der zu bearbeitenden Variante11. 10. 2011, BSB München – IMPACT Demo Day 17
  18. 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Aktueller Stand des IR-Lexikons Auf dem diachronen Korpus wurden 41.300 Einträge für 24.700 historische Wortformen erstellt, 71.400 Belegstellen annotiert. IMPACT-Partner in Slowenien und Bulgarien erstellen entsprechende historische Lexika mithilfe einer adaptierte Version des tools. Suchmaschine mit Queryexpansion11. 10. 2011, BSB München – IMPACT Demo Day 18
  19. 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Suchmaschine mit Queryexpansion
  20. 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Ressourcen und Spezial-Lexikafür historische Texte Diachrones Groundtruth Korpus (1500-1950) Hypothetisches Lexikon für regelbasierte Varianten Manuell verifiziertes Lexikon Lexika für Named Entities11. 10. 2011, BSB München – IMPACT Demo Day 20
  21. 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Named Entities (NEs) Wörter / Mehrwortlexeme, die auf einzelnes Element der realen Welt referieren (Personen, geographische Bezeichner, Organisationen). NEs sind nicht im allgemeinen Lexikon enthalten und sind besonders problematisch für die OCR.11. 10. 2011, BSB München – IMPACT Demo Day 21
  22. 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Named Entities Evaluationskorpus: NE-Annotation von Materialen u.a. von der Österreichischen Nationalbibliothek Gekeyte NE-Daten von der ONB: 85 Dokumente (Adress-Register, Ortsnamenverzeichnisse) ~ 300.000 geographische Entitäten, Vor- und Nachnamen-Lexika Tests zur NE-Erkennung: – mithilfe lokaler Grammatiken (regelbasiert) – mithilfe eines statistischen Klassifikators (maschinelles Lernen).11. 10. 2011, BSB München – IMPACT Demo Day 22
  23. 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.NEs – Erkennung: Reichsrat-Protokolle Classifier Recall Precision F Stat +train +lex 89,62 96,91 92,98 Stat +train –lex 88,38 96,01 92,04 Stat –train +lex 21,01 90,03 34,07 Stat –train –lex 20,15 87,71 32,77 RB +lex 70,49 85,02 77,07 RB –lex 20,91 86,76 24,07 Statistische (stat) und regelbasierte (RB) Klassifikatoren. Mit speziellen NE-Lexika (+lex) bzw. ohne NE-Lexika (-lex) Trainiert auf allgemeinem Korpus (-train) bzw. auf Reichsrat-Korpus (+train)11. 10. 2011, BSB München – IMPACT Demo Day 23
  24. 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Suchmaschine mit NE Highlighting
  25. 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Vielen Dank.11. 10. 2011, BSB München – IMPACT Demo Day 25
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×