Automated Assignment of
Topics to OCRed Historical
Texts
Florian Fink, Christoph Ringlstetter, Klaus U. Schulz
CIS - Cente...
Motivation
Standard (modern) repositories in libraries
• documents come with metadata describing subjects and topics cover...
Automated topic assignment
Task
Automatically compute all topics/fields that adequately describe contents
of given documen...
New Visions!
• assigning topics to document parts on all levels of granularity
(chapters, pages, paragraphs, ….)
• horizon...
Method used
TopicZoom
• university spin-off founded by our group in 2008
• topic assignment to texts (head hunting, trend ...
Example
Weight
Degree of
Generality
Significance Topic
1 8 7.31492196 South Africa
1 4 5.26957792 Elections
1 7 4.60475280...
Questions asked
Can this technology be used to bring order to
collections of OCRed historical texts?
• How is topic assign...
Historical corpus - Zedler lexicon
Johann Heinrich Zedler „Grosses vollständiges
Universallexicon aller Wissenschafften un...
Experiment
• started with scans from 14 pages of Zedler
• prepared three versions:
1. OCRed page (Finereader)
2. ground tr...
Zedler manually assigned topics
Average: 25 topics assigned per page
• Main topic (lemma) „Zeugen“ (witnesses)
law and jus...
Recall – average values
AA
AA
OCRed Ground truth Modernized ground truth
Recall: Percentage of manually assigned topics fo...
Notion of recall not fully adequate
Often for a missed topic a closely related is found in the answer set.
E.g. page 1, to...
Real problems for recall
• Very rare topics
Zedler treats rare topics such as “civet”, “campher” not represented
in the To...
Average precision values
Correct topic Questionable Wrong topic
OCR
ground truth
ground truth with modernized orthography
...
Problems for precision
• Wrong time periods
OCR had problemsto recognize years -> wrong time periods assigned
• Wrong reso...
Resume
Unavoidable subjectivity of evaluation
• manually assigned topics
• classifying computed topics into correct, quest...
Future work
• extend empirical basis
• realize easy improvements
• combine with social tagging
• look at new visions
• ass...
Thanks for your attention!
… special thanks to Bavarian State Library …
Upcoming SlideShare
Loading in …5
×

Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

393 views
291 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
393
On SlideShare
0
From Embeds
0
Number of Embeds
50
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Datech2014 Session 2 - Automated Assignment of Topics to OCRed Texts

  1. 1. Automated Assignment of Topics to OCRed Historical Texts Florian Fink, Christoph Ringlstetter, Klaus U. Schulz CIS - Center for Information and Language Processing University of Munich
  2. 2. Motivation Standard (modern) repositories in libraries • documents come with metadata describing subjects and topics covered in the texts (deep subject classification, e.g. UDC) • subjects often primary key for bringing order to large repositories • supporting users interested in particular fields OCRed historical texts from digitization tsunami • mostly poor metadata, no subject classification • missing order on whole collection, only keyword search • missing survey: what IS the collection about, what can I hope to find? Can we automatically find subjects/topics covered?
  3. 3. Automated topic assignment Task Automatically compute all topics/fields that adequately describe contents of given document, add hierarchical order to topics. Challenges • huge number of topics and fields, encyclopedic coverage • hierarchical order, from general fields to very specific topics science -> mathematics -> algebra -> group theory -> permutation groups Comparison: document classification • small number of given disjoint fields (e.g., politics, science, sports,..) • Task: find best label(s) for document Not only „replacement“ for manual topic assignment but
  4. 4. New Visions! • assigning topics to document parts on all levels of granularity (chapters, pages, paragraphs, ….) • horizontal access – automated linking of documents and document parts using topics found • detecting „topic reuse“, parallelisms and differences across repositories and subrepositories • time lines & trend analysis • ……..
  5. 5. Method used TopicZoom • university spin-off founded by our group in 2008 • topic assignment to texts (head hunting, trend analysis, ...) Background technology • huge semantic net: 120,000 nodes (topics, persons, organizations, events, geographic locations, time periods) • ordered as a directed acyclic graph • topic names come with linguistic variants; many multi-word expressions German (main focus) and English Free web service • users send (manually or XML interface) texts • receive topics found in texts • ranked using two relevance scores
  6. 6. Example Weight Degree of Generality Significance Topic 1 8 7.31492196 South Africa 1 4 5.26957792 Elections 1 7 4.60475280 African countries 1 6 4.45792943 Africa 1 3 3.91069886 Political events 1 2 1.84870472 Politics “The 2014 South African general election will be held on 7 May 2014 to elect a new National Assembly and new provincial legislatures in each province.” (Wikipedia)
  7. 7. Questions asked Can this technology be used to bring order to collections of OCRed historical texts? • How is topic assignment affected by OCR errors? • How is topic assignment affected by historical orthography? • TopicZoom hierarchy („modern topics“) suitable for topics found in historical texts?
  8. 8. Historical corpus - Zedler lexicon Johann Heinrich Zedler „Grosses vollständiges Universallexicon aller Wissenschafften und Künste“ (Great Complete Encyclopedia of All Sciences and Arts) • largest and most famous 18th century German encyclopedia • 64 volumes plus four supplements • ca. 284,000 articles • 63,000 two column pages • article sizes extremely unbalanced • accessible in the web Images (tif) received from Bavarian State Library
  9. 9. Experiment • started with scans from 14 pages of Zedler • prepared three versions: 1. OCRed page (Finereader) 2. ground truth 3. ground truth with modernized orthography • manually assigned topics to the 14 pages • automated topic assignment for the three versions of each page • looked at recall and precision obtained for three page versions • analysis of results and problems OCR quality • percentage correctly recognized words (tokens) average 75.03%, for words of length > 3: 71.12% • for OCR versus ground truth with modernized orthography average 68.37%, for words of length >3: 62.31%
  10. 10. Zedler manually assigned topics Average: 25 topics assigned per page • Main topic (lemma) „Zeugen“ (witnesses) law and justice, contracts, last will, marriage, rights, courts, judges, handicapped persons, laws, children, teenagers, corruption, civil law, childhood, adolescense. • Several lemmata… peoples, plague, language, gypsies, eviction, paper production, hunting helpers, hunting, mines, mining, grammar, rhetoric, Zeugma (city), bridges, Roman Empire, Romans, Euphrates, Alexander the Great, nations, France, Spain, Netherlands. • Main topic (lemma) historiography („giving witness“). history, historiography , historians, Heinrich Cornelius Agrippa, jews, diluvian, genesis, Adam and Eve, biblical figures, Persia, Romulus and Remus, Jesus Christ, Arabs, Koran, Bible, Fables, Mecca, Mosques, The Franks, Christianity, Paganism, Plutarch. • ………….
  11. 11. Recall – average values AA AA OCRed Ground truth Modernized ground truth Recall: Percentage of manually assigned topics found among computed topics Threshold for TopicZoom significance value 1.0 0.6 0.3 0.0 50% 50%50% 50%
  12. 12. Notion of recall not fully adequate Often for a missed topic a closely related is found in the answer set. E.g. page 1, topics “children”, “teenagers” missed, “childhood” and “adolescence” are found. Intuitively, “felt recall” larger than computed recall Manually assigned spatial areas Computed spatial areas Recall: 20%
  13. 13. Real problems for recall • Very rare topics Zedler treats rare topics such as “civet”, “campher” not represented in the TopicZoom semantic net. • Changing world Topics from parts of world that have dramatically changed Old professions, habits, and techniques etc., E.g. “paper production”, “hunting helpers”, “perfume manufacture”, “potency means”, “brick oil” many old professions (“Drechsler”, turner) now very popular family names.
  14. 14. Average precision values Correct topic Questionable Wrong topic OCR ground truth ground truth with modernized orthography Threshold: significance 0.6
  15. 15. Problems for precision • Wrong time periods OCR had problemsto recognize years -> wrong time periods assigned • Wrong resolution of ambigious words Words of the texts confusedwith the names of smallvillages -> severalwrong topics • Language changes beyond the level of orthography • e.g., “Flüsse” (rivers) used twice for liquids of the nose and the eyes -> several wrong topics (rivers and more general geographic objects) • e.g. “Verstopfung” (main modern meaning: constipation) refering to problems of the brain, nose, and ears (interpretation hardly found in modern texts) -> several wrong topics, all related to diseases of the digestive tract • e.g. “Blattern” used for a problem of the eyes. Modern language and TopicZoom net: “Blattern” synonym for “Pocken” (smallpox) -> several wrong topics
  16. 16. Resume Unavoidable subjectivity of evaluation • manually assigned topics • classifying computed topics into correct, questionable, wrong !!Do not primarily believe in numbers! Get own impression! Automated topic assignment • valuable and useful if some errors are considered acceptable • insufficient if errors cannot be tolerated • combination with social tagging (e.g., error elimination)? Significant improvements – in particular for precision - would be possible with minor modification of the underlying semantic net
  17. 17. Future work • extend empirical basis • realize easy improvements • combine with social tagging • look at new visions • assigning topics to document parts • interlink documents based on topical similarity • detection of topic parallelism • time line analysis and topic trends
  18. 18. Thanks for your attention! … special thanks to Bavarian State Library …

×