Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

READ Presentation for the #DHMASTERCLASS18 at the DHI Paris

122 views

Published on

Applying HTR for Text Recognition and Keywordspotting using Transkribus.

Published in: Education
  • Be the first to comment

  • Be the first to like this

READ Presentation for the #DHMASTERCLASS18 at the DHI Paris

  1. 1. Kanton Zürich Direktion der Justiz und des Innern Transkribus Workshop DHI Paris Staatsarchiv Tobias Hodel (Zürich)
  2. 2. Direktion der Justiz und des Innern  Making archival (esp. handwritten) documents more accessible  Research infrastructure – Transkribus  Funded until mid-2019 by the European Union (H2020)  15 European partners What is READ Recognition and Enrichment of Archival Documents
  3. 3. Direktion der Justiz und des Innern  Recognition of layout and text structures  Recognition of handwriting (Handwritten Text Recognition)  Text recognition with dictionaries  Writer identification  Best-practices for recognition of large amounts of documents  Digital Humanities in archives and scholarly practices Research perspectives of READ
  4. 4. Direktion der Justiz und des Innern 11:30-11:45 Introduction to READ und Transkribus 11:45-12:45 Using Transkribus 12:45-13:00 Adding your documents to the mix 13:00-14:00 Lunch 14:00-15:30 Keyword Spotting, Training of HTR models and Layoutanalysis, Crowdsourcing, best-practices Program of the Workshops
  5. 5. Direktion der Justiz und des Innern  University of Innsbruck (co-ordinator / Austria) → Transkribus  Universitat Politecnica de Valencia (Spain) → HTR  University College London (United Kingdom) → Dissemination, e-Learning  National Center for Scientific Research “Demokritos” (Greece) → Layout Analysis  Democritus University of Thrace (Greece) → Layout Analysis  University of London Computer Centre (United Kingdom) → Webinterface  Vienna University of Technology (Austria) → Layout Analysis, Writer Identification, ScanTent  University of Rostock (Germany) → HTR, Layout Analysis  Leipzig University (Germany) → Dictionaries  Naver Labs (France) → Document Understanding  Ecole Polytechnique Federale de Lausanne (Switzerland) → Large Scale Demonstrator  National Archives Finland (Finland) → Large Scale Demonstrator  Passau Diocesan Archives (Germany) → Large Scale Demonstrator READ Partner
  6. 6. Direktion der Justiz und des Innern Projects with a Memorandum of Understanding
  7. 7. Direktion der Justiz und des Innern
  8. 8. Direktion der Justiz und des Innern Automated Text Recognition?
  9. 9. Direktion der Justiz und des Innern  Machine learning using neural networks  Processes writing by line, rather than by character  Needs to be trained by being shown document images and transcripts  More training data --> more accurate recognition  Create a model to transcribe and search a collection of documents Automated Text Recognition
  10. 10. Bentham model • Based on Jeremy Bentham’s papers (c.18-19 English) • Written by Bentham and his secretaries • Trained on 896 pages – using transcripts submitted by volunteers • 5-10% CER is possible
  11. 11. Direktion der Justiz und des Innern 1 Writer, 150 pages of material for training: 10% 1 Writer, 450 pages of material for training: 4,4% Same writer, 10 years later, without material for training: 9,2% 1 Writer, 1132 pages of material for training: 3% Text Recognition: What to expect (Character Error Rate)
  12. 12. Direktion der Justiz und des Innern  Neural networks can also process printed text – with less training data!  Transcribe documents or use OCR engine in Transkribus  Use these transcripts to train a model  Results with 1-2% CER are possible Recognising printed text
  13. 13. Direktion der Justiz und des Innern
  14. 14. Direktion der Justiz und des Innern (Web-)Interfaces  Transcription (beta)  Crowdsourcing (beta)  Correction (beta)  Search/extract (under development)  E-Learning  ScanApp Preview of Transcription WebUI: https://transkribus.eu/longan/sandbox/transcriber/ ?test=0&colId=20688&docId=74458&pageId=1
  15. 15. Direktion der Justiz und des Innern Transkribus: transkribus.eu / email@transkribus.eu / transkribus.eu/wiki/ READ: read.transkribus.eu (auch für News) Staatsarchiv Zürich: tobias.hodel@ji.zh.ch
  16. 16. Direktion der Justiz und des Innern  Register: Create Username/Login  10 Steps Guide  10 Steps Video  Transkribus Wiki Please fill out our feedback form: http://bit.ly/dhd2018 Up Next: Transkribus
  17. 17. Direktion der Justiz und des Innern By UPVLC (Bentham writings) http://prhlt-carabela.prhlt.upv.es/bentham/ Live Demo in Transkribus Bundesratsprotokolle Keyword Spotting
  18. 18. Direktion der Justiz und des Innern
  19. 19. Direktion der Justiz und des Innern Transkribus export to EVT (by HumaReC) http://humarec-viewer.vital-it.ch Transkribus and EVT Edition Visualization Technology
  20. 20. Direktion der Justiz und des Innern Beta-Test: https://transkribus.eu/read/library/ Alpha-Test: https://transkribus.eu/readTest/ Transkribus WebUI
  21. 21. Direktion der Justiz und des Innern Transkribus: transkribus.eu / email@transkribus.eu / transkribus.eu/wiki/ READ: read.transkribus.eu (auch für News) Staatsarchiv Zürich: tobias.hodel@ji.zh.ch

×