• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demokritos, University of Salford, IBM, Uni-versity of Innsbruck, LMU University of Munich, INL Institute for Dutch Lexicology and KB
 

IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demokritos, University of Salford, IBM, Uni-versity of Innsbruck, LMU University of Munich, INL Institute for Dutch Lexicology and KB

on

  • 4,242 views

 

Statistics

Views

Total Views
4,242
Views on SlideShare
1,575
Embed Views
2,667

Actions

Likes
0
Downloads
0
Comments
0

5 Embeds 2,667

http://www.digitisation.eu 2325
http://impact.dlsi.ua.es 182
http://localhost 158
http://translate.googleusercontent.com 1
http://131.253.14.98 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demokritos, University of Salford, IBM, Uni-versity of Innsbruck, LMU University of Munich, INL Institute for Dutch Lexicology and KB IMPACT Final Event 26-06-2012 - Overview of IMPACT tools by: ABBYY, NCSR Demokritos, University of Salford, IBM, Uni-versity of Innsbruck, LMU University of Munich, INL Institute for Dutch Lexicology and KB Presentation Transcript

    • Click to edit document nameOverview ofIMPACT tools
    • Overview of IMPACT tools
    • Baseline: FineReader XIX First Omnifont OCR for Fraktur and Old European Scripts Special dictionaries for Old European languages Available as a result of METAe project Baseline for IMPACT project 1
    • IMPACT project: Improvements at every step Image pre-processing  Better binarisation Layout analysis  General quality improvements  Better historical newspapers segmentation Recognition  New classifiers  Better gothic recognition, including new graphemes  Old Slavonic script supported Export  ALTO XML  PDF MRC: smaller file size with the same image quality Extensibility  Extended language support  Training API Licensing and pricing  Special XIX pricing for Impact project 2
    • Much better quality 3
    • Technology availability FineReader Engine 9 – 10 - 11 Recognition Server 3.0 FineReader Online new Cloud OCR SDK new 4
    • Thank you! 5
    • Text Recognition  Improved recognition of pre-trained gothic fonts. New graphemes.DoW:  Test results: 20% ~ 30% recognition quality improvement“Significantlyimprovedperformance andsubstitution rateespecially forhistorical texts.”  New language: Old Church Slavonic  Test results: ~5% recognition errors 6
    • More: Native ALTO Support Available in  FineReader Engine > 10 R2  Cloud OCR SDK 7
    • More: PDF MRCThe same image quality with much lower file size Original: jp2, 251 kb PDF MRC, 53 Kb 8
    • Overview of IMPACT tools
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Tools developed by theNational Center for Scientific Research (NCSR) "Demokritos"  Border Removal Tool Image enhancement tools Download demo: It detects  and  removes  noisy  http://www.iit.demokritos.gr/~bgat/H-DocPro/ black borders as well as noisy text  regions.  Moreover,  it  detects  the  optimal  page  frames  of  double  page  document images. IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 1
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Tools developed by theNational Center for Scientific Research (NCSR) "Demokritos"  Border Removal Tool Image enhancement tools Download demo: http://www.iit.demokritos.gr/~bgat/H-DocPro/ (SET‐A) 38718  randomly  selected  historical  images  with  and  without  noisy  black  borders.  (SET‐B) 22383  images  with  noisy  black  borders, subset of SET‐A .  (SET‐C) 3467 double page historical images IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 2
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Tools developed by theNational Center for Scientific Research (NCSR) "Demokritos"  Page Curl Correction Tool Image enhancement tools Download demo: It  rectifies  document  images  http://www.iit.demokritos.gr/~bgat/H-DocPro/ which  suffer  from  warping  and  perspective  distortions  that  deteriorate  the  performance  of  OCR. IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 3
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Tools developed by theNational Center for Scientific Research (NCSR) "Demokritos"  Page Curl Correction Tool Image enhancement tools Download demo: 420 randomly  selected historical  http://www.iit.demokritos.gr/~bgat/H-DocPro/ document images IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 4
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Tools developed by theNational Center for Scientific Research (NCSR) "Demokritos" OCR evaluation toolkit  Character Accuracy Download: http://www.iit.demokritos.gr/~bgat/OCREval/  Word Accuracy  Rejection Rate  Based on the isri‐ocr‐evaluation‐tools   (Information  Science  Research   Characters Marked Institute,  University  of  Nevada  ‐ Source code from  Accuracy after Correction  http://code.google.com/p/isri‐ocr‐  Accuracy by Character Class evaluation‐tools/ Code license: Apache License 2.0)  Figure of Merit OCREval.exe [a,u8,u16] stop_words.txt (or “-”) gt.txt (or batchgt.txt) ocr.txt (or batch.txt) character_report.txt word_report.txt overall_report.xml IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 5
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Tools developed by theNational Center for Scientific Research (NCSR) "Demokritos"  It provides an integrated GUI for  Word Spotting Application indexing historical documents  Download demo: without an OCR engine. It allows  http://www.iit.demokritos.gr/~bgat/WordSpot/ searching the database for  instances of a query keyword using  several different methods (list,  query by example, free text) IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 6
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Tools developed by theNational Center for Scientific Research (NCSR) "Demokritos" Word Spotting Application Download demo: http://www.iit.demokritos.gr/~bgat/WordSpot/ Benchmarking set: French book, BnF – year of  publication: 1838 (153 pages), German book,  BSB – year of publication: 1788  (126 pages) IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 7
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Tools developed by theNational Center for Scientific Research (NCSR) "Demokritos"  Scientific impact  Part  of  the  PhD  thesis  of  N.  Stamatopolous who was awarded  the  "Best  PhD  Thesis  of  2011  Impact Factor: 2.918 Impact Factor: 1.091 Award  in  the  area  of  Informatics  and Telematics“ in Greece.  Publications on a great number  of widely recognized journals and  international conferences.  IMPACT  related  publications  have  127  citations  from  non‐ Impact Factor: 1.474 IMPACT  researchers  (information  calculated from Scopus) IMPACT event: Project outcomes | 26 June 2012: NCSR “Demokritos” Tools, B.Gatos 8
    • Overview of IMPACT tools
    • IMPACT Project Outcomes, 26 June 2012, The HagueIMPACT Image and Ground Truth Repository• Content Management System• Datasets – 667,437 images – 52,880 PAGE ground truth files• Invaluable resource for planning, experimentation and evaluation relating to digitisation projects• Access to existing material via CoC• Further development and usage in other projects and domains (Europeana Newspapers) 1
    • IMPACT Project Outcomes, 26 June 2012, The Hague Aletheia• Semi-automated tool for ground truth and showcase production• Full PAGE support• Page border, print space• Layout regions (incl. metadata)• Text lines, words and glyphs• Unicode text at all levels• Reading order, layers etc.• Quality control 2
    • IMPACT Project Outcomes, 26 June 2012, The Hague Layout Evaluation Tool• Performance evaluation of page segmentation and region classification• Configuration of scenarios based on different types of errors Miss / Part. Miss Split Misclass. Merge False Detection 3
    • IMPACT Project Outcomes, 26 June 2012, The Hague Text Line and Word Segmentation• Modules for integration in workflows and/or other tools• Independent of OCR• Used in • Aletheia • Dewarping • T-OCR • Word spotting 4
    • IMPACT Project Outcomes, 26 June 2012, The Hague Dewarping Tool• Correction of arbitrary warping artefacts• Two stage process: • Model detection • Image transformation• Improved legibility• Increased OCR accuracy 5
    • IMPACT Project Outcomes, 26 June 2012, The HagueMore Information PRImA http://www.primaresearch.org IMPACT CoC http://www.digitisation.eu 6
    • Overview of IMPACT tools
    • IBM Labs in HaifaIBM`s Highlights CONCERT (COoperative eNgine for Correction of ExtRacted Text)  IBM`s collaborative correction platform announced in August 2010  http://www-03.ibm.com/press/us/en/pressrelease/32380.wss  IBM`s Adaptive OCR integrated  EE2/EE3 Dictionaries integrated  Dutch, German, Spanish Sophisticated monitoring mechanism integrated 2011  Designed for both malicious users and game users CONCERT Games introduction  IMPACT Conference 2011  Beyond the productivity tools Adaptive OCR System  Font and language independent1
    • IBM Labs in HaifaCONCERT Adaptive collaborative correction platform  Uses the feedback from the users to improve productivity  Fully connected to the Adaptive OCR Engine Strong emphasis on productivity tools  Reduce the time for verification/correction  Patented smart-key approach  Motivate volunteers Separating data entry process into several complementary tasks  Optimized application dedicated to each task  Break down the tasks into subtask  Make it suitable for parallel processing  Online compilation User Monitoring  Robust against malicious users and designed for gamification2
    • IBM Labs in HaifaAdaptive OCR Engine Consistent and reliable confidence level  Important for quality assurance No use of prior knowledge on the font  Crazy font can be handled Good use of the feedback from the users  Character and Word level Robust to distortion  Page level distortion and printing variations Easy to migrate between books from the same publisher  Continues update Not too slow  Around 2-3 times slower than OMNI Engines3
    • IBM Labs in HaifaWhat is next ? Search Over OCR  Beyond transcription Improve User Feedback  Online advisor  Best performers list Community building around content  Integrate community tools within the platform Complete CONCERT Games  iPhone/iPad/Android/Desktop E-Book creation  Fully digital transcription  Using original font as option Page distortion correction  Fully integrate the word-based page distortion correction4
    • Overview of IMPACT tools
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.The Functional Extension Parser (FEP)A Document Understanding PlatformGünter MühlbergerUniversity of InnsbruckDepartment for German Language and Literature Studies
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Functional Extension Parser A book is more than just pure text – it contains a lot of structural metadata – These metadata are (often) encoded in the layout of a document – Size of characters, position on page, distance to other lines, etc. is used to express structural meaning such as headlines, footnotes, caption lines, etc. FEP is a platform to process digitised or born digital documents and to “understand” the meaning of the layout by using a rules engine – Modular approach – Images and OCR as input – Web-GUI for visualisation, editing and correction – Evaluation module – Export module (METS/ALTO, PDF, ePUB,...) Several pilots – EOD Network: Enhancement of PDF eBook of historical books – DNB: Extraction of metadata from title pages – Internal project: Card index FEP is available for free for research applications – Commercial deployment is foreseen via technology transfer platform of the University Innsbruck 2
    • Overview of IMPACT tools
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Postcorrection ToolLudwig-Maximilians-Universität MünchenCentrum für Informations- und SprachverarbeitungThe Hague, June 26th 2012Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter, Klaus U. Schulz,Thorsten Vobl
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Postcorrection Tool as Carrier of IMPACT Technology Advanced language technology used for improvement of interactive postcorrection Advanced language technology used for improvement of interactive postcorrection  Lexica, matching tool, profiler integrated as background technology Lexica, matching tool, profiler integrated as background technology  Document centric knowledge from unsupervised analysis of OCRed Document centric knowledge from unsupervised analysis of OCRed document used for detection of error classes and suggested corrections document used for detection of error classes and suggested corrections  Batchmode for corrections of many errors in „one shot“ Batchmode for corrections of many errors in „one shot“ Rich graphical user interface to let users fully benefit from „knowledge“ on Rich graphical user interface to let users fully benefit from „knowledge“ on document derived error classes document derived error classes
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved Lexicon Technology Valid historical words not marked as Valid historical words not marked as errors errors Historical variants proposed as correction Historical variants proposed as correction candidates candidates
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved Workflow - Batch Processing Strings with identical error patterns Strings with identical error patterns corrected as batch corrected as batch
    • IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improved Selection of Correction Candidates Ranking of candidates through Ranking of candidates through document specific language and document specific language and error profile error profile
    • Overview of IMPACT tools
    • Lexica and Strategies and Tools forLexicon Building & Lexicon Deployment in OCR and Retrieval Historical LanguageAvailable through the IMPACT Centre of Competence: www.digitisation.eu
    • Lexica for 9 languagesExample:·Dutch IR lexicon:  (will be available as web service) WNT (Dictionary of the Dutch language), 475,498 distinct word forms, 215,180 lemmata, and 558,438 distinct lemma/word form combinations, with 1,636,709 attestations.·Dutch OCR lexicon (1) Based on a large historical corpus from the DBNL (Digitale Bibliotheek voor de Nederlandse Letteren, Digital Library for Dutch Literature).·Dutch OCR lexicon (2) Based on the IR lexicon (WNT) [best results in CCS experiment]·Dutch Named Entity lexicon (persons, locations, organizations)·Set of historical spelling variation rules Using a reference set of 10,000 pairs of historical words taken from the WNT-based lexico to which parallel modern words have been added
    • Lexica in OCR: Resultsfor 9 languages
    • Lexica in OCR:External Dictonary interface· Function of the tool: enable addition of  (IMPACT) special lexica to FineReader  engine· Availability: free for non‐commercial use
    • Lexica in OCR:OCR Evaluation Tool· Function of the tool: evaluate influence of  lexica on OCR, measure OCR (word)  accuracy· Availability: free for non‐commercial use,  Impact Interoperability Framework
    • Lexica in Retrieval:BlackLab· Function of the tool: enable use of lexica  and linguistic annotation in a lucene‐ based search engine · Deployment of IMPACT IR lexica in query expansion · Keyword-in-context display of search results · Grouping and sorting of search results · Support for linguistic annotation (lemma, part of speech, etc) · Exploitation of named entity tagging · Incremental indexing · Supports CLARIN standards (CQL)· Availability: free for non‐commercial use
    • Toolbox forlexicon building and deployment· Function of the tools: support lexicon  building for historical language · Corpus-based lexicon building · Lexicon-building from dictionary quotations · Spelling variation: pattern extraction and matching · Lemmatization· Availability: free for non‐commercial use
    • CoBaLT corpus-basedlexicon building tool· Function of the tool: productivity tool  for  IR lexicon building from corpora · Both for pre-lemmatized and unannotated corpora · Web-based· Availability: free for non‐commercial use
    • CoBaLT corpus-basedlexicon building tool
    • Dictionary attestation tool· Function of the tool: productivity tool  for IR  lexicon building from dictionary quotations  · Manual evaluation and correction of large quantities of automatically matched occurrences of a headword (scripts available) in the quotations of the particular article in a comprehensive dictionary · High productivity: revision speed (OED) 400-600 entries/hour · Web-based· Availability: free for non‐commercial use
    • Dictionary attestation tool
    • Spellingvariationand lemmatization tools· Function of the tools:  · match (inflected) historical spellings of words to modern dictionary headword form · Derive spelling variation patterns from example material· Availability: free for non‐commercial use
    • NERTnamed entity recognition tool· Function of the tool: mark up named  entities in text· Extension of Stanford NE recognizer with  extensions for spelling variation and OCR  errors· Availability: free for non‐commercial use
    • Named entitymatching tool· Function of the tool: match named entities  with historical spelling to standard form· Availability: free for non‐commercial use
    • Named entityattestation tool· Function of the tool: manual evaluation  and correction of automatically matched  occurrences of Named Entities in text  material· Web‐based· Availability: free for non‐commercial use
    • Named entityattestation tool
    • Overview of IMPACT tools
    • Background / Use case A diverse set of tools relevant to OCR & digitisation being developed by IMPACT and also othersBut: Mix of prototypes, robust commercial solutions, 3rd party tools Variety of programming languages, environments, backgroundsSolution: Wrapping of tools as web services allows: - quick & easy integration → uniform access - flexible, loose coupling of separate modules - platform independent execution via the web  IMPACT Interoperability Framework (IIF)
    • Overview: Service Oriented ArchitectureIMPACT Interoperability Framework Technologies: Java REST/SOAP Apache Maven Apache Tomcat Apache Axis2 Apache Synapse Taverna EngineInteroperability through open technology standards (OASIS/W3C)
    • Workflow example • OCR task = processing pipeline • Building blocks = individual tools • Workflow = interaction between tools (mashups) • Collaboration with
    • Availability & TakeupAvailability IIF components released as open source (Apache 2.0 License): https://github.com/impactcentre/interoperability-framework Community workflow resources (CC License): http://www.myexperiment.org/groups/235Takeup: SCAPE – Scalable Preservation Environments DDMAL – Digital Music Archives & Libraries Laboratory Montreal