Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Succeed Evaluation Infrastructure - Apostolos Antonacopoulos


Published on

Succeed WP5 Evaluation Infrastructure at the "Succeed in Digitisation. Spreading Excellence" Conference.

  • Be the first to comment

  • Be the first to like this

Succeed Evaluation Infrastructure - Apostolos Antonacopoulos

  1. 1. Evaluation Infrastructure or: How do I know my digitised content is any good? Apostolos Antonacopoulos PRImA Research Lab
  2. 2. Why evaluate? What to evaluate? How? PRImA Research Lab
  3. 3. Content Holders - Why? •Objectively assess what can be expected from current best OCR •Prioritise different material •Re-scan / re-OCR existing content? •Specify precise service contracts •QA of results from service providers PRImA Research Lab
  4. 4. Developers / Contractors - Why? •Select best workflow components •For different batches of material •Identify performance bottlenecks •Tune performance of system components •Quality Assurance PRImA Research Lab
  5. 5. OK… What to evaluate? Isn’t Word Accuracy enough? •No – on its own it is of relatively little help… For anything other than just querying a word it is important to first have accurate •Layout •Reading order PRImA Research Lab
  6. 6. What else? Documents also contain graphical elements •An often ignored fact! And even for difficult to OCR documents •Layout still provides useful information (location of headers, page numbers etc.) PRImA Research Lab
  7. 7. In a Nutshell As obvious as it may sound… Need to evaluate according to different Use Scenarios PRImA Research Lab
  8. 8. Use Scenario Examples •Keyword search •Phrase search •Newspaper article search •ToC / book structure extraction •Layout re-flowing for mobile browsing PRImA Research Lab
  9. 9. How can I evaluate all that? PRImA Evaluation Infrastructure •In partnership with the IMPACT CoC ①Comprehensive datasets ②Ground truthing tools – Aletheia ③Scenario-based evaluation tools •Layout, reading order, text accuracy •Results in several levels of detail PRImA Research Lab
  10. 10. Proven Use Several International Competitions (SUCCEED and at ICDAR conferences) oHistorical book recognition oHistorical newspaper layout analysis Continuous evaluation challenge oWorkflows and individual components Wellcome Trust Library Case Study oAssessment of material for prioritisation of digitisation PRImA Research Lab
  11. 11. Layout Quality OCR Accuracy Text Eval Layout Eval PAGE XML Layout Text Content Aletheia Web Aletheia Crowd Prototype Tesseract Exporter FineReader Exporter Document Image Typewritten OCR Segmenter Repositories Converter Validator Dewarping Image Tool Metadata Extractor Extractor Exporter Snippet Serialised Text SimplePageExporter C++ JAletheia Sandbox PAGE to SVG XSD Optimiser Layout correspondence, reading order Validation Conversion Filtering Bag of Words, Character and word accuracy Dewarping Eval … Threshold, Otsu, Sauvola binarisation Image and PAGE XML snippets Gamera XML (PAGE Scanner) Tool Prototype Data Java Web Command Line ALTO XML FineReader XML For more: