Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Workflow Development for OCR (and beyond)

364 views

Published on

Workflow Development for OCR (and more)
Creating and Communicating Digital Content Conference, 26-27 May 2011, Umea, Sweden.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Workflow Development for OCR (and beyond)

  1. 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflow Development for OCR (and beyond) Clemens Neudecker, KB National Library of the Netherlands Creating and Communicating Digital Content Conference Umea, 26 May 2011
  2. 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT – Improving access to text  Funded by the EC as part of the 7th Framework Programme  Coordinated by KB – National Library of the Netherlands  EU funding: € 12 100 000  26 partners: Libraries, Research Institutes, Industry Partners  Start date: 1 January 2008  Duration: 48 Months  2012: Centre of Competence 2  Project website: www.impact-project.eu  IMPACT blog: http://impactocr.wordpress.com/  Twitter: @impactocr, #impactproject  Join us on LinkedIn!
  3. 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 3 A familiar scene? VVt Venetien den 1.Junij, Anno 1618. DJgn i f paffato te S' aö'Jifeert mo? üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te / sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe
  4. 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR: A multitude of challenges… I. OCR challenges (gothic fonts, bleed-through, warping, etc.) 4
  5. 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR: A multitude of challenges… II. Language challenges (spelling variants, inflection, and many more!) Example: historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled 5
  6. 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 6 And a multitude of solutions!  22 different ‘tools’ from diverse WP’s, developers: OCR (C++, C#), Image Processing & Lexica (DLL), Command Line Tools (Win/Linux), Java, Ruby, PHP, Perl, etc. + 3rd party software! “One ring to rule them all...”  IMPACT Interoperability Framework (IIF)
  7. 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Requirement: Interoperability Framework  Interoperability vs. integration  Web based vs. local installation/platform  Most important: flexible, scalable, user friendly 7  Java 6  Apache Axis2  Apache Tomcat  Apache Synapse (optional)  Taverna Workflow Engine
  8. 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 8 Generic Web Service Wrapper Only requirement: Command Line Application  HTML form Available on OPFlabs: https://github.com/openplanets/scape/tree/master/xa-toolwrapper  Minimise integration effort: developers can focus on their application and have to worry less about integration = higher quality software
  9. 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 9 Service Oriented Architecture  Java as programming language = platform independence  Standard Apache components = easy to maintain, well supported  Synapse as enterprise service bus = load balancing & fail over  HTTPS encryption & authentication = secure  Minimise deployment effort: scalability, hot deployment/update
  10. 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 10 Workflow development  OCR workflow = data pipeline  Building blocks = processing steps (nodes)  Integration = interaction between nodes (mashup)  Maximise usability
  11. 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 11 Workflow management  Web 2.0 style registry: myExperiment  Local client: Taverna Workbench  Web client: project website
  12. 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflow registry  Share resources and experience  Rate/tag/comment workflows  Organised in groups
  13. 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflow modules  “Basic” workflows = wraps exactly one software tool/web service  Documented inputs/outputs
  14. 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 14 Complex workflows  Tool/data pipeline  Easily derived from workflow modules  Task/goal oriented  Reusable
  15. 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Local client: Taverna Workbench http://www.taverna.org.uk/  Background: BioSciences  Developed and maintained by myGrid, UK  Available for Windows/Linux/OSX and as open source  Funding secured until 2014
  16. 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Web client: Taverna Server/ Workflow Parser  SOAP/REST API  Remote execution of workflows (webapp)
  17. 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Use case: Workflows for Evaluation  Tool A vs Tool B (Tool A(v1) vs Tool A(v2))  Workflow X (Tool A + Tool B) vs Workflow Y (Tool A + Tool C)  Workflow X vs previously digitised material  Users identify optimal workflow for source material/project 17
  18. 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 18 Other examples  Workflows for Digitisation  IMPACT  Workflows for Linguistic Analysis  CLARIN  Workflows for Preservation  SCAPE  Interface for automatic storage of results, based on DAV, realised as a workflow module (native beanshell support)  And there are many more…
  19. 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Benefits & Outlook  Modular  Transparent  Expandable  Scalable  Platform independent  User friendly  Growing interest in workflow management in CH sector  Easy to set up, deploy, free (open source)  Domain independent
  20. 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you! Questions?

×