The IMPACT Interoperability Framework: 
Workflows for OCR and beyond 
Clemens Neudecker, KB National Library of the Netherlands 
2nd IMPACT Conference, British Library, London 24/25 October 2011
Background 
 > 20 individual software components for specific challenges 
 Prototyping new algorithms, improving commercial solutions 
 Different frameworks (C, C++, Java, etc.), platforms (Win/Linux) 
 Extensible with 3rd party applications 
 IMPACT Interoperability Framework (IIF)
Architecture 
 Java 
 Web Services 
 Apache 
 Taverna 
Open Source available on https://github.com/impactcentre 
Free Hackathon 14/15 November, University of Manchester 
http://impact-mygrid-taverna-hackathon.wikispaces.com/
Integration 
 Only requirement: 
command line executable 
 Generic command line wrapper 
produces web service 
 Web service exposed as 
workflow module with 
documentation 
 Quick & easy integration: 
developers can focus on their application and have to worry 
less about integration = higher quality software
Workflows 
 OCR workflow = 
data pipeline 
 Building blocks = 
processing modules 
(nodes) 
 Integration = 
interaction between 
nodes (mashups) 
 Collaboration with
Evaluation features 
 Text comparison of result with ground truth, 
using Levenshtein distance method 
 Word evaluation (with reading order) 
 Layout based comparison of result with ground truth, 
using the Page Analysis And Ground Truth Elements Framework
Community 
 Web2.0 style 
workflow registry 
 Ready-to-use and 
documented resources 
 Community of experts 
 Sharing of experiments 
and know how
Local client: Taverna Workbench 
 Background: 
BioSciences 
 Developed and 
maintained by 
myGrid, UK 
 Open source 
 GUI for design and execution of web services & workflows
Remote client: Portal 
 SOAP/REST API 
 Remote execution of web services & workflows
Results Repository 
Custom service for IMPACT: 
 automatic storage of 
workflow outputs and 
provenance via WebDAV 
 Fully interoperable, 
since HTTP-based 
 Configurable storage of 
result sets 
 Create reports using POI
Scalability 
 Central ESB proxy 
manages multiple 
service copies 
 Process parallelization, 
Load distribution, 
Fail over, Security 
 Served >2M requests 
 Throughput improvements of 94% with every additional instance 
 Tested on Dutch Supercomputing Cloud (“Enlighten Your Research”)
Outlook 
 Online service for testing/evaluation 
 Specification & Guidelines 
 Extending the scope: 
Workflows for linguistic analysis: CLARIN 
Workflows for preservation: SCAPE 
 Even better scalability: Map/Reduce 
 Supported by a community of developers & practitioners
xkcd.com/688 
“Anyway, the thing about progress is 
that is always seems greater than it really is.” 
Ludwig Wittgenstein, Philosophical Investigations 
(quoting Johann Nestroy)

The IMPACT Interoperability Framework - Workflows for OCR and beyond

  • 1.
    The IMPACT InteroperabilityFramework: Workflows for OCR and beyond Clemens Neudecker, KB National Library of the Netherlands 2nd IMPACT Conference, British Library, London 24/25 October 2011
  • 2.
    Background  >20 individual software components for specific challenges  Prototyping new algorithms, improving commercial solutions  Different frameworks (C, C++, Java, etc.), platforms (Win/Linux)  Extensible with 3rd party applications  IMPACT Interoperability Framework (IIF)
  • 3.
    Architecture  Java  Web Services  Apache  Taverna Open Source available on https://github.com/impactcentre Free Hackathon 14/15 November, University of Manchester http://impact-mygrid-taverna-hackathon.wikispaces.com/
  • 4.
    Integration  Onlyrequirement: command line executable  Generic command line wrapper produces web service  Web service exposed as workflow module with documentation  Quick & easy integration: developers can focus on their application and have to worry less about integration = higher quality software
  • 5.
    Workflows  OCRworkflow = data pipeline  Building blocks = processing modules (nodes)  Integration = interaction between nodes (mashups)  Collaboration with
  • 7.
    Evaluation features Text comparison of result with ground truth, using Levenshtein distance method  Word evaluation (with reading order)  Layout based comparison of result with ground truth, using the Page Analysis And Ground Truth Elements Framework
  • 8.
    Community  Web2.0style workflow registry  Ready-to-use and documented resources  Community of experts  Sharing of experiments and know how
  • 9.
    Local client: TavernaWorkbench  Background: BioSciences  Developed and maintained by myGrid, UK  Open source  GUI for design and execution of web services & workflows
  • 10.
    Remote client: Portal  SOAP/REST API  Remote execution of web services & workflows
  • 11.
    Results Repository Customservice for IMPACT:  automatic storage of workflow outputs and provenance via WebDAV  Fully interoperable, since HTTP-based  Configurable storage of result sets  Create reports using POI
  • 12.
    Scalability  CentralESB proxy manages multiple service copies  Process parallelization, Load distribution, Fail over, Security  Served >2M requests  Throughput improvements of 94% with every additional instance  Tested on Dutch Supercomputing Cloud (“Enlighten Your Research”)
  • 13.
    Outlook  Onlineservice for testing/evaluation  Specification & Guidelines  Extending the scope: Workflows for linguistic analysis: CLARIN Workflows for preservation: SCAPE  Even better scalability: Map/Reduce  Supported by a community of developers & practitioners
  • 15.
    xkcd.com/688 “Anyway, thething about progress is that is always seems greater than it really is.” Ludwig Wittgenstein, Philosophical Investigations (quoting Johann Nestroy)