Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Collaborative Workflow Development and Experimentation in the Digital Humanities

568 views

Published on

A Service-Oriented-Architecture for Collaborative Workflow Development and Experimentation in the Digital Humanities
2012 Leipzig eHumanities Seminar, 10 October 2012, Leipzig, Germany.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Collaborative Workflow Development and Experimentation in the Digital Humanities

  1. 1. A Service-Oriented Architecture for Collaborative Workflow Development and Experimentation eHumanities Seminar 2012 University of Leipzig 10-10-2012 Clemens Neudecker, KB @cneudecker Zeki Mustafa Dogan, SUB-DL Sven Schlarb, ÖNB @SvenSchlarb Juan Garcés, GCDH @juan_garces
  2. 2. Idea • Provide web-based versions of tools (web services) • Package web services, data and documentation into ready-to-run “components” (encapsulation) • Chain the components to create workflows via drag-and-drop operation • Share and use workflows to re-run experiments and to demonstrate results
  3. 3. Background • High degree of diversity in research topics, but also tools and frameworks being used • Technical resources should be easy to use, well documented, accessible from anywhere • Prevent re-inventing of the wheel
  4. 4. Requirements • Interoperability = connect different resources • Flexibility = easy to deploy and adapt • Modularity = allow different combinations of tools • Usability = simple to use for non-technical users • Re-usability = easy to share with others • Scalability = apt for large-scale processing • Sustainability = resources simple to preserve • Transparency = tools evaluated separately • Distributed development and deployment
  5. 5. Interoperability Framework (IIF) • Modules: - Java Wrapper for command line tools - Web Services (incl. format converters) - Taverna Workflow Engine - Client interfaces - Repository connectors
  6. 6. Sources https://github.com/impactcentre/interoperability-framework
  7. 7. IIF Command Line Wrapper • Java project, builds using Maven2 • Creates a web service project from a given tool description (XML) • Web service exposes SOAP & REST endpoints and Java API interface • Requirements: command line call, no direct user interaction
  8. 8. IIF Web Services • Web services are described by a WSDL • Input/output data structures • Data is referenced by URL • Annotations • Default values
  9. 9. REST
  10. 10. SOAP
  11. 11. IIF Workflows • What is a workflow? (Yahoo Pipes, etc.) • Different kinds of workflows: for a single command, application, chain of processes • Main benefit: Encapsulation, Reuse • Workflows as “components”: include link to WS endpoint, sample input data and documentation = ready-to-use resource • Web 2.0 workflow registry: myExperiment
  12. 12. Why workflows? • “In-silico experimentation” • Good structuring of experiment setup: – Challenge/Research question – Dataset definition – Processing with algorithms – Evaluation/Provenance – Presentation of results • All this can be modelled into a workflow
  13. 13. Integration into Taverna • Web Services (SOAP and REST) • Command line tools (SH and SSH) • Beanshells (can import Java libraries) • R (statistics) • Excel, CSV • Additional service types can be added through dedicated plug-ins
  14. 14. Taverna flavours • Workbench – local GUI client for Linux, Windows, OSX • Command line tool – run workflows from the command line • Server – Webapp with REST API and Java/Ruby client libs • Web-Wf-Designer – Javascript version for designing workflows in a browser
  15. 15. Workbench
  16. 16. Webapp
  17. 17. Workflow registry
  18. 18. Client interfaces • Web service client: create a simple HTML form from a given web service description • Taverna client: create a simple HTML form from a given Taverna workflow description  integration into production and presentation environments via iframes
  19. 19. WS-client
  20. 20. T2-client
  21. 21. Repositories • Accessible via web service API – Fedora Commons – WebDAV – PRImA
  22. 22. Architecture
  23. 23. Examples • Use case 1: OCR (IMPACT) • Start: Images (scanned documents) • Processing: OCR, NLP, Evaluation • Result: Full text, Entities, Sentiments
  24. 24. Examples • Use case 2: Preservation (SCAPE) • Start: Document collection preparation • Processing: Hadoop, Hive • Result: Statistics
  25. 25. Reading image metadata Jp2PathCreator HadoopStreamingExiftoolRead find /NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ... ... NAS reading files from NAS 1,4 GB 1,2 GB : ~ 5 h + ~ 38 h = ~ 43 h 60.000 books 24 Million pages Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ...
  26. 26. HtmlPathCreator SequenceFileCreator find /NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html … /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html ... Z119585409/00000707 Z119585409/00000708 Z119585409/00000709 Z119585409/00000710 Z119585409/00000711 Z119585409/00000712 NAS reading files from NAS 1,4 GB 997 GB (uncompressed) : ~ 5 h + ~ 24 h = ~ 29 h 60.000 books 24 Million pages Sequence file creation
  27. 27. Z119585409/00000001 Z119585409/00000002 Z119585409/00000003 Z119585409/00000004 Z119585409/00000005 HTML parsing HadoopAvBlockWidthMapReduce ... : ~ 6 h 60.000 books 24 Million pages Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400 Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400 Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400 Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400 Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400 Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250 Map Reduce SequenceFile Textfile
  28. 28. Analytic Queries CREATE TABLE htmlwidth (hid STRING, hwidth INT) : ~ 6 h 60.000 books 24 Million pages HiveLoadExifData & HiveLoadHocrData htmlwidth hid hwidth Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 jp2width jid jwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 CREATE TABLE jp2width (hid STRING, jwidth INT)
  29. 29. Analytic Queries HiveSelect jp2width htmlwidth jid jwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 : ~ 6 h 60.000 books 24 Million pages hid hwidth Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 jid jwidth hwidth Z119585409/000000 2250 1870 01 Z119585409/000000 02 2150 2100 Z119585409/000000 03 2125 2015 Z119585409/000000 04 2125 1350 Z119585409/000000 05 2250 1700 select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid
  30. 30. Examples • Use case 3: Curation (GDZ) • Start: Get documents from repository • Processing: Enrichment (OCR, Entities, GeoNames) • Result: Online presentation
  31. 31. ROPEN (= Resource Oriented Presentation ENvironment)
  32. 32. Scalability • Multiple options: - Service parallelization - Cloud - Grid - Hadoop
  33. 33. Compatibility • Taverna  UIMA • Taverna  Galaxy • Taverna  Kepler • Taverna  Weblicht • Taverna  Seasr
  34. 34. But… • Multi-layered approach increases complexity (debugging, maintenance) • Diverse set of endpoints (OS, CPU, etc.) • Multiple dependencies • Shared responsibilities • Authentication & Authorization • Error handling / Fail-over / Monitoring
  35. 35. Demo(s)
  36. 36. Discussion • Potential/use cases DH? • Tools/features to make available? • Questions, comments or remarks?
  37. 37. Thank you!

×