Collaborative Workflow Development and Experimentation in the Digital Humanities

A Service-Oriented Architecture
for Collaborative Workflow
Development and
Experimentation
eHumanities Seminar 2012
University of Leipzig
10-10-2012
Clemens Neudecker, KB @cneudecker
Zeki Mustafa Dogan, SUB-DL
Sven Schlarb, ÖNB @SvenSchlarb
Juan Garcés, GCDH @juan_garces

Idea
• Provide web-based versions of tools
(web services)
• Package web services, data and
documentation into ready-to-run
“components” (encapsulation)
• Chain the components to create workflows
via drag-and-drop operation
• Share and use workflows to re-run
experiments and to demonstrate results

Background
• High degree of diversity in research topics,
but also tools and frameworks being used
• Technical resources should be easy to
use, well documented, accessible from
anywhere
• Prevent re-inventing of the wheel

Requirements
• Interoperability = connect different resources
• Flexibility = easy to deploy and adapt
• Modularity = allow different combinations of tools
• Usability = simple to use for non-technical users
• Re-usability = easy to share with others
• Scalability = apt for large-scale processing
• Sustainability = resources simple to preserve
• Transparency = tools evaluated separately
• Distributed development and deployment

Interoperability Framework (IIF)
• Modules:
- Java Wrapper for command line tools
- Web Services (incl. format converters)
- Taverna Workflow Engine
- Client interfaces
- Repository connectors

Sources
https://github.com/impactcentre/interoperability-framework

IIF Command Line Wrapper
• Java project, builds using Maven2
• Creates a web service project from
a given tool description (XML)
• Web service exposes SOAP & REST
endpoints and Java API interface
• Requirements: command line call,
no direct user interaction

IIF Web Services
• Web services are described by a WSDL
• Input/output data structures
• Data is referenced by URL
• Annotations
• Default values

IIF Workflows
• What is a workflow? (Yahoo Pipes, etc.)
• Different kinds of workflows: for a single
command, application, chain of processes
• Main benefit: Encapsulation, Reuse
• Workflows as “components”: include link
to WS endpoint, sample input data and
documentation = ready-to-use resource
• Web 2.0 workflow registry: myExperiment

Why workflows?
• “In-silico experimentation”
• Good structuring of experiment setup:
– Challenge/Research question
– Dataset definition
– Processing with algorithms
– Evaluation/Provenance
– Presentation of results
• All this can be modelled into a workflow

Integration into Taverna
• Web Services (SOAP and REST)
• Command line tools (SH and SSH)
• Beanshells (can import Java libraries)
• R (statistics)
• Excel, CSV
• Additional service types can be added
through dedicated plug-ins

Taverna flavours
• Workbench – local GUI client for Linux,
Windows, OSX
• Command line tool – run workflows from
the command line
• Server – Webapp with REST API and
Java/Ruby client libs
• Web-Wf-Designer – Javascript version for
designing workflows in a browser

Client interfaces
• Web service client: create a simple HTML
form from a given web service description
• Taverna client: create a simple HTML form
from a given Taverna workflow description
 integration into production and
presentation environments via iframes

Repositories
• Accessible via web service API
– Fedora Commons
– WebDAV
– PRImA

Examples
• Use case 1: OCR (IMPACT)
• Start: Images (scanned documents)
• Processing: OCR, NLP, Evaluation
• Result: Full text, Entities, Sentiments

Examples
• Use case 2: Preservation (SCAPE)
• Start: Document collection preparation
• Processing: Hadoop, Hive
• Result: Statistics

Reading image metadata
Jp2PathCreator HadoopStreamingExiftoolRead
find
/NAS/Z119585409/00000001.jp2
/NAS/Z119585409/00000002.jp2
/NAS/Z119585409/00000003.jp2
…
/NAS/Z117655409/00000001.jp2
/NAS/Z117655409/00000002.jp2
/NAS/Z117655409/00000003.jp2
…
/NAS/Z119585987/00000001.jp2
/NAS/Z119585987/00000002.jp2
/NAS/Z119585987/00000003.jp2
…
/NAS/Z119584539/00000001.jp2
/NAS/Z119584539/00000002.jp2
/NAS/Z119584539/00000003.jp2
…
/NAS/Z119599879/00000001.jp2l
/NAS/Z119589879/00000002.jp2
/NAS/Z119589879/00000003.jp2
...
...
NAS
reading files from NAS
1,4 GB 1,2 GB
: ~ 5 h + ~ 38 h = ~ 43 h
60.000 books
24 Million pages
Z119585409/00000001 2345
Z119585409/00000002 2340
Z119585409/00000003 2543
…
Z117655409/00000001 2300
Z117655409/00000002 2300
Z117655409/00000003 2345
…
Z119585987/00000001 2300
Z119585987/00000002 2340
Z119585987/00000003 2432
…
Z119584539/00000001 5205
Z119584539/00000002 2310
Z119584539/00000003 2134
…
Z119599879/00000001 2312
Z119589879/00000002 2300
Z119589879/00000003 2300
...

HtmlPathCreator SequenceFileCreator
find
/NAS/Z119585409/00000707.html
/NAS/Z119585409/00000708.html
/NAS/Z119585409/00000709.html
…
/NAS/Z138682341/00000707.html
/NAS/Z138682341/00000708.html
/NAS/Z138682341/00000709.html
…
/NAS/Z178791257/00000707.html
/NAS/Z178791257/00000708.html
/NAS/Z178791257/00000709.html
…
/NAS/Z967985409/00000707.html
/NAS/Z967985409/00000708.html
/NAS/Z967985409/00000709.html
…
/NAS/Z196545409/00000707.html
/NAS/Z196545409/00000708.html
/NAS/Z196545409/00000709.html
...
Z119585409/00000707
Z119585409/00000708
Z119585409/00000709
Z119585409/00000710
Z119585409/00000711
Z119585409/00000712
NAS
reading files from NAS
1,4 GB 997 GB (uncompressed)
: ~ 5 h + ~ 24 h = ~ 29 h
60.000 books
24 Million pages
Sequence file creation

Z119585409/00000001
Z119585409/00000002
Z119585409/00000003
Z119585409/00000004
Z119585409/00000005
HTML parsing
HadoopAvBlockWidthMapReduce
...
: ~ 6 h
60.000 books
24 Million pages
Z119585409/00000001 2100
Z119585409/00000001 2200
Z119585409/00000001 2300
Z119585409/00000001 2400
Z119585409/00000002 2100
Z119585409/00000002 2200
Z119585409/00000002 2300
Z119585409/00000002 2400
Z119585409/00000003 2100
Z119585409/00000003 2200
Z119585409/00000003 2300
Z119585409/00000003 2400
Z119585409/00000004 2100
Z119585409/00000004 2200
Z119585409/00000004 2300
Z119585409/00000004 2400
Z119585409/00000005 2100
Z119585409/00000005 2200
Z119585409/00000005 2300
Z119585409/00000005 2400
Z119585409/00000001 2250
Z119585409/00000002 2250
Z119585409/00000003 2250
Z119585409/00000004 2250
Z119585409/00000005 2250
Map Reduce
SequenceFile Textfile

Analytic Queries
CREATE TABLE htmlwidth
(hid STRING, hwidth INT)
: ~ 6 h
60.000 books
24 Million pages
HiveLoadExifData & HiveLoadHocrData
htmlwidth
hid hwidth
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
jp2width
jid jwidth
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
CREATE TABLE jp2width
(hid STRING, jwidth INT)

Analytic Queries
HiveSelect
jp2width htmlwidth
jid jwidth
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
: ~ 6 h
60.000 books
24 Million pages
hid hwidth
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
jid jwidth hwidth
Z119585409/000000
2250 1870
01
Z119585409/000000
02
2150 2100
Z119585409/000000
03
2125 2015
Z119585409/000000
04
2125 1350
Z119585409/000000
05
2250 1700
select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

Examples
• Use case 3: Curation (GDZ)
• Start: Get documents from repository
• Processing: Enrichment
(OCR, Entities, GeoNames)
• Result: Online presentation

ROPEN
(= Resource Oriented Presentation ENvironment)

Scalability
• Multiple options:
- Service parallelization
- Cloud
- Grid
- Hadoop

Compatibility
• Taverna  UIMA
• Taverna  Galaxy
• Taverna  Kepler
• Taverna  Weblicht
• Taverna  Seasr

But…
• Multi-layered approach increases
complexity (debugging, maintenance)
• Diverse set of endpoints (OS, CPU, etc.)
• Multiple dependencies
• Shared responsibilities
• Authentication & Authorization
• Error handling / Fail-over / Monitoring

Discussion
• Potential/use cases DH?
• Tools/features to make available?
• Questions, comments or remarks?

Collaborative Workflow Development and Experimentation in the Digital Humanities

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Collaborative Workflow Development and Experimentation in the Digital Humanities

Similar to Collaborative Workflow Development and Experimentation in the Digital Humanities (20)

More from cneudecker

More from cneudecker (20)

Recently uploaded

Recently uploaded (20)

Collaborative Workflow Development and Experimentation in the Digital Humanities