Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012
Upcoming SlideShare
Loading in...5

Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012



Sven Schlarb of the Austrian National Library gave this introduction to large scale preservation workflows with Taverna at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation ...

Sven Schlarb of the Austrian National Library gave this introduction to large scale preservation workflows with Taverna at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012 Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012 Presentation Transcript

  • SCAP ELarge scale preservation workflows withTavernaSven SchlarbAustrian National LibraryKeeping Control: Scalable Preservation Environments for Identification andCharacterisationGuimarães, Portugal, 07/12/2012
  • SCAPE What do you mean by „Workflow“?• Data flow rather than control flow• (Semi-)Automated data processing pipeline• Defined inputs and outputs• Modular and reusable processing units• Easy to deploy, execute, and share
  • SCAPE Modularise complex preservation tasks• Assuming that complex preservation tasks can be separated into processing steps• Together the steps represent the automated processing pipeline Quality Migrate Characterise Ingest Assurance
  • SCAPE Experimental workflow development• Easy to execute a workflow on standard platforms from anywhere• Experimental data available online or downloadable• Reproducible experiment results• Workflow development as a community activity
  • SCAPE Taverna• Workflow language and computational model for creating composite data-intensive processing chains• Developed since 2004 as a tool for life scientists and bio-informaticians by myGrid, University of Manchester, UK• Available for Windows/Linux/OSX and as open source (LGPL)
  • SCAPE SCUFL/T2FLOW/SCUFL2• Alternative to other workflow description languages, such as the Business Process Enactment Language (BPEL)• SCUFL2 is Tavernas new workflow specification language (Taverna 3), workflow bundle format, and Java API• SCUFL2 will replace the t2flow format (which replaced the SCUFL format)• Adopts Linked Data technology
  • SCAPE Creating workflows using Taverna• Users interactively build data processing pipelines• Set of nodes represents data processing elements• Nodes are connected by directed edges and the workflow itself is a directed graph• Nodes can have multiple inputs and outputs• Workflows can contain other (embedded) workflows
  • SCAPE Processors• Web service clients (SOAP/REST)• Local scripts (R and Beanshell languages)• Remote shell script invocations via ssh (Tool)• XML splitters - XSLT (interoperability!)
  • SCAPE List handling: Implicit iteration over multiple inputs• A „single value“ input port (list depth 0) processes values iteratively (foreach)• A flat value list has list depth 1• List depth > 1 for tree structures• Multiple input ports with lists are combined as cross product or dot product
  • SCAPE Example: Tika Preservation Component• Input: „file“• Processor: Tika web service (SOAP)• Output: Mime-Type
  • SCAPE Workflow development and execution• Local development: Taverna Workbench
  • SCAPE Workflow registry• Web 2.0 style registry: myExperiment
  • SCAPE Remote Workflow Execution• Web client using REST API of Taverna Server
  • SCAPE Hadoop• Open source implementation of MapReduce (Dean & Ghemawat, Google, 2004)• Hadoop= MapReduce + HDFS• HDFS: Distributed file system, data stored in 64MB (default) blocks
  • SCAPE Hadoop• Job tracker (master) manages job execution on task trackers (workers)• Each machine is configured to dedicate processing cores to MapReduce tasks (each core is a worker)• Name node manages HDFS, i.e. distribution of data blocks on data nodes
  • SCAPE Hadoop job building blocksMap/reduceApplication (JAR) Job configuration Set or overwrite configuration parameters. Map method Create intermediate key/value pair output Reduce method Aggregate intermediate key/value pair output from map
  • SCAPECluster
  • SCAPE Large scale execution environment Cluster Dette billede kan ikke vises i øjeblikket. Taverna ServerFile server (REST API) Hadoop Jobtracker Apache Tomcat Web Application
  • SCAPE Example: Characterisation on a large document collection• Using „Tool“ service, remote ssh execution• Orchestration of hadoop jobs (Hadoop-Streaming- API, Hadoop Map/Reduce, and Hive)• Available on myExperiment:• See Blogpost: 08-07-big-data-processing-chaining-hadoop-jobs- using-taverna
  • SCAPECreate text file containing JPEG2000 input file paths and readImage metadata using Exiftool via the Hadoop Streaming API. 20
  • Reading image metadata SCAPE Jp2PathCreator HadoopStreamingExiftoolRead reading files from NAS /NAS/Z119585409/00000001.jp2 Z119585409/00000001 2345 /NAS/Z119585409/00000002.jp2 Z119585409/00000002 2340 /NAS/Z119585409/00000003.jp2 Z119585409/00000003 2543 … … /NAS/Z117655409/00000001.jp2 Z117655409/00000001 2300 /NAS/Z117655409/00000002.jp2 Z117655409/00000002 2300 /NAS/Z117655409/00000003.jp2 Z117655409/00000003 2345 … find … /NAS/Z119585987/00000001.jp2 Z119585987/00000001 2300 /NAS/Z119585987/00000002.jp2 Z119585987/00000002 2340 /NAS/Z119585987/00000003.jp2 Z119585987/00000003 2432 … … /NAS/Z119584539/00000001.jp2 Z119584539/00000001 5205 NAS /NAS/Z119584539/00000002.jp2 Z119584539/00000002 2310 /NAS/Z119584539/00000003.jp2 Z119584539/00000003 2134 … … /NAS/Z119599879/00000001.jp2l Z119599879/00000001 2312 /NAS/Z119589879/00000002.jp2 Z119589879/00000002 ... 2300 /NAS/Z119589879/00000003.jp2 Z119589879/00000003 2300 ... ... 1,4 GB 1,2 GB 60.000 books 21 24 Million pages : ~5h + ~ 38 h = ~ 43 h
  • SCAPECreate text file containing HTML input file paths and createone sequence file with the complete file content in HDFS. 22
  • SequenceFile creation SCAPE HtmlPathCreator SequenceFileCreator reading files from NAS /NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html Z119585409/00000707 /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html Z119585409/00000708 /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html find … Z119585409/00000709 /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … Z119585409/00000710 /NAS/Z967985409/00000707.html NAS /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html Z119585409/00000711 … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html Z119585409/00000712 /NAS/Z196545409/00000709.html ... 1,4 GB 997 GB (uncompressed) 60.000 books 23 24 Million pages : ~5h + ~ 24 h = ~ 29 h
  • SCAPEExecute Hadoop MapReduce job using the sequence file createdbefore in order to calculate the average paragraph block width. 24
  • HTML Parsing SCAPE HadoopAvBlockWidthMapReduce Map Reduce Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2250 Z119585409/00000001 2400Z119585409/00000001 Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2250 Z119585409/00000002 2400Z119585409/00000002 Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2250 Z119585409/00000003 2400Z119585409/00000003 Z119585409/00000004 2100 Z119585409/00000004 2200Z119585409/00000004 Z119585409/00000004 2300 Z119585409/00000004 2250 Z119585409/00000004 2400 ... Z119585409/00000005 2100 Z119585409/00000005 2200Z119585409/00000005 Z119585409/00000005 2300 Z119585409/00000005 2250 Z119585409/00000005 2400 SequenceFile Textfile 60.000 books24 Million pages : ~6h 25
  • SCAPECreate hive table and load generated data into the Hive database. 26
  • Analytic Queries SCAPE HiveLoadExifData & HiveLoadHocrData Dette billede kan ikke vises i øjeblikket. htmlwidth Z119585409/00000001 1870 Z119585409/00000002 2100 CREATE TABLE htmlwidth Z119585409/00000003 2015 (hid STRING, hwidth INT) Z119585409/00000004 1350 Z119585409/00000005 1700 Dette billede kan ikke vises i øjeblikket. jp2width Z119585409/00000001 2250 Z119585409/00000002 2150 CREATE TABLE jp2width Z119585409/00000003 2125 Z119585409/00000004 2125 (hid STRING, jwidth INT) Z119585409/00000005 2250 60.000 books24 Million pages : ~6h 27
  • Analytic Queries SCAPE HiveSelect Dette billede kan ikke vises i øjeblikket. jp2width Dette billede kan ikke vises i øjeblikket. htmlwidth select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid Dette billede kan ikke vises i øjeblikket. 60.000 books24 Million pages : ~6h 28
  • SCAPEDo a simple hive query in order to test if the database hasbeen created successfully. 29
  • SCAPEExample: Web Archiving 30
  • SCAPE Hands on – Virtual machine• 0.20.2+923.421 Pseudo-distributed Hadoop configuration• Chromium Webbrowser with Hadoop Admin Links• Taverna Workbench 2.3.0• NetBeans IDE 7.1.2• SampleHadoopCommand.txt (executable Hadoop Command for DEMO1)• Latest patches
  • SCAPE Hands on – VM setup• Unpackage scape4youTraining.tar.gz• VirtualBox: Mashine => Add => Browse to folder => select VBOX file• VM instance login: • user: scape • pw: scape123
  • SCAPE Hands on – Demo1• Using Hadoop for analysing ARC files• Located at: /example/sampleIN/ (HDFS)• Execution via command in: SampleHadoopCommand.txt (on Desktop)• Result can then be found at: /example/sample_OUT/
  • SCAPE Hands on – Demo2• Using Taverna for analysing ARC files• Workflow: /home/scape/scanARC/scanARC_TIKA.t2f low • ADD FILE LOCATION (not add value!!) • Input: /home/scape/scanARC/input/ONBSample.txt• Result: ~/scanARC/outputCSV/fullTIKAReport.c sv• See ~/scanARC/outputGraphics/