SCAP                                                                     ELarge scale preservation workflows withTavernaSv...
SCAPE        What do you mean by „Workflow“?•   Data flow rather than control flow•   (Semi-)Automated data processing pip...
SCAPE    Modularise complex preservation tasks• Assuming that complex preservation tasks can be  separated into processing...
SCAPE     Experimental workflow development• Easy to execute a workflow on standard platforms  from anywhere• Experimental...
SCAPE                     Taverna• Workflow language and computational model for  creating composite data-intensive proces...
SCAPE            SCUFL/T2FLOW/SCUFL2• Alternative to other workflow description languages,  such as the Business Process E...
SCAPE       Creating workflows using Taverna• Users interactively build data processing pipelines• Set of nodes represents...
SCAPE                     Processors•   Web service clients (SOAP/REST)•   Local scripts (R and Beanshell languages)•   Re...
SCAPE    List handling: Implicit iteration over multiple                           inputs•   A „single value“ input port (...
SCAPE    Example: Tika Preservation Component• Input:  „file“• Processor:  Tika web service (SOAP)• Output:  Mime-Type
SCAPE    Workflow development and execution• Local development: Taverna Workbench
SCAPE               Workflow registry• Web 2.0 style registry: myExperiment
SCAPE          Remote Workflow Execution• Web client using REST API of Taverna Server
SCAPE                    Hadoop• Open source implementation of MapReduce  (Dean & Ghemawat, Google, 2004)• Hadoop= MapRedu...
SCAPE                    Hadoop• Job tracker (master) manages job execution on task  trackers (workers)• Each machine is c...
SCAPE              Hadoop job building blocksMap/reduceApplication  (JAR)                                 Job configuratio...
SCAPECluster
SCAPE        Large scale execution environment                                                            Cluster         ...
SCAPE Example: Characterisation on a large document                    collection• Using „Tool“ service, remote ssh execut...
SCAPECreate text file containing JPEG2000 input file paths and readImage metadata using Exiftool via the Hadoop Streaming ...
Reading image metadata                                                          SCAPE             Jp2PathCreator          ...
SCAPECreate text file containing HTML input file paths and createone sequence file with the complete file content in HDFS....
SequenceFile creation                                                         SCAPE            HtmlPathCreator            ...
SCAPEExecute Hadoop MapReduce job using the sequence file createdbefore in order to calculate the average paragraph block ...
HTML Parsing                                                                 SCAPE                      HadoopAvBlockWidth...
SCAPECreate hive table and load generated data into the Hive database.                                                    ...
Analytic Queries                                                                                                       SCA...
Analytic Queries                                                                                                          ...
SCAPEDo a simple hive query in order to test if the database hasbeen created successfully.                                ...
SCAPEExample: Web Archiving                             30
SCAPE         Hands on – Virtual machine• 0.20.2+923.421 Pseudo-distributed Hadoop  configuration• Chromium Webbrowser wit...
SCAPE             Hands on – VM setup• Unpackage scape4youTraining.tar.gz• VirtualBox: Mashine => Add => Browse to folder ...
SCAPE               Hands on – Demo1• Using Hadoop for analysing ARC files• Located at:   /example/sampleIN/ (HDFS)• Execu...
SCAPE           Hands on – Demo2• Using Taverna for analysing ARC files• Workflow:  /home/scape/scanARC/scanARC_TIKA.t2f  ...
Upcoming SlideShare
Loading in …5
×

Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

496 views

Published on

Sven Schlarb of the Austrian National Library gave this introduction to large scale preservation workflows with Taverna at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

  1. 1. SCAP ELarge scale preservation workflows withTavernaSven SchlarbAustrian National LibraryKeeping Control: Scalable Preservation Environments for Identification andCharacterisationGuimarães, Portugal, 07/12/2012
  2. 2. SCAPE What do you mean by „Workflow“?• Data flow rather than control flow• (Semi-)Automated data processing pipeline• Defined inputs and outputs• Modular and reusable processing units• Easy to deploy, execute, and share
  3. 3. SCAPE Modularise complex preservation tasks• Assuming that complex preservation tasks can be separated into processing steps• Together the steps represent the automated processing pipeline Quality Migrate Characterise Ingest Assurance
  4. 4. SCAPE Experimental workflow development• Easy to execute a workflow on standard platforms from anywhere• Experimental data available online or downloadable• Reproducible experiment results• Workflow development as a community activity
  5. 5. SCAPE Taverna• Workflow language and computational model for creating composite data-intensive processing chains• Developed since 2004 as a tool for life scientists and bio-informaticians by myGrid, University of Manchester, UK• Available for Windows/Linux/OSX and as open source (LGPL)
  6. 6. SCAPE SCUFL/T2FLOW/SCUFL2• Alternative to other workflow description languages, such as the Business Process Enactment Language (BPEL)• SCUFL2 is Tavernas new workflow specification language (Taverna 3), workflow bundle format, and Java API• SCUFL2 will replace the t2flow format (which replaced the SCUFL format)• Adopts Linked Data technology
  7. 7. SCAPE Creating workflows using Taverna• Users interactively build data processing pipelines• Set of nodes represents data processing elements• Nodes are connected by directed edges and the workflow itself is a directed graph• Nodes can have multiple inputs and outputs• Workflows can contain other (embedded) workflows
  8. 8. SCAPE Processors• Web service clients (SOAP/REST)• Local scripts (R and Beanshell languages)• Remote shell script invocations via ssh (Tool)• XML splitters - XSLT (interoperability!)
  9. 9. SCAPE List handling: Implicit iteration over multiple inputs• A „single value“ input port (list depth 0) processes values iteratively (foreach)• A flat value list has list depth 1• List depth > 1 for tree structures• Multiple input ports with lists are combined as cross product or dot product
  10. 10. SCAPE Example: Tika Preservation Component• Input: „file“• Processor: Tika web service (SOAP)• Output: Mime-Type
  11. 11. SCAPE Workflow development and execution• Local development: Taverna Workbench
  12. 12. SCAPE Workflow registry• Web 2.0 style registry: myExperiment
  13. 13. SCAPE Remote Workflow Execution• Web client using REST API of Taverna Server
  14. 14. SCAPE Hadoop• Open source implementation of MapReduce (Dean & Ghemawat, Google, 2004)• Hadoop= MapReduce + HDFS• HDFS: Distributed file system, data stored in 64MB (default) blocks
  15. 15. SCAPE Hadoop• Job tracker (master) manages job execution on task trackers (workers)• Each machine is configured to dedicate processing cores to MapReduce tasks (each core is a worker)• Name node manages HDFS, i.e. distribution of data blocks on data nodes
  16. 16. SCAPE Hadoop job building blocksMap/reduceApplication (JAR) Job configuration Set or overwrite configuration parameters. Map method Create intermediate key/value pair output Reduce method Aggregate intermediate key/value pair output from map
  17. 17. SCAPECluster
  18. 18. SCAPE Large scale execution environment Cluster Dette billede kan ikke vises i øjeblikket. Taverna ServerFile server (REST API) Hadoop Jobtracker Apache Tomcat Web Application
  19. 19. SCAPE Example: Characterisation on a large document collection• Using „Tool“ service, remote ssh execution• Orchestration of hadoop jobs (Hadoop-Streaming- API, Hadoop Map/Reduce, and Hive)• Available on myExperiment: http://www.myexperiment.org/workflows/3105• See Blogpost: http://www.openplanetsfoundation.org/blogs/2012- 08-07-big-data-processing-chaining-hadoop-jobs- using-taverna
  20. 20. SCAPECreate text file containing JPEG2000 input file paths and readImage metadata using Exiftool via the Hadoop Streaming API. 20
  21. 21. Reading image metadata SCAPE Jp2PathCreator HadoopStreamingExiftoolRead reading files from NAS /NAS/Z119585409/00000001.jp2 Z119585409/00000001 2345 /NAS/Z119585409/00000002.jp2 Z119585409/00000002 2340 /NAS/Z119585409/00000003.jp2 Z119585409/00000003 2543 … … /NAS/Z117655409/00000001.jp2 Z117655409/00000001 2300 /NAS/Z117655409/00000002.jp2 Z117655409/00000002 2300 /NAS/Z117655409/00000003.jp2 Z117655409/00000003 2345 … find … /NAS/Z119585987/00000001.jp2 Z119585987/00000001 2300 /NAS/Z119585987/00000002.jp2 Z119585987/00000002 2340 /NAS/Z119585987/00000003.jp2 Z119585987/00000003 2432 … … /NAS/Z119584539/00000001.jp2 Z119584539/00000001 5205 NAS /NAS/Z119584539/00000002.jp2 Z119584539/00000002 2310 /NAS/Z119584539/00000003.jp2 Z119584539/00000003 2134 … … /NAS/Z119599879/00000001.jp2l Z119599879/00000001 2312 /NAS/Z119589879/00000002.jp2 Z119589879/00000002 ... 2300 /NAS/Z119589879/00000003.jp2 Z119589879/00000003 2300 ... ... 1,4 GB 1,2 GB 60.000 books 21 24 Million pages : ~5h + ~ 38 h = ~ 43 h
  22. 22. SCAPECreate text file containing HTML input file paths and createone sequence file with the complete file content in HDFS. 22
  23. 23. SequenceFile creation SCAPE HtmlPathCreator SequenceFileCreator reading files from NAS /NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html Z119585409/00000707 /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html Z119585409/00000708 /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html find … Z119585409/00000709 /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … Z119585409/00000710 /NAS/Z967985409/00000707.html NAS /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html Z119585409/00000711 … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html Z119585409/00000712 /NAS/Z196545409/00000709.html ... 1,4 GB 997 GB (uncompressed) 60.000 books 23 24 Million pages : ~5h + ~ 24 h = ~ 29 h
  24. 24. SCAPEExecute Hadoop MapReduce job using the sequence file createdbefore in order to calculate the average paragraph block width. 24
  25. 25. HTML Parsing SCAPE HadoopAvBlockWidthMapReduce Map Reduce Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2250 Z119585409/00000001 2400Z119585409/00000001 Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2250 Z119585409/00000002 2400Z119585409/00000002 Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2250 Z119585409/00000003 2400Z119585409/00000003 Z119585409/00000004 2100 Z119585409/00000004 2200Z119585409/00000004 Z119585409/00000004 2300 Z119585409/00000004 2250 Z119585409/00000004 2400 ... Z119585409/00000005 2100 Z119585409/00000005 2200Z119585409/00000005 Z119585409/00000005 2300 Z119585409/00000005 2250 Z119585409/00000005 2400 SequenceFile Textfile 60.000 books24 Million pages : ~6h 25
  26. 26. SCAPECreate hive table and load generated data into the Hive database. 26
  27. 27. Analytic Queries SCAPE HiveLoadExifData & HiveLoadHocrData Dette billede kan ikke vises i øjeblikket. htmlwidth Z119585409/00000001 1870 Z119585409/00000002 2100 CREATE TABLE htmlwidth Z119585409/00000003 2015 (hid STRING, hwidth INT) Z119585409/00000004 1350 Z119585409/00000005 1700 Dette billede kan ikke vises i øjeblikket. jp2width Z119585409/00000001 2250 Z119585409/00000002 2150 CREATE TABLE jp2width Z119585409/00000003 2125 Z119585409/00000004 2125 (hid STRING, jwidth INT) Z119585409/00000005 2250 60.000 books24 Million pages : ~6h 27
  28. 28. Analytic Queries SCAPE HiveSelect Dette billede kan ikke vises i øjeblikket. jp2width Dette billede kan ikke vises i øjeblikket. htmlwidth select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid Dette billede kan ikke vises i øjeblikket. 60.000 books24 Million pages : ~6h 28
  29. 29. SCAPEDo a simple hive query in order to test if the database hasbeen created successfully. 29
  30. 30. SCAPEExample: Web Archiving 30
  31. 31. SCAPE Hands on – Virtual machine• 0.20.2+923.421 Pseudo-distributed Hadoop configuration• Chromium Webbrowser with Hadoop Admin Links• Taverna Workbench 2.3.0• NetBeans IDE 7.1.2• SampleHadoopCommand.txt (executable Hadoop Command for DEMO1)• Latest patches
  32. 32. SCAPE Hands on – VM setup• Unpackage scape4youTraining.tar.gz• VirtualBox: Mashine => Add => Browse to folder => select VBOX file• VM instance login: • user: scape • pw: scape123
  33. 33. SCAPE Hands on – Demo1• Using Hadoop for analysing ARC files• Located at: /example/sampleIN/ (HDFS)• Execution via command in: SampleHadoopCommand.txt (on Desktop)• Result can then be found at: /example/sample_OUT/
  34. 34. SCAPE Hands on – Demo2• Using Taverna for analysing ARC files• Workflow: /home/scape/scanARC/scanARC_TIKA.t2f low • ADD FILE LOCATION (not add value!!) • Input: /home/scape/scanARC/input/ONBSample.txt• Result: ~/scanARC/outputCSV/fullTIKAReport.c sv• See ~/scanARC/outputGraphics/

×