Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Overview of VIEW


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

An Overview of VIEW

  1. 1. Scientific Workflows for Big Data Prof. Shiyong Lu Big Data Research Laboratory Department of Computer Science Wayne State University
  2. 2. Today’s data-intensive science Looking for needle in haystack Looking into haystack Jim Gray: Turing Award laureate
  3. 3. Big Data Challenges Looking for needle in haystack For Big Data, data management and movement is a frequent challenge …between facilities, Looking needle in archives, researchers… haystack Many files, large data volumes With security, reliability, performance… Ian Foster: Father of Grid Computing
  4. 4. Big Data Challenges Looking for needle in haystack Capture Curation Looking needle in haystack Storage Search Sharing Analysis Visualization
  5. 5. Big Data Science Large Hardron Collider (LHC)) 15 PB/year 173 TB/day 500 MB/sec Higgs discovery is “only possible because of the extraordinary achievements of … grid computing” —Rolf Heuer, CERN DG
  6. 6. Data flows at Argonne National Lab Data management challenges External Argonne data sources flows in 163 9 9 TB/day Advanced Photon Source (estimates) Argonne Leadership Computing Facility 143 100 Shortterm storage 100 150 Credit: Ian Foster Data analysis 10 50 Longterm storage
  7. 7. Big Data demands new CS research For example, existing clustering algorithms are typically cubic in N, and when N is too big, they do not work! - Jim Gray
  8. 8. What is Big Data? •Definition of Big Data: “…refers to large, diverse, complex, longitudinal, and/or distributed data sets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future.” from website
  9. 9. Big Data Challenges •Challenges of Big Data: “national big data challenges, which include advances in core techniques and technologies; big data infrastructure projects in various science, biomedical research, health and engineering communities; education and workforce development; and a comprehensive integrative program to support collaborations of multi-disciplinary teams and communities to make advances in the complex grand challenge science, biomedical research, and engineering problems of a computational- and data-intensive world.” from website
  10. 10. Big Data demands big workflows Reminiscent of
  11. 11. And thousands of parallel executions Managing big workflows and large-scale parallel execution is a big CS challenge !
  12. 12. Outline 1 Introduction 2 VIEW: A Prototypical SWFMS 3 A Scientific Workflow Composition Model 4 A Collectional Data Model 5 Conclusions and Future Work
  13. 13. Introduction  Data Intensive Science  From computation intensive to data intensive.  A new research cycle – from data capture and data curation to data analysis and data visualization.  “In the future, the rapidity with which any given discipline advances is likely to depend on how well the community acquires the necessary expertise in database, workflow management, visualization, and cloud computing technologies.” (“Beyond the Data Deluge”, Science, Vol. 323. no. 5919, pp. 1297 – 1298, 2009.)
  14. 14. Introduction  Scientific Workflow  A formal specification of a scientific process.  Represents, streamlines, and automates the steps from dataset selection and integration, computation and analysis, to final data product presentation and visualization.  Applications: Bioinformatics, Oceanography, Neuroinformatics, Astronomy, etc.
  15. 15. Introduction  Scientific Workflow Management System (SWFMS)  Supports the specification, modification, execution, failure handling, and monitoring of a scientific workflow.  Existing SWFMSs: • • • • Taverna, Kepler, Pegasus, VisTrails, • VIEW, • …
  16. 16. Our VIEW System
  17. 17. Our VIEW System  Enables scientist to design workflows
  18. 18. Our VIEW System  Enables scientist to design workflows  Provides runtime system to execute workflow
  19. 19. Our VIEW System  Enables scientist to design workflows  Provides runtime system to execute workflow  on dedicated VIEW server
  20. 20. Our VIEW System  Enables scientist to design workflows  Provides runtime system to execute workflow  on dedicated VIEW server  in Cloud computing environment
  21. 21. Our VIEW System  Enables scientist to design workflows  Provides runtime system to execute workflow  on dedicated VIEW server  in Cloud computing environment  Supports efficient collection, storage, querying, and visualization of workflow provenance
  22. 22. Our VIEW System  Enables scientist to design workflows  Provides runtime system to execute workflow  on dedicated VIEW server  in Cloud computing environment  Supports efficient collection, storage, querying, and visualization of workflow provenance  Is currently used in several bioinformatics applications, including genomic recombination and gene conversion data analysis
  23. 23. An Example Workflow in VIEW  Example workflows in
  24. 24. An Example Workflow in VIEW
  25. 25. VIEW 1-2-3 Step 1: Drag and drop inputs and outputs, and computational
  26. 26. VIEW 1-2-3 Step 2: Link them into a scientific workflow
  27. 27. VIEW 1-2-3 Step 3: Click the run button, you get the result!
  28. 28. Kids Play VIEW
  29. 29. An Example Workflow in VIEW  FiberFlow  Transforms the large-scale neuroimaging data to knowledge through crosssubject, cross-modality computation, ultimately leading to high clinical intelligence in neural diseases.
  30. 30. VIEW: A Prototypical SWFMS  Minimum complexity for users, but massive techniques in the backstage.  To provide a clear and simple abstraction for manipulating and coordinating resources  Service-oriented architecture.  Intuitive, user-friendly GUI
  31. 31. A Reference Architecture for SWFMSs Service-oriented architecture of VIEW
  32. 32. A Reference Architecture for SWFMSs Other advantages of :
  33. 33. A Reference Architecture for SWFMSs Other advantages of :  VIEW workflows can be executed in other systems (specifications are not tied to a particular SWFMS)  Use of open standards (Web Services, XML) promotes collaboration, interoperability and extensibility of the system  Workflow and data models implemented in VIEW are specifically geared towards heavy scientific data
  34. 34. A Reference Architecture for SWFMSs
  35. 35. VIEW: A Prototypical SWFMS A typical scientific workflow execution diagram.
  36. 36. Workflow Engine Workflow Engine is the heart of the system.  Workflow Orchestration.  Workflow Execution.  Coordination of other subsystems. Workflow Engine in VIEW.  Dataflow based.  Pure workflow composition.  Workflow constructs.
  37. 37. SWL  Example of our proposed scientific workflow specification language (SWL).
  38. 38. Primitive Workflow Specification  Example SWL specification of a primitive workflow.
  39. 39. Workflow Execution Workflow Execution  Primitive workflow  Unary construct based workflow  Graph based workflow • A workflow graph is a composition of workflows by binary constructs. • Optimistic scheduling.
  40. 40. Workflow Database Schema
  41. 41. Data Product Manager Data Product Manager     Solid data model. Scalable data storage. Convenient data access. Data Independence. Data Product Manager is based on the collectional data model.
  42. 42. DPM Architecture  Architecture of the Data Product Manager. Data Product Manager Main Server Master Data Access Layer Node Database Node Database Node Database Data Mapping Layer Data Set 1 Relational Databases File Repositorys Data Set 2 Relational Databases File Repositorys Data Storage Layer
  43. 43. DPL  Example of the XML description of a collectional data product.
  44. 44. Data Storage  VIEW supports two ways of storage:  A collection can be stored in a table containing a set of its key/value pairs, whose values are references to existing collections.  A collection can be expanded and stored in two tables. • The Group By operator. • The Compress operator.
  45. 45. Data Typing A Data Product  a Collection  or a List  or an Empty. The List type  Introduced in the workflow engine.  Each element is a data product.  Heterogeneous.
  46. 46. Collectional Data Querying Operators are implemented in primitive workflows.     Arithmetic operators. Boolean operators. Collectional operators. List operators. Queries are implemented in workflow compositions.
  47. 47. Example  Given a table Reference < Student, Company, GradTime >, Find the total number of students offered in each company and each graduation year; Sort the result in descending GradTime and ascending Company order.  SQL query.  SELECT Company, GradTime, COUNT(DISTINCT Student) AS NumberOfJob FROM Reference GROUP BY Company, GradTime ORDER BY GradTime DESC, Company ASC;
  48. 48. Example of Query Workflow Query Workflow.
  49. 49. Key Requirements for Workflow Modeling R1: Programming-in-the-large. R2: Dataflow programming model. R3: Composable dataflow constructs. R4: Workflow encapsulation and hierarchical composition. R5: Single-assignment property. R6: Physical and logical data models. R7: Exception handling.
  50. 50. A Scientific Workflow Model Workflows are the basic and the only operands for workflow composition. M i1 ii1 W1 o1 k i1 ii1 W2 o1 k o1 o1 W3 Task components (e.g. Web services) are constructed to primitive workflows (a.k.a. tasks) which are the basic building blocks of scientific workflows.
  51. 51. A Scientific Workflow Model A workflow construct is a mapping from a set of workflows to a workflow.  Unary workflow constructs  Binary workflow constructs  … A construct C takes a set of workflows W1, ...., Wn as input, and composes them into Wc as the output workflow.
  52. 52. A Scientific Workflow Model  Our proposed scientific workflow model consists of the following two layers:  The logical layer contains the workflow interface that models the input ports and output ports of a workflow.  The physical layer contains the workflow body that models the physical implementation of the workflow. • Primitive workflows. • Graph-based workflows. • Unary-construct-based workflows.
  53. 53. Unary Workflow Constructs Dataflow-based Unary Workflow Constructs
  54. 54. The Map Construct  The Map construct enables the parallel processing of a collection of data products based on a workflow that can only process a single data product.  Example: [[1,2],[3,6],[4,7]] [1,2] ii1 k W1 o1 W2 o1 W1 o1 2 [3,6] M i1 i1 ik i1 k W1 o1 18 [4,7] i1 ik W1 o1 28
  55. 55. The Reduce Construct  The Reduce construct enables the aggregation of a list of data products to a single data product based on a workflow that aggregates a limited (two or more) number of input data products.  Example: R i1 0 [3,5,9] i2 i1 Add o 1 i2 k W3 0 3 o1 i1 Addo1 3 i2 5 i1 Add o1 i2 8 9 i1 Add o1 17 i2
  56. 56. The Tree Construct  The Tree construct  Enables parallel aggregation of a collection of data products.  Aggregates a collection pairwisely as a binary tree until one single aggregated product is generated.  The Tree construct can be applied on associative workflows.  Example: T [0,3,5,9] i1 i1 Add o 1 i2 k W4 o1 0 3 i1 Addo1 3 i2 5 9 i1 Addo1 i2 14 i1 Add o1 i2 17
  57. 57. The Conditional Construct  The Conditional construct enables the conditional execution of a workflow based on a condition on one of the inputs.  Example: [2,3] 2 p=(PI 1 < PI 2 ) C i1 p i1 o1 o1 p=true [2,3] i1 o iProjection k Projection 1 i2 2 i2 W4 [2,3] 1 p=(PI 1 >= PI 2 ) C i1 p i1 o1 o1 p=false Projection ik i2 i2 W4 i2 Fail i1 2 Projection i2 3
  58. 58. The Loop Construct  The Loop construct enables cyclic executions of a workflow.  The output of the workflow will be repetitively returned (fed back) to a specified input port until the predicate evaluates to true.  Example: p=(PI 1 >100) L 0 1 i1 i1 i2 i2 ik Add o1 o1 p 0 1 i1 o1 p=false 1 Add i2 i1 1 o1 p=false 2 Add i2 ... 1 101 Add i2 p=true
  59. 59. The Curry Construct  The Curry construct allows users to fix one of the input ports with a specified argument and thus reduce the number of input ports.  By applying multiple Curry constructs, a workflow that takes multiple arguments can be translated into a chain of workflows each with a single argument.  Example: U 4 1 i1 i1 Add o 1 i2 k W8 o1 1 4 i1 Add o 1 i2 k 5
  60. 60. Workflow Composition  Example of the composition of Map and Map constructs.  A Workflow that increase all the numbers in a nested list by 1. 1 i1 o M M 1 i1 i2 [[1,2,3],[4,5,6]] i1 o1 k ii2 Add (a) W9 o1 1 1 2 1 3 1 4 1 5 1 6 1 k ii2 Add i1 o ik Add 1 i2 i1 o ik Add 1 i2 i1 o1 k ii2 Add i1 o ik Add 1 i2 i1 o ik Add 1 i2 2 3 4 5 6 7
  61. 61. Workflow Composition  Example of the composition of Map and Reduce constructs.  A workflow for parallel summation of each row in a matrix . 0 o1 o1 1 o1 i1 Addition i2 k 2 o1 i1 Addition ik i2 0 4 o1 i1 Addition i2 k 5 o1 i1 Addition ik i2 M R 0 i1 i1 i2 i2 ik Add [[1,2,3],[4,5,6]] W11 3 6 o1 i1 Addition 6 ii2 k o1 i1 Addition 15 ii2 k
  62. 62. Workflow Composition  Example of complicated workflow composition.  A workflow to calculate the greatest common divisor. L p=(PI(2)==0) i1 i1 i1 Split o1 o2 G2W o i1 iModulus 1 k i2 ii1 o o kMerge 1 1 i2 W13 o1 W14 G2W i1 i2 o1 i1 Merge i2 i1 M o1 i1W14 o1 W15 W17 i1 1 M U o1 o1 i1 iikProjection 2 W16 o1
  63. 63. A Collectional Data Model  A collectional data model  Support collection oriented datasets. • Scientists often work with collection oriented datasets, such as arrays, lists, tables or file collections. • A collection-oriented data model enables data parallelism in scientific workflows.  Support nested data structures. • Scientific data is often hierarchically organized. • Scientific workflow tasks often produce collections of data products, and the execution of a workflow composed from such tasks can create increasingly nested data collections.  Provide well-defined operators and their arbitrary compositions to manipulate and query scientific data collections.
  64. 64. A Collectional Data Model  A relation is a pair < R, r > where R is a schema of the relation and r is an instance of that schema.  A relation schema can be defined as an unordered tuple < c1 : d1, c2 : d2, …, cn : dn > where c1, c2, …, cn are column names and d1, d2, …, dn are domain names.  A relation instance is a table with rows (called tuples) and named columns (called attributes).
  65. 65. A Collectional Data Model  A collection schema is a pair < K, V >.  K, the key, is a pair k : d where k is the key name and d is the domain name .  V, the value, is either a relation schema or a collection schema.  A collection instance is a set of key-value pairs (pi, qi) (i∈ {1,…,m}).  Each pi is a scalar value.  Each qi is either a relation instance or a collection instance.
  66. 66. A Collectional Data Model An example:  Parameters< Model : String, Experiments : Integer, <Concentration : Double, Degree : Integer >>.
  67. 67. The Collectional Operators  We extend the relational operators to the collectional operators of which the collections are the only operands.  Six primitive operators: union, set difference, selection, projection, Cartesian product and renaming.  The set of the collections is closed under those operators.  A relation can be defined as a collection whose height and cardinality are equal to 1. The collectional operators will then reduce to the relational operators.
  68. 68. The Collectional Operators  The union and the set difference operators can only be applied on union-compatible collections. Result Model 26 m1 Result m2 32 Result Model 32 m2 Result m3 31
  69. 69. The Collectional Operators  Example of the union operator and the set difference operator. Model m1 m2 m3 Result 26 Model Result Result m1 26 32 m2 Result Result 31
  70. 70. The Collectional Operators  Example of the Cartesian product Operator and the Renaming Operator. M1.Result M2.Result M2.model M1.model m1 m2 26 32 m1 M1.Result M2.Result m2 M2.model m1 m2 26 31 M1.Result M2.Result 32 32 M1.Result M2.Result 32 31
  71. 71. The Collectional Operators  Example of the selection operator. Model m2 Experiment 1 Concentration Degree 7.1 15 ... ...
  72. 72. The Collectional Operators  Example of the projection operator. Concentration Degree Experiment 1 2 ... 7.0 15 ... 7.1 15 ... Concentration Degree ... 7.0 ... 30 ... 7.1 30 ...
  73. 73. Key Features of VIEW F1: VIEW features the first uniform workflow model, in which workflows are the only building blocks. In VIEW, tasks are primitive workflows and all workflow constructs do not discriminate workflows from tasks. Such a model greatly simplifies workflow design, in which a workflow designer only needs to compose complex workflows from simpler ones without the need to first encapsulate workflows to tasks or vice versa during the composition process.
  74. 74. F2: VIEW has a powerful workflow composition power in which workflow constructs are fully compositional one with another with arbitrary levels. This often results in VIEW workflows that are more concise and efficient to execute, which can be hard to model in other workflow systems.
  75. 75. F3: VIEW features a pure dataflow-based workflow language SWL, including the dataflow counterparts of controlflow-style constructs, such as conditional and loop. Existing workflow languages often require both controlflow and dataflow constructs, resulting in complex or even obscure semantics and non-trivial workflow design.
  76. 76. F4: VIEW supports the cloud MapReduce programming model not only at the job level, but also at the workflow level. Therefore, one can apply the Map and Reduce constructs on an arbitrary workflow with arbitrary number of times. As a result, VIEW can process nested lists of data products in parallel using multiple runs of a workflow.
  77. 77. F5: VIEW features a collectional data model that supports not only traditional primitive data types, such as integer, float, double, boolean, char, string, but also files, relations, hierarchical collections (hierarchical key-value pairs) to support parallel processing of data collections.
  78. 78. F6: VIEW supports a high-level graphbased provenance query language OPQL. In most cases, users can formulate lineage queries easily without the need of writing recursive queries or knowing the underlying database schema.
  79. 79.  F7: VIEW features the first service-oriented architecture that conforms to the reference architecture for scientific workflow management systems (SWFMSs). This architecture greatly facilitates interoperability and subsystem reusability in the community. This architecture also provides a generic infrastructure upon which a domain-specific scientific workflow application system (SWFAS) can be easily developed with custom interface for various platforms and devices.
  80. 80. Conclusions and Future Works  A scientific workflow composition model.  A collectional data model.  A protypical SWFMS.  Future work:  Formalization of the scientific workflow algebra and collectional algebra. • Completeness. • Integration.  Collaborative scientific workflow composition. • Concurrent design and composition. • Concurrent execution.
  81. 81. VIEW application Fiber tract analysis for Epilepsy.
  82. 82. VIEW application Computational detection of MARS in genome.
  83. 83. VIEW application DNA analysis for bacteria E. Coli
  84. 84. VIEW application Simulation of Nereis succinea mate search behavior.
  85. 85. Big Data is a Pyramid Can you contribute a piece too?
  86. 86. Big Data Research Laboratory Wayne State University