Workflow Provenance:
From Modelling to Reporting
Rayhan Ferdous
Banani Roy
Chanchal K. Roy
Kevin A. Schneider
Provenance
Relates to any question about data lineage
Does it matter?
Big Data Analytics is NOT for FREE !!!
Taxonomy of
Provenance
da Cruz et al. "Towards a taxonomy of provenance in scientific workflow management systems." Services, 2009
Scopes of R&D
that were
focused
independently
Provenance
Data Collection Workflow design
Changes to system
Version control
Data usage
feedback
Reporting and
learning
Learning system
Recommendation
Data usage
Monitoring
Resource
Time series
Control
Smart re-run
Fault detection
Data analysis
Data Provenance
Process
provenance
Visualization
Version comparison
User tracking
Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and
Annotation Workshop. 2008.
Provenance entered into
Big Data
Standardization is necessary
Any system is never complete
Users are from different levels of expertise and goals
Fundamental research questions need to be identified
Data source, format, management varies
Users need a meaningful and flexible way to interact
Its not feasible to offer a high learning curve
When
multiple
domains join
together…
I have my
own style
Data
provenance
vs workflow
provenance
are
necessary
01 Logging is
necessary
02
Workflows differ
by modelling,
architecture and
implementation
from domain to
domain
03 Logging
mechanisms
and log
structures
also differ
04
We want to
bring
everything
into one
place…
for Big Data
Provenance
Programming Model + Automated Logging
External configurability of logs
Use with a Domain Specific Language (DSL)
Extensible with further technologies
Parse logs in Graph Database (GDB)
Proposed fundamental workflow provenance queries
Data visualization to answer queries
Primary complexity analysis
User study of visualizations
Scale the system
Avoiding
Mathematical Modelling for this Session
Object Oriented Programming Model
Proposed Programming
Model
Tools
DSL
Extension
(Hadoop, Spark etc.)
Logging
Configuration
Workflow
User
Domain
Expert
Model
Developer
uses
uses
uses
OOP Layer
Modelling
Layer
DSL Layer
Tool Layer
User Layer
Proposed
System
Architecture
Workflow System
(Tools, DSL, Proposed Model,
OOP, Extension)
Logs
Online
Parser
Visualization
Service
(Reporting)
User
Proposed
System
Components
Relation
Proposed
fundamental
queries vs
Cypher
Unit Map
MATCH (n:Type)
WHERE Condition
RETURN n
Time Sequence Map
MATCH (n:Type)
WHERE Condition
RETURN n.p
ORDER BY n.ptime
Data Sequence Map
MATCH (n1:Type1),(n2:Type2)
WHERE Condition AND n1.p ==n2.p
RETURN n1.p1, n2.p2
Examples of some queries
What are the frequencies of different workflow components?
match(n) return n.label as label, count(n) as freq
What are the frequencies of different modules?
match (n:Module) return n.NAME as tool, count(n) as count
What is the time series mapping of CPU load for FastQC module?
match(n:Module) where n.NAME="FastQC" and n.cpu_run >= "0“
return n.time as time, n.cpu_run as cpuload order by n.time
What is the cpu load to execution time mapping for all modules?
match(n:Module) where n.cpu_run >= "0" and n.duration_run >= "0"
return n.NAME as name, n.cpu_run as cpu, n.duration_run as duration
Classification of Workflow
Provenance Queries
Is this really necessary ???!!!
Classification WF Provenance Questions
Time Point (Unit
Mapping)
Time Series Sequence
Mapping
Statistical Sequence Mapping
Evaluate Evaluate Evaluate Compare Predict
Past Now Past Now Past Now Past Now Future
object invocation object invocation sequence
frequency of object invocation
(inter WF) object-object invocation correlation (inter WF) object invocation
(inner WF) object-object invocation correlation (inner WF) object invocation
histogram of object invocation
(inter WF) histogram comparison (inter WF) distribution
(inner WF) histogram comparison (inner WF) distribution
statistical measurements
(inter WF) measurements comparison
(inter WF) threshold
(inter WF) measurements correlation
(inner WF) measurements comparison
(inner WF) threshold
(inner WF) measurements correlation
object source (module)
object lineage (module)
sequence
measurements of DAG
(inter WF) lineage-lineage comparison
(inter WF) graph similarity
(inter WF) lineage-lineage correlation
object destination (module)
(inner WF) lineage-lineage comparison
(inner WF) graph similarity
(inner WF) lineage-lineage correlation
object property object property sequence
frequency of object property
(inter WF) property-property comparison
(inter WF) object property
(inter WF) property-object correlation
histogram of object property (inner WF) property-property comparison
(inner WF) object property
statistical measurements (inner WF) property-object correlation
400+ possible queries
Increases according to the GDB Node properties and different combinations
Coverage of Existing Works
Classification WF Provenance Questions
Time Point (Unit
Mapping)
Time Series Sequence
Mapping
Statistical Sequence Mapping
Evaluate Evaluate Evaluate Compare Predict
Past Now Past Now Past Now Past Now Future
object invocation object invocation sequence
frequency of object invocation
(inter WF) object-object invocation correlation (inter WF) object invocation
(inner WF) object-object invocation correlation (inner WF) object invocation
histogram of object invocation
(inter WF) histogram comparison (inter WF) distribution
(inner WF) histogram comparison (inner WF) distribution
statistical measurements
(inter WF) measurements comparison
(inter WF) threshold
(inter WF) measurements correlation
(inner WF) measurements comparison
(inner WF) threshold
(inner WF) measurements correlation
object source (module)
object lineage (module)
sequence
measurements of DAG
(inter WF) lineage-lineage comparison
(inter WF) graph similarity
(inter WF) lineage-lineage correlation
object destination (module)
(inner WF) lineage-lineage comparison
(inner WF) graph similarity
(inner WF) lineage-lineage correlation
object property object property sequence
frequency of object property
(inter WF) property-property comparison
(inter WF) object property
(inter WF) property-object correlation
histogram of object property (inner WF) property-property comparison
(inner WF) object property
statistical measurements (inner WF) property-object correlation
Ghoshal Akidau Anand Buneman Cheney
A comprehensive classification
leads to the way of
storytelling with data
Data Visualization Research can be merged with the queries
in a systematic way
Primary Visualization Suggestion
Chart (X, Y, Size,
Color)
Frequency
Time series -
ordinal
Time series -
nominal
Mapping -
ordinal vs ordinal
Mapping -
nominal vs ordinal
Mapping -
nominal vs nominal
Lineage
Bar chart X X X X
Grouped bar chart X X X X
Stacked bar chart X X X X
Line chart X X X
Step line chart X
Basis line chart X X X
Pie chart X X X X X
Ring chart X X X X X
Area chart X X
Stacked area chart X X
Scatter plot X X X X
Bubble chart X X X X X
Floating bar chart X X X X
Floating pie chart X X X X X
Floating ring chart X X X X X
Block matrix X X X X X X
Heatmap X
Histogram X X
Box plot X X
Strip chart X X X
Bee Swarm chart X X X
DAG X
Tree map X
Metric X
Tabular X X X X X X
Complexity of our approach
Selected modules
&
Implemented workflows
System Configuration:
Intel Core i7-7700
16 GB DDR4 RAM
256GB SSD
Ubuntu LTS 16.04
Next Step 1
to scale the system
with state of the art techs.
Next Step 2
to find the best visualization
through user study
for provenance queries.
So many
angles to
investigate
How could only a line chart
be drawn in a better way?
Do we need interactivity?
What type of interactivity
is not an excess?
Scopes of R&D
that were
focused
independently
Provenance
Data Collection Workflow design
Changes to system
Version control
Data usage
feedback
Reporting and
learning
Learning system
Recommendation
Data usage
Monitoring
Resource
Time series
Control
Smart re-run
Fault detection
Data analysis
Data Provenance
Process
provenance
Visualization
Version comparison
User tracking
Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and
Annotation Workshop. 2008.
Contributed
In Progress
Future work
References
1. Ghoshal et al., "Provenance from log files: a BigData problem." Proceedings of the Joint EDBT/ICDT
2013 Workshops.
2. Akidau et al., “The dataflow model: a practical approach to balancing correctness, latency, and cost
in massive-scale, unbounded, out-of-order data processing." Proceedings of the VLDB Endowment,
2015.
3. Anand et al., "Techniques for efficiently querying scientific workflow provenance graphs." EDBT
2010.
4. Buneman et al., "Why and where: A characterization of data provenance." International conference
on database theory. 2001.
5. Cheney et al., "Provenance in databases: Why, how, and where." Foundations and Trends® in
Databases, 2009.
6. da Cruz, Sérgio Manuel Serra, Maria Luiza M. Campos, and Marta Mattoso. "Towards a taxonomy of
provenance in scientific workflow management systems." Services-I, 2009 World Conference on.
IEEE, 2009.
7. Crawl, Daniel, and Ilkay Altintas. "A provenance-based fault tolerance mechanism for scientific
workflows." Provenance and Annotation of Data and Processes (2008): 152-159.
8. Amsterdamer, Yael, et al. "Putting lipstick on pig: Enabling database-style workflow
provenance." Proceedings of the VLDB Endowment 5.4 (2011): 346-357.
9. Hazel, Dan. "Using rational numbers to key nested sets." arXiv preprint arXiv:0806.3115 (2008).
10. Green, Todd J., Grigoris Karvounarakis, and Val Tannen. "Provenance semirings." Proceedings of the
twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM,
2007.
11. Acar, Umut, et al. "A graph model of data and workflow provenance." 2010.
12. Dominguez-Sal, David, et al. "Survey of graph database performance on the hpc scalable graph
analysis benchmark." International Conference on Web-Age Information Management. Springer,
Berlin, Heidelberg, 2010.
Thanks !!! (Demo)

Workflow Provenance: From Modelling to Reporting

  • 1.
    Workflow Provenance: From Modellingto Reporting Rayhan Ferdous Banani Roy Chanchal K. Roy Kevin A. Schneider
  • 2.
    Provenance Relates to anyquestion about data lineage Does it matter? Big Data Analytics is NOT for FREE !!!
  • 3.
    Taxonomy of Provenance da Cruzet al. "Towards a taxonomy of provenance in scientific workflow management systems." Services, 2009
  • 4.
    Scopes of R&D thatwere focused independently Provenance Data Collection Workflow design Changes to system Version control Data usage feedback Reporting and learning Learning system Recommendation Data usage Monitoring Resource Time series Control Smart re-run Fault detection Data analysis Data Provenance Process provenance Visualization Version comparison User tracking Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and Annotation Workshop. 2008.
  • 5.
    Provenance entered into BigData Standardization is necessary Any system is never complete Users are from different levels of expertise and goals Fundamental research questions need to be identified Data source, format, management varies Users need a meaningful and flexible way to interact Its not feasible to offer a high learning curve
  • 6.
    When multiple domains join together… I havemy own style Data provenance vs workflow provenance are necessary 01 Logging is necessary 02 Workflows differ by modelling, architecture and implementation from domain to domain 03 Logging mechanisms and log structures also differ 04
  • 7.
    We want to bring everything intoone place… for Big Data Provenance Programming Model + Automated Logging External configurability of logs Use with a Domain Specific Language (DSL) Extensible with further technologies Parse logs in Graph Database (GDB) Proposed fundamental workflow provenance queries Data visualization to answer queries Primary complexity analysis User study of visualizations Scale the system
  • 8.
  • 9.
    Object Oriented ProgrammingModel Proposed Programming Model Tools DSL Extension (Hadoop, Spark etc.) Logging Configuration Workflow User Domain Expert Model Developer uses uses uses OOP Layer Modelling Layer DSL Layer Tool Layer User Layer Proposed System Architecture
  • 10.
    Workflow System (Tools, DSL,Proposed Model, OOP, Extension) Logs Online Parser Visualization Service (Reporting) User Proposed System Components Relation
  • 11.
    Proposed fundamental queries vs Cypher Unit Map MATCH(n:Type) WHERE Condition RETURN n Time Sequence Map MATCH (n:Type) WHERE Condition RETURN n.p ORDER BY n.ptime Data Sequence Map MATCH (n1:Type1),(n2:Type2) WHERE Condition AND n1.p ==n2.p RETURN n1.p1, n2.p2
  • 12.
  • 13.
    What are thefrequencies of different workflow components? match(n) return n.label as label, count(n) as freq What are the frequencies of different modules? match (n:Module) return n.NAME as tool, count(n) as count What is the time series mapping of CPU load for FastQC module? match(n:Module) where n.NAME="FastQC" and n.cpu_run >= "0“ return n.time as time, n.cpu_run as cpuload order by n.time What is the cpu load to execution time mapping for all modules? match(n:Module) where n.cpu_run >= "0" and n.duration_run >= "0" return n.NAME as name, n.cpu_run as cpu, n.duration_run as duration
  • 14.
    Classification of Workflow ProvenanceQueries Is this really necessary ???!!!
  • 15.
    Classification WF ProvenanceQuestions Time Point (Unit Mapping) Time Series Sequence Mapping Statistical Sequence Mapping Evaluate Evaluate Evaluate Compare Predict Past Now Past Now Past Now Past Now Future object invocation object invocation sequence frequency of object invocation (inter WF) object-object invocation correlation (inter WF) object invocation (inner WF) object-object invocation correlation (inner WF) object invocation histogram of object invocation (inter WF) histogram comparison (inter WF) distribution (inner WF) histogram comparison (inner WF) distribution statistical measurements (inter WF) measurements comparison (inter WF) threshold (inter WF) measurements correlation (inner WF) measurements comparison (inner WF) threshold (inner WF) measurements correlation object source (module) object lineage (module) sequence measurements of DAG (inter WF) lineage-lineage comparison (inter WF) graph similarity (inter WF) lineage-lineage correlation object destination (module) (inner WF) lineage-lineage comparison (inner WF) graph similarity (inner WF) lineage-lineage correlation object property object property sequence frequency of object property (inter WF) property-property comparison (inter WF) object property (inter WF) property-object correlation histogram of object property (inner WF) property-property comparison (inner WF) object property statistical measurements (inner WF) property-object correlation 400+ possible queries Increases according to the GDB Node properties and different combinations
  • 16.
  • 17.
    Classification WF ProvenanceQuestions Time Point (Unit Mapping) Time Series Sequence Mapping Statistical Sequence Mapping Evaluate Evaluate Evaluate Compare Predict Past Now Past Now Past Now Past Now Future object invocation object invocation sequence frequency of object invocation (inter WF) object-object invocation correlation (inter WF) object invocation (inner WF) object-object invocation correlation (inner WF) object invocation histogram of object invocation (inter WF) histogram comparison (inter WF) distribution (inner WF) histogram comparison (inner WF) distribution statistical measurements (inter WF) measurements comparison (inter WF) threshold (inter WF) measurements correlation (inner WF) measurements comparison (inner WF) threshold (inner WF) measurements correlation object source (module) object lineage (module) sequence measurements of DAG (inter WF) lineage-lineage comparison (inter WF) graph similarity (inter WF) lineage-lineage correlation object destination (module) (inner WF) lineage-lineage comparison (inner WF) graph similarity (inner WF) lineage-lineage correlation object property object property sequence frequency of object property (inter WF) property-property comparison (inter WF) object property (inter WF) property-object correlation histogram of object property (inner WF) property-property comparison (inner WF) object property statistical measurements (inner WF) property-object correlation Ghoshal Akidau Anand Buneman Cheney
  • 18.
    A comprehensive classification leadsto the way of storytelling with data Data Visualization Research can be merged with the queries in a systematic way
  • 19.
  • 20.
    Chart (X, Y,Size, Color) Frequency Time series - ordinal Time series - nominal Mapping - ordinal vs ordinal Mapping - nominal vs ordinal Mapping - nominal vs nominal Lineage Bar chart X X X X Grouped bar chart X X X X Stacked bar chart X X X X Line chart X X X Step line chart X Basis line chart X X X Pie chart X X X X X Ring chart X X X X X Area chart X X Stacked area chart X X Scatter plot X X X X Bubble chart X X X X X Floating bar chart X X X X Floating pie chart X X X X X Floating ring chart X X X X X Block matrix X X X X X X Heatmap X Histogram X X Box plot X X Strip chart X X X Bee Swarm chart X X X DAG X Tree map X Metric X Tabular X X X X X X
  • 21.
  • 22.
  • 23.
    System Configuration: Intel Corei7-7700 16 GB DDR4 RAM 256GB SSD Ubuntu LTS 16.04
  • 24.
    Next Step 1 toscale the system with state of the art techs.
  • 26.
    Next Step 2 tofind the best visualization through user study for provenance queries.
  • 27.
    So many angles to investigate Howcould only a line chart be drawn in a better way? Do we need interactivity? What type of interactivity is not an excess?
  • 28.
    Scopes of R&D thatwere focused independently Provenance Data Collection Workflow design Changes to system Version control Data usage feedback Reporting and learning Learning system Recommendation Data usage Monitoring Resource Time series Control Smart re-run Fault detection Data analysis Data Provenance Process provenance Visualization Version comparison User tracking Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and Annotation Workshop. 2008. Contributed In Progress Future work
  • 29.
    References 1. Ghoshal etal., "Provenance from log files: a BigData problem." Proceedings of the Joint EDBT/ICDT 2013 Workshops. 2. Akidau et al., “The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing." Proceedings of the VLDB Endowment, 2015. 3. Anand et al., "Techniques for efficiently querying scientific workflow provenance graphs." EDBT 2010. 4. Buneman et al., "Why and where: A characterization of data provenance." International conference on database theory. 2001. 5. Cheney et al., "Provenance in databases: Why, how, and where." Foundations and Trends® in Databases, 2009. 6. da Cruz, Sérgio Manuel Serra, Maria Luiza M. Campos, and Marta Mattoso. "Towards a taxonomy of provenance in scientific workflow management systems." Services-I, 2009 World Conference on. IEEE, 2009. 7. Crawl, Daniel, and Ilkay Altintas. "A provenance-based fault tolerance mechanism for scientific workflows." Provenance and Annotation of Data and Processes (2008): 152-159. 8. Amsterdamer, Yael, et al. "Putting lipstick on pig: Enabling database-style workflow provenance." Proceedings of the VLDB Endowment 5.4 (2011): 346-357. 9. Hazel, Dan. "Using rational numbers to key nested sets." arXiv preprint arXiv:0806.3115 (2008). 10. Green, Todd J., Grigoris Karvounarakis, and Val Tannen. "Provenance semirings." Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2007. 11. Acar, Umut, et al. "A graph model of data and workflow provenance." 2010. 12. Dominguez-Sal, David, et al. "Survey of graph database performance on the hpc scalable graph analysis benchmark." International Conference on Web-Age Information Management. Springer, Berlin, Heidelberg, 2010.
  • 30.