SlideShare a Scribd company logo
Workflow Provenance:
From Modelling to Reporting
Rayhan Ferdous
Banani Roy
Chanchal K. Roy
Kevin A. Schneider
Provenance
Relates to any question about data lineage
Does it matter?
Big Data Analytics is NOT for FREE !!!
Taxonomy of
Provenance
da Cruz et al. "Towards a taxonomy of provenance in scientific workflow management systems." Services, 2009
Scopes of R&D
that were
focused
independently
Provenance
Data Collection Workflow design
Changes to system
Version control
Data usage
feedback
Reporting and
learning
Learning system
Recommendation
Data usage
Monitoring
Resource
Time series
Control
Smart re-run
Fault detection
Data analysis
Data Provenance
Process
provenance
Visualization
Version comparison
User tracking
Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and
Annotation Workshop. 2008.
Provenance entered into
Big Data
Standardization is necessary
Any system is never complete
Users are from different levels of expertise and goals
Fundamental research questions need to be identified
Data source, format, management varies
Users need a meaningful and flexible way to interact
Its not feasible to offer a high learning curve
When
multiple
domains join
together…
I have my
own style
Data
provenance
vs workflow
provenance
are
necessary
01 Logging is
necessary
02
Workflows differ
by modelling,
architecture and
implementation
from domain to
domain
03 Logging
mechanisms
and log
structures
also differ
04
We want to
bring
everything
into one
place…
for Big Data
Provenance
Programming Model + Automated Logging
External configurability of logs
Use with a Domain Specific Language (DSL)
Extensible with further technologies
Parse logs in Graph Database (GDB)
Proposed fundamental workflow provenance queries
Data visualization to answer queries
Primary complexity analysis
User study of visualizations
Scale the system
Avoiding
Mathematical Modelling for this Session
Object Oriented Programming Model
Proposed Programming
Model
Tools
DSL
Extension
(Hadoop, Spark etc.)
Logging
Configuration
Workflow
User
Domain
Expert
Model
Developer
uses
uses
uses
OOP Layer
Modelling
Layer
DSL Layer
Tool Layer
User Layer
Proposed
System
Architecture
Workflow System
(Tools, DSL, Proposed Model,
OOP, Extension)
Logs
Online
Parser
Visualization
Service
(Reporting)
User
Proposed
System
Components
Relation
Proposed
fundamental
queries vs
Cypher
Unit Map
MATCH (n:Type)
WHERE Condition
RETURN n
Time Sequence Map
MATCH (n:Type)
WHERE Condition
RETURN n.p
ORDER BY n.ptime
Data Sequence Map
MATCH (n1:Type1),(n2:Type2)
WHERE Condition AND n1.p ==n2.p
RETURN n1.p1, n2.p2
Examples of some queries
What are the frequencies of different workflow components?
match(n) return n.label as label, count(n) as freq
What are the frequencies of different modules?
match (n:Module) return n.NAME as tool, count(n) as count
What is the time series mapping of CPU load for FastQC module?
match(n:Module) where n.NAME="FastQC" and n.cpu_run >= "0“
return n.time as time, n.cpu_run as cpuload order by n.time
What is the cpu load to execution time mapping for all modules?
match(n:Module) where n.cpu_run >= "0" and n.duration_run >= "0"
return n.NAME as name, n.cpu_run as cpu, n.duration_run as duration
Classification of Workflow
Provenance Queries
Is this really necessary ???!!!
Classification WF Provenance Questions
Time Point (Unit
Mapping)
Time Series Sequence
Mapping
Statistical Sequence Mapping
Evaluate Evaluate Evaluate Compare Predict
Past Now Past Now Past Now Past Now Future
object invocation object invocation sequence
frequency of object invocation
(inter WF) object-object invocation correlation (inter WF) object invocation
(inner WF) object-object invocation correlation (inner WF) object invocation
histogram of object invocation
(inter WF) histogram comparison (inter WF) distribution
(inner WF) histogram comparison (inner WF) distribution
statistical measurements
(inter WF) measurements comparison
(inter WF) threshold
(inter WF) measurements correlation
(inner WF) measurements comparison
(inner WF) threshold
(inner WF) measurements correlation
object source (module)
object lineage (module)
sequence
measurements of DAG
(inter WF) lineage-lineage comparison
(inter WF) graph similarity
(inter WF) lineage-lineage correlation
object destination (module)
(inner WF) lineage-lineage comparison
(inner WF) graph similarity
(inner WF) lineage-lineage correlation
object property object property sequence
frequency of object property
(inter WF) property-property comparison
(inter WF) object property
(inter WF) property-object correlation
histogram of object property (inner WF) property-property comparison
(inner WF) object property
statistical measurements (inner WF) property-object correlation
400+ possible queries
Increases according to the GDB Node properties and different combinations
Coverage of Existing Works
Classification WF Provenance Questions
Time Point (Unit
Mapping)
Time Series Sequence
Mapping
Statistical Sequence Mapping
Evaluate Evaluate Evaluate Compare Predict
Past Now Past Now Past Now Past Now Future
object invocation object invocation sequence
frequency of object invocation
(inter WF) object-object invocation correlation (inter WF) object invocation
(inner WF) object-object invocation correlation (inner WF) object invocation
histogram of object invocation
(inter WF) histogram comparison (inter WF) distribution
(inner WF) histogram comparison (inner WF) distribution
statistical measurements
(inter WF) measurements comparison
(inter WF) threshold
(inter WF) measurements correlation
(inner WF) measurements comparison
(inner WF) threshold
(inner WF) measurements correlation
object source (module)
object lineage (module)
sequence
measurements of DAG
(inter WF) lineage-lineage comparison
(inter WF) graph similarity
(inter WF) lineage-lineage correlation
object destination (module)
(inner WF) lineage-lineage comparison
(inner WF) graph similarity
(inner WF) lineage-lineage correlation
object property object property sequence
frequency of object property
(inter WF) property-property comparison
(inter WF) object property
(inter WF) property-object correlation
histogram of object property (inner WF) property-property comparison
(inner WF) object property
statistical measurements (inner WF) property-object correlation
Ghoshal Akidau Anand Buneman Cheney
A comprehensive classification
leads to the way of
storytelling with data
Data Visualization Research can be merged with the queries
in a systematic way
Primary Visualization Suggestion
Chart (X, Y, Size,
Color)
Frequency
Time series -
ordinal
Time series -
nominal
Mapping -
ordinal vs ordinal
Mapping -
nominal vs ordinal
Mapping -
nominal vs nominal
Lineage
Bar chart X X X X
Grouped bar chart X X X X
Stacked bar chart X X X X
Line chart X X X
Step line chart X
Basis line chart X X X
Pie chart X X X X X
Ring chart X X X X X
Area chart X X
Stacked area chart X X
Scatter plot X X X X
Bubble chart X X X X X
Floating bar chart X X X X
Floating pie chart X X X X X
Floating ring chart X X X X X
Block matrix X X X X X X
Heatmap X
Histogram X X
Box plot X X
Strip chart X X X
Bee Swarm chart X X X
DAG X
Tree map X
Metric X
Tabular X X X X X X
Complexity of our approach
Selected modules
&
Implemented workflows
System Configuration:
Intel Core i7-7700
16 GB DDR4 RAM
256GB SSD
Ubuntu LTS 16.04
Next Step 1
to scale the system
with state of the art techs.
Next Step 2
to find the best visualization
through user study
for provenance queries.
So many
angles to
investigate
How could only a line chart
be drawn in a better way?
Do we need interactivity?
What type of interactivity
is not an excess?
Scopes of R&D
that were
focused
independently
Provenance
Data Collection Workflow design
Changes to system
Version control
Data usage
feedback
Reporting and
learning
Learning system
Recommendation
Data usage
Monitoring
Resource
Time series
Control
Smart re-run
Fault detection
Data analysis
Data Provenance
Process
provenance
Visualization
Version comparison
User tracking
Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and
Annotation Workshop. 2008.
Contributed
In Progress
Future work
References
1. Ghoshal et al., "Provenance from log files: a BigData problem." Proceedings of the Joint EDBT/ICDT
2013 Workshops.
2. Akidau et al., “The dataflow model: a practical approach to balancing correctness, latency, and cost
in massive-scale, unbounded, out-of-order data processing." Proceedings of the VLDB Endowment,
2015.
3. Anand et al., "Techniques for efficiently querying scientific workflow provenance graphs." EDBT
2010.
4. Buneman et al., "Why and where: A characterization of data provenance." International conference
on database theory. 2001.
5. Cheney et al., "Provenance in databases: Why, how, and where." Foundations and Trends® in
Databases, 2009.
6. da Cruz, Sérgio Manuel Serra, Maria Luiza M. Campos, and Marta Mattoso. "Towards a taxonomy of
provenance in scientific workflow management systems." Services-I, 2009 World Conference on.
IEEE, 2009.
7. Crawl, Daniel, and Ilkay Altintas. "A provenance-based fault tolerance mechanism for scientific
workflows." Provenance and Annotation of Data and Processes (2008): 152-159.
8. Amsterdamer, Yael, et al. "Putting lipstick on pig: Enabling database-style workflow
provenance." Proceedings of the VLDB Endowment 5.4 (2011): 346-357.
9. Hazel, Dan. "Using rational numbers to key nested sets." arXiv preprint arXiv:0806.3115 (2008).
10. Green, Todd J., Grigoris Karvounarakis, and Val Tannen. "Provenance semirings." Proceedings of the
twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM,
2007.
11. Acar, Umut, et al. "A graph model of data and workflow provenance." 2010.
12. Dominguez-Sal, David, et al. "Survey of graph database performance on the hpc scalable graph
analysis benchmark." International Conference on Web-Age Information Management. Springer,
Berlin, Heidelberg, 2010.
Thanks !!! (Demo)

More Related Content

Similar to Workflow Provenance: From Modelling to Reporting

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
Richard Garris
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Ian Foster
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
Rakebul Hasan
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Ian Foster
 
Knowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceKnowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems Science
David De Roure
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석
datasciencekorea
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...
SK Ahammad Fahad
 
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsHortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
IJMER
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Carole Goble
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
 
TUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsTUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflows
Hong-Linh Truong
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
eXascale Infolab
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
Amazon Web Services
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big Data
Jian Wu
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
Richard Zijdeman
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
Resume
ResumeResume
Resume
Yuanzhe Cai
 

Similar to Workflow Provenance: From Modelling to Reporting (20)

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
Knowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceKnowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems Science
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...
 
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsHortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
TUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsTUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflows
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big Data
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Resume
ResumeResume
Resume
 

Recently uploaded

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 

Recently uploaded (20)

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 

Workflow Provenance: From Modelling to Reporting

  • 1. Workflow Provenance: From Modelling to Reporting Rayhan Ferdous Banani Roy Chanchal K. Roy Kevin A. Schneider
  • 2. Provenance Relates to any question about data lineage Does it matter? Big Data Analytics is NOT for FREE !!!
  • 3. Taxonomy of Provenance da Cruz et al. "Towards a taxonomy of provenance in scientific workflow management systems." Services, 2009
  • 4. Scopes of R&D that were focused independently Provenance Data Collection Workflow design Changes to system Version control Data usage feedback Reporting and learning Learning system Recommendation Data usage Monitoring Resource Time series Control Smart re-run Fault detection Data analysis Data Provenance Process provenance Visualization Version comparison User tracking Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and Annotation Workshop. 2008.
  • 5. Provenance entered into Big Data Standardization is necessary Any system is never complete Users are from different levels of expertise and goals Fundamental research questions need to be identified Data source, format, management varies Users need a meaningful and flexible way to interact Its not feasible to offer a high learning curve
  • 6. When multiple domains join together… I have my own style Data provenance vs workflow provenance are necessary 01 Logging is necessary 02 Workflows differ by modelling, architecture and implementation from domain to domain 03 Logging mechanisms and log structures also differ 04
  • 7. We want to bring everything into one place… for Big Data Provenance Programming Model + Automated Logging External configurability of logs Use with a Domain Specific Language (DSL) Extensible with further technologies Parse logs in Graph Database (GDB) Proposed fundamental workflow provenance queries Data visualization to answer queries Primary complexity analysis User study of visualizations Scale the system
  • 9. Object Oriented Programming Model Proposed Programming Model Tools DSL Extension (Hadoop, Spark etc.) Logging Configuration Workflow User Domain Expert Model Developer uses uses uses OOP Layer Modelling Layer DSL Layer Tool Layer User Layer Proposed System Architecture
  • 10. Workflow System (Tools, DSL, Proposed Model, OOP, Extension) Logs Online Parser Visualization Service (Reporting) User Proposed System Components Relation
  • 11. Proposed fundamental queries vs Cypher Unit Map MATCH (n:Type) WHERE Condition RETURN n Time Sequence Map MATCH (n:Type) WHERE Condition RETURN n.p ORDER BY n.ptime Data Sequence Map MATCH (n1:Type1),(n2:Type2) WHERE Condition AND n1.p ==n2.p RETURN n1.p1, n2.p2
  • 12. Examples of some queries
  • 13. What are the frequencies of different workflow components? match(n) return n.label as label, count(n) as freq What are the frequencies of different modules? match (n:Module) return n.NAME as tool, count(n) as count What is the time series mapping of CPU load for FastQC module? match(n:Module) where n.NAME="FastQC" and n.cpu_run >= "0“ return n.time as time, n.cpu_run as cpuload order by n.time What is the cpu load to execution time mapping for all modules? match(n:Module) where n.cpu_run >= "0" and n.duration_run >= "0" return n.NAME as name, n.cpu_run as cpu, n.duration_run as duration
  • 14. Classification of Workflow Provenance Queries Is this really necessary ???!!!
  • 15. Classification WF Provenance Questions Time Point (Unit Mapping) Time Series Sequence Mapping Statistical Sequence Mapping Evaluate Evaluate Evaluate Compare Predict Past Now Past Now Past Now Past Now Future object invocation object invocation sequence frequency of object invocation (inter WF) object-object invocation correlation (inter WF) object invocation (inner WF) object-object invocation correlation (inner WF) object invocation histogram of object invocation (inter WF) histogram comparison (inter WF) distribution (inner WF) histogram comparison (inner WF) distribution statistical measurements (inter WF) measurements comparison (inter WF) threshold (inter WF) measurements correlation (inner WF) measurements comparison (inner WF) threshold (inner WF) measurements correlation object source (module) object lineage (module) sequence measurements of DAG (inter WF) lineage-lineage comparison (inter WF) graph similarity (inter WF) lineage-lineage correlation object destination (module) (inner WF) lineage-lineage comparison (inner WF) graph similarity (inner WF) lineage-lineage correlation object property object property sequence frequency of object property (inter WF) property-property comparison (inter WF) object property (inter WF) property-object correlation histogram of object property (inner WF) property-property comparison (inner WF) object property statistical measurements (inner WF) property-object correlation 400+ possible queries Increases according to the GDB Node properties and different combinations
  • 17. Classification WF Provenance Questions Time Point (Unit Mapping) Time Series Sequence Mapping Statistical Sequence Mapping Evaluate Evaluate Evaluate Compare Predict Past Now Past Now Past Now Past Now Future object invocation object invocation sequence frequency of object invocation (inter WF) object-object invocation correlation (inter WF) object invocation (inner WF) object-object invocation correlation (inner WF) object invocation histogram of object invocation (inter WF) histogram comparison (inter WF) distribution (inner WF) histogram comparison (inner WF) distribution statistical measurements (inter WF) measurements comparison (inter WF) threshold (inter WF) measurements correlation (inner WF) measurements comparison (inner WF) threshold (inner WF) measurements correlation object source (module) object lineage (module) sequence measurements of DAG (inter WF) lineage-lineage comparison (inter WF) graph similarity (inter WF) lineage-lineage correlation object destination (module) (inner WF) lineage-lineage comparison (inner WF) graph similarity (inner WF) lineage-lineage correlation object property object property sequence frequency of object property (inter WF) property-property comparison (inter WF) object property (inter WF) property-object correlation histogram of object property (inner WF) property-property comparison (inner WF) object property statistical measurements (inner WF) property-object correlation Ghoshal Akidau Anand Buneman Cheney
  • 18. A comprehensive classification leads to the way of storytelling with data Data Visualization Research can be merged with the queries in a systematic way
  • 20. Chart (X, Y, Size, Color) Frequency Time series - ordinal Time series - nominal Mapping - ordinal vs ordinal Mapping - nominal vs ordinal Mapping - nominal vs nominal Lineage Bar chart X X X X Grouped bar chart X X X X Stacked bar chart X X X X Line chart X X X Step line chart X Basis line chart X X X Pie chart X X X X X Ring chart X X X X X Area chart X X Stacked area chart X X Scatter plot X X X X Bubble chart X X X X X Floating bar chart X X X X Floating pie chart X X X X X Floating ring chart X X X X X Block matrix X X X X X X Heatmap X Histogram X X Box plot X X Strip chart X X X Bee Swarm chart X X X DAG X Tree map X Metric X Tabular X X X X X X
  • 21. Complexity of our approach
  • 23. System Configuration: Intel Core i7-7700 16 GB DDR4 RAM 256GB SSD Ubuntu LTS 16.04
  • 24. Next Step 1 to scale the system with state of the art techs.
  • 25.
  • 26. Next Step 2 to find the best visualization through user study for provenance queries.
  • 27. So many angles to investigate How could only a line chart be drawn in a better way? Do we need interactivity? What type of interactivity is not an excess?
  • 28. Scopes of R&D that were focused independently Provenance Data Collection Workflow design Changes to system Version control Data usage feedback Reporting and learning Learning system Recommendation Data usage Monitoring Resource Time series Control Smart re-run Fault detection Data analysis Data Provenance Process provenance Visualization Version comparison User tracking Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and Annotation Workshop. 2008. Contributed In Progress Future work
  • 29. References 1. Ghoshal et al., "Provenance from log files: a BigData problem." Proceedings of the Joint EDBT/ICDT 2013 Workshops. 2. Akidau et al., “The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing." Proceedings of the VLDB Endowment, 2015. 3. Anand et al., "Techniques for efficiently querying scientific workflow provenance graphs." EDBT 2010. 4. Buneman et al., "Why and where: A characterization of data provenance." International conference on database theory. 2001. 5. Cheney et al., "Provenance in databases: Why, how, and where." Foundations and Trends® in Databases, 2009. 6. da Cruz, Sérgio Manuel Serra, Maria Luiza M. Campos, and Marta Mattoso. "Towards a taxonomy of provenance in scientific workflow management systems." Services-I, 2009 World Conference on. IEEE, 2009. 7. Crawl, Daniel, and Ilkay Altintas. "A provenance-based fault tolerance mechanism for scientific workflows." Provenance and Annotation of Data and Processes (2008): 152-159. 8. Amsterdamer, Yael, et al. "Putting lipstick on pig: Enabling database-style workflow provenance." Proceedings of the VLDB Endowment 5.4 (2011): 346-357. 9. Hazel, Dan. "Using rational numbers to key nested sets." arXiv preprint arXiv:0806.3115 (2008). 10. Green, Todd J., Grigoris Karvounarakis, and Val Tannen. "Provenance semirings." Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2007. 11. Acar, Umut, et al. "A graph model of data and workflow provenance." 2010. 12. Dominguez-Sal, David, et al. "Survey of graph database performance on the hpc scalable graph analysis benchmark." International Conference on Web-Age Information Management. Springer, Berlin, Heidelberg, 2010.