The document discusses scientific workflow management systems and collaboration in workflow-based science. It notes that collaboration requires that a scientist be able to make sense of third-party data, and that this requires the data to be accompanied by provenance metadata that describes how the data was generated and processed. The concept of a "Research Object" is introduced as a way to package scientific data and workflows together with provenance and other related information to enable collaboration and reuse.
Structured Occurrence Network for provenance: talk for ipaw12 paperPaolo Missier
The document discusses using structured occurrence networks (SONs) to model provenance. SONs extend occurrence networks (ONs) to represent the activity of complex systems through relationships between multiple ONs. The goal is to explore using SONs as a formal model of provenance, viewing data as an evolving system and agents as also evolving systems. Communication SONs are introduced to capture communication between concurrently proceeding ONs. This establishes patterns for representing workflow and multi-layered provenance using SONs.
SWPM12 report on the dagstuhl seminar on Semantic Data Management Paolo Missier
The document summarizes discussions that took place at a Dagstuhl seminar on provenance in semantic data management in April 2012. Key points discussed include:
1) The need for provenance-specific benchmarks and reference data sets to better understand provenance usage and properties.
2) Proposals to collect provenance traces from various domains in a community repository using the PROV standard for interoperability.
3) Challenges of representing and reasoning with uncertain provenance information from sources like sensors, NLP, and human errors.
PDT: Personal Data from Things,and its provenancePaolo Missier
This document discusses various aspects of the Internet of Things (IoT), including potential architectures and stacks, connectivity and evolution. It examines use cases at different scales, from individual sensors to smart cities. The role of metadata and data provenance is explored for IoT applications involving science, personal data from sensors, and devices that make autonomous decisions. Issues of data ownership, privacy and user control are important considerations for personal data generated by IoT devices. The relationship between IoT and machine-to-machine communication is also briefly discussed.
The document discusses scientific workflow management systems and provenance. It notes that momentum is growing around data sharing, as evidenced by a special issue of Nature on the topic. Effective data sharing requires standards for packaging data with metadata into self-descriptive research objects, as well as representation of process provenance using workflow descriptions. Provenance captures causal relationships in scientific data and is important for understanding, reusing, and validating others' work. The Open Provenance Model aims to standardize provenance representation.
Paper presentations: UK e-science AHM meeting, 2005Paolo Missier
The document describes an ontology-based approach to handling information quality in e-science. It presents an initial quality framework that captures scientists' quality requirements and allows defining domain-specific quality characteristics. It introduces a web service that annotates datasets with quality metrics based on how well their elements conform to relevant ontologies, using transcriptomics as an example domain. The approach aims to make quality definitions reusable and the computation of quality measurements over large datasets cost-effective.
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paolo Missier
The document discusses fine-grained provenance tracking of workflow data products. It presents a functional model for collection-oriented workflow processing that models workflows operating on nested collections. This model generalizes simple iteration to arbitrary collection depths and handles multiple input collections through a generalized cross product operation. The model aims to enable efficient provenance querying by traversing the workflow graph instead of the potentially larger provenance graph.
This document discusses encoding provenance graphs and PROV constraints using Datalog rules. It maps PROV notation graphs to a database of facts and encodes most PROV constraints as Datalog rules. This allows for declarative specification of provenance graphs with deductive inference, enabling validation of graphs and rapid prototyping of analysis algorithms. Some limitations include inability to encode certain constraints and attributes in graph relations. The approach provides a proof of concept for representing and reasoning over provenance graphs with Datalog.
Structured Occurrence Network for provenance: talk for ipaw12 paperPaolo Missier
The document discusses using structured occurrence networks (SONs) to model provenance. SONs extend occurrence networks (ONs) to represent the activity of complex systems through relationships between multiple ONs. The goal is to explore using SONs as a formal model of provenance, viewing data as an evolving system and agents as also evolving systems. Communication SONs are introduced to capture communication between concurrently proceeding ONs. This establishes patterns for representing workflow and multi-layered provenance using SONs.
SWPM12 report on the dagstuhl seminar on Semantic Data Management Paolo Missier
The document summarizes discussions that took place at a Dagstuhl seminar on provenance in semantic data management in April 2012. Key points discussed include:
1) The need for provenance-specific benchmarks and reference data sets to better understand provenance usage and properties.
2) Proposals to collect provenance traces from various domains in a community repository using the PROV standard for interoperability.
3) Challenges of representing and reasoning with uncertain provenance information from sources like sensors, NLP, and human errors.
PDT: Personal Data from Things,and its provenancePaolo Missier
This document discusses various aspects of the Internet of Things (IoT), including potential architectures and stacks, connectivity and evolution. It examines use cases at different scales, from individual sensors to smart cities. The role of metadata and data provenance is explored for IoT applications involving science, personal data from sensors, and devices that make autonomous decisions. Issues of data ownership, privacy and user control are important considerations for personal data generated by IoT devices. The relationship between IoT and machine-to-machine communication is also briefly discussed.
The document discusses scientific workflow management systems and provenance. It notes that momentum is growing around data sharing, as evidenced by a special issue of Nature on the topic. Effective data sharing requires standards for packaging data with metadata into self-descriptive research objects, as well as representation of process provenance using workflow descriptions. Provenance captures causal relationships in scientific data and is important for understanding, reusing, and validating others' work. The Open Provenance Model aims to standardize provenance representation.
Paper presentations: UK e-science AHM meeting, 2005Paolo Missier
The document describes an ontology-based approach to handling information quality in e-science. It presents an initial quality framework that captures scientists' quality requirements and allows defining domain-specific quality characteristics. It introduces a web service that annotates datasets with quality metrics based on how well their elements conform to relevant ontologies, using transcriptomics as an example domain. The approach aims to make quality definitions reusable and the computation of quality measurements over large datasets cost-effective.
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paolo Missier
The document discusses fine-grained provenance tracking of workflow data products. It presents a functional model for collection-oriented workflow processing that models workflows operating on nested collections. This model generalizes simple iteration to arbitrary collection depths and handles multiple input collections through a generalized cross product operation. The model aims to enable efficient provenance querying by traversing the workflow graph instead of the potentially larger provenance graph.
This document discusses encoding provenance graphs and PROV constraints using Datalog rules. It maps PROV notation graphs to a database of facts and encodes most PROV constraints as Datalog rules. This allows for declarative specification of provenance graphs with deductive inference, enabling validation of graphs and rapid prototyping of analysis algorithms. Some limitations include inability to encode certain constraints and attributes in graph relations. The approach provides a proof of concept for representing and reasoning over provenance graphs with Datalog.
The document discusses porting genome sequencing data processing pipelines from scripted HPC implementations to workflow models on the cloud. This allows the pipelines to be more scalable, flexible, and evolvable. Tracking provenance is also important for using results as clinical evidence and analyzing differences when the pipelines change. Preliminary tests on the Microsoft Azure cloud show potential cost savings from improved resource utilization.
The document discusses integrating data from multiple sources on-the-fly without prior knowledge of the schemas. It proposes using approximate entity reconciliation, which leverages techniques like record linkage, approximate joins, and adaptive query processing. The key challenges are trading off completeness of integration for query response time and implementing a hybrid join algorithm that switches between exact and approximate joins to optimize this tradeoff.
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paolo Missier
Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).
ProvAbs: model, policy, and tooling for abstracting PROV graphsPaolo Missier
This document presents ProvAbs, a model, policy language, and tool for abstracting PROV graphs to enable partial disclosure of provenance data. The model groups nodes in a PROV graph and replaces them with a new abstract node while preserving the graph's validity. A policy assigns sensitivity levels to nodes and drives the node selection for abstraction. The ProvAbs tool implements the abstraction model and allows interactively exploring policy settings and clearances to generate abstract views of a PROV graph.
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Paolo Missier
Much of the knowledge produced through data-intensive computations is liable to decay over time, as the underlying data drifts, and the algorithms, tools, and external data sources used for processing change and evolve. Your genome, for example, does not change over time, but our understanding of it does. How often should be look back at it, in the hope to gain new insight e.g. into genetic diseases, and how much does that cost when you scale re-analysis to an entire population?
The "total cost of ownership” of knowledge derived from data (TCO-DK) includes the cost of refreshing the knowledge over time in addition to the initial analysis, but is often not a primary consideration.
The ReComp project aims to provide models, algorithms, and tools to help humans understand TCO-DK, i.e., the nature and impact of changes in data, and assess the cost and benefits of knowledge refresh.
In this talk we try and map the scope of ReComp, by giving a number of patterns that cover typical analytics scenarios where re-computation is appropriate. We specifically describe two such scenarios, where we are conducting small scale, proof-of-concept ReComp experiments to help us sketch the general ReComp architecture. This initial exercise reveals a multiplicity of problems and research challenges, which will inform the rest of the project
Big Data Quality Panel: Diachron Workshop @EDBTPaolo Missier
1) Traditional approaches to ensuring data quality such as quality assurance and curation face challenges from big data's volume, velocity, and variety characteristics.
2) It is difficult to determine general thresholds for when data quality issues can be ignored as the importance varies between different analytics algorithms.
3) The ReComp decision support system aims to use metadata about past analytics tasks to determine when knowledge needs to be refreshed due to changes in big data or models.
The lifecycle of reproducible science data and what provenance has got to do ...Paolo Missier
The document discusses various aspects of ensuring reproducibility in scientific research through provenance. It begins by providing an overview of the data lifecycle and challenges to reproducibility as experiments and components evolve. It then discusses different levels of reproducibility (rerun, repeat, replicate, reproduce) and approaches to analyzing differences in workflow provenance traces to understand how changes impact results. The remainder of the document describes specific systems and tools developed by the author and collaborators that use provenance to improve reproducibility, including data packaging with Research Objects, provenance recording and analysis workflows with YesWorkflow, process virtualization using TOSCA, and provenance differencing with Pdiff.
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
This document discusses moving whole exome sequencing pipelines to the cloud using e-Science Central workflow management. The goal is to process 3000 exomes from neurological patients in a scalable and cost-effective way. Current scripts are being ported to e-Science Central for improved abstraction, execution, and provenance tracking. Provenance will help compare results from different pipeline versions and support clinical diagnosis. Initial testing with 300 exomes will begin, with full scalability testing planned for September 2014.
The document discusses using SNPs (single nucleotide polymorphisms) to help identify candidate genes associated with quantitative traits. It presents SNPit, a database that integrates data from Ensembl, dbSNP and Perlegen to rank SNPs based on differences between resistant and susceptible mouse strains. SNPit supports exploratory analysis of large genomic regions to help focus candidate gene searches for traits like disease susceptibility. The goal is to complement existing methods and automate parts of the process to accelerate disease gene identification.
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
A talk given at the DATAPLAT workshop, co-located with the IEEE ICDE conference (May 2024, Utrecht, NL).
Data Provenance for Data Science is our attempt to provide a foundation to add explainability to data-centric AI.
It is a prototype, with lots of work still to do.
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
In this presentation, given to graduate students at Universita' RomaTre, Italy, we suggest that concepts well-known in Data Provenance can be exploited to provide explanations in the context of data-centric AI processes. Through use cases (incremental data cleaning, training set pruning), we build up increasingly complex provenance patterns, culminating in an open question:
how to describe "why" a specific data item has been manipulated as part of data processing, when such processing may consist of a complex data transformation algorithm.
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
A talk given at the BDA4HM workshop, IEEE BigData conference, Dec. 2023
please see paper here:
https://drive.google.com/file/d/1vN08G0FWxOSH1Yeak5AX6a0sr5-EBbAt/view
Data-centric AI and the convergence of data and model engineering:opportunit...Paolo Missier
A keynote talk given to the IDEAL 2023 conference (Evora, Portugal Nov 23, 2023).
Abstract.
The past few years have seen the emergence of what the AI community calls "Data-centric AI", namely the recognition that some of the limiting factors in AI performance are in fact in the data used for training the models, as much as in the expressiveness and complexity of the models themselves. One analogy is that of a powerful engine that will only run as fast as the quality of the fuel allows. A plethora of recent literature has started the connection between data and models in depth, along with startups that offer "data engineering for AI" services. Some concepts are well-known to the data engineering community, including incremental data cleaning, multi-source integration, or data bias control; others are more specific to AI applications, for instance the realisation that some samples in the training space are "easier to learn from" than others. In this "position talk" I will suggest that, from an infrastructure perspective, there is an opportunity to efficiently support patterns of complex pipelines where data and model improvements are entangled in a series of iterations. I will focus in particular on end-to-end tracking of data and model versions, as a way to support MLDev and MLOps engineers as they navigate through a complex decision space.
Realising the potential of Health Data Science:opportunities and challenges ...Paolo Missier
This document summarizes a presentation on opportunities and challenges for applying health data science and AI in healthcare. It discusses the potential of predictive, preventative, personalized and participatory (P4) approaches using large health datasets. However, it notes major challenges including data sparsity, imbalance, inconsistency and high costs. Case studies on liver disease and COVID datasets demonstrate issues requiring data engineering. Ensuring explanations and human oversight are also key to adopting AI in clinical practice. Overall, the document outlines a complex landscape and the need for better data science methods to realize the promise of data-driven healthcare.
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
This document describes DP4DS, a tool to collect fine-grained provenance from data processing pipelines. Specifically, it can collect provenance from dataframe-based Python scripts. It demonstrates scalable provenance generation, storage, and querying. Current work includes improving provenance compression techniques and demonstrating the tool's generality for standard relational operators. Open questions remain around how useful fine-grained provenance is for explaining findings from real data science pipelines.
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
a brief intro on the data challenges associated with working with Health Care data, with a few examples, both from literature and our own, of traditional approaches (Latent Class Analysis, Topic Modelling) and a perspective on Language-based modelling for Electronic Health Records (EHR).
probably more references than actual content in here!
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
This document describes a method for capturing and querying fine-grained provenance from data science preprocessing pipelines. It captures provenance at the dataframe level by comparing inputs and outputs to identify transformations. Templates are used to represent common transformations like joins and appends. The approach was evaluated on benchmark datasets and pipelines, showing overhead from provenance capture is low and queries are fast even for large datasets. Scalability was demonstrated on datasets up to 1TB in size. A tool called DPDS was also developed to assist with data science provenance.
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
The document proposes tracking trajectories of multiple long-term conditions using dynamic patient-cluster associations. It uses topic modeling to identify disease clusters from patient timelines and quantifies how patients associate with clusters over time. Preliminary results on 143,000 patients from UK Biobank show varying stability of patient associations with clusters. Further work aims to better define stability and identify causes of instability.
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
The document discusses porting genome sequencing data processing pipelines from scripted HPC implementations to workflow models on the cloud. This allows the pipelines to be more scalable, flexible, and evolvable. Tracking provenance is also important for using results as clinical evidence and analyzing differences when the pipelines change. Preliminary tests on the Microsoft Azure cloud show potential cost savings from improved resource utilization.
The document discusses integrating data from multiple sources on-the-fly without prior knowledge of the schemas. It proposes using approximate entity reconciliation, which leverages techniques like record linkage, approximate joins, and adaptive query processing. The key challenges are trading off completeness of integration for query response time and implementing a hybrid join algorithm that switches between exact and approximate joins to optimize this tradeoff.
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paolo Missier
Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).
ProvAbs: model, policy, and tooling for abstracting PROV graphsPaolo Missier
This document presents ProvAbs, a model, policy language, and tool for abstracting PROV graphs to enable partial disclosure of provenance data. The model groups nodes in a PROV graph and replaces them with a new abstract node while preserving the graph's validity. A policy assigns sensitivity levels to nodes and drives the node selection for abstraction. The ProvAbs tool implements the abstraction model and allows interactively exploring policy settings and clearances to generate abstract views of a PROV graph.
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Paolo Missier
Much of the knowledge produced through data-intensive computations is liable to decay over time, as the underlying data drifts, and the algorithms, tools, and external data sources used for processing change and evolve. Your genome, for example, does not change over time, but our understanding of it does. How often should be look back at it, in the hope to gain new insight e.g. into genetic diseases, and how much does that cost when you scale re-analysis to an entire population?
The "total cost of ownership” of knowledge derived from data (TCO-DK) includes the cost of refreshing the knowledge over time in addition to the initial analysis, but is often not a primary consideration.
The ReComp project aims to provide models, algorithms, and tools to help humans understand TCO-DK, i.e., the nature and impact of changes in data, and assess the cost and benefits of knowledge refresh.
In this talk we try and map the scope of ReComp, by giving a number of patterns that cover typical analytics scenarios where re-computation is appropriate. We specifically describe two such scenarios, where we are conducting small scale, proof-of-concept ReComp experiments to help us sketch the general ReComp architecture. This initial exercise reveals a multiplicity of problems and research challenges, which will inform the rest of the project
Big Data Quality Panel: Diachron Workshop @EDBTPaolo Missier
1) Traditional approaches to ensuring data quality such as quality assurance and curation face challenges from big data's volume, velocity, and variety characteristics.
2) It is difficult to determine general thresholds for when data quality issues can be ignored as the importance varies between different analytics algorithms.
3) The ReComp decision support system aims to use metadata about past analytics tasks to determine when knowledge needs to be refreshed due to changes in big data or models.
The lifecycle of reproducible science data and what provenance has got to do ...Paolo Missier
The document discusses various aspects of ensuring reproducibility in scientific research through provenance. It begins by providing an overview of the data lifecycle and challenges to reproducibility as experiments and components evolve. It then discusses different levels of reproducibility (rerun, repeat, replicate, reproduce) and approaches to analyzing differences in workflow provenance traces to understand how changes impact results. The remainder of the document describes specific systems and tools developed by the author and collaborators that use provenance to improve reproducibility, including data packaging with Research Objects, provenance recording and analysis workflows with YesWorkflow, process virtualization using TOSCA, and provenance differencing with Pdiff.
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
This document discusses moving whole exome sequencing pipelines to the cloud using e-Science Central workflow management. The goal is to process 3000 exomes from neurological patients in a scalable and cost-effective way. Current scripts are being ported to e-Science Central for improved abstraction, execution, and provenance tracking. Provenance will help compare results from different pipeline versions and support clinical diagnosis. Initial testing with 300 exomes will begin, with full scalability testing planned for September 2014.
The document discusses using SNPs (single nucleotide polymorphisms) to help identify candidate genes associated with quantitative traits. It presents SNPit, a database that integrates data from Ensembl, dbSNP and Perlegen to rank SNPs based on differences between resistant and susceptible mouse strains. SNPit supports exploratory analysis of large genomic regions to help focus candidate gene searches for traits like disease susceptibility. The goal is to complement existing methods and automate parts of the process to accelerate disease gene identification.
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
A talk given at the DATAPLAT workshop, co-located with the IEEE ICDE conference (May 2024, Utrecht, NL).
Data Provenance for Data Science is our attempt to provide a foundation to add explainability to data-centric AI.
It is a prototype, with lots of work still to do.
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
In this presentation, given to graduate students at Universita' RomaTre, Italy, we suggest that concepts well-known in Data Provenance can be exploited to provide explanations in the context of data-centric AI processes. Through use cases (incremental data cleaning, training set pruning), we build up increasingly complex provenance patterns, culminating in an open question:
how to describe "why" a specific data item has been manipulated as part of data processing, when such processing may consist of a complex data transformation algorithm.
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
A talk given at the BDA4HM workshop, IEEE BigData conference, Dec. 2023
please see paper here:
https://drive.google.com/file/d/1vN08G0FWxOSH1Yeak5AX6a0sr5-EBbAt/view
Data-centric AI and the convergence of data and model engineering:opportunit...Paolo Missier
A keynote talk given to the IDEAL 2023 conference (Evora, Portugal Nov 23, 2023).
Abstract.
The past few years have seen the emergence of what the AI community calls "Data-centric AI", namely the recognition that some of the limiting factors in AI performance are in fact in the data used for training the models, as much as in the expressiveness and complexity of the models themselves. One analogy is that of a powerful engine that will only run as fast as the quality of the fuel allows. A plethora of recent literature has started the connection between data and models in depth, along with startups that offer "data engineering for AI" services. Some concepts are well-known to the data engineering community, including incremental data cleaning, multi-source integration, or data bias control; others are more specific to AI applications, for instance the realisation that some samples in the training space are "easier to learn from" than others. In this "position talk" I will suggest that, from an infrastructure perspective, there is an opportunity to efficiently support patterns of complex pipelines where data and model improvements are entangled in a series of iterations. I will focus in particular on end-to-end tracking of data and model versions, as a way to support MLDev and MLOps engineers as they navigate through a complex decision space.
Realising the potential of Health Data Science:opportunities and challenges ...Paolo Missier
This document summarizes a presentation on opportunities and challenges for applying health data science and AI in healthcare. It discusses the potential of predictive, preventative, personalized and participatory (P4) approaches using large health datasets. However, it notes major challenges including data sparsity, imbalance, inconsistency and high costs. Case studies on liver disease and COVID datasets demonstrate issues requiring data engineering. Ensuring explanations and human oversight are also key to adopting AI in clinical practice. Overall, the document outlines a complex landscape and the need for better data science methods to realize the promise of data-driven healthcare.
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
This document describes DP4DS, a tool to collect fine-grained provenance from data processing pipelines. Specifically, it can collect provenance from dataframe-based Python scripts. It demonstrates scalable provenance generation, storage, and querying. Current work includes improving provenance compression techniques and demonstrating the tool's generality for standard relational operators. Open questions remain around how useful fine-grained provenance is for explaining findings from real data science pipelines.
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
a brief intro on the data challenges associated with working with Health Care data, with a few examples, both from literature and our own, of traditional approaches (Latent Class Analysis, Topic Modelling) and a perspective on Language-based modelling for Electronic Health Records (EHR).
probably more references than actual content in here!
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
This document describes a method for capturing and querying fine-grained provenance from data science preprocessing pipelines. It captures provenance at the dataframe level by comparing inputs and outputs to identify transformations. Templates are used to represent common transformations like joins and appends. The approach was evaluated on benchmark datasets and pipelines, showing overhead from provenance capture is low and queries are fast even for large datasets. Scalability was demonstrated on datasets up to 1TB in size. A tool called DPDS was also developed to assist with data science provenance.
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
The document proposes tracking trajectories of multiple long-term conditions using dynamic patient-cluster associations. It uses topic modeling to identify disease clusters from patient timelines and quantifies how patients associate with clusters over time. Preliminary results on 143,000 patients from UK Biobank show varying stability of patient associations with clusters. Further work aims to better define stability and identify causes of instability.
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
The document discusses data provenance for data science applications. It proposes automatically generating and storing metadata that describes how data flows through a machine learning pipeline. This provenance information could help address questions about model predictions, data processing decisions, and regulatory requirements for high-risk AI systems. Capturing provenance at a fine-grained level incurs overhead but enables detailed queries. The approach was evaluated on performance and scalability. Provenance may help with transparency, explainability and oversight as required by new regulations.
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
a talk given at the VLDB 2021 conference, August, 2021, presenting our paper:
Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507–520, January, 2021.
http://doi.org/10.14778/3436905.3436911
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Paolo Missier
The document discusses provenance in the context of data science and artificial intelligence. It provides bibliometric data on publications related to data/workflow provenance from 2000 to the present. Recent trends include increased focus on applications in computing and engineering fields. Blockchain is discussed as a method for capturing fine-grained provenance. The document also outlines challenges around explainability, transparency and accountability for high-risk AI systems according to new EU regulations, and argues that provenance techniques may help address these challenges by providing traceability of system functioning and operation monitoring.
Analytics of analytics pipelines:from optimising re-execution to general Dat...Paolo Missier
This document discusses using data provenance to optimize re-execution of analytics pipelines and enable transparency in data science workflows. It proposes a framework called ReComp that selectively recomputes parts of expensive analytics workflows when inputs change based on provenance data. It also discusses applying provenance techniques to collect fine-grained data on data preparation steps in machine learning pipelines to help explain model decisions and data transformations. Early results suggest provenance can be collected with reasonable overhead and enables useful queries about pipeline execution.
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
Paolo Missier presented on optimizing the re-execution of analytics pipelines in response to changes in input data. The talk discussed using provenance to selectively re-run parts of workflows impacted by changes. ProvONE combines process structure and runtime provenance to enable granular re-execution. The ReComp framework detects and quantifies data changes, estimates impact, and selectively re-executes relevant sub-processes to optimize re-running workflows in response to evolving data.
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
The document describes the ReComp framework for efficiently recomputing analytics processes when changes occur. ReComp uses provenance data from past executions to estimate the impact of changes and selectively re-execute only affected parts of processes. It identifies changes, computes data differences, and estimates impacts on past outputs to determine the minimum re-executions needed. For genomic analysis workflows, ReComp reduced re-executions from 495 to 71 by caching intermediate data and re-running only impacted fragments. The framework is customizable via difference and impact functions tailored to specific applications and data types.
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
Must Know Postgres Extension for DBA and Developer during MigrationMydbops
Mydbops Opensource Database Meetup 16
Topic: Must-Know PostgreSQL Extensions for Developers and DBAs During Migration
Speaker: Deepak Mahto, Founder of DataCloudGaze Consulting
Date & Time: 8th June | 10 AM - 1 PM IST
Venue: Bangalore International Centre, Bangalore
Abstract: Discover how PostgreSQL extensions can be your secret weapon! This talk explores how key extensions enhance database capabilities and streamline the migration process for users moving from other relational databases like Oracle.
Key Takeaways:
* Learn about crucial extensions like oracle_fdw, pgtt, and pg_audit that ease migration complexities.
* Gain valuable strategies for implementing these extensions in PostgreSQL to achieve license freedom.
* Discover how these key extensions can empower both developers and DBAs during the migration process.
* Don't miss this chance to gain practical knowledge from an industry expert and stay updated on the latest open-source database trends.
Mydbops Managed Services specializes in taking the pain out of database management while optimizing performance. Since 2015, we have been providing top-notch support and assistance for the top three open-source databases: MySQL, MongoDB, and PostgreSQL.
Our team offers a wide range of services, including assistance, support, consulting, 24/7 operations, and expertise in all relevant technologies. We help organizations improve their database's performance, scalability, efficiency, and availability.
Contact us: info@mydbops.com
Visit: https://www.mydbops.com/
Follow us on LinkedIn: https://in.linkedin.com/company/mydbops
For more details and updates, please follow up the below links.
Meetup Page : https://www.meetup.com/mydbops-databa...
Twitter: https://twitter.com/mydbopsofficial
Blogs: https://www.mydbops.com/blog/
Facebook(Meta): https://www.facebook.com/mydbops/
ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this shift, provides a look at the implementation and roadmap, and shares how this shift benefits ScyllaDB users.
Discover top-tier mobile app development services, offering innovative solutions for iOS and Android. Enhance your business with custom, user-friendly mobile applications.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
What is an RPA CoE? Session 2 – CoE RolesDianaGray10
In this session, we will review the players involved in the CoE and how each role impacts opportunities.
Topics covered:
• What roles are essential?
• What place in the automation journey does each role play?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
Essentials of Automations: Exploring Attributes & Automation ParametersSafe Software
Building automations in FME Flow can save time, money, and help businesses scale by eliminating data silos and providing data to stakeholders in real-time. One essential component to orchestrating complex automations is the use of attributes & automation parameters (both formerly known as “keys”). In fact, it’s unlikely you’ll ever build an Automation without using these components, but what exactly are they?
Attributes & automation parameters enable the automation author to pass data values from one automation component to the next. During this webinar, our FME Flow Specialists will cover leveraging the three types of these output attributes & parameters in FME Flow: Event, Custom, and Automation. As a bonus, they’ll also be making use of the Split-Merge Block functionality.
You’ll leave this webinar with a better understanding of how to maximize the potential of automations by making use of attributes & automation parameters, with the ultimate goal of setting your enterprise integration workflows up on autopilot.
From Natural Language to Structured Solr Queries using LLMsSease
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving
What began over 115 years ago as a supplier of precision gauges to the automotive industry has evolved into being an industry leader in the manufacture of product branding, automotive cockpit trim and decorative appliance trim. Value-added services include in-house Design, Engineering, Program Management, Test Lab and Tool Shops.
In our second session, we shall learn all about the main features and fundamentals of UiPath Studio that enable us to use the building blocks for any automation project.
📕 Detailed agenda:
Variables and Datatypes
Workflow Layouts
Arguments
Control Flows and Loops
Conditional Statements
💻 Extra training through UiPath Academy:
Variables, Constants, and Arguments in Studio
Control Flow in Studio
1. Scientific Workflow Management System
Janus
Provenance
Towards systema-c informa-on exchange
and reuse in e‐laboratories
AGU Fall mee-ng, Dec. 2009
Paolo Missier
Information Management Group
School of Computer Science, University of Manchester, UK
with additional material by Sean Bechhofer and Matthew Gamble,
e-Labs design group, University of Manchester
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
2. Momentum on sharing and collaboration
Special issue of Nature on Data Sharing (Sept. 2009)
http://www.nature.com/news/specials/datasharing/index.html
2
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
3. Momentum on sharing and collaboration
Special issue of Nature on Data Sharing (Sept. 2009)
• timeliness requires rapid sharing
• repurposing
• the Human Genome project use case
http://www.nature.com/news/specials/datasharing/index.html
2
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
4. Momentum on sharing and collaboration
Special issue of Nature on Data Sharing (Sept. 2009)
• timeliness requires rapid sharing
• repurposing
• the Human Genome project use case
http://www.nature.com/news/specials/datasharing/index.html
• Debate is much further along in Earth Sciences
– ESIP - data preservation / stewardship, 2009
– Long established in some communities - Atmospheric sciences,
1998 [1]
• Science Commons recommendations for Open Science
– (July 2008) [link]
[1] Strebel DE, Landis DR, Huemmrich KF, Newcomer JA, Meeson BW: The FIFE Data
Publication Experiment. Journal of the Atmospheric Sciences 1998, 55:1277-1283 2
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
5. Collaboration in workflow-based science
workflow workflow
+ execution
input dataset
specification
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
6. Collaboration in workflow-based science
workflow workflow
+ execution
input dataset
specification
outcome
outcome (provenance)
(data)
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
7. Collaboration in workflow-based science
workflow workflow
+ execution
input dataset
specification
outcome
outcome (provenance)
(data)
Research
Object
Packaging
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
8. Collaboration in workflow-based science
workflow workflow
+ execution
input dataset
specification
outcome
outcome (provenance)
(data)
Research
Object
Packaging
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
9. Collaboration in workflow-based science
workflow workflow
+ execution
input dataset
specification
outcome
ul outcome
Pa (data)
(provenance)
browse Research
query Object
unbundle Packaging
reuse
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
10. Collaboration in workflow-based science
workflow workflow
+ execution
input dataset
specification
Data-mediated outcome
ul implicit outcome
Pa collaboration (data)
(provenance)
browse Research
query Object
unbundle Packaging
reuse
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
11. Collaboration in workflow-based science
What is needed for Paul to make sense of third party data?
Data-mediated outcome
ul implicit outcome
Pa collaboration (data)
(provenance)
browse Research
query Object
unbundle Packaging
reuse
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
12. Collaboration in workflow-based science
What is needed for Paul to make sense of third party data?
Data-mediated outcome
ul implicit outcome
Pa collaboration (data)
(provenance)
browse Research
query Object
unbundle Packaging
reuse
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
13. Collaboration in workflow-based science
What is needed for Paul to make sense of third party data?
Data-mediated outcome
ul implicit outcome
Pa collaboration (data)
(provenance)
browse Research
query Object
unbundle Packaging
reuse
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
14. Collaboration in workflow-based science
What is needed for Paul to make sense of third party data?
Data-mediated outcome
ul implicit outcome
Pa collaboration (data)
(provenance)
browse Research
query Object
unbundle Packaging
reuse
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
15. Paul’s
Paul’s Pack
QTL
Research
Object
Common pathways
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
16. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
Results
Logs Slides
Workflow 13 Paper
Results
Common pathways
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
17. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
Results
Logs Slides
Workflow 13 Paper
Representation
Results
Common pathways
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
18. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
Results
Logs Slides
Workflow 13 Paper
Representation
Results Domain Relations
Common pathways
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
19. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
produces
Results
Included in Included in Published in
Logs Slides
produces
Feeds into
Included in Included in
Workflow 13 Paper
produces Published in
Representation
Results Domain Relations
Common pathways
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
20. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
produces
Results
Included in Included in Published in
Logs Slides
produces
Feeds into
Included in Included in
Workflow 13 Paper
Metadata produces Published in
Representation
Results Domain Relations
Aggregation
Common pathways
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
21. ORE: representing generic aggregations
Resource Map Data structure
(descriptor)
http://www.openarchives.org/ore/1.0/primer.html section 4
A. Pepe, M. Mayernik, C.L. Borgman, and H.V. Sompel, "From Artifacts to Aggregations:
Modeling Scientific Life Cycles on the Semantic Web," Journal of the American Society for
Information Science and Technology (JASIST), to appear, 2009.
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
23. Content: Workflow provenance
A detailed trace of workflow execution
- tasks performed, data transformations
- inputs used, outputs produced
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
24. Content: Workflow provenance
A detailed trace of workflow execution
- tasks performed, data transformations
- inputs used, outputs produced
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
25. Content: Workflow provenance
A detailed trace of workflow execution
lister
- tasks performed, data transformations
get pathways
by genes1 - inputs used, outputs produced
merge pathways
gene_id
concat gene pathway ids
output
pathway_genes
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
26. Why provenance matters, if done right
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis for improvement, re-design
The W3C Incubator on Provenance has been collecting numerous use cases:
http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
27. What users expect to learn
• Causal relations:
- which pathways come from which genes?
- which processes contributed to producing an
lister image?
- which process(es) caused data to be incorrect?
get pathways
by genes1
- which data caused a process to fail?
merge pathways • Process and data analytics:
– analyze variations in output vs an input
gene_id parameter sweep (multiple process runs)
– how often has my favourite service been
concat gene pathway ids executed? on what inputs?
– who produced this data?
output
– how often does this pathway turn up when the
input genes range over a certain set S?
pathway_genes
9
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
28. Open Provenance Model
• graph of causal dependencies involving data and processors
• not necessarily generated by a workflow!
• v1.1 out soon
wasGeneratedBy (R)
A P
Goal:
used (R)
P A standardize causal dependencies
to enable provenance metadata exchange
wgb(R5)
A1 wgb(R1) used(R3) A3 P1
P3
wgb(R6)
A2 wgb(R2) used(R4) A4 P2
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
29. Additional requirements on OPM
• Artifact values require uniform common identifier
scheme
– Linked Data in OPM?
• OPM accounts for structural causal relationships
– additional domain-specific knowledge required
– attaching semantic annotations to OPM graph nodes
• OPM graphs can grow very large
– reduce size by exporting only query results
• Taverna approach
– multiple levels of abstraction
• through OPM accounts (“points of view”)
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
30. Additional requirements on OPM
• Artifact values require uniform common identifier
scheme
– Linked Data in OPM?
• OPM accounts for structural causal relationships
– additional domain-specific knowledge required
– attaching semantic annotations to OPM graph nodes
• OPM graphs can grow very large
– reduce size by exporting only query results
• Taverna approach
– multiple levels of abstraction
• through OPM accounts (“points of view”)
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
31. Query results as OPM graphs
prov(W)
execute
W run W
query Q
export Q(prov(W))
OPM(Q(prov(W)))
prov(WA)
Q(prov(W))
- Approach implemented in the Taverna 2.1 workflow system
- Internal provenance DB with ad hoc query language
Just released!
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
32. Full-fledged data-mediated collaborations
exp. A workflow A +
input A
Research
Object result
result A
provenance
datasets A
A
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
33. Full-fledged data-mediated collaborations
exp. A workflow A +
input A
Research
Object result
result A
provenance
datasets A
A
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
34. Full-fledged data-mediated collaborations
exp. A workflow A +
input A
Research
Object result
result A
provenance
datasets A
A
result A → input B
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
35. Full-fledged data-mediated collaborations
exp. A workflow A +
input A
Research
Object result
result A
provenance
datasets A
A
workflow B+
input B
Research
Object result
exp. B result B
provenance
result A → input B datasets B
B
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
36. Full-fledged data-mediated collaborations
workflow A +
input A workflow B +
inputB
result A → input B
Research
result Object result
datasets result A+B provenance
A datasets A+B
B
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
37. Full-fledged data-mediated collaborations
workflow A +
input A workflow B +
inputB
result A → input B
Research
result Object result
datasets result A+B provenance
A datasets A+B
B
Provenance composition
accounts for implicit
collaboration
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier
38. Full-fledged data-mediated collaborations
workflow A +
input A workflow B +
inputB
result A → input B
Research
result Object result
datasets result A+B provenance
A datasets A+B
B
Provenance composition
accounts for implicit
collaboration
Aligned with focus of upcoming Provenance Challenge 4:
“connect my provenance to yours" into a whole OPM provenance graph. - P.Missier
AGU Fall meeting, San Francisco, Dec. 2009
39. Contacts
The myGrid Consortium (Manchester, Southampton)
http://mygrid.org.uk
http://www.myexperiment.org
Janus Me: pmissier@acm.org
Provenance
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier