The document discusses various aspects of ensuring reproducibility in scientific research through provenance. It begins by providing an overview of the data lifecycle and challenges to reproducibility as experiments and components evolve. It then discusses different levels of reproducibility (rerun, repeat, replicate, reproduce) and approaches to analyzing differences in workflow provenance traces to understand how changes impact results. The remainder of the document describes specific systems and tools developed by the author and collaborators that use provenance to improve reproducibility, including data packaging with Research Objects, provenance recording and analysis workflows with YesWorkflow, process virtualization using TOSCA, and provenance differencing with Pdiff.
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudPaolo Missier
Another Cloud-e-Genome dissemination opportunity:
Porting an existing WES/WGS pipeline from HPC to a (public) cloud,
while achieving more flexibility and better abstraction,
and with better performance than the equivalent HPC deployment
Data Trajectories: tracking the reuse of published datafor transitive credi...Paolo Missier
This document discusses tracking the reuse of published research data through transformations in order to attribute credit. It presents a hypothetical scenario of data being reused by multiple researchers. The reuse events can be modeled as a provenance graph compliant with the W3C PROV standard. Rules for inductively assigning and propagating credit through the graph are defined. Challenges in building the provenance graph in practice are discussed, as autonomous systems may incompletely or inconsistently report reuse events. Addressing these challenges is framed as an important research agenda.
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Paolo Missier
Much of the knowledge produced through data-intensive computations is liable to decay over time, as the underlying data drifts, and the algorithms, tools, and external data sources used for processing change and evolve. Your genome, for example, does not change over time, but our understanding of it does. How often should be look back at it, in the hope to gain new insight e.g. into genetic diseases, and how much does that cost when you scale re-analysis to an entire population?
The "total cost of ownership” of knowledge derived from data (TCO-DK) includes the cost of refreshing the knowledge over time in addition to the initial analysis, but is often not a primary consideration.
The ReComp project aims to provide models, algorithms, and tools to help humans understand TCO-DK, i.e., the nature and impact of changes in data, and assess the cost and benefits of knowledge refresh.
In this talk we try and map the scope of ReComp, by giving a number of patterns that cover typical analytics scenarios where re-computation is appropriate. We specifically describe two such scenarios, where we are conducting small scale, proof-of-concept ReComp experiments to help us sketch the general ReComp architecture. This initial exercise reveals a multiplicity of problems and research challenges, which will inform the rest of the project
This document summarizes a project kickoff meeting for the ReComp project. The objectives of the ReComp project are to 1) investigate analytics techniques for supporting re-computation decisions, 2) research methods for assessing when re-computing an analytical process is feasible, and 3) create a decision support system to selectively recompute complex analytics processes. The expected outcomes are algorithms and a software framework to help determine when and how to recompute analyses when data or models change over time. The document outlines several challenges for the project, including estimating the impact of changes, managing different types of metadata, assessing reproducibility, and making the solutions reusable across different application cases.
Introduction to Data streaming - 05/12/2014Raja Chiky
Raja Chiky is an associate professor whose research interests include data stream mining, distributed architectures, and recommender systems. The document outlines data streaming concepts including what a data stream is, data stream management systems, and basic approximate algorithms used for processing massive, high-velocity data streams. It also discusses challenges in distributed systems and using semantic technologies for data streaming.
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudPaolo Missier
Another Cloud-e-Genome dissemination opportunity:
Porting an existing WES/WGS pipeline from HPC to a (public) cloud,
while achieving more flexibility and better abstraction,
and with better performance than the equivalent HPC deployment
Data Trajectories: tracking the reuse of published datafor transitive credi...Paolo Missier
This document discusses tracking the reuse of published research data through transformations in order to attribute credit. It presents a hypothetical scenario of data being reused by multiple researchers. The reuse events can be modeled as a provenance graph compliant with the W3C PROV standard. Rules for inductively assigning and propagating credit through the graph are defined. Challenges in building the provenance graph in practice are discussed, as autonomous systems may incompletely or inconsistently report reuse events. Addressing these challenges is framed as an important research agenda.
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Paolo Missier
Much of the knowledge produced through data-intensive computations is liable to decay over time, as the underlying data drifts, and the algorithms, tools, and external data sources used for processing change and evolve. Your genome, for example, does not change over time, but our understanding of it does. How often should be look back at it, in the hope to gain new insight e.g. into genetic diseases, and how much does that cost when you scale re-analysis to an entire population?
The "total cost of ownership” of knowledge derived from data (TCO-DK) includes the cost of refreshing the knowledge over time in addition to the initial analysis, but is often not a primary consideration.
The ReComp project aims to provide models, algorithms, and tools to help humans understand TCO-DK, i.e., the nature and impact of changes in data, and assess the cost and benefits of knowledge refresh.
In this talk we try and map the scope of ReComp, by giving a number of patterns that cover typical analytics scenarios where re-computation is appropriate. We specifically describe two such scenarios, where we are conducting small scale, proof-of-concept ReComp experiments to help us sketch the general ReComp architecture. This initial exercise reveals a multiplicity of problems and research challenges, which will inform the rest of the project
This document summarizes a project kickoff meeting for the ReComp project. The objectives of the ReComp project are to 1) investigate analytics techniques for supporting re-computation decisions, 2) research methods for assessing when re-computing an analytical process is feasible, and 3) create a decision support system to selectively recompute complex analytics processes. The expected outcomes are algorithms and a software framework to help determine when and how to recompute analyses when data or models change over time. The document outlines several challenges for the project, including estimating the impact of changes, managing different types of metadata, assessing reproducibility, and making the solutions reusable across different application cases.
Introduction to Data streaming - 05/12/2014Raja Chiky
Raja Chiky is an associate professor whose research interests include data stream mining, distributed architectures, and recommender systems. The document outlines data streaming concepts including what a data stream is, data stream management systems, and basic approximate algorithms used for processing massive, high-velocity data streams. It also discusses challenges in distributed systems and using semantic technologies for data streaming.
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries and an execution framework.
2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection and recommendations.
3) The Vertical Hoeffding Tree algorithm in SAMOA provides high parallelism and accuracy for streaming decision tree learning, outperforming native Apache Flink implementations on certain datasets while being faster on others.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
In this talk, we present Apache SAMOA, an open-source platform for mining big data streams with Apache Flink, Storm and Samza. Real time analytics is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Apache SAMOA includes algorithms for the most common machine learning tasks such as classification and clustering. It provides a pluggable architecture that allows it to run on Apache Flink, but also with other several distributed stream processing engines such as Storm and Samza.
This document discusses multi-dimensional database modeling and big data research challenges. It begins with an overview of business intelligence and data warehousing systems. It then discusses OLAP cube design, query languages, and decision support system benchmarks. Recent experiences with adapting benchmarks like TPC-H and TPC-DS to the multi-dimensional model are summarized. Finally, several challenging research problems are outlined, including big data integration, flexible schema modeling, and scaling systems for real-time OLAP and advanced visualization.
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
The document proposes using perceptron learners at the leaves of Hoeffding decision trees to improve performance on data streams. It introduces a new evaluation metric called RAM-Hours that considers both time and memory usage. The authors empirically evaluate different classifier models, including Hoeffding trees with perceptron and naive Bayes learners at leaves, on several datasets. Results show that hybrid models like Hoeffding naive Bayes perceptron trees often provide the best balance of accuracy, time and memory usage.
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
This document discusses techniques for mining data streams. It begins by defining different types of streaming data like time-series data and sequence data. It then discusses the characteristics of data streams like their huge volume, fast changing nature, and requirement for real-time processing. The key challenges in stream query processing are the unbounded memory requirements and need for approximate query answering. The document outlines several synopsis data structures and techniques used for mining data streams, including random sampling, histograms, sketches, and randomized algorithms. It also discusses architectures for stream query processing and classification of dynamic data streams.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
Analytics of analytics pipelines:from optimising re-execution to general Dat...Paolo Missier
This document discusses using data provenance to optimize re-execution of analytics pipelines and enable transparency in data science workflows. It proposes a framework called ReComp that selectively recomputes parts of expensive analytics workflows when inputs change based on provenance data. It also discusses applying provenance techniques to collect fine-grained data on data preparation steps in machine learning pipelines to help explain model decisions and data transformations. Early results suggest provenance can be collected with reasonable overhead and enables useful queries about pipeline execution.
Within this tutorial we present the results of recent research about the cloud enablement of data streaming systems. We illustrate, based on both industrial as well as academic prototypes, new emerging uses cases and research trends. Specifically, we focus on novel approaches for (1) fault tolerance and (2) scalability in large scale distributed streaming systems. In general, new fault tolerance mechanisms strive to be more robust and at the same time introduce less overhead. Novel load balancing approaches focus on elastic scaling over hundreds of instances based on the data and query workload. Finally, we present open challenges for the next generation of cloud-based data stream processing engines.
This document discusses supporting parallel OLAP (online analytical processing) over big data. It presents different data partitioning schemes for distributed warehouses and evaluates their performance using the TPC-H benchmark. Experimental results show improved query response times when fragmenting and distributing tables over multiple database backends compared to a single backend. The authors also introduce derived data techniques to further optimize query performance. They conclude more work is needed to automate data partitioning and support larger datasets.
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
The Advanced Photon Source (APS) at Argonne National Laboratory produces intense beams of x-rays for scientific research. Experimental data from the APS is growing dramatically due to improved detectors and a planned upgrade. This is creating data and computation challenges across the entire experimental process. Efforts are underway to accelerate the experimental feedback loop through automated data analysis, optimized data streaming, and computer-steered experiments to minimize data collection. The goal is to enable real-time insights and knowledge-driven experiments.
1. Real-time analytics of social networks can help companies detect new business opportunities by understanding customer needs and reactions in real-time.
2. MOA and SAMOA are frameworks for analyzing massive online and distributed data streams. MOA deals with evolving data streams using online learning algorithms. SAMOA provides a programming model for distributed, real-time machine learning on data streams.
3. Both tools allow companies to gain insights from social network and other real-time data to understand customers and react to opportunities.
This document discusses indexing techniques for scalable record linkage and deduplication. It introduces the problems of record linkage on large datasets that do not fit in memory and addresses corrupted data. Blocking is presented as a common approach, where similar records are grouped into blocks to reduce the number of record pairs that must be compared. The document also discusses research on developing machine learning techniques to automatically learn optimal blocking keys and blocking functions. Evaluation frameworks for record linkage are introduced. The sorted neighborhood method is described in detail, including how it creates keys, sorts data, and merges records to link them.
This document discusses mining data streams. It describes stream data as continuous, ordered, and fast changing. Traditional databases store finite data sets while stream data may be infinite. The document outlines challenges in mining stream data including processing queries and patterns continuously and with limited memory. It proposes using synopses to approximate answers within a small error range.
This document provides an overview of the Apache Hadoop ecosystem. It discusses key components like HDFS, MapReduce, YARN, Pig Latin, and performance tuning for MapReduce jobs. HDFS is introduced as the distributed file system that provides high throughput and scalability. MapReduce is described as the framework for distributed processing of large datasets across clusters. YARN is presented as an improvement over the static resource allocation in Hadoop 0.1.x. Pig Latin is demonstrated as a high-level language for expressing data analysis jobs. The document concludes by discussing extensions beyond MapReduce, like iterative processing and indexing approaches.
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...Shalin Hai-Jew
This document summarizes a presentation on using NVivo 10 software to code and analyze qualitative and mixed methods research data. It introduces NVivo 10 as a data management and analysis tool, demonstrates how to import and code data from various sources, and shows how to visualize and analyze coded data through matrices, models, and queries. The goals are to introduce NVivo 10's capabilities and to demonstrate the process of setting up a project for qualitative or mixed methods research.
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries and an execution framework.
2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection and recommendations.
3) The Vertical Hoeffding Tree algorithm in SAMOA provides high parallelism and accuracy for streaming decision tree learning, outperforming native Apache Flink implementations on certain datasets while being faster on others.
Artificial intelligence and data stream miningAlbert Bifet
Big Data and Artificial Intelligence have the potential to
fundamentally shift the way we interact with our surroundings. The
challenge of deriving insights from data streams has been recognized
as one of the most exciting and key opportunities for both academia
and industry. Advanced analysis of big data streams from sensors and
devices is bound to become a key area of artificial intelligence
research as the number of applications requiring such processing
increases. Dealing with the evolution over time of such data streams,
i.e., with concepts that drift or change completely, is one of the
core issues in stream mining. In this talk, I will present an overview
of data stream mining, industrial applications, open source tools, and
current challenges of data stream mining.
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
In this talk, we present Apache SAMOA, an open-source platform for mining big data streams with Apache Flink, Storm and Samza. Real time analytics is becoming the fastest and most efficient way to obtain useful knowledge from what is happening now, allowing organizations to react quickly when problems appear or to detect new trends helping to improve their performance. Apache SAMOA includes algorithms for the most common machine learning tasks such as classification and clustering. It provides a pluggable architecture that allows it to run on Apache Flink, but also with other several distributed stream processing engines such as Storm and Samza.
This document discusses multi-dimensional database modeling and big data research challenges. It begins with an overview of business intelligence and data warehousing systems. It then discusses OLAP cube design, query languages, and decision support system benchmarks. Recent experiences with adapting benchmarks like TPC-H and TPC-DS to the multi-dimensional model are summarized. Finally, several challenging research problems are outlined, including big data integration, flexible schema modeling, and scaling systems for real-time OLAP and advanced visualization.
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
The document proposes using perceptron learners at the leaves of Hoeffding decision trees to improve performance on data streams. It introduces a new evaluation metric called RAM-Hours that considers both time and memory usage. The authors empirically evaluate different classifier models, including Hoeffding trees with perceptron and naive Bayes learners at leaves, on several datasets. Results show that hybrid models like Hoeffding naive Bayes perceptron trees often provide the best balance of accuracy, time and memory usage.
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
This document discusses techniques for mining data streams. It begins by defining different types of streaming data like time-series data and sequence data. It then discusses the characteristics of data streams like their huge volume, fast changing nature, and requirement for real-time processing. The key challenges in stream query processing are the unbounded memory requirements and need for approximate query answering. The document outlines several synopsis data structures and techniques used for mining data streams, including random sampling, histograms, sketches, and randomized algorithms. It also discusses architectures for stream query processing and classification of dynamic data streams.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
Analytics of analytics pipelines:from optimising re-execution to general Dat...Paolo Missier
This document discusses using data provenance to optimize re-execution of analytics pipelines and enable transparency in data science workflows. It proposes a framework called ReComp that selectively recomputes parts of expensive analytics workflows when inputs change based on provenance data. It also discusses applying provenance techniques to collect fine-grained data on data preparation steps in machine learning pipelines to help explain model decisions and data transformations. Early results suggest provenance can be collected with reasonable overhead and enables useful queries about pipeline execution.
Within this tutorial we present the results of recent research about the cloud enablement of data streaming systems. We illustrate, based on both industrial as well as academic prototypes, new emerging uses cases and research trends. Specifically, we focus on novel approaches for (1) fault tolerance and (2) scalability in large scale distributed streaming systems. In general, new fault tolerance mechanisms strive to be more robust and at the same time introduce less overhead. Novel load balancing approaches focus on elastic scaling over hundreds of instances based on the data and query workload. Finally, we present open challenges for the next generation of cloud-based data stream processing engines.
This document discusses supporting parallel OLAP (online analytical processing) over big data. It presents different data partitioning schemes for distributed warehouses and evaluates their performance using the TPC-H benchmark. Experimental results show improved query response times when fragmenting and distributing tables over multiple database backends compared to a single backend. The authors also introduce derived data techniques to further optimize query performance. They conclude more work is needed to automate data partitioning and support larger datasets.
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
The Advanced Photon Source (APS) at Argonne National Laboratory produces intense beams of x-rays for scientific research. Experimental data from the APS is growing dramatically due to improved detectors and a planned upgrade. This is creating data and computation challenges across the entire experimental process. Efforts are underway to accelerate the experimental feedback loop through automated data analysis, optimized data streaming, and computer-steered experiments to minimize data collection. The goal is to enable real-time insights and knowledge-driven experiments.
1. Real-time analytics of social networks can help companies detect new business opportunities by understanding customer needs and reactions in real-time.
2. MOA and SAMOA are frameworks for analyzing massive online and distributed data streams. MOA deals with evolving data streams using online learning algorithms. SAMOA provides a programming model for distributed, real-time machine learning on data streams.
3. Both tools allow companies to gain insights from social network and other real-time data to understand customers and react to opportunities.
This document discusses indexing techniques for scalable record linkage and deduplication. It introduces the problems of record linkage on large datasets that do not fit in memory and addresses corrupted data. Blocking is presented as a common approach, where similar records are grouped into blocks to reduce the number of record pairs that must be compared. The document also discusses research on developing machine learning techniques to automatically learn optimal blocking keys and blocking functions. Evaluation frameworks for record linkage are introduced. The sorted neighborhood method is described in detail, including how it creates keys, sorts data, and merges records to link them.
This document discusses mining data streams. It describes stream data as continuous, ordered, and fast changing. Traditional databases store finite data sets while stream data may be infinite. The document outlines challenges in mining stream data including processing queries and patterns continuously and with limited memory. It proposes using synopses to approximate answers within a small error range.
This document provides an overview of the Apache Hadoop ecosystem. It discusses key components like HDFS, MapReduce, YARN, Pig Latin, and performance tuning for MapReduce jobs. HDFS is introduced as the distributed file system that provides high throughput and scalability. MapReduce is described as the framework for distributed processing of large datasets across clusters. YARN is presented as an improvement over the static resource allocation in Hadoop 0.1.x. Pig Latin is demonstrated as a high-level language for expressing data analysis jobs. The document concludes by discussing extensions beyond MapReduce, like iterative processing and indexing approaches.
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...Shalin Hai-Jew
This document summarizes a presentation on using NVivo 10 software to code and analyze qualitative and mixed methods research data. It introduces NVivo 10 as a data management and analysis tool, demonstrates how to import and code data from various sources, and shows how to visualize and analyze coded data through matrices, models, and queries. The goals are to introduce NVivo 10's capabilities and to demonstrate the process of setting up a project for qualitative or mixed methods research.
Workflow Provenance: From Modelling to ReportingRayhan Ferdous
This document provides an overview of workflow provenance and proposes a programming model and system architecture for collecting and querying workflow provenance data at scale. It begins by defining provenance and its importance for big data analytics. It then classifies different types of provenance queries and proposes a taxonomy. The document outlines a programming model using object-oriented programming and domain-specific languages to automate provenance logging. It proposes parsing logs into a graph database to support fundamental provenance queries and data visualization. Finally, it discusses scaling the system and conducting further research through user studies and query optimization.
The document discusses the increasing scale and complexity of knowledge generation in science domains like astronomy and medicine over recent centuries. It argues that knowledge generation can be viewed as a systems problem involving many actors and processes. The document proposes a service-oriented approach using web services as an integrating framework to address challenges of scale, complexity, and distributed collaboration in e-Science. Key challenges discussed include semantics, documentation, scaling issues, and sociological factors like incentives.
Using Neo4j for exploring the research graph connections made by RD-Switchboardamiraryani
In this talk, Jingbo Wang (NCI) and Amir Aryani (ANDS) have presented the Neo4j queries that can help data managers to explore the connections between datasets, researchers, grants, and publications using the graph model and Research Data Switchboard. In addition, they have discussed a paper on "Graph connections made by RD-Switchboard using NCI’s metadata", presented in the Reproducible Open Science workshop in Hannover September 2016.
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
2016 07 12_purdue_bigdatainomics_seandavisSean Davis
Newer, faster, cheaper molecular assays are driving biomedical research. I discuss the history of biomedical data including concepts of data sharing, hypothesis-driven vs generating research, and the potential to expand our thinking on biomedical research to be much more integrated through smart, creative, and open use of technologies and more flexible, longitudinal studies.
The Role of Metadata in Reproducible Computational ResearchJeremy Leipzig
Reproducible computational research (RCR) provides the keystone to the scientific method, packaging the transformation of raw data to published results in a manner than can be communicated to others. Developing RCR standards has been a growing concern of statisticians, data scientists, and informatics professionals. Metadata provides context and provenance to raw data, and is essential to both discovery and validation RCR. This presentation will give an overview for emerging metadata standards in data, analysis, pipelines tools, and publications.
This document describes Jean-Paul Calbimonte's doctoral research on enabling semantic integration of streaming data sources. The research aims to provide semantic query interfaces for streaming data, expose streaming data for the semantic web, and integrate streaming sources through ontology mappings. The approach involves ontology-based data access to streams, a semantic streaming query language, and semantic integration of distributed streams. Work done so far includes defining a language (SPARQLSTR) for querying RDF streams and enabling an engine to support streaming data sources through ontology mappings. Future work involves query optimization and quantitative evaluation.
Keynote speech - Carole Goble - Jisc Digital Festival 2015Jisc
Carole Goble is a professor in the school of computer science at the University of Manchester.
In this keynote, Carole offered her insights into research data management and data centres.
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
RDA Fourth Plenary Keynote - Prof. Christine L. Borgman, Professor Presidential Chair in Information Studies at UCLA: "Data, Data, Everywhere, Nor Any Drop to Drink." Tuesday 23rd Sept 2014, Amsterdam, the Netherlands
https://rd-alliance.org/plenary-meetings/fourth-plenary/plenary4-programme.html
We've all heard about how on-demand computing and storage will transform scientific practice. But by focusing on resources alone, we're missing the real benefit of the large-scale outsourcing and consequent economies of scale that cloud is about. The biggest IT challenge facing science today is not volume but complexity. Sure, terabytes demand new storage and computing solutions. But they're cheap. It is establishing and operating the processes required to collect, manage, analyze, share, archive, etc., that data that is taking all of our time and killing creativity. And that's where outsourcing can be transformative. An entrepreneur can run a small business from a coffee shop, outsourcing essentially every business function to a software-as-a-service provider--accounting, payroll, customer relationship management, the works. Why can't a young researcher run a research lab from a coffee shop? For that to happen, we need to make it easy for providers to develop "apps" that encapsulate useful capabilities and for researchers to discover, customize, and apply these "apps" in their work. The effect, I will argue, will be a dramatic acceleration of discovery.
Tools für das Management von ForschungsdatenHeinz Pampel
Workshop „Wege in die Köpfe“ des DFG-Projekts „EWIG - Entwicklung von Workflowkomponenten für die Langzeitarchivierung von Forschungsdaten in den Geowissenschaften“ | Berlin, 03.07.2014
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...Niki Pavlopoulou
The document proposes a dynamic diverse summarization system for heterogeneous graph streams using embeddings. It aims to provide expressive, non-redundant summaries with high usability while using limited resources in dynamic smart environments. The approach uses word embeddings to create vector representations of triples, DBSCAN clustering to group similar triples, and ranking and selection to choose the top-k diverse triples for the summary in response to a diversity-aware query. The system is evaluated on a real-world dataset against baselines, measuring correctness of the summaries.
This document discusses research objects (ROs) and their role in reproducible science. It makes three key points:
1. Publications should convince readers of validity through reproducible results, but current systems do not fully facilitate reproducibility. ROs can address this by explicitly representing methods used.
2. Reproducibility reinforces results and is a key factor in scientific discovery. ROs provide a reproducible representation of methods.
3. ROs bundle together essential resources from a computational study, such as data, results, methods, people involved, and annotations for understanding, interpretation, and reuse. They support the full experimental lifecycle from problem definition to publication.
Similar to The lifecycle of reproducible science data and what provenance has got to do with it (20)
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
A talk given at the DATAPLAT workshop, co-located with the IEEE ICDE conference (May 2024, Utrecht, NL).
Data Provenance for Data Science is our attempt to provide a foundation to add explainability to data-centric AI.
It is a prototype, with lots of work still to do.
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
In this presentation, given to graduate students at Universita' RomaTre, Italy, we suggest that concepts well-known in Data Provenance can be exploited to provide explanations in the context of data-centric AI processes. Through use cases (incremental data cleaning, training set pruning), we build up increasingly complex provenance patterns, culminating in an open question:
how to describe "why" a specific data item has been manipulated as part of data processing, when such processing may consist of a complex data transformation algorithm.
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
A talk given at the BDA4HM workshop, IEEE BigData conference, Dec. 2023
please see paper here:
https://drive.google.com/file/d/1vN08G0FWxOSH1Yeak5AX6a0sr5-EBbAt/view
Data-centric AI and the convergence of data and model engineering:opportunit...Paolo Missier
A keynote talk given to the IDEAL 2023 conference (Evora, Portugal Nov 23, 2023).
Abstract.
The past few years have seen the emergence of what the AI community calls "Data-centric AI", namely the recognition that some of the limiting factors in AI performance are in fact in the data used for training the models, as much as in the expressiveness and complexity of the models themselves. One analogy is that of a powerful engine that will only run as fast as the quality of the fuel allows. A plethora of recent literature has started the connection between data and models in depth, along with startups that offer "data engineering for AI" services. Some concepts are well-known to the data engineering community, including incremental data cleaning, multi-source integration, or data bias control; others are more specific to AI applications, for instance the realisation that some samples in the training space are "easier to learn from" than others. In this "position talk" I will suggest that, from an infrastructure perspective, there is an opportunity to efficiently support patterns of complex pipelines where data and model improvements are entangled in a series of iterations. I will focus in particular on end-to-end tracking of data and model versions, as a way to support MLDev and MLOps engineers as they navigate through a complex decision space.
Realising the potential of Health Data Science:opportunities and challenges ...Paolo Missier
This document summarizes a presentation on opportunities and challenges for applying health data science and AI in healthcare. It discusses the potential of predictive, preventative, personalized and participatory (P4) approaches using large health datasets. However, it notes major challenges including data sparsity, imbalance, inconsistency and high costs. Case studies on liver disease and COVID datasets demonstrate issues requiring data engineering. Ensuring explanations and human oversight are also key to adopting AI in clinical practice. Overall, the document outlines a complex landscape and the need for better data science methods to realize the promise of data-driven healthcare.
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
This document describes DP4DS, a tool to collect fine-grained provenance from data processing pipelines. Specifically, it can collect provenance from dataframe-based Python scripts. It demonstrates scalable provenance generation, storage, and querying. Current work includes improving provenance compression techniques and demonstrating the tool's generality for standard relational operators. Open questions remain around how useful fine-grained provenance is for explaining findings from real data science pipelines.
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
a brief intro on the data challenges associated with working with Health Care data, with a few examples, both from literature and our own, of traditional approaches (Latent Class Analysis, Topic Modelling) and a perspective on Language-based modelling for Electronic Health Records (EHR).
probably more references than actual content in here!
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
This document describes a method for capturing and querying fine-grained provenance from data science preprocessing pipelines. It captures provenance at the dataframe level by comparing inputs and outputs to identify transformations. Templates are used to represent common transformations like joins and appends. The approach was evaluated on benchmark datasets and pipelines, showing overhead from provenance capture is low and queries are fast even for large datasets. Scalability was demonstrated on datasets up to 1TB in size. A tool called DPDS was also developed to assist with data science provenance.
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
The document proposes tracking trajectories of multiple long-term conditions using dynamic patient-cluster associations. It uses topic modeling to identify disease clusters from patient timelines and quantifies how patients associate with clusters over time. Preliminary results on 143,000 patients from UK Biobank show varying stability of patient associations with clusters. Further work aims to better define stability and identify causes of instability.
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
The document discusses data provenance for data science applications. It proposes automatically generating and storing metadata that describes how data flows through a machine learning pipeline. This provenance information could help address questions about model predictions, data processing decisions, and regulatory requirements for high-risk AI systems. Capturing provenance at a fine-grained level incurs overhead but enables detailed queries. The approach was evaluated on performance and scalability. Provenance may help with transparency, explainability and oversight as required by new regulations.
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
a talk given at the VLDB 2021 conference, August, 2021, presenting our paper:
Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507–520, January, 2021.
http://doi.org/10.14778/3436905.3436911
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Paolo Missier
The document discusses provenance in the context of data science and artificial intelligence. It provides bibliometric data on publications related to data/workflow provenance from 2000 to the present. Recent trends include increased focus on applications in computing and engineering fields. Blockchain is discussed as a method for capturing fine-grained provenance. The document also outlines challenges around explainability, transparency and accountability for high-risk AI systems according to new EU regulations, and argues that provenance techniques may help address these challenges by providing traceability of system functioning and operation monitoring.
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
Paolo Missier presented on optimizing the re-execution of analytics pipelines in response to changes in input data. The talk discussed using provenance to selectively re-run parts of workflows impacted by changes. ProvONE combines process structure and runtime provenance to enable granular re-execution. The ReComp framework detects and quantifies data changes, estimates impact, and selectively re-executes relevant sub-processes to optimize re-running workflows in response to evolving data.
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
The document describes the ReComp framework for efficiently recomputing analytics processes when changes occur. ReComp uses provenance data from past executions to estimate the impact of changes and selectively re-execute only affected parts of processes. It identifies changes, computes data differences, and estimates impacts on past outputs to determine the minimum re-executions needed. For genomic analysis workflows, ReComp reduced re-executions from 495 to 71 by caching intermediate data and re-running only impacted fragments. The framework is customizable via difference and impact functions tailored to specific applications and data types.
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
This document discusses efficient re-computation of big data analytics processes when changes occur. It presents the ReComp framework which uses process execution history and provenance to selectively re-execute only the relevant parts of a process that are impacted by changes, rather than fully re-executing the entire process from scratch. This approach estimates the impact of changes using type-specific difference functions and impact estimation functions. It then identifies the minimal subset of process fragments that need to be re-executed based on change impact analysis and provenance traces. The framework is able to efficiently re-compute complex processes like genomics analytics workflows in response to changes in reference databases or other dependencies.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfTechgropse Pvt.Ltd.
In this blog post, we'll delve into the intersection of AI and app development in Saudi Arabia, focusing on the food delivery sector. We'll explore how AI is revolutionizing the way Saudi consumers order food, how restaurants manage their operations, and how delivery partners navigate the bustling streets of cities like Riyadh, Jeddah, and Dammam. Through real-world case studies, we'll showcase how leading Saudi food delivery apps are leveraging AI to redefine convenience, personalization, and efficiency.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
CAKE: Sharing Slices of Confidential Data on BlockchainClaudio Di Ciccio
Presented at the CAiSE 2024 Forum, Intelligent Information Systems, June 6th, Limassol, Cyprus.
Synopsis: Cooperative information systems typically involve various entities in a collaborative process within a distributed environment. Blockchain technology offers a mechanism for automating such processes, even when only partial trust exists among participants. The data stored on the blockchain is replicated across all nodes in the network, ensuring accessibility to all participants. While this aspect facilitates traceability, integrity, and persistence, it poses challenges for adopting public blockchains in enterprise settings due to confidentiality issues. In this paper, we present a software tool named Control Access via Key Encryption (CAKE), designed to ensure data confidentiality in scenarios involving public blockchains. After outlining its core components and functionalities, we showcase the application of CAKE in the context of a real-world cyber-security project within the logistics domain.
Paper: https://doi.org/10.1007/978-3-031-61000-4_16
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Things to Consider When Choosing a Website Developer for your Website | FODUUFODUU
Choosing the right website developer is crucial for your business. This article covers essential factors to consider, including experience, portfolio, technical skills, communication, pricing, reputation & reviews, cost and budget considerations and post-launch support. Make an informed decision to ensure your website meets your business goals.
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
The lifecycle of reproducible science data and what provenance has got to do with it
1. The lifecycle of reproducible science data
and what provenance has got to do with it
Paolo Missier
School of Computing Science
Newcastle University, UK
Alan Turing Institute
Symposium On Reproducibility for Data-Intensive Research
Oxford, April 6, 2016
With material contributed by:
Yang Cao, Bertram Ludascher, Tim McPhillips, Dave Vieglais, Matt Jones and
the DataONE CyberInfrastructure group
Rawaa Qasha at Newcastle University
Carole Goble at the University of Manchester
5. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Mapping the reproducibility space
5
Goal: to help scientists understand the effect of workflow / data / dependencies
evolution on workflow execution results
Approach: compare provenance traces generated during the runs: PDIFF
P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow
reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013.
8. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
(Yet another) Data Lifecycle picture
Search
discover
packagepublish
spec(P’)
Deploy
P’ Env
D D1
P P’
dep dep’
compute
Env
D’
prov(D’)
Compare
(P,P’,D,D’)
spec(P)
prov(D)
Research
Objects
DataONE
Federated
Research Data
Repositories
- Matlab
provenance
recorder
TOSCA-based
virtualisation
Pdiff: differencing
provenance
YesWorkflow
- Workflow
Provenance
- NoWorkflow
Matlab
provenance
recorder
(DataONE)
ReproZip
9. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
12. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Manifest Metadata
Manifest Construction
• Identification – id, title, creator, status….
• Aggregates – list of ids/links to resources
• Annotations – list of annotations about resources
Manifest
Manifest Description
• Checklists – what should be there
• Provenance – where it came from
• Versioning – its evolution
• Dependencies – what else is needed
Manifest
13. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
14. Components for a flexible, scalable,
sustainable network
Cyberinfrastructure Component 2
Member Nodes
www.dataone.org/member-nodes
Coordinating Nodes
• retain complete
metadata catalog
• indexing for search
• network-wide services
• ensure content
availability
(preservation)
• replication services
Member Nodes
• diverse institutions
• serve local community
• provide resources for
managing their data
• retain copies of data
14
15. Cyberinfrastructure
Data Services: Extraction, sub-setting etc
Provenance Semantics-enabled Discovery
ontolog
y
annotation
System
Metadata
Science
Data
Search
API
Science
Metadata
Provenance
Replicate
Metadata
Index
15
17. What input data went
into this study?
What methods were
used?
… with what
parameter settings,
calibrations, …?
Can we trust the data
and methods?
Provenance (lineage): track origin and processing history
of data trust, data quality ~ audit trail for attribution, credit
Discovery of data, methodologies, experiments
Use Provenance for
Transparency, Reproducibility
17
18. W3C has published the ‘PROV’ standard
Entity
Activity
Agent
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
W3C PROV model
See w3.org/TR/prov-o/
used
20
24. DataONE data packages:
Provenance inside!
resource map
science metadata
system
metadata
science data
system
metadata
system
metadata
OAI-ORE with ProvONE trace
figures
system metadata
software
system metadata
29
28. MATLAB, R , Python … Scripts
YesWorkflow (YW):
Scripts as prospective provenance
Script + @YW-annotation
workflow-land & trace-land
Combine provenance:
Prospective (workflow)
Retrospective (runtime trace)
Reconstructed (logs, files, …)
User can query own data &
provenance prior to sharing
Incentive: accelerate work!
“Provenance for Self”
34
29. When a user cites a pub, we
know:
Which data produced it
What software produced it
What was derived from it
Who to credit down the
attribution stack
Katz & Smith. 2014. Implementing Transitive Credit
with JSON-LD. arXiv:1407.5117
Missier, Paolo. “Data Trajectories: Tracking Reuse of
Published Data for Transitive Credit Attribution.” 11th
Intl. Data Curation Conference (IDCC). Amsterdam,
2016. (Best Paper Award)
Transitive Credit
36
30. Provenance today:
Important but hard
C limate C hange Impacts
in the U nited S tates
U .S . N a t iona l C lim a t e A sse ssm e nt
U . S. G lo b a l C h a n g e R e s e a r c h P r o g r a m
“This report is the result of a three-
year analytical effort by a team of
over 300 experts, overseen by a
broadly constituted Federal Advisory
Committee of 60 members. It was
developed from information and
analyses gathered in over 70
workshops and listening sessions
held across the country.”
37
32. Yaxing’s script with inputs &
output products
YesWorkflow model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results can be
traced back all the way to Yaxing’s
input
Provenance in action
40
33. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
38. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
39. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Data divergence analysis using provenance
All work done with reference to the e-Science Central WFMS
Assumption: workflow WFj (new version) runs to completion
thus it produces a new provenance trace
however, it may be disfunctional relative to WFi (the original)
Example: only input data changes: d != d’, WFj == WFi
4
7
Note: results may diverge even when the input datasets are identical, for example when one or
more of the services exhibits non-deterministic behaviour, or depends on external state that has
changed between executions.
45. P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
References
Research Objects: www.researchobject.org
Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et
al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011).
doi:doi:10.1016/j.future.2011.08.004.
DataONE: dataone.org
Cuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati,
Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.”
In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014.
doi:10.2218/ijdc.v9i2.332.
Process Virtualisation using TOSCA
Qasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using
TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015.
doi:10.1109/CLOUD.2015.146.
NoWorkflow: provenance recording for Python
Murta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire.
“noWorkflow: Capturing and Analyzing Provenance of Scripts⋆.” In Procs. IPAW’14. Cologne,
Germany: Springer, 2014.
Pdiff: provenance differencing for understanding workflow differences
Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing
for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience
(2013). doi:10.1002/cpe.3035.
Editor's Notes
Packaging – physical and logical containers
Open Archives Initiation Object Reuse and Exchange (OAI ORE) is a standard for describing aggregations of web resources
http://www.openarchives.org/ore/
Uses a Resource Map to describe the aggregated resources
Proxies allow for statements about the resources within the aggregation
Capturing context and viewpoints
Several concrete serialisations
RDF/XML, Atom, RDFa
Open Annotation specification is a community developed data model for annotation of web resources
http://www.openannotation.org/spec/core/
Developed by the W3C Open Annotation Community Group
Allows for “stand-off” annotations
Annotation as a first class citizen
Developed to fit with Web Architecture
How do you make a research object? Well, gather your resources, describe them in the manifest.
Different types of Containers can be used to transfer and package the Research Object;
The Research Object Bundle is a structured ZIP file format… but more specific and more general formats are also used, such a
Docker images (a bit low-level, capturing the whole execution environment)
BagIt (a digital archiving format that is commonly used by libraries), or
Simply existing Web resources (which may be subject to change).
You can register and archive research object in domain-specific repositories like FAIRDOM’s SEEK (system biology models), FARR Commons CKAN (public health medical data), technology-specific repositories (myExperiment for workflow-centric workflows), or generic data repositories you probably have already heard of, like Zenodo and Figshare.
Linked Resource Model very relevant
Dublin Core Application Profile
Pericles Linked Resource Model
Identification includes properties for identifying the “mime type” annotation profile of the RO
Need to update with new / upcoming MN locations and logos
Amber notes:Retain CN, MN logo? Required if used elsewhere, if not cut?Not all MN logos will fit – select representative or cut?Cross reference with google MN
Rebecca:
Need updated logos for KNB, AOOS (FIXED) – I would select a different set of MNs to highlight since all won’t fit
Rebecca:
Can we do a better job than the quad chart? If not, are all the logos in
1st quadrant appropriate?
Update before RSV
Figure shows from 2020 – edit?
Rebecca: the green axis and legend on the right is difficult to read – another color would be better.
Bertram: Agreed. But this isn’t our chart. Maybe we can “patch” it? Also: should credit source!
Still missing; EYE CANDY
Also removed (redundant with next slide!):
DataONE Provenance Products & Tools:
New ProvONE model
extends W3C PROV standard for workflows
New Matlab provenance recorder
ITK also includes R, Python recorders
DataONE Web UI integration
UI is “provenance-aware”
These statements are the low-level pieces of information that we keep track of.
These statements are the low-level pieces of information that we keep track of.
These statements are the low-level pieces of information that we keep track of.
These statements are the low-level pieces of information that we keep track of.
We want to enhance analysis software that scientists are already familiar with. So for our first round, we are working on a Matlab Toolbox, and an R library. In conjunction with Bertram, Paolo, and other colleagues, we are incorporating the Yesworkflow java library into our Matlab Toolbox to capture ‘prospective’ provenance.
Is the logo supposed to be R or ONE R?
Use tools, concepts scientists are already familiar with
Query 3: Where is the raw image corresponding to corrected image DRT322_11000ev_028.img
Scientist: Look at the image files nested within the raw directory. Find the image file that contains the values DRT322, 11000, and 028 in the file access path.
YW: Extract the URI template variable names and values from the path to DRT322_11000ev_028.img output by the port named corrected_image, look at the paths for all files output by the raw_image port, and return the file whose path includes template variables with names and values matching those for DRT322_11000ev_028.img
In the DataONE Search, we can search for ‘grass’, and two data packages show up. The Yaxing Wei (Alice) soil map processing workflow and the Christopher Schwalm (Bob) analysis workflows both show that they have provenance information associated with the Data Packages (via the icon in the search record). We next will choose the Wei’s Data Package to see the details. This can be seen at https://search-sandbox-2.test.dataone.org.
Viewing the Wei soil processing workflow we see on the left that the Matlab script (C3_C4_map_present_with_comments.m) has 25 inputs. It also has 6 outputs on the right. The top three outputs are the YesWorkflow diagrams (dataflow, processflow, combined). The bottom three are the NetCDF data files that represent three different world map grids of percentage of grass types (C3 grass fraction, C4 grass fraction, and total grass fraction). The script can be downloaded with the Download button in the center. This can be accessed at https://search-sandbox-2.test.dataone.org/#view/metadata_e859d2dd-c5e6-4ec6-892f-1b00bb6f8f65.xml. Bertram, if you want to show the YesWorkflow diagram (combined) for this run showing how monthly air and precipitation values are used as the inputs, the combined diagram can be accessed from this page, or directly from https://cn-sandbox-2.test.dataone.org/cn/v2/resolve/d87e1a6a-1a78-4f96-bba8-cb74ac2b1efb