This is a keynote that I have given in polyweb workshop on the state of the art of data science reproducibility. I review tools that have been developed over the last few years in the first part. In the second part, I focus on proposals that I have been involved in to facilitate workflow reproducibility and preservation.
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
Thoughts on computational science reproducibility with a focus on software. Given at the Software Sustainability Institute's 2014 Collaborations Workshop
Software Metadata: Describing "dark software" in GeoSciencesdgarijo
This document discusses describing "dark software" or unshared scientific software in geosciences. It proposes using the OntoSoft ontology to capture standardized metadata about scientific software. This would allow software to be more discoverable, reusable and reproducible. The document outlines the types of metadata captured by OntoSoft and demonstrates how it can be used to describe software and facilitate search and comparison of different tools.
PhD Thesis: Mining abstractions in scientific workflowsdgarijo
Slides of the presentation for my PhD dissertation. I strongly recommend downloading the slides, as they have animations that are easier to see in power point. The abstract of the thesis is as follows: "Scientific workflows have been adopted in the last decade to represent the computational methods used in in silico scientific experiments and their associated research products. Scientific workflows have demonstrated to be useful for sharing and reproducing scientific experiments, allowing scientists to visualize, debug and save time when re-executing previous work. However, scientific workflows may be difficult to understand and reuse. The large amount of available workflows in repositories, together with their heterogeneity and lack of documentation and usage examples may become an obstacle for a scientist aiming to reuse the work from other scientists. Furthermore, given that it is often possible to implement a method using different algorithms or techniques, seemingly disparate workflows may be related at a higher level of abstraction, based on their common functionality. In this thesis we address the issue of reusability and abstraction by exploring how workflows relate to one another in a workflow repository, mining abstractions that may be helpful for workflow reuse. In order to do so, we propose a simple model for representing and relating workflows and their executions, we analyze the typical common abstractions that can be found in workflow repositories, we explore the current practices of users regarding workflow reuse and we describe a method for discovering useful abstractions for workflows based on existing graph mining techniques. Our results expose the common abstractions and practices of users in terms of workflow reuse, and show how our proposed abstractions have potential to become useful for users designing new workflows".
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
Scientists publish computational experiments in ways that do not facilitate reproducibility or reuse. Significant domain expertise, time and effort are required to understand scientific experiments and their research outputs. In order to improve this situation, mechanisms are needed to capture the exact details and the context of computational experiments. Only then, Intelligent Systems would be able help researchers understand, discover, link and reuse products of existing research.
In this presentation I will introduce my work and vision towards enabling scientists share, link, curate and reuse their computational experiments and results. In the first part of the talk, I will present my work for capturing and sharing the context of scientific experiments by using scientific workflows and machine readable representations. Thanks to this approach, experiment results are described in an unambiguous manner, have a clear trace of their creation process and include a pointer to the sources used for their generation. In the second part of the talk, I will describe examples on how the context of scientific experiments may be exploited to browse, explore and inspect research results. I will end the talk by presenting new ideas for improving and benefiting from the capture of context of scientific experiments and how to involve scientists in the process of curating and creating abstractions on available research metadata.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
Reproducibility and Scientific Research: why, what, where, when, who, how Carole Goble
This document discusses the importance of reproducibility in scientific research. It makes three key points:
1. For results to be considered valid, scientific publications should provide clear descriptions of methods and protocols so that other researchers can successfully repeat and extend the work.
2. Many factors can undermine reproducibility, such as publication pressures, poor training, disorganization, and outright fraud. Ensuring reproducible research requires transparency across experimental designs, data, software, and computational workflows.
3. Achieving reproducible science is challenging and poorly incentivized due to the resources and time required to prepare materials for independent verification. Overcoming these issues will require collective effort across the research community.
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
Aspects of Reproducibility in Earth ScienceRaul Palma
The document discusses aspects of reproducibility in earth science research within the European Virtual Environment for Research - Earth Science Themes (EVEREST) project. The key objectives of EVEREST are to establish an e-infrastructure to facilitate collaborative earth science research through shared data, models, and workflows. Research Objects (ROs) will be used to capture and share workflows, processes, and results to help ensure reproducibility and preservation of earth science research. An example RO is described for mapping volcano deformation using satellite imagery and other data sources. Issues around reproducibility related to data access, software dependencies, and manual intervention in workflows are also discussed.
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
Thoughts on computational science reproducibility with a focus on software. Given at the Software Sustainability Institute's 2014 Collaborations Workshop
Software Metadata: Describing "dark software" in GeoSciencesdgarijo
This document discusses describing "dark software" or unshared scientific software in geosciences. It proposes using the OntoSoft ontology to capture standardized metadata about scientific software. This would allow software to be more discoverable, reusable and reproducible. The document outlines the types of metadata captured by OntoSoft and demonstrates how it can be used to describe software and facilitate search and comparison of different tools.
PhD Thesis: Mining abstractions in scientific workflowsdgarijo
Slides of the presentation for my PhD dissertation. I strongly recommend downloading the slides, as they have animations that are easier to see in power point. The abstract of the thesis is as follows: "Scientific workflows have been adopted in the last decade to represent the computational methods used in in silico scientific experiments and their associated research products. Scientific workflows have demonstrated to be useful for sharing and reproducing scientific experiments, allowing scientists to visualize, debug and save time when re-executing previous work. However, scientific workflows may be difficult to understand and reuse. The large amount of available workflows in repositories, together with their heterogeneity and lack of documentation and usage examples may become an obstacle for a scientist aiming to reuse the work from other scientists. Furthermore, given that it is often possible to implement a method using different algorithms or techniques, seemingly disparate workflows may be related at a higher level of abstraction, based on their common functionality. In this thesis we address the issue of reusability and abstraction by exploring how workflows relate to one another in a workflow repository, mining abstractions that may be helpful for workflow reuse. In order to do so, we propose a simple model for representing and relating workflows and their executions, we analyze the typical common abstractions that can be found in workflow repositories, we explore the current practices of users regarding workflow reuse and we describe a method for discovering useful abstractions for workflows based on existing graph mining techniques. Our results expose the common abstractions and practices of users in terms of workflow reuse, and show how our proposed abstractions have potential to become useful for users designing new workflows".
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
Scientists publish computational experiments in ways that do not facilitate reproducibility or reuse. Significant domain expertise, time and effort are required to understand scientific experiments and their research outputs. In order to improve this situation, mechanisms are needed to capture the exact details and the context of computational experiments. Only then, Intelligent Systems would be able help researchers understand, discover, link and reuse products of existing research.
In this presentation I will introduce my work and vision towards enabling scientists share, link, curate and reuse their computational experiments and results. In the first part of the talk, I will present my work for capturing and sharing the context of scientific experiments by using scientific workflows and machine readable representations. Thanks to this approach, experiment results are described in an unambiguous manner, have a clear trace of their creation process and include a pointer to the sources used for their generation. In the second part of the talk, I will describe examples on how the context of scientific experiments may be exploited to browse, explore and inspect research results. I will end the talk by presenting new ideas for improving and benefiting from the capture of context of scientific experiments and how to involve scientists in the process of curating and creating abstractions on available research metadata.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
Reproducibility and Scientific Research: why, what, where, when, who, how Carole Goble
This document discusses the importance of reproducibility in scientific research. It makes three key points:
1. For results to be considered valid, scientific publications should provide clear descriptions of methods and protocols so that other researchers can successfully repeat and extend the work.
2. Many factors can undermine reproducibility, such as publication pressures, poor training, disorganization, and outright fraud. Ensuring reproducible research requires transparency across experimental designs, data, software, and computational workflows.
3. Achieving reproducible science is challenging and poorly incentivized due to the resources and time required to prepare materials for independent verification. Overcoming these issues will require collective effort across the research community.
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
Aspects of Reproducibility in Earth ScienceRaul Palma
The document discusses aspects of reproducibility in earth science research within the European Virtual Environment for Research - Earth Science Themes (EVEREST) project. The key objectives of EVEREST are to establish an e-infrastructure to facilitate collaborative earth science research through shared data, models, and workflows. Research Objects (ROs) will be used to capture and share workflows, processes, and results to help ensure reproducibility and preservation of earth science research. An example RO is described for mapping volcano deformation using satellite imagery and other data sources. Issues around reproducibility related to data access, software dependencies, and manual intervention in workflows are also discussed.
Towards Incidental Collaboratories; Research Data ServicesAnita de Waard
This document discusses enabling "incidental collaboratories" by collecting and connecting biological research data through a centralized framework. It argues that biology research is currently quite isolated due to its small scale and competitive nature. The framework would involve storing experimental data with metadata, allowing analyses across similar experiment types and biological subjects, and preserving data long-term with access controls. This could help move labs from being isolated to being "sensors in a network" and address objections around data ownership and quality.
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
This document provides a summary of Jennifer Shelton's background and experience in bioinformatics. It outlines her education in biology and post-baccalaureate studies. Her research focuses on de novo genome and transcriptome assembly using next-generation sequencing and BioNano Genomics data. She has extensive experience developing bioinformatics workflows and teaching coding skills through workshops. Currently she is the Bioinformatics Core Outreach Coordinator at Kansas State University where she continues her research and outreach efforts.
The document discusses the challenges and opportunities that will arise from the exponential growth of biological data in the coming years. It outlines four key areas: 1) Research approaches will need to effectively analyze infinite amounts of data. 2) Software and decentralized infrastructure will be needed to process the data. 3) Open science and reproducible research practices are important for data-driven biology. 4) Training the next generation of biologists in data analysis skills will be a major challenge. The document advocates for open source tools, reproducible research methods, and expanded training programs to help biology take advantage of the coming data deluge.
This document discusses provenance and research objects. It introduces key concepts from the PROV model including entities, activities, and agents. It explains how research objects can bundle digital resources from a scientific experiment along with provenance and context. Finally, it provides an example of capturing provenance from workflow runs using the Common Workflow Language and storing it in a research object bundle.
Jean-Claude Bradley presents on "Peer Review and Science2.0: blogs, wikis and social networking sites" as a guest lecturer for the “Peer Review Culture in Scholarly Publication and Grantmaking” course at Drexel University. The main thrust of the presentation is that peer review alone is not capable of coping with the increasing flood of scientific information being generated and shared. Arguments are made to show that providing sufficient proof for scientific findings does scale and weakens the tragedy of the trusted source cascade.
From peer-reviewed to peer-reproduced: a role for research objects in scholar...Alejandra Gonzalez-Beltran
The document discusses how research objects and computational workflows can help capture experimental processes and reproduce findings in life sciences research. It describes a computational experiment evaluating three genome assembly algorithms on bacterial, insect, and human genomes. Key steps included identifying resources, designing the experimental workflow, running the experiment in Galaxy, and publishing results as nanopublications aggregated in a research object to enable verification and reuse. The goal is to improve reproducibility by making experimental descriptions and reviews more structured and transparent.
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
Reproducibility of model-based results: standards, infrastructure, and recogn...FAIRDOM
Written and presented by Dagmar Waltemath (University of Rostock) as part of the Reproducible and Citable Data and Models Workshop in Warnemünde, Germany. September 14th - 16th 2015.
This document discusses reproducible research and provides guidance on how to conduct research in a reproducible manner. It covers:
1. The importance of reproducible research due to large datasets, computational analyses, and the potential for human error. Ensuring reproducibility requires new expertise and infrastructure.
2. Key aspects of reproducible research include data management plans, version control, use of file formats and software/tools that allow reproducibility, and publishing data and code to allow others to replicate results.
3. Reproducible research benefits the scientific community by increasing transparency and allows researchers to re-analyze their own data in the future. Journals and funders are increasingly requiring reproducibility.
The document discusses various open education tools for use in chemistry courses, including screencasting lectures, wikis, open notebook science, and games. It provides examples of these tools being used, such as recording lectures, organizing course content on wikis, making research data and lab notebooks publicly available, and games for learning chemistry concepts. Student response to these tools is also discussed, with most students finding value in access to recorded lectures and using wikis for assignments, and appreciation for rapid feedback and learning documentation skills through open notebooks.
Jean-Claude Bradley presents on "Technology and Students - Mix, Match or Miss?" at the Villanova Teaching and Learning Strategies Symposium on May 13, 2010. Topics covered include screencasting, wikis, games and Second Life, with a particular focus on student response to these technologies.
This document is a resume for Gautam Machiraju. It summarizes his education and research experience. He has a B.A. in Applied Mathematics from UC Berkeley with a concentration in Mathematical Biology and a minor in Bioengineering. He has worked on several research projects involving mathematical modeling and data analysis related to cancer biomarkers, genomics, and proteomics. His skills include programming, mathematics, data science, and laboratory techniques. He is currently a bioinformatics research assistant at Stanford University School of Medicine.
This document discusses provenance and research objects. It provides an overview of the key concepts in provenance including entities, activities, and agents. It explains how research objects can bundle digital resources from a scientific experiment along with provenance and metadata. Finally, it describes how provenance information can be captured from Common Workflow Language (CWL) workflow runs to create research object bundles.
Recomendations for infrastructure and incentives for open science, presented to the Research Data Alliance 6th Plenary. Presenter: William Gunn, Director of Scholarly Communications for Mendeley.
This document discusses the challenges and opportunities biology faces with increasing data generation. It outlines four key points:
1) Research approaches for analyzing infinite genomic data streams, such as digital normalization which compresses data while retaining information.
2) The need for usable software and decentralized infrastructure to perform real-time, streaming data analysis.
3) The importance of open science and reproducibility given most researchers cannot replicate their own computational analyses.
4) The lack of data analysis training in biology and efforts at UC Davis to address this through workshops and community building.
Data Integration vs Transparency: Tackling the tensionPaul Groth
Paul Groth discussed the tension between data integration and transparency. He explained that while integrating data from multiple sources is important for analysis, it can reduce transparency about where the data came from. Provenance, or recording the origin and process of data, was presented as a solution. Groth outlined challenges in provenance collection and proposed techniques like taint tracking and record and replay from software security to help automate provenance capture while data is integrated and analyzed.
This document discusses research objects and scientific workflows. It introduces research objects as a way to aggregate all elements needed to understand a research investigation, including datasets, results, experiments, and provenance. Scientific workflows are presented as tools for automating data-intensive scientific activities, with prospective and retrospective provenance capturing the intended and actual methods. The document outlines an approach to summarizing complex workflows using semantic annotations of workflow motifs and reduction primitives like collapse and eliminate. This distills provenance traces for improved understanding and querying.
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
This document provides an overview of the PROV provenance model and some of its extensions. It discusses the motivation for provenance, the history and development of the PROV model, its key concepts of entities, activities, and agents. It also describes extensions like ProvONE and PAV that build upon PROV to model workflow and scientific provenance.
Towards Incidental Collaboratories; Research Data ServicesAnita de Waard
This document discusses enabling "incidental collaboratories" by collecting and connecting biological research data through a centralized framework. It argues that biology research is currently quite isolated due to its small scale and competitive nature. The framework would involve storing experimental data with metadata, allowing analyses across similar experiment types and biological subjects, and preserving data long-term with access controls. This could help move labs from being isolated to being "sensors in a network" and address objections around data ownership and quality.
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
This document provides a summary of Jennifer Shelton's background and experience in bioinformatics. It outlines her education in biology and post-baccalaureate studies. Her research focuses on de novo genome and transcriptome assembly using next-generation sequencing and BioNano Genomics data. She has extensive experience developing bioinformatics workflows and teaching coding skills through workshops. Currently she is the Bioinformatics Core Outreach Coordinator at Kansas State University where she continues her research and outreach efforts.
The document discusses the challenges and opportunities that will arise from the exponential growth of biological data in the coming years. It outlines four key areas: 1) Research approaches will need to effectively analyze infinite amounts of data. 2) Software and decentralized infrastructure will be needed to process the data. 3) Open science and reproducible research practices are important for data-driven biology. 4) Training the next generation of biologists in data analysis skills will be a major challenge. The document advocates for open source tools, reproducible research methods, and expanded training programs to help biology take advantage of the coming data deluge.
This document discusses provenance and research objects. It introduces key concepts from the PROV model including entities, activities, and agents. It explains how research objects can bundle digital resources from a scientific experiment along with provenance and context. Finally, it provides an example of capturing provenance from workflow runs using the Common Workflow Language and storing it in a research object bundle.
Jean-Claude Bradley presents on "Peer Review and Science2.0: blogs, wikis and social networking sites" as a guest lecturer for the “Peer Review Culture in Scholarly Publication and Grantmaking” course at Drexel University. The main thrust of the presentation is that peer review alone is not capable of coping with the increasing flood of scientific information being generated and shared. Arguments are made to show that providing sufficient proof for scientific findings does scale and weakens the tragedy of the trusted source cascade.
From peer-reviewed to peer-reproduced: a role for research objects in scholar...Alejandra Gonzalez-Beltran
The document discusses how research objects and computational workflows can help capture experimental processes and reproduce findings in life sciences research. It describes a computational experiment evaluating three genome assembly algorithms on bacterial, insect, and human genomes. Key steps included identifying resources, designing the experimental workflow, running the experiment in Galaxy, and publishing results as nanopublications aggregated in a research object to enable verification and reuse. The goal is to improve reproducibility by making experimental descriptions and reviews more structured and transparent.
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
Reproducibility of model-based results: standards, infrastructure, and recogn...FAIRDOM
Written and presented by Dagmar Waltemath (University of Rostock) as part of the Reproducible and Citable Data and Models Workshop in Warnemünde, Germany. September 14th - 16th 2015.
This document discusses reproducible research and provides guidance on how to conduct research in a reproducible manner. It covers:
1. The importance of reproducible research due to large datasets, computational analyses, and the potential for human error. Ensuring reproducibility requires new expertise and infrastructure.
2. Key aspects of reproducible research include data management plans, version control, use of file formats and software/tools that allow reproducibility, and publishing data and code to allow others to replicate results.
3. Reproducible research benefits the scientific community by increasing transparency and allows researchers to re-analyze their own data in the future. Journals and funders are increasingly requiring reproducibility.
The document discusses various open education tools for use in chemistry courses, including screencasting lectures, wikis, open notebook science, and games. It provides examples of these tools being used, such as recording lectures, organizing course content on wikis, making research data and lab notebooks publicly available, and games for learning chemistry concepts. Student response to these tools is also discussed, with most students finding value in access to recorded lectures and using wikis for assignments, and appreciation for rapid feedback and learning documentation skills through open notebooks.
Jean-Claude Bradley presents on "Technology and Students - Mix, Match or Miss?" at the Villanova Teaching and Learning Strategies Symposium on May 13, 2010. Topics covered include screencasting, wikis, games and Second Life, with a particular focus on student response to these technologies.
This document is a resume for Gautam Machiraju. It summarizes his education and research experience. He has a B.A. in Applied Mathematics from UC Berkeley with a concentration in Mathematical Biology and a minor in Bioengineering. He has worked on several research projects involving mathematical modeling and data analysis related to cancer biomarkers, genomics, and proteomics. His skills include programming, mathematics, data science, and laboratory techniques. He is currently a bioinformatics research assistant at Stanford University School of Medicine.
This document discusses provenance and research objects. It provides an overview of the key concepts in provenance including entities, activities, and agents. It explains how research objects can bundle digital resources from a scientific experiment along with provenance and metadata. Finally, it describes how provenance information can be captured from Common Workflow Language (CWL) workflow runs to create research object bundles.
Recomendations for infrastructure and incentives for open science, presented to the Research Data Alliance 6th Plenary. Presenter: William Gunn, Director of Scholarly Communications for Mendeley.
This document discusses the challenges and opportunities biology faces with increasing data generation. It outlines four key points:
1) Research approaches for analyzing infinite genomic data streams, such as digital normalization which compresses data while retaining information.
2) The need for usable software and decentralized infrastructure to perform real-time, streaming data analysis.
3) The importance of open science and reproducibility given most researchers cannot replicate their own computational analyses.
4) The lack of data analysis training in biology and efforts at UC Davis to address this through workshops and community building.
Data Integration vs Transparency: Tackling the tensionPaul Groth
Paul Groth discussed the tension between data integration and transparency. He explained that while integrating data from multiple sources is important for analysis, it can reduce transparency about where the data came from. Provenance, or recording the origin and process of data, was presented as a solution. Groth outlined challenges in provenance collection and proposed techniques like taint tracking and record and replay from software security to help automate provenance capture while data is integrated and analyzed.
This document discusses research objects and scientific workflows. It introduces research objects as a way to aggregate all elements needed to understand a research investigation, including datasets, results, experiments, and provenance. Scientific workflows are presented as tools for automating data-intensive scientific activities, with prospective and retrospective provenance capturing the intended and actual methods. The document outlines an approach to summarizing complex workflows using semantic annotations of workflow motifs and reduction primitives like collapse and eliminate. This distills provenance traces for improved understanding and querying.
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
This document provides an overview of the PROV provenance model and some of its extensions. It discusses the motivation for provenance, the history and development of the PROV model, its key concepts of entities, activities, and agents. It also describes extensions like ProvONE and PAV that build upon PROV to model workflow and scientific provenance.
This document proposes representing scientific workflows as first-class citizens called research objects. It presents a model for workflow research objects that aggregates all necessary elements to understand an investigation. These include experiments, annotations, results, datasets and provenance. Research objects are encoded using semantic technologies like RDF and follow standards such as the Object Exchange model. The lifecycle of research objects is also described.
These slides introduces the second edition of ProvBench which I am leading to collect a corpus of provenance data for benchmarking for the provenance (and scientific) community
I gave this talk in the EDBT 2014 conference, which tool place in Athens, Greece.
I show how data examples can be used to characterize the behavior of scientific modules. I present a new methods that automatically generate the data examples, and show that such data examples are useful for the human user to understand the task of the modules, and that they can be used to assist curators in repairing broken workflows (i.e., workflows for which one or more modules are no longer supplied by their providers)
This document proposes a method to improve the reuse of workflow fragments by mining workflow repositories. It evaluates different graph representations of workflows and uses the SUBDUE algorithm to identify recurrent fragments. An experiment compares representations on precision, recall, memory usage, and time. Representation D1, which labels edges and nodes, performed best. A second experiment assesses how filtering workflows by keywords impacts finding relevant fragments for a user query. The method aims to incorporate workflow fragment search capabilities into the design lifecycle to promote reuse.
A use case designed in the context of the Dataone provenance woring group illustrating how the provenance traces generated by differet workflow engines can be quered via the D-PROV model.
Each month, join us as we highlight and discuss hot topics ranging from the future of higher education to wearable technology, best productivity hacks and secrets to hiring top talent. Upload your SlideShares, and share your expertise with the world!
Not sure what to share on SlideShare?
SlideShares that inform, inspire and educate attract the most views. Beyond that, ideas for what you can upload are limitless. We’ve selected a few popular examples to get your creative juices flowing.
SlideShare is a global platform for sharing presentations, infographics, videos and documents. It has over 18 million pieces of professional content uploaded by experts like Eric Schmidt and Guy Kawasaki. The document provides tips for setting up an account on SlideShare, uploading content, optimizing it for searchability, and sharing it on social media to build an audience and reputation as a subject matter expert.
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
Metadata and Semantics Research Conference, Manchester, UK 2015
Research Objects: why, what and how,
In practice the exchange, reuse and reproduction of scientific experiments is hard, dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: codes fork, data is updated, algorithms are revised, workflows break, service updates are released. Neither should they be viewed just as second-class artifacts tethered to publications, but the focus of research outcomes in their own right: articles clustered around datasets, methods with citation profiles. Many funders and publishers have come to acknowledge this, moving to data sharing policies and provisioning e-infrastructure platforms. Many researchers recognise the importance of working with Research Objects. The term has become widespread. However. What is a Research Object? How do you mint one, exchange one, build a platform to support one, curate one? How do we introduce them in a lightweight way that platform developers can migrate to? What is the practical impact of a Research Object Commons on training, stewardship, scholarship, sharing? How do we address the scholarly and technological debt of making and maintaining Research Objects? Are there any examples
I’ll present our practical experiences of the why, what and how of Research Objects.
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesMonica Munoz-Torres
Precise elucidation of the many different biological features encoded in a genome requires a careful curation process that involves reviewing all available evidence to allow researchers to resolve discrepancies and validate automated gene models, protein alignments, and other biological elements. Genome annotation is an inherently collaborative task; researchers only rarely work in isolation, turning to colleagues for second opinions and insights from those with expertise in particular domains and gene families.
The i5k initiative seeks to sequence the genomes of 5,000 insect and related arthropod species. The selected species are known to be important to worldwide agriculture, food safety, medicine, and energy production as well as many used as models in biology, those most abundant in world ecosystems, and representatives in every branch of the insect phylogeny in an effort to better understand arthropod evolution and phylogeny. Because computational genome analysis remains an imperfect art, each of these new genomes sequenced will require visualization and curation.
Apollo is an instantaneous, collaborative, genome annotation editor, and the new JavaScript based version allows researchers real-time interactivity, breaking down large amounts of data into manageable portions to mobilize groups of researchers with shared interests. The i5K is a broad and inclusive effort that seeks to involve scientists from around the world in their genome curation process and Apollo is serving as the platform to empower this community. Here we offer details about this collaboration.
Written and presented by Carole Goble (University of Manchester) as part of the Reproducible and Citable Data and Models Workshop in Warnemünde, Germany. September 14th - 16th 2015.
This document introduces FAIRDOM, a consortium that provides a platform and services to help researchers organize, manage, share, and preserve research outputs according to FAIR principles. FAIRDOM has been in operation for 10 years and has over 50 installations supporting over 118 projects. It provides tools and services to help researchers collaborate better and integrate their data, models, publications and other research objects. FAIRDOM also works with other organizations and infrastructure providers to support broader research initiatives.
This document discusses using the T-BioInfo platform to provide practical education in bioinformatics. It describes how the platform can integrate different types of omics data and analysis into intuitive, visual pipelines. This allows non-experts to analyze and interpret complex datasets. Example projects are provided, such as using RNA-seq data to identify genes involved in a disease. The goal is to teach bioinformatics through collaborative, project-based learning without requiring programming skills. Learners would reconstruct simulated biological processes and contribute to ongoing analysis of real scientific datasets.
Virtual research environments for implementing long tail open scienceBlue BRIDGE
This document discusses virtual research environments (VREs) for supporting "long-tail open science". It defines VREs as operational environments that dynamically aggregate resources like data, services, and computing/storage for users. VREs aim to support collaborative research, reproducibility, and open sharing of data/findings while providing simplified access. The document outlines how VREs can be created on demand, integrated with applications/services, and used for collaborative experiments and workflows to enable repeatability and reuse of research. Real-world examples of VREs like D4Science are presented.
The BlueBRIDGE approach to collaborative researchBlue BRIDGE
Gianpaolo Coro, ISTI-CNR, at BlueBRIDGE workshop on "Data Management services to support stock assessement", held during the Annual ICES Science conference 2016
This document summarizes Professor Carole Goble's presentation on making research more reproducible and FAIR (Findable, Accessible, Interoperable, Reusable) through the use of research objects and related standards and infrastructure. It discusses challenges to reproducibility in computational research and proposes bundling datasets, workflows, software and other research products into standardized research objects that can be cited and shared to help address these challenges.
This document discusses provenance and collaboration in science. It presents use cases in astronomy, biology, and other disciplines to illustrate challenges around data packaging, preservation, retrieval and reuse of scientific workflows. These include dealing with large datasets, versioning data from external sources, and understanding and reusing other researchers' workflows. The role of research objects and linked data for supporting provenance, identity, context and the lifecycle of scientific work is also examined.
Reproducible, Open Data Science in the Life SciencesEamonn Maguire
The document outlines the workflow of a data scientist, from planning experiments and collecting data, to analyzing, visualizing, and publishing results. It emphasizes that data science involves formalizing hypotheses based on observations and testing them using collected data. A suite of open-source tools is presented to help data scientists in managing data and supporting open, reproducible life science research. The goal is to enable integration and sharing of experimental data and results.
The Symbiotic Nature of Provenance and WorkflowEric Stephan
This document discusses the symbiotic relationship between provenance and workflows in scientific research. It notes that workflows provide automation and integration capabilities, while provenance provides documentation of what transpired. The document provides examples of workflow and provenance technologies and outlines challenges around interoperability. It concludes that recognizing the interdependent relationship between provenance and workflows can help advance systems science research.
Precise elucidation of the many different biological features encoded in any genome requires careful examination and review by researchers, who gather and evaluate the available evidence to corroborate and modify gene predictions and other biological elements. This curation process allows them to resolve discrepancies and validate automated gene model hypotheses and alignments. This approach is the well-established practice for well-known genomes such as human, mouse, zebrafish, Drosophila, et cetera. Desktop Apollo was originally developed to meet these needs.
The cost of sequencing a genome has been dramatically reduced by several orders of magnitude in the last decade, and the natural consequence is that more and more researchers are sequencing more and more new genomes, both within populations and across species. Because individual researchers can now readily sequence many genomes of interest, the need for a universally accessible genomic curation tool logically follows. Each new exome or genome sequenced requires visualization and curation to obtain biologically accurate genomic features sets, even for limited set of genes, because computational genome analysis remains an imperfect art. Additionally, unlike earlier genome projects, which had the advantage of more highly polished genomes, recent projects usually have lower coverage. Therefore researchers now face additional work correcting for more frequent assembly errors and annotating genes split across multiple contigs.
Genome annotation is an inherently collaborative task; researchers only very rarely work in isolation, turning to colleagues for second opinions and insights from those with with expertise in particular domains and gene families. The new JavaScript based Apollo, allows researchers real-time interactivity, breaking down large amounts of data into manageable portions to mobilize groups of researchers with shared interests. We are also focused on training the next generation of researchers by reaching out to educators to make these tools available as part of curricula via workshops and webinars, and through widely applied systems such as iPlant and DNA Subway. Here we offer details of our progress.
Presentation at Genome Informatics, Session (3) on Databases, Data Mining, Visualization, Ontologies and Curation.
Authors: Monica C Munoz-Torres, Suzanna E. Lewis, Ian Holmes, Colin Diesh, Deepak Unni, Christine Elsik.
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsKhalid Belhajjame
I gave this talk at the EDBT'2020 conference. It shows how the provenance of workflows can be anonymized without compromising lineage relationships between the data records that are used and generated by the modules that compose the workflow.
Privacy-Preserving Data Analysis Workflows for eScienceKhalid Belhajjame
This document discusses an approach for preserving privacy in scientific workflows that use large datasets. It proposes using k-anonymity to anonymize sensitive workflow data. Parameter dependencies are leveraged to identify sensitive parameters and infer appropriate anonymity degrees. The approach was tested on 20 workflows, with overhead less than 1 millisecond. This preliminary work aims to assist scientists in anonymizing workflow data while enabling exploration of provenance and data products.
- The document discusses evaluating "why-not" queries against scientific workflow provenance. Why-not queries help understand why a data item was not returned by a workflow execution.
- It proposes a solution for evaluating why-not queries in workflows with black-box modules that do not preserve attribute information from inputs. The solution explores workflow modules from sink to source to identify "picky" modules responsible for a data item not appearing in results.
- To identify picky modules, it harvests information from the web by searching for traces of scientific module invocations to find valid candidate inputs and determine if a module accepts them or is likely picky. It conducts an experiment using real workflows to test the effectiveness of
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
1) The document presents a methodology to convert script-based experiments into reproducible workflow research objects (WROs). This addresses issues of understanding, reusing, and reproducing experiments conducted through scripts.
2) The methodology involves 5 steps: generate an abstract workflow, create an executable workflow, refine the workflow, record provenance data, and annotate and check the quality of the conversion.
3) Applying the methodology to a molecular dynamics simulation case study, the authors demonstrate how scripts can be transformed into WROs containing workflows, annotations, provenance data, and other resources needed for reproducibility.
The document discusses assisting designers in composing workflows through the reuse of frequent workflow fragments mined from repositories. It proposes an approach that involves mining fragments, representing workflows as graphs, homogenizing activity labels, and allowing users to search for fragments using keywords and activities from their initial workflow. Fragments are retrieved based on relevance to keywords and compatibility to specified activities, then ranked and presented to users for composition. Experiments assess different graph representations for mining fragments in terms of effectiveness, size and runtime. The approach aims to help designers reuse best practices from repositories when specifying new workflows.
Linking the prospective and retrospective provenance of scriptsKhalid Belhajjame
Scripting languages like Python, R, andMATLAB have seen significant use across a variety of scientific domains. To assist scientists in the analysis of script executions, a number of mechanisms, e.g., noWorkflow, have been recently proposed to capture the provenance of script executions. The provenance information recorded can be used, e.g., to trace the lineage of a particular result by identifying the data inputs and the processing steps that were used to produce it. By and large, the provenance information captured for scripts is fine-grained in the sense that it captures data dependencies at the level of script statement, and do so for every variable within the script. While useful, the amount of recorded provenance information can be overwhelming for users and cumbersome to use. This suggests the need for abstraction mechanisms that focus attention on specific parts of provenance relevant for analyses. Toward this goal, we advocate that fine-grained provenance information recorded as the result of script execution can be abstracted using user-specified, workflow-like views. Specifically, we show how the provenance traces recorded by noWorkflow can be mapped to the workflow specifications generated by YesWorkflow from scripts based on user annotations. We examine the issues in constructing a successful mapping, provide an initial implementation of our solution, and present competency queries illustrating how a workflow view generated from the script can be used to explore the provenance recorded during script execution.
I gave this talk in TAPP 2014 during the provenance week in Cologne, on inferring fine graine dependencies between data (ports) in scientific workflows. -- khalid
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
Scientific Workflows have become the workhorse of BigData analytics for scientists. As well as being repeatable and optimizable pipelines that bring together datasets and analysis tools, workflows make-up an important part of the provenance of data generated from their execution. By faithfully capturing all stages in the analysis, workflows play a critical part in building up the audit-trail (a.k.a. provenance) meta- data for derived datasets and contributes to the veracity of results. Provenance is essential for reporting results, reporting the method followed, and adapting to changes in the datasets or tools. These functions, however, are hampered by the complexity of workflows and consequently the complexity of data-trails generated from their instrumented execution. In this paper we propose the generation of workflow description summaries in order to tackle workflow complexity. We elaborate reduction primitives for summarizing workflows, and show how prim- itives, as building blocks, can be used in conjunction with semantic workflow annotations to encode different summariza- tion strategies. We report on the effectiveness of the method through experimental evaluation using real-world workflows from the Taverna system.
A talk given at the EDBT/ICDT 2010 conference. For more details, visit the project website at http://img.cs.manchester.ac.uk/dataspaces/dataspaces.html
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
2. “Science
is
built
upon
the
founda0ons
of
theory
and
experiment
validated
and
improved
through
open,
transparent
communica0on.
With
the
increasingly
central
role
of
computa0on
in
scien0fic
discovery,
this
means
communica0ng
all
details
of
the
computa0ons
needed
for
others
to
replicate
the
experiment.
V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to
reproducible: Reproducibility in computational and experimental mathematics.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
2
3. basic
studies
on
cancer
are
unreliable,
with
grim
consequences
for
producing
new
medicines
in
the
future
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
3
4. The
research
result,
obtained
by
Stapel
and
co-‐workers
Roos
Vonk
(Radboud
University)
and
Marcel
Zeelenberg
(nl)
(Tilburg
University),
showing
that
meat
eaters
are
more
selfish
than
vegetarians,
which
was
widely
publicized
in
Dutch
media
is
suspected
to
be
based
on
faked
data.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
4
5. ¡ ReplicaEon
means
conducEng
studies
with
independent:
§ InvesEgators
§ Data,
§ methods,
§ Laboratories,
§ Instruments.
¡ ReplicaEon
is
the
ulEmate
standard
for
strengthening
evidence
and
trust
in
scienEfic
findings.
¡ However,
replicaEon
is
most
of
the
Eme
not
possible:
expensive
(Eme
and
money),
opportunisEc
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
5
6. Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
6
Way
too
expensive
Reproducible Research:
Make data and code
available so that others
Replication may reproduce findings
Scholarly Article,
is not enough
Reproducibility
(Re)useless
8. ¡ The
huge
increases
in
performance
both
at
the
level
of
hardware
and
soVware,
meant
that
highly
complex
analysis
are
possible.
¡ However,
these
same
advances
meant
a
higher
risk
of
generaEng
results
that
cannot
be
reproduced.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
8
9. ¡ Researchers
in
experimental
biology
use
carefully
lab
notebooks
to
document
different
aspects
of
their
experiments.
¡ This
is
not
the
case
for
computaEonal
scienEsts
who
tend
to
run
their
analysis
with
no
clear
record
of
the
exact
process
they
followed
or
intermediary
datasets
(results)
they
used
and
generated.
¡ It
is
therefore
possible
that
numerous
published
results
may
be
unreliable
or
even
completely
invalid.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
9
10. ¡ OVen,
there
is
no
record
of
the
process
(workflow)
that
produced
the
published
computaEonal
results
in
scholarly
communicaEons.
¡ Even
the
code
is
missing,
or
underwent
changes.
§ It
cannot
be
used
to
process
the
data
referred
to,
(if
we
are
lucky).
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
10
11. “The
reproducible
research
movement
recognizes
that
tradi0onal
scien0fic
research
and
publica0on
prac0ces
now
fall
short
…,
and
encourages
all
those
involved
i n
the
produc0on
of
computa0onal
science
...
to
facilitate
and
prac0ce
really
reproducible
research.”
We
witnessed
recently
the
emergence
of
a
number
of
methods
and
tools
for
enabling
reproducibility
V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to
reproducible: Reproducibility in computational and experimental mathematics.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
11
12. Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
12
System-‐Level
Reproducibility
Reprozip
Burrito
ES3
Scripting
oriented
Reproducibility
IPython
Knitr
IJulia
Workflow
oriented
reproducibility
Galaxy
Taverna
Vistrails
Article
Centered
Reproducibility
SOLE
DEEP
SHARE
Investigation
oriented
Reproducibility
ISA
Research
Object
FuGE
13. Packing Experiments AUTHORS
Computational Environment E
Execution p’
Experiment ReproZip
p
Provenance Tree
Capture of Provenance
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
13
14. Packing Experiments AUTHORS
Computational Environment E
Execution p’
Experiment ReproZip
Capture of Provenance
p
• command-line
arguments
• working directory
• files read
• files written
…
process p’
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
14
15. Packing Experiments AUTHORS
Computational Environment E
Experiment ReproZip
Capture of Provenance
Description of data
Description of experiment
Description of environment
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
15
Execution
Provenance Tree
Identification of
Necessary
Components
Input and output files
Executable programs and steps
Environment variables, dependencies, …
16. Packing Experiments AUTHORS
Computational Environment E
Experiment ReproZip
Capture of Provenance
Description of data
Description of experiment
Description of environment
17. Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
16
Execution
Provenance Tree
Identification of
Necessary
Components
Input and output files
Executable programs and steps
Environment variables, dependencies, …
VisTrails Workflow
Specification of
Workflow
Reproducible
Package
Figure taken from Chirigati et al., 2012
18. Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
17
System-‐Level
Reproducibility
Reprozip
Burrito
ES3
Scripting
oriented
Reproducibility
IPython
Knitr
IJulia
Workflow
oriented
reproducibility
Galaxy
Taverna
Vistrails
Article
Centered
Reproducibility
SOLE
DEEP
SHARE
Investigation
oriented
Reproducibility
ISA
Research
Object
FuGE
19. ¡ IPython
provides
a
rich
architecture
for
interacEve
compuEng
with:
§ A
browser-‐based
notebook
with
support
for
code,
text,
mathemaEcal
expressions,
inline
plots
and
other
rich
media.
§ Support
for
interacEve
data
visualizaEon
and
use
of
GUI
toolkits.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
18
21. Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
20
System-‐Level
Reproducibility
Reprozip
Burrito
ES3
Scripting
oriented
Reproducibility
IPython
Knitr
IJulia
Workflow
oriented
reproducibility
Galaxy
Taverna
Vistrails
Article
Centered
Reproducibility
SOLE
DEEP
SHARE
Investigation
oriented
Reproducibility
ISA
Research
Object
FuGE
22. ¡ Inputs
to
computaEonal
science
are
not
linked
with
its
outputs.
§ Inputs:
Large
quanEEes
of
data,
complex
data
manipulaEon
and/or
numerical
simulaEon
use
of
large
and
oVen
distributed
soVware
stacks.
§ Outputs:
Research
papers
(text-‐based,
non-‐interacEve)
¡ Authors
and
Readers
§ approach
computaEonal
§ science
from
opposite
direcEons
¡ The
objecEve
of
SOLE
is
to
link
research
papers
with
auxiliary
resources
that
have
been
uElized,
e.g.,
datasets,
soVware
programs,
files,
etc.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
21
23. Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
22
System-‐Level
Reproducibility
Reprozip
Burrito
ES3
Scripting
oriented
Reproducibility
IPython
Knitr
IJulia
Workflow
oriented
reproducibility
Galaxy
Taverna
Vistrails
Article
Centered
Reproducibility
SOLE
DEEP
SHARE
Investigation
oriented
Reproducibility
ISA
Research
Object
FuGE
24. ¡ Assists
users
to
submit
the
structured
content
via
simple
templates
and
an
internal
authoring
tool
¡ Performs
value-‐
added
semanEc
annotaEon
of
the
experimental
metadata
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
23
28. ¡ Data driven analysis pipelines
¡ Systematic gathering of data and
analysis tools into computational
solutions for scientific problem-solving
¡ Tools for automating frequently
performed data intensive activities
¡ Provenance for the resulting datasets
§ The method followed
§ The resources used
§ The datasets used
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
27
29. GWAS,
Pharmacogenomics
Association
study
of
Nevirapine-‐induced
skin
rash
in
Thai
Population
Trypanosomiasis
(sleeping
sickness
parasite)
in
African
Cattle
Astronomy
HelioPhysics
Library
Doc
Preservation
Systems
Biology
of
Micro-‐
Organisms
Observing
Systems
Simulation
Experiments
JPL,
NASA
BioDiversity
Invasive
Species
Modelling
[Credit Carole A. Goble]
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
28
30. ¡ Scientific workflows are primarily used to specify and enact in
silico experiments
¡ However, they can also be used as a a means to document the
experiment that the scientist ran, and even repurpose it!
Khalid Belhajjame @ PoliWeb Workshop, 2014
Kegg pathway
query
Kegg pathway
query
chromosome17
chromosome37
Detect common
pathways
Common
pathways
Scientific workflows
Increasingly adopted in modern sciences.
Transparent documentation of
experimental methods
Repeatable and configurable
29
31. ¡ A decayed or reduced ability to be executed or
produce the same results
¡ To better understand workflow decay, we
conducted an empirical analysis to identify the
causes of workflow decay.
¡ To do so, we analyzed a sample of real
workflows to determine if they suffer from
decay and the reasons that caused their decay
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
30
32. ¡ Taverna workflows
from
myExperiment.org
§ Taverna 1
§ Taverna 2
¡ Selection process
§ By the creation year
§ By the creator
§ By the domain
¡ Software
environment
§ Taverna 2.3
¡ Experiment
metadata
§ June-July 2012
§ 4 researchers
Khalid Belhajjame @ PoliWeb Workshop, 2014 31
33. Number of Taverna 1 workflows from 2007 to 2011
2007 2008 2009 2010 2011
Tested 12 10 10 10 4*
Total 74 341 101 26 13
Number of Taverna 2 workflows from 2009 to 2012
2009 2010 2011 2012
Tested 12 10 15 9
Total 97 308 289 184
Khalid Belhajjame @ PoliWeb Workshop, 2014
32
35. ¡ 75% of the 92 tested
workflows failed to
be either executed or
produce the same
result (if testable)
¡ Those from early
years (2007-2009)
had 91% failure rate
Khalid Belhajjame @ PoliWeb Workshop, 2014
Taverna 1
Taverna 2
34
36. ¡ Manual analysis
§ By the validation report from Taverna workbench
§ By interpreting experiment results reported by Taverna
¡ Identified 4 categories of causes
§ Missing example data
§ Missing execution environment
§ Insufficient descriptions about workflows
§ Volatile third-party Resources
¡ Other unconsidered possible factors
§ Changes in the local operating environment (hardware, OS, middleware,
compiler, etc)
Khalid Belhajjame @ PoliWeb Workshop, 2014
35
37. Causes
Refined
Causes
Examples
Third
party
resources
are
not
available
Underlying
dataset,
particularly
those
locally
hosted
in-‐house
dataset,
is
no
longer
available
Khalid Belhajjame @ PoliWeb Workshop, 2014
Researcher
hosting
the
data
changed
institution,
server
is
no
longer
available
Services
are
deprecated
DDBJ
web
services
are
not
longer
provided
despite
the
fact
that
they
are
used
in
many
myExperiment
workflows
Third
party
resources
are
available
but
not
accessible
Data
is
available
but
identified
using
different
IDs
than
the
ones
known
to
the
user
Due
to
scalability
reasons
the
input
data
is
superseded
by
new
one
making
the
workflow
not
executable
or
providing
wrong
results
Data
is
available
but
permission,
certificate,
or
network
to
access
it
is
needed
Cannot
get
the
input,
which
is
a
security
token
that
can
only
be
obtained
by
a
registered
user
of
ChemiSpider
Services
are
available
but
need
permission,
certificate,
or
network
to
access
and
invoke
them
The
security
policies
of
the
execution
framework
are
updated
due
to
new
hosting
institution
rules
Third
party
resources
have
changed
Services
are
still
available
by
using
the
same
identifiers
but
their
functionality
have
changed
The
web
services
are
updated
36
38. ¡ 50% of the decay was caused by
volatility of 3rd-party resource
§ Unavailable
§ Inaccessible
§ Updated
¡ Missing example data
§ Unable to re-run
¡ Missing execution environment
§ Such as local plugins
¡ Insufficient metadata
§ Such as any required dependency
libraries or permission
information
Khalid Belhajjame @ PoliWeb Workshop, 2014
37
40. ¡ Some
services
that
compose
workflows
are
annotated
using
concepts
from
domain
ontologies
¡ Such
annotaEons
can
be
used
to
repair
workflow
§ IdenEfy
available
services
that
can
play
the
same
role
as
an
unavailable
service
within
a
workflow.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
39
41. Task ontology: captures information about the action carried
out by service operations within a domain of interest, e.g.,
Sequence_alignment and Protein_identification
Domain ontology: captures information about the application
domains covered by operation parameters, e.g., Protein_record
and DNA_sequence
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
40
42. Task replaceability: For an operation op2 to be able to substitute
an operation op1, op2 must fulfil a task that is equivalent to or
subsumes the task op1 performs:
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
41
43. Parameter replaceability: To be compatible the domain of the
output must be the same as or subconcept of the domain of the
subsequent input.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
42
44. While the method just presented is sound, its practical applicability
is hindered by the following facts
§ Semantic annotations of web services are scarce.
§ Our experience suggests that a large proportion of existing
semantic annotations suffer from inaccuracies
§ As a result, a substitute that is discovered for replacing an
unavailable operation using such annotations may turn out to be
unsuitable, and, inversely, a suitable substitute may be
discarded.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
43
46. Formally,
let
wf1
be
a
workflow
in
which
the
operation
op1
is
unavailable.
The
operation
op2
can
replace
the
operation
op1
in
terms
of
its
inputs
and
outputs
if:
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
45
47. ¡ In addition to the compatibility in terms of inputs and outputs, we have to
check that the candidate substitute performs a task compatible with that of
the unavailable operation.
¡ To perform this test, we exploit the following observation. An operation
op2 is able to replace the operation op1 in terms of task, if for every
possible input instances that op1 is able to consume, op2 delivers the same
output as that obtained by invoking op1.
¡ To perform the above test, however, we will have to call the missing
operation op1!
¡ A solution that we adopt for overcoming the above problem makes use of
workflow provenance logs. These are traces that contain intermediate data
that were used as input and delivered as output by the constituent
operations of a workflow when enacted.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
46
48. § An
operation
op2
may
be
compatible
in
terms
of
task
with
op1
if:
op2
delivers
the
same
results
that
op1
delivered
in
past
execuEons,
that
are
logged
within
provenance
logs,
when
fed
using
the
same
input
values.
§ Notice that we say may be compatible. This is because we may not be able to
compare the outputs obtained for every possible input value of the operation
op1.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
47
49. ¡ The
condiEon
that
we
have
described
for
checking
the
suitability
of
an
operation
as
a
substitute
for
another
one
may
be
stronger
than
is
required
in
practice.
¡ There are various parameter representations that are adopted
in bioinformatics.
¡ Because of representation mismatch, a service operation that
performs a task similar to the missing operation may be found
to be unsuitable.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
48
50. Example
of
values
delivered
by
two
operaEons
using
the
same
input
value
Value1
Value2
CosSym(value1,value2)
=
0.007
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
49
51. To
overcome
this
problem,
we
use
a
two
step
process
when
comparing
the
values
of
parameters:
1. Given
a
parameter
value,
we
derive
its
representaEon.
2. If
the
representaEon
is
associated
with
a
key
ahribute
(idenEfier),
extract
the
value
of
such
an
ahribute
If
two
parameter
values
are
associated
with
idenEfiers,
then
they
are
compared
by
comparing
their
idenEfiers.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
50
52. Example
of
values
delivered
by
two
operaEons
using
the
same
input
value
Value1
Value2
Fasta Format
Uniprot Format
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
51
54. ¡ ScienEfic
workflows
are
increasingly
used
by
scienEsts
as
a
means
for
specifying
and
enacEng
their
experiments.
¡ They
tend
to
be
data
intensive
¡
The
data
sets
obtained
as
a
result
of
their
enactment
can
be
stored
in
public
repositories
to
be
queried,
analyzed
and
used
to
feed
the
execuEon
of
other
workflows.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
53
55. ¡ The
datasets
obtained
as
a
result
of
workflow
execuEon
oVen
contain
duplicates.
¡ As
a
result:
§ The
analysis
and
interpretaEon
of
workflow
results
may
become
tedious.
§ The
presence
of
duplicates
also
unnecessarily
increases
the
size
of
workflow
results.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
54
56. ¡ Research
in
duplicate
record
detecEon
has
been
acEve
for
more
than
three
decades.
§ Elmagarmid
et
al.,
2007
conducted
a
comprehensive
survey
of
the
topics.
¡ We
do
not
aim
to
design
yet
another
algorithm
for
comparing
and
matching
records.
¡ Rather,
we
invesEgate
how
provenance
traces
produced
as
a
result
of
workflow
execuEons
can
be
used
to
guide
the
detecEon
of
duplicate
records
in
workflow
results.
Ahmed
K.
Elmagarmid,
Panagiotis
G.
Ipeirotis,
and
Vassilios
S.
Verykios.
Du-‐plicate
record
detection:
A
survey.
IEEE
Trans.
Knowl.
Data
Eng.,
19(1):1–16,2007.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
55
57. ¡ A
data
driven
workflow
can
be
defined
as
a
directed
graph:
wf = hN, Ei
¡ A
node
represent
an
analysis
operaEon,
which
has
a
set
of
input
and
output
parameters.
hop, Iop, Oopi 2 N
hhop, oi, hop0, iii 2 E
¡ The
edges
are
dataflow
dependencies:
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
56
58. The
execuEon
of
workflows
gives
rise
to
provenance
trace,
which
we
capture
using
two
relaEons.
¡
Transforma5on:
to
specify
that
the
execuEon
of
an
operaEon
took
as
input
a
given
ordered
set
of
records
and
generated
another
ordered
set
of
records.
op, o1, ro1 , . . . , op, om, rom op, i1, ri1 , . . . , op, in, rin
OutBop InBop
¡ Transfer:
to
specify
transfer
of
records
along
the
edges
of
the
workflow.
op , i , r op, o, r
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
57
59. To
guide
the
detecEon
of
duplicates
in
workflow
results
we
exploit
the
following
fact:
¡ An
operaEon
that
is
known
to
be
determinisEc
produces
idenEcal
output
bindings
given
the
same
input
binding.
deterministic op OutBop InBop T OutBop InBop T
id OutBop, OutBop
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
58
60. Provenance-‐Guided
Detection
of
Duplicates:
Example
IdentifyProtein
GetGOTerm
Ri
Ro
R’i
R’o
1. The
set
of
records
Ri
that
are
bound
to
the
input
parameter
of
the
starEng
operaEon
are
compared
to
idenEfy
duplicate
records.
The
result
of
this
phase
is
a
parEEon
of
disjoint
sets
of
idenEcal
records.
i
o
i’
o’
Ri R1i
Rni
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
59
61. Provenance-‐Guided
Detection
of
Duplicates:
Example
IdentifyProtein
Ri
Ro
R’i
R’o
2. The
sets
of
records
Ro,
R’i
GetGOTerm
and
R’o
are
parEEoned
into
sets
of
idenEcal
records
based
on
the
parEEoning
of
Ri.
For
example:
Ro R1o
Rno
Rio
ro Ro s.t. ri Rii
, IdentifyProtein, o, ro IdentifyProtein, i, ri
i
o
i’
o’
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
60
62. Provenance-‐Guided
Detection
of
Duplicates:
Example
¡ In
the
example
just
described,
the
operaEons
that
compose
the
workflow
have
exactly
one
input
and
one
output
parameter.
§ However,
the
algorithm
we
developed
supports
operaEons
with
mulEple
input
and
output
parameters.
¡ NoEce
that
we
assumes
that
the
analysis
operaEons
that
compose
the
workflow
are
determinisEc.
This
is
not
always
the
case.
§ This
raises
the
quesEon
as
to
how
to
determine
that
a
given
operaEon
is
determinisEc.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
61
63. To
verify
the
determinism
of
operaEons,
we
use
an
approach
whereby
operaEons
are
probed.
1. Given
an
operaEon
op,
we
select
examples
values
that
can
be
used
by
the
inputs
of
op,
and
invoke
op
using
those
values
mulEple
Emes.
2.
If
op
produces
idenEcal
output
values
given
idenEcal
input
values,
then
it
is
likely
to
be
determinisEc,
otherwise,
it
is
not
determinisEc.
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
62
64. To
support
duplicates
detecEon
in
collecEon
based
workflows
we
need
to
be
able
to:
¡ Iden5fy
when
two
collec5ons
are
iden5cal
Two
collecEons
Ri
and
Rj
are
idenEcal
if
they
are
of
the
same
size
and
there
is
a
bijecEve
mapping:
that
maps
each
record
ri
in
Ri
to
a
record
rj
in
Rj
such
that
ri
and
rj
are
idenEcal
¡ Iden5fy
duplicates
records
between
two
collec5ons
that
are
known
to
be
iden5cal
IdenEfy
a
bijecEve
mapping
that
maps
every
ri
in
Ri
to
an
idenEcal
rj
in
Rj.
map : Ri Rj
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
63
66. ¡ Overwhelming for users who are not
the developers
¡ Abstractions required for reporting
¡ Lineage queries result in very long
trails
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
65
67. ¡ a.k.a. Shims
D.
Hull
et
al
¡ Dealing with data and
protocol heterogeneities
¡ Local organization of data
~ 60%
Garijo
D.,
Alper.
P.,
Belhajjame
K.
et
al
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
66
68. Process-Wise and Data-
Wise abstractions
¡ Sub-workflows
§ Not always a significant unit
of function (e.g. aesthetic
purposes)
¡ Bookmarked data links
§ Cluster the output signature
§ Further complicates workflow
¡ Components
§ Library dependent
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
67
69. ¡ A graph model for representing workflows
¡ Graph re-write rules for summarization
IF performs certain function THEN re-write WF graph !
!!!!!!
motifs reduction-primitives
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
68
71. Pure Dataflows
W= N,E!
Operation and Port Nodes
N = (Nop U Np)!
!
Dataflow edges
E = (Eopèp U Epèp U
Epèop )!
!
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
70
77. ¡ Strategies as a set of rules for
summarization
¡ Two sample strategies based on an
empirical analysis of workflows
¡ Reporting:
§ Process: Significant activities (Retrieval,
Analysis, Visualization)
§ Data:
§ Reduced cardinality
§ Stripped of protocol specific payload/formatting
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
76
78. ¡ By-Eliminate
§ Minimal annotation effort
§ Single rule
¡ By Collapse
§ More specific annotation
§ Multiple rules
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
77
86. ¡ Establishing Trust, but also understanding and
reusability, in Computational Science is more
than ever needed
¡ Reproducibility seems to be a cost-effective
solution
¡ A number of tools and methods have been
developed for doing so.
¡ However, …. that is not enough
¡ Changing our ways (culture) of doing science is
more challenging
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
85
87. ¡ Pinar
Alper
¡ Óscar
Corcho
¡ Fernando
ChirigaE
¡ Juliana
Freire
¡ David
De
Roure
¡ Yolanda
Gil
¡ Daniel
Garijo
¡ Carole
Goble
¡ David
Koop
¡ SEan
Soiland-‐Reyes
¡ Paolo
Missier
¡ Jun
Zhao
¡ and
many
others
…
Khalid
Belhajjame
@
PoliWeb
Workshop,
2014
86