This document summarizes a presentation about scientific workflow systems and related technologies including Taverna, Biocatalogue, and myExperiment. Taverna is a workflow management system that allows researchers to design and run workflows linking various bioinformatics services. Biocatalogue is a public registry of life science web services. MyExperiment is a repository for sharing workflows. The document discusses how these tools help scientists conduct experiments and analyze and preserve results.
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
This document discusses using semantic web technologies for translational research in life sciences. It provides an overview of semantic web standards and outlines several projects demonstrating applications in healthcare and biomedical research. These include developing an active semantic electronic medical record, semantically annotating experimental glycomics data, and integrating diverse biomedical data sources using ontologies to enable complex querying and knowledge discovery.
This document discusses community standards for reproducible and reusable bioscience research. It outlines the importance of consistent reporting to maximize the value of collective scientific outputs. However, there are challenges due to the large number of bioscience reporting standards and lack of knowledge about how they relate. The document calls for a coherent catalogue of data sharing resources to evaluate standards, show relationships among them, and promote interoperability. This would help researchers make informed choices about standards and facilitate structured descriptions of experiments across domains.
The document discusses the MESUR (Making Use and Sense of Scholarly Usage Data) project which aims to develop new metrics for scholarly impact and prestige based on usage data from digital scholarly resources rather than just citations. The key points are:
1) MESUR analyzes over 1 billion usage events of scholarly articles and develops network-based metrics from usage patterns to map the structure of science.
2) Preliminary results show relevant structure in usage-based network maps that correlate with traditional citation-based metrics.
3) MESUR has produced a variety of usage and citation-based metrics and developed online tools for exploring these metrics.
Knowledge Infrastructure for Global Systems ScienceDavid De Roure
Presentation at the First Open Global Systems Science Conference, Brussels, 8-10 November 2012
http://www.gsdp.eu/nc/news/news/date/2012/10/31/first-open-global-systems-science-conference/
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Ontomaton: NCBO BioPortal Ontology lookups in Google Spreadsheets produced by ISATeam at University of Oxford e-Research Centre (Eamonn Maguire, Alejandra Gonzalez-Beltran, Philippe Rocca-Serra and Susanna Sansone) and NCBO (Trish Whetzel).
The work was presented during ICBO 2013 in Montreal by Trish Whetzel (Thanks Trish!)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
This document discusses using semantic web technologies for translational research in life sciences. It provides an overview of semantic web standards and outlines several projects demonstrating applications in healthcare and biomedical research. These include developing an active semantic electronic medical record, semantically annotating experimental glycomics data, and integrating diverse biomedical data sources using ontologies to enable complex querying and knowledge discovery.
This document discusses community standards for reproducible and reusable bioscience research. It outlines the importance of consistent reporting to maximize the value of collective scientific outputs. However, there are challenges due to the large number of bioscience reporting standards and lack of knowledge about how they relate. The document calls for a coherent catalogue of data sharing resources to evaluate standards, show relationships among them, and promote interoperability. This would help researchers make informed choices about standards and facilitate structured descriptions of experiments across domains.
The document discusses the MESUR (Making Use and Sense of Scholarly Usage Data) project which aims to develop new metrics for scholarly impact and prestige based on usage data from digital scholarly resources rather than just citations. The key points are:
1) MESUR analyzes over 1 billion usage events of scholarly articles and develops network-based metrics from usage patterns to map the structure of science.
2) Preliminary results show relevant structure in usage-based network maps that correlate with traditional citation-based metrics.
3) MESUR has produced a variety of usage and citation-based metrics and developed online tools for exploring these metrics.
Knowledge Infrastructure for Global Systems ScienceDavid De Roure
Presentation at the First Open Global Systems Science Conference, Brussels, 8-10 November 2012
http://www.gsdp.eu/nc/news/news/date/2012/10/31/first-open-global-systems-science-conference/
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Ontomaton: NCBO BioPortal Ontology lookups in Google Spreadsheets produced by ISATeam at University of Oxford e-Research Centre (Eamonn Maguire, Alejandra Gonzalez-Beltran, Philippe Rocca-Serra and Susanna Sansone) and NCBO (Trish Whetzel).
The work was presented during ICBO 2013 in Montreal by Trish Whetzel (Thanks Trish!)
Marco Brandizi and Keywan Hassani-Pak, Rothamsted Research, Invited Presentation at SWAT4HCLS 2022.
FAIR data principles are being a driving force in life sciences and other scientific domains, helping researchers to share their data and free all of their potential to integrate information and do novel discoveries. Knowledge graphs are an ever more popular paradigm to model data according to such principles, and technologies such as graph databases are emerging as complementary to approaches like linked data. All of this includes the agronomy, farming and food domains. How advanced the adoption of sound data management policies is in these life domains? How does that compare to other life sciences? In this presentation, we will talk about our practical experience, focusing on KnetMiner, a gene and molecular biology discovering platform, which is based on building and publishing knowledge graphs according to the FAIR principles, as well as using a mix of linked data standards for life sciences and recent graph database and API technologies. We will welcome questions and discussions from the audience about similar experience.
High-performance web services for gene and variant annotationsChunlei Wu
This document describes high-performance web services called MyGene.info and MyVariant.info that provide gene and variant annotations. It discusses how the services aggregate data from multiple sources and keep it up-to-date. It also explains the use of document databases to store data in JSON format and support rich data structures. APIs for the services support large-scale usage and are designed to be easy to use, developer-friendly, and aggregate comprehensive annotation information for genes and variants.
Towards Responsible Content Mining: A Cambridge perspectivepetermurrayrust
ContentMining (Text and Data Mining) is now legal in the UK for non-commercial research. Cambridge UK is a natural centre, with several components:
* a world-class University and Library
* many publishers, both Open Access and conventional
* a digital culture
* ContentMine - a leading proponent and practitioner of mining
Cambridge University Press welcomes content mining and invited PMR to give a talk there. He showed the technology and protocols and proposed a practical way forward in 2017
BioCatalogue talk by Carole Goble. She outlines in these slides the reasons behind the BioCatalogue project. And present the BioCatalogue and its goals.
High throughput mining of the scholarly literature; talk at NIHpetermurrayrust
Elsevier stopped Chris Hartgerink, a statistician, from downloading research papers in bulk from Sciencedirect for the purpose of content mining to detect potentially problematic research findings, despite having legal access through his university's subscription and only intending to extract facts without redistributing full papers; he had downloaded around 30GB of data over 10 days to mine psychology literature for test results, figures, tables and other information reported in papers. Hartgerink's research aims to investigate unreliable findings that can harm policy and research progress through an innovative content mining method.
Crowd-sourcing is being used to build ChemSpider, a structure-centric community for chemists. ChemSpider allows users to search over 20 million chemical structures and associated data. It enables collaborative curation of data through tools like commenting and editing. ChemSpider aims to enable open discovery through features like virtual screening of compounds using LASSO descriptors.
Annotation of SBML Models Through Rule-Based Semantic IntegrationAllyson Lister
This talk was given on June 28, 2009 at the Bio-Ontologies SIG as part of ISMB/ECCB 2009. You can download the paper this presentation is about from http://hdl.handle.net/10101/npre.2009.3286.1. More information on the ISMB conference is available at http://www.iscb.org/ismbeccb2009/ and http://friendfeed.com/ismbeccb2009
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
A Global Commons for Scientific Data: Molecules and Wikidatapetermurrayrust
This document summarizes Peter Murray-Rust's work on developing software to extract structured data and information from scientific documents. It discusses tools to extract data from text, tables, images, computational logs, and more. It provides examples of extracting chemical information, disease and species data, and phylogenetic trees from figures. The goal is to liberate scientific data locked up in unstructured documents to enable new discoveries.
A keynote given on experiences in curating workflows and web services.
3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"
11-13 December 2007
Renaissance Hotel
Washington DC, USA
The document discusses the increasing scale and complexity of knowledge generation in science domains like astronomy and medicine over recent centuries. It argues that knowledge generation can be viewed as a systems problem involving many actors and processes. The document proposes a service-oriented approach using web services as an integrating framework to address challenges of scale, complexity, and distributed collaboration in e-Science. Key challenges discussed include semantics, documentation, scaling issues, and sociological factors like incentives.
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeCarole Goble
Workflow systems support the design, configuration and execution of repetitive, multi-step pipelines and analytics, well established in many disciplines, notably biology and chemistry, but less so in biodiversity and ecology. From an experimental perspective workflows are a means to handle the work of accessing an ecosystem of software and platforms, manage data and security, and handle errors. From a reporting perspective they are a means to accurately document methodology for reproducibility, comparison, exchange and reuse, and to trace the provenance of results for review, credit, workflow interoperability and impact analysis. Workflows operate in an evolving ecosystem and are assemblages of components in that ecosystem; their provenance trails are snapshots of intermediate and final results. Taking a lifecycle perspective, what are the challenges in workflow design and use with different stakeholders? What needs to be tackled in evolution, resilience, and preservation? And what are the “mitigate or adapt” strategies adopted by workflow systems in the face of changes in the ecosystem/environment, for example when tools are depreciated or datasets become inaccessible in the face of funding shortfalls?
Taverna is a free and open-source workflow management system that allows researchers to design and execute scientific workflows. It was developed by the University of Manchester to support in silico experiments in biology. Taverna provides a graphical user interface for designing workflows using a variety of distributed data sources and web services without having to learn complex programming. It has been widely adopted by researchers in fields such as biology, healthcare, astronomy, and cheminformatics to automate analysis pipelines and share workflows.
This curriculum vitae summarizes the educational and professional background of Kamran Sartipi. He holds a PhD in Computer Science from the University of Waterloo and has published extensively. His research focuses on software engineering, information security, electronic health, and medical informatics. He has led several projects involving decision support systems, knowledge engineering, distributed systems, and standards-based health interoperability.
To address challenges of poor interoperability among biological natural language processing (BioNLP) services, the authors propose a framework called BioNLP-SADI that uses Semantic Automated Discovery and Integration (SADI) to integrate BioNLP tools. BioNLP-SADI represents output in RDF, uses ontologies for modeling, and SPARQL for querying to consolidate results from multiple services without programming. This allows ad-hoc analysis of text mining results and comparative evaluation of BioNLP tools. The authors implemented several example BioNLP services within this framework.
FAIR Computational Workflows
Computational workflows capture precise descriptions of the steps and data dependencies needed to carry out computational data pipelines, analysis and simulations in many areas of Science, including the Life Sciences. The use of computational workflows to manage these multi-step computational processes has accelerated in the past few years driven by the need for scalable data processing, the exchange of processing know-how, and the desire for more reproducible (or at least transparent) and quality assured processing methods. The SARS-CoV-2 pandemic has significantly highlighted the value of workflows.
This increased interest in workflows has been matched by the number of workflow management systems available to scientists (Galaxy, Snakemake, Nextflow and 270+ more) and the number of workflow services like registries and monitors. There is also recognition that workflows are first class, publishable Research Objects just as data are. They deserve their own FAIR (Findable, Accessible, Interoperable, Reusable) principles and services that cater for their dual roles as explicit method description and software method execution [1]. To promote long-term usability and uptake by the scientific community, workflows (as well as the tools that integrate them) should become FAIR+R(eproducible), and citable so that author’s credit is attributed fairly and accurately.
The work on improving the FAIRness of workflows has already started and a whole ecosystem of tools, guidelines and best practices has been under development to reduce the time needed to adapt, reuse and extend existing scientific workflows. An example is the EOSC-Life Cluster of 13 European Biomedical Research Infrastructures which is developing a FAIR Workflow Collaboratory based on the ELIXIR Research Infrastructure for Life Science Data Tools ecosystem. While there are many tools for addressing different aspects of FAIR workflows, many challenges remain for describing, annotating, and exposing scientific workflows so that they can be found, understood and reused by other scientists.
This keynote will explore the FAIR principles for computational workflows in the Life Science using the EOSC-Life Workflow Collaboratory as an example.
[1] Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes,Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters, and Daniel Schober FAIR Computational Workflows Data Intelligence 2020 2:1-2, 108-121 https://doi.org/10.1162/dint_a_00033.
Arabidopsis Information Portal: A Community-Extensible Platform for Open DataMatthew Vaughn
Araport is an innovative model organism database resource that offers users the ability to bring their own visualizations, data sets, algorithms, and genome browser tracks and share them with their colleagues.
Marco Brandizi and Keywan Hassani-Pak, Rothamsted Research, Invited Presentation at SWAT4HCLS 2022.
FAIR data principles are being a driving force in life sciences and other scientific domains, helping researchers to share their data and free all of their potential to integrate information and do novel discoveries. Knowledge graphs are an ever more popular paradigm to model data according to such principles, and technologies such as graph databases are emerging as complementary to approaches like linked data. All of this includes the agronomy, farming and food domains. How advanced the adoption of sound data management policies is in these life domains? How does that compare to other life sciences? In this presentation, we will talk about our practical experience, focusing on KnetMiner, a gene and molecular biology discovering platform, which is based on building and publishing knowledge graphs according to the FAIR principles, as well as using a mix of linked data standards for life sciences and recent graph database and API technologies. We will welcome questions and discussions from the audience about similar experience.
High-performance web services for gene and variant annotationsChunlei Wu
This document describes high-performance web services called MyGene.info and MyVariant.info that provide gene and variant annotations. It discusses how the services aggregate data from multiple sources and keep it up-to-date. It also explains the use of document databases to store data in JSON format and support rich data structures. APIs for the services support large-scale usage and are designed to be easy to use, developer-friendly, and aggregate comprehensive annotation information for genes and variants.
Towards Responsible Content Mining: A Cambridge perspectivepetermurrayrust
ContentMining (Text and Data Mining) is now legal in the UK for non-commercial research. Cambridge UK is a natural centre, with several components:
* a world-class University and Library
* many publishers, both Open Access and conventional
* a digital culture
* ContentMine - a leading proponent and practitioner of mining
Cambridge University Press welcomes content mining and invited PMR to give a talk there. He showed the technology and protocols and proposed a practical way forward in 2017
BioCatalogue talk by Carole Goble. She outlines in these slides the reasons behind the BioCatalogue project. And present the BioCatalogue and its goals.
High throughput mining of the scholarly literature; talk at NIHpetermurrayrust
Elsevier stopped Chris Hartgerink, a statistician, from downloading research papers in bulk from Sciencedirect for the purpose of content mining to detect potentially problematic research findings, despite having legal access through his university's subscription and only intending to extract facts without redistributing full papers; he had downloaded around 30GB of data over 10 days to mine psychology literature for test results, figures, tables and other information reported in papers. Hartgerink's research aims to investigate unreliable findings that can harm policy and research progress through an innovative content mining method.
Crowd-sourcing is being used to build ChemSpider, a structure-centric community for chemists. ChemSpider allows users to search over 20 million chemical structures and associated data. It enables collaborative curation of data through tools like commenting and editing. ChemSpider aims to enable open discovery through features like virtual screening of compounds using LASSO descriptors.
Annotation of SBML Models Through Rule-Based Semantic IntegrationAllyson Lister
This talk was given on June 28, 2009 at the Bio-Ontologies SIG as part of ISMB/ECCB 2009. You can download the paper this presentation is about from http://hdl.handle.net/10101/npre.2009.3286.1. More information on the ISMB conference is available at http://www.iscb.org/ismbeccb2009/ and http://friendfeed.com/ismbeccb2009
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
A Global Commons for Scientific Data: Molecules and Wikidatapetermurrayrust
This document summarizes Peter Murray-Rust's work on developing software to extract structured data and information from scientific documents. It discusses tools to extract data from text, tables, images, computational logs, and more. It provides examples of extracting chemical information, disease and species data, and phylogenetic trees from figures. The goal is to liberate scientific data locked up in unstructured documents to enable new discoveries.
A keynote given on experiences in curating workflows and web services.
3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"
11-13 December 2007
Renaissance Hotel
Washington DC, USA
The document discusses the increasing scale and complexity of knowledge generation in science domains like astronomy and medicine over recent centuries. It argues that knowledge generation can be viewed as a systems problem involving many actors and processes. The document proposes a service-oriented approach using web services as an integrating framework to address challenges of scale, complexity, and distributed collaboration in e-Science. Key challenges discussed include semantics, documentation, scaling issues, and sociological factors like incentives.
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeCarole Goble
Workflow systems support the design, configuration and execution of repetitive, multi-step pipelines and analytics, well established in many disciplines, notably biology and chemistry, but less so in biodiversity and ecology. From an experimental perspective workflows are a means to handle the work of accessing an ecosystem of software and platforms, manage data and security, and handle errors. From a reporting perspective they are a means to accurately document methodology for reproducibility, comparison, exchange and reuse, and to trace the provenance of results for review, credit, workflow interoperability and impact analysis. Workflows operate in an evolving ecosystem and are assemblages of components in that ecosystem; their provenance trails are snapshots of intermediate and final results. Taking a lifecycle perspective, what are the challenges in workflow design and use with different stakeholders? What needs to be tackled in evolution, resilience, and preservation? And what are the “mitigate or adapt” strategies adopted by workflow systems in the face of changes in the ecosystem/environment, for example when tools are depreciated or datasets become inaccessible in the face of funding shortfalls?
Taverna is a free and open-source workflow management system that allows researchers to design and execute scientific workflows. It was developed by the University of Manchester to support in silico experiments in biology. Taverna provides a graphical user interface for designing workflows using a variety of distributed data sources and web services without having to learn complex programming. It has been widely adopted by researchers in fields such as biology, healthcare, astronomy, and cheminformatics to automate analysis pipelines and share workflows.
This curriculum vitae summarizes the educational and professional background of Kamran Sartipi. He holds a PhD in Computer Science from the University of Waterloo and has published extensively. His research focuses on software engineering, information security, electronic health, and medical informatics. He has led several projects involving decision support systems, knowledge engineering, distributed systems, and standards-based health interoperability.
To address challenges of poor interoperability among biological natural language processing (BioNLP) services, the authors propose a framework called BioNLP-SADI that uses Semantic Automated Discovery and Integration (SADI) to integrate BioNLP tools. BioNLP-SADI represents output in RDF, uses ontologies for modeling, and SPARQL for querying to consolidate results from multiple services without programming. This allows ad-hoc analysis of text mining results and comparative evaluation of BioNLP tools. The authors implemented several example BioNLP services within this framework.
FAIR Computational Workflows
Computational workflows capture precise descriptions of the steps and data dependencies needed to carry out computational data pipelines, analysis and simulations in many areas of Science, including the Life Sciences. The use of computational workflows to manage these multi-step computational processes has accelerated in the past few years driven by the need for scalable data processing, the exchange of processing know-how, and the desire for more reproducible (or at least transparent) and quality assured processing methods. The SARS-CoV-2 pandemic has significantly highlighted the value of workflows.
This increased interest in workflows has been matched by the number of workflow management systems available to scientists (Galaxy, Snakemake, Nextflow and 270+ more) and the number of workflow services like registries and monitors. There is also recognition that workflows are first class, publishable Research Objects just as data are. They deserve their own FAIR (Findable, Accessible, Interoperable, Reusable) principles and services that cater for their dual roles as explicit method description and software method execution [1]. To promote long-term usability and uptake by the scientific community, workflows (as well as the tools that integrate them) should become FAIR+R(eproducible), and citable so that author’s credit is attributed fairly and accurately.
The work on improving the FAIRness of workflows has already started and a whole ecosystem of tools, guidelines and best practices has been under development to reduce the time needed to adapt, reuse and extend existing scientific workflows. An example is the EOSC-Life Cluster of 13 European Biomedical Research Infrastructures which is developing a FAIR Workflow Collaboratory based on the ELIXIR Research Infrastructure for Life Science Data Tools ecosystem. While there are many tools for addressing different aspects of FAIR workflows, many challenges remain for describing, annotating, and exposing scientific workflows so that they can be found, understood and reused by other scientists.
This keynote will explore the FAIR principles for computational workflows in the Life Science using the EOSC-Life Workflow Collaboratory as an example.
[1] Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes,Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters, and Daniel Schober FAIR Computational Workflows Data Intelligence 2020 2:1-2, 108-121 https://doi.org/10.1162/dint_a_00033.
Arabidopsis Information Portal: A Community-Extensible Platform for Open DataMatthew Vaughn
Araport is an innovative model organism database resource that offers users the ability to bring their own visualizations, data sets, algorithms, and genome browser tracks and share them with their colleagues.
OVium Bio-Information Solutions use forefront algorithms to analyze key data resources such NCBI, EBLM and PDB to develop cell signal pathways.
OVium employs cloud and MPP computing solutions with homology and signal network mapping to develop chemical and protein pathways for discovery research.
The document discusses Microsoft Research's ORECHEM project, which aims to integrate chemistry scholarship with web architectures, grid computing, and the semantic web. It involves developing infrastructure to enable new models for research and dissemination of scholarly materials in chemistry. Key aspects include using OAI-ORE standards to describe aggregations of web resources related to crystallography experiments. The objective is to build a pipeline that extracts 3D coordinate data from feeds, performs computations on resources like TeraGrid, and stores resulting RDF triples in a triplestore. RESTful web services are implemented to access different steps in the workflow.
German Conference on Bioinformatics 2021
https://gcb2021.de/
FAIR Computational Workflows
Computational workflows capture precise descriptions of the steps and data dependencies needed to carry out computational data pipelines, analysis and simulations in many areas of Science, including the Life Sciences. The use of computational workflows to manage these multi-step computational processes has accelerated in the past few years driven by the need for scalable data processing, the exchange of processing know-how, and the desire for more reproducible (or at least transparent) and quality assured processing methods. The SARS-CoV-2 pandemic has significantly highlighted the value of workflows.
This increased interest in workflows has been matched by the number of workflow management systems available to scientists (Galaxy, Snakemake, Nextflow and 270+ more) and the number of workflow services like registries and monitors. There is also recognition that workflows are first class, publishable Research Objects just as data are. They deserve their own FAIR (Findable, Accessible, Interoperable, Reusable) principles and services that cater for their dual roles as explicit method description and software method execution [1]. To promote long-term usability and uptake by the scientific community, workflows (as well as the tools that integrate them) should become FAIR+R(eproducible), and citable so that author’s credit is attributed fairly and accurately.
The work on improving the FAIRness of workflows has already started and a whole ecosystem of tools, guidelines and best practices has been under development to reduce the time needed to adapt, reuse and extend existing scientific workflows. An example is the EOSC-Life Cluster of 13 European Biomedical Research Infrastructures which is developing a FAIR Workflow Collaboratory based on the ELIXIR Research Infrastructure for Life Science Data Tools ecosystem. While there are many tools for addressing different aspects of FAIR workflows, many challenges remain for describing, annotating, and exposing scientific workflows so that they can be found, understood and reused by other scientists.
This keynote will explore the FAIR principles for computational workflows in the Life Science using the EOSC-Life Workflow Collaboratory as an example.
[1] Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes,Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters, and Daniel Schober FAIR Computational Workflows Data Intelligence 2020 2:1-2, 108-121 https://doi.org/10.1162/dint_a_00033.
This document is a resume for Gautam Machiraju. It summarizes his education and research experience. He has a B.A. in Applied Mathematics from UC Berkeley with a concentration in Mathematical Biology and a minor in Bioengineering. He has worked on several research projects involving mathematical modeling and data analysis related to biology and healthcare. These include modeling cancer biomarker shedding kinetics, mining literature for biomarker data, and using deep learning on patient time-series data. He has strong skills in programming, mathematics, and bioinformatics.
This document is a resume for Gautam Machiraju. It summarizes his education and research experience. He has a B.A. in Applied Mathematics from UC Berkeley with a concentration in Mathematical Biology and a minor in Bioengineering. He has worked on several research projects involving mathematical modeling and data analysis related to cancer biomarkers, genomics, and proteomics. His skills include programming, mathematics, data science, and laboratory techniques. He is currently a bioinformatics research assistant at Stanford University School of Medicine.
Lei Zheng has over 15 years of experience in areas such as machine learning, data mining, and software development. He currently works as a Senior Software Engineer at Yahoo, where he develops algorithms for spam filtering and detection of abusive behavior. Previously he held research positions at the University of Pittsburgh and JustSystems Evans Research, where he implemented algorithms and systems for information retrieval, natural language processing, and data mining.
Acting as Advocate? Seven steps for libraries in the data decadeLizLyon
UKOLN advocates that libraries take seven steps to support data management and open science in the data decade:
1) Provide briefings on cloud data services in partnership with IT services.
2) Build usable data management tools in partnership with researchers.
3) Develop data sustainability strategies and articulate the costs and benefits.
4) Publish case studies on open science to show benefits of universal data sharing.
5) Present at university ethics committees to highlight open data issues.
6) Raise awareness of citizen science opportunities and guidelines for good practice.
7) Promote data citation and attribution to embed in publication practice.
Adithya Rajan is seeking a career in machine learning and big data. He has a Ph.D. in Electrical Engineering from Arizona State University with extensive coursework in machine learning, optimization, and statistics. He has over 3 years of industry experience as a data scientist and research engineer developing machine learning algorithms. His research focuses on applying statistical techniques like stochastic ordering and information theory to wireless communications and signal processing.
Similar to Invited talk @ ESIP summer meeting, 2009 (20)
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
A talk given at the DATAPLAT workshop, co-located with the IEEE ICDE conference (May 2024, Utrecht, NL).
Data Provenance for Data Science is our attempt to provide a foundation to add explainability to data-centric AI.
It is a prototype, with lots of work still to do.
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
In this presentation, given to graduate students at Universita' RomaTre, Italy, we suggest that concepts well-known in Data Provenance can be exploited to provide explanations in the context of data-centric AI processes. Through use cases (incremental data cleaning, training set pruning), we build up increasingly complex provenance patterns, culminating in an open question:
how to describe "why" a specific data item has been manipulated as part of data processing, when such processing may consist of a complex data transformation algorithm.
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
A talk given at the BDA4HM workshop, IEEE BigData conference, Dec. 2023
please see paper here:
https://drive.google.com/file/d/1vN08G0FWxOSH1Yeak5AX6a0sr5-EBbAt/view
Data-centric AI and the convergence of data and model engineering:opportunit...Paolo Missier
A keynote talk given to the IDEAL 2023 conference (Evora, Portugal Nov 23, 2023).
Abstract.
The past few years have seen the emergence of what the AI community calls "Data-centric AI", namely the recognition that some of the limiting factors in AI performance are in fact in the data used for training the models, as much as in the expressiveness and complexity of the models themselves. One analogy is that of a powerful engine that will only run as fast as the quality of the fuel allows. A plethora of recent literature has started the connection between data and models in depth, along with startups that offer "data engineering for AI" services. Some concepts are well-known to the data engineering community, including incremental data cleaning, multi-source integration, or data bias control; others are more specific to AI applications, for instance the realisation that some samples in the training space are "easier to learn from" than others. In this "position talk" I will suggest that, from an infrastructure perspective, there is an opportunity to efficiently support patterns of complex pipelines where data and model improvements are entangled in a series of iterations. I will focus in particular on end-to-end tracking of data and model versions, as a way to support MLDev and MLOps engineers as they navigate through a complex decision space.
Realising the potential of Health Data Science:opportunities and challenges ...Paolo Missier
This document summarizes a presentation on opportunities and challenges for applying health data science and AI in healthcare. It discusses the potential of predictive, preventative, personalized and participatory (P4) approaches using large health datasets. However, it notes major challenges including data sparsity, imbalance, inconsistency and high costs. Case studies on liver disease and COVID datasets demonstrate issues requiring data engineering. Ensuring explanations and human oversight are also key to adopting AI in clinical practice. Overall, the document outlines a complex landscape and the need for better data science methods to realize the promise of data-driven healthcare.
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
This document describes DP4DS, a tool to collect fine-grained provenance from data processing pipelines. Specifically, it can collect provenance from dataframe-based Python scripts. It demonstrates scalable provenance generation, storage, and querying. Current work includes improving provenance compression techniques and demonstrating the tool's generality for standard relational operators. Open questions remain around how useful fine-grained provenance is for explaining findings from real data science pipelines.
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
a brief intro on the data challenges associated with working with Health Care data, with a few examples, both from literature and our own, of traditional approaches (Latent Class Analysis, Topic Modelling) and a perspective on Language-based modelling for Electronic Health Records (EHR).
probably more references than actual content in here!
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
This document describes a method for capturing and querying fine-grained provenance from data science preprocessing pipelines. It captures provenance at the dataframe level by comparing inputs and outputs to identify transformations. Templates are used to represent common transformations like joins and appends. The approach was evaluated on benchmark datasets and pipelines, showing overhead from provenance capture is low and queries are fast even for large datasets. Scalability was demonstrated on datasets up to 1TB in size. A tool called DPDS was also developed to assist with data science provenance.
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
The document proposes tracking trajectories of multiple long-term conditions using dynamic patient-cluster associations. It uses topic modeling to identify disease clusters from patient timelines and quantifies how patients associate with clusters over time. Preliminary results on 143,000 patients from UK Biobank show varying stability of patient associations with clusters. Further work aims to better define stability and identify causes of instability.
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
The document discusses data provenance for data science applications. It proposes automatically generating and storing metadata that describes how data flows through a machine learning pipeline. This provenance information could help address questions about model predictions, data processing decisions, and regulatory requirements for high-risk AI systems. Capturing provenance at a fine-grained level incurs overhead but enables detailed queries. The approach was evaluated on performance and scalability. Provenance may help with transparency, explainability and oversight as required by new regulations.
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
a talk given at the VLDB 2021 conference, August, 2021, presenting our paper:
Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507–520, January, 2021.
http://doi.org/10.14778/3436905.3436911
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Paolo Missier
The document discusses provenance in the context of data science and artificial intelligence. It provides bibliometric data on publications related to data/workflow provenance from 2000 to the present. Recent trends include increased focus on applications in computing and engineering fields. Blockchain is discussed as a method for capturing fine-grained provenance. The document also outlines challenges around explainability, transparency and accountability for high-risk AI systems according to new EU regulations, and argues that provenance techniques may help address these challenges by providing traceability of system functioning and operation monitoring.
Analytics of analytics pipelines:from optimising re-execution to general Dat...Paolo Missier
This document discusses using data provenance to optimize re-execution of analytics pipelines and enable transparency in data science workflows. It proposes a framework called ReComp that selectively recomputes parts of expensive analytics workflows when inputs change based on provenance data. It also discusses applying provenance techniques to collect fine-grained data on data preparation steps in machine learning pipelines to help explain model decisions and data transformations. Early results suggest provenance can be collected with reasonable overhead and enables useful queries about pipeline execution.
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
Paolo Missier presented on optimizing the re-execution of analytics pipelines in response to changes in input data. The talk discussed using provenance to selectively re-run parts of workflows impacted by changes. ProvONE combines process structure and runtime provenance to enable granular re-execution. The ReComp framework detects and quantifies data changes, estimates impact, and selectively re-executes relevant sub-processes to optimize re-running workflows in response to evolving data.
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
The document describes the ReComp framework for efficiently recomputing analytics processes when changes occur. ReComp uses provenance data from past executions to estimate the impact of changes and selectively re-execute only affected parts of processes. It identifies changes, computes data differences, and estimates impacts on past outputs to determine the minimum re-executions needed. For genomic analysis workflows, ReComp reduced re-executions from 495 to 71 by caching intermediate data and re-running only impacted fragments. The framework is customizable via difference and impact functions tailored to specific applications and data types.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
12. What do Scientists use Taverna for? ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Systems biology model building Proteomics Sequence analysis Protein structure prediction Gene/protein annotation Microarray data analysis QTL studies QSAR studies Medical image analysis Public Health care epidemiology Heart model simulations High throughput screening Phenotypical studies Phylogeny Statistical analysis Text mining Astronomy, Music, Meteorology Netherlands Bioinformatics Centre Genome Canada Bioinformatics Platform BioMOBY US FLOSS social science program RENCI SysMO Consortium French SIGENAE farm animals project ThaiGrid CARMEN Neuroscience project SPINE consortium EU Enfin, EMBRACE, BioSapian, Casimir EU SysMO Consortium NERC Centre for Ecology and Hydrology Bergen Centre for Computational Biology Max-Planck institute for Plant Breeding Research Genoa Cancer Research Centre AstroGrid 30 USA academic and research institutions
13. ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier 200 Genotype Phenotype Metabolic pathways Literature [Paul Fisher]
14.
15.
16. WaaS: Workflows as a Service ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier [Pettifer, Kell, University of Manchester] inside
17. Workflows operating over Grid Infrastructure ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier http://www.knowarc.eu KnowARC integrated with Taverna" application prototype to use Taverna as direct interface to Grid resources running ARC. Open source grid software infrastructure aimed at enabling multi-institutional data sharing and analysis. Underpins caBIG. Taverna links together caGrid resources. http://cagrid.org / http://www.eu-egee.org/ Europe’s leading grid computing project, Piloted Taverna over EGEE gLite services
18.
19. caBIG cancer cyberinfrastructure uses Taverna to link services ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier A sample caGrid workflow for microarray analysis, using caArray, GenePattern and geWorkbench [Ravi Madduri] Orchestrating caGrid Services in Taverna Wei Tan, Ravi Madduri, Kiran Keshav, Baris E. Suzek, Scott Oster, Ian Foster, Proc IEEE Intl Conf on Web Services (ICWS 2008)
20. Who else is in this space? ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Kepler Triana BPEL Ptolemy II Taverna Trident BioExtract
21.
22. Workflow-based experimentation lifecycle ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Run Analyse Results Publish Develop Collect and query provenance metadata
23. Taverna ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Graphical Workbench For Professionals Plug-in architecture Nested Workflows Drag and Drop Wiring together Rapidly incorporate new service without coding. Not restricted to predetermined services Access to local and remote resources and analysis tools 3500+ service operations available when start up
24.
25.
26.
27.
28.
29.
30. ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Curation Model Versioning Quantitative Content Tags Service Model Semantic Content Ontologies Functional Capabilities Provenance Operational Capabilities Operational Metrics Usage Policy Community Standing Ratings Usage Statistics Attribution Free text Searching Statistics Usable and Useful Understandable Controlled vocabs Interfaces
34. Scaling up along the social dimension ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier What ? Where ? Why ? Who ? How ? Crossing the boundaries of individual investigation Develop Run Analyze Publish Develop Run Analyze Publish
35. Scientific collaboration ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Source: Andrea Wiggins , talk given at the School of Computer Science, University of Manchester, UK, June 18th, 2009
36. ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Source: Andrea Wiggins , talk given at the School of Computer Science, University of Manchester, UK, June 18th, 2009
37. ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Source: Andrea Wiggins , talk given at the School of Computer Science, University of Manchester, UK, June 18th, 2009
38.
39.
40.
41. Publishing for collaboration ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Run Analyse Results Publish Develop Collect and query provenance metadata Design-time reuse: Composition from existing workflows Runtime reuse: Workflows as services compare results across versions foster virtual scientific communities provenance exchange and interoperability the OPM experiment
42. Collaboration in the workflow space ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier What ? Where ? Why ? Who ? How ? Develop Run Analyze Publish Develop Run Analyze Publish
55. Workflows and Services Curation by Experts Social Curation by the Crowd refine validate refine validate Self-Curation by Contributors seed seed refine validate seed refine validate seed Automated Curation
60. Collaboration in the workflow space ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier What ? Where ? Why ? Who ? How ? Develop Run Analyze Publish Develop Run Analyze Publish
72. Provenance interoperability for open science OPM ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier Develop Run Analyze Publish Develop Run Analyze Publish OPM: the Open Provenance Model
73. The Science Lifecycle scientists Graduate Students Undergraduate Students experimentation Data, Metadata, Provenance, Scripts, Workflows, Services, Ontologies, Blogs, ... Digital Libraries Next Generation Researchers Adapted from David De Roure’s slides Local Web Repositories Virtual Learning Environment Technical Reports Reprints Peer-Reviewed Journal & Conference Papers Preprints & Metadata Certified Experimental Results & Analyses
74. Finding the Provenance of research outputs across all the systems data transited through scientists Local Web Repositories Graduate Students Undergraduate Students Virtual Learning Environment Technical Reports Reprints Peer-Reviewed Journal & Conference Papers Preprints & Metadata Certified Experimental Results & Analyses experimentation Data, Metadata, Provenance, Scripts, Workflows, Services, Ontologies, Blogs, ... Digital Libraries Next Generation Researchers
75. Provenance Across Applications Local provenance stores Adapted from Luc Moreau’s slides: “The Open Provenance Model” (Univ. of Southampton,UK), 2009 Application Application Application Application Application Provenance Inter-Operability Layer import from OPM export to OPM
81. Upcoming events ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier SWPM 2009: The First International Workshop on the Role of Semantic Web in Provenance Management http://wiki.knoesis.org/index.php/SWPM-2009 Co-located with ISWC'09, October 25/26 2009, Washington D.C., USASubmission Deadline: Friday, July 31, 2009 Special issue of Future Generation Computer Systems Journal (FGCS) on the third provenance challenge (to be announced) expected deadline: Dec., 2009
82.
83.
Editor's Notes
Repetitive and mundane boring stuff made easier, reliable and adaptable. Big science and collaborative science
Interoperability, Integration and Collaboration Automated processing Interactive Repetitive and accurate compound processes (protocols) Transparent processes Data flow Trackable results Agile software development
This workflow searches for genes which reside in a QTL (Quantitative Trait Loci) region in the mouse, Mus musculus. The workflow requires an input of: a chromosome name or number; a QTL start base pair position; QTL end base pair position. Data is then extracted from BioMart to annotate each of the genes found in this region. The Entrez and UniProt identifiers are then sent to KEGG to obtain KEGG gene identifiers. The KEGG gene identifiers are then used to searcg for pathways in the KEGG pathway database. this is pathways_and_gene_annotations_for_qtl_phenotype_28303 exec with chromosome = 17 start_position = 28500000 end_position = 32500000
mention scalability: we support large datasets as well, in addition to these small listsand that lists can grow very large
The case study of Paul’s work Different datatypes, held in geographically different places of different subdisciplines. Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping Genes captured in microarray experiment and present in QTL (Quantitative Trait Loci ) region Microarray + QTL
The workflow produces OMIM tagged diseases which can be used to enrich the proto-ontology automatically in RDF
- caGrid (or the Grid) is the underlying network architecture and platform that provides the basis for connectivity of caBIG® tools. - ARC (australian Research Center) plugin for Taverna used for medical imaging Taverna is a workflow management system well known in bioinformatics. To show the ease of transitioning from local to grid, we execute MUSTANG first on the command line, then on the grid using Taverna. We then present two examples from the field of medical imaging, where the first one has to deal with huge temporary datasets. It thus greatly benefits from ARCs storage management and grid URL handling capabilities. The last example shows how one can achieve rapid testing iterations by separating the program binary from its use case description and the workflow. In the end, myExperiment is presented, which is a free web community for sharing Taverna workflows. By preparing a use case and an example workflow one can make his own program easily usable by everyone since the ARC middleware equalizes different grid configurations. video
aimed at different layers of the software stack “ The Many Faces of IT as Service”, Foster, Tuecke, 2005 “ Provisioning” – reservation to configuration to … … make sure resource will do what I want it to do, with the right qualities of service Virtualization = separation of concerns between provider & consumer of “content” Client and service Service provider and resource provider Provisioning = assemble & configure resources to meet user needs Management = sustain desired qualities of service despite dynamic environment
The GT4 plug-in support semantic based service query. Users can input multiple (up to three, in current scenario three is enough. We can add more in our program upon request) service query criteria and input the corresponding value. We can combine multiple criteria. The initial GUI only shows one query criteria, but more can be added by clicking “Add Service Query” button. For example, we can query the caGrid services whose “Research Center” name is “Ohio State University”, with Service Name “DICOMDataService”, and has operation “PullOp”.
we are now taking a broader view... what we have shown so far is just the “run” part of a more comprehensive lifecycle
Beanshell scripting and XML processing support inside the workflows Taverna 2: long running workflows, data reference handling, data streaming and staging, multiple extensibility points. Complete the Taverna 2 properties New data reference handling, security management, provenance management, asynchronous processor and data streaming, explicit monitoring and steering support, new dispatch layer better, supports dynamic service binding and service invocation through a resource broker, improved concurrency handling at the workflow level
The clustalw program from Emboss is called ‘emma’ Services are not deposited and preserved in software libraries. Rapid metadata heart-beat, especially on operational metadata. BioNanny – using Grid tools Use myExperiment to notify scientists with potential problems Use myExperiment to be smart about which services should be monitored. Workflows are deposited but…. Not self-contained. Linking to external services in flux. Or depend on software Incorporating services unavailable to others. Workflow fragility and hence decay. Workflows become plans and provenance rather than working scientific objects unless tended and updated.
for DEMO: search adaptivedisclosure key point: large-scale curation of rich service descriptions through community engagement, sustained over time, to ensure the quality of annotations what makes biocat stand apart from other approaches to service repositories? biocatalogue is a "super-registry" that is able to accommodate service descriptions from multiple different source registries thanks to a flexible annotation model context of application: distributed, P2P biocatalogue
SeekDA is DERI. search engine for WS http://www.ebi.ac.uk/uniprot-das/
curation results in trust and therefore usage encourages / justifies sustained further curation efforts really happens in all the associated tools that have references and use biocatalogue e.g. curation of Taverna services, of myExp workflows, etc accreditation of curators: idea borrowed from the wikipedia style of community contribution. not all contributors are equal, but all are entitled to providing contributions
how are the types of annotations chosen? contributors are free to add pretty much any types of annotations at the moment, using a simple tagging mechanism, i.e., annotations are not necessarily locked into controlled vocabularies / ontologiesthink of them as name-value pairs for the time being, but already exposing metadata as RDF through the APIusers can rate services. main rating criteria expected to be ease of use, availability and quality of documentation. Separate subjective from objectiveadditionally, there is room for automated curation: using Quasar, for exampleservice monitoring and associated automated annotations. functional testing of services. let users/ providers upoload complex test scripts, including full-fledged workflows, for example. biocatalogue can run these periodically and annotate the services with the outcome (think "junit test reports" as automated annotations)Service Profile Wheel AvailabilityFreshness
Automated monitoring & testing Test scripts, endpoint availability, meantime failure Partner feeds myExperiment.org Workflow profile Update feeds to users Develop incentives Expert for oversight How do we rank? How do we compare non-alike?
Cite FLOSS
myExperiment is as much an engineering project as it is a social experiment
e-Science is me-Science Aligning community with individual. But we have to aware of the drivers for collaboration. Competitive advantage. Be the first with the Nature paper. Academic vanity Credit, credibility, fame, acclaim, recognition, peer respect, reputation. Adoption Get my stuff adopted / recognised More funding Being found out Open to rigorous inspection. Being scooped Beaten by lab X Protecting my turf. Releasing results too early. Getting left behind. Being out of fashion. Looking stupid Being misinterpreted or misrepresented. Looking stupid. Losing control. Taking a risk
attribution -> provenance
Transferring data, methods and know-how from one discipline to another (e.g. astronomy image analysis applied to cancer tissue microarrays) How do you find relevant material that uses a different jargon in a different discipline organised to only suit its experts? Validation? How do I know it does what it says it does? Reproducibility? When the services are volatile. Reusability? When it contains an in-house code or application. Longevity? Will this workflow still run in 6 months time? Palpability? Why does this workflow fail? why does it work? How does it work?
Validation? How do I know it does what it says it does? Reproducibility? When the services are volatile. Reusability? When it contains an in-house code or application. Longevity? Will this workflow still run in 6 months time? Palpability? Why does this workflow fail? why does it work? How does it work?
myExperiment is as much an engineering project as it is a social experiment
step back: what do we mean by provenance, in general?data and process provenancewhat do we do with it?we collect a great deal of metadata, at very fine level of granularity: what can we expect to get out of it?
see dedicated in-depth talk for details on our approach
Lineage query lets users identify variables that carry interesting values for which provenance is sought nodes in the graph where provenance information should be reported