The document discusses roadblocks that hinder realizing a vision of integrated scholarly knowledge environment, focusing on issues with information technology, networked libraries, and scholarly endeavors. It identifies that relationships between inherently related scholarly assets are not technically linked in the current system. The author argues that the scholarly communication system needs to embrace the networked environment by moving beyond just digitizing the paper-based system and enabling relationships and connections between information. Standards, interoperability challenges, the emergence of repositories, e-science, and rights frameworks are discussed as drivers of needed changes to the scholarly system.
The OAI-ORE Interoperability Framework in the Context of the Current Scholarl...Herbert Van de Sompel
The document discusses the OAI-ORE Interoperability Framework in the context of current scholarly communication. It describes how OAI-ORE was funded and lists the editors. It then discusses how the current scholarly system is like a scanned paper system and outlines some technical trends emerging, including augmenting scholarship with machine-readable content, integrating datasets into the scholarly record, and exposing scholarly processes.
This document provides a summary of Matthieu Bourgery's educational background and professional experience. He holds multiple degrees including a Master's in Marine Biodiversity and Biotechnology from Heriot-Watt University and a Bachelor's in Biology from Orleans University. His professional experience includes positions conducting molecular diagnostics, working as a molecular diagnostic scientist, and currently pursuing a PhD in Finland studying microRNA regulation of bone homeostasis. He has strong skills in presentations, writing, laboratory techniques, and statistical and molecular biology software.
This document discusses augmenting interoperability across scholarly repositories. It proposes a shared data model and services using core data surrogates that can be obtained, harvested, and put across repositories. This would allow richer cross-repository services and enable scholarly communication as a global workflow. A pathways core data model is presented for representing digital objects uniformly across repositories to support interoperable functions.
This document discusses community standards for reproducible and reusable bioscience research. It outlines the importance of consistent reporting to maximize the value of collective scientific outputs. However, there are challenges due to the large number of bioscience reporting standards and lack of knowledge about how they relate. The document calls for a coherent catalogue of data sharing resources to evaluate standards, show relationships among them, and promote interoperability. This would help researchers make informed choices about standards and facilitate structured descriptions of experiments across domains.
The document discusses roadblocks that hinder realizing a vision of integrated scholarly knowledge environment, focusing on issues with information technology, networked libraries, and scholarly endeavors. It identifies that relationships between inherently related scholarly assets are not technically linked in the current system. The author argues that the scholarly communication system needs to embrace the networked environment by moving beyond just digitizing the paper-based system and enabling relationships and connections between information. Standards, interoperability challenges, the emergence of repositories, e-science, and rights frameworks are discussed as drivers of needed changes to the scholarly system.
The OAI-ORE Interoperability Framework in the Context of the Current Scholarl...Herbert Van de Sompel
The document discusses the OAI-ORE Interoperability Framework in the context of current scholarly communication. It describes how OAI-ORE was funded and lists the editors. It then discusses how the current scholarly system is like a scanned paper system and outlines some technical trends emerging, including augmenting scholarship with machine-readable content, integrating datasets into the scholarly record, and exposing scholarly processes.
This document provides a summary of Matthieu Bourgery's educational background and professional experience. He holds multiple degrees including a Master's in Marine Biodiversity and Biotechnology from Heriot-Watt University and a Bachelor's in Biology from Orleans University. His professional experience includes positions conducting molecular diagnostics, working as a molecular diagnostic scientist, and currently pursuing a PhD in Finland studying microRNA regulation of bone homeostasis. He has strong skills in presentations, writing, laboratory techniques, and statistical and molecular biology software.
This document discusses augmenting interoperability across scholarly repositories. It proposes a shared data model and services using core data surrogates that can be obtained, harvested, and put across repositories. This would allow richer cross-repository services and enable scholarly communication as a global workflow. A pathways core data model is presented for representing digital objects uniformly across repositories to support interoperable functions.
This document discusses community standards for reproducible and reusable bioscience research. It outlines the importance of consistent reporting to maximize the value of collective scientific outputs. However, there are challenges due to the large number of bioscience reporting standards and lack of knowledge about how they relate. The document calls for a coherent catalogue of data sharing resources to evaluate standards, show relationships among them, and promote interoperability. This would help researchers make informed choices about standards and facilitate structured descriptions of experiments across domains.
1) Scientist-edited wiki websites have become popular ways for biologists to manage and interpret the large amounts of genomic and other biological data being produced.
2) These wiki sites aim to help researchers make sense of the data flooding into public databases by allowing many annotators to contribute, in contrast to traditional smaller teams of annotators.
3) However, getting researchers to actually contribute to the wiki sites, rather than just take information from them, has been a challenge, as scientists are often too busy or secretive to cooperate openly. Whether wiki approaches can succeed where previous community-driven data sharing efforts have failed remains to be seen.
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
This document summarizes a presentation about scientific workflow systems and related technologies including Taverna, Biocatalogue, and myExperiment. Taverna is a workflow management system that allows researchers to design and run workflows linking various bioinformatics services. Biocatalogue is a public registry of life science web services. MyExperiment is a repository for sharing workflows. The document discusses how these tools help scientists conduct experiments and analyze and preserve results.
The document describes Allie, a database and search service for abbreviations and long forms in the life sciences. Allie searches MEDLINE titles and abstracts to generate pairs of abbreviations and their corresponding long forms. It displays potential matches along with bibliographic data and contextual information to help users understand abbreviations. Allie is updated weekly and its data is available via its website and as linked open data. It receives over 7,000 unique visits per month to its search service.
1. The document discusses how a biologist, Marco Roos, became interested in e-science through his work in molecular and cellular biology, bioinformatics, and data integration projects.
2. Roos describes how e-science allows for collaboration between different experts and disciplines through technologies like workflows, semantic web, and virtual laboratories.
3. Roos emphasizes that e-science should empower scientists by making tools and resources easy to use, share, and build upon so that scientists can focus on scientific problems rather than technical challenges.
The document discusses the Open Archives Initiative's Object Re-Use and Exchange (OAI-ORE) effort. OAI-ORE aims to develop standards and protocols to facilitate discovery, referencing, access, aggregation, and processing of complex digital objects across repositories. It takes a web-centric approach, seeing digital objects as compound information objects that may have multiple representations. The standards aim to address challenges like consistently linking related objects and enabling discovery of all parts of an object. The talk outlines motivations, examples, and design considerations for OAI-ORE's work on these challenges.
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
Introduction to Ontologies for Environmental BiologyBarry Smith
1. The document introduces ontologies for environmental biology and discusses several disciplines that could benefit from their use, including GIS, ecology, environmental biology, and various "-omics" fields.
2. It describes what an ontology is and compares ontologies to legends for maps or diagrams, which allow integration and help humans and computers make sense of complex data. Ontologies provide standardized terminology and annotations.
3. The document outlines the Open Biomedical Ontologies (OBO) Foundry, a collection of interoperable reference ontologies for annotating biomedical data. Foundry ontologies include the Gene Ontology and other ontologies for molecules, cells, anatomical structures, and more. They are developed through consensus and share
Towards Incidental Collaboratories; Research Data ServicesAnita de Waard
This document discusses enabling "incidental collaboratories" by collecting and connecting biological research data through a centralized framework. It argues that biology research is currently quite isolated due to its small scale and competitive nature. The framework would involve storing experimental data with metadata, allowing analyses across similar experiment types and biological subjects, and preserving data long-term with access controls. This could help move labs from being isolated to being "sensors in a network" and address objections around data ownership and quality.
This document provides an introduction and overview of a manual annotation workshop using the Web Apollo genome annotation tool. It discusses manual annotation and community-based curation efforts. The workshop aims to teach participants how to identify genes of interest, become familiar with Web Apollo, learn how to corroborate and modify gene models using evidence, and understand the genome annotation process from assembly to manual curation. The document outlines the workshop activities and provides guidance on using Web Apollo, including navigating the interface, editing annotations, and annotating simple cases by adding or modifying exons.
The document discusses scientific workflow management systems and collaboration in workflow-based science. It notes that collaboration requires that a scientist be able to make sense of third-party data, and that this requires the data to be accompanied by provenance metadata that describes how the data was generated and processed. The concept of a "Research Object" is introduced as a way to package scientific data and workflows together with provenance and other related information to enable collaboration and reuse.
PDT: Personal Data from Things,and its provenancePaolo Missier
This document discusses various aspects of the Internet of Things (IoT), including potential architectures and stacks, connectivity and evolution. It examines use cases at different scales, from individual sensors to smart cities. The role of metadata and data provenance is explored for IoT applications involving science, personal data from sensors, and devices that make autonomous decisions. Issues of data ownership, privacy and user control are important considerations for personal data generated by IoT devices. The relationship between IoT and machine-to-machine communication is also briefly discussed.
Paper presentations: UK e-science AHM meeting, 2005Paolo Missier
The document describes an ontology-based approach to handling information quality in e-science. It presents an initial quality framework that captures scientists' quality requirements and allows defining domain-specific quality characteristics. It introduces a web service that annotates datasets with quality metrics based on how well their elements conform to relevant ontologies, using transcriptomics as an example domain. The approach aims to make quality definitions reusable and the computation of quality measurements over large datasets cost-effective.
This document discusses encoding provenance graphs and PROV constraints using Datalog rules. It maps PROV notation graphs to a database of facts and encodes most PROV constraints as Datalog rules. This allows for declarative specification of provenance graphs with deductive inference, enabling validation of graphs and rapid prototyping of analysis algorithms. Some limitations include inability to encode certain constraints and attributes in graph relations. The approach provides a proof of concept for representing and reasoning over provenance graphs with Datalog.
SWPM12 report on the dagstuhl seminar on Semantic Data Management Paolo Missier
The document summarizes discussions that took place at a Dagstuhl seminar on provenance in semantic data management in April 2012. Key points discussed include:
1) The need for provenance-specific benchmarks and reference data sets to better understand provenance usage and properties.
2) Proposals to collect provenance traces from various domains in a community repository using the PROV standard for interoperability.
3) Challenges of representing and reasoning with uncertain provenance information from sources like sensors, NLP, and human errors.
Structured Occurrence Network for provenance: talk for ipaw12 paperPaolo Missier
The document discusses using structured occurrence networks (SONs) to model provenance. SONs extend occurrence networks (ONs) to represent the activity of complex systems through relationships between multiple ONs. The goal is to explore using SONs as a formal model of provenance, viewing data as an evolving system and agents as also evolving systems. Communication SONs are introduced to capture communication between concurrently proceeding ONs. This establishes patterns for representing workflow and multi-layered provenance using SONs.
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paolo Missier
Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paolo Missier
The document discusses fine-grained provenance tracking of workflow data products. It presents a functional model for collection-oriented workflow processing that models workflows operating on nested collections. This model generalizes simple iteration to arbitrary collection depths and handles multiple input collections through a generalized cross product operation. The model aims to enable efficient provenance querying by traversing the workflow graph instead of the potentially larger provenance graph.
The document discusses porting genome sequencing data processing pipelines from scripted HPC implementations to workflow models on the cloud. This allows the pipelines to be more scalable, flexible, and evolvable. Tracking provenance is also important for using results as clinical evidence and analyzing differences when the pipelines change. Preliminary tests on the Microsoft Azure cloud show potential cost savings from improved resource utilization.
The document discusses scientific workflow management systems and provenance. It notes that momentum is growing around data sharing, as evidenced by a special issue of Nature on the topic. Effective data sharing requires standards for packaging data with metadata into self-descriptive research objects, as well as representation of process provenance using workflow descriptions. Provenance captures causal relationships in scientific data and is important for understanding, reusing, and validating others' work. The Open Provenance Model aims to standardize provenance representation.
1) Scientist-edited wiki websites have become popular ways for biologists to manage and interpret the large amounts of genomic and other biological data being produced.
2) These wiki sites aim to help researchers make sense of the data flooding into public databases by allowing many annotators to contribute, in contrast to traditional smaller teams of annotators.
3) However, getting researchers to actually contribute to the wiki sites, rather than just take information from them, has been a challenge, as scientists are often too busy or secretive to cooperate openly. Whether wiki approaches can succeed where previous community-driven data sharing efforts have failed remains to be seen.
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
This document summarizes a presentation about scientific workflow systems and related technologies including Taverna, Biocatalogue, and myExperiment. Taverna is a workflow management system that allows researchers to design and run workflows linking various bioinformatics services. Biocatalogue is a public registry of life science web services. MyExperiment is a repository for sharing workflows. The document discusses how these tools help scientists conduct experiments and analyze and preserve results.
The document describes Allie, a database and search service for abbreviations and long forms in the life sciences. Allie searches MEDLINE titles and abstracts to generate pairs of abbreviations and their corresponding long forms. It displays potential matches along with bibliographic data and contextual information to help users understand abbreviations. Allie is updated weekly and its data is available via its website and as linked open data. It receives over 7,000 unique visits per month to its search service.
1. The document discusses how a biologist, Marco Roos, became interested in e-science through his work in molecular and cellular biology, bioinformatics, and data integration projects.
2. Roos describes how e-science allows for collaboration between different experts and disciplines through technologies like workflows, semantic web, and virtual laboratories.
3. Roos emphasizes that e-science should empower scientists by making tools and resources easy to use, share, and build upon so that scientists can focus on scientific problems rather than technical challenges.
The document discusses the Open Archives Initiative's Object Re-Use and Exchange (OAI-ORE) effort. OAI-ORE aims to develop standards and protocols to facilitate discovery, referencing, access, aggregation, and processing of complex digital objects across repositories. It takes a web-centric approach, seeing digital objects as compound information objects that may have multiple representations. The standards aim to address challenges like consistently linking related objects and enabling discovery of all parts of an object. The talk outlines motivations, examples, and design considerations for OAI-ORE's work on these challenges.
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
Introduction to Ontologies for Environmental BiologyBarry Smith
1. The document introduces ontologies for environmental biology and discusses several disciplines that could benefit from their use, including GIS, ecology, environmental biology, and various "-omics" fields.
2. It describes what an ontology is and compares ontologies to legends for maps or diagrams, which allow integration and help humans and computers make sense of complex data. Ontologies provide standardized terminology and annotations.
3. The document outlines the Open Biomedical Ontologies (OBO) Foundry, a collection of interoperable reference ontologies for annotating biomedical data. Foundry ontologies include the Gene Ontology and other ontologies for molecules, cells, anatomical structures, and more. They are developed through consensus and share
Towards Incidental Collaboratories; Research Data ServicesAnita de Waard
This document discusses enabling "incidental collaboratories" by collecting and connecting biological research data through a centralized framework. It argues that biology research is currently quite isolated due to its small scale and competitive nature. The framework would involve storing experimental data with metadata, allowing analyses across similar experiment types and biological subjects, and preserving data long-term with access controls. This could help move labs from being isolated to being "sensors in a network" and address objections around data ownership and quality.
This document provides an introduction and overview of a manual annotation workshop using the Web Apollo genome annotation tool. It discusses manual annotation and community-based curation efforts. The workshop aims to teach participants how to identify genes of interest, become familiar with Web Apollo, learn how to corroborate and modify gene models using evidence, and understand the genome annotation process from assembly to manual curation. The document outlines the workshop activities and provides guidance on using Web Apollo, including navigating the interface, editing annotations, and annotating simple cases by adding or modifying exons.
The document discusses scientific workflow management systems and collaboration in workflow-based science. It notes that collaboration requires that a scientist be able to make sense of third-party data, and that this requires the data to be accompanied by provenance metadata that describes how the data was generated and processed. The concept of a "Research Object" is introduced as a way to package scientific data and workflows together with provenance and other related information to enable collaboration and reuse.
PDT: Personal Data from Things,and its provenancePaolo Missier
This document discusses various aspects of the Internet of Things (IoT), including potential architectures and stacks, connectivity and evolution. It examines use cases at different scales, from individual sensors to smart cities. The role of metadata and data provenance is explored for IoT applications involving science, personal data from sensors, and devices that make autonomous decisions. Issues of data ownership, privacy and user control are important considerations for personal data generated by IoT devices. The relationship between IoT and machine-to-machine communication is also briefly discussed.
Paper presentations: UK e-science AHM meeting, 2005Paolo Missier
The document describes an ontology-based approach to handling information quality in e-science. It presents an initial quality framework that captures scientists' quality requirements and allows defining domain-specific quality characteristics. It introduces a web service that annotates datasets with quality metrics based on how well their elements conform to relevant ontologies, using transcriptomics as an example domain. The approach aims to make quality definitions reusable and the computation of quality measurements over large datasets cost-effective.
This document discusses encoding provenance graphs and PROV constraints using Datalog rules. It maps PROV notation graphs to a database of facts and encodes most PROV constraints as Datalog rules. This allows for declarative specification of provenance graphs with deductive inference, enabling validation of graphs and rapid prototyping of analysis algorithms. Some limitations include inability to encode certain constraints and attributes in graph relations. The approach provides a proof of concept for representing and reasoning over provenance graphs with Datalog.
SWPM12 report on the dagstuhl seminar on Semantic Data Management Paolo Missier
The document summarizes discussions that took place at a Dagstuhl seminar on provenance in semantic data management in April 2012. Key points discussed include:
1) The need for provenance-specific benchmarks and reference data sets to better understand provenance usage and properties.
2) Proposals to collect provenance traces from various domains in a community repository using the PROV standard for interoperability.
3) Challenges of representing and reasoning with uncertain provenance information from sources like sensors, NLP, and human errors.
Structured Occurrence Network for provenance: talk for ipaw12 paperPaolo Missier
The document discusses using structured occurrence networks (SONs) to model provenance. SONs extend occurrence networks (ONs) to represent the activity of complex systems through relationships between multiple ONs. The goal is to explore using SONs as a formal model of provenance, viewing data as an evolving system and agents as also evolving systems. Communication SONs are introduced to capture communication between concurrently proceeding ONs. This establishes patterns for representing workflow and multi-layered provenance using SONs.
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paolo Missier
Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paolo Missier
The document discusses fine-grained provenance tracking of workflow data products. It presents a functional model for collection-oriented workflow processing that models workflows operating on nested collections. This model generalizes simple iteration to arbitrary collection depths and handles multiple input collections through a generalized cross product operation. The model aims to enable efficient provenance querying by traversing the workflow graph instead of the potentially larger provenance graph.
The document discusses porting genome sequencing data processing pipelines from scripted HPC implementations to workflow models on the cloud. This allows the pipelines to be more scalable, flexible, and evolvable. Tracking provenance is also important for using results as clinical evidence and analyzing differences when the pipelines change. Preliminary tests on the Microsoft Azure cloud show potential cost savings from improved resource utilization.
The document discusses scientific workflow management systems and provenance. It notes that momentum is growing around data sharing, as evidenced by a special issue of Nature on the topic. Effective data sharing requires standards for packaging data with metadata into self-descriptive research objects, as well as representation of process provenance using workflow descriptions. Provenance captures causal relationships in scientific data and is important for understanding, reusing, and validating others' work. The Open Provenance Model aims to standardize provenance representation.
The document discusses integrating data from multiple sources on-the-fly without prior knowledge of the schemas. It proposes using approximate entity reconciliation, which leverages techniques like record linkage, approximate joins, and adaptive query processing. The key challenges are trading off completeness of integration for query response time and implementing a hybrid join algorithm that switches between exact and approximate joins to optimize this tradeoff.
ProvAbs: model, policy, and tooling for abstracting PROV graphsPaolo Missier
This document presents ProvAbs, a model, policy language, and tool for abstracting PROV graphs to enable partial disclosure of provenance data. The model groups nodes in a PROV graph and replaces them with a new abstract node while preserving the graph's validity. A policy assigns sensitivity levels to nodes and drives the node selection for abstraction. The ProvAbs tool implements the abstraction model and allows interactively exploring policy settings and clearances to generate abstract views of a PROV graph.
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Paolo Missier
Much of the knowledge produced through data-intensive computations is liable to decay over time, as the underlying data drifts, and the algorithms, tools, and external data sources used for processing change and evolve. Your genome, for example, does not change over time, but our understanding of it does. How often should be look back at it, in the hope to gain new insight e.g. into genetic diseases, and how much does that cost when you scale re-analysis to an entire population?
The "total cost of ownership” of knowledge derived from data (TCO-DK) includes the cost of refreshing the knowledge over time in addition to the initial analysis, but is often not a primary consideration.
The ReComp project aims to provide models, algorithms, and tools to help humans understand TCO-DK, i.e., the nature and impact of changes in data, and assess the cost and benefits of knowledge refresh.
In this talk we try and map the scope of ReComp, by giving a number of patterns that cover typical analytics scenarios where re-computation is appropriate. We specifically describe two such scenarios, where we are conducting small scale, proof-of-concept ReComp experiments to help us sketch the general ReComp architecture. This initial exercise reveals a multiplicity of problems and research challenges, which will inform the rest of the project
Big Data Quality Panel: Diachron Workshop @EDBTPaolo Missier
1) Traditional approaches to ensuring data quality such as quality assurance and curation face challenges from big data's volume, velocity, and variety characteristics.
2) It is difficult to determine general thresholds for when data quality issues can be ignored as the importance varies between different analytics algorithms.
3) The ReComp decision support system aims to use metadata about past analytics tasks to determine when knowledge needs to be refreshed due to changes in big data or models.
The lifecycle of reproducible science data and what provenance has got to do ...Paolo Missier
The document discusses various aspects of ensuring reproducibility in scientific research through provenance. It begins by providing an overview of the data lifecycle and challenges to reproducibility as experiments and components evolve. It then discusses different levels of reproducibility (rerun, repeat, replicate, reproduce) and approaches to analyzing differences in workflow provenance traces to understand how changes impact results. The remainder of the document describes specific systems and tools developed by the author and collaborators that use provenance to improve reproducibility, including data packaging with Research Objects, provenance recording and analysis workflows with YesWorkflow, process virtualization using TOSCA, and provenance differencing with Pdiff.
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
This document discusses moving whole exome sequencing pipelines to the cloud using e-Science Central workflow management. The goal is to process 3000 exomes from neurological patients in a scalable and cost-effective way. Current scripts are being ported to e-Science Central for improved abstraction, execution, and provenance tracking. Provenance will help compare results from different pipeline versions and support clinical diagnosis. Initial testing with 300 exomes will begin, with full scalability testing planned for September 2014.
The document discusses using SNPs (single nucleotide polymorphisms) to help identify candidate genes associated with quantitative traits. It presents SNPit, a database that integrates data from Ensembl, dbSNP and Perlegen to rank SNPs based on differences between resistant and susceptible mouse strains. SNPit supports exploratory analysis of large genomic regions to help focus candidate gene searches for traits like disease susceptibility. The goal is to complement existing methods and automate parts of the process to accelerate disease gene identification.
The Evolution of e-Research: Machines, Methods and MusicDavid De Roure
The document summarizes the evolution of e-research over three generations from 1981 to the present. The first generation saw early adopters using tools within their disciplines with some reuse. The second generation was characterized by increased reuse of tools, data and methods across areas. The third generation is defined by radical sharing of resources globally across any discipline through social networks and reusable research objects. The document also discusses several specific projects and tools that exemplify each generation of e-research including myExperiment, Galaxy, and SALAMI.
The document discusses reproducible bioscience data. It describes Susanna-Assunta Sansone as a principal investigator and team leader at the University of Oxford e-Research Centre who gives a presentation on policies, communities, and standards around reproducible bioscience data. The presentation covers topics like preserving institutional memory, utilizing public data, and addressing reproducibility and reuse of public data through community standards and structured data annotation.
Taverna is a free and open-source workflow management system that allows researchers to design and execute scientific workflows. It was developed by the University of Manchester to support in silico experiments in biology. Taverna provides a graphical user interface for designing workflows using a variety of distributed data sources and web services without having to learn complex programming. It has been widely adopted by researchers in fields such as biology, healthcare, astronomy, and cheminformatics to automate analysis pipelines and share workflows.
The Symbiotic Nature of Provenance and WorkflowEric Stephan
This document discusses the symbiotic relationship between provenance and workflows in scientific research. It notes that workflows provide automation and integration capabilities, while provenance provides documentation of what transpired. The document provides examples of workflow and provenance technologies and outlines challenges around interoperability. It concludes that recognizing the interdependent relationship between provenance and workflows can help advance systems science research.
Processing Amplicon Sequence Data for the Analysis of Microbial CommunitiesMartin Hartmann
This document provides an overview of next-generation sequencing (NGS) technologies and their usefulness for analyzing microorganisms associated with plants. It discusses how NGS methods allow addressing previously impossible questions about the composition, function, and interactions of microbial communities in environments like the rhizosphere and phyllosphere. While powerful, NGS platforms have limitations that can introduce errors or biases, but methods exist to overcome these issues. The review highlights applications of NGS in metagenomic studies of plant-associated microbiomes and how these new techniques are transforming the field.
Precise elucidation of the many different biological features encoded in any genome requires careful examination and review by researchers, who gather and evaluate the available evidence to corroborate and modify gene predictions and other biological elements. This curation process allows them to resolve discrepancies and validate automated gene model hypotheses and alignments. This approach is the well-established practice for well-known genomes such as human, mouse, zebrafish, Drosophila, et cetera. Desktop Apollo was originally developed to meet these needs.
The cost of sequencing a genome has been dramatically reduced by several orders of magnitude in the last decade, and the natural consequence is that more and more researchers are sequencing more and more new genomes, both within populations and across species. Because individual researchers can now readily sequence many genomes of interest, the need for a universally accessible genomic curation tool logically follows. Each new exome or genome sequenced requires visualization and curation to obtain biologically accurate genomic features sets, even for limited set of genes, because computational genome analysis remains an imperfect art. Additionally, unlike earlier genome projects, which had the advantage of more highly polished genomes, recent projects usually have lower coverage. Therefore researchers now face additional work correcting for more frequent assembly errors and annotating genes split across multiple contigs.
Genome annotation is an inherently collaborative task; researchers only very rarely work in isolation, turning to colleagues for second opinions and insights from those with with expertise in particular domains and gene families. The new JavaScript based Apollo, allows researchers real-time interactivity, breaking down large amounts of data into manageable portions to mobilize groups of researchers with shared interests. We are also focused on training the next generation of researchers by reaching out to educators to make these tools available as part of curricula via workshops and webinars, and through widely applied systems such as iPlant and DNA Subway. Here we offer details of our progress.
Presentation at Genome Informatics, Session (3) on Databases, Data Mining, Visualization, Ontologies and Curation.
Authors: Monica C Munoz-Torres, Suzanna E. Lewis, Ian Holmes, Colin Diesh, Deepak Unni, Christine Elsik.
This document discusses data management and curation in bioinformatics. It describes Susanna-Assunta Sansone as the principal investigator and team leader at the University of Oxford e-Research Centre, where her team works on data management, biocuration, software development, databases, and community standards and ontologies for various domains including toxicology, health, and agriculture. The document promotes the importance of data standards to enable data sharing and reproducibility in bioscience research.
Acting as Advocate? Seven steps for libraries in the data decadeLizLyon
UKOLN advocates that libraries take seven steps to support data management and open science in the data decade:
1) Provide briefings on cloud data services in partnership with IT services.
2) Build usable data management tools in partnership with researchers.
3) Develop data sustainability strategies and articulate the costs and benefits.
4) Publish case studies on open science to show benefits of universal data sharing.
5) Present at university ethics committees to highlight open data issues.
6) Raise awareness of citizen science opportunities and guidelines for good practice.
7) Promote data citation and attribution to embed in publication practice.
The case for cloud computing in Life SciencesOla Spjuth
This document summarizes Ola Spjuth's background and research interests related to cloud computing in life sciences. Spjuth is an associate professor who manages bioinformatics resources at SciLifeLab and UPPMAX. His research focuses on developing e-infrastructure, automation methods, and applied e-science using tools like Docker and Kubernetes. He is working on projects applying these technologies to problems in drug discovery and predictive modeling of image data.
A keynote given on experiences in curating workflows and web services.
3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"
11-13 December 2007
Renaissance Hotel
Washington DC, USA
Web Apollo: Lessons learned from community-based biocuration efforts.Monica Munoz-Torres
This presentation tries to highlight the importance and relevance of community-based curation of biological data. It describes the results of harvesting expertise from dispersed researchers assigning functions to predicted and curated peptides, as well as collaborative efforts for standardization of genes and gene product attributes across species and databases.
Jeremy Hadidjojo is a PhD candidate in physics at the University of Michigan with expertise in computational physics, mathematical modeling, simulation, and data analysis. His research focuses on developing physical models of biological pattern formation and applying machine learning techniques to analyze complex systems. He has extensive programming skills in MATLAB, Python, C++ and experience with parallel and GPU computing. His published works include modeling mechanisms of planar cell chirality and retinal cone patterning in zebrafish.
The document discusses the challenges and opportunities that will arise from the exponential growth of biological data in the coming years. It outlines four key areas: 1) Research approaches will need to effectively analyze infinite amounts of data. 2) Software and decentralized infrastructure will be needed to process the data. 3) Open science and reproducible research practices are important for data-driven biology. 4) Training the next generation of biologists in data analysis skills will be a major challenge. The document advocates for open source tools, reproducible research methods, and expanded training programs to help biology take advantage of the coming data deluge.
The document discusses the evolution of science and research from the 1940s to present day. It notes Vannevar Bush's 1945 concerns about the growing mountain of research that scientists did not have time to fully understand or remember. It then discusses the current "data explosion" and challenges of accessing, sharing, and building on increasingly large amounts of data and research. The document advocates for reusable, reproducible, and transparent science through connected resources and environments that facilitate collaboration and knowledge sharing.
Similar to Invited talk at the GeoClouds Workshop, Indianapolis, 2009 (20)
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
A talk given at the DATAPLAT workshop, co-located with the IEEE ICDE conference (May 2024, Utrecht, NL).
Data Provenance for Data Science is our attempt to provide a foundation to add explainability to data-centric AI.
It is a prototype, with lots of work still to do.
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
In this presentation, given to graduate students at Universita' RomaTre, Italy, we suggest that concepts well-known in Data Provenance can be exploited to provide explanations in the context of data-centric AI processes. Through use cases (incremental data cleaning, training set pruning), we build up increasingly complex provenance patterns, culminating in an open question:
how to describe "why" a specific data item has been manipulated as part of data processing, when such processing may consist of a complex data transformation algorithm.
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
A talk given at the BDA4HM workshop, IEEE BigData conference, Dec. 2023
please see paper here:
https://drive.google.com/file/d/1vN08G0FWxOSH1Yeak5AX6a0sr5-EBbAt/view
Data-centric AI and the convergence of data and model engineering:opportunit...Paolo Missier
A keynote talk given to the IDEAL 2023 conference (Evora, Portugal Nov 23, 2023).
Abstract.
The past few years have seen the emergence of what the AI community calls "Data-centric AI", namely the recognition that some of the limiting factors in AI performance are in fact in the data used for training the models, as much as in the expressiveness and complexity of the models themselves. One analogy is that of a powerful engine that will only run as fast as the quality of the fuel allows. A plethora of recent literature has started the connection between data and models in depth, along with startups that offer "data engineering for AI" services. Some concepts are well-known to the data engineering community, including incremental data cleaning, multi-source integration, or data bias control; others are more specific to AI applications, for instance the realisation that some samples in the training space are "easier to learn from" than others. In this "position talk" I will suggest that, from an infrastructure perspective, there is an opportunity to efficiently support patterns of complex pipelines where data and model improvements are entangled in a series of iterations. I will focus in particular on end-to-end tracking of data and model versions, as a way to support MLDev and MLOps engineers as they navigate through a complex decision space.
Realising the potential of Health Data Science:opportunities and challenges ...Paolo Missier
This document summarizes a presentation on opportunities and challenges for applying health data science and AI in healthcare. It discusses the potential of predictive, preventative, personalized and participatory (P4) approaches using large health datasets. However, it notes major challenges including data sparsity, imbalance, inconsistency and high costs. Case studies on liver disease and COVID datasets demonstrate issues requiring data engineering. Ensuring explanations and human oversight are also key to adopting AI in clinical practice. Overall, the document outlines a complex landscape and the need for better data science methods to realize the promise of data-driven healthcare.
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
This document describes DP4DS, a tool to collect fine-grained provenance from data processing pipelines. Specifically, it can collect provenance from dataframe-based Python scripts. It demonstrates scalable provenance generation, storage, and querying. Current work includes improving provenance compression techniques and demonstrating the tool's generality for standard relational operators. Open questions remain around how useful fine-grained provenance is for explaining findings from real data science pipelines.
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
a brief intro on the data challenges associated with working with Health Care data, with a few examples, both from literature and our own, of traditional approaches (Latent Class Analysis, Topic Modelling) and a perspective on Language-based modelling for Electronic Health Records (EHR).
probably more references than actual content in here!
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
This document describes a method for capturing and querying fine-grained provenance from data science preprocessing pipelines. It captures provenance at the dataframe level by comparing inputs and outputs to identify transformations. Templates are used to represent common transformations like joins and appends. The approach was evaluated on benchmark datasets and pipelines, showing overhead from provenance capture is low and queries are fast even for large datasets. Scalability was demonstrated on datasets up to 1TB in size. A tool called DPDS was also developed to assist with data science provenance.
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
The document proposes tracking trajectories of multiple long-term conditions using dynamic patient-cluster associations. It uses topic modeling to identify disease clusters from patient timelines and quantifies how patients associate with clusters over time. Preliminary results on 143,000 patients from UK Biobank show varying stability of patient associations with clusters. Further work aims to better define stability and identify causes of instability.
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
The document discusses data provenance for data science applications. It proposes automatically generating and storing metadata that describes how data flows through a machine learning pipeline. This provenance information could help address questions about model predictions, data processing decisions, and regulatory requirements for high-risk AI systems. Capturing provenance at a fine-grained level incurs overhead but enables detailed queries. The approach was evaluated on performance and scalability. Provenance may help with transparency, explainability and oversight as required by new regulations.
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
a talk given at the VLDB 2021 conference, August, 2021, presenting our paper:
Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507–520, January, 2021.
http://doi.org/10.14778/3436905.3436911
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Paolo Missier
The document discusses provenance in the context of data science and artificial intelligence. It provides bibliometric data on publications related to data/workflow provenance from 2000 to the present. Recent trends include increased focus on applications in computing and engineering fields. Blockchain is discussed as a method for capturing fine-grained provenance. The document also outlines challenges around explainability, transparency and accountability for high-risk AI systems according to new EU regulations, and argues that provenance techniques may help address these challenges by providing traceability of system functioning and operation monitoring.
Analytics of analytics pipelines:from optimising re-execution to general Dat...Paolo Missier
This document discusses using data provenance to optimize re-execution of analytics pipelines and enable transparency in data science workflows. It proposes a framework called ReComp that selectively recomputes parts of expensive analytics workflows when inputs change based on provenance data. It also discusses applying provenance techniques to collect fine-grained data on data preparation steps in machine learning pipelines to help explain model decisions and data transformations. Early results suggest provenance can be collected with reasonable overhead and enables useful queries about pipeline execution.
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
Paolo Missier presented on optimizing the re-execution of analytics pipelines in response to changes in input data. The talk discussed using provenance to selectively re-run parts of workflows impacted by changes. ProvONE combines process structure and runtime provenance to enable granular re-execution. The ReComp framework detects and quantifies data changes, estimates impact, and selectively re-executes relevant sub-processes to optimize re-running workflows in response to evolving data.
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
The document describes the ReComp framework for efficiently recomputing analytics processes when changes occur. ReComp uses provenance data from past executions to estimate the impact of changes and selectively re-execute only affected parts of processes. It identifies changes, computes data differences, and estimates impacts on past outputs to determine the minimum re-executions needed. For genomic analysis workflows, ReComp reduced re-executions from 495 to 71 by caching intermediate data and re-running only impacted fragments. The framework is customizable via difference and impact functions tailored to specific applications and data types.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
1. Scientific Workflow Management System
Taverna,
Biocatalogue,
and
myExperiment:
a
three-‐legged
founda;on
for
effec;ve
collabora;on
in
E-‐science
A collaborative talk by Paolo Missier
Information Management Group
School of Computer Science, University of Manchester, UK
with additional material kindly shared by:
Prof. Dave DeRoure and David Newman, University of Southampton
Prof. Carole Goble and the e-Labs design group, University of Manchester
1
GeoClouds workshop, Indianapolis, IN, Sept. 17, 2009 - P. Missier
Sunday, 13 March 2011
2. What is the myGrid Project?
UK
e-‐Science
pilot
project
since
2001.
Centred
at
Manchester,
Southampton
and
the
EMBL-‐EBI
Part
of
Open
Middleware
Infrastructure
InsEtute
UK
hFp://
www.omii.ac.uk.
Mixture
of
developers,
bioinformaEcians
and
researchers
An
alliance
of
contribuEng
projects
and
partners
Open
source
development
and
content
LGPL
or
BSD
Infrastructure
We
don’t
own
any
resources
(apart
from
catalogues)
Or
a
Grid.
ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
Sunday, 13 March 2011
3. Taverna
Graphical
Workbench
For
Professionals
Plug-‐in
architecture
Nested
Workflows
Drag
and
Drop
Wiring
together
Rapidly
incorporate
new
service
without
coding.
Not
restricted
to
predetermined
services
Access
to
local
and
remote
resources
and
analysis
tools
3500+
service
operaEons
available
when
start
up
ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
Sunday, 13 March 2011
4. What do Scientists use Taverna for?
Systems
biology
model
building Netherlands
BioinformaEcs
Centre
Genome
Canada
BioinformaEcs
Plaaorm
Proteomics
BioMOBY
Sequence
analysis US
FLOSS
social
science
program
Protein
structure
predicEon RENCI
Gene/protein
annotaEon
SysMO
ConsorEum
Microarray
data
analysis French
SIGENAE
farm
animals
project
QTL
studies ThaiGrid
CARMEN
Neuroscience
project
QSAR
studies SPINE
consorEum
Medical
image
analysis EU
Enfin,
EMBRACE,
BioSapian,
Casimir
Public
Health
care
epidemiology EU
SysMO
ConsorEum
Heart
model
simulaEons NERC
Centre
for
Ecology
and
Hydrology
High
throughput
screening Bergen
Centre
for
ComputaEonal
Biology
Max-‐Planck
insEtute
for
Plant
Breeding
Research
Phenotypical
studies
Genoa
Cancer
Research
Centre
Phylogeny AstroGrid
StaEsEcal
analysis
30
USA
academic
and
research
Text
mining ins;tu;ons
Astronomy,
Music,
Meteorology
ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
Sunday, 13 March 2011
5. Who else is in this space?
Trident Triana
Kepler
Ptolemy II
Taverna
BioExtract
BPEL
5
ESIP meeting,Santa Barbara, CA, July 2009 - P. Missier
Sunday, 13 March 2011
6. www.myexperiment.org
Socially share,
discover and reuse
workflows and
other methods.
Cooperative bazaar.
l Sunday
10th
May:
1748
registered
users,
143
groups,
669
workflows,
197
files,
52
packs
56
different
countries.
Top
4:
UK,
US,
The
Netherlands,
Germany
Sunday, 13 March 2011
9. Why data provenance matters, if done right
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis for improvement, re-design
The W3C Incubator on Provenance has been collecting numerous use cases:
http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#
Linköping, Sweden -- January 2010
Sunday, 13 March 2011
10. Goals, expected contributions
• Established technology provider - open-source
– traditionally active in the bioinf space
– but also involved in the e-Lico EU project (data mining
portal)
– large community base, established production
environment
• Main goal:
– to offer our workflow and workflow repository technology,
put it to the test on the challenges of data preservation
pipelines
• Challenges:
– expect new requirements on our current technology
• robust, high-volume data pipelines
• workflow provenance -- process evolution
10
• data provenance
Sunday, 13 March 2011