The document discusses scientific workflow management systems and provenance. It notes that momentum is growing around data sharing, as evidenced by a special issue of Nature on the topic. Effective data sharing requires standards for packaging data with metadata into self-descriptive research objects, as well as representation of process provenance using workflow descriptions. Provenance captures causal relationships in scientific data and is important for understanding, reusing, and validating others' work. The Open Provenance Model aims to standardize provenance representation.
This document discusses mapping ontologies from multiple datasets in the Linked Open Data cloud to the PROTON upper-level ontology. It presents an approach to semantically mapping classes and properties from datasets like DBPedia, Freebase and Geonames to PROTON in order to provide a unified vocabulary for querying across datasets. The mappings were developed using both automated and manual methods. Statistics on the ontology extensions and mappings are provided, as well as examples of SPARQL queries over the mapped data. Future work includes publishing the mapped ontologies and extending the mappings to additional datasets.
Inquiry Optimization Technique for a Topic Map Databasetmra
This document proposes an inquiry optimization technique for topic map databases. It discusses using an object-oriented data model for topic map databases to improve query performance compared to a relational model. The document defines cost estimation formulas to help the database system select the optimal retrieval route, either following associations or searching by topic, when answering queries. An experiment is needed to evaluate the effectiveness of using these cost estimations to optimize queries of a topic map database.
The document provides an overview of materials informatics and the Materials Genome Initiative. It discusses how materials informatics uses data-driven approaches and techniques from fields like signal processing, machine learning and statistics to generate structure-property-processing linkages from materials science data and improve understanding of materials behavior. This includes extracting features from materials microstructure, using statistical analysis and data mining to discover relationships and create predictive models, and evaluating how knowledge has improved.
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Anubhav Jain
1. The document discusses using natural language processing (NLP) algorithms to extract useful information from unstructured text sources in materials science literature to help organize the world's materials science information and enable new search and analysis capabilities.
2. It describes a project called Matscholar that applies NLP techniques like named entity recognition and relation extraction to millions of article abstracts to build a searchable database with summarized materials property and application data.
3. The approach involves collecting text sources, developing machine learning models trained on annotated examples to extract entities and relations, and integrating the extracted structured data with materials property databases to enable new search and analysis functions.
Applications of Natural Language Processing to Materials DesignAnubhav Jain
This document discusses using natural language processing (NLP) techniques to extract useful information from unstructured text sources in materials science literature. It describes how NLP models can be trained on large datasets of materials science publications to perform tasks like chemistry-aware search, summarizing material properties, and suggesting synthesis methods. The models are developed using techniques like word embeddings, LSTM networks, and named entity recognition. The goal is to organize materials science knowledge from text into a database called Matscholar to enable new applications of the information.
Data Integration at the Ontology Engineering GroupOscar Corcho
Presentation done on the work being done on Data Integration at OEG-UPM (http://www.oeg-upm.net/), for the CredIBLE workshop, in Sophia-Antipolis (October 15th, 2012).
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
1) The document discusses evaluating machine learning algorithms for materials science using the Matbench protocol.
2) Matbench provides standardized datasets, testing procedures, and an online leaderboard to benchmark and compare machine learning performance.
3) This allows different groups to evaluate algorithms independently and identify best practices for materials science predictions.
This document discusses mapping ontologies from multiple datasets in the Linked Open Data cloud to the PROTON upper-level ontology. It presents an approach to semantically mapping classes and properties from datasets like DBPedia, Freebase and Geonames to PROTON in order to provide a unified vocabulary for querying across datasets. The mappings were developed using both automated and manual methods. Statistics on the ontology extensions and mappings are provided, as well as examples of SPARQL queries over the mapped data. Future work includes publishing the mapped ontologies and extending the mappings to additional datasets.
Inquiry Optimization Technique for a Topic Map Databasetmra
This document proposes an inquiry optimization technique for topic map databases. It discusses using an object-oriented data model for topic map databases to improve query performance compared to a relational model. The document defines cost estimation formulas to help the database system select the optimal retrieval route, either following associations or searching by topic, when answering queries. An experiment is needed to evaluate the effectiveness of using these cost estimations to optimize queries of a topic map database.
The document provides an overview of materials informatics and the Materials Genome Initiative. It discusses how materials informatics uses data-driven approaches and techniques from fields like signal processing, machine learning and statistics to generate structure-property-processing linkages from materials science data and improve understanding of materials behavior. This includes extracting features from materials microstructure, using statistical analysis and data mining to discover relationships and create predictive models, and evaluating how knowledge has improved.
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Anubhav Jain
1. The document discusses using natural language processing (NLP) algorithms to extract useful information from unstructured text sources in materials science literature to help organize the world's materials science information and enable new search and analysis capabilities.
2. It describes a project called Matscholar that applies NLP techniques like named entity recognition and relation extraction to millions of article abstracts to build a searchable database with summarized materials property and application data.
3. The approach involves collecting text sources, developing machine learning models trained on annotated examples to extract entities and relations, and integrating the extracted structured data with materials property databases to enable new search and analysis functions.
Applications of Natural Language Processing to Materials DesignAnubhav Jain
This document discusses using natural language processing (NLP) techniques to extract useful information from unstructured text sources in materials science literature. It describes how NLP models can be trained on large datasets of materials science publications to perform tasks like chemistry-aware search, summarizing material properties, and suggesting synthesis methods. The models are developed using techniques like word embeddings, LSTM networks, and named entity recognition. The goal is to organize materials science knowledge from text into a database called Matscholar to enable new applications of the information.
Data Integration at the Ontology Engineering GroupOscar Corcho
Presentation done on the work being done on Data Integration at OEG-UPM (http://www.oeg-upm.net/), for the CredIBLE workshop, in Sophia-Antipolis (October 15th, 2012).
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
1) The document discusses evaluating machine learning algorithms for materials science using the Matbench protocol.
2) Matbench provides standardized datasets, testing procedures, and an online leaderboard to benchmark and compare machine learning performance.
3) This allows different groups to evaluate algorithms independently and identify best practices for materials science predictions.
Duplicate Detection of Records in Queries using ClusteringIJORCS
The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. Also, it is important to detect and clean equivalence errors because an equivalence error may result in several duplicate tuples. Recent research efforts have focused on the issue of duplicate elimination in data warehouses. This entails trying to match inexact duplicate records, which are records that refer to the same real-world entity while not being syntactically equivalent. This paper mainly focuses on efficient detection and elimination of duplicate data. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data.
This document discusses DT's core analytical competencies in data engineering, analytics, and quantitative skills. It describes capabilities in areas such as data architecture, ETL, spatial data services, data transformation, reporting, data mining, spatial data mining, and quantitative skills in statistics, machine learning, spatial statistics and other applied mathematics. It also provides examples of analytics applied to problems involving time series anomaly detection, correlation, aggregation, graphs, movement patterns, and classification. Teams have degrees from top universities and expertise in fields like computer science, engineering, mathematics and social sciences.
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
The document proposes a privacy-preserving approach for hierarchical document clustering using maximal frequent item sets (MFI). First, MFI are identified from document collections using the Apriori algorithm to define clusters precisely. Then, the same MFI-based similarity measure is used to construct a hierarchy of clusters. This approach decreases dimensionality and avoids duplicate documents, thereby protecting individual copyrights. The methodology and algorithm are described in detail.
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
This presentation is intended as a high-level introduction for to deep learning and its applications in materials science. The intended audience is materials scientists and engineers
Disclaimers: the second half of this presentation is intended as a broad overview of deep learning applications in materials science; due to time limitations it is not intended to be comprehensive. As a review of the field, this necessarily includes work that is not my own. If my own name is not included explicitly in the reference at the bottom of a slide, I was not involved in that work.
Any mention of commercial products in this presentation is for information only; it does not imply recommendation or endorsement by NIST.
This document contains four exam papers for a Data Warehousing and Data Mining course. Each paper contains 8 questions with sub-questions worth varying points. The questions cover topics such as data mining processes, differences between operational databases and data warehouses, data transformation techniques, data mining query languages, classification algorithms like naive Bayes and decision trees, clustering methods, and mining time-series, text and web data.
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
Artificial intelligence is emerging as a new paradigm in materials science. This talk describes how physical intuition and (insightful) machine learning can solve the complicated task of structure recognition in materials at the nanoscale.
Open Source Tools for Materials InformaticsAnubhav Jain
This document discusses open source tools for materials informatics, including Matminer and Matscholar. Matminer is a library of descriptors for materials science data that can generate features for machine learning models. It includes over 60 featurizer classes and supports scikit-learn. Matscholar applies natural language processing to over 2 million materials science abstracts to extract keywords and enable improved literature searching. The document argues that open datasets like Matbench and automated tools like Automatminer could help lower barriers for developing machine learning models in materials science by making it easier to obtain training data and evaluate model performance.
Assessing Factors Underpinning PV Degradation through Data AnalysisAnubhav Jain
The document discusses using PVPRO methods and large-scale data analysis to distinguish system and module degradation in PV systems. It involves 3 main tasks: 1) Developing an algorithm to detect off-maximum power point operation and compare it to existing tools. 2) Applying PVPRO to additional datasets to refine methods and perform degradation analysis on 25 large PV systems. 3) Connecting bill-of-materials data to degradation results from accelerated stress tests through data-driven analysis and publishing findings while anonymizing data.
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
The document discusses the development of Matbench, a standardized benchmark for evaluating machine learning algorithms for materials property prediction. Matbench includes 13 standardized datasets covering a variety of materials prediction tasks. It employs a nested cross-validation procedure to evaluate algorithms and ranks submissions on an online leaderboard. This allows for reproducible evaluation and comparison of different algorithms. Matbench has provided insights into which algorithm types work best for certain prediction problems and has helped measure overall progress in the field. Future work aims to expand Matbench with more diverse datasets and evaluation procedures to better represent real-world materials design challenges.
Extracting and Making Use of Materials Data from Millions of Journal Articles...Anubhav Jain
- The document discusses using natural language processing techniques to extract materials data from millions of journal articles.
- It aims to organize the world's information on materials science by using NLP models to extract useful data from unstructured text sources like research literature in an automated manner.
- The process involves collecting raw text data, developing machine learning models to extract entities and relationships, and building search interfaces to make the extracted data accessible.
The document discusses DataONE, a project aimed at improving data repository interoperability and advancing best practices in data lifecycle management. It focuses on enabling access to multiple external data repositories from within a HUB environment. This would allow users to aggregate and integrate disparate datasets for new analyses, and enable reproducible workflows. The goal is to address issues around scattered and dispersed data by improving discovery, integration and long-term preservation of datasets.
1. Materials Informatics uses Python tools like RDKit for analyzing molecular structures and properties.
2. ORGAN and MolGAN are two generative models that use GANs to generate novel molecular structures based on SMILES strings, with ORGAN incorporating reinforcement learning to optimize for desired properties.
3. Tools like RDKit enable analyzing molecular fingerprints and descriptors that can be used for machine learning applications in materials informatics.
Presentation as held at the "Workshop on Knowledge Evolution and Ontology Dynamics" co-located with ISWC 2011. Related to the paper http://ceur-ws.org/Vol-784/evodyn1.pdf
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
FireWorks is a workflow management system that allows researchers to define and execute complex computational materials science workflows on local or remote computing resources in an automated manner. It provides features such as error detection and recovery, job scheduling, provenance tracking, and remote file access. The atomate library builds on FireWorks to provide a high-level interface for common materials simulation procedures like structure optimization, band structure calculation, and property prediction using popular codes like VASP. Together, these tools aim to make high-throughput computational materials discovery and design more accessible to researchers.
SRbench is a benchmark for streaming RDF storage engines that was developed by Ying Zhang and Peter Boncz of CWI Amsterdam. It uses real-world linked open data sets and defines queries and implementations in natural language and languages like SPARQLStream and C-SPARQL to evaluate streaming RDF databases. The benchmark addresses the challenges of streaming RDF data by using appropriate datasets from the linked open data cloud and supporting semantics in stream queries. Future work will focus on performance evaluation and verifying benchmark results.
The document discusses integrating data from multiple sources on-the-fly without prior knowledge of the schemas. It proposes using approximate entity reconciliation, which leverages techniques like record linkage, approximate joins, and adaptive query processing. The key challenges are trading off completeness of integration for query response time and implementing a hybrid join algorithm that switches between exact and approximate joins to optimize this tradeoff.
Paper presentations: UK e-science AHM meeting, 2005Paolo Missier
The document describes an ontology-based approach to handling information quality in e-science. It presents an initial quality framework that captures scientists' quality requirements and allows defining domain-specific quality characteristics. It introduces a web service that annotates datasets with quality metrics based on how well their elements conform to relevant ontologies, using transcriptomics as an example domain. The approach aims to make quality definitions reusable and the computation of quality measurements over large datasets cost-effective.
The document discusses porting genome sequencing data processing pipelines from scripted HPC implementations to workflow models on the cloud. This allows the pipelines to be more scalable, flexible, and evolvable. Tracking provenance is also important for using results as clinical evidence and analyzing differences when the pipelines change. Preliminary tests on the Microsoft Azure cloud show potential cost savings from improved resource utilization.
The document discusses scientific workflow management systems and collaboration in workflow-based science. It notes that collaboration requires that a scientist be able to make sense of third-party data, and that this requires the data to be accompanied by provenance metadata that describes how the data was generated and processed. The concept of a "Research Object" is introduced as a way to package scientific data and workflows together with provenance and other related information to enable collaboration and reuse.
PDT: Personal Data from Things,and its provenancePaolo Missier
This document discusses various aspects of the Internet of Things (IoT), including potential architectures and stacks, connectivity and evolution. It examines use cases at different scales, from individual sensors to smart cities. The role of metadata and data provenance is explored for IoT applications involving science, personal data from sensors, and devices that make autonomous decisions. Issues of data ownership, privacy and user control are important considerations for personal data generated by IoT devices. The relationship between IoT and machine-to-machine communication is also briefly discussed.
Structured Occurrence Network for provenance: talk for ipaw12 paperPaolo Missier
The document discusses using structured occurrence networks (SONs) to model provenance. SONs extend occurrence networks (ONs) to represent the activity of complex systems through relationships between multiple ONs. The goal is to explore using SONs as a formal model of provenance, viewing data as an evolving system and agents as also evolving systems. Communication SONs are introduced to capture communication between concurrently proceeding ONs. This establishes patterns for representing workflow and multi-layered provenance using SONs.
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paolo Missier
The document discusses fine-grained provenance tracking of workflow data products. It presents a functional model for collection-oriented workflow processing that models workflows operating on nested collections. This model generalizes simple iteration to arbitrary collection depths and handles multiple input collections through a generalized cross product operation. The model aims to enable efficient provenance querying by traversing the workflow graph instead of the potentially larger provenance graph.
Duplicate Detection of Records in Queries using ClusteringIJORCS
The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. Also, it is important to detect and clean equivalence errors because an equivalence error may result in several duplicate tuples. Recent research efforts have focused on the issue of duplicate elimination in data warehouses. This entails trying to match inexact duplicate records, which are records that refer to the same real-world entity while not being syntactically equivalent. This paper mainly focuses on efficient detection and elimination of duplicate data. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data.
This document discusses DT's core analytical competencies in data engineering, analytics, and quantitative skills. It describes capabilities in areas such as data architecture, ETL, spatial data services, data transformation, reporting, data mining, spatial data mining, and quantitative skills in statistics, machine learning, spatial statistics and other applied mathematics. It also provides examples of analytics applied to problems involving time series anomaly detection, correlation, aggregation, graphs, movement patterns, and classification. Teams have degrees from top universities and expertise in fields like computer science, engineering, mathematics and social sciences.
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
The document proposes a privacy-preserving approach for hierarchical document clustering using maximal frequent item sets (MFI). First, MFI are identified from document collections using the Apriori algorithm to define clusters precisely. Then, the same MFI-based similarity measure is used to construct a hierarchy of clusters. This approach decreases dimensionality and avoids duplicate documents, thereby protecting individual copyrights. The methodology and algorithm are described in detail.
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
This presentation is intended as a high-level introduction for to deep learning and its applications in materials science. The intended audience is materials scientists and engineers
Disclaimers: the second half of this presentation is intended as a broad overview of deep learning applications in materials science; due to time limitations it is not intended to be comprehensive. As a review of the field, this necessarily includes work that is not my own. If my own name is not included explicitly in the reference at the bottom of a slide, I was not involved in that work.
Any mention of commercial products in this presentation is for information only; it does not imply recommendation or endorsement by NIST.
This document contains four exam papers for a Data Warehousing and Data Mining course. Each paper contains 8 questions with sub-questions worth varying points. The questions cover topics such as data mining processes, differences between operational databases and data warehouses, data transformation techniques, data mining query languages, classification algorithms like naive Bayes and decision trees, clustering methods, and mining time-series, text and web data.
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
Artificial intelligence is emerging as a new paradigm in materials science. This talk describes how physical intuition and (insightful) machine learning can solve the complicated task of structure recognition in materials at the nanoscale.
Open Source Tools for Materials InformaticsAnubhav Jain
This document discusses open source tools for materials informatics, including Matminer and Matscholar. Matminer is a library of descriptors for materials science data that can generate features for machine learning models. It includes over 60 featurizer classes and supports scikit-learn. Matscholar applies natural language processing to over 2 million materials science abstracts to extract keywords and enable improved literature searching. The document argues that open datasets like Matbench and automated tools like Automatminer could help lower barriers for developing machine learning models in materials science by making it easier to obtain training data and evaluate model performance.
Assessing Factors Underpinning PV Degradation through Data AnalysisAnubhav Jain
The document discusses using PVPRO methods and large-scale data analysis to distinguish system and module degradation in PV systems. It involves 3 main tasks: 1) Developing an algorithm to detect off-maximum power point operation and compare it to existing tools. 2) Applying PVPRO to additional datasets to refine methods and perform degradation analysis on 25 large PV systems. 3) Connecting bill-of-materials data to degradation results from accelerated stress tests through data-driven analysis and publishing findings while anonymizing data.
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
The document discusses the development of Matbench, a standardized benchmark for evaluating machine learning algorithms for materials property prediction. Matbench includes 13 standardized datasets covering a variety of materials prediction tasks. It employs a nested cross-validation procedure to evaluate algorithms and ranks submissions on an online leaderboard. This allows for reproducible evaluation and comparison of different algorithms. Matbench has provided insights into which algorithm types work best for certain prediction problems and has helped measure overall progress in the field. Future work aims to expand Matbench with more diverse datasets and evaluation procedures to better represent real-world materials design challenges.
Extracting and Making Use of Materials Data from Millions of Journal Articles...Anubhav Jain
- The document discusses using natural language processing techniques to extract materials data from millions of journal articles.
- It aims to organize the world's information on materials science by using NLP models to extract useful data from unstructured text sources like research literature in an automated manner.
- The process involves collecting raw text data, developing machine learning models to extract entities and relationships, and building search interfaces to make the extracted data accessible.
The document discusses DataONE, a project aimed at improving data repository interoperability and advancing best practices in data lifecycle management. It focuses on enabling access to multiple external data repositories from within a HUB environment. This would allow users to aggregate and integrate disparate datasets for new analyses, and enable reproducible workflows. The goal is to address issues around scattered and dispersed data by improving discovery, integration and long-term preservation of datasets.
1. Materials Informatics uses Python tools like RDKit for analyzing molecular structures and properties.
2. ORGAN and MolGAN are two generative models that use GANs to generate novel molecular structures based on SMILES strings, with ORGAN incorporating reinforcement learning to optimize for desired properties.
3. Tools like RDKit enable analyzing molecular fingerprints and descriptors that can be used for machine learning applications in materials informatics.
Presentation as held at the "Workshop on Knowledge Evolution and Ontology Dynamics" co-located with ISWC 2011. Related to the paper http://ceur-ws.org/Vol-784/evodyn1.pdf
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
FireWorks is a workflow management system that allows researchers to define and execute complex computational materials science workflows on local or remote computing resources in an automated manner. It provides features such as error detection and recovery, job scheduling, provenance tracking, and remote file access. The atomate library builds on FireWorks to provide a high-level interface for common materials simulation procedures like structure optimization, band structure calculation, and property prediction using popular codes like VASP. Together, these tools aim to make high-throughput computational materials discovery and design more accessible to researchers.
SRbench is a benchmark for streaming RDF storage engines that was developed by Ying Zhang and Peter Boncz of CWI Amsterdam. It uses real-world linked open data sets and defines queries and implementations in natural language and languages like SPARQLStream and C-SPARQL to evaluate streaming RDF databases. The benchmark addresses the challenges of streaming RDF data by using appropriate datasets from the linked open data cloud and supporting semantics in stream queries. Future work will focus on performance evaluation and verifying benchmark results.
The document discusses integrating data from multiple sources on-the-fly without prior knowledge of the schemas. It proposes using approximate entity reconciliation, which leverages techniques like record linkage, approximate joins, and adaptive query processing. The key challenges are trading off completeness of integration for query response time and implementing a hybrid join algorithm that switches between exact and approximate joins to optimize this tradeoff.
Paper presentations: UK e-science AHM meeting, 2005Paolo Missier
The document describes an ontology-based approach to handling information quality in e-science. It presents an initial quality framework that captures scientists' quality requirements and allows defining domain-specific quality characteristics. It introduces a web service that annotates datasets with quality metrics based on how well their elements conform to relevant ontologies, using transcriptomics as an example domain. The approach aims to make quality definitions reusable and the computation of quality measurements over large datasets cost-effective.
The document discusses porting genome sequencing data processing pipelines from scripted HPC implementations to workflow models on the cloud. This allows the pipelines to be more scalable, flexible, and evolvable. Tracking provenance is also important for using results as clinical evidence and analyzing differences when the pipelines change. Preliminary tests on the Microsoft Azure cloud show potential cost savings from improved resource utilization.
The document discusses scientific workflow management systems and collaboration in workflow-based science. It notes that collaboration requires that a scientist be able to make sense of third-party data, and that this requires the data to be accompanied by provenance metadata that describes how the data was generated and processed. The concept of a "Research Object" is introduced as a way to package scientific data and workflows together with provenance and other related information to enable collaboration and reuse.
PDT: Personal Data from Things,and its provenancePaolo Missier
This document discusses various aspects of the Internet of Things (IoT), including potential architectures and stacks, connectivity and evolution. It examines use cases at different scales, from individual sensors to smart cities. The role of metadata and data provenance is explored for IoT applications involving science, personal data from sensors, and devices that make autonomous decisions. Issues of data ownership, privacy and user control are important considerations for personal data generated by IoT devices. The relationship between IoT and machine-to-machine communication is also briefly discussed.
Structured Occurrence Network for provenance: talk for ipaw12 paperPaolo Missier
The document discusses using structured occurrence networks (SONs) to model provenance. SONs extend occurrence networks (ONs) to represent the activity of complex systems through relationships between multiple ONs. The goal is to explore using SONs as a formal model of provenance, viewing data as an evolving system and agents as also evolving systems. Communication SONs are introduced to capture communication between concurrently proceeding ONs. This establishes patterns for representing workflow and multi-layered provenance using SONs.
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paolo Missier
The document discusses fine-grained provenance tracking of workflow data products. It presents a functional model for collection-oriented workflow processing that models workflows operating on nested collections. This model generalizes simple iteration to arbitrary collection depths and handles multiple input collections through a generalized cross product operation. The model aims to enable efficient provenance querying by traversing the workflow graph instead of the potentially larger provenance graph.
SWPM12 report on the dagstuhl seminar on Semantic Data Management Paolo Missier
The document summarizes discussions that took place at a Dagstuhl seminar on provenance in semantic data management in April 2012. Key points discussed include:
1) The need for provenance-specific benchmarks and reference data sets to better understand provenance usage and properties.
2) Proposals to collect provenance traces from various domains in a community repository using the PROV standard for interoperability.
3) Challenges of representing and reasoning with uncertain provenance information from sources like sensors, NLP, and human errors.
This document discusses encoding provenance graphs and PROV constraints using Datalog rules. It maps PROV notation graphs to a database of facts and encodes most PROV constraints as Datalog rules. This allows for declarative specification of provenance graphs with deductive inference, enabling validation of graphs and rapid prototyping of analysis algorithms. Some limitations include inability to encode certain constraints and attributes in graph relations. The approach provides a proof of concept for representing and reasoning over provenance graphs with Datalog.
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paolo Missier
Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).
ProvAbs: model, policy, and tooling for abstracting PROV graphsPaolo Missier
This document presents ProvAbs, a model, policy language, and tool for abstracting PROV graphs to enable partial disclosure of provenance data. The model groups nodes in a PROV graph and replaces them with a new abstract node while preserving the graph's validity. A policy assigns sensitivity levels to nodes and drives the node selection for abstraction. The ProvAbs tool implements the abstraction model and allows interactively exploring policy settings and clearances to generate abstract views of a PROV graph.
Big Data Quality Panel: Diachron Workshop @EDBTPaolo Missier
1) Traditional approaches to ensuring data quality such as quality assurance and curation face challenges from big data's volume, velocity, and variety characteristics.
2) It is difficult to determine general thresholds for when data quality issues can be ignored as the importance varies between different analytics algorithms.
3) The ReComp decision support system aims to use metadata about past analytics tasks to determine when knowledge needs to be refreshed due to changes in big data or models.
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Paolo Missier
Much of the knowledge produced through data-intensive computations is liable to decay over time, as the underlying data drifts, and the algorithms, tools, and external data sources used for processing change and evolve. Your genome, for example, does not change over time, but our understanding of it does. How often should be look back at it, in the hope to gain new insight e.g. into genetic diseases, and how much does that cost when you scale re-analysis to an entire population?
The "total cost of ownership” of knowledge derived from data (TCO-DK) includes the cost of refreshing the knowledge over time in addition to the initial analysis, but is often not a primary consideration.
The ReComp project aims to provide models, algorithms, and tools to help humans understand TCO-DK, i.e., the nature and impact of changes in data, and assess the cost and benefits of knowledge refresh.
In this talk we try and map the scope of ReComp, by giving a number of patterns that cover typical analytics scenarios where re-computation is appropriate. We specifically describe two such scenarios, where we are conducting small scale, proof-of-concept ReComp experiments to help us sketch the general ReComp architecture. This initial exercise reveals a multiplicity of problems and research challenges, which will inform the rest of the project
The lifecycle of reproducible science data and what provenance has got to do ...Paolo Missier
The document discusses various aspects of ensuring reproducibility in scientific research through provenance. It begins by providing an overview of the data lifecycle and challenges to reproducibility as experiments and components evolve. It then discusses different levels of reproducibility (rerun, repeat, replicate, reproduce) and approaches to analyzing differences in workflow provenance traces to understand how changes impact results. The remainder of the document describes specific systems and tools developed by the author and collaborators that use provenance to improve reproducibility, including data packaging with Research Objects, provenance recording and analysis workflows with YesWorkflow, process virtualization using TOSCA, and provenance differencing with Pdiff.
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
This document discusses moving whole exome sequencing pipelines to the cloud using e-Science Central workflow management. The goal is to process 3000 exomes from neurological patients in a scalable and cost-effective way. Current scripts are being ported to e-Science Central for improved abstraction, execution, and provenance tracking. Provenance will help compare results from different pipeline versions and support clinical diagnosis. Initial testing with 300 exomes will begin, with full scalability testing planned for September 2014.
The document discusses using SNPs (single nucleotide polymorphisms) to help identify candidate genes associated with quantitative traits. It presents SNPit, a database that integrates data from Ensembl, dbSNP and Perlegen to rank SNPs based on differences between resistant and susceptible mouse strains. SNPit supports exploratory analysis of large genomic regions to help focus candidate gene searches for traits like disease susceptibility. The goal is to complement existing methods and automate parts of the process to accelerate disease gene identification.
Scientific discovery and innovation in an era of data-intensive science
William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator
The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.
Software tools for high-throughput materials data generation and data miningAnubhav Jain
Atomate and matminer are open-source Python libraries for high-throughput materials data generation and data mining. Atomate makes it easy to automatically generate large datasets by running standardized computational workflows with different simulation packages. Matminer contains tools for featurizing materials data and integrating it with machine learning algorithms and data visualization methods. Both aim to accelerate materials discovery by automating and standardizing computational workflows and data analysis tasks.
The document discusses challenges in the modern research workflow and information landscape. It notes that the definition of "information" is evolving as research cycles and processes change. Additionally, it highlights issues around access to information, wasted resources, and imperfections in the existing system. The document advocates that systems need to adapt and suggests we can do better.
The document discusses research workflows and information needs that are changing as research becomes more data-driven and digital. It notes the complexity of information that researchers now deal with, including data, code, and non-digital materials. Additionally, it highlights issues around access, rewards, and incentives in the current system and the need to better support evolving research practices.
The Future of Digital Science - World Science Forum 2011Kaitlin Thaney
(1) The document discusses how digital science is changing the research workflow by making more information available digitally. However, there are still blocking points like accessing non-digital materials and sharing results.
(2) The approach presented aims to address these issues by developing tools that integrate both digital and non-digital mediums to help researchers, machines, and decision makers. This includes tracking parameters, expiration dates, and calibration of non-digital materials.
(3) The goal is to use technology to help coordinate research where feasible, reduce duplication, and help measure the impact and reputation of research in an imperfect digital system.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
"Towards a Science of Reproducible Science?" DPRMA Workshop talk at JCDL 2013, Indianapolis, 25th July 2013. Workshop website is http://dprma.oerc.ox.ac.uk/
Paper is
David De Roure. 2013. Towards computational research objects. In Proceedings of the 1st International Workshop on Digital Preservation of Research Methods and Artefacts (DPRMA '13). ACM, New York, NY, USA, 16-19. DOI=10.1145/2499583.2499590 http://doi.acm.org/10.1145/2499583.2499590
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
Preserving the Inputs and Outputs of Scholarshiptsbbbu
Tim Babbitt discusses the changing context of research and scholarship due to digitization and the internet. The inputs and outputs of research are increasingly digital and complex, including data, code, presentations, and more. ProQuest has a history of preserving scholarship through microfilming and is exploring how to preserve the full range of digital scholarly outputs and their linkages in a sustainable way. Key questions include balancing new and old preservation methods and moving beyond preserving individual objects to also preserving networks and linkages between scholarly works.
How to best manage your data to make the most of it for your research - With ODAM Framework (Open Data for Access and Mining) Give an open access to your data and make them ready to be mined
Where is the opportunity for libraries in the collaborative data infrastructure?LIBER Europe
Presentation by Susan Reilly at Bibsys2013 on the opportunties for libraries and their role in the collaborative data infrastructure. Looks at data sharing, authentication, preservation and advocacy.
This document provides an overview of where and how artificial intelligence (AI) is used in materials science. It discusses several key areas:
1) Hypothesis generation using archival data and machine learning to predict new materials.
2) Data acquisition, cleaning, and feature identification using AI techniques like denoising and artifact removal from experimental data.
3) Knowledge extraction from large datasets using unsupervised learning methods like non-negative matrix factorization to identify materials phases.
4) Closing the materials discovery loop with demonstrations of autonomous materials research systems that integrate computation, autonomous synthesis and characterization using AI.
This document summarizes Rob Grim's presentation on e-Science, research data, and the role of libraries. It discusses the Open Data Foundation's work in promoting metadata standards like DDI and SDMX. It also outlines the research data lifecycle and how metadata management can help libraries support research through services like data registration, archiving, discovery and access. Finally, it provides examples of how Tilburg University library supports research data through services aligned with data availability, discovery, access and delivery.
Functional and Architectural Requirements for Metadata: Supporting Discovery...Jian Qin
The tremendous growth in digital data has led to an increase in metadata initiatives for different types of scientific data, as evident in Ball’s survey (2009). Although individual communities have specific needs, there are shared goals that need to be recognized if systems are to effectively support data sharing within and across all domains. This paper considers this need, and explores systems requirements that are essential for metadata supporting the discovery and management of scientific data. The paper begins with an introduction and a review of selected research specific to metadata modeling in the sciences. Next, the paper’s goals are stated, followed by the presentation of valuable systems requirements. The results include a base-model with three chief principles: principle of least effort, infrastructure service, and portability. The principles are intended to support “data user” tasks. Results also include a set of defined user tasks and functions, and applications scenarios.
Publishing of Scientific Data - Science Foundation Ireland Summit 2010jodischneider
This document discusses trends in publishing scientific data, including requirements to deposit data, citing data through identifiers like DOIs, considering data itself as a publication in data journals or databases, and including interactive data within publications. It also outlines new roles for working with scientific data, such as data scientists and curators who extract facts from literature to populate databases and ensure data quality.
This presentation discusses managing research data through the data life cycle. It begins with an overview of the research life cycle and embedding the data life cycle within it. Key aspects of data management are then covered, including why manage data, ethical and legal issues, requirements for data sharing and retention, and creating a data management plan. The rest of the presentation delves into each stage of the data life cycle, providing best practices for data collection, organization, security, storage, documentation, processing, analysis, and long-term preservation or sharing. File formats, metadata, repositories, and bibliographic resources are also addressed.
This document discusses using cloud services to facilitate materials data sharing and analysis. It proposes a "Discovery Cloud" that would allow researchers to easily store, curate, discover, and analyze materials data without needing local software or hardware. This cloud platform could accelerate discovery by automating workflows and reducing costs through on-demand scalability. It would also make long-term data preservation simpler. The document highlights Globus research data management services as an example of cloud tools that could help address the dual challenges of treating data as both a rare treasure to preserve and a "deluge" to efficiently manage.
This document discusses the challenges of collecting, storing, and analyzing large volumes of internet measurement data. It examines issues such as distributed and resilient data collection, handling multi-timescale and heterogeneous data from various sources, and developing standardized tools and formats. The paper proposes the "datapository" - an internet data repository designed to address these challenges through a collaborative framework for data sharing, storage, and analysis tools. The goal is to help both network operators and researchers more effectively harness the wealth of data available.
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
A talk given at the DATAPLAT workshop, co-located with the IEEE ICDE conference (May 2024, Utrecht, NL).
Data Provenance for Data Science is our attempt to provide a foundation to add explainability to data-centric AI.
It is a prototype, with lots of work still to do.
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
In this presentation, given to graduate students at Universita' RomaTre, Italy, we suggest that concepts well-known in Data Provenance can be exploited to provide explanations in the context of data-centric AI processes. Through use cases (incremental data cleaning, training set pruning), we build up increasingly complex provenance patterns, culminating in an open question:
how to describe "why" a specific data item has been manipulated as part of data processing, when such processing may consist of a complex data transformation algorithm.
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
A talk given at the BDA4HM workshop, IEEE BigData conference, Dec. 2023
please see paper here:
https://drive.google.com/file/d/1vN08G0FWxOSH1Yeak5AX6a0sr5-EBbAt/view
Data-centric AI and the convergence of data and model engineering:opportunit...Paolo Missier
A keynote talk given to the IDEAL 2023 conference (Evora, Portugal Nov 23, 2023).
Abstract.
The past few years have seen the emergence of what the AI community calls "Data-centric AI", namely the recognition that some of the limiting factors in AI performance are in fact in the data used for training the models, as much as in the expressiveness and complexity of the models themselves. One analogy is that of a powerful engine that will only run as fast as the quality of the fuel allows. A plethora of recent literature has started the connection between data and models in depth, along with startups that offer "data engineering for AI" services. Some concepts are well-known to the data engineering community, including incremental data cleaning, multi-source integration, or data bias control; others are more specific to AI applications, for instance the realisation that some samples in the training space are "easier to learn from" than others. In this "position talk" I will suggest that, from an infrastructure perspective, there is an opportunity to efficiently support patterns of complex pipelines where data and model improvements are entangled in a series of iterations. I will focus in particular on end-to-end tracking of data and model versions, as a way to support MLDev and MLOps engineers as they navigate through a complex decision space.
Realising the potential of Health Data Science:opportunities and challenges ...Paolo Missier
This document summarizes a presentation on opportunities and challenges for applying health data science and AI in healthcare. It discusses the potential of predictive, preventative, personalized and participatory (P4) approaches using large health datasets. However, it notes major challenges including data sparsity, imbalance, inconsistency and high costs. Case studies on liver disease and COVID datasets demonstrate issues requiring data engineering. Ensuring explanations and human oversight are also key to adopting AI in clinical practice. Overall, the document outlines a complex landscape and the need for better data science methods to realize the promise of data-driven healthcare.
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
This document describes DP4DS, a tool to collect fine-grained provenance from data processing pipelines. Specifically, it can collect provenance from dataframe-based Python scripts. It demonstrates scalable provenance generation, storage, and querying. Current work includes improving provenance compression techniques and demonstrating the tool's generality for standard relational operators. Open questions remain around how useful fine-grained provenance is for explaining findings from real data science pipelines.
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
a brief intro on the data challenges associated with working with Health Care data, with a few examples, both from literature and our own, of traditional approaches (Latent Class Analysis, Topic Modelling) and a perspective on Language-based modelling for Electronic Health Records (EHR).
probably more references than actual content in here!
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
This document describes a method for capturing and querying fine-grained provenance from data science preprocessing pipelines. It captures provenance at the dataframe level by comparing inputs and outputs to identify transformations. Templates are used to represent common transformations like joins and appends. The approach was evaluated on benchmark datasets and pipelines, showing overhead from provenance capture is low and queries are fast even for large datasets. Scalability was demonstrated on datasets up to 1TB in size. A tool called DPDS was also developed to assist with data science provenance.
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
The document proposes tracking trajectories of multiple long-term conditions using dynamic patient-cluster associations. It uses topic modeling to identify disease clusters from patient timelines and quantifies how patients associate with clusters over time. Preliminary results on 143,000 patients from UK Biobank show varying stability of patient associations with clusters. Further work aims to better define stability and identify causes of instability.
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
The document discusses data provenance for data science applications. It proposes automatically generating and storing metadata that describes how data flows through a machine learning pipeline. This provenance information could help address questions about model predictions, data processing decisions, and regulatory requirements for high-risk AI systems. Capturing provenance at a fine-grained level incurs overhead but enables detailed queries. The approach was evaluated on performance and scalability. Provenance may help with transparency, explainability and oversight as required by new regulations.
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
a talk given at the VLDB 2021 conference, August, 2021, presenting our paper:
Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507–520, January, 2021.
http://doi.org/10.14778/3436905.3436911
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Paolo Missier
The document discusses provenance in the context of data science and artificial intelligence. It provides bibliometric data on publications related to data/workflow provenance from 2000 to the present. Recent trends include increased focus on applications in computing and engineering fields. Blockchain is discussed as a method for capturing fine-grained provenance. The document also outlines challenges around explainability, transparency and accountability for high-risk AI systems according to new EU regulations, and argues that provenance techniques may help address these challenges by providing traceability of system functioning and operation monitoring.
Analytics of analytics pipelines:from optimising re-execution to general Dat...Paolo Missier
This document discusses using data provenance to optimize re-execution of analytics pipelines and enable transparency in data science workflows. It proposes a framework called ReComp that selectively recomputes parts of expensive analytics workflows when inputs change based on provenance data. It also discusses applying provenance techniques to collect fine-grained data on data preparation steps in machine learning pipelines to help explain model decisions and data transformations. Early results suggest provenance can be collected with reasonable overhead and enables useful queries about pipeline execution.
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
Paolo Missier presented on optimizing the re-execution of analytics pipelines in response to changes in input data. The talk discussed using provenance to selectively re-run parts of workflows impacted by changes. ProvONE combines process structure and runtime provenance to enable granular re-execution. The ReComp framework detects and quantifies data changes, estimates impact, and selectively re-executes relevant sub-processes to optimize re-running workflows in response to evolving data.
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
The document describes the ReComp framework for efficiently recomputing analytics processes when changes occur. ReComp uses provenance data from past executions to estimate the impact of changes and selectively re-execute only affected parts of processes. It identifies changes, computes data differences, and estimates impacts on past outputs to determine the minimum re-executions needed. For genomic analysis workflows, ReComp reduced re-executions from 495 to 71 by caching intermediate data and re-running only impacted fragments. The framework is customizable via difference and impact functions tailored to specific applications and data types.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
1. Scientific Workflow Management System
Janus
Provenance
Research objects, myExperiment, and
Open Provenance for collabora;ve E‐science
REPRISE workshop ‐ IDCC’09
Paolo Missier
Information Management Group
School of Computer Science, University of Manchester, UK
with additional material by Sean Bechhofer and Matthew Gamble,
e-Labs design group, University of Manchester
1
IDCC’09, London - P.Missier
2. Momentum on sharing and collaboration
Special issue of Nature on Data Sharing (Sept. 2009)
The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–
169 (2009)
Prepublication data sharing:
Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9
September 2009 http://www.nature.com/news/specials/datasharing/index.html 2
IDCC’09, London - P.Missier
3. Momentum on sharing and collaboration
Special issue of Nature on Data Sharing (Sept. 2009)
• timeliness requires rapid sharing
• repurposing
• the Human Genome project use case
The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–
169 (2009)
Prepublication data sharing:
Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9
September 2009 http://www.nature.com/news/specials/datasharing/index.html 2
IDCC’09, London - P.Missier
4. Momentum on sharing and collaboration
Special issue of Nature on Data Sharing (Sept. 2009)
• timeliness requires rapid sharing
• repurposing
• the Human Genome project use case
• Ongoing debate in several communities
– Clinical trials [1]
– Earth Sciences -- ESIP - data preservation / stewardship, 2009
– Long established in some communities - Atmospheric sciences,
1998 [2]
• Science Commons recommendations for Open Science
– Open Science recommendations from Science Commons (July 2008) [link]
The Toronto group: Toronto International Data Release Workshop Authors, Nature 461, 168–
169 (2009)
Prepublication data sharing:
Nature 461, 168-170 (10 September 2009) | doi:10.1038/461168a; Published online 9
September 2009 http://www.nature.com/news/specials/datasharing/index.html 2
IDCC’09, London - P.Missier
12. Collaboration through data
What is needed for B to make sense of A’s data?
1.Packaging:
– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:
– data format standardization efforts
– metadata representation
• process provenance
–workflow provenance
3.Container:
– a repository for Research Objects 4
IDCC’09, London - P.Missier
13. Collaboration through data
What is needed for B to make sense of A’s data?
1.Packaging:
– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:
– data format standardization efforts
– metadata representation
• process provenance
–workflow provenance
3.Container:
– a repository for Research Objects 4
IDCC’09, London - P.Missier
14. Collaboration through data
What is needed for B to make sense of A’s data?
1.Packaging:
– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:
– data format standardization efforts
– metadata representation
• process provenance
–workflow provenance
3.Container:
– a repository for Research Objects 4
IDCC’09, London - P.Missier
15. Collaboration through data
What is needed for B to make sense of A’s data?
1.Packaging:
– standards for self-descriptive data + metadata bundles:
Research Objects
2.Content:
– data format standardization efforts
– metadata representation
• process provenance
–workflow provenance
3.Container:
– a repository for Research Objects 4
IDCC’09, London - P.Missier
17. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
Results
Logs Slides
Workflow 13 Paper
Results
Common pathways
18. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
Results
Logs Slides
Workflow 13 Paper
Representation
Results
Common pathways
19. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
Results
Logs Slides
Workflow 13 Paper
Representation
Results Domain Relations
Common pathways
20. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
produces
Results
Included in Included in Published in
Logs Slides
produces
Feeds into
Included in Included in
Workflow 13 Paper
produces Published in
Representation
Results Domain Relations
Common pathways
21. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
produces
Results
Included in Included in Published in
Logs Slides
produces
Feeds into
Included in Included in
Workflow 13 Paper
produces Published in
Representation
Results Domain Relations
Aggregation
Common pathways
22. Paul’s
Paul’s Pack
QTL
Research
Object Workflow 16
produces
Results
Included in Included in Published in
Logs Slides
produces
Feeds into
Included in Included in
Workflow 13 Paper
Metadata produces Published in
Representation
Results Domain Relations
Aggregation
Common pathways
23. ORE: representing generic aggregations
Resource Map Data structure
(descriptor)
http://www.openarchives.org/ore/1.0/primer.html section 4
A. Pepe, M. Mayernik, C.L. Borgman, and H.V. Sompel, "From Artifacts to Aggregations:
Modeling Scientific Life Cycles on the Semantic Web," Journal of the American Society for
Information Science and Technology (JASIST), to appear, 2009.
6
24.
25. Content: Workflow provenance
A detailed trace of workflow execution
- tasks performed, data transformations
- inputs used, outputs produced
8
26. Content: Workflow provenance
A detailed trace of workflow execution
- tasks performed, data transformations
- inputs used, outputs produced
8
27. Content: Workflow provenance
A detailed trace of workflow execution
lister
- tasks performed, data transformations
get pathways
by genes1 - inputs used, outputs produced
merge pathways
gene_id
concat gene pathway ids
output
pathway_genes
8
28. Why provenance matters, if done right
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To describe one’s experiment to others, for understanding / reuse
• To provide evidence in support of scientific claims
• To enable post hoc process analysis for improvement, re-design
The W3C Incubator on Provenance has been collecting numerous use cases:
http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#
IDCC’09, London - P.Missier
29. What users expect to learn
• Causal relations:
- which pathways come from which genes?
- which processes contributed to producing an
lister image?
- which process(es) caused data to be incorrect?
get pathways
by genes1
- which data caused a process to fail?
merge pathways • Process and data analytics:
– analyze variations in output vs an input
gene_id parameter sweep (multiple process runs)
– how often has my favourite service been
concat gene pathway ids executed? on what inputs?
– who produced this data?
output
– how often does this pathway turn up when the
input genes range over a certain set S?
pathway_genes
10
IDCC’09, London - P.Missier
30. Open Provenance Model
• graph of causal dependencies involving data and processors
• not necessarily generated by a workflow!
• v1.0.1 currently open for comments
wasGeneratedBy (R)
A P
Goal:
used (R)
P A standardize causal dependencies
to enable provenance metadata exchange
wgb(R5)
A1 wgb(R1) used(R3) A3 P1
P3
wgb(R6)
A2 wgb(R2) used(R4) A4 P2
11
IDCC’09, London - P.Missier
31. The 3rd provenance challenge
• Chosen workflow from the Pan-STARRS project
– Panoramic Survey Telescope & Rapid Response Syste
• http://twiki.ipaw.info/bin/view/Challenge/
ThirdProvenanceChallenge
• Goal:
– demonstrate “provenance interoperability” at query level
12
IDCC’09, London - P.Missier
35. OPM and query-interoperability
Team A
prov(WA)
encode W execute
run WA
as WA query Q
OPM(prov(WA)) export Q(prov(WA))
prov(WA)
Team B
Q(PWA)
PWA =
import(OPM(prov(WA)))
execute
import
query Q
14
36. OPM and query-interoperability
Team A
prov(WA)
encode W execute
run WA
as WA query Q
OPM(prov(WA)) export Q(prov(WA))
prov(WA)
?
Team B
Q(PWA)
PWA =
import(OPM(prov(WA)))
execute
import
query Q
14
41. Additional requirements
• Artifact values require uniform common identifier
scheme
– each group used artifacts to refer to its own data results
– but those results were expressed using proprietary
naming conventions
– Linked Data in OPM?
16
42. Additional requirements
• Artifact values require uniform common identifier
scheme
– each group used artifacts to refer to its own data results
– but those results were expressed using proprietary
naming conventions
– Linked Data in OPM?
• OPM accounts for structural causal relationships
– additional domain-specific knowledge required
– attaching semantic annotations to OPM graph nodes
16
43. Additional requirements
• Artifact values require uniform common identifier
scheme
– each group used artifacts to refer to its own data results
– but those results were expressed using proprietary
naming conventions
– Linked Data in OPM?
• OPM accounts for structural causal relationships
– additional domain-specific knowledge required
– attaching semantic annotations to OPM graph nodes
• OPM graphs can grow very large
– reduce size by exporting only query results
• Taverna approach
– multiple levels of abstraction
• through OPM accounts (“points of view”) 16
44. Query results as OPM graphs
prov(WA)
encode W execute
run WA
as WA query Q
OPM(prov(WA)) export Q(prov(WA))
prov(WA)
45. Query results as OPM graphs
prov(WA)
encode W execute
run WA
as WA query Q
OPM(prov(WA)) export Q(prov(WA))
prov(WA)
46. Query results as OPM graphs
prov(WA)
encode W execute
run WA
as WA query Q
OPM(prov(WA)) export Q(prov(WA))
Q(prov(WA))
47. Query results as OPM graphs
prov(WA)
encode W execute
run WA
as WA query Q
OPM(Q(prov(WA))) export Q(prov(WA))
Q(prov(WA))
48. Query results as OPM graphs
prov(WA)
encode W execute
run WA
as WA query Q
OPM(Q(prov(WA))) export Q(prov(WA))
Q(prov(WA))
- Approach implemented in Taverna 2.1
- Internal provenance DB with ad hoc query language
- To be released soon
52. Full-fledged data-mediated collaborations
exp. A workflow A +
input A
Research
Object result
result A
provenance
datasets A
A
workflow B+
input B
Research
Object result
exp. B result B
provenance
result A → input B datasets B
B
18
53. Full-fledged data-mediated collaborations
workflow A +
input A workflow B +
inputB
result A → input B
Research
result Object result
datasets result A+B provenance
A datasets A+B
B
18
54. Full-fledged data-mediated collaborations
workflow A +
input A workflow B +
inputB
result A → input B
Research
result Object result
datasets result A+B provenance
A datasets A+B
B
Provenance composition
accounts for implicit
collaboration
18
55. Full-fledged data-mediated collaborations
workflow A +
input A workflow B +
inputB
result A → input B
Research
result Object result
datasets result A+B provenance
A datasets A+B
B
Provenance composition
accounts for implicit
collaboration
Aligned with focus of upcoming Provenance Challenge 4:
“connect my provenance to yours" into a whole OPM provenance graph.
18