ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
Scientific Workflow Systems for accessible, reproducible researchPeter van Heusden
Presentation for eResearch Africa 2013 on using scientific workflow management systems to compose and enact analysis workflows in bioinformatics (or science in general).
Reproducibility and Scientific Research: why, what, where, when, who, how Carole Goble
This document discusses the importance of reproducibility in scientific research. It makes three key points:
1. For results to be considered valid, scientific publications should provide clear descriptions of methods and protocols so that other researchers can successfully repeat and extend the work.
2. Many factors can undermine reproducibility, such as publication pressures, poor training, disorganization, and outright fraud. Ensuring reproducible research requires transparency across experimental designs, data, software, and computational workflows.
3. Achieving reproducible science is challenging and poorly incentivized due to the resources and time required to prepare materials for independent verification. Overcoming these issues will require collective effort across the research community.
The document summarizes Anita de Waard's presentation on Elsevier's experiments with big and small data. It discusses Elsevier's work with text mining and knowledge graphs to extract information from over 14 million articles. It also describes Elsevier's Medical Graph which predicts the probability of over 2,000 medical conditions occurring based on analysis of clinical data from 6 million patients. Finally, it reviews Elsevier's various tools and services to help researchers preserve, process, share, comprehend, access, and discover research data and publications.
The document discusses using WEKA and BioWeka to analyze DNA sequences and perform pattern matching. It summarizes how Eclat filtering and EM clustering are applied to a dataset containing DNA sequences from human and chimpanzee chromosomes. Eclat is used to extract codon frequencies as features, while EM clustering assigns sequences to clusters based on the mixture model with the highest posterior probability. The analysis aims to identify biologically relevant groups of genes and determine chromosomal similarities between humans and chimpanzees.
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
Keynote presentation at the iConference 2015, Newport Beach, Los Angeles, 26 March 2015.
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
http://ischools.org/the-iconference/
BEWARE: presentation includes hidden slides AND in situ build animations - best viewed by downloading.
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
Scientific Workflow Systems for accessible, reproducible researchPeter van Heusden
Presentation for eResearch Africa 2013 on using scientific workflow management systems to compose and enact analysis workflows in bioinformatics (or science in general).
Reproducibility and Scientific Research: why, what, where, when, who, how Carole Goble
This document discusses the importance of reproducibility in scientific research. It makes three key points:
1. For results to be considered valid, scientific publications should provide clear descriptions of methods and protocols so that other researchers can successfully repeat and extend the work.
2. Many factors can undermine reproducibility, such as publication pressures, poor training, disorganization, and outright fraud. Ensuring reproducible research requires transparency across experimental designs, data, software, and computational workflows.
3. Achieving reproducible science is challenging and poorly incentivized due to the resources and time required to prepare materials for independent verification. Overcoming these issues will require collective effort across the research community.
The document summarizes Anita de Waard's presentation on Elsevier's experiments with big and small data. It discusses Elsevier's work with text mining and knowledge graphs to extract information from over 14 million articles. It also describes Elsevier's Medical Graph which predicts the probability of over 2,000 medical conditions occurring based on analysis of clinical data from 6 million patients. Finally, it reviews Elsevier's various tools and services to help researchers preserve, process, share, comprehend, access, and discover research data and publications.
The document discusses using WEKA and BioWeka to analyze DNA sequences and perform pattern matching. It summarizes how Eclat filtering and EM clustering are applied to a dataset containing DNA sequences from human and chimpanzee chromosomes. Eclat is used to extract codon frequencies as features, while EM clustering assigns sequences to clusters based on the mixture model with the highest posterior probability. The analysis aims to identify biologically relevant groups of genes and determine chromosomal similarities between humans and chimpanzees.
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
Keynote presentation at the iConference 2015, Newport Beach, Los Angeles, 26 March 2015.
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
http://ischools.org/the-iconference/
BEWARE: presentation includes hidden slides AND in situ build animations - best viewed by downloading.
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
Keynote at JISC Digifest 2015 on Reproducibility and Research Objects in Scholarly Communication
Includes hidden slides
All material except maybe the IT Crowd screengrab reusable
Fault detection of imbalanced data using incremental clusteringIRJET Journal
This document proposes a method for fault detection in imbalanced data using incremental clustering with feature selection. Standard classification algorithms are not suitable for fault detection in imbalanced data as they prioritize the majority class. The proposed method uses incremental clustering to detect faults, maintaining statistical summaries for each cluster. It selects features using a minimum spanning tree-based algorithm to reduce dimensionality and improve efficiency. This feature selection aims to choose a subset of strongly related features while removing irrelevant and redundant features. The selected features are then used as input for the incremental clustering fault detection method to achieve better classification accuracy and result quality for imbalanced fault detection problems.
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET Journal
This document proposes a new one-to-many data linkage technique using a One-Class Clustering Tree (OCCT) to link records from different datasets. The technique constructs a decision tree where internal nodes represent attributes from the first dataset and leaves represent attributes from the second dataset that match. It uses maximum likelihood estimation for splitting criteria and pre-pruning to reduce complexity. The method is applied to the database misuse domain to identify common and malicious users by analyzing access request contexts and accessible data. Evaluation shows the technique achieves better precision and recall than existing methods.
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...Genomika Diagnósticos
API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges
X-Meeting 2015
Authors: Jamisson Freitas, Marcel Caraciolo, Victor Diniz, Rodrigo Alexandre and João Bosco Oliveira
These slides were presented at AGU 2018 by Tanu Malik from DePaul University, in a session convened by Dr. Ian Foster, director of the Data Science and Learning division at Argonne National Laboratory.
Aspects of Reproducibility in Earth ScienceRaul Palma
The document discusses aspects of reproducibility in earth science research within the European Virtual Environment for Research - Earth Science Themes (EVEREST) project. The key objectives of EVEREST are to establish an e-infrastructure to facilitate collaborative earth science research through shared data, models, and workflows. Research Objects (ROs) will be used to capture and share workflows, processes, and results to help ensure reproducibility and preservation of earth science research. An example RO is described for mapping volcano deformation using satellite imagery and other data sources. Issues around reproducibility related to data access, software dependencies, and manual intervention in workflows are also discussed.
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
Discovering new drugs is a lengthy and expensive process. This means that finding new uses for existing drugs can help create new treatments in less time and with less time. The difficulty is in finding these potential new uses.
How do we find these undiscovered uses for existing drugs?
We can unify the available structured and unstructured data sets into a knowledge graph. This is done by fusing the structured data sets, and performing named entity extraction on the unstructured data sets. Once this is done, we can use deep learning techniques to predict latent relationships.
In this talk we will cover:
Building the knowledge graph
Predicting latent relationships
Using the latent relationships to repurpose existing drugs
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
There is a variety of public resources on the Internet which contain information about various aspects of chemical, biological and pharmaceutical domains. The quality, maturity, hosting organizations, team sizes behind these data resources vary wildly and as a consequence content cannot be always trusted and the effort of extracting information and preparing it for reuse is repeated again and again at various levels. This problem is especially serious in applications for QSAR, QSPR and QNAR modeling. On the other hand authors of this poster believe, based on their own extensive experience building various types of chemical, analytical and biological databases for decades, that the process of building such knowledgebase can be systematically described and automated. This poster will outline the work performed on text and data-mining various public resources on the Web, data curation process and making this information publicly available through a portal and a RESTful API. We will also demonstrate how such knowledgebase can be used for real-time QSAR and QSPR predictions.
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
While we have seen a tremendous growth in machine learning methods over the last two decades there is still no one fits all solution. The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as Deep Learning (DL). There has been increasing use of DL recently which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts. It was therefore our goal to develop a DL framework embedded into a general research data management platform (Open Science Data Repository) which can be used as an API, standalone tool or integrated in new software as an autonomous module. In this poster we will present results of comparing performance of classic machine learning methods (Naïve Bayes, logistic regression, Support Vector Machines etc.) with Deep Learning and will discuss challenges associated with Ddeep Learning Neural Networks (DNN). The DNN learning models of different complexity (up to 6 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/) and Tensorflow (www.tensorflow.org) and applied to various use cases connected to prediction of physicochemical properties, ADME, toxicity and calculating properties of materials. It was also shown that using nVidia GPUs significantly accelerates calculations, although memory consumption puts some limits on performance and applicability of standard toolkits 'as is'.
This document discusses a hybrid technique for associative classification. It begins with an introduction to data mining processes like classification and association rule mining. The author then discusses the motivation and objectives of developing a framework to generate classification association rules more efficiently. The proposed methodology involves reviewing existing models, implementing a classification system using association rules in Weka, and comparing the performance to other methods. The facilities required are data mining tools like Weka. Finally, the document provides references that were consulted in the literature survey on associative classification and related techniques.
Reproducibility Using Semantics: An Overviewdgarijo
Overview of the different approaches for addressing reproducibilities (using semantics) in laboratory protocols, workflow description and publication and workflow infrastructure. Furthermore, Research Objects are introduced as a means to capture the context and annotations of scientific experiments, together with the privacy and IPR concerns that may arise. This presentation was presented in Dagstuhl Seminar 16041: http://www.dagstuhl.de/16041
This document summarizes a presentation on MapReduce and YARN. It discusses key concepts like MapReduce execution, building MapReduce programs in Eclipse, and the YARN architecture. The presentation covers why MapReduce is used, real-life uses, an example MapReduce job, and interactions with Hadoop. It also explains motivations for YARN, how it works, and compares small and big data processing with MapReduce and YARN.
My poster on using pairwise learning for annotating, engineering and designing biological molecules. Mostly an overview of the types of things we are working on at the lab.
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
In this paper, we present a genetic programming (GP) approach to record
deduplication with indexing techniques.Data de-duplication is a process in which data are
cleaned from duplicate records due to misspelling, field swap or any other mistake or data
inconsistency. This process requires that we identify objects that are included in more than
one list.The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality in data warehouse. So, we
need to create such a algorithm that can detect and eliminate maximum duplications.GP
with indexing is one of the optimization technique that helps to find maximum duplicates in
the database. We used adeduplication function that is able to identify whether two or more
entries in a repository are replicas or not. As many industries and systems depend on the
accuracy and reliability of databases to carry out operations. Therefore, the quality of the
information stored in the databases, can have significant cost implications to a system that
relies on information to function and conduct business. Moreover, this is fact that clean and
replica-free repositories not only allow the retrieval of higher quality information but also
lead to more concise data and to potential savings in computational time and resources to
process this data.
Index
This document presents research on classifying data using a new enhanced decision tree algorithm called NEDTA. It first provides background on data mining and decision tree classification techniques. It then discusses existing decision tree algorithms ID3, J48 and NBTree and applies them to a banking dataset to evaluate performance. The objectives are stated as applying the algorithms, evaluating results, comparing performance based on accuracy, time and error rate, and developing an enhanced method. The document outlines the implementation and provides results of applying the existing algorithms in Weka. It compares the accuracy and performance of ID3, J48 and NBTree and finds the new NEDTA algorithm produces better results.
Interlinking educational data to Web of Data (Thesis presentation)Enayat Rajabi
This is a thesis presentation about interlinking educational data to Web of Data. I explain how I used the Linked Data approach to expose and interlink educational data to the Linked Open Data cloud
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Enhancement techniques for data warehouse staging areaIJDKP
This document discusses techniques for enhancing the performance of data warehouse staging areas. It proposes two algorithms: 1) A semantics-based extraction algorithm that reduces extraction time by pruning useless data using semantic information. 2) A semantics-based transformation algorithm that similarly aims to reduce transformation time. It also explores three scheduling techniques (FIFO, minimum cost, round robin) for loading data into the data warehouse and experimentally evaluates their performance. The goal is to enhance each stage of the ETL process to maximize overall performance.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Richard Zijdeman
A glimpse of how we are used to connecting datasets on our laptops and how, imho, need to move to the Web of Data, including a demo connecting various sources all from your(!) machine.
Fault detection of imbalanced data using incremental clusteringIRJET Journal
This document proposes a method for fault detection in imbalanced data using incremental clustering with feature selection. Standard classification algorithms are not suitable for fault detection in imbalanced data as they prioritize the majority class. The proposed method uses incremental clustering to detect faults, maintaining statistical summaries for each cluster. It selects features using a minimum spanning tree-based algorithm to reduce dimensionality and improve efficiency. This feature selection aims to choose a subset of strongly related features while removing irrelevant and redundant features. The selected features are then used as input for the incremental clustering fault detection method to achieve better classification accuracy and result quality for imbalanced fault detection problems.
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET Journal
This document proposes a new one-to-many data linkage technique using a One-Class Clustering Tree (OCCT) to link records from different datasets. The technique constructs a decision tree where internal nodes represent attributes from the first dataset and leaves represent attributes from the second dataset that match. It uses maximum likelihood estimation for splitting criteria and pre-pruning to reduce complexity. The method is applied to the database misuse domain to identify common and malicious users by analyzing access request contexts and accessible data. Evaluation shows the technique achieves better precision and recall than existing methods.
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...Genomika Diagnósticos
API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges
X-Meeting 2015
Authors: Jamisson Freitas, Marcel Caraciolo, Victor Diniz, Rodrigo Alexandre and João Bosco Oliveira
These slides were presented at AGU 2018 by Tanu Malik from DePaul University, in a session convened by Dr. Ian Foster, director of the Data Science and Learning division at Argonne National Laboratory.
Aspects of Reproducibility in Earth ScienceRaul Palma
The document discusses aspects of reproducibility in earth science research within the European Virtual Environment for Research - Earth Science Themes (EVEREST) project. The key objectives of EVEREST are to establish an e-infrastructure to facilitate collaborative earth science research through shared data, models, and workflows. Research Objects (ROs) will be used to capture and share workflows, processes, and results to help ensure reproducibility and preservation of earth science research. An example RO is described for mapping volcano deformation using satellite imagery and other data sources. Issues around reproducibility related to data access, software dependencies, and manual intervention in workflows are also discussed.
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
Discovering new drugs is a lengthy and expensive process. This means that finding new uses for existing drugs can help create new treatments in less time and with less time. The difficulty is in finding these potential new uses.
How do we find these undiscovered uses for existing drugs?
We can unify the available structured and unstructured data sets into a knowledge graph. This is done by fusing the structured data sets, and performing named entity extraction on the unstructured data sets. Once this is done, we can use deep learning techniques to predict latent relationships.
In this talk we will cover:
Building the knowledge graph
Predicting latent relationships
Using the latent relationships to repurpose existing drugs
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
There is a variety of public resources on the Internet which contain information about various aspects of chemical, biological and pharmaceutical domains. The quality, maturity, hosting organizations, team sizes behind these data resources vary wildly and as a consequence content cannot be always trusted and the effort of extracting information and preparing it for reuse is repeated again and again at various levels. This problem is especially serious in applications for QSAR, QSPR and QNAR modeling. On the other hand authors of this poster believe, based on their own extensive experience building various types of chemical, analytical and biological databases for decades, that the process of building such knowledgebase can be systematically described and automated. This poster will outline the work performed on text and data-mining various public resources on the Web, data curation process and making this information publicly available through a portal and a RESTful API. We will also demonstrate how such knowledgebase can be used for real-time QSAR and QSPR predictions.
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko
While we have seen a tremendous growth in machine learning methods over the last two decades there is still no one fits all solution. The next era of cheminformatics and pharmaceutical research in general is focused on mining the heterogeneous big data, which is accumulating at ever growing pace, and this will likely use more sophisticated algorithms such as Deep Learning (DL). There has been increasing use of DL recently which has shown powerful advantages in learning from images and languages as well as many other areas. However the accessibly of this technique for cheminformatics is hindered as it is not available readily to non-experts. It was therefore our goal to develop a DL framework embedded into a general research data management platform (Open Science Data Repository) which can be used as an API, standalone tool or integrated in new software as an autonomous module. In this poster we will present results of comparing performance of classic machine learning methods (Naïve Bayes, logistic regression, Support Vector Machines etc.) with Deep Learning and will discuss challenges associated with Ddeep Learning Neural Networks (DNN). The DNN learning models of different complexity (up to 6 hidden layers) were built and tuned (different number of hidden units per layer, multiple activation functions, optimizers, drop out fraction, regularization parameters, and learning rate) using Keras (https://keras.io/) and Tensorflow (www.tensorflow.org) and applied to various use cases connected to prediction of physicochemical properties, ADME, toxicity and calculating properties of materials. It was also shown that using nVidia GPUs significantly accelerates calculations, although memory consumption puts some limits on performance and applicability of standard toolkits 'as is'.
This document discusses a hybrid technique for associative classification. It begins with an introduction to data mining processes like classification and association rule mining. The author then discusses the motivation and objectives of developing a framework to generate classification association rules more efficiently. The proposed methodology involves reviewing existing models, implementing a classification system using association rules in Weka, and comparing the performance to other methods. The facilities required are data mining tools like Weka. Finally, the document provides references that were consulted in the literature survey on associative classification and related techniques.
Reproducibility Using Semantics: An Overviewdgarijo
Overview of the different approaches for addressing reproducibilities (using semantics) in laboratory protocols, workflow description and publication and workflow infrastructure. Furthermore, Research Objects are introduced as a means to capture the context and annotations of scientific experiments, together with the privacy and IPR concerns that may arise. This presentation was presented in Dagstuhl Seminar 16041: http://www.dagstuhl.de/16041
This document summarizes a presentation on MapReduce and YARN. It discusses key concepts like MapReduce execution, building MapReduce programs in Eclipse, and the YARN architecture. The presentation covers why MapReduce is used, real-life uses, an example MapReduce job, and interactions with Hadoop. It also explains motivations for YARN, how it works, and compares small and big data processing with MapReduce and YARN.
My poster on using pairwise learning for annotating, engineering and designing biological molecules. Mostly an overview of the types of things we are working on at the lab.
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
In this paper, we present a genetic programming (GP) approach to record
deduplication with indexing techniques.Data de-duplication is a process in which data are
cleaned from duplicate records due to misspelling, field swap or any other mistake or data
inconsistency. This process requires that we identify objects that are included in more than
one list.The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality in data warehouse. So, we
need to create such a algorithm that can detect and eliminate maximum duplications.GP
with indexing is one of the optimization technique that helps to find maximum duplicates in
the database. We used adeduplication function that is able to identify whether two or more
entries in a repository are replicas or not. As many industries and systems depend on the
accuracy and reliability of databases to carry out operations. Therefore, the quality of the
information stored in the databases, can have significant cost implications to a system that
relies on information to function and conduct business. Moreover, this is fact that clean and
replica-free repositories not only allow the retrieval of higher quality information but also
lead to more concise data and to potential savings in computational time and resources to
process this data.
Index
This document presents research on classifying data using a new enhanced decision tree algorithm called NEDTA. It first provides background on data mining and decision tree classification techniques. It then discusses existing decision tree algorithms ID3, J48 and NBTree and applies them to a banking dataset to evaluate performance. The objectives are stated as applying the algorithms, evaluating results, comparing performance based on accuracy, time and error rate, and developing an enhanced method. The document outlines the implementation and provides results of applying the existing algorithms in Weka. It compares the accuracy and performance of ID3, J48 and NBTree and finds the new NEDTA algorithm produces better results.
Interlinking educational data to Web of Data (Thesis presentation)Enayat Rajabi
This is a thesis presentation about interlinking educational data to Web of Data. I explain how I used the Linked Data approach to expose and interlink educational data to the Linked Open Data cloud
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Enhancement techniques for data warehouse staging areaIJDKP
This document discusses techniques for enhancing the performance of data warehouse staging areas. It proposes two algorithms: 1) A semantics-based extraction algorithm that reduces extraction time by pruning useless data using semantic information. 2) A semantics-based transformation algorithm that similarly aims to reduce transformation time. It also explores three scheduling techniques (FIFO, minimum cost, round robin) for loading data into the data warehouse and experimentally evaluates their performance. The goal is to enhance each stage of the ETL process to maximize overall performance.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Richard Zijdeman
A glimpse of how we are used to connecting datasets on our laptops and how, imho, need to move to the Web of Data, including a demo connecting various sources all from your(!) machine.
Reproducibility by Other Means: Transparent Research ObjectsTimothy McPhillips
This document discusses issues around reproducibility in research and proposes modeling reproducibility as multidimensional to help address terminology conflicts. It argues that reproducibility includes dimensions like experiment replicability, code re-executability, and findings reproducibility. Mapping definitions to shared dimensions and allowing claims using different terminologies could help resolve issues. Research Objects that attach reproducibility claims to artifacts and support queries in different terminologies may improve transparency without requiring exact repetition.
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
Thoughts on computational science reproducibility with a focus on software. Given at the Software Sustainability Institute's 2014 Collaborations Workshop
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
Requirements for reproducibility in computational chemistry publications include making available the data, code or algorithms, and results from the study. Authors should provide all data necessary to understand and assess their conclusions. Source code or detailed algorithm descriptions should also be included to allow independent reproduction of the work. Finally, publications must contain the actual results from applying the method rather than just describing results. Adopting these standards of transparency helps ensure others can evaluate and build upon published research claims.
Reproducibility, Research Objects and Reality, Leiden 2016Carole Goble
Presented at the Leiden Bioscience Lecture, 24 November 2016, Reproducibility, Research Objects and Reality
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. It all sounds very laudable and straightforward. BUT…..
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange
In this talk I will explore these issues in data-driven computational life sciences through the examples and stories from initiatives I am involved, and Leiden is involved in too including:
· FAIRDOM which has built a Commons for Systems and Synthetic Biology projects, with an emphasis on standards smuggled in by stealth and efforts to affecting sharing practices using behavioural interventions
· ELIXIR, the EU Research Data Infrastructure, and its efforts to exchange workflows
· Bioschemas.org, an ELIXIR-NIH-Google effort to support the finding of assets.
This document summarizes the results of an empirical analysis of 177 scientific workflows from Taverna and Wings systems. The analysis identified common motifs in data-oriented activities and workflow implementation styles. For data activities, motifs included data preparation, data transformation, data movement and data visualization. For workflows, motifs involved different ways activities were combined and implemented. The identified motifs could help inform workflow design practices and tools to generate workflow abstractions, improving understanding and reusability of workflows.
The document summarizes research on enabling data reuse from published datasets. It reviews 40 papers that cataloged 39 different features of datasets that can enable reuse. These features are grouped into categories related to enabling access, documenting methodological choices and quality, and helping users understand and situate the data. The paper presents a case study analyzing over 1.4 million data files from more than 65,000 repositories on GitHub, relating dataset engagement metrics to various reuse features. Using these metrics as proxies for reuse, an initial deep learning model is developed to predict a dataset's reusability based on its documented features. This work demonstrates the gap between existing principles for enabling reuse and actionable insights that can help data publishers and tools implement functionalities proven
A Big Picture in Research Data ManagementCarole Goble
A personal view of the big picture in Research Data Management, given at GFBio - de.NBI Summer School 2018 Riding the Data Life Cycle! Braunschweig Integrated Centre of Systems Biology (BRICS), 03 - 07 September 2018
Reproducible, Open Data Science in the Life SciencesEamonn Maguire
The document outlines the workflow of a data scientist, from planning experiments and collecting data, to analyzing, visualizing, and publishing results. It emphasizes that data science involves formalizing hypotheses based on observations and testing them using collected data. A suite of open-source tools is presented to help data scientists in managing data and supporting open, reproducible life science research. The goal is to enable integration and sharing of experimental data and results.
This is a keynote that I have given in polyweb workshop on the state of the art of data science reproducibility. I review tools that have been developed over the last few years in the first part. In the second part, I focus on proposals that I have been involved in to facilitate workflow reproducibility and preservation.
Lecture for a course at NTNU, 27th January 2021
CC-BY 4.0 Dag Endresen https://orcid.org/0000-0002-2352-5497
See also http://bit.ly/biodiversityinformatics
https://www.gbif.no/events/2021/lecture-ntnu-gbif.html
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
Scientists publish computational experiments in ways that do not facilitate reproducibility or reuse. Significant domain expertise, time and effort are required to understand scientific experiments and their research outputs. In order to improve this situation, mechanisms are needed to capture the exact details and the context of computational experiments. Only then, Intelligent Systems would be able help researchers understand, discover, link and reuse products of existing research.
In this presentation I will introduce my work and vision towards enabling scientists share, link, curate and reuse their computational experiments and results. In the first part of the talk, I will present my work for capturing and sharing the context of scientific experiments by using scientific workflows and machine readable representations. Thanks to this approach, experiment results are described in an unambiguous manner, have a clear trace of their creation process and include a pointer to the sources used for their generation. In the second part of the talk, I will describe examples on how the context of scientific experiments may be exploited to browse, explore and inspect research results. I will end the talk by presenting new ideas for improving and benefiting from the capture of context of scientific experiments and how to involve scientists in the process of curating and creating abstractions on available research metadata.
Metadata and Semantics Research Conference, Manchester, UK 2015
Research Objects: why, what and how,
In practice the exchange, reuse and reproduction of scientific experiments is hard, dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: codes fork, data is updated, algorithms are revised, workflows break, service updates are released. Neither should they be viewed just as second-class artifacts tethered to publications, but the focus of research outcomes in their own right: articles clustered around datasets, methods with citation profiles. Many funders and publishers have come to acknowledge this, moving to data sharing policies and provisioning e-infrastructure platforms. Many researchers recognise the importance of working with Research Objects. The term has become widespread. However. What is a Research Object? How do you mint one, exchange one, build a platform to support one, curate one? How do we introduce them in a lightweight way that platform developers can migrate to? What is the practical impact of a Research Object Commons on training, stewardship, scholarship, sharing? How do we address the scholarly and technological debt of making and maintaining Research Objects? Are there any examples
I’ll present our practical experiences of the why, what and how of Research Objects.
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Bertram Ludäscher
Keynote at CLIR Workshop (Webinar): Torward Open, Reproducible, and Reusable Research. February 10, 2021. https://reusableresearch.com/
ABSTRACT. The “reproducibility crisis” has resulted in much interest in methods and tools to improve computational reproducibility. FAIR data principles (data should be findable, accessible, interoperable, and reusable) are also being adapted and evolved to apply to other artifacts, notably computational analyses (scientific workflows, Jupyter notebooks, etc.). The current focus on computational reproducibility of scripts and other computational workflows sometimes overshadows a somewhat neglected and arguably more important issue: transparency of data analysis, including data wrangling and cleaning. In this talk I will ask the question: What information is gained by conducting a reproducibility experiment? This leads to a simple model (PRIMAD) that aims to answer this question by sorting out different scenarios. Finally, I will present some features of Whole-Tale, a computational platform for reproducible and transparent computational experiments.
Journal Club - Best Practices for Scientific ComputingBram Zandbelt
This document discusses the importance of best practices in scientific computing. It notes that scientists rely heavily on software for research, with many writing their own code. However, most scientists are self-taught in software skills and may be unaware of best practices that could help them write more reliable and maintainable code. The document advocates treating software like a scientific instrument and following practices such as version control, testing, and automation. Adopting these practices could help reduce errors and make software easier to reuse.
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsKhalid Belhajjame
I gave this talk at the EDBT'2020 conference. It shows how the provenance of workflows can be anonymized without compromising lineage relationships between the data records that are used and generated by the modules that compose the workflow.
Privacy-Preserving Data Analysis Workflows for eScienceKhalid Belhajjame
This document discusses an approach for preserving privacy in scientific workflows that use large datasets. It proposes using k-anonymity to anonymize sensitive workflow data. Parameter dependencies are leveraged to identify sensitive parameters and infer appropriate anonymity degrees. The approach was tested on 20 workflows, with overhead less than 1 millisecond. This preliminary work aims to assist scientists in anonymizing workflow data while enabling exploration of provenance and data products.
- The document discusses evaluating "why-not" queries against scientific workflow provenance. Why-not queries help understand why a data item was not returned by a workflow execution.
- It proposes a solution for evaluating why-not queries in workflows with black-box modules that do not preserve attribute information from inputs. The solution explores workflow modules from sink to source to identify "picky" modules responsible for a data item not appearing in results.
- To identify picky modules, it harvests information from the web by searching for traces of scientific module invocations to find valid candidate inputs and determine if a module accepts them or is likely picky. It conducts an experiment using real workflows to test the effectiveness of
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
1) The document presents a methodology to convert script-based experiments into reproducible workflow research objects (WROs). This addresses issues of understanding, reusing, and reproducing experiments conducted through scripts.
2) The methodology involves 5 steps: generate an abstract workflow, create an executable workflow, refine the workflow, record provenance data, and annotate and check the quality of the conversion.
3) Applying the methodology to a molecular dynamics simulation case study, the authors demonstrate how scripts can be transformed into WROs containing workflows, annotations, provenance data, and other resources needed for reproducibility.
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
This document provides an overview of the PROV provenance model and some of its extensions. It discusses the motivation for provenance, the history and development of the PROV model, its key concepts of entities, activities, and agents. It also describes extensions like ProvONE and PAV that build upon PROV to model workflow and scientific provenance.
The document discusses assisting designers in composing workflows through the reuse of frequent workflow fragments mined from repositories. It proposes an approach that involves mining fragments, representing workflows as graphs, homogenizing activity labels, and allowing users to search for fragments using keywords and activities from their initial workflow. Fragments are retrieved based on relevance to keywords and compatibility to specified activities, then ranked and presented to users for composition. Experiments assess different graph representations for mining fragments in terms of effectiveness, size and runtime. The approach aims to help designers reuse best practices from repositories when specifying new workflows.
This document proposes a method to improve the reuse of workflow fragments by mining workflow repositories. It evaluates different graph representations of workflows and uses the SUBDUE algorithm to identify recurrent fragments. An experiment compares representations on precision, recall, memory usage, and time. Representation D1, which labels edges and nodes, performed best. A second experiment assesses how filtering workflows by keywords impacts finding relevant fragments for a user query. The method aims to incorporate workflow fragment search capabilities into the design lifecycle to promote reuse.
Linking the prospective and retrospective provenance of scriptsKhalid Belhajjame
Scripting languages like Python, R, andMATLAB have seen significant use across a variety of scientific domains. To assist scientists in the analysis of script executions, a number of mechanisms, e.g., noWorkflow, have been recently proposed to capture the provenance of script executions. The provenance information recorded can be used, e.g., to trace the lineage of a particular result by identifying the data inputs and the processing steps that were used to produce it. By and large, the provenance information captured for scripts is fine-grained in the sense that it captures data dependencies at the level of script statement, and do so for every variable within the script. While useful, the amount of recorded provenance information can be overwhelming for users and cumbersome to use. This suggests the need for abstraction mechanisms that focus attention on specific parts of provenance relevant for analyses. Toward this goal, we advocate that fine-grained provenance information recorded as the result of script execution can be abstracted using user-specified, workflow-like views. Specifically, we show how the provenance traces recorded by noWorkflow can be mapped to the workflow specifications generated by YesWorkflow from scripts based on user annotations. We examine the issues in constructing a successful mapping, provide an initial implementation of our solution, and present competency queries illustrating how a workflow view generated from the script can be used to explore the provenance recorded during script execution.
These slides introduces the second edition of ProvBench which I am leading to collect a corpus of provenance data for benchmarking for the provenance (and scientific) community
I gave this talk in TAPP 2014 during the provenance week in Cologne, on inferring fine graine dependencies between data (ports) in scientific workflows. -- khalid
I gave this talk in the EDBT 2014 conference, which tool place in Athens, Greece.
I show how data examples can be used to characterize the behavior of scientific modules. I present a new methods that automatically generate the data examples, and show that such data examples are useful for the human user to understand the task of the modules, and that they can be used to assist curators in repairing broken workflows (i.e., workflows for which one or more modules are no longer supplied by their providers)
This document discusses research objects and scientific workflows. It introduces research objects as a way to aggregate all elements needed to understand a research investigation, including datasets, results, experiments, and provenance. Scientific workflows are presented as tools for automating data-intensive scientific activities, with prospective and retrospective provenance capturing the intended and actual methods. The document outlines an approach to summarizing complex workflows using semantic annotations of workflow motifs and reduction primitives like collapse and eliminate. This distills provenance traces for improved understanding and querying.
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
Scientific Workflows have become the workhorse of BigData analytics for scientists. As well as being repeatable and optimizable pipelines that bring together datasets and analysis tools, workflows make-up an important part of the provenance of data generated from their execution. By faithfully capturing all stages in the analysis, workflows play a critical part in building up the audit-trail (a.k.a. provenance) meta- data for derived datasets and contributes to the veracity of results. Provenance is essential for reporting results, reporting the method followed, and adapting to changes in the datasets or tools. These functions, however, are hampered by the complexity of workflows and consequently the complexity of data-trails generated from their instrumented execution. In this paper we propose the generation of workflow description summaries in order to tackle workflow complexity. We elaborate reduction primitives for summarizing workflows, and show how prim- itives, as building blocks, can be used in conjunction with semantic workflow annotations to encode different summariza- tion strategies. We report on the effectiveness of the method through experimental evaluation using real-world workflows from the Taverna system.
A use case designed in the context of the Dataone provenance woring group illustrating how the provenance traces generated by differet workflow engines can be quered via the D-PROV model.
This document proposes representing scientific workflows as first-class citizens called research objects. It presents a model for workflow research objects that aggregates all necessary elements to understand an investigation. These include experiments, annotations, results, datasets and provenance. Research objects are encoded using semantic technologies like RDF and follow standards such as the Object Exchange model. The lifecycle of research objects is also described.
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
2. Data-Oriented
Science
Computing is transforming the practice of science. The so-called
“Fourth Paradigm of scientific research” [1] refers to the current
era, where scientists utilize computational tools and technologies
to manage, share, federate, analyze, visualize data to underpin
scientific findings.
The objective of data-oriented science is is to create a richer
research ecosystem in which emphasis is given not only to the
build-up of scientific knowledge, but also to the build-up and
dissemination of other work-products of research such as data,
protocols, models and tools.
Why is that?
[1] Tony Hey, Stewart Tansley, and Kristin M. Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft
Research, 2009.BDA MDD 2018
3. Scholarly Articles Are Not Enough
Scholarly articles remain the main trusted means for scientists to
communicate their findings
However, they are noticeably insufficient to communicate all the
actual scientific knowledge behind the reported findings.
There is a need for communicating and preserving other artifacts to
enable the understanding, verification, and reuse. In other words, ….
reproducibility
BDA MDD 2018
4. 47 of 53
“landmark”
publications
could not be
replicated
Inadequate cell lines and
animal models
Nature, 483, 2012
basic studies on cancer are
unreliable, with grim
consequences for producing
new medicines in the future
BDA MDD 2018
5. The research result, obtained by Stapel and co-workers Roos Vonk (Radboud
University) and Marcel Zeelenberg (nl) (Tilburg University), showing that meat
eaters are more selfish than vegetarians, which was widely publicized in Dutch
media is suspected to be based on faked data.
BDA MDD 2018
6. Reproducibility is not just about finding
cheaters … it is above all a noble cause
BDA MDD 2018
7. Researchers in experimental biology use carefully lab
notebooks to document different aspects of their experiments.
This is not the case for computational scientists who tend to
run their analysis with no clear record of the exact process they
followed or intermediary datasets (results) they used and
generated.
It is therefore possible that numerous published results may be
unreliable or even completely invalid.
Culture of Reproducibility
BDA MDD 2018
8. Culture of Reproducibility
Often, there is no record of the process (workflow) that
produced the published computational results in scholarly
communications.
Even the code is missing, or underwent changes.
It cannot be used to process the data referred to, (if we are
lucky).
BDA MDD 2018
9. Open and transparent Communication
“The reproducible research movement
recognizes that traditional scientific
research and publication practices now fall
short …, and encourages all those involved
in the production of computational
science ... to facilitate and practice really
reproducible research.”
V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to
reproducible: Reproducibility in computational and experimental mathematics.
We witnessed recently the emergence of a
number of methods and tools for enabling
reproducibility
BDA MDD 2018
10. Scope of this seminar
We will focus on the reproducibility of scientific
workflows.
These have been adopted in modern sciences, notably
life sciences and bio-diversity for encoding and enacting
scientific experiments
We will look at what it means to reproduce a scientific
workflow, and draw a map of some solutions that have
been proposed in this direction
BDA MDD 2018
11. Agenda
Scientific workflows
Scientific Workflow Reproducibility
Workflow Preservation Against Decay
From Workflows to Scripts
BDA MDD 2018
12. Scientific workflow
• Workflow technology is increasingly
used for specifying and enacting scientific
experiments.
• A scientific workflow is a series of
analysis operations connected using data
links.
• Analysis operations can be supplied
locally or can be independently
developed web services.
BDA MDD 2018
13. Science with workflows
GWAS, Pharmacogenomics
Association study of
Nevirapine-induced skin rash
in Thai Population Trypanosomiasis (sleeping
sickness parasite) in
African Cattle
Astronomy &
HelioPhysics
Library Doc
Preservation
Systems Biology
of Micro-
Organisms
Observing Systems
Simulation Experiments
JPL, NASA
BioDiversity
Invasive Species
Modelling
[Credit Carole A. Goble]BDA MDD 2018
14. Workflows for systematic
resource use
• Access heterogeneous resources.
• Explicit, runnable, repeatable analytical
process.
• Explore parameter spaces.
• Sweep an analysis over datasets.
• Transparent and efficient analyses with
provenance collected from workflow
executions
Workflow
Provenance
Data
BDA MDD 2018
17. Agenda
Scientific workflows
Scientific Workflow Reproducibility
Workflow Preservation Against Decay
From Workflows to Scripts
BDA MDD 2018
18. Reproducibility Terminology
Reproducibility has been studied in science in larger
contexts than computational reproducibility, in
particular where wet experiments are involved.
A plethora of terms are used including repeat,
replicate, reproduce, redo, rerun, recompute, reuse and
repurpose etc. to name a few.
We will focus on 4 Rs: Repeat, Replicate, Reproduce
and Reuse.
For each of them, we will give the definition in wet-lab
contexts and propose a definition in a computational
setting.
BDA MDD 2018
20. Repeat
A wet experiment is said to be repeated when the
experiment is performed in the same lab as the original
experiment, that is, on the same scientific environment.
By analogy, an in silico experiment is said to be repeated
when it is performed in the same computational setting as
the original experiment.
The major goal of the repeat task is to check whether the
initial experiment was correct and can be performed again.
The difficulty lies in recording as much information as
possible to repeat the experiment so that the same
conclusion can be drawn.
BDA MDD 2018
21. Replicate
A wet experiment is said to be replicated when the
experiment is performed in a different (wet) ”lab” than
the original experiment.
By analogy, a replicated in silico experiment is
performed in a new setting and computational
environment, although similar to the original ones).
When replicated, a result has a high level of robustness:
the result remains valid in a similar (even though
different) setting has been considered.
A continuum of situations can be considered between
a repeated and replicated experiments.
BDA MDD 2018
22. Reproduce
Reproduce is defined in the broadest possible sense of the
term and denotes the situation where an experiment is
performed within a different set-up but with the aim to
validate the same scientific hypothesis.
In other words, what matters is the conclusion obtained and
not the methodology considered to reach it.
Completely different approaches can be designed,
completely different data sets can be used, as long as both
experiments converge to the same scientific conclusion.
A reproducible result is thus a high- quality result,
confirmed while obtained in various ways.
BDA MDD 2018
23. Reuse
A very important concept related to reproducibility is
Reuse which denotes the case where a different
experiment is performed, with similarities with an
original experiment.
A specific kind of reuse occurs when a single
experiment is reused in a new context (and thus
adapted to new needs), the experiment is then said to
be repurposed.
BDA MDD 2018
24. Repeat, Replicate, Reproduce and Reuse
Reproduce and reuse are the most important scientific targets.
However, before investigating alternative ways of obtaining a result
(to reach reproducibility) or before reusing a given methodology in
a new context (to reach reuse), the original experiment has to be
carefully tested (possibly by reviewers and/or any peers),
demonstrating its ability to be at least repeated and hopefully
replicated
The database community lags well behind other computer
science communities, e.g., the Semantic Web community
ISWC and ESWC encourages the authors to submit with the
paper auxiliary resources about the experiment they used as
well as the software/prototype they built if any.
BDA MDD 2018
25. Reproducibility
and Scientific Workflows
We now introduce definitions of reproducibility concepts in the
particular context of use of scientific workflow systems.
In our definition, we distinguish six components of an analysis
designed using a scientific workflow.
1. S, the workflow specification, providing the analysis steps
associated with tools, chained in a given order,
2. I, the input of the workflow used for its execution, that is, the
concrete data sets and parameter settings specified for any tools,
3. E, the workflow context and runtime environment, that is, the
computational context of the execution (OS, libs, etc.).
Additionally, we consider R and C, the result of the analysis (typically
the final data sets) and the high level conclusion that can be reached
from this analysis, respectively.
BDA MDD 2018
26. Repeatability of a Scientific Workflow
Given two analyses A and A’ performed using scientific
workflows, we say that A’ repeats A if and only if A and
A’ are identical on all their components.
Replicability of a Scientific Workflow
Given two analyses A and A’ performed using scientific
workflows, we say that A’ replicates A if and only if they
reach the same conclusion while their specification and
input components are similar and other components may
differ (in particular no condition is set on the run-time
environment).
Terms such as rerun, re-compute typically consider situations
where the workflow specification is unchanged.
BDA MDD 2018
27. Reproducibility of a Scientific Workflow
Given two analyses A and A’ performed using scientific
workflows, we say that A’ reproduces A if and only if
they reach the same conclusion. No condition is set on
any other components of the analysis.
Reuse of a Scientific Workflow
Given two analyses A and A’ performed using scientific
workflows, we say that A’ reuses A if and only if the
specification or input of A’ is part of the specification
or input of A’.
No other condition is set, especially the conclusion to
reach may be different.
BDA MDD 2018
28. Reproducibility of a Scientific Workflow
Given two analyses A and A’ performed using scientific
workflows, we say that A’ reproduces A if and only if
they reach the same conclusion. No condition is set on
any other components of the analysis.
Reuse of a Scientific Workflow
Given two analyses A and A’ performed using scientific
workflows, we say that A’ reuses A if and only if the
specification or input of A’ is part of the specification
or input of A’.
No other condition is set, especially the conclusion to
reach may be different.
• Paul writes workflows for identifying
biological pathways implicated in
resistance to Trypanosomiasis in cattle
• Paul meets Jo who is investigating
Whipworm in mouse.
• Jo reuses one of Paul’s workflow without
change.
• Jo identifies the biological pathways
involved in sex dependence in the mouse
model, believed to be involved in the
ability of mice to expel the parasite.
• Previously a manual two year study by Jo
had failed to do this.
Computational Workflows
Carole Goble
Reuse can be impressive when it works
…but is generally hard to achieve
Real-Life Example Of Reuse
BDA MDD 2018
29. Which level of reproducibility
are we at?
Repeatability and Replicability L
Even these two are hard to achieve most of the time.
Needless to speak about reuse at this point. There are
few use cases that show the potential of workflow reuse,
but we are still at the stage of use cases.
Solutions for enabling scientific workflow repeatability
and replication has mainly focused on their
preservation against decay
BDA MDD 2018
30. Agenda
Scientific workflows
Scientific Workflow Reproducibility
Workflow Preservation Against Decay
From Workflows to Scripts
BDA MDD 2018
31. Workflow Preservation
Public repositories such as myExperiment and
CrowdLabs have been used by scientists to publish
workflow specification and share them over the web.
The availability of workflow specification is however
not sufficient for enabling their repeatability and
replicability.
Indeed, an empirical study that we conducted showed
that the majority of workflow suffers from decay.
BDA MDD 2018
32. Understanding The Causes of
Workflow Decay
We adopted an empirical approach
To identify the causes of workflow decay
To quantify their severity
To do so, we analyzed a sample of real
workflows to determine if they suffer from
decay and the reasons that caused their decay
BDA MDD 2018
33. Experimental Setup
Taverna workflows from
myExperiment.org
Taverna 1
Taverna 2
Selection process
By the creation year
By the creator
By the domain
Software environment
Taverna 2.3
Experiment metadata
4 researchers
BDA MDD 2018
34. Analyzed Workflows
Number of Taverna 1 workflows from 2007 to 2011
2007 2008 2009 2010 2011
Tested 11 10 10 10 4*
Total 74 341 101 26 13
Number of Taverna 2 workflows from 2009 to 2012
2009 2010 2011 2012
Tested 12 10 15 9
Total 97 308 289 184
BDA MDD 2018
36. The Proportion of Decay
75% of the 92 tested
workflows failed to be
either executed or
produce the same result (if
testable)
Those from earlier years
(2007-2009) had 91%
failure rate
Taverna 1
Taverna 2
BDA MDD 2018
37. The Cause of Decay
Manual analysis
By the validation report from Taverna workbench
By interpreting experiment results reported by Taverna
Identified 4 categories of causes
Missing example data
Missing execution environment
Insufficient descriptions about workflows
Volatile third-party Resources
BDA MDD 2018
38. Decay Caused by Third-Party Resources
Causes Refined Causes Examples
Third party resources
are not available
Underlying dataset, particularly those
locally hosted in-house dataset, is no
longer available
Researcher hosting the data changed
institution, server is no longer available
Services are deprecated DDBJ web services are not longer
provided despite the fact that they are
used in many myExperiment
workflows
Third party resources
are available but not
accessible
Data is available but identified using
different IDs that the one known to the
user
Due to scalability reasons the input
data is superseded by new one making
the workflow not executable or
providing wrong results
Data is available but permission,
certificate, or net- work to access it is
needed
Cannot get the input, which is a
security token that can only be
obtained by a registered user of
ChemiSpider
Services are available but need
permission, certificate, or network to
access and invoke them
The security policies of the execution
framework are updated due to new
host- ing institution rules
Third party resources
have changed
Services are still available by using the
same identifiers but their functionality
have changed
The web services are updated
BDA MDD 2018
39. The Cause of Decay
Manual analysis
By the validation report from Taverna workbench
By interpreting experiment results reported by Taverna
Identified 4 categories of causes
Missing example data
Missing execution environment
Insufficient descriptions about workflows
Volatile third-party Resources
BDA MDD 2018
40. Summary of Decay Causes
50% of the decay was caused by
volatility of 3rd-party resource
Unavailable
Inaccessible
Updated
Missing example data
Unable to re-run
Missing execution environment
Such as local plugins
Insufficient metadata
Such as any required
dependency libraries or
permission information
BDA MDD 2018
42. Combating Workflow Decay
Objective: Provide enough information to
Prevent decay
Detect decay
Repair decay
Approach: Research Objects + Checklists
Research Object: Aggregate workflow specifications together
with auxiliary elements, such as example data inputs,
annotations, provenance traces that can ne used to prevent
decay and/or repair the workflow in case of decay.
Checklists: to check that sufficient information is preserved
along with workflows
BDA MDD 2018
44. Cheklisting the Reproducibility
of a Workflow
!"#"$%&'()&
"*'#+'%"&
$,"$-#./%
0.(.1&
)"/$2.3%.4(
5"/"'2$,&
678"$%
9*'#+'%.4(&
2"342%
:"7
;+234/"
The Minim model used in our approach is an adaptation of the MiM model [1].
[1] Matthew Gamble, Jun Zaho, Graham Klyne and Carole Goble. MIM: A Minimum Information Model Vocabulary and Framework
for Scientific Linked Data. eScience 2012
BDA MDD 2018
45. Use Case
4 myExperiment packs
2 from genomics, 1 from geography, and 1 domain-neutral
Experiment process:
Transform them into RO
Create checklist descriptions
Observations
2 research objects did not contains example inputs, the other
2 failed because of update to third party resources and
environment of execution.
BDA MDD 2018
46. Lessons Learnt
1. Dependency is the root enemy of reproducible
workflows
2. Documentation, i.e., annotation, is vital
3. Documentation should be easy to create
BDA MDD 2018
48. Benefits Of Research Objects
A research object aggregates all elements that are
necessary to understand research investigations.
Methods (experiments) are viewed as first class citizens
Promote reuse
Enable the verification of reproducibility of the results
BDA MDD 2018
50. Research Object Model: Overview
The model specification can be found at http://wf4ever.github.com/ro/
And the primer at http://wf4ever.github.com/ro-primer/
BDA MDD 2018
55. Grounding Workflow-centric Research
Objects Using Semantic Technologies
Workflow-centric research objects are encoded using RDF, according to a set of
ontologies that are publicly available
Research objects use the Object Exchange and Reuse (ORE) model, to represent
aggregation.
ORE
BDA MDD 2018
56. We use the Annotation Ontology (AO), to annotate research object
resources and their relationships.
Grounding Workflow-centric Research
Objects Using Semantic Technologies
BDA MDD 2018
57. 57
Scientist
Live RO Live RO
RO snapshot
<<copy>>
Identified by a URI
Some metadata
Some curation
Mostly private (for my group)
RO snapshot
<<copy>>
Identified by a URI
Some metadata
Some curation
Mostly private (for my group
and for paper reviewers)
Librarian/Curator
Scientist
My supervisor calls
me to report my work
My supervisor calls
me again and we
decide to publish our
RO+paper
<<versionOf>>
Archived RO
<<copy, filter
and curate>>
Identified by a URI
Good metadata
and curation
Mostly public
Reviews
received and
final version
published
<<versionOf>>
A new PhD
student
continues my
work
<<copy>>
58. Using Research Objects for the Preservation
of Workflows/Experiments
Case study: investigating the epigenetic mechanisms
involved in Huntington’s disease (HD). It is the most
commonly inherited neurodegenerative disorder in
Europe, that affects 1 out of 10 000 people.
The scientist in this use case were convinced to use
Research Object as a model for packaging their
investigation
BDA MDD 2018
59. Preserving Scientific Wokflows
when they have not been
packaged into research objects
… Which is the case of most of workflows.
And even if they are packaged into research objects,
scientific workflows can still suffer from decay.
BDA MDD 2018
60. Scientific Workflow
Preservation
Issue: As we have seen from the results of the empirical study
we presented earlier, workflow preservation is frequently
hampered by the volatility of the web services implementing
the analysis operations that constitute workflows.
Objective: to provide a means for scientists to repair
workflows by identifying service operations that can play the
same role as the unavailable ones.
BDA MDD 2018
61. Outline
✔ Context: Preservation of Scientific Workflows
■ Discovering Substitute Services Using Semantic
Annotation of Web Services
■ Discovering Substitute Services Using Existing
Workflow Specifications and Provenance traces
■ Conclusions
BDA MDD 2018
62. Ontologies Used For Annotating
Web Services
Task ontology: captures information about the action carried
out by service operations within a domain of interest, e.g.,
Sequence_alignment and Protein_identification
Domain ontology: captures information about the application
domains covered by operation parameters, e.g., Protein_record and
DNA_sequence
BDA MDD 2018
63. Task Replaceability
Task replaceability: For an operation op2 to be able to substitute
an operation op1, op2 must fulfil a task that is equivalent to or
subsumes the task op1 performs:
BDA MDD 2018
65. Limitations
While the method just presented is sound, its practical applicability is
hindered by the following facts
§ Semantic annotations of web services are scarce.
§ Our experience suggests that a large proportion of existing semantic
annotations suffer from inaccuracies
§ As a result, a substitute that is discovered for replacing an unavailable
operation using such annotations may turn out to be unsuitable, and,
inversely, a suitable substitute may be discarded.
BDA MDD 2018
66. Discovering Substitute Services Using Existing Workflow
Specifications and Provenance traces
Existing Workflow
Specifications
Provenance traces of missing
operations
BDA MDD 2018
67. Parameter Compatibility
Formally, let wf1 be a workflow in which the operation op1 is unavailable.
The operation op2 can replace the operation op1 in terms of its inputs and
outputs if:
BDA MDD 2018
68. Task Compatibility
In addition to the compatibility in terms of inputs and outputs, we have to
check that the candidate substitute performs a task compatible with that of the
unavailable operation.
To perform this test, we exploit the following observation. An operation op2 is
able to replace the operation op1 in terms of task, if for every possible input
instances that op1 is able to consume, op2 delivers the same output as that
obtained by invoking op1.
To perform the above test, however, we will have to call the missing operation
op1!
A solution that we adopt for overcoming the above problem makes use of
workflow provenance logs. These are traces that contain intermediate data that
were used as input and delivered as output by the constituent operations of a
workflow when enacted.
BDA MDD 2018
69. Task Compatibility (cont.)
§ An operation op2 may be compatible in terms of task with op1
if:
op2 delivers the same results that op1 delivered in past
executions, that are logged within provenance logs, when fed
using the same input values.
§ Notice that we say may be compatible. This is because we may
not be able to compare the outputs obtained for every possible
input value of the operation op1.
BDA MDD 2018
70. Relaxing Substitutability
Conditions
The condition that we have described for checking the
suitability of an operation as a substitute for another one may
be stronger than is required in practice.
There are various parameter representations that are adopted
in bioinformatics.
Because of representation mismatch, a service operation that
performs a task similar to the missing operation may be found
to be unsuitable.
BDA MDD 2018
71. Example of values delivered by two operations using the same
input value
Value1
Value2
CosSym(value1,value2) = 0.007
BDA MDD 2018
72. Relaxing Substitutability
Conditions
To overcome this problem, we use a two step process when
comparing the values of parameters:
1. Given a parameter value, we derive its representation.
2. If the representation is associated with a key attribute
(identifier), extract the value of such an attribute
If two parameter values are associated with identifiers, then they
are compared by comparing their identifiers.
BDA MDD 2018
73. Example of values delivered by two operations using the same
input value
Value1
Value2
Fasta Format
Uniprot Format
BDA MDD 2018
74. Data Examples for Characterizing
Scientific Operations
We have conducted an empirical evaluation to assess
the effectiveness of the method described.
The issue that we faced is the ability to have examples
that characterize the missing operation, and that can be
used for comparison with available modules.
This motivated a proposal that we have worked on for
characterizing analysis operations using data examples.
BDA MDD 2018
76. Generating Data Examples
Data examples can be used as a means to
describe the behavior of analysis operations.
Enumerating all possible data examples that
can be used to describe a given operation may
be expensive, and may contain redundant data
examples that describe the same behavior.
Issue: which data examples should be used to characterize the
functionality of a given operation?
Solution: We have showed how software testing techniques can
be adapted to the problem of generating data examples without
relying on the availability of the operation specification, which
often is not accessible.
Trick: Use domain ontologies for
partitioning the space of possible values
BDA MDD 2018
77. Agenda
Scientific workflows
Scientific Workflow Reproducibility
Workflow Preservation Against Decay
From Workflows to Scripts
BDA MDD 2018
78. From Workflow to Scripts…
and then Back
Scientific Workflows have proved their utility, and they
are used in practice by scientist
However, the majority of scientists utilize scripting
languages to specify and enact their data analysis.
In order to promote their reproducibility, we have seen
a number of proposals in recent years that seek to bring
some advantages that characterize workflows to scripts.
We will see some of them in what follows.
BDA MDD 2018
79. Meanwhile, on a nearby planet …
Interactive Visualization
R and Python and the Winners
BDA MDD 2018
80. Why Bother?
Workflow provides key features to enable reproducibility
that scripts lack
Modularity
This features lack in scripts in general
Workflow can be repurposed in a straightforward manner
by customizing the resources and the dependencies
Scalability: some workflow systems can handle large
amounts of data
Provenance: Most workflow systems are instrumented to
capture provenance information about workflow execution
YesWorkflow to the rescue
BDA MDD 2018
87. YesWorkflow Architecture
• YW-Extract
– ... structured comments
YW-Model
Program Block, Workflow
Port (data, parameters) – Channels
(dataflow)
YW-Graph
using GraphViz/DOT files
BDA MDD 2018
88. What About Provenance?
There are some solutions that allow capturing the
provenance of a script.
Use (R, Python, ..) libraries and/or code
instrumentation to capture runtime observables
file read/write, function calls, program variables & state,
…
noWorkflow system
[Murta-Braganholo-Chiriga=-Koop-Freire-IPAW14]
exploit Python profiling library to capture run=me
provenance
Can be messy as they capture every operating system event/call!
BDA MDD 2018
89. Actually, We Can Construct the Provenance Without
Recording it in the First Place!
YW+annota)ons:(Model(your(Workflow!(
YesWorkflow(Provenance(@(TaPP'15( 17(
BDA MDD 2018
97. 38
Step
Bundle Resources into a Research Object
5
Script Abstract
workow
Concrete
workow(s)
Annotations
Paper
Provenance
Data
Attributions
BDA MDD 2018
98. Conclusions
Research in enabling reproducibility has seen a real push in recent
year, with some great initiatives, software products and data
repositories
Figshare, Dataverse, OpenAir, DataONE, RDA
Workflows and Scripts are no exception, and there have been
some good proposals from a handful of researchers as well as
practitioners.
MADICS Workfing Group on Reproducibility.
We are just scratching the surface and there are numerous issues
that still need to be addressed.
workflow/scripts similarities, comparison of scientific results,
incremental re-computation, to cite a few are still open topics.
BDA MDD 2018
99. Acknowledgement
Pinar Alper,
Lucas Augusto
Carvalho
Shawn Bowers
Sarah Cohen Boulakia
Alban Gaignard
Daniel Garijo
Carole Goble,
Bertram Ludascher
Timothy McPhilips
Claudia Medeiros
Paolo Missier
Stian Soiland-Reyes
BDA MDD 2018
100. References
Pinar Alper, Khalid Belhajjame, Carole A. Goble: Static analysis of Taverna workflows to predict provenance patterns.
Future Generation Comp. Syst. 75: 310-329 (2017)
Khalid Belhajjame, Jun Zhao, Daniel Garijo, Matthew Gamble, Kristina M. Hettne, Raúl Palma, Eleni Mina, Óscar
Corcho, José Manuél Gómez-Pérez, Sean Bechhofer, Graham Klyne, Carole A. Goble: Using a suite of ontologies for
preserving workflow-centric research objects. J. Web Sem. 32: 16-42 (2015)
Khalid Belhajjame, Carole A. Goble, Stian Soiland-Reyes, David De Roure: Fostering Scientific Workflow Preservation
through Discovery of Substitute Services. eScience 2011: 97-104
Sarah Cohen Boulakia, Khalid Belhajjame, Olivier Collin, Jérôme Chopard, Christine Froidevaux, Alban Gaignard,
Konrad Hinsen, Pierre Larmande, Yvan Le Bras, Frédéric Lemoine, Fabien Mareuil, Hervé Ménager, Christophe Pradal,
Christophe Blanchet: Scientific workflows for computational reproducibility in the life sciences: Status, challenges and
opportunities. Future Generation Comp. Syst. 75: 284-298 (2017)
Alban Gaignard, Khalid Belhajjame, Hala Skaf-Molli: SHARP: Harmonizing Cross-workflow Provenance.
SeWeBMeDA@ESWC 2017: 50-64
Lucas Augusto Montalvão Costa Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros: Converting scripts into
reproducible workflow research objects. eScience 2016: 71-80
Timothy M. McPhillips, Shawn Bowers, Khalid Belhajjame, Bertram Ludäscher: Retrospective Provenance Without a
Runtime Provenance Recorder. TaPP 2015
Timothy M. McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao,
Fernando Chirigati, Saumen C. Dey, Juliana Freire, Deborah N. Huntzinger, Christopher Jones, David Koop, Paolo
Missier, Mark Schildhauer, Christopher R. Schwalm, Yaxing Wei, James Cheney, Mark Bieda, Bertram Ludäscher:
YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts. CoRR
abs/1502.02403 (2015)
BDA MDD 2018