Jasmine is a hybrid reasoning tool that discovers causal links in biological data through deterministic machine learning. It generates hypotheses to explain observations, verifies hypotheses by finding data that satisfies them, and falsifies some hypotheses by revealing inconsistencies. Jasmine represents domain knowledge as an ontology and uses abductive, inductive, and deductive reasoning. It predicts targets for objects by finding common causes between similar objects based on their satisfied features.
Gene Selection for Sample Classification in Microarray: Clustering Based MethodIOSR Journals
This document describes a clustering-based method for gene selection to classify samples in microarray data. It involves calculating the relevance of each gene to class labels and the redundancy between genes using mutual information. Genes are clustered based on their relevance, with the most relevant gene selected as the cluster representative. Min-hash clustering is then applied to reduce redundant genes and cluster size. The goal is to select a minimal set of non-redundant genes that can accurately classify samples by reducing noise from irrelevant genes.
- The document proposes a multi-view stacking ensemble method for drug-target interaction (DTI) prediction that combines predictions from multiple machine learning models trained on different drug and target feature view combinations.
- It generates 126 view combination datasets from 14 drug views and 9 target views, then trains extra trees, random forest, and XGBoost classifiers on each view combination. Predictions from these base models are then combined using a stacking ensemble with an extra trees meta-learner.
- The method is shown to outperform single models and voting ensembles, and calibration of the meta-learner and use of local imbalance measures provide further improvements to predictive performance on DTI prediction tasks.
Relationships Among Classical Test Theory and Item Response Theory Frameworks...AnusornKoedsri3
This document discusses relationships between classical test theory (CTT) and item response theory (IRT) frameworks. It reviews prior literature that has found the frameworks to be quite comparable in estimating person and item parameters. However, previous studies only used IRT to generate data and did not consider CTT models with an underlying normal variable assumption. The current study aims to compare item and person statistics from IRT and CTT models with an underlying normal variable assumption using simulated data generated from both frameworks. A simulation was conducted varying test length (20, 40, 60 items) and number of examinees (500, 1000) to compare one-parameter logistic, two-parameter logistic, parallel, and congeneric models.
This document describes two machine learning techniques, particle swarm optimization with support vector machines (PSO-SVM) and recursive feature elimination with support vector machines (RFE-SVM), that were used to classify autism neuroimaging data from the Autism Brain Imaging Data Exchange database. PSO-SVM was used to select discriminative features for classification, while RFE-SVM ranked features by importance. Both techniques aimed to improve classification accuracy and reduce overfitting by selecting optimal feature subsets from the high-dimensional neuroimaging data. The results could help develop brain-based diagnostic criteria for autism.
Inference Networks for Molecular Database Similarity SearchingCSCJournals
Molecular similarity searching is a process to find chemical compounds that are similar to a target compound. The concept of molecular similarity play an important role in modern computer aided drug design methods, and has been successfully applied in the optimization of lead series. It is used for chemical database searching and design of combinatorial libraries. In this paper, we explore the possibility and effectiveness of using Inference Bayesian network for similarity searching. The topology of the network represents the dependence relationships between molecular descriptors and molecules as well as the quantitative knowledge of probabilities encoding the strength of these relationships, mined from our compound collection. The retrieve of an active compound to a given target structure is obtained by means of an inference process through a network of dependences. The new approach is tested by its ability to retrieve seven sets of active molecules seeded in the MDDR. Our empirical results suggest that similarity method based on Bayesian networks provide a promising and encouraging alternative to existing similarity searching methods.
The screening of chemical libraries is an important step in the drug discovery process. The
existing chemical libraries contain up to millions of compounds. As the screening at such scale
is expensive, the virtual screening is often utilized. There exist several variants of virtual
screening and ligand-based virtual screening is one of them. It utilizes the similarity of screened
chemical compounds to known compounds. Besides the employed similarity measure, another
aspect greatly influencing the performance of ligand-based virtual screening is the chosen
chemical compound representation. In this paper, we introduce a fragment-based
representation of chemical compounds. Our representation utilizes fragments to represent a
compound where each fragment is represented by its physico-chemical descriptors. The
representation is highly parametrizable, especially in the area of physico-chemical descriptors
selection and application. In order to test the performance of our method, we utilized an existing
framework for virtual screening benchmarking. The results show that our method is comparable
to the best existing approaches and on some data sets it outperforms them.
Polyrepresentation in a Quantum-inspired Information Retrieval FrameworkIngo Frommholz
The document discusses the principle of polyrepresentation in information retrieval. It describes polyrepresentation as using multiple evidence or interpretations from different cognitive perspectives to determine relevance. This helps mitigate the "cognitive freefall" problem. There are different types of polyrepresentation, including document, information need, and algorithm polyrepresentation. Document polyrepresentation uses different features of a document, information need polyrepresentation uses different interpretations of a user's need, and algorithm polyrepresentation uses different retrieval algorithms or weighting schemes. The document argues that combining rankings from different polyrepresentative approaches through restricted fusion, where only the overlapping results are considered, can improve retrieval performance over single models.
The document proposes a new approach to compare stock market patterns to DNA sequences using compression techniques. Stock market data is converted to binary sequences representing increases and decreases, which are then encoded into DNA nucleotides. These nucleotide sequences are divided and matched against human genome sequences using BLAST. The analysis found certain sub-sequences of the stock market patterns matched 100% to the human genome, suggesting this approach could potentially predict stock market behavior.
Gene Selection for Sample Classification in Microarray: Clustering Based MethodIOSR Journals
This document describes a clustering-based method for gene selection to classify samples in microarray data. It involves calculating the relevance of each gene to class labels and the redundancy between genes using mutual information. Genes are clustered based on their relevance, with the most relevant gene selected as the cluster representative. Min-hash clustering is then applied to reduce redundant genes and cluster size. The goal is to select a minimal set of non-redundant genes that can accurately classify samples by reducing noise from irrelevant genes.
- The document proposes a multi-view stacking ensemble method for drug-target interaction (DTI) prediction that combines predictions from multiple machine learning models trained on different drug and target feature view combinations.
- It generates 126 view combination datasets from 14 drug views and 9 target views, then trains extra trees, random forest, and XGBoost classifiers on each view combination. Predictions from these base models are then combined using a stacking ensemble with an extra trees meta-learner.
- The method is shown to outperform single models and voting ensembles, and calibration of the meta-learner and use of local imbalance measures provide further improvements to predictive performance on DTI prediction tasks.
Relationships Among Classical Test Theory and Item Response Theory Frameworks...AnusornKoedsri3
This document discusses relationships between classical test theory (CTT) and item response theory (IRT) frameworks. It reviews prior literature that has found the frameworks to be quite comparable in estimating person and item parameters. However, previous studies only used IRT to generate data and did not consider CTT models with an underlying normal variable assumption. The current study aims to compare item and person statistics from IRT and CTT models with an underlying normal variable assumption using simulated data generated from both frameworks. A simulation was conducted varying test length (20, 40, 60 items) and number of examinees (500, 1000) to compare one-parameter logistic, two-parameter logistic, parallel, and congeneric models.
This document describes two machine learning techniques, particle swarm optimization with support vector machines (PSO-SVM) and recursive feature elimination with support vector machines (RFE-SVM), that were used to classify autism neuroimaging data from the Autism Brain Imaging Data Exchange database. PSO-SVM was used to select discriminative features for classification, while RFE-SVM ranked features by importance. Both techniques aimed to improve classification accuracy and reduce overfitting by selecting optimal feature subsets from the high-dimensional neuroimaging data. The results could help develop brain-based diagnostic criteria for autism.
Inference Networks for Molecular Database Similarity SearchingCSCJournals
Molecular similarity searching is a process to find chemical compounds that are similar to a target compound. The concept of molecular similarity play an important role in modern computer aided drug design methods, and has been successfully applied in the optimization of lead series. It is used for chemical database searching and design of combinatorial libraries. In this paper, we explore the possibility and effectiveness of using Inference Bayesian network for similarity searching. The topology of the network represents the dependence relationships between molecular descriptors and molecules as well as the quantitative knowledge of probabilities encoding the strength of these relationships, mined from our compound collection. The retrieve of an active compound to a given target structure is obtained by means of an inference process through a network of dependences. The new approach is tested by its ability to retrieve seven sets of active molecules seeded in the MDDR. Our empirical results suggest that similarity method based on Bayesian networks provide a promising and encouraging alternative to existing similarity searching methods.
The screening of chemical libraries is an important step in the drug discovery process. The
existing chemical libraries contain up to millions of compounds. As the screening at such scale
is expensive, the virtual screening is often utilized. There exist several variants of virtual
screening and ligand-based virtual screening is one of them. It utilizes the similarity of screened
chemical compounds to known compounds. Besides the employed similarity measure, another
aspect greatly influencing the performance of ligand-based virtual screening is the chosen
chemical compound representation. In this paper, we introduce a fragment-based
representation of chemical compounds. Our representation utilizes fragments to represent a
compound where each fragment is represented by its physico-chemical descriptors. The
representation is highly parametrizable, especially in the area of physico-chemical descriptors
selection and application. In order to test the performance of our method, we utilized an existing
framework for virtual screening benchmarking. The results show that our method is comparable
to the best existing approaches and on some data sets it outperforms them.
Polyrepresentation in a Quantum-inspired Information Retrieval FrameworkIngo Frommholz
The document discusses the principle of polyrepresentation in information retrieval. It describes polyrepresentation as using multiple evidence or interpretations from different cognitive perspectives to determine relevance. This helps mitigate the "cognitive freefall" problem. There are different types of polyrepresentation, including document, information need, and algorithm polyrepresentation. Document polyrepresentation uses different features of a document, information need polyrepresentation uses different interpretations of a user's need, and algorithm polyrepresentation uses different retrieval algorithms or weighting schemes. The document argues that combining rankings from different polyrepresentative approaches through restricted fusion, where only the overlapping results are considered, can improve retrieval performance over single models.
The document proposes a new approach to compare stock market patterns to DNA sequences using compression techniques. Stock market data is converted to binary sequences representing increases and decreases, which are then encoded into DNA nucleotides. These nucleotide sequences are divided and matched against human genome sequences using BLAST. The analysis found certain sub-sequences of the stock market patterns matched 100% to the human genome, suggesting this approach could potentially predict stock market behavior.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Classification of Microarray Gene Expression Data by Gene Combinations using ...IJCSEA Journal
Feature selection has attracted a huge amount of interest in both research and application communities of data mining. Among the large amount of genes presented in gene expression data, only a small fraction of them is effective for performing a certain diagnostic test. Hence, one of the major tasks with the gene expression data is to find groups of co regulated genes whose collective expression is strongly associated with the sample categories or response variables. A framework is proposed in this paper to find informative gene combinations and to classify gene combinations belonging to its relevant subtype by using fuzzy logic. The genes are ranked based on their statistical scores and highly informative genes are filtered. Such genes are fuzzified to identify 2-gene and 3-gene combinations and the intermediate value for each gene is calculated to select top gene combinations to further classify gene lymphoma subtypes by using fuzzy rules. Finally the accuracy of top gene combinations is compared with clustering results. The classification is done using the gene combinations and it is analyzed to predict the accuracy of the results. The work is implemented using java language.
This document provides an overview of quantitative structure-property relationship (QSPR) modeling for drug disposition prediction. It discusses why QSPRs are useful, the general methodology used in QSPR modeling including descriptor generation, statistical analysis and model validation. Specific approaches covered include multiple linear regression, partial least squares, artificial neural networks and internal/external validation techniques. The overall goal of a QSPR approach is to mathematically relate molecular descriptors to physicochemical properties and pharmacokinetic parameters to allow for drug property prediction without additional experiments.
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
Sequencing projects arising from high-throughput technologies including those of sequencing DNA microarray allowed measuring simultaneously the expression levels of millions of genes of a biological sample as well as to annotate and to identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information, bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the researchers’ hypothesis. In this context, this article describes and discusses some techniques used for the functional analysis of gene expression data.
This document summarizes a study on multilabel text classification and the effect of label hierarchy. The study implements various algorithms for multilabel classification, including naive Bayes, k-nearest neighbors, random forests, SVMs, RBMs, and hierarchical classification algorithms. It evaluates the algorithms on four datasets that vary in features, labels, training/test sizes, and label cardinality. The goal is to analyze how different algorithmic approaches and dataset properties affect classification performance, particularly for hierarchical learning algorithms. Evaluation measures include micro/macro-averaged precision, recall and F1-score. The document provides details on the problem formulation, algorithms, implementation, datasets and evaluation.
A review robot fault diagnosis part ii qualitative models and search strategi...Siva Samy
This document discusses qualitative models and search strategies used in fault diagnosis for industrial robots. It reviews two main types of diagnostic search strategies - topographic search, which uses a template of normal operation to identify faults, and symptomatic search, which looks for symptoms to direct the search to the fault location. It also outlines different forms of qualitative models, including causal models, abstraction hierarchies, fault trees, and qualitative physics, discussing their representation and advantages. The role of fault diagnosis for robot operations is highlighted, along with technical challenges to developing practical supervisory control systems.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
This document presents a method for calculating asymptotic confidence intervals for indirect effects in structural equation models. The author first shows that under maximum likelihood estimation, the coefficient vector in a recursive structural equation model is asymptotically normally distributed. Then, using the multivariate delta method, it is shown that nonlinear functions of the coefficients, such as indirect effects, are also asymptotically normally distributed. This allows confidence intervals to be constructed for indirect effects based on their asymptotic distribution. An example is worked through to demonstrate the method.
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...CSCJournals
This document is the front matter of the International Journal of Biometrics and Bioinformatics (IJBB) Volume 2, Issue 1 published on February 28, 2008. It includes information about the editor in chief, copyright details, a table of contents listing one paper, and brief descriptions of the paper titled "Inference Networks for Molecular Database Similarity Searching" which explores using Bayesian networks for molecular similarity searching in chemical databases.
This document describes a study that uses machine learning algorithms to efficiently predict DNA-binding proteins. Support vector machines and cascade correlation neural networks are optimized and compared to determine the best performing model. The SVM model achieves 86.7% accuracy at predicting DNA-binding proteins using features like overall charge, patch size, and amino acid composition of proteins. The CCNN model achieves lower accuracy of 75.4%. The study aims to improve on previous work by using the standard jack-knife validation technique to evaluate model performance on unseen data.
Gene Selection for Patient Clustering by Gaussian Mixture ModelCSCJournals
Clustering is the basic composition of data analysis, which also plays a significant role in
microarray analysis. Gaussian mixture model (GMM) based clustering is very popular approach
for clustering. However, GMM approach is not so popular for patients/samples clustering based
on gene expression data, because gene expression datasets usually contains the large number
(m) of genes (variables) in presence of a few (n) samples observations, and consequently the
estimates of GMM parameters are not possible for patient clustering, because there does not
exists the inverse of its covariance matrix due to m>n. To conquer these problems, we propose to
apply a few ‘q’ top DE (differentially expressed) genes (i.e., q<n /><m) between two or more
patient classes, which are selected proportionally from all DE gene’s groups. Here, the fact
behind our proposal that the EE (equally expressed) genes between two or more classes have no
significant contribution to the minimization of misclassification error rate (MER). For selecting few
top DE genes, at first, we clustering genes (instead of patients/samples) by GMM approach. Then
we detect DE and EE gene clusters (groups) by our proposed rule. Then we select q (few) top DE
genes from different DE gene clusters by the rule of proportional to cluster size. Application of
such a few ‘q’ number of top DE genes overcomes the inverse problem of covariance matrix in
the estimation process of GMM’s parameters, and ultimately for gene expression data
(patient/sample) clustering. The performance of the proposed method is investigated using both
simulated and real gene expression data analysis. It is observed that the proposed method
improves the performance over the traditional GMM approaches in both situations.
Ontologies are being used to organize information in many domains like artificial intelligence,
information science, semantic web, library science. Ontologies of an entity having different information
can be merged to create more knowledge of that particular entity. Ontologies today are powering more
accurate search and retrieval in websites like Wikipedia etc. As we move towards the future to Web 3.0,
also termed as the semantic web, ontologies will play a more important role.
Ontologies are represented in various forms like RDF, RDFS, XML, OWL etc. Querying ontologies can
yield basic information about an entity. This paper proposes an automated method for ontology creation,
using concepts from NLP (Natural Language Processing), Information Retrieval and Machine Learning.
Concepts drawn from these domains help in designing more accurate ontologies represented using the
XML format. This paper uses document classification using classification algorithms for assigning labels
to documents, document similarity to cluster similar documents to the input document, together, and
summarization to shorten the text and keep important terms essential in making the ontology. The module
is constructed using the Python programming language and NLTK (Natural Language Toolkit). The
ontologies created in XML will convey to a lay person the definition of the important term's and their
lexical relationships.
Performance analysis of linkage learning techniques in genetic algorithmseSAT Journals
Abstract One variance of Genetic Algorithms is a Linkage Learning Genetic Algorithm (LLGA) enhances the efficiencies of Simple Genetic Algorithm (SGA) while solving NP hard Problems. Discovery of Linkage Learning Technique is an important task in GA. Almost all existing Linkage Learning Techniques follow either random approach or probabilistic approaches. This makes repeated passes over the population to determine the relationship between individuals. SGA with random linkage technique is simple but may take long time to converge to the optimal solutions. This paper uses a linkage learning operator called Gene Silencing which is an inspired mechanism from biological systems. The Gene Silencing mechanism is used to improve the linkages by preserving the building blocks in an individual from the disruption of recombination processes such as Crossover and Mutation. It converges quickly to the optimal solution without compromising the diversification on search spaces. To prove this phenomenon, the Travelling Sales Person problem (TSP) has been chosen to retain the order of cities in a tour. Experiments carried out on different TSP benchmark instances taken from TSPLIB which is a standard library for TSP problems. These benchmark instances have also been applied on various linkage learning techniques and analyses the performance of these techniques with Gene Silencing (GS) mechanism. The performance analysis has been made on experimental results with respect to optimal solution and convergence speed. Index Terms: Linkage Learning, Gene Silencing, Building Blocks, Genetic Algorithm, TSPLIB, Performance Analysis
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
This document discusses chemoinformatics and its role in distance education. It defines chemoinformatics as the application of informatics methods to solve chemical problems, involving the design, creation, organization, management, analysis and use of chemical information. The document then outlines proposed courses for a distance learning program in chemoinformatics, including introductory courses covering fundamental concepts as well as more advanced topics involving programming, databases, and molecular modeling. It concludes by discussing how chemoinformatics can help improve distance education programs in chemistry fields.
Introduction
What is cheminformatics?
Why do we have to use informatics methods in chemistry?
Is it cheminformatics or chemoinformatics?
Emergence of cheminformations
Three major aspects of cheminformatics
Basics of cheminformatics
Topological representations
Tools used for cheminformatics
Application of cheminformatics
Role of cheminformatics in morden drug discovery
Conclusion
Bibliography
The document proposes a novel hybrid method called PCA-BEL for classifying gene expression microarray data. PCA-BEL uses principal component analysis (PCA) for feature extraction followed by classification using a Brain Emotional Learning (BEL) network. PCA reduces the dimensionality of the microarray data to overcome the high dimensionality problem. BEL is then used for classification due its low computational complexity making it suitable for high dimensional data. The method is tested on several cancer gene expression datasets and achieves average accuracies of 100%, 96%, 98.32%, 87.40% and 88% on five datasets respectively, demonstrating its effectiveness for microarray classification tasks.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Mahesh Joshi has studied computer science and engineering, obtaining degrees from universities in India and the US. He is currently pursuing a Masters in Language Technologies at Carnegie Mellon University. His research focuses on natural language processing techniques such as word sense disambiguation and abbreviation expansion, especially in medical texts. He has published papers in conferences and released several related software tools.
Application specific Programming of the Texas Instruments ...butest
The document provides an overview of programming the Texas Instruments TMS320C6711 digital signal processor using Simulink. It describes the hardware features of the TMS320C6711 DSK board and how to configure it for use with Simulink. It then gives an example of building an audio reverberation model in Simulink and deploying it to run on the DSK board.
This proposal seeks funding to develop technology for detecting insider threats through analyzing email data and application event traces. The proposal involves extending existing email tracking and mining technologies called MET and EMT to incorporate additional data sources like host-based sensors. The work will result in an integrated email security appliance called the EmailWall that can detect anomalous insider behavior, model groups of users, and quarantine potentially malicious emails. The schedule outlines milestones over 18 months for research, development, testing, and deployment of the system for insider threat detection.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Classification of Microarray Gene Expression Data by Gene Combinations using ...IJCSEA Journal
Feature selection has attracted a huge amount of interest in both research and application communities of data mining. Among the large amount of genes presented in gene expression data, only a small fraction of them is effective for performing a certain diagnostic test. Hence, one of the major tasks with the gene expression data is to find groups of co regulated genes whose collective expression is strongly associated with the sample categories or response variables. A framework is proposed in this paper to find informative gene combinations and to classify gene combinations belonging to its relevant subtype by using fuzzy logic. The genes are ranked based on their statistical scores and highly informative genes are filtered. Such genes are fuzzified to identify 2-gene and 3-gene combinations and the intermediate value for each gene is calculated to select top gene combinations to further classify gene lymphoma subtypes by using fuzzy rules. Finally the accuracy of top gene combinations is compared with clustering results. The classification is done using the gene combinations and it is analyzed to predict the accuracy of the results. The work is implemented using java language.
This document provides an overview of quantitative structure-property relationship (QSPR) modeling for drug disposition prediction. It discusses why QSPRs are useful, the general methodology used in QSPR modeling including descriptor generation, statistical analysis and model validation. Specific approaches covered include multiple linear regression, partial least squares, artificial neural networks and internal/external validation techniques. The overall goal of a QSPR approach is to mathematically relate molecular descriptors to physicochemical properties and pharmacokinetic parameters to allow for drug property prediction without additional experiments.
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
Sequencing projects arising from high-throughput technologies including those of sequencing DNA microarray allowed measuring simultaneously the expression levels of millions of genes of a biological sample as well as to annotate and to identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information, bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the researchers’ hypothesis. In this context, this article describes and discusses some techniques used for the functional analysis of gene expression data.
This document summarizes a study on multilabel text classification and the effect of label hierarchy. The study implements various algorithms for multilabel classification, including naive Bayes, k-nearest neighbors, random forests, SVMs, RBMs, and hierarchical classification algorithms. It evaluates the algorithms on four datasets that vary in features, labels, training/test sizes, and label cardinality. The goal is to analyze how different algorithmic approaches and dataset properties affect classification performance, particularly for hierarchical learning algorithms. Evaluation measures include micro/macro-averaged precision, recall and F1-score. The document provides details on the problem formulation, algorithms, implementation, datasets and evaluation.
A review robot fault diagnosis part ii qualitative models and search strategi...Siva Samy
This document discusses qualitative models and search strategies used in fault diagnosis for industrial robots. It reviews two main types of diagnostic search strategies - topographic search, which uses a template of normal operation to identify faults, and symptomatic search, which looks for symptoms to direct the search to the fault location. It also outlines different forms of qualitative models, including causal models, abstraction hierarchies, fault trees, and qualitative physics, discussing their representation and advantages. The role of fault diagnosis for robot operations is highlighted, along with technical challenges to developing practical supervisory control systems.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
This document presents a method for calculating asymptotic confidence intervals for indirect effects in structural equation models. The author first shows that under maximum likelihood estimation, the coefficient vector in a recursive structural equation model is asymptotically normally distributed. Then, using the multivariate delta method, it is shown that nonlinear functions of the coefficients, such as indirect effects, are also asymptotically normally distributed. This allows confidence intervals to be constructed for indirect effects based on their asymptotic distribution. An example is worked through to demonstrate the method.
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...CSCJournals
This document is the front matter of the International Journal of Biometrics and Bioinformatics (IJBB) Volume 2, Issue 1 published on February 28, 2008. It includes information about the editor in chief, copyright details, a table of contents listing one paper, and brief descriptions of the paper titled "Inference Networks for Molecular Database Similarity Searching" which explores using Bayesian networks for molecular similarity searching in chemical databases.
This document describes a study that uses machine learning algorithms to efficiently predict DNA-binding proteins. Support vector machines and cascade correlation neural networks are optimized and compared to determine the best performing model. The SVM model achieves 86.7% accuracy at predicting DNA-binding proteins using features like overall charge, patch size, and amino acid composition of proteins. The CCNN model achieves lower accuracy of 75.4%. The study aims to improve on previous work by using the standard jack-knife validation technique to evaluate model performance on unseen data.
Gene Selection for Patient Clustering by Gaussian Mixture ModelCSCJournals
Clustering is the basic composition of data analysis, which also plays a significant role in
microarray analysis. Gaussian mixture model (GMM) based clustering is very popular approach
for clustering. However, GMM approach is not so popular for patients/samples clustering based
on gene expression data, because gene expression datasets usually contains the large number
(m) of genes (variables) in presence of a few (n) samples observations, and consequently the
estimates of GMM parameters are not possible for patient clustering, because there does not
exists the inverse of its covariance matrix due to m>n. To conquer these problems, we propose to
apply a few ‘q’ top DE (differentially expressed) genes (i.e., q<n /><m) between two or more
patient classes, which are selected proportionally from all DE gene’s groups. Here, the fact
behind our proposal that the EE (equally expressed) genes between two or more classes have no
significant contribution to the minimization of misclassification error rate (MER). For selecting few
top DE genes, at first, we clustering genes (instead of patients/samples) by GMM approach. Then
we detect DE and EE gene clusters (groups) by our proposed rule. Then we select q (few) top DE
genes from different DE gene clusters by the rule of proportional to cluster size. Application of
such a few ‘q’ number of top DE genes overcomes the inverse problem of covariance matrix in
the estimation process of GMM’s parameters, and ultimately for gene expression data
(patient/sample) clustering. The performance of the proposed method is investigated using both
simulated and real gene expression data analysis. It is observed that the proposed method
improves the performance over the traditional GMM approaches in both situations.
Ontologies are being used to organize information in many domains like artificial intelligence,
information science, semantic web, library science. Ontologies of an entity having different information
can be merged to create more knowledge of that particular entity. Ontologies today are powering more
accurate search and retrieval in websites like Wikipedia etc. As we move towards the future to Web 3.0,
also termed as the semantic web, ontologies will play a more important role.
Ontologies are represented in various forms like RDF, RDFS, XML, OWL etc. Querying ontologies can
yield basic information about an entity. This paper proposes an automated method for ontology creation,
using concepts from NLP (Natural Language Processing), Information Retrieval and Machine Learning.
Concepts drawn from these domains help in designing more accurate ontologies represented using the
XML format. This paper uses document classification using classification algorithms for assigning labels
to documents, document similarity to cluster similar documents to the input document, together, and
summarization to shorten the text and keep important terms essential in making the ontology. The module
is constructed using the Python programming language and NLTK (Natural Language Toolkit). The
ontologies created in XML will convey to a lay person the definition of the important term's and their
lexical relationships.
Performance analysis of linkage learning techniques in genetic algorithmseSAT Journals
Abstract One variance of Genetic Algorithms is a Linkage Learning Genetic Algorithm (LLGA) enhances the efficiencies of Simple Genetic Algorithm (SGA) while solving NP hard Problems. Discovery of Linkage Learning Technique is an important task in GA. Almost all existing Linkage Learning Techniques follow either random approach or probabilistic approaches. This makes repeated passes over the population to determine the relationship between individuals. SGA with random linkage technique is simple but may take long time to converge to the optimal solutions. This paper uses a linkage learning operator called Gene Silencing which is an inspired mechanism from biological systems. The Gene Silencing mechanism is used to improve the linkages by preserving the building blocks in an individual from the disruption of recombination processes such as Crossover and Mutation. It converges quickly to the optimal solution without compromising the diversification on search spaces. To prove this phenomenon, the Travelling Sales Person problem (TSP) has been chosen to retain the order of cities in a tour. Experiments carried out on different TSP benchmark instances taken from TSPLIB which is a standard library for TSP problems. These benchmark instances have also been applied on various linkage learning techniques and analyses the performance of these techniques with Gene Silencing (GS) mechanism. The performance analysis has been made on experimental results with respect to optimal solution and convergence speed. Index Terms: Linkage Learning, Gene Silencing, Building Blocks, Genetic Algorithm, TSPLIB, Performance Analysis
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
This document discusses chemoinformatics and its role in distance education. It defines chemoinformatics as the application of informatics methods to solve chemical problems, involving the design, creation, organization, management, analysis and use of chemical information. The document then outlines proposed courses for a distance learning program in chemoinformatics, including introductory courses covering fundamental concepts as well as more advanced topics involving programming, databases, and molecular modeling. It concludes by discussing how chemoinformatics can help improve distance education programs in chemistry fields.
Introduction
What is cheminformatics?
Why do we have to use informatics methods in chemistry?
Is it cheminformatics or chemoinformatics?
Emergence of cheminformations
Three major aspects of cheminformatics
Basics of cheminformatics
Topological representations
Tools used for cheminformatics
Application of cheminformatics
Role of cheminformatics in morden drug discovery
Conclusion
Bibliography
The document proposes a novel hybrid method called PCA-BEL for classifying gene expression microarray data. PCA-BEL uses principal component analysis (PCA) for feature extraction followed by classification using a Brain Emotional Learning (BEL) network. PCA reduces the dimensionality of the microarray data to overcome the high dimensionality problem. BEL is then used for classification due its low computational complexity making it suitable for high dimensional data. The method is tested on several cancer gene expression datasets and achieves average accuracies of 100%, 96%, 98.32%, 87.40% and 88% on five datasets respectively, demonstrating its effectiveness for microarray classification tasks.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Mahesh Joshi has studied computer science and engineering, obtaining degrees from universities in India and the US. He is currently pursuing a Masters in Language Technologies at Carnegie Mellon University. His research focuses on natural language processing techniques such as word sense disambiguation and abbreviation expansion, especially in medical texts. He has published papers in conferences and released several related software tools.
Application specific Programming of the Texas Instruments ...butest
The document provides an overview of programming the Texas Instruments TMS320C6711 digital signal processor using Simulink. It describes the hardware features of the TMS320C6711 DSK board and how to configure it for use with Simulink. It then gives an example of building an audio reverberation model in Simulink and deploying it to run on the DSK board.
This proposal seeks funding to develop technology for detecting insider threats through analyzing email data and application event traces. The proposal involves extending existing email tracking and mining technologies called MET and EMT to incorporate additional data sources like host-based sensors. The work will result in an integrated email security appliance called the EmailWall that can detect anomalous insider behavior, model groups of users, and quarantine potentially malicious emails. The schedule outlines milestones over 18 months for research, development, testing, and deployment of the system for insider threat detection.
This document details research applying genetic programming to optimize word aligners for machine translation. Genetic programming mimics biological evolution by applying selection, mutation, and crossover to programs to find better solutions. The researcher aims to evolve word alignment programs but has so far been unsuccessful due to issues with mutations generating invalid code. Implementation decisions around representation, selection, crossover, and halting criteria are discussed. Results show minor improvements but generations prove ineffective as crossover alone is not enough for variety and mutation is not working properly.
This document discusses a Bayesian approach to active learning for collaborative filtering. It summarizes that collaborative filtering uses preference patterns to predict user ratings, but requires many user ratings for accuracy. Active learning aims to acquire the most informative ratings from users. Previous active learning methods only consider estimated models, which can be misleading with few ratings. The proposed method takes a full Bayesian approach, averaging expected loss over the posterior model distribution to account for model uncertainty and avoid problems from estimated models. It aims to select items that maximize reduction in expected loss when considering the full model distribution, rather than just an estimated model.
This newsletter from Animal Care Services provides updates on several topics:
1) The Mouse Models Core has created new transgenic and knockout mouse lines and is offering additional services like embryo collection and storage.
2) Construction is underway for a new Biomedical Sciences Building, with an expected move-in date of October 30, 2009.
3) "Special Services" fees will now be implemented for any work done by ACS outside of standard husbandry care, such as specialized diets or housing.
4) Several staff changes are announced, including new veterinarians, technicians of the year, and retirements.
The document describes several graduate level computer science courses offered at the university, including courses on professional communication, programming, algorithms, data structures, operating systems, and software engineering. Key courses cover computer science fundamentals like programming in C/C++ and Java, discrete structures, algorithm analysis, and automata theory. The courses provide students with knowledge and skills in core computer science topics as well as practical skills in programming, software engineering, and systems.
Este documento analiza el modelo de negocio de YouTube. Explica que YouTube y otros sitios de video online representan un nuevo modelo de negocio para contenidos audiovisuales debido al cambio en los hábitos de consumo causado por las nuevas tecnologías. Describe cómo YouTube aprovecha la participación de los usuarios para mejorar continuamente y atraer una audiencia diferente a la de los medios tradicionales.
A SYSTEM OF SERIAL COMPUTATION FOR CLASSIFIED RULES PREDICTION IN NONREGULAR ...ijaia
Objects or structures that are regular take uniform dimensions. Based on the concepts of regular models,
our previous research work has developed a system of a regular ontology that models learning structures
in a multiagent system for uniform pre-assessments in a learning environment. This regular ontology has
led to the modelling of a classified rules learning algorithm that predicts the actual number of rules needed
for inductive learning processes and decision making in a multiagent system. But not all processes or
models are regular. Thus this paper presents a system of polynomial equation that can estimate and predict
the required number of rules of a non-regular ontology model given some defined parameters.
Systems biology aims to understand how biological systems function through the interactions of their components. It uses both top-down and bottom-up approaches, with top-down identifying molecular interaction networks from large-scale omics data and bottom-up examining mechanisms arising from known component interactions. While high-throughput data has driven systems biology, the field also requires hypothesis-driven testing using computational models informed by both large and small-scale experimental data. Integrating different types of genome-wide and molecular-level data can provide insights into biological processes and diseases.
Novel modelling of clustering for enhanced classification performance on gene...IJECEIAES
Gene expression data is popularized for its capability to disclose various disease conditions. However, the conventional procedure to extract gene expression data itself incorporates various artifacts that offer challenges in diagnosis a complex disease indication and classification like cancer. Review of existing research approaches indicates that classification approaches are few to proven to be standard with respect to higher accuracy and applicable to gene expression data apart from unaddresed problems of computational complexity. Therefore, the proposed manuscript introduces a novel and simplified model capable using Graph Fourier Transform, Eigen Value and vector for offering better classification performance considering case study of microarray database, which is one typical example of gene expressiondata. The study outcome shows that proposed system offers comparatively better accuracy and reduced computational complexity with the existing clustering approaches.
National Resource for Networks Biology's TR&D Theme 3: Although networks have been very useful for representing molecular interactions and mechanisms, network diagrams do not visually resemble the contents of cells. Rather, the cell involves a multi-scale hierarchy of components – proteins are subunits of protein complexes which, in turn, are parts of pathways, biological processes, organelles, cells, tissues, and so on. In this technology research project, we will pursue methods that move Network Biology towards such hierarchical, multi-scale views of cell structure and function.
This document discusses a thesis that uses machine learning algorithms to diagnose mental illness using MRI brain scans. Specifically, it analyzes schizophrenia, bipolar disorder, and healthy control subject data from multiple imaging modalities. It trains and tests eight machine learning classifiers - support vector machines, k-nearest neighbors, logistic regression, naive Bayes, and random forests - on the raw imaging data as well as data transformed through dimensionality reduction techniques. The results aim to demonstrate the efficacy of these algorithms at classifying subjects based on their brain scans and diagnosing their mental condition.
REPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELScscpconf
Uncertainty is a pervasive in real world environment due to vagueness, is associated with the
difficulty of making sharp distinctions and ambiguity, is associated with situations in which the
choices among several precise alternatives cannot be perfectly resolved. Analysis of large
collections of uncertain data is a primary task in the real world applications, because data is
incomplete, inaccurate and inefficient. Representation of uncertain data in various forms such
as Data Stream models, Linkage models, Graphical models and so on, which is the most simple,
natural way to process and produce the optimized results through Query processing. In this
paper, we propose the Uncertain Data model can be represented as Possibilistic data model
and vice versa for the process of uncertain data using various data models such as possibilistic
linkage model, Data streams, Possibilistic Graphs. This paper presents representation and
process of Possiblistic Linkage model through Possible Worlds with the use of product-based
operator.
Statistical Analysis based Hypothesis Testing Method in Biological Knowledge ...ijcsa
This document summarizes a research paper that introduces a text mining-based method for answering biological queries and testing hypotheses. The proposed approach analyzes hypotheses stated as natural language questions and measures their statistical significance based on existing literature. It computes a p-value to determine whether to accept or reject each hypothesis. The method also generates a network of related biological entities to provide context and suggest new hypotheses for further investigation. The goal is to help researchers quantitatively evaluate assumptions and guide relevant discovery of new biological knowledge.
This document introduces machine learning techniques and their applications in systems biology. It discusses the challenges in systems biology due to the complexity and size of biological systems. Machine learning is well-suited for systems biology as it can analyze large datasets, adapt to new information, and discover relationships hidden in data. Specifically, the document describes inductive logic programming, clustering, Bayesian networks, and decision trees and how they are used in classification, forecasting, clustering, description, deviation detection, link analysis, and visualization of biological data.
This document discusses the use of Bayesian statistics in genetic data analysis. It explains that genetic data often results from complex stochastic processes that require probabilistic models. Bayesian inference provides a convenient framework for dealing with genetic problems that involve many interdependent parameters. The popularity of Bayesian methods has increased due to advances in computational power that allow for Markov chain Monte Carlo techniques to tackle complex likelihood problems. The document reviews applications of Bayesian analysis in population genetics, genomics, and human genetics.
An approach for self creating software code in bionets with artificial embryo...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Bacterial virulence proteins, which have been classified on structure of virulence, causes
several diseases. For instance, Adhesins play an important role in the host cells. They are
inserted DNA sequences for a variety of virulence properties. Several important methods
conducted for the prediction of bacterial virulence proteins for finding new drugs or vaccines.
In this study, we propose a method for feature selection about classification of bacterial
virulence protein. The features are constituted directly from the amino acid sequence of a given
protein. Amino acids form proteins, which are critical to life, and have many important
functions in living cells. They occurring with different physicochemical properties by a vector of
20 numerical values, and collected in AAIndex databases of known 544 indices.
For all that, this approach have two steps. Firstly, the amino acid sequence of a given protein
analysed with Lyapunov Exponents that they have a chaotic structure in accordance with the
chaos theory. After that, if the results show characterization over the complete distribution in
the phase space from the point of deterministic system, it means related protein will show a
chaotic structure.
Empirical results revealed that generated feature vectors give the best performance with chaotic
structure of physicochemical features of amino acids with Adhesins and non-Adhesins data sets.
The Chaotic Structure of Bacterial Virulence Protein Sequencescsandit
This document discusses analyzing the chaotic structure of bacterial virulence protein sequences using their amino acid physicochemical properties. It proposes a method involving two main steps: 1) Analyzing the amino acid sequence of a given protein using Lyapunov exponents to determine if it exhibits chaotic behavior according to chaos theory. 2) If the results characterize the complete distribution in phase space like a deterministic system, the related protein is considered to have a chaotic structure. The method is tested on adhesin and non-adhesin protein datasets. Results show that physicochemical feature vectors generated from the chaotic structure analysis perform best for classification, supporting the hypothesis that bacterial virulence protein sequences have chaotic structures derived from the physicochemical properties of their constituent amino acids
A comparative study of clustering and biclustering of microarray dataijcsit
There are subsets of genes that have similar behavior under subsets of conditions, so we say that they
coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can
be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of
utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes
that are coexpressed under clusters of conditions. This type of clustering is called biclustering.
Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate
this problem by finding suboptimal solutions. In this paper, we make a new survey on clustering and
biclustering of gene expression data, also called microarray data.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Biological Significance of Gene Expression Data Using Similarity Based Biclus...CSCJournals
Unlocking the complexity of a living organism’s biological processes, functions and genetic network is vital in learning how to improve the health of humankind. Genetic analysis, especially biclustering, is a significant step in this process. Though many biclustering methods exist, only few provide a query based approach for biologists to search the biclusters which contain a certain gene of interest. This proposed query based biclustering algorithm SIMBIC+ first identifies a functionally rich query gene. After identifying the query gene, sets of genes including query gene that show coherent expression patterns across subsets of experimental conditions is identified. It performs simultaneous clustering on both row and column dimension to extract biclusters using Top down approach. Since it uses novel ‘ratio’ based similarity measure, biclusters with more coherence and with more biological meaning are identified. SIMBIC+ uses score based approach with an aim of maximizing the similarity of the bicluster. Contribution entropy based condition selection and multiple row / column deletion methods are used to reduce the complexity of the algorithm to identify biclusters with maximum similarity value. Experiments are conducted on Yeast Saccharomyces dataset and the biclusters obtained are compared with biclusters of popular MSB (Maximum Similarity Bicluster) algorithm. The biological significance of the biclusters obtained by the proposed algorithm and MSB are compared and the comparison proves that SIMBIC+ identifies biclusters with more significant GO (Gene Ontology).
Stable Drug Designing by Minimizing Drug Protein Interaction Energy Using PSO csandit
1. The document proposes using a particle swarm optimization (PSO) algorithm to design stable drug molecules that minimize interaction energy with target proteins.
2. In the algorithm, drugs are represented as variable-length trees containing functional groups, and PSO is used to optimize van der Waals and electrostatic interaction energies.
3. Results show that PSO performs better than previous fixed-length tree methods at designing drugs that stably bind to active sites of human rhinovirus, malaria, and HIV proteins.
A clonal based algorithm for the reconstruction of genetic network using s sy...eSAT Journals
Abstract Motivation: Gene regulatory network is the network based approach to represent the interactions between genes. DNA microarray is the most widely used technology for extracting the relationships between thousands of genes simultaneously. Gene microarray experiment provides the gene expression data for a particular condition and varying time periods. The expression of a particular gene depends upon the biological conditions and other genes. In this paper, we propose a new method for the analysis of microarray data. The proposed method makes use of S-system, which is a well-accepted model for the gene regulatory network reconstruction. Since the problem has multiple solutions, we have to identify an optimized solution. Evolutionary algorithms have been used to solve such problems. Though there are a number of attempts already been carried out by various researchers, the solutions are still not that satisfactory with respect to the time taken and the degree of accuracy achieved. Therefore, there is a need of huge amount further work in this topic for achieving solutions with improved performances. Results: In this work, we have proposed Clonal selection algorithm for identifying optimal gene regulatory network. The approach is tested on the real life data: SOS Ecoli DNA repairing gene expression data. It is observed that the proposed algorithm converges much faster and provides better results than the existing algorithms. Index Terms: Microarray analysis, Evolutionary Algorithm, Artificial Immune System, S-system, Gene Regulatory Network, SOS Ecoli DNA repairing, Clonal Selection Algorithm.
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSISijcseit
HMM has found its application in almost every field. Applying Hmm to biological sequences has its own
advantages. HMM’s being more systematic and specific, yield a result better than consensus techniques.
Profile HMMs use position specific scoring for the matching & substitution of a residue and for the
opening or extension of a gap. HMMs apply a statistical method to estimate the true frequency of a residue
at a given position in the alignment from its observed frequency while standard profiles use the observed
frequency itself to assign the score for that residue. This means that a profile HMM derived from only 10 to
20 aligned sequences can be of equivalent quality to a standard profile created from 40 to 50 aligned
sequences.
This document summarizes an artificial immune system (AIS) model called IMMSIM that simulates the immune system using an agent-based modeling approach. IMMSIM represents immune cells like B cells and T cells as agents on a grid, with receptors on the cells encoded as bit strings. Interactions like antigen binding are determined by comparing receptor strings. The model captures phenomena like clonal expansion and the immune response in a simplified way. It demonstrates how agent-based models can integrate biological knowledge into computational simulations to study immune system dynamics.
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONcsandit
Sequencing projects arising from high throughput technologies including those of sequencing DNA microarrays allowed to simultaneously measure the expression levels of millions of genes of a biological sample as well as annotate and identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information,
bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the hypothesis of researchers throughout the experimental cycle. In this context, this article describes and discusses some of techniques used for the functional analysis of gene expression data.
The defense was successful in portraying Michael Jackson favorably to the jury in several ways:
1) They dressed Jackson in ornate costumes that conveyed images of purity, innocence, and humility.
2) Jackson was shown entering the courtroom as if on a red carpet, emphasizing his celebrity status.
3) Jackson appeared vulnerable, childlike, and in declining health during the trial, eliciting sympathy from jurors.
4) Defense attorney Tom Mesereau effectively presented a coherent narrative of Jackson as a victim and portrayed Neverland as a place of refuge, undermining the prosecution's arguments.
Michael Jackson was born in 1958 in Gary, Indiana and rose to fame in the 1960s as the lead singer of The Jackson 5, topping music charts in the 1970s. As a solo artist in the 1980s, his album Thriller broke music records. In the 1990s and 2000s, Jackson faced several legal issues related to child abuse allegations while continuing to release music. He married Lisa Marie Presley and Debbie Rowe and had two children before his death in 2009.
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
This document appears to be a list of popular books from various authors. It includes over 150 book titles across many genres such as fiction, non-fiction, memoirs, and novels. The books cover a wide range of topics from politics to cooking to autobiographies.
The prosecution lost the Michael Jackson trial due to several key mistakes and weaknesses in their case:
1) The lead prosecutor, Thomas Sneddon, was too personally invested in the case against Jackson, having pursued him for over a decade without success.
2) Sneddon's opening statement was disorganized and weak, failing to effectively outline the prosecution's case.
3) The accuser's mother was not credible and damaged the prosecution's case through her erratic testimony, history of lies and con artist behavior.
4) Many prosecution witnesses were not credible due to prior lawsuits against Jackson, debts owed to him, or having been fired by him. Several witnesses even took the Fifth Amendment.
Here are three examples of public relations from around the world:
1. The UK government's "Be Clear on Cancer" campaign which aims to raise awareness of cancer symptoms and encourage early diagnosis.
2. Samsung's global brand marketing and sponsorship activities which aim to increase brand awareness and favorability of Samsung products worldwide.
3. The Brazilian government's efforts to improve its international image and relations with other countries through strategic communication and diplomacy.
The three most important functions of public relations are:
1. Media relations because the media is how most organizations reach their key audiences. Strong media relationships are crucial.
2. Writing, because written communication is at the core of public relations and how most information is
Michael Jackson Please Wait... provides biographical information about Michael Jackson including his birthdate, birthplace, parents, height, interests, idols, favorite foods, films, and more. It discusses his background, career highlights including influential albums like Thriller, and films he appeared in such as The Wiz and Moonwalker. The document contains photos and details about Jackson's life and illustrious music career.
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
The document discusses the process of manufacturing celebrity and its negative byproducts. It argues that celebrities are rarely the best in their individual pursuits like singing, dancing, etc. but become famous due to being products of a system controlled by wealthy elites. This system stifles opportunities for worthy artists and creates feudalism. The document also asserts that manufactured celebrities should not be viewed as role models due to behaviors like drug abuse and narcissism that result from the celebrity-making process.
Michael Jackson was a child star who rose to fame with the Jackson 5 in the late 1960s and early 1970s. As a solo artist in the 1970s and 1980s, he had immense commercial success with albums like Off the Wall, Thriller, and Bad, which featured hit singles and groundbreaking music videos. However, his career and public image were plagued by controversies related to allegations of child sexual abuse in the 1990s and 2000s. He continued recording and performing but faced ongoing media scrutiny into his private life until his death in 2009.
Social Networks: Twitter Facebook SL - Slide 1butest
The document discusses using social networking tools like Twitter and Facebook in K-12 education. Twitter allows students and teachers to share short updates and can be used to give parents a window into classroom activities. Facebook allows targeted advertising that could be used to promote educational activities. Both tools could help facilitate communication between schools and communities if used properly while managing privacy and security concerns.
Facebook has over 300 million active users who log on daily, and allows brands to create public profile pages to interact with users. Pages are for brands and organizations only, while groups can be made by any user about any topic. Pages do not show admin names and have no limits on fans, while groups display admin names and are limited to 5,000 members. Content on pages should aim to provoke action from subscribers and establish a regular posting schedule using a conversational tone.
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
Hare Chevrolet is a car dealership located in Noblesville, Indiana that has successfully used social media platforms like Twitter, Facebook, and YouTube to create a positive brand image. They invest significant time interacting directly with customers online to foster a sense of community rather than overtly advertising. As a result, Hare Chevrolet has built a large, engaged audience on social media and serves as a model for how brands can use online presences strategically.
Welcome to the Dougherty County Public Library's Facebook and ...butest
This document provides instructions for signing up for Facebook and Twitter accounts. It outlines the sign up process for both platforms, including filling out forms with name, email, password and other details. It describes how the platforms will then search for friends and suggest people to connect with. It also explains how to search for and follow the Dougherty County Public Library page on both Facebook and Twitter once signed up. The document concludes by thanking participants and providing a contact for any additional questions.
Paragon Software announces the release of Paragon NTFS for Mac OS X 8.0, which provides full read and write access to NTFS partitions on Macs. It is the fastest NTFS driver on the market, achieving speeds comparable to native Mac file systems. Paragon NTFS for Mac 8.0 fully supports the latest Mac OS X Snow Leopard operating system in 64-bit mode and allows easy transfer of files between Windows and Mac partitions without additional hardware or software.
This document provides compatibility information for Olympus digital products used with Macintosh OS X. It lists various digital cameras, photo printers, voice recorders, and accessories along with their connection type and any notes on compatibility. Some products require booting into OS 9.1 for software compatibility or do not support devices that need a serial port. Drivers and software are available for download from Olympus and other websites for many products to enable use with OS X.
To use printers managed by the university's Information Technology Services (ITS), students and faculty must install the ITS Remote Printing software on their Mac OS X computer. This allows them to add network printers, log in with their ITS account credentials, and print documents while being charged per page to funds in their pre-paid ITS account. The document provides step-by-step instructions for installing the software, adding a network printer, and printing to that printer from any internet connection on or off campus. It also explains the pay-in-advance printing payment system and how to check printing charges.
The document provides an overview of the Mac OS X user interface for beginners, including descriptions of the desktop, login screen, desktop elements like the dock and hard disk, and how to perform common tasks like opening files and folders. It also addresses frequently asked questions for Windows users switching to Mac OS X, such as where documents are stored, how to save or find documents, and what the equivalent of the C: drive is in Mac OS X. The document concludes with sections on file management tasks like creating and deleting folders, organizing files within applications, using Spotlight search, and an overview of the Dashboard feature.
This document provides a checklist for securing Mac OS X version 10.5, focusing on hardening the operating system, securing user accounts and administrator accounts, enabling file encryption and permissions, implementing intrusion detection, and maintaining password security. It describes the Unix infrastructure and security framework that Mac OS X is built on, leveraging open source software and following the Common Data Security Architecture model. The checklist can be used to audit a system or harden it against security threats.
This document summarizes a course on web design that was piloted in the summer of 2003. The course was a 3 credit course that met 4 times a week for lectures and labs. It covered topics such as XHTML, CSS, JavaScript, Photoshop, and building a basic website. 18 students from various majors enrolled. Student and instructor evaluations found the course to be very successful overall, though some improvements were suggested like ensuring proper software and pairing programming/non-programming students. The document also discusses implications of incorporating web design material into existing computer science curriculums.
Vicki Haugen McMaster is seeking a position in web design, front-end development, or digital photography. She has over 12 years of experience in front-end development using HTML and CSS, as well as expertise in Adobe Creative Suite programs like Photoshop. Her previous roles include web developer positions at Aquent and The Creative Group where she updated websites and assisted development teams.
Kyril Mossin is a web designer and Flash programmer with over 15 years of experience in graphic design, web development, and multimedia production. He has created websites for advertising agencies, photo agencies, and other clients. His skills include Flash, ActionScript, Photoshop, Dreamweaver, and HTML. He has a background in print production and technical writing.
1. Jasmine: a hybrid reasoning tool for discovering causal links in biological
data
Boris A Galitsky1 , Sergey O Kuznetsov2 and Dmitry V Vinogradov2
1 School of Computer Science and Information Systems
Birkbeck College, University of London
Malet Street, London WC1E 7HX, UK
galitsky @dcs.bbk.ac.uk
2 All-Russian Institute for Scientific and Technical Information (VINITI),
Usievicha 20, Moscow 125190, Russia
{serge,vin}@viniti.ru
http://www.dcs.bbk.ac.uk/~galitsky/ Jasmine /
Abstract. We develop the means to mine for causal links in biological data. The reasoning schema and
implementation of the logic programming-based deterministic machine learning system Jasmine is present-
ed. The methodology of mining for causal links is illustrated by the prediction tasks for protein secondary
structure and phylogenetic profiles. The suggested methodology leads to a clearer approach to hierarchical
classification of proteins and a novel way to represent evolutionary relationships. Comparative analysis of
Jasmine and other statistical and deterministic systems (including Inductive Logic Programming) are out-
lined.
Keywords: deterministic machine learning, plausible reasoning, causal link, protein secondary structure prediction,
phylogenetic profile analysis
1 A hybrid deterministic reasoning technique
In many scientific areas, including bioinformatics, data is being generated at a faster pace
then it can be effectively analyzed. Therefore, the possibility to automate the scientific dis-
covery process is of both theoretical and practical value. In this paper we present the system
Jasmine which automatically generates hypotheses to explain observations, verifies these hy-
potheses by finding the subset of data satisfying them, falsifies some of the hypotheses by re-
vealing inconsistencies and finally derives the explanations for the observations by means of
cause-effect links [12].
Analyzing biological data, it is of crucial importance not just to reveal correlations between
biological parameters, but also to find interpretable causal links between them. Most of the
time, the machinery of statistical inference allows the former but runs into difficulties with
the latter. Non-uniform, statistically distorted and incomplete biological data in bioinformat-
ics, extracted from various sources, is frequently not tractable to immediate statistical analy-
sis which would deliver an adequate explanation of phenomena. In response to this problem,
special non-statistical techniques have been developed to tackle a wide variety of problems in
bioinformatics [9,10,20], and Jasmine, the system to be presented here, extends this ap-
proach.
A traditional characterisation of scientific methodology is thefollows the o-deductive” pro-
cess. Firstly, scientific knowledge, intuition and imagination yield plausible hypotheses, and
then the deductive conjectures of these hypotheses are verified experimentally [6,22]. Jas-
mine extends this methodology by implementing (i) a combination of abdeductive, inductive
and analogic reasoning for hypotheses generation, and (ii) the multi-valued logic-based dedue
reasoning for verification of their consistency and coverage of the domain (compare with
[11]). Formally, all the above components can be represented as a deductive inference via
logic ogramming [1, 5, 26].
2. In contrast to statistical approaches, Jasmine represents domain knowledge in the form of
an ontology. This provides the means both for direct inspection of biological knowledge in
the model and for inference of predictions.
In keeping with most hypotheses generation strategies used in modern biology [11], Jas-
mine uses abductive reasoning in addition to its deductive and inductive capabilities. These
abductive processes generate specific facts about biological entities, while general hypotheses
are delivered by deductive and inductive procedures.
In this paper, we call the results of data mining causal links because of the structure of in-
ter-dependencies between features which are known and those which are subject to predic-
tion. Making a prediction, Jasmine explicitly indicates which objects and which features de-
livered this prediction. Therefore in our prediction settings, objects and their features that de-
liver prediction are considered causes (because they are explicitly localized); the object’s pre-
dicted feature is referred to as an effect. Indeed, causes may not imply effects in a physical
sense, however the metaphor is useful because it highlights the explanatory property of a pre-
diction delivered by Jasmine. It is important to differentiate Jasmine’s prediction settings,
which comprise a hypotheses management environment, from pure accuracy-oriented predic-
tion settings. The objective in the design of Jasmine is not simply to raise recognition accura-
cy, but to identify sets of features and their inter-relations in logical form, which in turn raise
the recognition quality.
Note also that a prediction direction (the roles of being a cause and an effect) in Jasmine
settings is easily reversed.
In terms of its deterministic methodology of prediction and implementation via logic pro-
gramming, Jasmine’s approach is close to that of Inductive Logic Programming [20] which
has been applied in management hypotheses in bioinformatics quite successfully and a com-
parison will be presented below in the Conclusion.
The rest of the present paper is organized as follows. We introduce the reasoning schema
using the reduced dataset of immunoglobulin proteins in Section 2, followed by Section 3
which describes the computation of similarity for obtaining causal links. Section 4 provides a
biological background for the dataset from Section 2 and shows how Jasmine predicts sec-
ondary structure of proteins and helps to find the physically plausible background for these
predictions. In Section 5 Jasmine is applied to the problem of mining evolutionary profiles,
where an original structure of causal links between occurrences of protein groups in biologi-
cal species is obtained. The logical properties of Jasmine are outlined in Appendix.
2 A reasoning schema
Jasmine is based on a learning model called JSM-method (in honor of John Stuart Mill,
the English philosopher who proposed schemes of inductive reasoning in the 19th century).
JSM-method to be presented in this Section implements Mill’s idea [16] that similar effects
are likely to follow common causes. Logical properties of JSM are presented in Appendix 1.
The JSM environment consists of features (causes), objects and effects. Within a first-or-
der language, objects are atoms, features (causes) and effects (targets) are terms which in-
clude these atoms. For a target (a feature to be predicted), there are four groups of objects
with respect to the evidence they provide for this effect:
Positive – Negative – Inconsistent - Unknown.
An inference to obtain a target feature (satisfied or not) can be represented as one in a re-
spective four-valued logic. The predictive machinery is based on building hypotheses,
target(X):- features1(X, …), …,featuren(X, …), that separate examples,
where target is to be predicted, and features1, …,featuren∈features are the features which are
considered causes of the target (effect); X ranges over objects.
3. Desired separation is based on the similarity of objects in terms of features they satisfy.
Usually, such similarity is domain-dependent. However, building the general framework of
inductive-based prediction, we use the anti-unification of formulas that express the totality of
features of the given and other objects (our features (causes) do not have to be unary predi-
cates; they are expressed by arbitrary first-order terms).
JSM-prediction is based on the notion of similarity between objects. Similarity between a
pair of objects is a hypothetical object which obeys the common features of this pair of ob-
jects. In handling similarity JSM is close to Formal Concept Analysis [3], where similarity is
the meet operation of a lattice (called concept lattice). In this work we choose anti-unification
of formulas expressing features of the pair of objects to derive a formula for similarity sub-
object [19]. Anti-unification, in the finite term case, was studied as the least upper bound op-
eration in a lattice of terms. Below we will be using the predicate
similar(Object1, Object2, CommonSubObject ) which yields the third argument given the
first and the second arguments.
We start with an abstract example of Jasmine setting, based on the secondary structure pre-
diction of immunoglobulin proteins. In the section below we will comment on the meanings
of involved causes and subjects. Our introductory example of JSM settings for unary predi-
cate is as follows (from now on we use the conventional PROLOG notations for variables
and constants), Fig. 2.1.
features([e1, e2, e3, e4, e5, e6, oa1, oa2, ap1, ap2, ap3, ap4, f1, f2,
cc4, cc5, cc6, cc7, cb5, cb7]). %% Features and targets ( causes and effects)
objects([o1, o2, o3, o4, o5, o6, o7, o8, o9, o10, o11, o12, o13, o14,
o15, o16, o17, o18]).
targets[cb5]).
%% Beginning of knowledge base
e1(o1). oa1(o1). ap1(o1). ap3(o1). f1(o1). cc5(o1). cb5(o1).
e1(o2). oa1(o2). ap1(o2). ap3(o2). f1(o2). cc5(o2). cb5(o2).
e2(o8). oa2(o8). ap2(o8). ap1(o8). f1(o8). cc5(o8). cb5(o8).
e3(o10). oa1(o10). a3(o10). ap2(o10). f1(o10). cc4(o10).
e3(o11). oa1(o11). a3(o11). ap2(o11). f1(o11). cc4(o11). cb5(o11).
cb7(o11).
e4(o16). oa1(o16). a1(o16). ap1(o16). f1(o16). cc5(o16). cb5(o16).
e5(o17). oa1(o17). a4(o17). ap2(o17). f1(o17). cc6(o17). cb7(o17).
e6(o18). oa1(o18). a1(o18). ap2(o18). f1(o18). cc4(o18). cb7(o18).
%% End of knowledge base
unknown(cb5(o10)).
Fig. 2.1 The knowledge base for mining of causal links
4. Finding the totality of Finding the totality of
intersections between features intersections between features
of all objects (positive) 1 of all objects (negative)
Among the above Among the above
intersections, select only intersections, select only
those which describe only those which describe only
positive objects negative objects
Form positive hypotheses Form negative hypotheses
from the above intersections from the above intersections
2
Instantiate Obtaining objects Instantiate
positive with unknown target negative
hypotheses by which satisfy both hypotheses by
objects 3 positive and objects
negative hypotheses
(remain unknown)
Obtaining objects with Obtaining objects with
unknown target which unknown target which
satisfy positive hypotheses 4 satisfy negative hypotheses
Remaining objects with unknown
target: inconsistent prediction
Fig. 2.2: The chart for reasoning procedure of Jasmine. Note that the prediction schema is oriented to dis-
cover which features cause the target and how (the causal link) rather than just searching for common features
for the target which would be much simpler (six units on the top). The respective clauses (1-4) and sample re -
sults for each numbered unit (1-4) are presented in Fig. 2.3.
We start our presentation of reasoning procedure with the chart (Fig. 2.2), followed by the
logic program representation. Let us build a framework for predicting the target feature V of
objects set by the formulas X expressing their features: unknown(X, V). We are going to pre-
dict whether V(x1, …, xn) holds or not, where x1, …, xn are variables of the formulas X (in our
example, X= cb5(o10), x1 = o10 ).
We start with the raw data, positive and negative examples, rawPos(X, V) and
rawNeg(X, V), for the target V, where X range over formulas expressing features of objects.
We form the totality of intersections for these examples (positive ones, U, that satisfy
iPos(U,V), and negative ones, W, that satisfy iNeg(W,V), not shown ):
iPos(U, V):- rawPos(X1, V), rawPos(X2, V), X1=X2, similar(X1, X2, U), U=[ ]. (1)
iPos(U, V):- iPos(U1, V), rawPos(X1, V), similar(X1, U1, U), U=[ ].
Above are the recursive definitions of the intersections. As the logic program clauses
which actually construct the lattice for the totality of intersections for positive and negative
examples, we introduce the third argument to accumulate the currently obtained intersections
(the negative case is analogous):
iPos(U, V):- iPos(U, V, _).
iPos(U, V, Accums):- rawPos(X1, V), rawPos(X2, V), X1=X2, similar(X1, X2, U),
Accums=[X1, X2], U=[ ].
iPos(U, V, AccumsX1):- iPos(U1, V, Accums), !, rawPos(X1, V),
5. not member(X1, Accums), similar(X1, U1, U), U=[ ],
append(Accums, [X1], AccumsX1).
To obtain the actual positive posHyp and negative negHyp hypotheses from the intersec-
tions derived above, we filter out the inconsistent hypotheses which belong to both positive
and negative intersections inconsHyp(U, V):
inconsHyp(U, V):- iPos(U, V), iNeg(U, V). (2)
posHyp(U, V):-iPos(U, V), not inconsHyp(U, V).
negHyp(U, V):-iNeg(U, V), not inconsHyp(U, V).
Here U is the formula expressing the features of objects. It serves as a body of clauses for
hypotheses V :- U.
The following clauses deliver the totality of objects so that the features expressed by the
hypotheses are included in the features of these objects. We derive positive and negative hy-
potheses reprObjectsPos(X, V) and reprObjectsNeg(X, V) where X is instantiated with ob-
jects where V is positive and negative respectively. The last clause (with the head reprOb-
jectsIncons(X, V)) implements the search for the objects to be predicted so that the features
expressed by both the positive and negative hypotheses are included in the features of these
objects.
reprObjectsPos(X, V):- rawPos(X, V), posHyp(U, V), similar(X, U, U).
(3)
reprObjectsNeg (X, V):- rawNeg(X, V), negHyp(U, V), similar(X, U, U).
reprObjectsIncons(X, V):-unknown(X,V), posHyp(U1, V), negHyp(U2, V),
similar(X, U1, U1), similar(X, U2, U2).
Two clauses above (top and middle) do not participate in prediction directly; their role is to
indicate which objects deliver what kind of prediction. This form of analysis will be illustrat-
ed in Section 5, Figs 4-5.
Finally, we approach the clauses for prediction. For the objects with unknown targets, the
system predicts that they either satisfy these targets, do not satisfy these targets, or that the
fact of satisfaction is inconsistent with the raw facts. To deliver V, a positive hypothesis has
to be found so that a set of features X of an object has to include the features expressed by
this hypothesis and X is not from reprObjectsIncons(X, V). To deliver ¬V, a negative hy-
pothesis has to be found so that a set of features X of an object has to include the features ex-
pressed by this hypothesis and X is not from reprObjectsIncons(X, V). No prediction can be
made for the objects with features expressed by X from the third clause,
predictIncons(X,V).
predictPos(X,V):- unknown(X, V), posHyp(U, V), similar(X, U,U), (4)
not reprObjectsIncons(X, V).
predictNeg(X,V):- unknown(X, V), negHyp(U, V), similar(X, U,U),
not reprObjectsIncons(X, V).
predictIncons(X,V):- unknown(X, V), not predictPos(X, V), not predictNeg(X, V), not
reprObjectsIncons(X, V).
The first clause above (shown in bold) will serve as an entry point to predict (choose) an
effect of given features from the generated list of possible effects that can be obtained for the
current state. The clause below is an entry point to Jasmine if it is integrated with other appli-
cations and/or reasoning components
predict_effect_by_learning(EffectToBePredicted,S):-
findAllPossibleEffects (S, As), loadRequiredSamples(As),
member(EffectToBePredicted, As), predictPos(X, EffectToBePredicted), !, X=[ ].
Predicate loadRequiredSamples(As) above forms the training dataset. If for a given dataset
a prediction is inconsistent, it is worth eliminating the cases from the dataset which deliver
this inconsistency. Conversely, if there is an insufficient number of positive or negative cas-
6. es, additional ones are included in the dataset. A number of iterations may be required to ob-
tain a prediction, however the iteration procedure is deterministic: the source of inconsisten-
cy / insufficient data cases are explicitly indicated at the step where predicates reprObject-
sPos and reprObjectsNeg introduced above are satisfied.
For example, for the knowledge base above, we have the following protocol and results
(Fig. 2.3)
1. Intersections
Positive: [[e1(_),oa1(_),ap1(_),ap3(_),f1(_),cc5(_)],
[ap1(_),f1(_),cc5(_)],[ap1(_),f1(_)],[oa1(_),f1(_)],
[oa1(_),ap1(_),f1(_),cc5(_)],
[e2(_),e3(_),oa2(_),ap1(_),ap2(_),f1(_)],[e3(_),ap2(_),f1(_)],
[e4(_),oa1(_),ap1(_),
f1(_),cc5(_)]]
Negative: [[oa1(_),ap2(_),f1(_),cb7(_)]]
Unassigned examples:
2. Hypotheses
Positive:[e1(_),oa1(_),ap1(_),ap3(_),f1(_),cc5(_)],[ap1(_),f1(_),cc5(_)],
[ap1(_),f1(_)],[oa1(_),f1(_)],[oa1(_),ap1(_),f1(_),cc5(_3B60)],
[e2(_),e3(_),oa2(_),ap1(_),ap2(_),f1(_)], [e3(_),ap2(_),f1(_)],
[e4(_),oa1(_),ap1(_),f1(_),cc5(_)]]
Negative: [[oa1(_),ap2(_),f1(_),cb7(_)]]
Contradicting hypotheses: []
The clauses for hypotheses here are:
cb5(X):-
e1(X),oa1(X),ap1(X),ap3(X),f1(X),cc5(X);ap1(X),f1(X),cc5(X);ap1(X),f1(X).
cb5(X):- not ( oa1(X),ap2(X),f1(X),cb7(X)). Note that all intersections are turned into hy-
potheses because there is no overlap between positive and negative ones
3. Background (positive and negative objects with respect to the target cb5)
Positive:
[[e1(o1),oa1(o1),ap1(o1),ap3(o1),f1(o1),cc5(o1)],
[e1(o2),oa1(o2),ap1(o2),ap3(o2),f1(o2),cc5(o2)],
[e2(o7),e3(o7),oa2(o7), ap1(o7),ap2(o7),f1(o7),cc5(o7)],
[e2(o8),e3(o8),oa2(o8),ap1(o8),ap2(o8),f1(o8)],
[e3(o11),oa1(o11),ap2(o11),f1(o11),cc4(o11),cb7(o11)],[e4(o15),
oa1(o15),ap1(o15),f1(o15),cc5(o15)],
[e4(o16),oa1(o16),ap1(o16),f1(o16),cc5(o16)]]
Negative:
[[e5(o17),oa1(o17),ap2(o17),f1(o17),cc6(o17),cb7(o17)],
[e6(o18),oa1(o18),ap2(o18),f1(o18),cc4(o18),cb7(o18)]]
Inconsistent: []
4. Prediction for cb5 (objects o10)
Positive: [[e3(o10),oa1(o10),ap2(o10),f1(o10),cc4(o10)]]
Negative:[]
Inconsistent: []
Instantiated derived rules
cb5(o10):- e3(o10),oa1(o10),ap2(o10),f1(o10),cc4(o10).
Fig. 2.3: The Jasmine prediction protocol. Steps are numbered in accordance to the units at Fig. 2.2 and
clauses for the
Hence cb5(o10) holds, which means that the sequence o10 has the length of loop of 5
amino acids.
Towards the end of the Section we enumerate the predicates we have used in our presenta-
tion of Jasmine reasoning (Fig. 2.4). As to the above variables, X, U, U1, U2 range over for-
mulas containing objects, and V ranges over targets.
7. Positive Negative Other Meaning
rawPos rawNeg unknown Raw data
iPos iNeg Intersections
posHyp negHyp inconsHyp, inconsistent Hypotheses
reprObjectsPos reprObjectsNeg reprObjectsIncons, both Instantiations by objects
positive and negative, pre-
diction: unknown
predictPos predictNeg predictIncons, Predictions
prediction: inconsistent
Fig. 2.4: The predicates from Jasmine prediction settings.
3. Computing similarity between objects
The quality of Jasmine-based prediction is dramatically dependent on how the similarity of
objects is defined. Usually high prediction accuracy can be achieved if the measure of simi-
larity is sensitive to object features which determine the target (explicitly or implicitly).
Since most of times it is unclear in advance which feature affect the target, the similarity
measure should take into account all available features. If the totality of selected features de-
scribing each object is expressed by formulas, a reasonable expression of similarity between
a pair of objects is a formula which is the least common generalization of the formulas for
both objects, which is anti-unification. Anti-unification is the inverse operation to the unifi-
cation of formulas in logic programming. Unification is the basic operation which finds the
least general (instantiated) formula (if it exists), given a pair of formulas.
For example, for two formulas p(a, X, f(X)) and p(Y, f(b), f(f(b))) their anti-unification
(least general generalization) is p(Z1, Z2, f(Z2)). Anti-unification was used in [21] as a
method of generalization and later this work was extended to form a theory of inductive gen-
eralization and hypothesis formation. Conversely, unification of this formulas, p(a, X, f(X))
= p(Y, f(b), f(f(b))) will be p(a, f(b), f(f(b))). Our logic programming implementation of anti-
unification for a pair of conjunctions, which can be customized to a particular knowledge do-
main, is presented at Fig. 3.1.
Although the issue of implementation of the anti-unification has been addressed in the lit-
erature (e. g. [35]), we present the full code to have this paper self-contained. In a given do-
main, additional constraints on terms can be enforced to express a domain-specific similarity.
Particularly, certain arguments can be treated differently (should not be allowed to change if
very important, or should form a special kind of constant). A domain – specific code should
occur in the line shown in bold.
The simplest way to express similarity between objects is to use attribute features. For ex-
ample, to express a similarity between amino acids, we use their attributes such as polarity
and hydrophobic properties. These attributes may be specific to a narrow domain of analysis
(immunoglobilins in our example, [23]). If one prefers to have a stronger control over com-
parison of similar and distinct attributes of amino acid, terms should be introduced to express
the features of objects. In our example of amino acids, if one considers hydrophobic and hy-
drophilic amino acids as ones which do not have any similarity,
hydrophobic(other_attribute1, …, other_attribute2) and hydrophylic(other_attribute1, …,
other_attribute2) predicates should be used. Instead of residue(sequence) terms one would
use
residue(sequence, hydrophobic(other_attribute1, …, other_attribute2) ) to take into ac-
count hydrophobicity as a potential cause of certain property of a protein. Such predicates
8. will then be treated as desired by the anti-unification procedure. Moreover, predicates for re-
lations between objects are naturally accepted by Jasmine framework. As only we explicitly
incorporated hydrophobicity into our object representation, it cannot be reduced to the at-
tribute-value logic settings.
similar(F1, F2, F):- antiUnifyFormulas(F1, F2, F).
antiUnifyFormulas(F1, F2, F):- clause_list(F1, F1s), clause_list(F2,
F2s),
findall( Fm, (member(T1, F1s), member(T2, F2s),
antiUnifyTerms(T1, T2, Fm)), Fms), %finding pairs
%Now it is necessary to sort out formulas which are not most general within the list
findall( Fmost, (member(Fmost, Fms),
not ( member(Fcover, Fms), Fcover = Fmost,
antiUnifyTerms(Fmost, Fcover, Fcover)) ), Fss),
clause_list(F, Fss). % converting back to clause
antiUnifyTerms(Term1, Term2,Term):-
Term1=..[Pred0|Args1],len(Args1, LA),% make sure predicates
Term2=..[Pred0|Args2],len(Args2, LA),% have the same arity
findall( Var, ( member(N, [0,1,2,3,4,5,6,7,8,9,10 ]), % not more than 10 ar-
guments
[! sublist(N, 1, Args1, [VarN1]), %loop through arguments
sublist(N, 1, Args2, [VarN2]),
string_term(Nstr,N), VarN1=..[Name|_],
string_term(Tstr,Name),
concat(['z',Nstr,Tstr],ZNstr), atom_string(ZN, ZNstr) !],
% building a canonical argument to create a variable as a result of anti-unification
ifthenelse( not (VarN1=VarN2),
ifthenelse(( VarN1=..[Pred,_|_],VarN2=..[Pred,_|_]),
ifthenelse( antiUnifyConst(VarN1, VarN2, VarN12),
%going deeper into a subterm when an argument is a term
(Var=VarN12), Var=ZNstr) ),
OR domain-specific code here for special treatment of certain arguments
% various cases: variable vs variable, or vs constant, or constant vs constant
Var=ZNstr),Var=VarN1) ), Args), Term=..[Pred0|
Args].
Fig.3.1: The clauses for logic program for anti-unification (least general generalization) of two formulas (conjunctions of
terms). Predicate antiUnify(T1, T2, Tv) inputs two formulas (scenarios in our case) and outputs a resultant anti-unifica-
tion.
4. Introductory example: predicting the secondary structure of immunoglobulins
We illustrate how Jasmine tackles the biological problem where a micro-level features cause
the macro-level effects. The problem in our general terms is posed as a prediction of macro-
level effects given the micro-level data. The micro-level features comprise the protein
residues (amino acids in certain positions), and the macro-level is the secondary structure of
a protein. The secondary structure of a protein contains information on how a sequence of
amino acids is divided into fragments (strands, loops and helices in 3D space). Evidently, in-
dividual amino acids determine the secondary structure by means of “collective behavior”
whose direct simulation is associated with substantial difficulties.
To illustrate the prediction of the secondary structure of immunoglobulin family with Jas-
mine, we select the training dataset of sequences, where the variable length of certain loop
fragments is dependent on the residues of other fragments. We form our dataset from the set
9. of immunoglobulin sequences of variable domain, heavy chains [7]. In accordance to the tra-
ditional approaches of sequence-structure analysis the set of aligned sequences are split into
fragments (Table1).
Before we apply the deterministic machine learning, the hierarchical clustering approach
is used to decrease the data dimension. To find inter-connections between residues in a se-
quence 100 amino acids long, we divide the multiple sequence alignment into 21 fragments
to search for correlation among these fragments rather than on the level of residues. After
that we divide the fragments of sequences into clusters, we use Jasmine to find correlation
between these clusters. Jasmine is capable of operating with raw data; however, it would be
much harder to take advantage of Jasmine’s results if it is fed with original data containing
1000 aligned sequences of about 100 amino acids.
Table 4.1: We divide the aligned sequences into 21 fragments 0A, A, AA’, A’,... in accordance to conven-
tional names for strands and loops [34]. For each position, we show its number within alignment and within
corresponding fragment (fragments C’C’’, C’’, CB, FG with high variability are omitted). For example, 19 th
position of the alignment is the 3rd position of B-word. For each fragment, we show all keywords [7]. Gray col-
# 1 2 3 # 4 5 6 # 7 8 9 # 10 11 12 13 # 14 15 16 # 17 18 19 20 21 22 23 24
0A1 0A2 0A3 A1 A2 A3 AA'1AA'2 AA'3 A'1 A'2 A'3 A'4 A'B1 A'B2 A'B3 B1 B2 B3 B4 B5 B6 B7 B8
1 Q V Q 1 L VL E 1 S G GA 1 E VL K K 1 P GS GAS 1 S LV RK LVI ST C X GA
2 E 2 V Q 2 S P 2 G V K 2 E 2 T ST X IV
3 Q E 3 W A 3 G V Q 3 Q
4 Q Q 4 R
# 25 26 27 28 29-33 # 34 35 36 37 38 39 # 40 41 42 43 44 45 # 46 47 48 49 50 51 52-55 56-60
BC1 BC2 BC3 BC4..CB.. C1 C2 C3 C4 C5 C6 CC'1 CC'2 CC'3 CC'4 CC'5 CC'6 C'1 C'2 C'3 C'4 C'5 C'6 ..C'C" ..C"..
1 S G FY X 1 MI X W VI R Q 1 X P GS RK G L 1 E W VMIL GAS X I
2 G ST 2 W X 2 A Q 2 X T
3 D S
# 61 62 63 64 65 # 66 67 68 69 70 71 72 # 73 74 75 76 # 77 78 79 80 81 82 82A
C"D1C"D2C"D3C"D4C"D5 D1 D2 D3 D4 D5 D6 D7 DE1 DE2 DE3 DE4 E1 E2 E3 E4 E5 E6 E7
1 X S VL K G 1 R FVI T IM ST R D 1 X SA K N 1 TS L Y LM Q M N
2 Q X FL Q 2 R VA 2 X I S 2 T A Y Q W S
3 P S F Q 3 Q A 3 X T S 3 Q F S K L S
4 X P V K 4 R P 4 T A Y E L X
5 Q F S Q L N
6 Q V V T M T
# 82B 82C 83 84 85 86 87 # 88 89 90 91 92 93 94 95-101 # 102 103 104 105 106 107 108 109 110 111
EF1 EF2 EF3 EF4 EF5 EF6EF7 F1 F2 F3 F4 F5 F6 F7 ..FG.. G1 G2 G3 G4 G5 G6 G7 G8 G9 G10
1 S LVM R X E D T 1 A VML Y YF C A RK 1 X W G Q G T LM V T V
2 S T A A 2 T X 2 X T
3 S K A S
4 S T P E
5 N D P V
or denotes that all keywords contain the same residues in the highlighted position. If two or three residues are
shown for a position, each of them can be found in this position. Note, that in all cases the residues sharing a
position belong to the same amino acid group.
Table 4.1 indicates that the sequences within each fragments fall into one to six clusters
[7,8]. A convenient way to represent the totality of sequences in the dataset is shown in the
Table 4.1, where the sequences are encoded in cluster numbers. The two-level classification
is suggested in [8] to show the diversity of sequences of cluster numbers as a tree. Each node
of a tree is a unique combination of cluster numbers; fragment E is chosen as the first-level
classification criterion and fragments {0A, A, A’, F} as the second level criterion. In accor-
dance to the selected classification criteria, for each class, the selected classifying fragments
determine not only the rest of fragments, but the number of residues in the loop fragments (in
the way presented at Figure 4.2). In such the form the data is well prepared for applying Jas-
mine.
10. Now we are ready to outline the Jasmine representation language for this domain; it has
been introduced in Section 2 as an example of Jasmine reasoning. The fact (cause) that for a
sequence o2 the fragment e is cluster 1 (e.g. TLYLQMN, Table 1) is expressed as e1(o2).
Analogously, the fact that the fragment CB of the same sequence o2 is 5 amino acids long
(the effect) is expressed as cb5(o2). The reader may suggest terms
e(o2,1) and cb(o2, 5) instead but such generalization needs to be done in a more tricky way
to suite anti-unification operation over these terms. Note that since the predicates are not al-
ways unary, our settings are irreducible to attribute-value logic. Since we have quite limited
number of values for cluster numbers and fragment length, we use unary predicates for clari-
ty.
The fragment of our database has been presented in Section 2 above as an introductory
example. We expect the abstract knowledge base introduced above to become transparent at
this stage.
Fig. 4.2: the classification graph for immunoglobulin sequences, encoded via cluster numbers. Fragment E is the criteri-
on for division into classes; fragments 0A –A –A’-F – into subclasses. Each subclass has a specific length of the compli -
E 0A-A-A’--F D--AA’-A’B- B-C-C’ CDR
BC-CC’--C”D- length
DE-EF
113 1 1 1411 111 1 1 1 |C’C”|=5,
|CB|=5
213 1 1 1111 111
1
213 2 1 1211 111 1 1 1 |C’C”|=7,
|CB|=5
212 2 1 1111 411
3 1211 313 1 1 1 |C’C”|=5,
|CB|=5
2 221 1 3 1211 323
2 1211 323
2 2321 112 2 2 2 |C’C”|=4,
132 1 |CB|=5-7
2 2221 112
3 2 2211 112
31
142 1
2 1221 112
111 1
1 1112 231 1 1 1 |C’C”|=5,
4 |CB|=5
121 1 1 1112 221
2 1122 231
|C’C”|=6,
5 142 1 4 2331 114 2 2 2 |CB|=7
1111
|C’C”|=4,
6 112 1 1 2311 114 2 2 1 |CB|=7
1111
mentary determining region (CDR, [8,23] ) fragments which are the subjects of prediction. Exploration of the causal links
between fragments for various prediction settings leads to the discovery of a novel interconnection pattern between the
fragments with physical plausibility: fragments E, 0A,A, A’ and F have a special role in protein folding process [8].
Being a built-in solver of inverse problems, Jasmine is capable of finding the optimal classi-
fication structure for a refined dataset. A classification is optimal if classifying criteria allow
achieving the highest possible precision of the data being classified (when an object is related
to a class its features (effects) are predicted with the highest accuracy). We believe that in a
general situation the optimality of classification graph should be measured in terms of mini-
mum inconsistency between the data components rather than by a prediction accuracy (com-
pare with [24]).
11. In the domain of proteins, the main causal link is how a sequence of amino acid residues
causes the protein to fold in a certain way. This problem can be solved by physical simula-
tion only for short sequences, therefore machine learning needs to be involved. Nowadays,
most efficient approaches to fold recognition are statistical (including SVM), however they
do not help to understand the mechanisms of how individual residues affect the fold.
Searching through a variety of cause-effect models in terms of consistency of generated
hypotheses, Jasmine has delivered the set of classification criteria which were initially ob-
tained manually taking into account the spatial locations of immunoglobulin strands. Hence
Jasmine independently confirmed the classification structure obtained using additional (phys-
ical) knowledge about our dataset. Therefore, in this example the search for clearer causal
links may generate such additional knowledge!
To quantitatively evaluate how the recognition accuracy can be increased by exploration
of causal links by Jasmine, we predict the whole immunoglobulin sequence, given its few
residues. In our evaluation we follow the prediction settings of the study [7] where the re-
quired correlations between residues were discovered. Also, we draw the comparison with
the base-line approach of sequence homology computation.
Classification-based and homology-based approaches have different methodologies to
predict the unknown sequence residues. The former first determines belonging to a class (and
subclass), and then enumerates the possible residues as class attributes as a prediction. The
latter calculates the distances from all available sequences and suggest the closest sequence as
a prediction.
The set of 500 sequences has been used for training (building classification patterns) and
200 sequences for evaluation of prediction accuracy in each recognition settings (Table 4.3).
Values in the table are the times the predictions are better than the random prediction for the
totality of residues which are the subject of prediction. Using hierarchical classification pre-
sented in [7] improves the homology-based prediction by 175.2% and using Jasmine to opti-
mize the classification-based prediction delivers further 16.5% improvement.
Data sets (protein families) Accuracy
Heavy Chain Heavy Chain Light Chain Light Chain improve-
Relating a se-
Variable domain Variable domain Variable do- Variable do- ment
quence to a class us-
Mouse Human main main
ing:
Mouse Human
Homology method 5.4 6.2 7.3 8.1 1
Hierarchical classifi- 11.1 11.3 12.7 12.2 1.75
cation method
Optimized hierarchi- 13.5 13.7 13.8 14.1 2.04
cal method using Jas-
mine
Table 4.3: Evaluation of recognition accuracy improvement by Jasmine.
5 Mining the evolutionary data
Prediction of the protein function has been one of the most important tasks over the last two
decades. The functional prediction is traditionally based on the observation that two homolo-
gous proteins evolved from a common ancestral protein are expected to have similar func-
tions. Sequence similarity is the basis for a vast number of computational tools to derive pro-
tein homology. Recent approaches have attempted to combine the sequence information with
evolutionary data [4,13,27,28]. Some approaches of comparative genomics have extended the
homology method by identifying inter-connected proteins which participate in a common
structural complex or metabolic pathways, or fuse into a single gene in some genomes. Pre-
12. diction of gene functions was found to be a good application area for Inductive Logic Pro-
gramming [10,11,20].
The phylogenetic profile of a protein is represented as a vector whose each component
corresponds to a specific genome [14,15]. Each component is one if there is a significant ho-
mology of that protein in the corresponding genome, and zero if it is not the case. On the path
of discoveries targeting the prediction of protein functions, we focus on the following ques-
tion: how the phylogenetic patterns for different species are correlated to each other. This
question is tightly connected with finding the scenarios of gene evolution, given the phyloge-
netic tree for the species [17]. In this section we apply Jasmine to the database of Clusters of
Orthologous Groups of proteins (COGs), which represents a phylogenetic classification of
the proteins encoded in complete genomes [25]. Finding the structure of how phylogenetic
profiles are interconnected, we will be addressing the dual problem of how the binary vectors
of distribution of COGs through species are correlated. The dataset consists from the matrix
of 66 (species) by 4874 (clusters of proteins).
Hence we explore the following causal links (the third is the combination of the first and
the second):
(1) Whether for a given COG its existence in certain species may provide an evidence
for its existence in the other selected species. Here species are features and COGs are
objects (#features << #objects).
(2) Whether for a given species occurrence of certain COGs in this species implies the
occurrence of the other selected COG in this animal. In this case COGs are features and
species are objects.
(3) Whether a set of occurrences of given COGs in a given collection of species implies
the occurrence of the particular COG in the particular animal. Either feature – object
assignments above are possible, but the variables need to range over list of COGs or
species; also, facts need to be augmented by clauses.
The issue of duality between features and objects is thoroughly addressed by the Formal
Concept Analysis approach [3]; Jasmine (relational) settings can be reduced to this (proposi-
tional) approach if all predicates are unary and there are no explicitly specified links between
the involved features (clauses). For the sake of simplicity in our examples we use unary pred-
icates, however arbitrary terms can be used. For instance, if one intends to take into account
protein sequence information for COGs, the expression for a feature of object will look like
a3(o7, sequence(o7, cogRepresentative)) instead of just a3(o7).
The most efficient data mining occurs in an interactive mode, when the current results of
data mining are visualized for a user who then plans the further directions of exploration on-
line. We select a straightforward approach to data visualization and display our matrix as a
grid with columns for selected COGs and rows for selected species (Fig. 5.1). Having chosen
a data fragment for a group of COGs and species, one can perform an analysis of intercon-
nection between phylogenetic profiles within the selected dataset.
For any cell, we can verify whether it is consistent with the selected dataset or not: we can
run a prediction of its (binary) value, declaring it unknown. Here COGs are objects, species
are features of these objects which cause an effect (a selected feature for particular object
holds or does not hold). The prediction protocol for a given cell is presented at Fig. 5.1 in the
sequence analogously to our presentation of Jasmine reasoning in Section 2.
The shown Jasmine setting targets the causal link (1) above; if the matrix is transposed,
causal link (2) would be a subject of Jasmine analysis. To approach the causal link (3) above
one needs to analyze either (1) and use generated hypotheses of (2) in addition to the facts of
(1), or the other way around. Overall, these kinds of analyses evaluate how strongly the distri-
butions of COGs for various species are correlated.
13. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Notes
a1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0
a2 1 0 0 0 0 1 0 ? 0 0 1 0 1 0 0
a3 1 0 0 0 1 1 0 0 1 0 1 0 1 0 0 Run Jasmine...
a4 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0
a5 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0
a6 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0
a7 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 Positive intersections [[a14(_),a23(_)],[a13(_),a23(_),a26(_)],[a13(_),a14(_243C),a23(_2450
a8 0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 Negative inters ections [[a1(_),a6(_),a24(_)],[a1(_),a5(_),a6(_),a9(_),a20(_),a24(_),a25(_)],[a1(_
a9 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 Unknown hypothesis [[a1(o22),a2(o22),a4(o22),a6(o22),a7(o22),a8(o22),a11(o22),a14(o22),a
a10 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 Positive hypotheses [[a14(_),a23(_)],[a13(_),a23(_),a26(_)],[a13(_),a14(_),a23(_)],[a14(_),a
a11 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 Negative hypothes es [[a1(_),a6(_),a24(_)],[a1(_),a5(_),a6(_),a9(_),a20(_),a24(_),a25(_)],[a1(_
a12 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 Incons istent hypotheses [[a23(_2404)]]
a13 0 1 1 0 0 1 1 1 0 0 1 0 1 0 0 [[a13(o7),a14(o7),a23(o7),a26(o7)],[a1(o10),a6(o10),a7(o10),a8(o10)
Background for positive hypotheses
a14 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 [[a1(o1),a5(o1),a6(o1),a9(o1),a20(o1),a24(o1),a25(o1),a26(o1)],[a1(o2),
Background for negative hypothes es
a15 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 []
Background for inconsis tent hypotheses
a16 1 1 0 1 0 0 0 1 0 1 0 0 0 0 0 Positive prediction []
a17 1 1 0 0 0 0 1 1 0 1 1 1 1 0 0 Negative prediction [[a1(o22),a2(o22),a4(o22),a6(o22),a7(o22),a8(o22),a11(o22),a14(o22),a
a18 1 1 0 1 0 0 1 1 0 0 1 1 0 1 1 Incons istent to predict []
a19 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 a lot of positive intersections but never 1 pred
a20 1 1 0 1 0 0 0 1 0 0 1 0 0 1 0 5,8,9,13,25 low number of positive but well pred
a21 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 many positive values but none predictions
a22 1 1 0 1 0 1 1 1 0 0 1 0 0 1 0 all negative
Fig. 5.1: The fragment of the data set for analysis; rows are species a1-a26 and columns are cogs o1-o26. The
cell values indicate the existence of a given cog for a given animal. The user interface for Jasmine is designed
in a way that any cell value can be set as unknown and its prediction can be attempted.
Figure 5.1 gives an example of interactive exploration of causal links. For each species, a
user attempts to verify if the occurrences of COGs can be predicted, given the selected subset
of the matrix. The prediction is successful for the species a1, a3, a5, a8 … shown in bold:
cells with 1 (shown in bold as well) are predicted properly. Hence there are causes for the ex-
istence of COGs for these species (for a number of reasons which may have a domain-specif-
ic biological interpretation). For the further analysis we collect all hypotheses for each
species which provided the prediction (Table 5.2).
The question mark indicates the current cell to be predicted; the prediction protocol on the
right corresponds to this cell. For each species, the user may specify peculiarities of the caus-
es for the COGs occurring in this species.
Lists of conditions for another species to infer that COGs that are present in the conditions (on the
spe the given species occurs in a current COG. The list left). Occurrence of these cogs in the first and
cies, is specified for the case that the actual occurrence other species conjectures the occurrence of the
a1- is explained by the condition (properly predicted). selected (other) cog for the second species.
a13
[a3(o9),a5(o9),a6(o9),a8(o9)],
[[a3(_),a6(_)],[a3(_),a5(_),a6(_)],[a3(_),a5(_),a6(_),a8(_)],[ [a3(o20),a5(o20),a6(o20),a8(o20),a9(o20)],
a1 a3(_),a5(_),a8(_)]] [a3(o25),a5(o25),a8(o25),a14(o25),a15(o25),a21(o25)]
a2
[a1(o9),a5(o9),a6(o9),a8(o9)],
a3 [[a1(_),a5(_),a6(_)],[a1(_),a5(_),a15(_),a21(_)]] [a1(o20),a5(o20),a6(o20),a8(o20),a9(o20)].
a4
[[a1(_),a3(_),a6(_)],[a3(_)],[a3(_),a6(_),a8(_)],[a1(_),a3(_),
a6(_),a8(_)],[a1(_),a3(_),a8(_)],[a3(_),a9(_)],[a3(_),a6(_),a [a1(o20),a3(o20),a6(o20),a8(o20),a9(o20)]
a5 8(_),a9(_)]] [[a1(_),a3(_),a6(_)],[a1(_),a3(_),a15(_),a21(_)]] [a1(o25),a3(o25),a14(o25),a15(o25),a21(o25)]
a6
a7
[[a9(_)],[a3(_),a5(_),a6(_)],[a1(_),a3(_),a5(_),a6(_)],[a1(_),
a8 a3(_),a5(_)],[a3(_),a5(_),a6(_),a9(_)]] [[a1(_),a3(_),a5(_)]] a1(o20),a3(o20),a5(o20),a6(o20),a9(o20)]
Table 5.2: Enumeration of causal links to the given effect (prediction of the existence of a COG for a given species)
Table 5.2 presents the causes which leads to a particular effect. For each species (the left
column), we perform the prediction for existence of each COG and collect the hypotheses
that deliver the correct positive prediction. If the prediction is wrong, or there is a negative
14. case, the hypotheses are not taken into account. The right column enumerates the COGs
which instantiate the hypotheses which deliver positive correct prediction.
Finally, causal links between the species are visualized as a graph (Fig. 5.3).
Figure 5.3: The graph of causal links between species which may serve as an alternative approach to build-
ing of an evolutionary tree for species a1-a22 (corresponds to the selection of Figure 5.1).
Each inter-relationship between the first and second species is obtained as a causal link be-
tween the occurrence of a particular COG for the first species (and some other species) and
occurrence of this COG in the second species. Each edge can be one-directional or bi-direc-
tional. Nodes are subscribed by the species number; they are in the one-to-one correspon-
dence to the evolutionary tree below (Figure 5.4).
For example of our analysis, we suggest the reader to observe that the species a3 and a5
cause the occurrence of COGs in species a1. On the Fig. 5.1 there is the bold 1 in the fifth
column for data (cog 5, or o5). The occurrence of cog 5 for species a1 is conjectured by
species a16 (refer to Table 5.2, first row right column). The fact that the cog o9 occurs in a3
(in particular) is written as [a3(o9),a5(o9),a6(o9),a8(o9)]; therefore, cog o9 contributes to the
causal link a3(o9) → a1(o5). Finally, notice the bi-directional node a3↔a1 (we just have
verified this node in one direction).
15. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Fig. 5.4: The fragment of evolutionary tree for selected 26 species.
6. Related work
In this paper we present the reasoning model and implementation of a deterministic machine
learning system Jasmine and illustrated its capability by two problems in bioinformatics.
Originally the JSM-method was formulated in early 1980s in terms of a predicate logic,
which is an extension of the First-Order Predicate Logic with quantifiers over tuples of vari-
able length [5]. Jasmine complies with the common paradigm of learning from positive and
negative examples [18]: given descriptions of positive and negative examples of causes rela-
tively to an effect, positive hypotheses are “generalized descriptions” of a subset of positive
examples that do not “cover” any negative example. Negative hypotheses are defined similar-
ly. Jasmine is designed as a logic program for convenient integration with logical as well as
knowledge representation formalisms realized via logic programming [26]. Multiple procedu-
ral reasoning systems of the JSM type have been used in bioinformatics [2].
As to the comparison with competitive approaches to logic-based implementation of ma-
chine learning, it is worth describing Inductive Logic Programming mentioned in the Intro-
duction. In Inductive Logic Programming (see, e.g., [20]) the idea of a version space [18, 31]
is implemented via the generality relation (“to be more general than, or to be equal to”) which
operate on descriptions. Classifiers are considered the inference relations in logic and the set
of classifiers is thought of as a subset of the set of logic programming formulas.
The major drawback of the version spaces, where classifiers are defined syntactically, is
that in the case of too restrictive choice of purely conjunctive classifiers, there is no classifier
that matches all positive examples. This situation which is referred to as “collapse of the ver-
sion space” is quite frequent, in particular, when classifiers are conjunctions of attribute value
assignments and “wildcards” are used for arbitrary values. If the expressive power is in-
creased syntactically, e.g., by introducing disjunction, then the version space tends to become
trivial, while the most specific generalization of positive examples becomes “closer” to or
just coincide with the set of positive examples.
On the contrary, JSM-hypotheses, as has been proven in [29], offer the methodology of
“context-restricted'' disjunction: not all disjunctions are possible, but only those comprising
minimal hypotheses.
16. As the origin of the generality order can be arbitrary, in JSM-method one can use descrip-
tions of positive and negative examples given by formulas of a description logic, by concep-
tual graphs, etc; just an initial generality relation has to be specified.
At the same time, the methodology of Inductive Logic Programming first makes over-gen-
eralized conjectures, which, after being refuted, are then made more specific to cover only
positive examples. In contrast to this methodology, JSM method makes inductive “leaps”
from the set of descriptions of examples to the set of descriptions of their generalization; the
latter being «far apart» from the former.
Similarity in JSM-method is not a relation, but an operation, which is idempotent, commu-
tative and associative (i.e., it induces a semi-lattice on descriptions and their generalizations),
a particular way of least general generalization of examples. JSM-method first constructs in-
ductive generalization of positive examples, which are then may be disproved by negative ex-
amples. Symmetrically, negative examples and their generalizations may be disproved by
positive examples.
Being described in algebraic terms of FCA, the JSM-method allows for simple efficient
implementation in procedural programming languages (using standard algorithms for com-
puting closure systems and concept lattices [32]); however, Jasmine is based on logic pro-
gramming implementation.
Jasmine is not well suited to handle large amount of data, because the number of hypothe-
ses grows exponentially with the number of examples. Focusing on flexibility, interactivity
and expressiveness of knowledge representation language, Jasmine requires conventional
means to reduce the dimension of data such as clustering, filtering or other pre-processing ap-
proaches. Therefore the methodology of mining large datasets for causal links with Jasmine
is to apply pre-processing first and then operate with the reduced dataset, so that the results
would allow an intuitive semantic interpretations and can be supported by an explanation
comprehensible by a human expert.
We conclude this Section by the comparison of Jasmine-based analysis with a generic sta-
tistical one. Mining the evolutionary and genomic data, the value of statistical technique is
that it may shed a light on the most common principles of data organization in biology [9].
Statistical approach is usually less computationally intensive and less sensitive to unreliable
data. However, averaging may conceal the correlation of higher consistency with lower num-
ber of representatives. To overcome this problem, a deterministic approach like Jasmine can
deliver simple rules with systematic exceptions instead.
Also, the disadvantage of using statistical approaches is that it is possible to loose impor-
tant information at initial step so that it cannot be restored on successive steps. Rather than
considering evolutionary data as noisy, we hypothesizing that the data set is reliable and will
be building the complex rules initially. Then, in contrast to the statistical approach, we will
be simplifying the rule by explicit enumeration of exceptions which may form a basis of a
more specific level of rules at the successive steps of data mining.
7. Conclusions: main features and advantages of Jasmine
The main features of Jasmine are as follows (not all of them have been demonstrated in the
current paper due to its brevity):
• Knowledge base for machine learning combines data (facts) with rules (causal links
between the data, clauses, productions, default rules etc.);
• Convenient component–based integration to other units implementing a specific ap-
proach to reasoning;
• Implements inductive, deductive, abductive and analogical reasoning [5,26,1].;
17. • Implemented as a logic program with full-scale meta-programming extensions; meta-
programming is intended to operate with domain-dependent predicates at a higher lev-
el of abstraction;
• Builds rules and meta-rules, and explains how they were built. Gives the cases (ob-
jects) which are similar to target case (to be predicted) and the cases which are differ-
ent from the target [2]. Also, Jasmine enumerates the features of cases which separate
similar and different cases;
• Distinguishes typical and atypical rules, rules can be defeated;
• Front end of Jasmine and its data exchange is compatible with Microsoft Office .
In comparison with the implementations of JSM approach which inspired Jasmine, the lat-
ter has the following advantages:
1. Implementing JSM as a logic program, Jasmine can derive facts from clauses in addi-
tion to operating solely with facts. These clauses can bring in the representation
means of particular approach to logical AI including situation calculus, default and
defeasible reasoning, temporal and spatial logics, reasoning about mental attitudes
and others. Jasmine uses facts dynamically obtained by external reasoning machinery,
and at the same time Jasmine yields hypotheses as clauses which are processed by ac-
companying deductive systems. Therefore, deductive capabilities of Jasmine are be-
yond the ones of JSM; in particular, Jasmine is better suited for active learning (com-
pare with [19]).
2. Current implementations of JSM are procedural and therefore strongly linked with do-
main specific features; Jasmine is fully domain-independent. Ideology of JSM-based
analysis is strongly affected by its procedural implementation. Taking advantage of
the capability of the extended logic program to update clauses on the fly, Jasmine can
operate with hypotheses and the initial dataset with much higher flexibility than JSM.
Using GUI of Jasmine, it is easy to tackle an inverse prediction problem.
3. The measure of likelihood of hypotheses generated by JSM are rather limited. A sin-
gle fact may prevent a hypotheses to be generated which covers a large number of
samples. In contrast to JSM Jasmine is capable of optimizing the initial dataset, elimi-
nating examples which deliver atypical solutions. Following the methodology of case-
based reasoning, Jasmine refines the initial dataset of cases to adjust it to a given do-
main where prediction is to be performed.
Jasmine’s functionality is not limited to bioinformatics [9]. To describe the integration ca-
pabilities of Jasmine for other domains, it is worth mentioning the following components
which have been integrated with Jasmine:
• Natural language information retrieval and question answering for extraction facts
and clauses from text, deriving new facts by means of Jasmine and using them to an-
swer questions.
• Reasoning about mental attitudes, implemented as a Natural Language Multiagent
Mental Simulator NL_MAMS, has been deployed in the commercial customer re-
sponse management software. It leverages Jasmine’s generic environment for learn-
ing from previous cases and extracted causal links of agents’ behavior.
One difficulty with Jasmine is that at present it is non trivial to implement by a third party.
Unlike regression, neural networks or decision trees, one cannot simply run a standard pack-
age. The present state of the software is that there is a substantial learning curve to use Jas-
mine effectively. In addition, there have been very limited efforts in applying Jasmine to a
wide variety of domains. So far the effort has been on the development of the methodology
and logic programming implementation rather than on developing a user-friendly program for
general adoption in the environment. We consider that the approach is now ready for the de-
velopment of a system suitable for widespread use by the researchers in bioinformatics and
18. other domains. Use of Jasmine by many groups in diverse learning environments would help
to identify the areas for its improvement.
In this paper we have demonstrated how a new method of data analysis allows posing
new kind of problems for well known data. Rather than evaluating Jasmine in a standard set-
tings, we show how the established domain in bioinformatics can leverage its capability and
obtain a clearer approach to hierarchical classification of proteins and a novel way to repre-
sent evolutionary relationships.
Jasmine’s applicability to a wide spectrum of problems in bioinformatics provided further
evidence that some aspects of scientific reasoning can be formalized and efficiently automat-
ed. This suggests that a system like Jasmine can be effective in a cost optimization of con-
ducting experimental observations. Our presentation of Jasmine reasoning in this paper is
such that the code from Section 2, Fig.3.1 and a domain-specific expression for similarity to
be written by the reader are sufficient to for reasoning-intensive data mining.
References
1. Anshakov, O.M., Finn, V.K., Skvortsov, D.P. (1989) On axiomatization of many-valued logics associ-
ated with formalization of plausible reasoning, Studia Logica v42 N4 pp. 423-447.
2. Blinova,V.G., Dobrynin,D.A., Finn, V. K., Kuznetsov, S.O. and Pankratova, E.S. (2003) Toxicology
Analysis by Means of the JSM-Method. Bionformatics v.19 pp 1201-1207.
3. Davey, B.A. and Priestley, H.A. (2002) Introduction to Lattices and Order. Cambridge University
Press.
4. Dubchak, I., Muchnik, I., Holbrook, S. R. and Kim, S.-H. (1995). Prediction of protein folding class
using global description of amino acid sequences. Proc. Natl. Acad. Sci. USA v.92, 8700-8704.
5. Finn, V.K. (1999) On the synthesis of cognitive procedures and the problem of induction. NTI Series
2, N1-2 pp. 8-45.
6. Furukawa, K. (1998) From Deduction to Induction: Logical Perspective. The Logic Programming
Paradigm. In Apt, K.R., Marek V.W., Truszczynski, M., Warren, D.S., Eds. Springer.
7. Galitsky, B. A., Gelfand, I. M. and Kister, A. E. (1998). Predicting amino acid sequences of the anti-
body human VH chains from its first several residues. Proc. Natl. Acad. Sci. USA 95, 5193-5198.
8. Galitsky, B.A., Gelfand I. M. and Kister, A. E. (1999). Class-defining characteristics in the mouse
heavy chains of variable domains. Protein Eng. 12, 919-925.
9. Galitsky, B.A. (2005) Bioinformatics Data Management and Data Mining. Encyclopedia of Database
Technologies and Applications. Idea Group (In the Press).
10. King, R.D. (2004) Applying inductive logic programming to predicting gene function. AI Magazine
Volume 25 , Issue 1, AAAI Press pp. 57 - 68 .
11. King, R.D., Whelan, K.E., Jones, F.M., Reiser, P.K.G., Bryant, C.H., Muggleton, S.H., Kell, D.B.
and Oliver, S.G. (2004). Functional genomic hypothesis generation and experimentation by a robot
scientist. Nature, 427:247-252.
12. Langley, P. The computational support of scientific discovery. Int. J. Hum.–Comput. Stud. 53, 393–
410 (2000).
13. Liao, L. and Noble, W.S. (2003) Combining Pairwise Sequence Similarity and Support Vector Ma-
chines for Detecting Remote Protein Evolutionary and Structural Relationships, Journal of Computa-
tional Biology, v10 pp. 857-868.
14. Marcotte, E.M., Xenarios, I., van der Bliek A.M., and Eisenberg, D. (2000) Localizing proteins in the
cell from their phylogenetic profiles. PNAS V97 v 22 pp. 12115-12120
15. McGuffin, L.J. and Jones, T.D. (2003) Benchmarking protein secondary structure prediction for pro-
tein fold recognition. Proteins, V. 52, pp166-175.
16. Mill, J.S. (1843) A system of logic, racionative and inductive. London.
17. Mirkin, B., Fenner, T., Galperin, M. and Koonin, E. (2003) Algorithms for computing parsimonious
evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of
horizontal gene transfer in the evolution of prokaryotes, BMC Evolutionary Biology 2003, 3:2.
18. Mitchell, T. (1997) Machine Learning. McGraw Hill.
19. Muggleton, S. (1995) Inverse entailment and Progol. New Generation Comput. J. 13, pp. 245–286.
20. Muggleton, S. and Firth. J. (1999) CProgol4.4: Theory and use. In Dzeroski, S. and Lavrac,N. (Eds),
Inductive Logic Programming and Knowledge Discovery in Databases.
19. 21. Plotkin, G.D. (1970) A Note on inductive generalization, Machine Intelligence, vol. 5, pp. 153-163,
Edinburgh University Press.
22. Popper, K. (1972) The Logic of Scientific Discovery (Hutchinson, London).
23. Poupon, A. and Mornon J.-P. (1998). Populations of hydrophobic amino acids within protein globular
domains: identification of conserved "topohydrophobic" positions. Proteins V.33, pp. 329-342.
24. Scott, C. and Nowak, R. (2002). Dyadic Classification Trees via Structural Risk Minimization, Neural
Information Processing Systems.
25. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B,
Galperin MY, Fedorova ND, Koonin EV (2001) The COG database: new developments in phyloge-
netic classification of proteins from complete genomes. Nucleic Acids Res. V29(1) pp. 22-8.
26. Vinogradov, D.V. Logic programs for quasi-axiomatic theories NTI Series 2 N1-2 (1999) pp. 61-64.
27. Ward, J.J., Sodhi, J.S., Buxton, B. F. and Jones, D. T. (2004) Predicting Gene Ontology Annotations
from Sequence Data using Kernel-Based Machine Learning Algorithms. Proceedings of the 2004
IEEE Computational Systems Bioinformatics Conference (CSB 2004).
28. Tramontano,A. and Morea,V. (2003) Exploiting evolutionary relationships for predicting protein
structures. Biotechnology and Bioengineering, V 84, Issue 7, pp. 756-762.
29. Ganter, B. and Kuznetsov, S.O. (2003) Hypotheses and Version Spaces, Proc. 10th Int. Conf. on Con-
ceptual Structures ICCS'03, In A. de Moor, W. Lex, and B. Ganter, Eds., Lecture Notes in Artificial
Intelligence, vol. 2746 pp. 83-95.
30. Ganter, B. and Wille, R (1999) Formal Concept Analysis: Mathematical Foundations, Springer.
31. Mitchell, T. (1982) Generalization as Search, Artificial Intelligence V18, no. 2.
32. Kuznetsov, S.O., Obiedkov, S.A. (2002) Comparing performance of algorithms for generating concept
lattices, J. Exp. Theor. Artif. Intell., V. 14, N 2-3, pp. 189-216.
33. Chandra A., Harel D. (1985) Horn clause queries and generalizations. J. of Logic Programming V 2
-N 1. pp. 1-15.
34. Smith, D.K. & Xue, H. (1997) Sequence profiles of immunoglobulin and immunoglobulin-like do-
mains. J. Mol. Biol. 274, 530-545.
35. Delcher, A.L., Kasif, S. (1990) Efficient parallel term matching and anti-unification. Logic program-
ming. MIT Press, Cambridge, MA pp. 355 - 369
Appendix 1: Logical properties of JSM
When a JSM knowledge base contains incomplete information on properties of the objects of interest, the
additional truth-values are necessary besides truth and false. As result, we have the following internal (splitting
for iteration) truth-values: (+1) for factual truth, (-1) for factual falsity, (0) for factual contradiction, and (τ) de-
notes indeterminacy. Additionally, there are standard external truth-values t (logical truth, selected value) and f
(logical falsity). For every internal truth-value ν the corresponding characteristic Rosser-Turkett’s Jν-operator
is used. The following rules define it: J ν(λ) = t, if λ = ν; and J ν(λ) = f, if λ ≠ ν. Here λ and ν are internal truth-
values.
Formula of the type J ν(Φ), where Φ is arbitrary formula, and ν is an internal truth-value, called J-formula. A
J-formula with atomic formula Φ is called a J-atomic one. A formula, constructed from J-formulas through
usual logical connectives of predicate first-order logic, is called external formula. An external formula, which
contains only J-atomic J-formulas, is called normal one.
We can define connectives and quantifiers on internal truth-values. It needs to supply a translation of each
external formula into equivalent normal one. Many-valued logics with this property are called J-definable. The
criterion on J-definability of many-valued (infinite-valued) logics has been found [1] and the completeness the-
orem has been proved for such logics [1].
The values of the binary predicates X⇒1W and V⇒2W, defined for each pair, consist from an object X (a sub-
object V) and subset of properties W, are some internal truth-values. Formulas J<+1,0>(X⇒1W), J<-1,0>(X⇒1W), and
J<τ,0>(X⇒1W) denotes that an object X possesses all properties from the set W, does not possess any property
from the set W, or properties from the set W are unknown, respectively. Similarly, formulas J<+1,1>(V⇒2W),
J<- 1,1>(V⇒2W), and J<0,1>(V⇒2W) denote the statements that an occurrence of a sub-object V in an arbitrary ob-
ject forces appearance of all properties from W, vanishing of all properties from W, or leads to a contradictory
information about possessing of all properties from W, respectively.
We proceed to the foundations of JSM’s implementation as a logic program.
If we restrict ourselves to a single property p, then J<0,1>(v⇒2{p}) is equal to formula
¬∃z [v ≤ z & z ≠ v & ∀x(J<+1,0>(x⇒1{p}) & v ≤ x ⊃ z ≤ x)] &
¬∃z [v ≤ z & z ≠ v & ∀x(J<- 1,0>(x⇒1{p}) & v ≤ x ⊃ z ≤ x)] &
20. ∃z1 ∃z2 [v ≤ z1 & v ≤ z2 & z1 ≠ z2 & J<+1,0>(z1⇒1{p}) & J<+1,0>(z2⇒1{p})] &
∃z1 ∃z2 [v ≤ z1 & v ≤ z2 & z1 ≠ z2 & J<- 1,0>(z1⇒1{p}) & J<- 1,0>(z2⇒1{p})] .
Similarly, J<+1,1>(v⇒2{p}) is equivalent to formula
¬∃z [v ≤ z & z ≠ v & ∀x(J<+1,0>(x⇒1{p}) & v ≤ x ⊃ z ≤ x)] &
∃z1 ∃z2 [v ≤ z1 & v ≤ z2 & z1 ≠ z2 & J<+1,0>(z1⇒1{p}) & J<+1,0>(z2⇒1{p})] & ¬J<0,1>(v⇒2{p}). Replacing +1 by –1
in the last formula give us the expression for J<- 1,1>(v⇒2{p}).
Hence, it is possible to express plausible inference rules and sufficient reasons criterion of the JSM-method
by means of first-order logic. This approach leads to a new deductive imitation theorem, which is similar to the
one obtained in [1]. However, in such a way we ignore computational effectiveness that is quite important for
practical usage of JSM. So another approach of JSM representation such as a stratified logic program [33] has
been suggested. The class of stratified logic programs admits very simple semantics (iterated fix-point models).
Using this semantics, a correctness result has been proved that states coincidence of the deductive imitation
model [1] with the iterated minimal fix-points model. The stratified logical program has been presented in Sec-
tion 2.