1. The document discusses a talk on using ontologies like Gene Ontology (GO) to improve the data mining step of knowledge discovery from databases.
2. It introduces IntelliGO, a semantic similarity measure for GO annotations that can be used for gene clustering and abstraction of biomedical data to aid data mining.
3. The talk will cover the development of IntelliGO, its application to gene clustering, and its use in clustering secondary effects from another ontology to reduce data dimensions for mining.
Manifesto for a Standard on Knowledge Exchange in Social Knowledge Management...Communardo Software GmbH
This document proposes a standard for capturing contextual information in knowledge management environments. It identifies six key concepts needed: knowledge activity, knowledge activity stream, knowledge trace, knowledge object, knowledge bundle, and knowledge container. The concepts are defined and a semantic model is proposed to represent relationships. Next steps include developing an ontology-based standard, use cases, and starting an interest group for further discussion and improvement.
This document discusses semantic coding of images for earth observation. It describes the Competence Centre on Information Extraction and Image Understanding for Earth Observation (COC), which aims to perform joint research on image understanding. It then discusses an approach to semantic image coding using visual words and bag-of-words representations classified with tools like LDA and SVM. This approach is implemented in a new OTB tool called cocSemanticCoding, which allows feature extraction, visual word quantification, and learning from manual annotations. Future work includes integrating other COC tools and developing new approaches for full image interpretation and knowledge modeling.
This document discusses personalization in e-learning and the challenges it poses for instructional designers. It covers 5 types of personalization and challenges including understanding learner-content interactions and designing personalized learning paths. The key needs for personalization are identified as a learner model, learning object design model, ontologies, and learning analytics. Several studies are summarized that explore modeling learner characteristics, visual search performance, memory spans, navigation design, and levels of processing to better understand learners and design personalized instruction. Challenges remain around differentiating learning paths for individuals and groups.
This document summarizes a lecture on learning classifier systems. It introduces learning classifier systems and provides examples of their applications in reinforcement learning, supervised learning, and function approximation. Specifically, it discusses XCS and its use in solving maze problems and the mountain car problem. It also provides an example of a learning classifier system solving boolean functions. Finally, it outlines current real-life applications of learning classifier systems in data mining, modeling market traders, autonomous robotics, and modeling artificial ecosystems. It concludes by introducing anticipatory learning classifier systems which can learn condition-action-effect relations.
Elettronica: Multimedia Information Processing in Smart Environments by Aless...Codemotion
The document discusses research into multimedia information processing for smart environments. It covers topics such as feature extraction, object recognition, distributed video coding for multiple sources, and new imaging techniques. The overall goal is to develop technologies that integrate sensors, distributed computing systems, and communications to create environments that can adapt to conditions, respond to users, and improve quality of life.
New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...Albert Orriols-Puig
The document discusses new challenges in learning classifier systems (LCS) when dealing with domains containing rare classes. It proposes using a design decomposition approach to analyze how LCS address rare classes. Specifically, it examines how the extended classifier system (XCS) handles rare classes. It identifies five critical elements of LCS that are important for detecting small niches associated with rare classes: 1) estimating classifier parameters correctly, 2) providing representatives of rare niches during initialization, 3) generating and growing representatives of rare niches, 4) adjusting the genetic algorithm application rate, and 5) ensuring representatives of rare niches dominate their niches. The document focuses on analyzing the first element of estimating classifier parameters for XCS when dealing with domains
IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...Albert Orriols-Puig
This document proposes using a learning classifier system (LCS) to mine association rules from data streams online by adapting to changes in associations. It describes CSar, an LCS that represents rules with intervals for continuous attributes and uses antecedent or consequent grouping to organize rules into association sets. Experimental results show CSar can extract rules with similar support and confidence as Apriori on discrete data and quantitative association rules on continuous data, though further analysis is needed to compare CSar to other rule miners and test its ability to adapt to changing data streams.
1. The document discusses using principles from biological vision to improve computer vision systems.
2. It describes how computer vision has incorporated ideas from visual neuroscience, such as using oriented filters inspired by V1 simple cells and implementing normalization models.
3. The document argues that biology uses cascades of canonical operations like linear filtering, nonlinearities, and pooling in an optimized way for general-purpose vision, and following these principles can improve computer vision tasks like object recognition.
Manifesto for a Standard on Knowledge Exchange in Social Knowledge Management...Communardo Software GmbH
This document proposes a standard for capturing contextual information in knowledge management environments. It identifies six key concepts needed: knowledge activity, knowledge activity stream, knowledge trace, knowledge object, knowledge bundle, and knowledge container. The concepts are defined and a semantic model is proposed to represent relationships. Next steps include developing an ontology-based standard, use cases, and starting an interest group for further discussion and improvement.
This document discusses semantic coding of images for earth observation. It describes the Competence Centre on Information Extraction and Image Understanding for Earth Observation (COC), which aims to perform joint research on image understanding. It then discusses an approach to semantic image coding using visual words and bag-of-words representations classified with tools like LDA and SVM. This approach is implemented in a new OTB tool called cocSemanticCoding, which allows feature extraction, visual word quantification, and learning from manual annotations. Future work includes integrating other COC tools and developing new approaches for full image interpretation and knowledge modeling.
This document discusses personalization in e-learning and the challenges it poses for instructional designers. It covers 5 types of personalization and challenges including understanding learner-content interactions and designing personalized learning paths. The key needs for personalization are identified as a learner model, learning object design model, ontologies, and learning analytics. Several studies are summarized that explore modeling learner characteristics, visual search performance, memory spans, navigation design, and levels of processing to better understand learners and design personalized instruction. Challenges remain around differentiating learning paths for individuals and groups.
This document summarizes a lecture on learning classifier systems. It introduces learning classifier systems and provides examples of their applications in reinforcement learning, supervised learning, and function approximation. Specifically, it discusses XCS and its use in solving maze problems and the mountain car problem. It also provides an example of a learning classifier system solving boolean functions. Finally, it outlines current real-life applications of learning classifier systems in data mining, modeling market traders, autonomous robotics, and modeling artificial ecosystems. It concludes by introducing anticipatory learning classifier systems which can learn condition-action-effect relations.
Elettronica: Multimedia Information Processing in Smart Environments by Aless...Codemotion
The document discusses research into multimedia information processing for smart environments. It covers topics such as feature extraction, object recognition, distributed video coding for multiple sources, and new imaging techniques. The overall goal is to develop technologies that integrate sensors, distributed computing systems, and communications to create environments that can adapt to conditions, respond to users, and improve quality of life.
New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...Albert Orriols-Puig
The document discusses new challenges in learning classifier systems (LCS) when dealing with domains containing rare classes. It proposes using a design decomposition approach to analyze how LCS address rare classes. Specifically, it examines how the extended classifier system (XCS) handles rare classes. It identifies five critical elements of LCS that are important for detecting small niches associated with rare classes: 1) estimating classifier parameters correctly, 2) providing representatives of rare niches during initialization, 3) generating and growing representatives of rare niches, 4) adjusting the genetic algorithm application rate, and 5) ensuring representatives of rare niches dominate their niches. The document focuses on analyzing the first element of estimating classifier parameters for XCS when dealing with domains
IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...Albert Orriols-Puig
This document proposes using a learning classifier system (LCS) to mine association rules from data streams online by adapting to changes in associations. It describes CSar, an LCS that represents rules with intervals for continuous attributes and uses antecedent or consequent grouping to organize rules into association sets. Experimental results show CSar can extract rules with similar support and confidence as Apriori on discrete data and quantitative association rules on continuous data, though further analysis is needed to compare CSar to other rule miners and test its ability to adapt to changing data streams.
1. The document discusses using principles from biological vision to improve computer vision systems.
2. It describes how computer vision has incorporated ideas from visual neuroscience, such as using oriented filters inspired by V1 simple cells and implementing normalization models.
3. The document argues that biology uses cascades of canonical operations like linear filtering, nonlinearities, and pooling in an optimized way for general-purpose vision, and following these principles can improve computer vision tasks like object recognition.
CCIA'2008: Can Evolution Strategies Improve Learning Guidance in XCS? Design ...Albert Orriols-Puig
This document proposes using evolution strategies (ES) instead of genetic algorithms (GA) in the genetic algorithm component of the XCS learning classifier system. It designs an ES-based XCS with a modified classifier representation and ES-based genetic operators. Experiments on real-world datasets show the ES-based XCS outperforms GA-based XCS with selection and mutation alone, though there is no significant difference when crossover is added. Further research is suggested to determine when different search operators should be used.
This document summarizes the second lecture in an introduction to machine learning course. The lecture discusses the definition of machine learning as improving performance on a task through experience. It also covers why machine learning is increasingly appealing due to growing data and computational power. Examples of applying machine learning to data mining tasks like medical prediction are provided. The lecture concludes by discussing emerging areas for machine learning like semantic networks and individual recommender systems.
Lecture16 - Advances topics on association rules PART IIIAlbert Orriols-Puig
This document provides an introduction to sequential pattern mining, which aims to find patterns or associations between items where the order or sequence is important. It defines key concepts like sequential patterns, frequent sequences, and support. An example database of customer transaction sequences is given to illustrate sequential patterns with a minimum support threshold of 25%. Key 1-sequence, 2-sequence, and 3-sequence patterns meeting the support threshold are identified in the example data. The document discusses how sequential pattern mining is useful for applications like analyzing the sequence of items customers buy or their navigational patterns on a website.
This document provides a summary of Lecture 4 of an introduction to machine learning course. It recaps topics from Lecture 3, including different machine learning paradigms and problems that will be studied like classification, regression, clustering, and association analysis. It then discusses classification and prediction problems in more detail. Specific algorithms for classification, regression, clustering and association rule mining are introduced. Finally, it previews that the next class will focus on the C4.5 classification algorithm.
Towards a 2-dimensional Self-organized Framework for Structured Population-ba...Carlos M. Fernandes
Presentation of the paper "Towards a 2-dimensional Self-organized Framework for Structured Population-based Metaheuristics". IEEE International Congress on Complex Systems, Agadir, Morocco, 2012
Presentation of the paper "Pherogenic Drawings: Generating Colored 2-Dimensional Abstract Representations of Sleep EEG with the KANTS Algorithm". ECTA 2012, Barcelona, Spain.
A crucial aspect of search applications is the possibility to identify named entities in free-form text and provide functionality for entity-based, complex queries towards the indexed data. By enriching each entity with semantically relevant information acquired from outside the text, one can create the foundation for an advanced search application. Thus, given a document about Denmark, where neither of the words Copenhagen, country, nor capital are mentioned, it should be possible to retrieve the document by querying for Copenhagen or European country.
In this paper, we report how we have tackled this problem. We will, however, concentrate only on the two tasks which are central to the solution, namely named entity recognition (NER) and enrichment of the discovered entities by relying on linked data from knowledge bases such as YAGO2 and DBpedia. We remain agnostic to all other details of the search application, which can be implemented in a relatively straight-forward way by using, e.g. Apache Solr .
The work deals only with Swedish and is restricted to two domains: news articles and medical texts. As a byproduct, our method achieves state-of-the-art results for Swedish NER and to our knowledge there are no previously published works on employing linked data for Swedish for the two domains at hand.
This document provides an overview of genetic fuzzy systems. It begins with a recap of supervised and unsupervised machine learning techniques. It then discusses fuzzy logics and how they can be used to represent imprecise concepts using membership functions. Fuzzy systems that use fuzzy logic to model relationships between variables are introduced. Genetic fuzzy systems combine fuzzy systems with genetic algorithms to design the fuzzy rules, membership functions, and inference engines. The genetic algorithm evolves populations of candidate fuzzy systems through selection, crossover and mutation to optimize system performance.
This document summarizes a lecture on advanced topics in association rule mining, including mining frequent patterns without candidate generation, multiple level association rules (AR), and quantitative AR. It discusses how concept hierarchies can be used to find associations between items at different levels of abstraction. Algorithms for mining generalized AR with taxonomies are presented, along with optimizations like pre-computing ancestors and pruning redundant rules. The importance of hierarchies for modeling real-world applications is highlighted.
GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...Albert Orriols-Puig
1. The document describes a facetwise analysis of the XCS learning classifier system for class imbalances.
2. It analyzes the population initialization process, generation of rules for minority classes, time to extinction of such rules, and derives a population size bound.
3. The analysis considers problems with multiple classes, one sampled at a lower frequency (minority class), and derives probabilities of sampling instances from each class.
This document provides an introduction and agenda for Lecture 1 of an Introduction to Machine Learning course. The lecture will cover goals for the course from both the instructor and student perspectives. It will also discuss the major project which will account for 70% of the grade, requiring students to conduct and present original machine learning research. The next class will cover what machine learning is, why it is important, and different machine learning paradigms.
Information Retrieval using Semantic SimilaritySaswat Padhi
This document summarizes a seminar on artificial intelligence that covered three main topics: information retrieval using semantics and ontology, semantic similarity, and information retrieval. It discusses how semantics and ontologies can help address what information retrieval is currently lacking by providing meaning. It then covers different approaches to measuring semantic similarity based on path lengths and information content in ontologies. Finally, it discusses how information retrieval can be improved by reweighting query terms and expanding queries based on semantic similarity to related terms.
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...GUANGYUAN PIAO
This document presents a method for calculating semantic similarity between resources in DBpedia to enable recommendations. It proposes the Resim measure, which satisfies fundamental axioms of similarity and outperforms other measures in evaluating general resource pairs and music recommendations. An evaluation of Resim shows it performs best at calculating similarities and recommending similar artists. Additionally, the document finds that the performance of recommendation systems decreases as resources have fewer links, indicating a linked data sparsity problem.
The document discusses a clinical decision support system (CDSS) being developed as part of the Synergy-COPD project, which will use a Java-based framework and Drools rules engine to represent clinical knowledge and make inferences from patient data represented in a HL7 virtual medical record format, with the goal of aiding diagnosis and management of COPD patients.
This document discusses gene prediction and some of the computational challenges involved. It covers topics like genes and proteins, gene prediction problems, computational approaches like using open reading frames and codon usage. It also discusses central dogma, exons, introns, splicing signals, genetic code and different gene prediction algorithms.
This document provides an overview of scalable pattern mining algorithms for large scale interval data. It discusses the need for scalable pattern mining due to the huge increase in data size. It covers serial frequent itemset mining methods like Apriori, Eclat, and FP-growth. It also discusses parallel itemset mining methods including FP-growth based PFP algorithm and ultrametric tree based FiDoop algorithm. Additionally, it covers pattern mining approaches for interval data, including interval sequences, temporal relations, and hierarchical representations. The document concludes by stating that while efforts have been made to modify classic algorithms for distributed processing, scalable mining of temporal relationships on large interval data remains an open issue.
Similarity-based gene prediction approaches compare unknown genes in a genome to known genes in closely-related species to determine gene structure and function. A key challenge is that genes are split into exons and introns. Spliced alignment algorithms aim to find the best chain of exons in the genome that align to a target gene sequence, taking into account local alignments between potential exons. Popular gene prediction tools that use spliced alignment include GENSCAN, GenomeScan, TwinScan, Glimmer, and GenMark. They apply probabilistic models and similarity information to predict exon-intron structure.
An introduction to frequent pattern mining algorithms and their usage in mining log data. Presented by Krishna Sridhar (Dato) at Seattle DAML meetup, Feb 2016.
This document discusses gene and DNA sequence patenting as well as patenting transgenic organisms. It defines what a patent is and explains that genes, DNA sequences, and transgenic organisms can be patented if they are novel, non-obvious, and industrially applicable. However, gene patenting is controversial from an ethics perspective as some believe genes should not be treated as property. Examples of patented genes, transgenic mice and fish are provided.
CCIA'2008: Can Evolution Strategies Improve Learning Guidance in XCS? Design ...Albert Orriols-Puig
This document proposes using evolution strategies (ES) instead of genetic algorithms (GA) in the genetic algorithm component of the XCS learning classifier system. It designs an ES-based XCS with a modified classifier representation and ES-based genetic operators. Experiments on real-world datasets show the ES-based XCS outperforms GA-based XCS with selection and mutation alone, though there is no significant difference when crossover is added. Further research is suggested to determine when different search operators should be used.
This document summarizes the second lecture in an introduction to machine learning course. The lecture discusses the definition of machine learning as improving performance on a task through experience. It also covers why machine learning is increasingly appealing due to growing data and computational power. Examples of applying machine learning to data mining tasks like medical prediction are provided. The lecture concludes by discussing emerging areas for machine learning like semantic networks and individual recommender systems.
Lecture16 - Advances topics on association rules PART IIIAlbert Orriols-Puig
This document provides an introduction to sequential pattern mining, which aims to find patterns or associations between items where the order or sequence is important. It defines key concepts like sequential patterns, frequent sequences, and support. An example database of customer transaction sequences is given to illustrate sequential patterns with a minimum support threshold of 25%. Key 1-sequence, 2-sequence, and 3-sequence patterns meeting the support threshold are identified in the example data. The document discusses how sequential pattern mining is useful for applications like analyzing the sequence of items customers buy or their navigational patterns on a website.
This document provides a summary of Lecture 4 of an introduction to machine learning course. It recaps topics from Lecture 3, including different machine learning paradigms and problems that will be studied like classification, regression, clustering, and association analysis. It then discusses classification and prediction problems in more detail. Specific algorithms for classification, regression, clustering and association rule mining are introduced. Finally, it previews that the next class will focus on the C4.5 classification algorithm.
Towards a 2-dimensional Self-organized Framework for Structured Population-ba...Carlos M. Fernandes
Presentation of the paper "Towards a 2-dimensional Self-organized Framework for Structured Population-based Metaheuristics". IEEE International Congress on Complex Systems, Agadir, Morocco, 2012
Presentation of the paper "Pherogenic Drawings: Generating Colored 2-Dimensional Abstract Representations of Sleep EEG with the KANTS Algorithm". ECTA 2012, Barcelona, Spain.
A crucial aspect of search applications is the possibility to identify named entities in free-form text and provide functionality for entity-based, complex queries towards the indexed data. By enriching each entity with semantically relevant information acquired from outside the text, one can create the foundation for an advanced search application. Thus, given a document about Denmark, where neither of the words Copenhagen, country, nor capital are mentioned, it should be possible to retrieve the document by querying for Copenhagen or European country.
In this paper, we report how we have tackled this problem. We will, however, concentrate only on the two tasks which are central to the solution, namely named entity recognition (NER) and enrichment of the discovered entities by relying on linked data from knowledge bases such as YAGO2 and DBpedia. We remain agnostic to all other details of the search application, which can be implemented in a relatively straight-forward way by using, e.g. Apache Solr .
The work deals only with Swedish and is restricted to two domains: news articles and medical texts. As a byproduct, our method achieves state-of-the-art results for Swedish NER and to our knowledge there are no previously published works on employing linked data for Swedish for the two domains at hand.
This document provides an overview of genetic fuzzy systems. It begins with a recap of supervised and unsupervised machine learning techniques. It then discusses fuzzy logics and how they can be used to represent imprecise concepts using membership functions. Fuzzy systems that use fuzzy logic to model relationships between variables are introduced. Genetic fuzzy systems combine fuzzy systems with genetic algorithms to design the fuzzy rules, membership functions, and inference engines. The genetic algorithm evolves populations of candidate fuzzy systems through selection, crossover and mutation to optimize system performance.
This document summarizes a lecture on advanced topics in association rule mining, including mining frequent patterns without candidate generation, multiple level association rules (AR), and quantitative AR. It discusses how concept hierarchies can be used to find associations between items at different levels of abstraction. Algorithms for mining generalized AR with taxonomies are presented, along with optimizations like pre-computing ancestors and pruning redundant rules. The importance of hierarchies for modeling real-world applications is highlighted.
GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...Albert Orriols-Puig
1. The document describes a facetwise analysis of the XCS learning classifier system for class imbalances.
2. It analyzes the population initialization process, generation of rules for minority classes, time to extinction of such rules, and derives a population size bound.
3. The analysis considers problems with multiple classes, one sampled at a lower frequency (minority class), and derives probabilities of sampling instances from each class.
This document provides an introduction and agenda for Lecture 1 of an Introduction to Machine Learning course. The lecture will cover goals for the course from both the instructor and student perspectives. It will also discuss the major project which will account for 70% of the grade, requiring students to conduct and present original machine learning research. The next class will cover what machine learning is, why it is important, and different machine learning paradigms.
Information Retrieval using Semantic SimilaritySaswat Padhi
This document summarizes a seminar on artificial intelligence that covered three main topics: information retrieval using semantics and ontology, semantic similarity, and information retrieval. It discusses how semantics and ontologies can help address what information retrieval is currently lacking by providing meaning. It then covers different approaches to measuring semantic similarity based on path lengths and information content in ontologies. Finally, it discusses how information retrieval can be improved by reweighting query terms and expanding queries based on semantic similarity to related terms.
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...GUANGYUAN PIAO
This document presents a method for calculating semantic similarity between resources in DBpedia to enable recommendations. It proposes the Resim measure, which satisfies fundamental axioms of similarity and outperforms other measures in evaluating general resource pairs and music recommendations. An evaluation of Resim shows it performs best at calculating similarities and recommending similar artists. Additionally, the document finds that the performance of recommendation systems decreases as resources have fewer links, indicating a linked data sparsity problem.
The document discusses a clinical decision support system (CDSS) being developed as part of the Synergy-COPD project, which will use a Java-based framework and Drools rules engine to represent clinical knowledge and make inferences from patient data represented in a HL7 virtual medical record format, with the goal of aiding diagnosis and management of COPD patients.
This document discusses gene prediction and some of the computational challenges involved. It covers topics like genes and proteins, gene prediction problems, computational approaches like using open reading frames and codon usage. It also discusses central dogma, exons, introns, splicing signals, genetic code and different gene prediction algorithms.
This document provides an overview of scalable pattern mining algorithms for large scale interval data. It discusses the need for scalable pattern mining due to the huge increase in data size. It covers serial frequent itemset mining methods like Apriori, Eclat, and FP-growth. It also discusses parallel itemset mining methods including FP-growth based PFP algorithm and ultrametric tree based FiDoop algorithm. Additionally, it covers pattern mining approaches for interval data, including interval sequences, temporal relations, and hierarchical representations. The document concludes by stating that while efforts have been made to modify classic algorithms for distributed processing, scalable mining of temporal relationships on large interval data remains an open issue.
Similarity-based gene prediction approaches compare unknown genes in a genome to known genes in closely-related species to determine gene structure and function. A key challenge is that genes are split into exons and introns. Spliced alignment algorithms aim to find the best chain of exons in the genome that align to a target gene sequence, taking into account local alignments between potential exons. Popular gene prediction tools that use spliced alignment include GENSCAN, GenomeScan, TwinScan, Glimmer, and GenMark. They apply probabilistic models and similarity information to predict exon-intron structure.
An introduction to frequent pattern mining algorithms and their usage in mining log data. Presented by Krishna Sridhar (Dato) at Seattle DAML meetup, Feb 2016.
This document discusses gene and DNA sequence patenting as well as patenting transgenic organisms. It defines what a patent is and explains that genes, DNA sequences, and transgenic organisms can be patented if they are novel, non-obvious, and industrially applicable. However, gene patenting is controversial from an ethics perspective as some believe genes should not be treated as property. Examples of patented genes, transgenic mice and fish are provided.
Gene expression in eukaryotes is controlled at multiple levels, including chromatin structure, transcription, RNA processing, and translation. Chromatin structure determines if genes are transcriptionally active or inactive. Transcription is regulated by the interaction of promoters, transcription factors, and enhancers. RNA processing controls splicing and transport of mRNA. Finally, translation and post-translational modifications further regulate gene expression. Overall, eukaryotic gene expression is tightly controlled through complex mechanisms at the chromatin, transcription, RNA, translation, and protein levels.
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
This document proposes CSMR, a scalable algorithm for text clustering that uses cosine similarity and MapReduce. CSMR performs pairwise text similarity by representing text documents as vectors in a vector space model and measuring similarity in parallel using MapReduce. It is a 4-phase algorithm that includes word counting, text vectorization using term frequencies, applying TF-IDF to document vectors, and measuring cosine similarity. The algorithm is designed to cluster large text corpora in a scalable manner on distributed systems like Hadoop. Future work includes implementing and testing CSMR on real data and publishing results.
Bioinformatics Project Training for 2,4,6 monthbiinoida
The document discusses projects and research opportunities available at the Bioinformatics Institute of India (BII). Some key points:
- BII focuses on major bioinformatics areas like genome analysis, protein structure prediction, microarray analysis, and drug discovery.
- Students can complete short or long-term projects at BII on topics like sequence analysis, phylogenetics, and software development.
- Projects help students gain skills in bioinformatics tools like MATLAB, Bioperl, and develop their careers in fields like academia, industry, and research.
- BII has hosted projects for students from various universities on diverse topics ranging from protein structure prediction to disease research.
Secondary Structure Prediction of proteins Vijay Hemmadi
Secondary structure prediction has been around for almost a quarter of a century. The early methods suffered from a lack of data. Predictions were performed on single sequences rather than families of homologous sequences, and there were relatively few known 3D structures from which to derive parameters. Probably the most famous early methods are those of Chou & Fasman, Garnier, Osguthorbe & Robson (GOR) and Lim. Although the authors originally claimed quite high accuracies (70-80 %), under careful examination, the methods were shown to be only between 56 and 60% accurate (see Kabsch & Sander, 1984 given below). An early problem in secondary structure prediction had been the inclusion of structures used to derive parameters in the set of structures used to assess the accuracy of the method.
Some good references on the subject:
The document discusses different types of spatial data used in geographic information systems (GIS). It describes vector data, which represents geographic features as points, lines, and polygons, and raster data, which divides the landscape into a grid of cells. It outlines some common vector data formats like shapefiles, coverages, and geodatabases, and raster formats like grids and digital elevation models. It provides details on how vector data structures represent points, lines, polygons, and their topological relationships in ArcGIS.
This document provides a tutorial on text mining and text stream mining techniques. It covers common text mining processes like transforming text into vector space models using bag-of-words representations, computing term weights, and applying machine learning algorithms. Specifically, it discusses vector space models, term weighting using TF-IDF, cosine similarity as a distance measure, and machine learning algorithms for classification like k-Nearest Neighbors, nearest centroid classification, and support vector machines. The tutorial is intended to provide an overview of fundamental text mining and text stream mining concepts.
This document discusses identifying mutations in the filaggrin gene through sequence analysis. The filaggrin gene codes for filaggrin proteins that are essential for skin barrier function. Mutations in this gene are linked to conditions like eczema and asthma. The study aims to detect faulty filaggrin genes, identify other human and non-human proteins with similar function to filaggrin, and find identical protein sequences to help develop therapeutic options. Sequence alignment methods like pairwise alignment and BLAST will be used to analyze filaggrin genes and identify similar protein sequences.
Spatial data defines a location using points, lines, polygons or pixels and includes location, shape, size and orientation. Non-spatial data relates to a specific location and includes statistical, text, image or multimedia data linked to spatial data defining the location. The document outlines key differences between spatial and non-spatial data, noting that spatial data is multi-dimensional and correlated while non-spatial data is one-dimensional and independent, with implications for conceptual, processing and storage issues.
Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/health-insurance-predictive-analysis-with-hadoop-and-machine-learning/julien-cabot
Educating a New Breed of Data Scientists for Scientific Data Management Jian Qin
This presentation reports the data science curriculum development and implementation at Syracuse iSchool, which has shaped by the fast changing data-intensive environment not only for science but also for business and research at large.
This document discusses using machine learning and MapReduce with Hadoop to perform predictive analysis on health insurance claims data. It proposes extracting data from online forums, searching for correlations between medical keywords and claims, and using support vector regression to build a predictive model. The analysis would be run on Amazon Elastic MapReduce for scalability and cost efficiency. Future work may include additional data sources and model enhancements.
Combining Data Mining and Ontology Engineering to enrich Ontologies and Linke...Mathieu d'Aquin
This document discusses combining data mining and ontology engineering to enrich ontologies and linked data. It describes how the knowledge discovery process and ontology engineering process can evolve together, with data mining interpreting ontologies and mining data to discover new concepts and relationships to enrich ontologies. It also outlines major new issues that arise from mining linked and ontology-based data, such as ontology-guided data mining and versioning to track changes between models.
Scientific discovery and innovation in an era of data-intensive science
William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator
The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.
Linked Open Data to support content based Recommender SystemsVito Ostuni
This document proposes using Linked Open Data to support content-based recommender systems by providing rich item descriptions. It describes the main drawback of traditional content-based recommender systems as their limited ability to analyze content. A vector space model is then adapted to represent RDF graphs from Linked Open Data as vectors, allowing item similarity to be computed based on semantic relationships. An evaluation of this approach using MovieLens data demonstrates improvements in recommendation precision and recall.
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…Roku
Linked Open Data to Support Content-based Recommender Systems
Tommaso Di Noia, Roberto Mirizzi, Vito Claudio Ostuni, Davide Romito, Markus Zanker
I-SEMANTICS 2012, 8th Int. Conf. on Semantic Systems, Sept. 5-7, 2012, Graz, Austria.
The World Wide Web is moving from a Web of hyper-linked Documents to a Web of linked Data. Thanks to the Semantic Web spread and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets. These datasets are connected with each other to form the so called Linked Open Data cloud. As of today, there are tons of RDF data available in the Web of Data, but only few applications really exploit their potential power. In this paper we show how these data can successfully be used to develop a recommender system (RS) that relies exclusively on the information encoded in the Web of Data. We implemented a content-based RS that leverages the data available within Linked Open Data datasets (in particular DBpedia, Freebase and LinkedMDB) in order to recommend movies to the end users. We extensively evaluated the approach and validated the effectiveness of the algorithms by experimentally measuring their accuracy with precision and recall metrics.
1) Deep learning has achieved great success in many computer vision tasks such as image classification, object detection, and segmentation. Convolutional neural networks (CNNs) are often used.
2) The size and quality of training datasets is crucial, as deep learning models require large amounts of labeled data to learn meaningful patterns. Data augmentation and synthesis can help increase data quantity and quality.
3) Semi-supervised and transfer learning techniques can help address the challenge of limited labeled data by making use of unlabeled data as well. Generative adversarial networks (GANs) have also been used for data augmentation.
Jena based implementation of a iso 11179 meta data registryA. Anil Sinaci
The ISO/IEC 11179 family of specifications introduces a standard model for meta-data registries to increase the interoperability of applications with the use of common data elements. Jena based implementation of a standard meta-data registry, brings semantic processing and reasoning capabilities on top of the common data elements and their consumer applications.
The digital universe is booming, especially metadata and user-generated data. This raises strong challenges in order to identify the relevant portions of data which are relevant for a particular problem and to deal with the lifecycle of data. Finer grain problems include data evolution and the potential impact of change in the applications relying on the data, causing decay. The management of scientific data is especially sensitive to this. We present the Research Objects concept as the means to indentify and structure relevant data in scientific domains, addressing data as first-class citizens. We also identify and formally represent the main reasons for decay in this domain and propose methods and tools for their diagnosis and repair, based on provenance information. Finally, we discuss on the application of these concepts to the broader domain of the Web of Data: Data with a Purpose.
This document provides an overview of multidimensional scaling (MDS) and overall similarity perceptual maps. It discusses the challenges with attribute rating perceptual maps and how MDS can help address these challenges by mapping similarities and dissimilarities among items based on consumer perceptions. The document outlines the history, methodology, implementation steps, and example application of MDS for creating a perceptual map of search engines. Key considerations for using MDS include having a large sample size of subjects and brands/products.
Going local with a world-class data infrastructure: Enabling SDMX for researc...Rob Grim
1. The document discusses how Tilburg University is using SDMX standards to support research through their metadata management and data infrastructure.
2. Key aspects of their approach include developing an SDMX metadata registry to describe time series data from different disciplines, and an SDMX data repository to prevent data replication and deal with confidentiality issues.
3. One project example is the CARDS World Taxation Indicators project, which aims to improve access to tax data by standardizing indicators using SDMX.
This document discusses the key ingredients for representing sensor data as linked data: core ontological models, additional domain ontologies, guidelines for generating identifiers, sensor web programming interfaces, and query processing engines. It provides examples of existing linked sensor data projects, examines the SSN ontology for describing sensors and observations, and outlines challenges in generating and consuming linked sensor data at scale from sensor networks.
Presentatie Big Data Forum 22 januari 2013 - Big Data en Big SocietySURFnet
tijdens de tweede editie van het Big Data Forum, gaf Erik Huizer, CTO SURFnet, een presentatie over de impact van Big Data op de maatschappij. Volgens Huizer staat big data in Nederland aan de basis van een maatschappelijke verandering. Door de huidige data gedreven maatschappij, zijn mensen steeds meer in staat zelf te kiezen wat goed voor hen is en met wie ze in zee gaan. Daar staat tegenover dat Big Data ook risico's inhoud voor zaken als privacy. De rol van de overheid veranderd daardoor en verschuift van verzorgen naar waarborgen.
SP1: Exploratory Network Analysis with GephiJohn Breslin
ICWSM 2011 Tutorial
Sebastien Heymann and Julian Bilcke
Gephi is an interactive visualization and exploration software for all kinds of networks and relational data: online social networks, emails, communication and financial networks, but also semantic networks, inter-organizational networks and more. Designed to make data navigation and manipulation easy, it aims to fulfill the complete chain from data importing to aesthetics refinements and interaction. Users interact with the visualization and manipulate structures, shapes and colors to reveal hidden properties. The goal is to help data analysts to make hypotheses, intuitively discover patterns or errors in large data collections.
In this tutorial we will provide a hands-on demonstration of the essential functionalities of Gephi, based on a real case scenario: the exploration of student networks from the "Facebook100" dataset (Social Structure of Facebook Networks, Amanda L. Traud et al, 2011). The participants will be guided step by step through the complete chain of representation, manipulation, layout, analysis and aesthetics refinements. Particular focus will be put on filters and metrics for the creation of their first visualizations. They will be incited to compare the hypotheses suggested by their own exploration to the results actually published in the academic paper afterwards. They finally will walk away with the practical knowledge enabling them to use Gephi for their own projects. The tutorial is intended for professionals, researchers and graduates who wish to learn how playing during a network exploration can speed up their studies.
Sébastien Heymann is a Ph.D. Candidate in Computer Science at Université Pierre et Marie Curie, France. His research at the ComplexNetworks team focuses on the dynamics of realworld networks. He leads the Gephi project since 2008, and is the administrator of the Gephi Consortium.
Julian Bilcke is a Software Engineer at ISC-PIF (Complex Systems Institute of Paris, France). He is a founder and a developer for the Gephi project since 2008.
This document outlines the key topics to be covered in a course on foundations of data science. The course will cover basic concepts of big data and data science including data collection, analysis, modeling and inference using statistical concepts. Students will learn to identify appropriate data mining algorithms to solve real-world problems and analyze relevant data, models and tools for applications. The syllabus will include modules on big data overview, data warehousing, data mining, data analysis, data science life cycle and more.
Desafios e Oportunidades derivados da Explosao de Dados (Big Data)Francisco Pires
Apresentação : "Desafios e Oportunidades derivados da Explosão de Dados (Big Data) ": nas Jornadas ANPRI - Associação Nacional dos Profissionais de Informática - em Coimbra no dia 16 de Junho de 2012 - Por Francisco Lavrador Pires : FB - https://www.facebook.com/francisco.l.pires ; Twitter @flpires
Presentation on Machine Learning and Data Miningbutest
The document discusses the differences between automatic learning/machine learning and data mining. It provides definitions for supervised vs unsupervised learning, what automated induction is, and the base components of data mining. Additionally, it outlines differences in the scientific approach between automatic learning and data mining, as well as differences from an industry perspective, including common data mining techniques used and tips for successful data mining projects.
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13
Similar to IntelliGO semantic similarity measure for Gene Ontology annotations (20)
This document discusses the COLT cohort study of lung transplantation patients. The study aims to predict outcomes of lung transplantation, specifically chronic lung allograft dysfunction, within the COLT cohort of over 500 lung transplant recipients prospectively followed since 2009 across 13 centers in France and other European countries. The study utilizes a systems approach, collecting extensive clinical, environmental, omics, and other data from donors, recipients, and outcomes to develop predictive models of outcomes up to 3 years post-transplant.
The document discusses a workshop on Systems Biology Graph Notation (SBGN) comprehensive disease maps held at the Luxembourg Centre for Systems Biomedicine (LCSB) in Luxembourg on June 14th, 2012. The workshop was supported by international speakers and aimed to discuss SBGN disease maps. The LCSB is an interdisciplinary research center within the University of Luxembourg focused on experimental and computational biology as well as technical platforms and clinical research related to systems biomedicine. Specific research areas of focus at the LCSB include network models of diseases, computational biology, and Parkinson's disease.
The document discusses efforts to harmonize data from two asthma cohorts - the BTS severe asthma registry and the UBioPred cohort. Over 80 variables were identified that could potentially be harmonized between the two datasets. Factor and cluster analysis of the BTS data identified four clusters of patients with different characteristics and outcomes over time. A cluster classifier was developed that can predict which cluster a patient belongs to based on a subset of variables, correctly classifying 97% of patients. Harmonizing the data from the two cohorts would increase statistical power to further explore heterogeneity in severe asthma.
This document summarizes a study that used a systems biology approach to analyze skeletal muscle abnormalities in COPD patients. The study integrated multi-level measurements including transcriptomics, physiological measurements, and serum protein levels from COPD patients and healthy individuals before and after exercise training. Networks were constructed from the data to identify relationships between molecular and physiological variables. The networks showed an uncoupling between tissue remodeling and bioenergetics pathways specific to COPD muscles. Additionally, IL-1 was found to promote tissue remodeling and mimic the training response in healthy individuals. Overall, the study provided new insights into the failure of COPD muscles to respond to exercise training.
The document discusses complex biological systems modeling and simulation. It describes The CoSMo Company's software platform for building multiscale models of complex systems across different levels from molecules to organisms. The platform features a modeling language (CSMML) that allows defining entity states and interaction rules. Models can be simulated over time to study emergent behaviors and applied to problems in fields like developmental biology, epidemiology, and cancer.
This document announces a workshop on emerging issues in sustainable business modeling to be held in Brussels, Belgium. Prominent international speakers from academia and business will discuss challenges and opportunities for sustainable business models. More information on the workshop can be found at the provided URL.
The document discusses WikiBioPath, a company that provides integrative data analysis of bio-molecular data through statistical analysis and interpretation of results in the context of biological knowledge. It offers expertise in transcriptomics, proteomics and genomics data from samples to help identify biomarkers, optimize drug compounds, and support personalized medicine. Key services discussed include pre-processing and quality control of experimental data, statistical analysis at the gene and gene set level, identifying population structures, and integrating results with knowledge from literature and databases.
Does Over-Masturbation Contribute to Chronic Prostatitis.pptxwalterHu5
In some case, your chronic prostatitis may be related to over-masturbation. Generally, natural medicine Diuretic and Anti-inflammatory Pill can help mee get a cure.
Integrating Ayurveda into Parkinson’s Management: A Holistic ApproachAyurveda ForAll
Explore the benefits of combining Ayurveda with conventional Parkinson's treatments. Learn how a holistic approach can manage symptoms, enhance well-being, and balance body energies. Discover the steps to safely integrate Ayurvedic practices into your Parkinson’s care plan, including expert guidance on diet, herbal remedies, and lifestyle modifications.
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...Donc Test
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by Stamler, Verified Chapters 1 - 33, Complete Newest Version Community Health Nursing A Canadian Perspective, 5th Edition by Stamler, Verified Chapters 1 - 33, Complete Newest Version Community Health Nursing A Canadian Perspective, 5th Edition by Stamler Community Health Nursing A Canadian Perspective, 5th Edition TEST BANK by Stamler Test Bank For Community Health Nursing A Canadian Perspective, 5th Edition Pdf Chapters Download Test Bank For Community Health Nursing A Canadian Perspective, 5th Edition Pdf Download Stuvia Test Bank For Community Health Nursing A Canadian Perspective, 5th Edition Study Guide Test Bank For Community Health Nursing A Canadian Perspective, 5th Edition Ebook Download Stuvia Test Bank For Community Health Nursing A Canadian Perspective, 5th Edition Questions and Answers Quizlet Test Bank For Community Health Nursing A Canadian Perspective, 5th Edition Studocu Test Bank For Community Health Nursing A Canadian Perspective, 5th Edition Quizlet Test Bank For Community Health Nursing A Canadian Perspective, 5th Edition Stuvia Community Health Nursing A Canadian Perspective, 5th Edition Pdf Chapters Download Community Health Nursing A Canadian Perspective, 5th Edition Pdf Download Course Hero Community Health Nursing A Canadian Perspective, 5th Edition Answers Quizlet Community Health Nursing A Canadian Perspective, 5th Edition Ebook Download Course hero Community Health Nursing A Canadian Perspective, 5th Edition Questions and Answers Community Health Nursing A Canadian Perspective, 5th Edition Studocu Community Health Nursing A Canadian Perspective, 5th Edition Quizlet Community Health Nursing A Canadian Perspective, 5th Edition Stuvia Community Health Nursing A Canadian Perspective, 5th Edition Test Bank Pdf Chapters Download Community Health Nursing A Canadian Perspective, 5th Edition Test Bank Pdf Download Stuvia Community Health Nursing A Canadian Perspective, 5th Edition Test Bank Study Guide Questions and Answers Community Health Nursing A Canadian Perspective, 5th Edition Test Bank Ebook Download Stuvia Community Health Nursing A Canadian Perspective, 5th Edition Test Bank Questions Quizlet Community Health Nursing A Canadian Perspective, 5th Edition Test Bank Studocu Community Health Nursing A Canadian Perspective, 5th Edition Test Bank Quizlet Community Health Nursing A Canadian Perspective, 5th Edition Test Bank Stuvia
These lecture slides, by Dr Sidra Arshad, offer a quick overview of the physiological basis of a normal electrocardiogram.
Learning objectives:
1. Define an electrocardiogram (ECG) and electrocardiography
2. Describe how dipoles generated by the heart produce the waveforms of the ECG
3. Describe the components of a normal electrocardiogram of a typical bipolar lead (limb II)
4. Differentiate between intervals and segments
5. Enlist some common indications for obtaining an ECG
6. Describe the flow of current around the heart during the cardiac cycle
7. Discuss the placement and polarity of the leads of electrocardiograph
8. Describe the normal electrocardiograms recorded from the limb leads and explain the physiological basis of the different records that are obtained
9. Define mean electrical vector (axis) of the heart and give the normal range
10. Define the mean QRS vector
11. Describe the axes of leads (hexagonal reference system)
12. Comprehend the vectorial analysis of the normal ECG
13. Determine the mean electrical axis of the ventricular QRS and appreciate the mean axis deviation
14. Explain the concepts of current of injury, J point, and their significance
Study Resources:
1. Chapter 11, Guyton and Hall Textbook of Medical Physiology, 14th edition
2. Chapter 9, Human Physiology - From Cells to Systems, Lauralee Sherwood, 9th edition
3. Chapter 29, Ganong’s Review of Medical Physiology, 26th edition
4. Electrocardiogram, StatPearls - https://www.ncbi.nlm.nih.gov/books/NBK549803/
5. ECG in Medical Practice by ABM Abdullah, 4th edition
6. Chapter 3, Cardiology Explained, https://www.ncbi.nlm.nih.gov/books/NBK2214/
7. ECG Basics, http://www.nataliescasebook.com/tag/e-c-g-basics
Adhd Medication Shortage Uk - trinexpharmacy.comreignlana06
The UK is currently facing a Adhd Medication Shortage Uk, which has left many patients and their families grappling with uncertainty and frustration. ADHD, or Attention Deficit Hyperactivity Disorder, is a chronic condition that requires consistent medication to manage effectively. This shortage has highlighted the critical role these medications play in the daily lives of those affected by ADHD. Contact : +1 (747) 209 – 3649 E-mail : sales@trinexpharmacy.com
2. IntelliGO semantic similarity measure
for Gene Ontology annotations
Marie-Dominique Devignes
Laboratoire Lorrain de Recherche en Informatique et ses
Applications (LORIA)
Equipe Orpailleur – INRIA Nancy Grand-Est
3. Outline of the talk
1. Introduction: Knowledge Discovery guided by Domain Knowledge
2. IntelliGO : a semantic similarity measure for GO annotations
3. IntelliGO-based clustering of genes
4. IntelliGO-based abstraction for data mining
5. Conclusion
Lyon, 14 juin 2012 2/42
4. LORIA, Team Orpailleur
Making data talk : from data to knowledge
KDD*
K
Information K
Information
Data
Data
Static picture : pyramid Dynamic picture : loop
* KDD : Knowledge Discovery
from Databases
Lyon, 14 juin 2012 3
5. The data deluge…
Données NGS
Croissance de EMBL !
xity
90 000 000
A
AT ple
80 000 000
ig D om
70 000 000
60 000 000
50 000 000
B ! EST
C
40 000 000 non-EST
30 000 000 WGS
20 000 000
10 000 000
0
1992 1994 1997 2000 2004 2009
t
r ma i t y !
Fo ene No
rog knowledge
hete without
knowledge …
(Amedeo Napoli)
Lyon, 14 juin 2012 4
6. Knowledge Discovery from Databases (KDD)
A three-step iterative process…
3. Interpretation
2. Data mining
Knowledge
1. Preparation Formatting
Rules, patterns
Selection
Integration Formatted data Expert
Dataset
Integrated
data
Database
…interactively controlled by an expert.
Lyon, 14 juin 2012 5/42
7. Knowledge Discovery guided by Domain
Knowledge : KDDK
Ontologies to assist the expert… 3. Interpretation
2. Data mining
Knowledge
1. Preparation Formatting
Rules, patterns
Selection
Integration Formatted data Expert
Dataset
Integrated
data
Database
… at each step of the process.
Lyon, 14 juin 2012 6/42
8. Our scope today :
How ontologies can be used to improve the data mining step :
Design of a semantic similarity measure on Gene Ontology
Use it for functional clustering of genes
Improvement of gene classification
Apply it to another ontology (MedDRA) for secondary effects
clustering
Data abstraction and dimension reduction
Essential for execution of data mining programms
Lyon, 14 juin 2012 7/42
9. Outline of the talk
1. Introduction: Knowledge Discovery guided by Domain Knowledge
2. IntelliGO : a semantic similarity measure for GO annotations
3. IntelliGO-based clustering of genes
4. IntelliGO-based abstraction for data mining
5. Conclusion
Lyon, 14 juin 2012 8/42
10. Semantic similarity in biomedical ontologies
Biological objects are « annotated » by terms / keywords
facilitates data retrieval from database
Annotation terms / Keywords are organized as controlled
vocabularies
Tree : e.g. Swissprot Keywords
Thesaurus : e.g. MESH, UMLS
Ontologies : e.g. Gene Ontology
Similarity measure comparing annotations can take advantage of
semantic relationships between terms /keywords
« semantic similarity measure »
Lyon, 14 juin 2012 9/42
11. Semantic similarity of annotations
General intuition….
Semantic
relationships
between terms
GO-t1 GO-t2 GO-t3 …
Gene1 X X O … Objects as
Gene2 X O X … « feature vectors »
…
« Euclidean » Score 1 0 0 1
« Semantic » Score 1 0.5 0.5 2
Lyon, 14 juin 2012 10/42
12. Brief outlook on GO (1)
Annotation vocabulary (GO
consortium since 2000)
More 30,000 terms
Millions of annotations (genes and
proteins in various species) recorded
in public databases
Three aspects :
Biological Process
Cellular Component
Molecular Function
Structured as a Rooted Directed Acyclic
Graph
Nodes = terms
Edges = semantic relationships
Is_a, part_of, related to
IMPORTANT : More than one parent from Pesquita et al., 2009
Lyon, 14 juin 2012 11/42
13. Example of similar GO annotations
Retrieved from Gene at NCBI
identical GO terms
similar GO terms
Lyon, 14 juin 2012 12/42
14. GO evidence codes
Traces the origin of each annotation
IEA (« Inferred from Electronic Annotation ») = not assigned by
curator
All others (16) manually assigned
Computational Author Curatorial
Experimental
analysis statement statement
EXP ISS TAS IC
IDA ISO NAS ND
IPI ISA
IMP ISM
IGI IGC
IEP RCA
« Evidence codes cannot be used as a measure of the quality of the annotation »
http://www.geneontology.org/GO.evidence.shtml
To be used as a filter for certain type of annotations
Lyon, 14 juin 2012 13/42
15. Semantic similarity measures : overview (1)
Pesquita et al., 2009: Semantic similarity in biomedical ontologies, PLOS Comp. Biol. July
2009
1. Gene-Gene or Protein-Protein comparison
= comparing two lists of terms (feature vectors)
« groupwise » « pairwise »
Set Graph Vector All pairs Best pairs
- Lord et al. (2003) : All pairs/ Average/ Resnick, Lin,
- Huang et al. (2007) : Vector/ Kappa-statistics (DAVID) Jiang measures
- Martin et al. (2004) : Graph/Jaccard - Wang et al. (2007) : best pairs/Average/ Wang
measure
Lyon, 14 juin 2012 14/42
16. Semantic similarity measures : overview (2)
2. Term-term comparison
= exploiting the semantic relationships
Root Node-based approaches
Information content of
__ Depth(LCA) lowest common ancestor
__ SPL(t ,t ) (LCA)
1 2
Relative to a corpus
Depth(LCA)
Independent of corpus
t1 t2
Edge-based approaches
Shortest Path Length (SPL)
Lyon, 14 juin 2012 15/42
17. IntelliGO semantic similarity measure (1)
Benabderrahmane et al., 2010 : BMC Bioinformatics
1. Term-term comparison
= combining node-based and edge-based approaches in a DAG
In a tree (MeSH, Ganesan 2003)
Root
2 x Depth(LCA)
SimGanesan(t1,t2) =
__ MaxDepth(LCA) Depth(t1) + Depth(t2)
__ MinSPL(t ,t )
1 2
In a rooted DAG (GO, IntelliGO 2010)
SimIntelliGO(t1,t2)
2 x MaxDepth(LCA)
=
t1 t2 MinSPL(t1,t2) + 2 x MaxDepth(LCA)
Lyon, 14 juin 2012 16/42
18. IntelliGO semantic similarity measure (2)
2. Gene (or protein) representation
Representation of genes in a vector space model
g = ∑i α i e i ei : basis vector, one per term (ti)
αi : Coefficient for term ti
Definition of coefficients
α i = w(g, ti) x IAF(ti) w(g, ti) : weight of evidence code * qualifying the
assignment of feature ti to gène g
IAF(ti) : « Inverse Annotation Frequency » ~ Information
Content of feature ti in annotation corpus.
* When more than one code, take
Definition of information content the maximal weight
NTOT NTOT : Total number of genes in the corpus
IAF(ti) = log
Nti Nti : Number of gènes with feature ti
Lyon, 14 juin 2012 17/42
19. IntelliGO semantic similarity measure (3)
3. Gene-gene comparison
Generalized dot-product between two gene vectors
g . h = ∑ i,j α i x β j x e i . ej with e i . ej = SimIntelliGO (ti,tj)
1 for i = j
≠ 0 for i ≠ j
Generalized cosine similarity
g . h
SimIntelliGO(g , h ) =
√g . g x√ h . h
Lyon, 14 juin 2012 18/42
20. Implementation
(Tax_ID, Gene_ID, GO_ID, Evid_Code,
NCBI AnnotationFile GO_Def, GO_aspect)
Parameter 1 Species
Parameter 2 GO aspect
Parameter 3 Liste of weights for EC CuratedAnnotationFile
IAF calculation MaxDepth(LCA MinSP
Genes Terms LCA L SPL
(Gene_ID , Array of (GO_ID pairs,
(GO_ID, IAF, Array of (GO_ID pairs,
[GO_ID, Evid-Code]) LCA_Depth, LCA_ID_List
[Gene_IDs]) SPL)
Liste of genes of Sim(ti, tj ) calculation
interest (Gene_IDs)
IntelliGO
Vector
representation term-term
Gene-Gene IntelliGO Similarity Matrix
Lyon, 14 juin 2012 19/42
21. IntelliGO on-line
http://plateforme-mbi.loria.fr/intelligo
Lyon, 14 juin 2012 20/42
22. Outline of the talk
1. Introduction: Knowledge Discovery guided by Domain Knowledge
2. IntelliGO : a semantic similarity measure for GO annotations
3. IntelliGO-based clustering of genes
4. IntelliGO-based abstraction for data mining
5. Conclusion
Lyon, 14 juin 2012 21/42
23. Gene clustering with IntelliGO : reference
datasets
Clustering means grouping together most similar objects and putting in different
clusters most dissimilar objects
relies on a similarity/distance measure
Evaluation purpose : Reference sets of genes
participating in similar biological process pathways
sharing similar molecular functions Pfam Clans
In two different species
Dataset Species Source Number of Total genes
sets
1 Human KEGG pathways 13 275
2 Yeast KEGG pathways 13 169
3 Human Pfam Clans 10 94
4 Yeast Pfam Clans 10 118
Lyon, 14 juin 2012 22/42
24. 1. Hierarchical clustering for comparing
IntelliGO with other semantic similarity
measures
For each collection of gene sets
List all genes from all sets
Pairwise similarity calculation - > distance matrix
Lord’s measure (normalized) : based on IC(LCA)
Al-Mubaid’s measure : based on SPL
SimGIC measure : based on count of common ancestors
IntelliGO measure : based on both MaxDepth(LCA) and minSPL
Hierarchical clustering and heatmap visualisation
R BioConductor
Lyon, 14 juin 2012 23/42
25. Heatmap visualisation of hierarchical
clustering
Lord et al. Al-Mubaid
(normalized)
SIMGIC IntelliGO :
- well-balanced
- suggests fuzziness
Lyon, 14 juin 2012 24/42
26. 2. Functional classification : the DAVID
software (1)
Functional classification = Clustering of functional annotation of genes:
On-line tool : DAVID (Database for Annotation Visualisation and
Integrated Discovery) software
Lyon, 14 juin 2012 25/42
27. DAVID functional classification
Rich feature vectors : GO terms, domains, Enzyme Classification, Pathways, etc.
Kappa statistics : similarity measure based on co-occurrence and co-absence of
features
Fuzzzy clustering agorithm : one gene can belong to more than one cluster.
Optimization of K number by varying kappa threshold -> % excluded genes
GO-t1 GO-t2 GO-t3 … PfamD1 …
Counting present and absent
Gene1 X X O … X … features + ‘Kappa statistics’
Gene2 X O X … X … Not a semantic similarity measure
No feature selection possible
…
26/22
28. 3. IntelliGO-based fuzzy C-means
IntelliGO distance matrix -> fanny implementation of fuzzy C-means in R
environment, minimization of objective function
Remark : not a feature vector matrix + euclidean distance no centroid calculation,
no appropriate validity index
Membership probability matrix: one gene can belong to more than one cluster.
-> % excluded genes
C1 C2 C3 … Membership threshold : Assignt
20 %
Gene1 90 % 0 8% … Gene1 C1
Gene2 25 % 40 % 0% … Gene2 C1, C2
Gene3 0% 10 % 13 % … Gene 3 Excl.
27/22
29. 4. Comparison using overlap analysis
R1 R2 Rn
Available
classification
…
602495
21259
Input list
65910 of Overlap analysis Global F-score =
12151
genes Pair-wise F-scores averageRi(MaxRiF_score)
157805
5246
489
171553
...
Functional
classification
…
at various
K value C1 C2 Ck
Precision (Ci, Rj) : % genes present in Ci that belong to Rj 2 × Precision( , Rj) × Recall( , Rj)
Ci Ci
F_score (Ci, Rj) =
Recall (Ci, Rj): % genes from Rj that are found in Ci Precision( , Rj) + Recall( , Rj)
Ci Ci
28/22
30. Comparison : optimal F-score and K number
IntelliGO +Hierarchical IntelliGO+ fanny DAVID
Ref Opt. Opt. Excl. Opt. Opt. Excl. Opt. Opt. Excl.
sets Fscore K genes Fscore K genes Fscore K genes
1 (13) 0.58 17 0% 0.70 7 0% 0.67 10 21%
2 (13) 0.57 17 0% 0.74 9 1% 0.68 9 18 %
3 (10) 0.82 18 0% 0.77 15 5% 0.64 11 27 %
4 (10) 0.68 13 0% 0.71 12 2% 0.70 10 41 %
Functional classification is reliable and robust with IntelliGO measure
Benabderrahmane et al., BIBM workshop IDASB 2011
Lyon, 14 juin 2012 29/42
31. 5. CR analysis : enrichment of reference sets
CR : Similar annotation but not in reference set
Suggestion : include this gene in the reference set
Calculate a weight :
Enrichment ( R ) ∩ Annotation ( g )
Suggest(g, R) =
Enrichment ( R )
where Enrichment(R) contains annotation terms specific of R (P-value < Pthreshold)
Illustration
The STON1 gene was suggested to join human pathway hsa04130
(SNARE interactions in vesicular transport) on the basis of its BP
annotation: ‘intracellular transport’.
To be confirmed by biologist experts of this pathway !
30/22
32. 6. RC analysis: enrichment of gene annotation
RC : Members of reference set not functionally classified with the rest of
the set
Suggestion : enrich gene annotation with terms t specific of C (present in
Enrichment(C), given a P-value threshold)
Calculate a weight :
Suggest(t) =
{g t ∈Enrichment ( C ) Annotation ( g )}
R C
Illustration
Adding BP term “Methionine biosynthetic process” to ARO8 gene in
yeast because this term is significantly enriched in cluster C9 matching
with pathway R6 (Lysine metabolism) to which ARO8 belongs to.
Indeed, the expert confirmed that ARO8 is involved in a methionine
biosynthesis pathway
31/22
33. Outline of the talk
1. Introduction: Knowledge Discovery guided by Domain Knowledge
2. IntelliGO : a semantic similarity measure for GO annotations
3. IntelliGO-based clustering of genes
4. IntelliGO-based abstraction for data mining
5. Conclusion
Lyon, 14 juin 2012 32/42
34. Application of IntelliGO measure to another
ontology: MedDRA
Study with MedDRA : Medical Directory of Regulatory Activities
Part of UMLS, about 20,000 terms.
Used for side-effect description = subset of MedDRA previously called
COSTART, about 1288 terms.
MedDRA is organized as a rDAG with five depth levels: System Organ
class, High Level Group Term, High Level Term, Preferred Term, Lowest
Level Term.
About 1/3 of MedDRA terms have more than one parent.
SIDER : Side Effect Repository at the EMBL (http://sideeffects.embl.de)
About 800 drugs associated with 1288 side effect features (MedDRA terms)
Challenge : to apply symbolic data mining methods on this large dataset to
discover patterns and regularities(Emmanuel Bresso’s PhD Thesis,
Harmonic Pharma)
Lyon, 14 juin 2012 33/42
35. Replacing MedDRA terms with term clusters (TC)
Large matrices (1288 attributes) are untractable with symbolic methods
Search for frequent itemsets fails because not enough features are shared !
Abstraction : Reduce the number of attributes without losing information
Classical problem in data mining
Generalisation in a tree, not possible with a rDAG
Semantic clustering is a solution
IntelliGO similarity between MedDRA terms
2 DepthMax[LCA (ti, tj)] SPL (ti, tj)
SimIntelliGO(ti, tj) = DIstIntelliGO(ti, tj) =
SPL(ti, tj) + 2 DepthMax[LCA (ti,tj)] SPL(ti, tj) + 2 DepthMax[LCA (ti,tj)]
Lyon, 14 juin 2012 34/42
36. Hierarchical clustering of MedDRA terms
Cluster terms AvgDist to
T54 other cluster
Pairwise distances calculated for a terms
subset of 1288 terms Erythema 0.31
Lichen planus 0.32
Hierarchical clustering + Kelley’s Parapsoriasis 0.32
Pityriasis alba 0.32
optimisation of cluster number Rash papular 0.32
112 term clusters Decubitus ulcer 0.35
Lupus miliaris 0.35
Calculation of the most representative disseminatus faciei
element Pruritus 0.35
Validation by the expert Rash 0.35
Sunburn 0.35
Example : TermCluster T54 Erythema Vulvovaginal pruritus 0.35
Dandruff 0.37
15 terms related to skin pathologies Rash 0.37
Photosensitivity reaction 0.37
Psoriasis
0.37
Lyon, 14 juin 2012 35/42
37. Mining Cardiovascular Agents (CA) and Anti-
Infective Agents (AIA)
CAAll 1288 attributes (terms) CATC 112 attributes (term clusters)
T1 T2 T3 T4 … TC1 TC2 TC3 …
Drug1 Drug 1
Drug2 Drug 2
… …
94 drugs 94 drugs
Idem for Anti-infective Agents : 76 drugs, 2 datasets AIAAll and AIATC
Lyon, 14 juin 2012 36/42
38. Frequent Closed Itemset (FCI) extraction (1)
Minimal 50 % 60 % 70 % 80 % 90 % 100 %
support
CAAll 386 94 41 11 1 0
CATC 5,564 1,379 256 62 6 0
AIAAll 178 41 9 2 0 0
AIATC 654 154 30 3 0 0
Use of Zart algorithm (Coron platform for symbolic datamining)
FCI : maximal subsets of drugs sharing similar side effects (All) or similar TCs
More FCIs are found with TC representation
Lyon, 14 juin 2012 37/42
39. Frequent Closed Itemset (FCI) extraction (2)
FCIs obtained with TC representation are more informative
Example : comparison of top- 5 FCIs obtained with AIA datasets
All representation TC representation
Nausea_and_vomitting_symptoms, 54_Erythema (88 %)
Vomiting (82%)
64_Nausea_and_vomitting_symptoms
Nausea (80%) (88%)
Pruritus (79%) 99_Neuromyopathy (82 %)
Nausea_and_vomitting_symptomes, 54_Erythema, 64_Nausea_and_
Vomiting, Nausea (78%) vomitting_symptoms (79%)
Headache (76%) 65_Blepharitis (78%)
Bresso et al., KDIR 2011
Lyon, 14 juin 2012 38/42
40. Conclusion
Bio-ontologies and data mining
Semantic distances can lead to better clustering than euclidean
distances
Exploit ‘omics’ datasets
Overlap analysis for annotation curation
Semantic abstraction make symbolic methods practicable
-> extract explicit knowledge from large real-world datasets
Systems Biology
Data-driven modelling of complex systems implies KDD
approaches
Semantic similarity may help in reducing data complexity
Lyon, 14 juin 2012 39/42
41. Acknowldegements
Harmonic Pharma
LORIA, Equipe Orpailleur (Start-up), Nancy
Nancy
MD Devignes Michel Souchet
Malika Smaïl-Tabbone Emmanuel Bresso
Sidahmed Benabderrahmane
Jean-François Kneib
http://plateforme-mbi.loria.fr/intelliGO
Financements
INCa (bourse de thèse interdisciplinaire)
Communauté Urbaine du Grand Nancy
Contrat Plan Etat Région : MISN
Lyon, 14 juin 2012 40/42