Slides of my doctoral thesis dissertation talk, given on 20 March 2014 at Politecnico di Milano. Title: "Computational prediction of gene functions through machine learning methods and multiple validation procedures"
The document discusses algorithm analysis and complexity analysis. It introduces the concept of analyzing an algorithm's runtime by examining the number of key operations like comparisons and assignments, rather than just measuring execution time. This is known as complexity analysis. The document uses an example of summing the rows and values of a matrix to illustrate how complexity analysis can identify the most efficient of multiple algorithms for the same problem. It determines that two example algorithms for summing a matrix have the same asymptotic runtime of O(n^2). The document then introduces Big-O notation for describing an algorithm's asymptotic worst-case runtime.
The document discusses minimum spanning trees and provides examples of Prim's and Kruskal's algorithms. It includes:
- A definition of minimum spanning tree as a subgraph that spans all nodes with minimum total edge weight.
- Characteristics of Prim's and Kruskal's algorithms such as working with undirected, weighted/unweighted graphs and producing optimal solutions greedily.
- A walk-through example of Prim's algorithm on a graph and calculating the minimum spanning tree cost.
The document discusses user-defined methods in Java programming. It covers key concepts like value-returning and void methods, parameters, scope of identifiers, and method overloading. Examples are provided to demonstrate how to define and call methods, pass parameters by value and by reference, and overload methods by having different parameter lists.
This document provides an introduction to the statistical programming language R. It describes an workshop on R given by Kui Shen from the Bioinformatics and Computational Biosciences Branch (BCBB) of the National Institute of Allergy and Infectious Diseases (NIAID). The workshop covers basic R topics like arithmetic, graphics, statistical tests, and importing/exporting data. It also demonstrates linear regression analysis using the cats dataset to model heart weight based on body weight.
Feature Selection for Document RankingAndrea Gigli
Feature selection for Machine Learning applied to Document Ranking (aka L2R, LtR, LETOR). Contains empirical results on Yahoo! and Bing public available Web Search Engine data.
Improving Spam Mail Filtering Using Classification Algorithms With Partition ...IRJET Journal
This document discusses improving spam mail filtering using various classification algorithms and a partition membership filter for preprocessing. It analyzes the performance of classification algorithms like JRip, Filtered Classifier, K-star, SGD, Multinomial, and Random Tree on a spam email dataset using metrics like accuracy, error rate, recall, and precision. The Random Tree algorithm achieved the best performance with 97.08% accuracy, 0.938 kappa statistics, and 0.56 seconds build time, outperforming the other algorithms. Preprocessing the data with a partition membership filter before classification further improved performance.
As optimization (or prescriptive analytics) has grown as a tool for business decision-making, a key factor in its success has been the adoption of model-based optimization. Using this approach, an analyst’s major work is to describe a problem of interest by means of an algebraic model, while the computation of a solution is left to general-purpose, off-the-shelf software. Powerful modeling systems manage the difficulties of translating between the human modeler’s ideas and the computer software’s needs. This tutorial introduces model-based optimization and offers a guide to its effective use.
The document discusses linear data structures like arrays, stacks, and queues. It defines them as structures where data is stored and accessed sequentially. Arrays allow direct access by index but have a fixed size. Stacks follow LIFO (last in, first out) using push and pop operations. Queues follow FIFO (first in, first out) using enqueue and dequeue operations. Examples of each are given like reversing a string using a stack. Pseudocode and C++ implementations of arrays, stacks, and queues are provided to demonstrate how to create and use the different data structures.
The document discusses algorithm analysis and complexity analysis. It introduces the concept of analyzing an algorithm's runtime by examining the number of key operations like comparisons and assignments, rather than just measuring execution time. This is known as complexity analysis. The document uses an example of summing the rows and values of a matrix to illustrate how complexity analysis can identify the most efficient of multiple algorithms for the same problem. It determines that two example algorithms for summing a matrix have the same asymptotic runtime of O(n^2). The document then introduces Big-O notation for describing an algorithm's asymptotic worst-case runtime.
The document discusses minimum spanning trees and provides examples of Prim's and Kruskal's algorithms. It includes:
- A definition of minimum spanning tree as a subgraph that spans all nodes with minimum total edge weight.
- Characteristics of Prim's and Kruskal's algorithms such as working with undirected, weighted/unweighted graphs and producing optimal solutions greedily.
- A walk-through example of Prim's algorithm on a graph and calculating the minimum spanning tree cost.
The document discusses user-defined methods in Java programming. It covers key concepts like value-returning and void methods, parameters, scope of identifiers, and method overloading. Examples are provided to demonstrate how to define and call methods, pass parameters by value and by reference, and overload methods by having different parameter lists.
This document provides an introduction to the statistical programming language R. It describes an workshop on R given by Kui Shen from the Bioinformatics and Computational Biosciences Branch (BCBB) of the National Institute of Allergy and Infectious Diseases (NIAID). The workshop covers basic R topics like arithmetic, graphics, statistical tests, and importing/exporting data. It also demonstrates linear regression analysis using the cats dataset to model heart weight based on body weight.
Feature Selection for Document RankingAndrea Gigli
Feature selection for Machine Learning applied to Document Ranking (aka L2R, LtR, LETOR). Contains empirical results on Yahoo! and Bing public available Web Search Engine data.
Improving Spam Mail Filtering Using Classification Algorithms With Partition ...IRJET Journal
This document discusses improving spam mail filtering using various classification algorithms and a partition membership filter for preprocessing. It analyzes the performance of classification algorithms like JRip, Filtered Classifier, K-star, SGD, Multinomial, and Random Tree on a spam email dataset using metrics like accuracy, error rate, recall, and precision. The Random Tree algorithm achieved the best performance with 97.08% accuracy, 0.938 kappa statistics, and 0.56 seconds build time, outperforming the other algorithms. Preprocessing the data with a partition membership filter before classification further improved performance.
As optimization (or prescriptive analytics) has grown as a tool for business decision-making, a key factor in its success has been the adoption of model-based optimization. Using this approach, an analyst’s major work is to describe a problem of interest by means of an algebraic model, while the computation of a solution is left to general-purpose, off-the-shelf software. Powerful modeling systems manage the difficulties of translating between the human modeler’s ideas and the computer software’s needs. This tutorial introduces model-based optimization and offers a guide to its effective use.
The document discusses linear data structures like arrays, stacks, and queues. It defines them as structures where data is stored and accessed sequentially. Arrays allow direct access by index but have a fixed size. Stacks follow LIFO (last in, first out) using push and pop operations. Queues follow FIFO (first in, first out) using enqueue and dequeue operations. Examples of each are given like reversing a string using a stack. Pseudocode and C++ implementations of arrays, stacks, and queues are provided to demonstrate how to create and use the different data structures.
CCC-Bicluster Analysis for Time Series Gene Expression DataIRJET Journal
The document presents a CCC-Biclustering (Contiguous Column Coherence) algorithm for identifying biclusters in time series gene expression data. The algorithm finds maximal biclusters with adjacent/contiguous columns in linear time using Ukkonen's suffix tree construction algorithm and discretized gene expression matrices. The algorithm was applied to a Saccharomyces cerevisiae gene expression time series in response to heat stress. It identifies coherent expression patterns shared among genes over contiguous time points, potentially revealing relevant regulatory modules.
Patterns that only occur in objects belonging to a
single class are called Jumping Emerging Patterns (JEP). JEP
based Classifiers are considered one of the successful classification
systems. Due to its comprehensibility, simplicity and strong
differentiating abilities JEPs have captured significant recognition.
However, discovery of JEPs in a large pattern space is normally a
time consuming and challenging task because of their exponential
behaviour. In this work a novel method based on genetic
algorithm (GA) is proposed to discover JEPs in large pattern
space. Since the complexity of GA is lower than other algorithms,
so we have combined the power of JEPs and GA to find high
quality JEPs from datasets to improve performance of
classification system. Our proposed method explores a set of high
quality JEPs from pattern search space unlike other methods in
literature that compute complete set of JEPs, Large numbers of
duplicate and redundant JEPs are filtered out during their
discovery process. Experimental results show that our proposed
Genetic-JEPs are effective and accurate for classification of a
variety of data sets and in general achieve higher accuracy than
other standard classifiers.
This document discusses analyzing image data that has been translated into numerical features for machine learning. It describes categorizing the data, translating the images into numeric formulas, and using a convolutional neural network (CNN) for classification. The CNN achieved a validation score of 0.31, lower than other algorithms that scored 0.38. Feature analysis is also discussed as calculating ratios for each dependent class value and encoding them as pixel intensities from 0-255.
Automatic Feature Subset Selection using Genetic Algorithm for Clusteringidescitation
Feature subset selection is a process of selecting a
subset of minimal, relevant features and is a pre processing
technique for a wide variety of applications. High dimensional
data clustering is a challenging task in data mining. Reduced
set of features helps to make the patterns easier to understand.
Reduced set of features are more significant if they are
application specific. Almost all existing feature subset
selection algorithms are not automatic and are not application
specific. This paper made an attempt to find the feature subset
for optimal clusters while clustering. The proposed Automatic
Feature Subset Selection using Genetic Algorithm (AFSGA)
identifies the required features automatically and reduces
the computational cost in determining good clusters. The
performance of AFSGA is tested using public and synthetic
datasets with varying dimensionality. Experimental results
have shown the improved efficacy of the algorithm with optimal
clusters and computational cost.
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorialAlexandros Karatzoglou
The slides from the Learning to Rank for Recommender Systems tutorial given at ACM RecSys 2013 in Hong Kong by Alexandros Karatzoglou, Linas Baltrunas and Yue Shi.
A database application differs form regular applications in that some of its inputs may be database queries. The program will execute the queries on a database and may use any result values in its subsequent program logic. This means that a user-supplied query may determine the values that the application will use in subsequent branching conditions. At the same time, a new database application is often required to work well on a body of existing data stored in some large database. For systematic testing of database applications, recent techniques replace the existing database with carefully crafted mock databases. Mock databases return values that will trigger as many execution paths in the application as possible and thereby maximize overall code coverage of the database application.
In this paper we offer an alternative approach to database application testing. Our goal is to support software engineers in focusing testing on the existing body of data the application is required to work well on. For that, we propose to side-step mock database generation and instead generate queries for the existing database. Our key insight is that we can use the information collected during previous program executions to systematically generate new queries that will maximize the coverage of the application under test, while guaranteeing that the generated test cases focus on the existing data.
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
In search based test data generation, the problem of test data generation is reduced to that of
function minimization or maximization.Traditionally, for branch testing, the problem of test data
generation has been formulated as a minimization problem. In this paper we define an alternate
maximization formulation and experimentally compare it with the minimization formulation. We
use a genetic algorithm as the search technique and in addition to the usual genetic algorithm
operators we also employ the path prefix strategy as a branch ordering strategy and memory and elitism. Results indicate that there is no significant difference in the performance or the coverage obtained through the two approaches and either could be used in test data generation when coupled with the path prefix strategy, memory and elitism.
The document discusses SQL design patterns and relational division pattern in particular. It describes relational division as finding elements that belong to all sets in a collection of sets. It provides various implementations of relational division in SQL, including using minus, not exists, and grouping with a having clause to check for equality of counts.
Incorporating Diversity in a Learning to Rank Recommender SystemJacek Wasilewski
Diversity is a desirable property of recommendations. Diversity can be increased with the use of re-rankers. This work presents an alternative approach where diversity is optimised together with accuracy during a matrix factorisation learning.
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...Rudradityo Saha
Plagiarism detection is difficult since there can be changes made to a sentence at several levels, namely, lexical, semantic, and syntactic level, to construct a paraphrased or plagiarized sentence posing as original. This project presents a novel Supervised Machine Learning Classification Paraphrase Detection System developed by conducting experiments using Microsoft Research Paraphrase (MSRP) Corpus and assessed on the same. The proposed paraphrase detection system has achieved comparable performance with existing paraphrase detection systems. The major contributions of this project are the utilization of a unique combination of lexical, semantic, and syntactic features, utilization of Shapley Additive Explanations (SHAP) Feature Importance Plots in XGBoost, and application of a soft voting classifier comprising of the top 3 performing standalone machine learning classifiers on the training dataset of MSRP Corpus. Another major contribution of the project is the finding that applying data augmentation techniques degrades the performance of machine learning classifiers.
Machine Learning with Python discusses machine learning concepts and the Python tools used for machine learning. It introduces machine learning terminology and different types of learning. It describes the Pandas, Matplotlib and scikit-learn frameworks for data analysis and machine learning in Python. Examples show simple programs for supervised learning using linear regression and unsupervised learning using K-means clustering.
The document summarizes Kenneth Emeka Odoh's presentation on recommender systems and his solution to the WSDM Challenge competition. It includes discussions of the top solutions which used techniques like light gradient boosted machines, neural networks, and ensemble modeling. It also describes Kenneth's solution using bidirectional LSTMs with techniques like batch normalization and dropout to avoid overfitting on the time series song listening data. Overall, the presentation covered many state-of-the-art recommender system techniques for sequential and time series prediction tasks.
Graph Based Methods For The Representation And Analysis Of.legal2
The document proposes using task precedence metagraphs (TPMGs) to represent and analyze business workflows, where tasks convert information elements and precedence relationships between tasks are represented. An algorithm called InfAnalysis is presented for analyzing information flow through a workflow represented as a TPMG. Structural verification of TPMGs is also discussed to identify errors like deadlocks that would make a workflow invalid.
Tracking the tracker: Time Series Analysis in Python from First Principleskenluck2001
The talk will focus on
1. Forecasting
2. Anomaly Detection
This will take a dive into common methods of doing time series analysis, introduce a new algorithm for online ARIMA, and a number of variations of Kalman filters with barebone implementations in Python.
A Python implementation of a anomaly detection system on data stream with a deep dive into the mathematics that will be explained in clear layman's term. We will work through a easy group exercise to internalize the concepts.
The talk will discuss how to deploy machine learning module in a production. We discuss lessons learnt in practice and conclusion.
Integration of Bioinformatics Web Services through the Search Computing Techn...Davide Chicco
Here are the key steps in Latent Semantic Indexing using SVD to measure semantic similarity between genes:
1. Build an annotation matrix with genes as rows and annotation terms as columns, with 1's indicating which genes are annotated to which terms.
2. Perform SVD on the annotation matrix to decompose it into three matrices: Uk, Σk, VTk.
3. Uk contains the vectors representing each gene in the reduced k-dimensional semantic space.
4. The similarity between two genes can be measured as the cosine similarity between their corresponding vectors in Uk. Genes with more similar vectors are considered more semantically similar based on their shared annotations.
So in summary, LSI uses SVD to project
Brian Durkin is an experienced project manager, procurement specialist, and business analyst with expertise in project management, procurement consulting, business analysis, business process design, and information systems development and implementation. He has led and managed both large and small projects across multiple industries. Brian is committed to his profession, as demonstrated by his various professional designations, including Project Management Professional (PMP), Information Systems Professional (I.S.P.), Canadian Institute of Management (C.I.M.), and Information Technology Certified Professional (ITCP).
Kathy Georgiadis has over 20 years of experience in various administrative and accounting roles. She has strong skills in accounts payable, accounts receivable, data entry, bookkeeping, and office administration. She is proficient in using various accounting software programs including MYOB, MRI, EDSAS, and Excel. Kathy aims to provide efficient customer service and administrative support to employers.
Danny Poirier is a journeyman electrician with over 30 years of experience seeking a position in the oil refinement industry. He has specialized training in areas such as arc flash safety, lock out/tag out procedures, and fall protection. His skills include programming PLCs, commissioning motors and electrical equipment, and troubleshooting electrical issues. He is bilingual in English and French and has strong communication and problem-solving abilities. His most recent role was as a foreman for CNRL where he oversaw electrical installations, connections, lighting work, and equipment localization.
Dokumen menyenaraikan beberapa pasangan kata berlawanan (antonim) dalam bahasa Indonesia seperti dewasa-kanak-kanak, gali-timbus, ketawa-menangis, kenyang-lapar, larangan-suruhan, pinjam-pulangkan, bandar-kampung, basah-kering, dahulu-sekarang, gelap-terang.
This document provides a summary of Manish Agrahari's career and qualifications. It outlines his 6 years of experience in IBM BPM development and Java/.Net, including designing and developing IBM BPM applications using features like BPDs, coaches, subprocesses, and integrating databases. It also lists his skills in technologies like IBM BPM, Eclipse, Java, SQL, and scripting languages. Recent projects are described, including developing insurance underwriting and claims management processes using IBM BPM.
An approach is developed to detect and correct errors in 16S RNA fragments from metagenomic sequencing data. Two algorithms are proposed - the first finds and corrects errors by studying correspondence between similar sequences, while the second fine-tunes the first algorithm's accuracy for estimating sequence errors, SNPs and detecting species. The approaches are tested on two 16S RNA fragment datasets, and classification results after error correction are compared to evaluate performance. Future work includes improving error detection and correction and validating the approach on other datasets.
CCC-Bicluster Analysis for Time Series Gene Expression DataIRJET Journal
The document presents a CCC-Biclustering (Contiguous Column Coherence) algorithm for identifying biclusters in time series gene expression data. The algorithm finds maximal biclusters with adjacent/contiguous columns in linear time using Ukkonen's suffix tree construction algorithm and discretized gene expression matrices. The algorithm was applied to a Saccharomyces cerevisiae gene expression time series in response to heat stress. It identifies coherent expression patterns shared among genes over contiguous time points, potentially revealing relevant regulatory modules.
Patterns that only occur in objects belonging to a
single class are called Jumping Emerging Patterns (JEP). JEP
based Classifiers are considered one of the successful classification
systems. Due to its comprehensibility, simplicity and strong
differentiating abilities JEPs have captured significant recognition.
However, discovery of JEPs in a large pattern space is normally a
time consuming and challenging task because of their exponential
behaviour. In this work a novel method based on genetic
algorithm (GA) is proposed to discover JEPs in large pattern
space. Since the complexity of GA is lower than other algorithms,
so we have combined the power of JEPs and GA to find high
quality JEPs from datasets to improve performance of
classification system. Our proposed method explores a set of high
quality JEPs from pattern search space unlike other methods in
literature that compute complete set of JEPs, Large numbers of
duplicate and redundant JEPs are filtered out during their
discovery process. Experimental results show that our proposed
Genetic-JEPs are effective and accurate for classification of a
variety of data sets and in general achieve higher accuracy than
other standard classifiers.
This document discusses analyzing image data that has been translated into numerical features for machine learning. It describes categorizing the data, translating the images into numeric formulas, and using a convolutional neural network (CNN) for classification. The CNN achieved a validation score of 0.31, lower than other algorithms that scored 0.38. Feature analysis is also discussed as calculating ratios for each dependent class value and encoding them as pixel intensities from 0-255.
Automatic Feature Subset Selection using Genetic Algorithm for Clusteringidescitation
Feature subset selection is a process of selecting a
subset of minimal, relevant features and is a pre processing
technique for a wide variety of applications. High dimensional
data clustering is a challenging task in data mining. Reduced
set of features helps to make the patterns easier to understand.
Reduced set of features are more significant if they are
application specific. Almost all existing feature subset
selection algorithms are not automatic and are not application
specific. This paper made an attempt to find the feature subset
for optimal clusters while clustering. The proposed Automatic
Feature Subset Selection using Genetic Algorithm (AFSGA)
identifies the required features automatically and reduces
the computational cost in determining good clusters. The
performance of AFSGA is tested using public and synthetic
datasets with varying dimensionality. Experimental results
have shown the improved efficacy of the algorithm with optimal
clusters and computational cost.
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorialAlexandros Karatzoglou
The slides from the Learning to Rank for Recommender Systems tutorial given at ACM RecSys 2013 in Hong Kong by Alexandros Karatzoglou, Linas Baltrunas and Yue Shi.
A database application differs form regular applications in that some of its inputs may be database queries. The program will execute the queries on a database and may use any result values in its subsequent program logic. This means that a user-supplied query may determine the values that the application will use in subsequent branching conditions. At the same time, a new database application is often required to work well on a body of existing data stored in some large database. For systematic testing of database applications, recent techniques replace the existing database with carefully crafted mock databases. Mock databases return values that will trigger as many execution paths in the application as possible and thereby maximize overall code coverage of the database application.
In this paper we offer an alternative approach to database application testing. Our goal is to support software engineers in focusing testing on the existing body of data the application is required to work well on. For that, we propose to side-step mock database generation and instead generate queries for the existing database. Our key insight is that we can use the information collected during previous program executions to systematically generate new queries that will maximize the coverage of the application under test, while guaranteeing that the generated test cases focus on the existing data.
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
In search based test data generation, the problem of test data generation is reduced to that of
function minimization or maximization.Traditionally, for branch testing, the problem of test data
generation has been formulated as a minimization problem. In this paper we define an alternate
maximization formulation and experimentally compare it with the minimization formulation. We
use a genetic algorithm as the search technique and in addition to the usual genetic algorithm
operators we also employ the path prefix strategy as a branch ordering strategy and memory and elitism. Results indicate that there is no significant difference in the performance or the coverage obtained through the two approaches and either could be used in test data generation when coupled with the path prefix strategy, memory and elitism.
The document discusses SQL design patterns and relational division pattern in particular. It describes relational division as finding elements that belong to all sets in a collection of sets. It provides various implementations of relational division in SQL, including using minus, not exists, and grouping with a having clause to check for equality of counts.
Incorporating Diversity in a Learning to Rank Recommender SystemJacek Wasilewski
Diversity is a desirable property of recommendations. Diversity can be increased with the use of re-rankers. This work presents an alternative approach where diversity is optimised together with accuracy during a matrix factorisation learning.
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...Rudradityo Saha
Plagiarism detection is difficult since there can be changes made to a sentence at several levels, namely, lexical, semantic, and syntactic level, to construct a paraphrased or plagiarized sentence posing as original. This project presents a novel Supervised Machine Learning Classification Paraphrase Detection System developed by conducting experiments using Microsoft Research Paraphrase (MSRP) Corpus and assessed on the same. The proposed paraphrase detection system has achieved comparable performance with existing paraphrase detection systems. The major contributions of this project are the utilization of a unique combination of lexical, semantic, and syntactic features, utilization of Shapley Additive Explanations (SHAP) Feature Importance Plots in XGBoost, and application of a soft voting classifier comprising of the top 3 performing standalone machine learning classifiers on the training dataset of MSRP Corpus. Another major contribution of the project is the finding that applying data augmentation techniques degrades the performance of machine learning classifiers.
Machine Learning with Python discusses machine learning concepts and the Python tools used for machine learning. It introduces machine learning terminology and different types of learning. It describes the Pandas, Matplotlib and scikit-learn frameworks for data analysis and machine learning in Python. Examples show simple programs for supervised learning using linear regression and unsupervised learning using K-means clustering.
The document summarizes Kenneth Emeka Odoh's presentation on recommender systems and his solution to the WSDM Challenge competition. It includes discussions of the top solutions which used techniques like light gradient boosted machines, neural networks, and ensemble modeling. It also describes Kenneth's solution using bidirectional LSTMs with techniques like batch normalization and dropout to avoid overfitting on the time series song listening data. Overall, the presentation covered many state-of-the-art recommender system techniques for sequential and time series prediction tasks.
Graph Based Methods For The Representation And Analysis Of.legal2
The document proposes using task precedence metagraphs (TPMGs) to represent and analyze business workflows, where tasks convert information elements and precedence relationships between tasks are represented. An algorithm called InfAnalysis is presented for analyzing information flow through a workflow represented as a TPMG. Structural verification of TPMGs is also discussed to identify errors like deadlocks that would make a workflow invalid.
Tracking the tracker: Time Series Analysis in Python from First Principleskenluck2001
The talk will focus on
1. Forecasting
2. Anomaly Detection
This will take a dive into common methods of doing time series analysis, introduce a new algorithm for online ARIMA, and a number of variations of Kalman filters with barebone implementations in Python.
A Python implementation of a anomaly detection system on data stream with a deep dive into the mathematics that will be explained in clear layman's term. We will work through a easy group exercise to internalize the concepts.
The talk will discuss how to deploy machine learning module in a production. We discuss lessons learnt in practice and conclusion.
Integration of Bioinformatics Web Services through the Search Computing Techn...Davide Chicco
Here are the key steps in Latent Semantic Indexing using SVD to measure semantic similarity between genes:
1. Build an annotation matrix with genes as rows and annotation terms as columns, with 1's indicating which genes are annotated to which terms.
2. Perform SVD on the annotation matrix to decompose it into three matrices: Uk, Σk, VTk.
3. Uk contains the vectors representing each gene in the reduced k-dimensional semantic space.
4. The similarity between two genes can be measured as the cosine similarity between their corresponding vectors in Uk. Genes with more similar vectors are considered more semantically similar based on their shared annotations.
So in summary, LSI uses SVD to project
Brian Durkin is an experienced project manager, procurement specialist, and business analyst with expertise in project management, procurement consulting, business analysis, business process design, and information systems development and implementation. He has led and managed both large and small projects across multiple industries. Brian is committed to his profession, as demonstrated by his various professional designations, including Project Management Professional (PMP), Information Systems Professional (I.S.P.), Canadian Institute of Management (C.I.M.), and Information Technology Certified Professional (ITCP).
Kathy Georgiadis has over 20 years of experience in various administrative and accounting roles. She has strong skills in accounts payable, accounts receivable, data entry, bookkeeping, and office administration. She is proficient in using various accounting software programs including MYOB, MRI, EDSAS, and Excel. Kathy aims to provide efficient customer service and administrative support to employers.
Danny Poirier is a journeyman electrician with over 30 years of experience seeking a position in the oil refinement industry. He has specialized training in areas such as arc flash safety, lock out/tag out procedures, and fall protection. His skills include programming PLCs, commissioning motors and electrical equipment, and troubleshooting electrical issues. He is bilingual in English and French and has strong communication and problem-solving abilities. His most recent role was as a foreman for CNRL where he oversaw electrical installations, connections, lighting work, and equipment localization.
Dokumen menyenaraikan beberapa pasangan kata berlawanan (antonim) dalam bahasa Indonesia seperti dewasa-kanak-kanak, gali-timbus, ketawa-menangis, kenyang-lapar, larangan-suruhan, pinjam-pulangkan, bandar-kampung, basah-kering, dahulu-sekarang, gelap-terang.
This document provides a summary of Manish Agrahari's career and qualifications. It outlines his 6 years of experience in IBM BPM development and Java/.Net, including designing and developing IBM BPM applications using features like BPDs, coaches, subprocesses, and integrating databases. It also lists his skills in technologies like IBM BPM, Eclipse, Java, SQL, and scripting languages. Recent projects are described, including developing insurance underwriting and claims management processes using IBM BPM.
An approach is developed to detect and correct errors in 16S RNA fragments from metagenomic sequencing data. Two algorithms are proposed - the first finds and corrects errors by studying correspondence between similar sequences, while the second fine-tunes the first algorithm's accuracy for estimating sequence errors, SNPs and detecting species. The approaches are tested on two 16S RNA fragment datasets, and classification results after error correction are compared to evaluate performance. Future work includes improving error detection and correction and validating the approach on other datasets.
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
Slides for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
BPSO&1-NN algorithm-based variable selection for power system stability ident...IJAEMSJORNAL
Due to the very high nonlinearity of the power system, traditional analytical methods take a lot of time to solve, causing delay in decision-making. Therefore, quickly detecting power system instability helps the control system to make timely decisions become the key factor to ensure stable operation of the power system. Power system stability identification encounters large data set size problem. The need is to select representative variables as input variables for the identifier. This paper proposes to apply wrapper method to select variables. In which, Binary Particle Swarm Optimization (BPSO) algorithm combines with K-NN (K=1) identifier to search for good set of variables. It is named BPSO&1-NN. Test results on IEEE 39-bus diagram show that the proposed method achieves the goal of reducing variables with high accuracy.
Experiments on Design Pattern DiscoveryTim Menzies
The document describes experiments conducted to discover design patterns from source code. It outlines the approach taken by DP-Miner tool, presents experiment data on four Java systems, and evaluates results by calculating precision and recall values. Benchmarks are lacking for accurately evaluating design pattern discovery techniques.
This document describes research on efficient data structures and algorithms for solving range aggregate problems. It begins with introductions to computational geometry and classic problems in the field like finding the closest pair of points. It then discusses concepts like output sensitivity and different computation models. Range searching data structures like range trees are described for solving problems like orthogonal range queries. The document outlines solving problems related to planar range maxima and planar range convex hull queries. It proposes preprocessing point data to speed up queries for problems like reporting the skyline points within a 2-sided range.
This document describes research on efficient data structures and algorithms for solving range aggregate problems. It begins with introductions to computational geometry and classic problems in the field like finding the closest pair of points. It then discusses concepts like output sensitivity and different computation models. Range searching data structures like range trees are described for solving problems like orthogonal range queries. The document outlines solving problems related to planar range maxima and planar range convex hull queries. It proposes preprocessing point data to speed up queries for problems like reporting the skyline points within a 2-sided range.
The document compares constructive meta-learning and stacking methods for composing inductive applications. It presents CAMLET, a tool for constructive meta-learning that analyzes learning algorithms, organizes them in a repository, and searches for compositions. A case study shows CAMLET achieving accuracies on par with stacking on common datasets and good parallel efficiency for composition.
Optimal rule set generation using pso algorithmcsandit
Classification and Prediction is an important resea
rch area of data mining. Construction of
classifier model for any decision system is an impo
rtant job for many data mining applications.
The objective of developing such a classifier is to
classify unlabeled dataset into classes. Here
we have applied a discrete Particle Swarm Optimizat
ion (PSO) algorithm for selecting optimal
classification rule sets from huge number of rules
possibly exist in a dataset. In the proposed
DPSO algorithm, decision matrix approach was used f
or generation of initial possible
classification rules from a dataset. Then the propo
sed algorithm discovers important or
significant rules from all possible classification
rules without sacrificing predictive accuracy.
The proposed algorithm deals with discrete valued d
ata, and its initial population of candidate
solutions contains particles of different sizes. Th
e experiment has been done on the task of
optimal rule selection in the data sets collected f
rom UCI repository. Experimental results show
that the proposed algorithm can automatically evolv
e on average the small number of
conditions per rule and a few rules per rule set, a
nd achieved better classification performance
of predictive accuracy for few classes.
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
This document provides an introduction and overview of tutorials for metabolomic data analysis. It discusses downloading required files and software. The goals of the analysis include using statistical and multivariate analyses to identify differences between sample groups and impacted biochemical domains. It also discusses various data analysis techniques including data quality assessment, univariate and multivariate statistical analyses, clustering, principal component analysis, partial least squares modeling, functional enrichment analysis, and network mapping.
Information Integration and Knowledge Acquisition from Semantically Heterogen...Jie Bao
The document discusses information integration and knowledge acquisition from semantically heterogeneous biological data sources. It describes how data sources need to be made self-describing through meta data schemas and ontologies. It also discusses how mappings are specified between data source schemas/ontologies and a user view schema/ontology to enable integration. Learning classifiers from distributed and semantically heterogeneous data sources is discussed, including gathering sufficient statistics from the distributed sources and generating hypotheses.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Comparative analysis of dynamic programming algorithms to find similarity in ...eSAT Journals
Abstract There exist many computational methods for finding similarity in gene sequence, finding suitable methods that gives optimal similarity is difficult task. Objective of this project is to find an appropriate method to compute similarity in gene/protein sequence, both within the families and across the families. Many dynamic programming algorithms like Levenshtein edit distance; Longest Common Subsequence and Smith-waterman have used dynamic programming approach to find similarities between two sequences. But none of the method mentioned above have used real benchmark data sets. They have only used dynamic programming algorithms for synthetic data. We proposed a new method to compute similarity. The performance of the proposed algorithm is evaluated using number of data sets from various families, and similarity value is calculated both within the family and across the families. A comparative analysis and time complexity of the proposed method reveal that Smith-waterman approach is appropriate method when gene/protein sequence belongs to same family and Longest Common Subsequence is best suited when sequence belong to two different families. Keywords - Bioinformatics, Gene, Gene Sequencing, Edit distance, String Similarity.
This document discusses using SVD (singular value decomposition) as a filtering technique prior to clustering temporal usage data. It describes applying SVD to filter out noise and high dimensionality before performing k-means clustering. SVD is used to decompose the data matrix and filter out components associated with the smallest singular values. Then k-means clustering is applied to the correlation between observations and the remaining right eigenvectors. This approach provides a robust way to cluster high-dimensional temporal data and identify distinct customer usage patterns over time.
This document discusses methods for handling missing data in big data technologies. It describes common types of missing data and existing imputation methods like mean substitution and model-based approaches. Probabilistic production dependencies are proposed to infer missing data values based on attribute relationships. An algorithm is presented that mines probabilistic production rules from data and then applies those rules to recover missing values. Sequential dependencies are also discussed for imputing missing values in ordered data.
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
The document presents a mathematical programming approach for selecting important variables in cluster analysis. It formulates a nonlinear binary model to minimize the distance between observations within clusters, using indicator variables to select important variables. The model is applied to a sample dataset of 30 observations across 5 variables, correctly identifying variables 3, 4 and 5 as most important for clustering the observations into two groups. The results are compared to an existing variable selection heuristic, with the mathematical programming approach achieving a 100% correct classification versus 97% for the other method.
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
Unstructured data processing webinar 06272016George Roth
This document provides an overview of how to prepare unstructured data for business intelligence and data analytics. It discusses structured, semi-structured, and unstructured data types. It then introduces Recognos' platform called ETI, which uses human-assisted machine learning to extract and integrate data from unstructured documents. ETI can extract data from documents that contain classifiable content through predefined field definitions and templates. It also discusses the challenges of extracting tables and derived fields that require semantic analysis. The document concludes with examples of using extracted data for compliance applications and creating data teams to manage the extraction process over time.
A Discrete Optimization Approach for SVD Best Truncation Choice based on ROC ...Davide Chicco
Truncated Singular Value Decomposition (SVD) has always been a key algorithm in modern machine learning.
Scientists and researchers use this applied mathematics method in many fields. Despite a long history and prevalence, the issue of how to choose the best truncation level still remains an open challenge. In this paper, we describe a new algorithm, akin a the discrete optimization method, that relies on the Receiver Operating Characteristics (ROC) Areas Under the Curve (AUCs) computation. We explore a concrete application of the algorithm to a bioinformatics problem, i.e. the prediction of biomolecular annotations. We applied the algorithm to nine different datasets and the obtained results demostrate the effectiveness of our technique.
This document discusses analyzing and visualizing gene expression data. It defines key terms like genes and gene expression data. It also describes clustering gene expression data using k-means clustering to group genes based on similarity in a dataset of yeast cell cycle genes. Finally, it discusses visualizing gene expression data using techniques like vector fusion, nMDS, and PCA to project high-dimensional gene expression datasets into 2D or 3D spaces.
Similar to Doctoral Thesis Dissertation 2014-03-20 @PoliMi (20)
Physiology and chemistry of skin and pigmentation, hairs, scalp, lips and nail, Cleansing cream, Lotions, Face powders, Face packs, Lipsticks, Bath products, soaps and baby product,
Preparation and standardization of the following : Tonic, Bleaches, Dentifrices and Mouth washes & Tooth Pastes, Cosmetics for Nails.
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
1. Computational Prediction of Gene Functions
through Machine Learning methods
and Multiple Validation Procedures
candidate: Davide Chicco davide.chicco@polimi.it
supervisor: Marco Masseroli
PhD Thesis Defense Dissertation
20th March 2014
2. “Computational Prediction of Gene Functions
through Machine Learning methods
and Multiple Validation Procedures”
1) Analyzed scientific problem
2) Machine learning methods used
3) Validation procedures
4) Main results
5) Annotation list correlation measures
6) Novelty indicator
7) Final list of likely predicted annotations
8) Conclusions
3. Biomolecular annotations
• The concept of annotation: association of nucleotide or amino
acid sequences with useful information describing their features
• The association of a gene and an information feature term
corresponds to a biomolecular annotation
• This information is expressed through controlled
vocabularies, sometimes structured as ontologies (e.g. Gene
Ontology), where every controlled term of the vocabulary is
associated with a unique alphanumeric code
Gene Biological function feature
Annotation
gene2bff
4. Biomolecular annotations
• The association of an information/feature with a gene ID
constitutes an annotation
• Annotation example:
• Scientific fact: “the gene GD4 is present in the
mitochondrial membrane”
• Corresponds to the coupling:
<GD4, mitochondrial membrane>
GD4 mitochondrial membrane
GD4 is present in the
mitochondrial membrane
5. The problem
• Many available annotations in different databanks
• However, available annotations are incomplete
• Only a few of them represent highly reliable, human–curated
information
• In vitro experiments are expensive (e.g. 1,000 € and 3 weeks)
• To support and quicken the time–consuming curation process,
prioritized lists of computationally predicted annotations are
extremely useful
• These lists could be generated by softwares based on
Machine Learning algorithms
6. The problem
• Other scientists and researchers dealt with the problem in the
past by using:
• Support Vector Machines (SVM) [Barutcuoglu et al., 2006]
• k-nearest neighbor algorithm (kNN) [Tao et al., 2007]
• Decision trees [King et al., 2003]
• Hidden Markov models (HMM) [Mi et al. 2013]
• …
• These methods were all good in stating if a predicted
annotation was correct or not, but were not able to make
extrapolations, that is to suggest new annotations absent
from the input dataset
8. input
matrix
outputStatistical
method
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
• The software reads the data from the db GPDW
• The software creates the input matrix:
Input Annotation matrix A {0, 1} m x n
m rows: genes
n columns: annotation features
A(i,j) = 1 if gene i is annotated to feature j or to
any descendant of j in the considered ontology
structure (true path rule)
A(i,j) = 0 otherwise (it is unknown)
feat 1 feat 2 feat 3 feat 4 … feat N
gene 1 0 0 0 0 … 0
gene 2 0 1 1 0 … 1
… … … … … … …
gene M 0 0 0 0 … 0
9. input
matrix
output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
• The software applies a statistical method
(Truncated Singular Value Decomposition,
Semantically Improved SVD with gene
clustering, Semantically Improved SVD with
clustering and term-term similarity weights) to
a binary A input matrix
• Returns a real output A~ matrix
• Every element of the A matrix is compared to
its corresponding element of the A~ matrix
10. • After the computation, we compare the Aij element to
the Aij~
input
matrix
outputStatistical
method
0 0 0 0 … 0
0 1 1 0 … 1
… … … … … …
0 0 0 0 … 0
0.1 0.3 0.6 0.5 … 0.2
0.6 0.8 0.1 0.9 … 0.8
… … … … … …
0.3 0.2 0.4 0.6 … 0.8
Input Aij Output: Aij~
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
if Aij = 1 & Aij~ > τ: AC TP
if Aij = 1 & Aij~ ≤ τ: AR FN
if Aij = 0 & Aij~ ≤ τ: NAC TN
if Aij = 0 & Aij~ > τ: AP FP
AC: Annotation Confirmed; AR: Annotation to be Reviewed
NAC: No Annotation Confirmed; AP: Annotation Predicted
τ: minimizes the sum APs + ARs
Input Output
Yes Yes
Yes No
No No
No Yes
11. input
matrix
outputStatistical
method
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
AC: Annotation Confirmed
AR: Annotation to be Reviewed
NAC: No Annotation Confirmed
AP: Annotation Predicted
• The Annotations Predicted - AP (FP) are the
annotations absent in input and predicted by our
software: we suggest them as present
• We record them in ranked lists:
Input Output
Yes Yes
Yes No
No No
No Yes
Rank Annotation ID Likelihood
value
1 218405 0.9742584
2 222571 0.8545574
… …
n 203145 0.1673128
13. input
matrix
output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
• Only the first most «important» k columns of A are
used for reconstruction
(where 0 < k < r, with r the number of non zero
singular values of A, i.e. the rank of A)
• In [P. Khatri et al. "A semantic analysis of the annotations of the
human genome“, Bioinformatics, 2005], the authors argued
that the study of the matrix A shows the semantic
relationships of the gene-function associations.
• A large value of a~ij suggests that gene i should be
annotated to term j, whereas a value close to zero
suggests the opposite.
Truncated Singular Value Decomposition (tSVD)
14. input
matrix
output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
• We departed from this method developed by Khatri
et al. (2005) Wayne State Univeristy, Detroit, and
implemented it
• Improvement:
• Khatri et al. used a fixed SVD truncation level
k=500
• We developed a method for automated data-
driven selection of k based on Receiver
Opearating Characteristic (ROC) curve
• We got better results shown in several
publications
Truncated Singular Value Decomposition (tSVD)
15. input
matrix output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
• Semantically improved (SIM1) version of the
Truncated SVD, based on gene clustering [P. Drineas et al.,
"Clustering large graphs via the singular value decomposition",
Machine Learning, 2004]
• Inspiring idea: similar genes can be grouped in
clusters, that have different weights
Truncated SVD with gene clustering (SIM1)
16. input
matrix output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
Truncated SVD with gene clustering (SIM1)
1. We choose a number C of clusters, and completely
discard the columns of matrix U where j = C+1, ..., n.
(we have an algorithm for the choice of C)
2. Each column uc of SVD matrix U represents a cluster,
and the value U(i,c) indicates the membership of
gene i to the c-th cluster.
3. For each cluster, first we generate Wc = diag(uc), and
then the modified gene-to-term matrix Ac = Wc A, in
which the i-th row of A is weighted by the
membership score of the corresponding gene to the
c-cluster.
17. input
matrix output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
Truncated SVD with gene clustering (SIM1)
4. Then, we compute Tc = Ac
T Ac, and its SVD(Tc)
5. Then, every element of the A~ matrix is computed
considering the c_th cluster that minimize its
Euclidean norm distance to the original vector:
ai~ = ai * Vk,c,i * Vk,c,i
T
6. Output matrix is produced
Tc = x
18. input
matrix output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
• Semantically improved (SIM2) version of the
Truncated SVD, based on gene clustering and term-
term similarity weights [P. Resnik, "Using information content to
evaluate semantic similarity in a taxonomy“, arXiv.org, 1995]
• Inspiring idea: functionally similar terms, should be
annotated to the same genes
Truncated SVD with gene clustering and term-
similarity weights (SIM2)
19. input
matrix output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
Truncated SVD with gene clustering and term-
similarity weights (SIM2)
In the algorithm shown before, we would add the
following step:
6. a) Furthermore, to effect more accurate clustering, we
compute the eigenvectors of the matrix G~ = ASAT
where real n*n matrix S is the term similarity matrix.
Starting from a pair of ontology terms, j1 and j2, the
term functional similarity S(j1, j2) can be calculated
using different methods.
Similarity is based on Resnik measure [P. Resnik, "Using
information content to evaluate semantic similarity in a
taxonomy", arXiv.org, 1995]
20. input
matrix output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
Other methods
With some colleagues at Politecnico di Milano we also
implemented other methods (not included in this thesis):
• Probabilistic Latent Semantic Analysis (pLSA)
• Latent Dirichlet Allocation with Gibbs sampling (LDA)
And with some colleagues at University of California
Irvine we have been trying to design and implement
other models:
• Auto-Encoder Deep Neural Network
21. • After the computation, we compare the Aij element to
the Aij~
input
matrix
outputStatistical
method
0 0 0 0 … 0
0 1 1 0 … 1
… … … … … …
0 0 0 0 … 0
0.1 0.3 0.6 0.5 … 0.2
0.6 0.8 0.1 0.9 … 0.8
… … … … … …
0.3 0.2 0.4 0.6 … 0.8
Input Aij Output: Aij~
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
if Aij = 1 & Aij~ > τ: AC TP
if Aij = 1 & Aij~ ≤ τ: AR FN
if Aij = 0 & Aij~ ≤ τ: NAC TN
if Aij = 0 & Aij~ > τ: AP FP
AC: Annotation Confirmed; AR: Annotation to be Reviewed
NAC: No Annotation Confirmed; AP: Annotation Predicted
τ: minimizes the sum APs + ARs
Input Output
Yes Yes
Yes No
No No
No Yes
26. Text Mining and Web Tool Validation
input
matrix
output
Data
reading
Statisical
method
Predicted
annotation
lists
A input
matrix
A~ output
matrix
Literature text mining and web tools validation
procedure
Databanks may be not updated, so we manually
searched for the predicted annotations through
• literature resources such as PubMed
• Web tools such as AmiGO and GeneCards
Validation
30. List Comparison Measures
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
Comparing methods and parameters
• When we have different lists of predicted
annotations and we want to know how
similar/different they are:
• How much similar are they?
• Answering this question will help us to
understand how method parameters
behave
Annotation ID
10,000
20,000
…
90,000
Annotation ID
40,000
10,000
…
90,000
Predicted
annotation
lists
Comparison
of the lists
Validation
31. List Comparison Measures
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
How much similar are these lists?
• Spearman's rank correlation coefficient
the total sum of the difference position between
each element (e.g. 3rd position – 1st position = 2)
Annotation ID
10,000
20,000
30,000
…
Annotation ID
30,000
10,000
40,000
…
Predicted
annotation
lists
Comparison
of the lists
Validation
32. List Comparison Measures
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
How much similar are these lists?
• Kendall tau distance:
the total sum of all the bubble-sort changes
needed to get a list equal to the other
outputAnnotation ID
10,000
20,000
…
90,000
Annotation ID
20,000
10,000
…
90,000
Predicted
annotation
lists
Comparison
of the lists
Validation
33. List Comparison Measures
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
Extended Kendall distance
Extended Spearman coefficient
output
Predicted
annotation
lists
Validation
Comparison
of the lists
output
Annotation ID
AP List
10,000
20,000
30,000
...
NAC List
70,000
80,000
90,000
...
Annotation ID
AP List
30,000
10,000
40,000
...
NAC List
70,000
20,000
90,000
...
• We assign a high
penalty if an element
is absent from one of
the lists
And a low
penalty if an element
is absent from one of
the AP lists
but present
in its NAC list
34. List Comparison Measures
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
Significant patterns:
• Extended Kendall distances show that the similar
SVD truncations are, the lower is the Extended
Kendall distance is, and so the more similar the
lists are.
• Lists generated by predictions that produced
similar AUC have similar low Extended Spearman
coefficients.
This means that lists from
predictions having similar AUC
percentages have element
difference very low.
Predicted
annotation
lists
Comparison
of the lists
Validation
35. Example: DAG tree of the Molecular Function
terms predicted for the Homo sapiens gene P2RY14.
Black balls: terms already present in the database.
Blue exagons: predicted terms.
Novelty Indicator
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
Predicted
annotation
lists
Novelty
indicator
Schlicker rate based on DAG
An indicator to express the “novelty” rate of a
prediction in a gene tree
• Statistical rate
• Visual DAG viewer
Comparison
of the lists
Validation
36. Example: DAG tree of the Molecular Function
terms predicted for the Homo sapiens gene CCR2.
Black balls: terms already present in the database.
Blue exagons: predicted terms.
Novelty Indicator
input
matrix
Statistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
Predicted
annotation
lists
Validation
Novelty
indicator
Schlicker rate based on DAG
An indicator to express the “novelty” rate of a
prediction into a gene
• Statistical rate
• Visual DAG viewer
Comparison
of the lists
37. Final predictions
input
matrix
output
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
We finally get a list of the most likely predicted
annotations that have the following characteristics:
- predicted by all the three methods tSVD, SIM1,
SIM2
- prediction ranking in the first 50% of the list
- having at least one validated parent.
output
Predicted
annotation
lists
Gene symbol Feature term
PPME1 Organelle organization. [BP]
CHST14 Chondroitin sulfate proteoglycan biosynthetic process. [BP]
CHST14 Biopolymer biosynthetic process. [BP]
ROPN1B Microtubule-based agellum. [CC]
CHST14 Dermatan sulfate proteoglycan biosynthetic process. [BP]
CPA2 Proteolysis involved in cellular protein catabolic process. [BP]
PPME1 Chromosome organization. [BP]
CNOT2 Positive regulation of cellular metabolic process. [BP]
Validation
38. Recap
input
matrix
outputStatistical
method
Data
reading
Statisical
method
A input
matrix
A~ output
matrix
output
Predicted
annotation
lists
Comparison
of the lists
Truncated SVD with the automatically chosen truncation
showed better results (percentage of predicted
annotations found on the updated database version)
than previous method version with fixed parameters.
New methods (SIM1 and SIM2) outperformed
Truncated SVD.
ROC analysis, Database version, and text mining and
web tool validation procedure resulted very efficient.
Extended Kendall and Spearman
coefficients showed interesting patterns,
otherwise invisible.
Novelty indicator rate resulted very
useful in explaining which are the most
interesting prediction tree, showing
relevant research paths.
Novelty
indicator
Validation