Elysium Technologies Private Limited
            Approved by ISO 9001:2008 and AICTE for SKP Training
            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com



        IEEE FINAL YEAR PROJECTS 2012 – 2013
                        BIO- INFORMATICS
Corporate Office: Madurai
    227-230, Church road, Anna nagar, Madurai – 625 020.
    0452 – 4390702, 4392702, +9199447933980
    Email: info@elysiumtechnologies.com, elysiumtechnologies@gmail.com
    Website: www.elysiumtechnologies.com

Branch Office: Trichy
    15, III Floor, SI Towers, Melapudur main road, Trichy – 620 001.
    0431 – 4002234, +919790464324.
    Email: trichy@elysiumtechnologies.com, elysium.trichy@gmail.com.
    Website: www.elysiumtechnologies.com

Branch Office: Coimbatore
    577/4, DB Road, RS Puram, Opp to KFC, Coimbatore – 641 002.
    +919677751577
    Website: Elysiumtechnologies.com, Email: info@elysiumtechnologies.com

Branch Office: Kollam
    Surya Complex, Vendor junction, Kollam – 691 010, Kerala.
    0474 – 2723622, +919446505482.
    Email: kerala@elysiumtechnologies.com.
    Website: www.elysiumtechnologies.com

Branch Office: Cochin
    4th Floor, Anjali Complex, near south over bridge, Valanjampalam,
    Cochin – 682 016, Kerala.
    0484 – 6006002, +917736004002.
    Email: kerala@elysiumtechnologies.com, Website: www.elysiumtechnologies.com

     IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                             Approved by ISO 9001:2008 and AICTE for SKP Training
                             Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                             http://www.elysiumtechnologies.com, info@elysiumtechnologies.com




                                 BIO - INFORMATICS                                                 2012 - 2013
EGC
          A Biologically Inspired Validity Measure for Comparison of Clustering Methods over
8201
          Metabolic Data Sets

         In the biological domain, clustering is based on the assumption that genes or metabolites involved in a common
         biological process are coexpressed/coaccumulated under the control of the same regulatory network. Thus, a detailed
         inspection of the grouped patterns to verify their memberships to well-known metabolic pathways could be very useful
         for the evaluation of clusters from a biological perspective. The aim of this work is to propose a novel approach for the
         comparison of clustering methods over metabolic data sets, including prior biological knowledge about the relation
         among elements that constitute the clusters. A way of measuring the biological significance of clustering solutions is
         proposed. This is addressed from the perspective of the usefulness of the clusters to identify those patterns that change
         in coordination and belong to common pathways of metabolic regulation. The measure summarizes in a compact way
         the objective analysis of clustering methods, which respects coherence and clusters distribution. It also evaluates the
         biological internal connections of such clusters considering common pathways. The proposed measure was tested in
         two biological databases using three clustering methods.

  EGC     A Co-Clustering Approach for Mining Large Protein-Protein Interaction Networks
  8202



         Several approaches have been presented in the literature to cluster Protein-Protein Interaction (PPI) networks. They can
         be grouped in two main categories: those allowing a protein to participate in different clusters and those generating only
         nonoverlapping clusters. In both cases, a challenging task is to find a suitable compromise between the biological
         relevance of the results and a comprehensive coverage of the analyzed networks. Indeed, methods returning high
         accurate results are often able to cover only small parts of the input PPI network, especially when low-characterized
         networks are considered. We present a coclustering-based technique able to generate both overlapping and
         nonoverlapping clusters. The density of the clusters to search for can also be set by the user. We tested our method on
         the two networks of yeast and human, and compared it to other five well-known techniques on the same interaction data
         sets. The results showed that, for all the examples considered, our approach always reaches a good compromise
         between accuracy and network coverage. Furthermore, the behavior of our algorithm is not influenced by the structure
         of the input network, different from all the techniques considered in the comparison, which returned very good results
         on the yeast network, while on the human network their outcomes are rather poor.




EGC
         A Comparative Study on Filtering Protein Secondary Structure Prediction
8203


                    IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com




       Filtering of Protein Secondary Structure Prediction (PSSP) aims to provide physicochemically realistic results, while it
       usually improves the predictive performance. We performed a comparative study on this challenging problem, utilizing
       both machine learning techniques and empirical rules and we found that combinations of the two lead to the highest
       improvement.



EGC    A Computational Model for Predicting Protein Interactions Based on Multidomain
8204
       Collaboration
       Recently, several domain-based computational models for predicting protein-protein interactions (PPIs) have been
       proposed. The conventional methods usually infer domain or domain combination (DC) interactions from already known
       interacting sets of proteins, and then predict PPIs using the information. However, the majority of these models often
       have limitations in providing detailed information on which domain pair (single domain interaction) or DC pair
       (multidomain interaction) will actually interact for the predicted protein interaction. Therefore, a more comprehensive
       and concrete computational model for the prediction of PPIs is needed. We developed a computational model to predict
       PPIs using the information of intraprotein domain cohesion and interprotein DC coupling interaction. A method of
       identifying the primary interacting DC pair was also incorporated into the model in order to infer actual participants in a
       predicted interaction. Our method made an apparent improvement in the PPI prediction accuracy, and the primary
       interacting DC pair identification was valid specifically in predicting multidomain protein interactions. In this paper, we
       demonstrate that 1) the intraprotein domain cohesion is meaningful in improving the accuracy of domain-based PPI
       prediction, 2) a prediction model incorporating the intradomain cohesion enables us to identify the primary interacting
       DC pair, and 3) a hybrid approach using the intra/interdomain interaction information can lead to a more accurate
       prediction.

EGC
8205    A Framework for Incorporating Functional Interrelationships into Protein Function
        Prediction Algorithms

        The functional annotation of proteins is one of the most important tasks in the post-genomic era. Although many
        computational approaches have been developed in recent years to predict protein function, most of these traditional
        algorithms do not take interrelationships among functional terms into account, such as different GO terms usually
        coannotate with some common proteins. In this study, we propose a new functional similarity measure in the form of
        Jaccard coefficient to quantify these interrelationships and also develop a framework for incorporating GO term
        similarity into protein function prediction process. The experimental results of cross-validation on S. cerevisiae and
        Homo sapiens data sets demonstrate that our method is able to improve the performance of protein function
        prediction. In addition, we find that small size terms associated with a few of proteins obtain more benefit than the large
        size ones when considering functional interrelationships. We also compare our similarity measure with other two
        widely used measures, and results indicate that when incorporated into function prediction algorithms, our proposed
        measure is more effective. Experiment results also illustrate that our algorithms outperform two previous competing

                     IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


        algorithms, which also take functional interrelationships into account, in prediction accuracy. Finally, we show that our
        method is robust to annotations in the database which are not complete at present. These results give new insights
        about the importance of functional interrelationships in protein function prediction.

EGC
         A Hybrid Approach to Survival Model Building Using Integration of Clinical and Molecular
8206
         Information in Censored Data

       In medical society, the prognostic models, which use clinicopathologic features and predict prognosis after a certain
       treatment, have been externally validated and used in practice. In recent years, most research has focused on high
       dimensional genomic data and small sample sizes. Since clinically similar but molecularly heterogeneous tumors may
       produce different clinical outcomes, the combination of clinical and genomic information, which may be complementary,
       is crucial to improve the quality of prognostic predictions. However, there is a lack of an integrating scheme for clinic-
       genomic models due to the {rm P}gg{rm N} problem, in particular, for a parsimonious model. We propose a methodology
       to build a reduced yet accurate integrative model using a hybrid approach based on the Cox regression model, which
       uses several dimension reduction techniques, {rm L}_{2} penalized maximum likelihood estimation (PMLE), and
       resampling methods to tackle the problem. The predictive accuracy of the modeling approach is assessed by several
       metrics via an independent and thorough scheme to compare competing methods. In breast cancer data studies on a
       metastasis and death event, we show that the proposed methodology can improve prediction accuracy and build a final
       model with a hybrid signature that is parsimonious when integrating both types of variables.



EGC      A Hybrid EKF and Switching PSO Algorithm for Joint State and Parameter Estimation of
8207     Lateral Flow Immunoassay Models
       In this paper, a hybrid extended Kalman filter (EKF) and switching particle swarm optimization (SPSO) algorithm is
       proposed for jointly estimating both the parameters and states of the lateral flow immunoassay model through available
       short time-series measurement. Our proposed method generalizes the well-known EKF algorithm by imposing physical
       constraints on the system states. Note that the state constraints are encountered very often in practice that give rise to
       considerable difficulties in system analysis and design. The main purpose of this paper is to handle the dynamic
       modeling problem with state constraints by combining the extended Kalman filtering and constrained optimization
       algorithms via the maximization probability method. More specifically, a recently developed SPSO algorithm is used to
       cope with the constrained optimization problem by converting it into an unconstrained optimization one through adding
       a penalty term to the objective function. The proposed algorithm is then employed to simultaneously identify the
       parameters and states of a lateral flow immunoassay model. It is shown that the proposed algorithm gives much
       improved performance over the traditional EKF method.


EGC      A Memory Efficient Method for Structure-Based RNA Multiple Alignment
8208




                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       Structure-based RNA multiple alignment is particularly challenging because covarying mutations make sequence
       information alone insufficient. Existing tools for RNA multiple alignment first generate pairwise RNA structure
       alignments and then build the multiple alignment using only sequence information. Here we present PMFastR, an
       algorithm which iteratively uses a sequence-structure alignment procedure to build a structure-based RNA multiple
       alignment from one sequence with known structure and a database of sequences from the same family. PMFastR also
       has low memory consumption allowing for the alignment of large sequences such as 16S and 23S rRNA. The algorithm
       also provides a method to utilize a multicore environment. We present results on benchmark data sets from BRAliBase,
       which shows PMFastR performs comparably to other state-of-the-art programs. Finally, we regenerate 607 Rfam seed
       alignments and show that our automated process creates multiple alignments similar to the manually curated Rfam seed
       alignments. Thus, the techniques presented in this paper allow for the generation of multiple alignments using
       sequence-structure guidance, while limiting memory consumption. As a result, multiple alignments of long RNA
       sequences, such as 16S and 23S rRNAs, can easily be generated locally on a personal computer. The software and
       supplementary data are available at http://genome.ucf.edu/PMFastR.

EGC
8209    A Metric for Phylogenetic Trees Based on Matching


       Comparing two or more phylogenetic trees is a fundamental task in computational biology. The simplest outcome of
       such a comparison is a pairwise measure of similarity, dissimilarity, or distance. A large number of such measures have
       been proposed, but so far all suffer from problems varying from computational cost to lack of robustness; many can be
       shown to behave unexpectedly under certain plausible inputs. For instance, the widely used Robinson-Foulds distance
       is poorly distributed and thus affords little discrimination, while also lacking robustness in the face of very small
       changes—reattaching a single leaf elsewhere in a tree of any size can instantly maximize the distance. In this paper, we
       introduce a new pairwise distance measure, based on matching, for phylogenetic trees. We prove that our measure
       induces a metric on the space of trees, show how to compute it in low polynomial time, verify through statistical testing
       that it is robust, and finally note that it does not exhibit unexpected behavior under the same inputs that cause problems
       with other measures. We also illustrate its usefulness in clustering trees, demonstrating significant improvements in the
       quality of hierarchical clustering as compared to the same collections of trees clustered using the Robinson-Foulds
       distance.



EGC
8210
         A New Efficient Algorithm for the Gene-Team Problem on General Sequences


       Identifying conserved gene clusters is an important step toward understanding the evolution of genomes and predicting
       the functions of genes. A famous model to capture the essential biological features of a conserved gene cluster is called
       the gene-team model. The problem of finding the gene teams of two general sequences is the focus of this paper. For
       this problem, He and Goldwasser had an efficient algorithm that requires O(mn) time using O(m + n) working space,
       where m and n are, respectively, the numbers of genes in the two given sequences. In this paper, a new efficient


                   IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       algorithm is presented. Assume m ≤ n. Let C = ΣαϵΣ o1(α)o2(α), where Σ is the set of distinct genes, and o1(α) and o2(a)
       are, respectively, the numbers of copies of a in the two given sequences. Our new algorithm requires O(min{C lg n,mn})
       time using O(m + n) working space. As compared with He and Goldwasser's algorithm, our new algorithm is more
       practical, as C is likely to be much smaller than mn in practice. In addition, our new algorithm is output sensitive. Its
       running time is O(lg n) times the size of the output. Moreover, our new algorithm can be efficiently extended to find the
       gene teams of k general sequences in O(k C Ig (n1n2...nk)) time, where ni is the number of genes in the ith input
       sequence.


EGC
         A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences
8211


       Today's genome analysis applications require sequence representations allowing for fast access to their contents while
       also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence
       representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly
       reusable or programming language- specific implementations. We present a novel, space-efficient data structure
       (GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character
       transformations, wildcard support, and an assortment of internal representations optimized for different distributions of
       wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our
       representation requires only 2 + 8 · 10-6 bits per character. Implemented in C, our portable software implementation
       provides a variety of methods for random and sequential access to characters and substrings (including different
       reading directions) using an object-oriented interface. In addition, it includes access to metadata like sequence
       descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq
       is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show
       that it is competitive with respect to space and time requirements


EGC     A New Unsupervised Feature Ranking Method for Gene Expression Data Based on
8212
        Consensus Affinity

       Feature selection is widely established as one of the fundamental computational techniques in mining microarray data.
       Due to the lack of categorized information in practice, unsupervised feature selection is more practically important but
       correspondingly more difficult. Motivated by the cluster ensemble techniques, which combine multiple clustering
       solutions into a consensus solution of higher accuracy and stability, recent efforts in unsupervised feature selection
       proposed to use these consensus solutions as oracles. However, these methods are dependent on both the particular
       cluster ensemble algorithm used and the knowledge of the true cluster number. These methods will be unsuitable when
       the true cluster number is not available, which is common in practice. In view of the above problems, a new
       unsupervised feature ranking method is proposed to evaluate the importance of the features based on consensus
       affinity. Different from previous works, our method compares the corresponding affinity of each feature between a pair
       of instances based on the consensus matrix of clustering solutions. As a result, our method alleviates the need to know
       the true number of clusters and the dependence on particular cluster ensemble approaches as in previous works.


                   IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


        Experiments on real gene expression data sets demonstrate significant improvement of the feature ranking results when
        compared to several state-of-the-art techniques.




EGC     A Sparse Regulatory Network of Copy-Number Driven Gene Expression Reveals
8213
        Putative Breast Cancer Oncogenes
        The influence of DNA cis-regulatory elements on a gene's expression has been intensively studied. However, little is
        known about expressions driven by trans-acting DNA hotspots. DNA hotspots harboring copy number aberrations are
        recognized to be important in cancer as they influence multiple genes on a global scale. The challenge in detecting
        trans-effects is mainly due to the computational difficulty in detecting weak and sparse trans-acting signals amidst co-
        occuring passenger events. We propose an integrative approach to learn a sparse interaction network of DNA copy-
        number regions with their downstream targets in a breast cancer dataset. Information from this network helps
        distinguish copy-number driven from copy-number independent expression changes on a global scale. Our result
        further delineates cis- and trans-effects in a breast cancer dataset, for which important oncogenes such as ESR1 and
        ERBB2 appear to be highly copy-number dependent. Further, our model is shown to be efficient and in terms of
        goodness of fit no worse than other state-of the art predictors and network reconstruction models using both simulated
        and real data.



EGC      A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray
8214
         Analysis
        Despite years of research, the name ambiguity problem remains largely unresolved. Outstanding issues include how to


        A plenitude of feature selection (FS) methods is available in the literature, most of them rising as a need to analyze data
        of very high dimension, usually hundreds or thousands of variables. Such data sets are now available in various
        application areas like combinatorial chemistry, text mining, multivariate imaging, or bioinformatics. As a general
        accepted rule, these methods are grouped in filters, wrappers, and embedded methods. More recently, a new group of
        methods has been added in the general framework of FS: ensemble techniques. The focus in this survey is on filter
        feature selection methods for informative feature discovery in gene expression microarray (GEM) analysis, which is also
        known as differentially expressed genes (DEGs) discovery, gene prioritization, or biomarker discovery. We present them
        in a unified framework, using standardized notations in order to reveal their technical details and to highlight their
        common characteristics as well as their particularities.

 EGC
          A Swarm Intelligence Framework for Reconstructing Gene Networks: Searching for
 8215
          Biologically Plausible Architectures

        In this paper, we investigate the problem of reverse engineering the topology of gene regulatory networks from temporal
        gene expression data. We adopt a computational intelligence approach comprising swarm intelligence techniques,
        namely particle swarm optimization (PSO) and ant colony optimization (ACO). In addition, the recurrent neural network


                   IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       (RNN) formalism is employed for modeling the dynamical behavior of gene regulatory systems. More specifically, ACO is
       used for searching the discrete space of network architectures and PSO for searching the corresponding continuous
       space of RNN model parameters. We propose a novel solution construction process in the context of ACO for
       generating biologically plausible candidate architectures. The objective is to concentrate the search effort into areas of
       the structure space that contain architectures which are feasible in terms of their topological resemblance to real-world
       networks. The proposed framework is initially applied to the reconstruction of a small artificial network that has
       previously been studied in the context of gene network reverse engineering. Subsequently, we consider an artificial data
       set with added noise for reconstructing a subnetwork of the genetic interaction network of S. cerevisiae (yeast). Finally,
       the framework is applied to a real-world data set for reverse engineering the SOS response system of the bacterium
       Escherichia coli. Results demonstrate the relative advantage of utilizing problem-specific knowledge regarding
       biologically plausible structural properties of gene networks over conducting a problem-agnostic search in the vast
       space of network architectures.


EGC
       A Top-r Feature Selection Algorithm for Microarray Gene Expression Data
8216


       Most of the conventional feature selection algorithms have a drawback whereby a weakly ranked gene that could
       perform well in terms of classification accuracy with an appropriate subset of genes will be left out of the selection.
       Considering this shortcoming, we propose a feature selection algorithm in gene expression data analysis of sample
       classifications. The proposed algorithm first divides genes into subsets, the sizes of which are relatively small (roughly
       of size h), then selects informative smaller subsets of genes (of size r <; h) from a subset and merges the chosen genes
       with another gene subset (of size r) to update the gene subset. We repeat this process until all subsets are merged into
       one informative subset. We illustrate the effectiveness of the proposed algorithm by analyzing three distinct gene
       expression data sets. Our method shows promising classification accuracy for all the test data sets. We also show the
       relevance of the selected genes in terms of their biological functions.


EGC
       Algorithms for Reticulate Networks of Multiple Phylogenetic Trees
8217


       A reticulate network N of multiple phylogenetic trees may have nodes with two or more parents (called reticulation
       nodes). There are two ways to define the reticulation number of N. One way is to define it as the number of reticulation
       nodes in N in this case, a reticulate network with the smallest reticulation number is called an optimal type-I reticulate
       network of the trees. The better way is to define it as the total number of parents of reticulation nodes in N minus the
       number of reticulation nodes in N ; in this case, a reticulate network with the smallest reticulation number is called an
       optimal type-II reticulate network of the trees. In this paper, we first present a fast fixed-parameter algorithm for
       constructing one or all optimal type-I reticulate networks of multiple phylogenetic trees. We then use the algorithm
       together with other ideas to obtain an algorithm for estimating a lower bound on the reticulation number of an optimal
       type-II reticulate network of the input trees. To our knowledge, these are the first fixed-parameter algorithms for the


                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       problems. We have implemented the algorithms in ANSI C, obtaining programs CMPT and MaafB. Our experimental data
       show that CMPT can construct optimal type-I reticulate networks rapidly and MaafB can compute better lower bounds
       for optimal type-II reticulate networks within shorter time than the previously best program PIRN designed by Wu.



EGC    Algorithms to Detect Multi-protein Modularity Conserved during Evolution
8218


       Detecting essential multiprotein modules that change infrequently during evolution is a challenging algorithmic task that
       is important for understanding the structure, function, and evolution of the biological cell. In this paper, we define a
       measure of modularity for interactomes and present a linear-time algorithm, Produles, for detecting multiprotein
       modularity conserved during evolution that improves on the running time of previous algorithms for related problems
       and offers desirable theoretical guarantees. We present a biologically motivated graph theoretic set of evaluation
       measures complementary to previous evaluation measures, demonstrate that Produles exhibits good performance by all
       measures, and describe certain recurrent anomalies in the performance of previous algorithms that are not detected by
       previous measures. Consideration of the newly defined measures and algorithm performance on these measures leads
       to useful insights on the nature of interactomics data and the goals of previous and current algorithms. Through
       randomization experiments, we demonstrate that conserved modularity is a defining characteristic of interactomes.
       Computational experiments on current experimentally derived interactomes for Homo sapiens and Drosophila
       melanogaster, combining results across algorithms, show that nearly 10 percent of current interactome proteins
       participate in multiprotein modules with good evidence in the protein interaction data of being conserved between
       human and Drosophila.


EGC    An Efficient Algorithm for Haplotype Inferenceon Pedigrees with Recombinations and
8219   Mutations

       Haplotype Inference (HI) is a computational challenge of crucial importance in a range of genetic studies. Pedigrees
       allow to infer haplotypes from genotypes more accurately than population data, since Mendelian inheritance restricts the
       set of possible solutions. In this work, we define a new HI problem on pedigrees, called Minimum-Change Haplotype
       Configuration (MCHC) problem, that allows two types of genetic variation events: recombinations and mutations. Our
       new formulation extends the Minimum-Recombinant Haplotype Configuration (MRHC) problem, that has been proposed
       in the literature to overcome the limitations of classic statistical haplotyping methods. Our contribution is twofold. First,
       we prove that the MCHC problem is APX-hard under several restrictions. Second, we propose an efficient and accurate
       heuristic algorithm for MCHC based on an L-reduction to a well-known coding problem. Our heuristic can also be used
       to solve the original MRHC problem and can take advantage of additional knowledge about the input genotypes.
       Moreover, the L-reduction proves for the first time that MCHC and MRHC are O(nm/log nm)-approximable on general
       pedigrees, where n is the pedigree size and m is the genotype length. Finally, we present an extensive experimental
       evaluation and comparison of our heuristic algorithm with several other state-of-the-art methods for HI on pedigrees.




                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com



EGC    An Efficient Method for Exploring the Space of Gene Tree/Species Tree Reconciliations in a
8220   Probabilistic Framework

       Background. Inferring an evolutionary scenario for a gene family is a fundamental problem with applications both in
       functional and evolutionary genomics. The gene tree/species tree reconciliation approach has been widely used to
       address this problem, but mostly in a discrete parsimony framework that aims at minimizing the number of gene
       duplications and/or gene losses. Recently, a probabilistic approach has been developed, based on the classical birth-
       and-death process, including efficient algorithms for computing posterior probabilities of reconciliations and orthology
       prediction. Results. In previous work, we described an algorithm for exploring the whole space of gene tree/species tree
       reconciliations, that we adapt here to compute efficiently the posterior probability of such reconciliations. These
       posterior probabilities can be either computed exactly or approximated, depending on the reconciliation space size. We
       use this algorithm to analyze the probabilistic landscape of the space of reconciliations for a real data set of fungal gene
       families and several data sets of synthetic gene trees. Conclusion. The results of our simulations suggest that, with
       exact gene trees obtained by a simple birth-and-death process and realistic gene duplication/loss rates, a very small
       subset of all reconciliations needs to be explored in order to approximate very closely the posterior probability of the
       most likely reconciliations. For cases where the posterior probability mass is more evenly dispersed, our method allows
       to explore efficiently the required subspace of reconciliations.


EGC
8221
       An Efficient Method for Modeling Kinetic Behavior of Channel Proteins in
       Cardiomyocytes

       Characterization of the kinetic and conformational properties of channel proteins is a crucial element in the integrative
       study of congenital cardiac diseases. The proteins of the ion channels of cardiomyocytes represent an important family
       of biological components determining the physiology of the heart. Some computational studies aiming to understand
       the mechanisms of the ion channels of cardiomyocytes have concentrated on Markovian stochastic approaches.
       Mathematically, these approaches employ Chapman-Kolmogorov equations coupled with partial differential equations.
       As the scale and complexity of such subcellular and cellular models increases, the balance between efficiency and
       accuracy of algorithms becomes critical. We have developed a novel two-stage splitting algorithm to address efficiency
       and accuracy issues arising in such modeling and simulation scenarios. Numerical experiments were performed based
       on the incorporation of our newly developed conformational kinetic model for the rapid delayed rectifier potassium
       channel into the dynamic models of human ventricular myocytes. Our results show that the new algorithm significantly
       outperforms commonly adopted adaptive Runge-Kutta methods. Furthermore, our parallel simulations with coupled
       algorithms for multicellular cardiac tissue demonstrate a high linearity in the speedup of large-scale cardiac simulations.



EGC     Cluster-Oriented Ensemble Classifier: Impact of Multicluster Characterization on
8222
        Ensemble Classifier Learning


                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


        All clustering methods have to assume some cluster relationship among the data objects that they are applied on.
        Similarity between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel
        multiviewpoint-based similarity measure and two related clustering methods. The major difference between a traditional
        dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the
        latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects
        being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Theoretical
        analysis and empirical study are conducted to support this claim. Two criterion functions for document clustering are
        proposed based on this new measure. We compare them with several well-known clustering algorithms that use other
        popular similarity measures on various document collections to verify the advantages of our proposal.

EGC     An Information Theoretic Approach to Constructing Robust Boolean Gene Regulatory
8223
        Networks
        We introduce a class of finite systems models of gene regulatory networks exhibiting behavior of the cell cycle. The
        network is an extension of a Boolean network model. The system spontaneously cycles through a finite set of internal
        states, tracking the increase of an external factor such as cell mass, and also exhibits checkpoints in which errors in
        gene expression levels due to cellular noise are automatically corrected. We present a 7-gene network based on
        Projective Geometry codes, which can correct, at every given time, one gene expression error. The topology of a
        network is highly symmetric and requires using only simple Boolean functions that can be synthesized using genes of
        various organisms. The attractor structure of the Boolean network contains a single cycle attractor. It is the smallest
        nontrivial network with such high robustness. The methodology allows construction of artificial gene regulatory
        networks with the number of phases larger than in natural cell cycle.


 EGC
 8224
         Antilope—A Lagrangian Relaxation Approach to the de novo Peptide Sequencing
         Problem

        Peptide sequencing from mass spectrometry data is a key step in proteome research. Especially de novo sequencing,
        the identification of a peptide from its spectrum alone, is still a challenge even for state-of-the-art algorithmic
        approaches. In this paper, we present antilope, a new fast and flexible approach based on mathematical programming. It
        builds on the spectrum graph model and works with a variety of scoring schemes. ANTILOPE combines Lagrangian
        relaxation for solving an integer linear programming formulation with an adaptation of Yen's k shortest paths algorithm.
        It shows a significant improvement in running time compared to mixed integer optimization and performs at the same
        speed like other state-of-the-art tools. We also implemented a generic probabilistic scoring scheme that can be trained
        automatically for a data set of annotated spectra and is independent of the mass spectrometer type. Evaluations on
        benchmark data show that antilope is competitive to the popular state-of-the-art programs PepNovo and NovoHMM both
        in terms of runtime and accuracy. Furthermore, it offers increased flexibility in the number of considered ion types.
        ANTILOPE will be freely available as part of the open source proteomics library OpenMS.


 EGC
 8225
        Assortative Mixing in Directed Biological Networks

                   IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com




       We analyze assortative mixing patterns of biological networks which are typically directed. We develop a theoretical
       background for analyzing mixing patterns in directed networks before applying them to specific biological networks.
       Two new quantities are introduced, namely the in-assortativity and the out-assortativity, which are shown to be useful in
       quantifying assortative mixing in directed networks. We also introduce the local (node level) assortativity quantities for
       in- and out-assortativity. Local assortativity profiles are the distributions of these local quantities over node degrees and
       can be used to analyze both canonical and real-world directed biological networks. Many biological networks, which
       have been previously classified as disassortative, are shown to be assortative with respect to these new measures.
       Finally, we demonstrate the use of local assortativity profiles in analyzing the functionalities of particular nodes and
       groups of nodes in real-world biological networks.



EGC    BpMatch: An Efficient Algorithm for a Segmental Analysis of Genomic Sequences
8226



       Here, we propose BpMatch: an algorithm that, working on a suitably modified suffix-tree data structure, is able to
       compute, in a fast and efficient way, the coverage of a source sequence S on a target sequence T, by taking into account
       direct and reverse segments, eventually overlapped. Using BpMatch, the operator should define a priori, the minimum
       length l of a segment and the minimum number of occurrences minRep, so that only segments longer than l and having
       a number of occurrences greater than minRep are considered to be significant. BpMatch outputs the significant
       segments found and the computed segment-based distance. On the worst case, assuming the alphabet dimension d is a
       constant, the time required by BpMatch to calculate the coverage is {rm O}(l^2n). On the average, by setting lge
       2log_d(n), the time required to calculate the coverage is only {rm O}(n). BpMatch, thanks to the minRep parameter, can
       also be used to perform a self-covering: to cover a sequence using segments coming from itself, by avoiding the trivial
       solution of having a single segment coincident with the whole sequence.

EGC
       Clustering 100,000 Protein Structure Decoysin Minutes
8227



       Ab initio protein structure prediction methods first generate large sets of structural conformations as candidates (called
       decoys), and then select the most representative decoys through clustering techniques. Classical clustering methods
       are inefficient due to the pairwise distance calculation, and thus become infeasible when the number of decoys is large.
       In addition, the existing clustering approaches suffer from the arbitrariness in determining a distance threshold for
       proteins within a cluster: a small distance threshold leads to many small clusters, while a large distance threshold
       results in the merging of several independent clusters into one cluster. In this paper, we propose an efficient clustering
       method through fast estimating cluster centroids and efficient pruning rotation spaces. The number of clusters is
       automatically detected by information distance criteria. A package named ONION, which can be downloaded freely, is
       implemented accordingly. Experimental results on benchmark data sets suggest that ONION is 14 times faster than


                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


        existing tools, and ONION obtains better selections for 31 targets, and worse selection for 19 targets compared to
        SPICKER's selections. On an average PC, ONION can cluster 100,000 decoys in around 12 minutes.

 EGC
          Composition Vector Method Based on Maximum Entropy Principle for Sequence
 8228
          Comparison

        The composition vector (CV) method is an alignment-free method for sequence comparison. Because of its simplicity
        when compared with multiple sequence alignment methods, the method has been widely discussed lately; and some
        formulas based on probabilistic models, like Hao's and Yu's formulas, have been proposed. In this paper, we improve
        these formulas by using the entropy principle which can quantify the nonrandomness occurrence of patterns in the
        sequences. More precisely, existing formulas are used to generate a set of possible formulas from which we choose the
        one that maximizes the entropy. We give the closed-form solution to the resulting optimization problem. Hence, from any
        given CV formula, we can find the corresponding one that maximizes the entropy. In particular, we show that Hao's
        formula is itself maximizing the entropy and we derive a new entropy-maximizing formula from Yu's formula. We
        illustrate the accuracy of our new formula by using both simulated and experimental data sets. For the simulated data
        sets, our new formula gives the best consensus and significant values for three different kinds of evolution models. For
        the data set of tetrapod 18S rRNA sequences, our new formula groups the clades of bird and reptile together correctly,
        where Hao's and Yu's formulas failed. Using real data sets with different sizes, we show that our formula is more
        accurate than Hao's and Yu's formulas even for small data sets.



 EGC
          Constructing and Drawing Regular Planar Split Networks
 8229


        Split networks are commonly used to visualize collections of bipartitions, also called splits, of a finite set. Such
        collections arise, for example, in evolutionary studies. Split networks can be viewed as a generalization of phylogenetic
        trees and may be generated using the SplitsTree package. Recently, the NeighborNet method for generating split
        networks has become rather popular, in part because it is guaranteed to always generate a circular split system, which
        can always be displayed by a planar split network. Even so, labels must be placed on the "outside” of the network,
        which might be problematic in some applications. To help circumvent this problem, it can be helpful to consider so-
        called flat split systems, which can be displayed by planar split networks where labels are allowed on the inside of the
        network too. Here, we present a new algorithm that is guaranteed to compute a minimal planar split network displaying a
        flat split system in polynomial time, provided the split system is given in a certain format. We will also briefly discuss
        two heuristics that could be useful for analyzing phylogeographic data and that allow the computation of flat split
        systems in this format in polynomial time.



EGC     Constructing Complex 3D Biological Environments from Medical Imaging Using High
8230
        Performance Computing


                   IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       Extracting information about the structure of biological tissue from static image data is a complex task requiring
       computationally intensive operations. Here, we present how multicore CPUs and GPUs have been utilized to extract
       information about the shape, size, and path followed by the mammalian oviduct, called the fallopian tube in humans,
       from histology images, to create a unique but realistic 3D virtual organ. Histology images were processed to identify the
       individual cross sections and determine the 3D path that the tube follows through the tissue. This information was then
       related back to the histology images, linking the 2D cross sections with their corresponding 3D position along the
       oviduct. A series of linear 2D spline cross sections, which were computationally generated for the length of the oviduct,
       were bound to the 3D path of the tube using a novel particle system technique that provides smooth resolution of self-
       intersections. This results in a unique 3D model of the oviduct, which is grounded in reality. The GPU is used for the
       processor intensive operations of image processing and particle physics based simulations, significantly reducing the
       time required to generate a complete model.


EGC
        CSD Homomorphisms between Phylogenetic Networks
8231


       Since Darwin, species trees have been used as a simplified description of the relationships which summarize the
       complicated network N of reality. Recent evidence of hybridization and lateral gene transfer, however, suggest that there
       are situations where trees are inadequate. Consequently it is important to determine properties that characterize
       networks closely related to N and possibly more complicated than trees but lacking the full complexity of N. A
       connected surjective digraph map (CSD) is a map f from one network N to another network M such that every arc is
       either collapsed to a single vertex or is taken to an arc, such that f is surjective, and such that the inverse image of a
       vertex is always connected. CSD maps are shown to behave well under composition. It is proved that if there is a CSD
       map from N to M, then there is a way to lift an undirected version of M into N, often with added resolution. A CSD map
       from N to M puts strong constraints on N. In general, it may be useful to study classes of networks such that, for any N,
       there exists a CSD map from N to some standard member of that class.

EGC
        Designing Filters for Fast-Known NcRNA Identification
8232


        Detecting members of known noncoding RNA (ncRNA) families in genomic DNA is an important part of sequence
       annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high-
       computational cost when used for genome-wide search. This cost can be reduced by using a filter to exclude sequences
       that are unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite
       recent advances, designing an efficient filter that can detect ncRNA instances lacking strong conservation while
       excluding most irrelevant sequences remains challenging. In this work, we design three types of filters based on
       multiple secondary structure profiles (SSPs). An SSP augments a regular profile (i.e., a position weight matrix) with
       secondary structure information but can still be efficiently scanned against long sequences. Multi-SSP-based filters
       combine evidence from multiple SSP matches and can achieve high sensitivity and specificity. Our SSP-based filters are
       extensively tested in BRAliBase III data set, Rfam 9.0, and a published soil metagenomic data set. In addition, we

                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       compare the SSP-based filters with several other ncRNA search tools including Infernal (with profile HMMs as filters),
       ERPIN, and tRNAscan-SE. Our experiments demonstrate that carefully designed SSP filters can achieve significant
       speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families.


EGC    Detection of Outlier Residues for Improving Interface Prediction in Protein
8233   Heterocomplexes

       Unlike Sequence-based understanding and identification of protein binding interfaces is a challenging research topic
       due to the complexity in protein systems and the imbalanced distribution between interface and noninterface residues.
       This paper presents an outlier detection idea to address the redundancy problem in protein interaction data. The cleaned
       training data are then used for improving the prediction performance. We use three novel measures to describe the
       extent a residue is considered as an outlier in comparison to the other residues: the distance of a residue instance from
       the center instance of all residue instances of the same class label (Dist), the probability of the class label of the residue
       instance (PCL), and the importance of within-class and between-class (IWB) residue instances. Outlier scores are
       computed by integrating the three factors; instances with a sufficiently large score are treated as outliers and removed.
       The data sets without outliers are taken as input for a support vector machine (SVM) ensemble. The proposed SVM
       ensemble trained on input data without outliers performs better than that with outliers. Our method is also more
       accurate than many literature methods on benchmark data sets. From our empirical studies, we found that some outlier
       interface residues are truly near to noninterface regions, and some outlier noninterface residues are close to interface
       regions.


EGC
8234
        DICLENS: Divisive Clustering Ensemble with Automatic Cluster Number


       Clustering has a long and rich history in a variety of scientific fields. Finding natural groupings of a data set is a hard
       task as attested by hundreds of clustering algorithms in the literature. Each clustering technique makes some
       assumptions about the underlying data set. If the assumptions hold, good clusterings can be expected. It is hard, in
       some cases impossible, to satisfy all the assumptions. Therefore, it is beneficial to apply different clustering methods on
       the same data set, or the same method with varying input parameters or both. We propose a novel method, DICLENS,
       which combines a set of clusterings into a final clustering having better overall quality. Our method produces the final
       clustering automatically and does not take any input parameters, a feature missing in many existing algorithms.
       Extensive experimental studies on real, artificial, and gene expression data sets demonstrate that DICLENS produces
       very good quality clusterings in a short amount of time. DICLENS implementation runs on standard personal computers
       by being scalable, and by consuming very little memory and CPU.



EGC     Disease Liability Prediction from Large Scale Genotyping Data Using Classifiers with a
8235    Reject Option


                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


        Many Genome-wide association studies (GWA) try to identify the genetic polymorphisms associated with variation in
        phenotypes. However, the most significant genetic variants may have a small predictive power to forecast the future
        development of common diseases. We study the prediction of the risk of developing a disease given genome-wide
        genotypic data using classifiers with a reject option, which only make a prediction when they are sufficiently certain, but
        in doubtful situations may reject making a classification. To test the reliability of our proposal, we used the Wellcome
        Trust Case Control Consortium (WTCCC) data set, comprising 14,000 cases of seven common human diseases and
        3,000 shared controls.


 EGC    Drosophila Gene Expression Pattern Annotation through Multi-Instance Multi-Label
 8236
        Learning

        In the studies of Drosophila embryogenesis, a large number of two-dimensional digital images of gene expression
        patterns have been produced to build an atlas of spatio-temporal gene expression dynamics across developmental time.
        Gene expressions captured in these images have been manually annotated with anatomical and developmental ontology
        terms using a controlled vocabulary (CV), which are useful in research aimed at understanding gene functions,
        interactions, and networks. With the rapid accumulation of images, the process of manual annotation has become
        increasingly cumbersome, and computational methods to automate this task are urgently needed. However, the
        automated annotation of embryo images is challenging. This is because the annotation terms spatially correspond to
        local expression patterns of images, yet they are assigned collectively to groups of images and it is unknown which
        term corresponds to which region of which image in the group. In this paper, we address this problem using a new
        machine learning framework, Multi-Instance Multi-Label (MIML) learning. We first show that the underlying nature of the
        annotation task is a typical MIML learning problem. Then, we propose two support vector machine algorithms under the
        MIML framework for the task. Experimental results on the FlyExpress database (a digital library of standardized
        Drosophila gene expression pattern images) reveal that the exploitation of MIML framework leads to significant
        performance improvement over state-of-the-art approaches.


EGC     Efficient Approaches for Retrieving Protein Tertiary Structures
8237


        The 3D conformation of a protein in the space is the main factor which determines its function in living organisms. Due
        to the huge amount of newly discovered proteins, there is a need for fast and accurate computational methods for
        retrieving protein structures. Their purpose is to speed up the process of understanding the structure-to-function
        relationship which is crucial in the development of new drugs. There are many algorithms addressing the problem of
        protein structure retrieval. In this paper, we present several novel approaches for retrieving protein tertiary structures.
        We present our voxel-based descriptor. Then we present our protein ray-based descriptors which are applied on the
        interpolated protein backbone. We introduce five novel wavelet descriptors which perform wavelet transforms on the
        protein distance matrix. We also propose an efficient algorithm for distance matrix alignment named Matrix Alignment
        by Sequence Alignment within Sliding Window (MASASW), which has shown as much faster than DALI, CE, and


                   IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       MatAlign. We compared our approaches between themselves and with several existing algorithms, and they generally
       prove to be fast and accurate. MASASW achieves the highest accuracy. The ray and wavelet-based descriptors as well
       as MASASW are more accurate than CE.



EGC     Efficient Genotype Eliminationvia Adaptive Allele Consolidation
8238



       In We propose the technique of Adaptive Allele Consolidation, that greatly improves the performance of the Lange-
       Goradia algorithm for genotype elimination in pedigrees, while still producing equivalent output. Genotype elimination
       consists in removing from a pedigree those genotypes that are impossible according to the Mendelian law of
       inheritance. This is used to find errors in genetic data and is useful as a preprocessing step in other analyses (such as
       linkage analysis or haplotype imputation). The problem of genotype elimination is intrinsically combinatorial, and Allele
       Consolidation is an existing technique where several alleles are replaced by a single "lumped” allele in order to reduce
       the number of combinations of genotypes that have to be considered, possibly at the expense of precision. In existing
       Allele Consolidation techniques, alleles are lumped once and for all before performing genotype elimination. The idea of
       Adaptive Allele Consolidation is to dynamically change the set of alleles that are lumped together during the execution
       of the Lange-Goradia algorithm, so that both high performance and precision are achieved. We have implemented the
       technique in a tool called Celer and evaluated it on a large set of scenarios, with good results.

EGC
8239    Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree


        Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data
        compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Small-scale and local
        repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats
        captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix
        tree or a suffix array along with other auxiliary data structures. Their space usage is 19-50 times the text size with the
        best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on
        finding all the maximal repeats from massive texts in a time- and space-efficient manner. Our technique uses the
        Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the
        space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per
        base, the space usage of our method is less than double the sequence size. Our space-efficient method keeps the
        timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing
        massive texts such as the whole human genome, since the prior methods must use external memory. For the first time,
        our method enables a desktop computer with 8 GB internal memory (actual internal memory usage is less than 6 GB) to
        find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as
        general-purpose open-source software for public use.



                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com



 EGC        Eigen-Genomic System Dynamic-Pattern Analysis (ESDA): Modeling mRNA Degradation
 8240       and Self-Regulation
        The High-throughput methods systematically measure the internal state of the entire cell, but powerful computational
        tools are needed to infer dynamics from their raw data. Therefore, we have developed a new computational method,
        Eigen-genomic System Dynamic-pattern Analysis (ESDA), which uses systems theory to infer dynamic parameters from
        a time series of gene expression measurements. As many genes are measured at a modest number of time points,
        estimation of the system matrix is underdetermined and traditional approaches for estimating dynamic parameters are
        ineffective; thus, ESDA uses the principle of dimensionality reduction to overcome the data imbalance. Since
        degradation rates are naturally confounded by self-regulation, our model estimates an effective degradation rate that is
        the difference between self-regulation and degradation. We demonstrate that ESDA is able to recover effective
        degradation rates with reasonable accuracy in simulation. We also apply ESDA to a budding yeast data set, and find that
        effective degradation rates are normally slower than experimentally measured degradation rates. Our results suggest
        that either self-regulation is widespread in budding yeast and that self-promotion dominates self-inhibition, or that self-
        regulation may be rare and that experimental methods for measuring degradation rates based on transcription arrest
        may severely overestimate true degradation rates in healthy cells.

        .
EGC
8241
        Empirical Evidence of the Applicability of Functional Clustering through Gene Expression
        Classification a great range of prior biological knowledge about the roles and functions of genes and gene-gene
        The availability of
        interactions allows us to simplify the analysis of gene expression data to make it more robust, compact, and
        interpretable. Here, we objectively analyze the applicability of functional clustering for the identification of groups of
        functionally related genes. The analysis is performed in terms of gene expression classification and uses predictive
        accuracy as an unbiased performance measure. Features of biological samples that originally corresponded to genes
        are replaced by features that correspond to the centroids of the gene clusters and are then used for classifier learning.
        Using 10 benchmark data sets, we demonstrate that functional clustering significantly outperforms random clustering
        without biological relevance. We also show that functional clustering performs comparably to gene expression
        clustering, which groups genes according to the similarity of their expression profiles. Finally, the suitability of
        functional clustering as a feature extraction technique is evaluated and discussed.


 EGC
            Evaluating Path Queries over Frequently Updated Route Collections
 8242



        This The recent advances in the infrastructure of Geographic Information Systems (GIS), and the proliferation of GPS
        technology, have resulted in the abundance of geodata in the form of sequences of points of interest (POIs), waypoints,
        etc. We refer to sets of such sequences as route collections. In this work, we consider path queries on frequently
        updated route collections: given a route collection and two points n_s and n_t, a path query returns a path, i.e., a
        sequence of points, that connects n_s to n_t. We introduce two path query evaluation paradigms that enjoy the benefits
        of search algorithms (i.e., fast index maintenance) while utilizing transitivity information to terminate the search sooner.


                   IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                              Approved by ISO 9001:2008 and AICTE for SKP Training
                             Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                             http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


         Efficient indexing schemes and appropriate updating procedures are introduced. An extensive experimental evaluation
         verifies the advantages of our methods compared to conventional graph-based search.

         .
EGC      Exploiting Intrastructure Information for Secondary Structure Prediction with Multifaceted
8243
         Pipelines

         Predicting the secondary structure of proteins is still a typical step in several bioinformatic tasks, in particular, for
         tertiary structure prediction. Notwithstanding the impressive results obtained so far, mostly due to the advent of
         sequence encoding schemes based on multiple alignment, in our view the problem should be studied from a novel
         perspective, in which understanding how available information sources are dealt with plays a central role. After
         revisiting a well-known secondary structure predictor viewed from this perspective (with the goal of identifying which
         sources of information have been considered and which have not), we propose a generic software architecture designed
         to account for all relevant information sources. To demonstrate the validity of the approach, a predictor compliant with
         the proposed generic architecture has been implemented and compared with several state-of-the-art secondary
         structure predictors. Experiments have been carried out on standard data sets, and the corresponding results confirm
         the validity of the approach. The predictor is available at http://iasc.diee.unica.it/ssp2/ through the corresponding web
         application or as downloadable stand-alone portable unpack-and-run bundle.



 EGC      Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic
 8244     Topic Modeling
         This Predicting the secondary structure of proteins is still a typical step in several bioinformatic tasks, in particular, for
         tertiary structure prediction. Notwithstanding the impressive results obtained so far, mostly due to the advent of
         sequence encoding schemes based on multiple alignment, in our view the problem should be studied from a novel
         perspective, in which understanding how available information sources are dealt with plays a central role. After
         revisiting a well-known secondary structure predictor viewed from this perspective (with the goal of identifying which
         sources of information have been considered and which have not), we propose a generic software architecture designed
         to account for all relevant information sources. To demonstrate the validity of the approach, a predictor compliant with
         the proposed generic architecture has been implemented and compared with several state-of-the-art secondary
         structure predictors. Experiments have been carried out on standard data sets, and the corresponding results confirm
         the validity of the approach.


  EGC        Fast Local Search for Unrooted Robinson-Foulds Supertrees
  8245



         A Robinson-Foulds (RF) supertree for a collection of input trees is a tree containing all the species in the input trees that
         is at minimum total RF distance to the input trees. Thus, an RF supertree is consistent with the maximum number of
         splits in the input trees. Constructing RF supertrees for rooted and unrooted data is NP-hard. Nevertheless, effective
         local search heuristics have been developed for the restricted case where the input trees and the supertree are rooted.
         We describe new heuristics, based on the Edge Contract and Refine (ECR) operation, that remove this restriction,

                    IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       thereby expanding the utility of RF supertrees. Our experimental results on simulated and empirical data sets show that
       our unrooted local search algorithms yield better supertrees than those obtained from MRP and rooted RF heuristics in
       terms of total RF distance to the input trees and, for simulated data, in terms of RF distance to the true tree.


EGC    Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Computing on
8246
       GPU with CUDA and ELLPACK-R Sparse Format

       Markov clustering (MCL) is becoming a key algorithm within bioinformatics for determining clusters in networks.
       However, with increasing vast amount of data on biological networks, performance and scalability issues are becoming
       a critical limiting factor in applications. Meanwhile, GPU computing, which uses CUDA tool for implementing a massively
       parallel computing environment in the GPU card, is becoming a very powerful, efficient, and low-cost option to achieve
       substantial performance gains over CPU approaches. The use of on-chip memory on the GPU is efficiently lowering the
       latency time, thus, circumventing a major issue in other parallel computing environments, such as MPI. We introduce a
       very fast Markov clustering algorithm using CUDA (CUDA-MCL) to perform parallel sparse matrix-matrix computations
       and parallel sparse Markov matrix normalizations, which are at the heart of MCL. We utilized ELLPACK-R sparse format
       to allow the effective and fine-grain massively parallel processing to cope with the sparse nature of interaction networks
       data sets in bioinformatics applications. As the results show, CUDA-MCL is significantly faster than the original MCL
       running on CPU. Thus, large-scale parallel computation on off-the-shelf desktop-machines, that were previously only
       possible on supercomputing architectures, can significantly change the way bioinformaticians and biologists deal with
       their data.


EGC
        Faster Mass Spectrometry-Based Protein Inference: Junction Trees Are More Efficient
5247
        than Sampling and Marginalization by Enumeration

       The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an
       inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make
       use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte
       Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the
       cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different
       statistical inference methods using a common graphical model, and we demonstrate that junction tree inference
       substantially improves rates of convergence compared to existing methods.



EGC    Gene Classification Using Parameter-Free Semi-Supervised Manifold Learning
8248


       The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an
       inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make
       use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte

                     IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


        Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the
        cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different
        statistical inference methods using a common graphical model, and we demonstrate that junction tree inference
        substantially improves rates of convergence compared to existing methods.

EGC
8249    GSGS: A Computational Approach to Reconstruct Signaling Pathway Structures from
        Gene Sets

        Reconstruction of signaling pathway structures is essential to decipher complex regulatory relationships in living cells.
        The existing computational approaches often rely on unrealistic biological assumptions and do not explicitly consider
        signal transduction mechanisms. Signal transduction events refer to linear cascades of reactions from the cell surface
        to the nucleus and characterize a signaling pathway. In this paper, we propose a novel approach, Gene Set Gibbs
        Sampling (GSGS), to reverse engineer signaling pathway structures from gene sets related to the pathways. We
        hypothesize that signaling pathways are structurally an ensemble of overlapping linear signal transduction events which
        we encode as Information Flows (IFs). We infer signaling pathway structures from gene sets, referred to as Information
        Flow Gene Sets (IFGSs), corresponding to these events. Thus, an IFGS only reflects which genes appear in the
        underlying IF but not their ordering. GSGS offers a Gibbs sampling like procedure to reconstruct the underlying
        signaling pathway structure by sequentially inferring IFs from the overlapping IFGSs related to the pathway. In the proof-
        of-concept studies, our approach is shown to outperform the existing state-of-the-art network inference approaches
        using both continuous and discrete data generated from benchmark networks in the DREAM initiative. We perform a
        comprehensive sensitivity analysis to assess the robustness of our approach. Finally, we implement GSGS to
        reconstruct signaling mechanisms in breast cancer cells.


 EGC
         Hash Subgraph Pairwise Kernel for Protein-Protein Interaction Extraction
 8250



        Extracting protein-protein interaction (PPI) from biomedical literature is an important task in biomedical text mining
        (BioTM). In this paper, we propose a hash subgraph pairwise (HSP) kernel-based approach for this task. The key to the
        novel kernel is to use the hierarchical hash labels to express the structural information of subgraphs in a linear time. We
        apply the graph kernel to compute dependency graphs representing the sentence structure for protein-protein
        interaction extraction task, which can efficiently make use of full graph structural information, and particularly capture
        the contiguous topological and label information ignored before. We evaluate the proposed approach on five publicly
        available PPI corpora. The experimental results show that our approach significantly outperforms all-path kernel
        approach on all five corpora and achieves state-of-the-art performance.



EGC
8251    Identification of Essential Proteins Based on Edge Clustering Coefficient


                   IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       Identification of essential proteins is key to understanding the minimal requirements for cellular life and important for
       drug design. The rapid increase of available protein-protein interaction (PPI) data has made it possible to detect protein
       essentiality on network level. A series of centrality measures have been proposed to discover essential proteins based
       on network topology. However, most of them tended to focus only on the location of single protein, but ignored the
       relevance between interactions and protein essentiality. In this paper, a new centrality measure for identifying essential
       proteins based on edge clustering coefficient, named as NC, is proposed. Different from previous centrality measures,
       NC considers both the centrality of a node and the relationship between it and its neighbors. For each interaction in the
       network, we calculate its edge clustering coefficient. A node's essentiality is determined by the sum of the edge
       clustering coefficients of interactions connecting it and its neighbors. The new centrality measure NC takes into account
       the modular nature of protein essentiality. NC is applied to three different types of yeast protein-protein interaction
       networks, which are obtained from the DIP database, the MIPS database and the BioGRID database, respectively. The
       experimental results on the three different networks show that the number of essential proteins discovered by NC
       universally exceeds that discovered by the six other centrality measures: DC, BC, CC, SC, EC, and IC. Moreover, the
       essential proteins discovered by NC show significant cluster effect.


EGC    Identifying Gene Pathways Associated with Cancer Characteristics via Sparse Statistical
8252
       Methods

       Information We propose a statistical method for uncovering gene pathways that characterize cancer heterogeneity. To
       incorporate knowledge of the pathways into the model, we define a set of activities of pathways from microarray gene
       expression data based on the Sparse Probabilistic Principal Component Analysis (SPPCA). A pathway activity logistic
       regression model is then formulated for cancer phenotype. To select pathway activities related to binary cancer
       phenotypes, we use the elastic net for the parameter estimation and derive a model selection criterion for selecting
       tuning parameters included in the model estimation. Our proposed method can also reverse-engineer gene networks
       based on the identified multiple pathways that enables us to discover novel gene-gene associations relating with the
       cancer phenotypes. We illustrate the whole process of the proposed method through the analysis of breast cancer gene
       expression data.


EGC    Inferring Gene Regulatory Networks via Nonlinear State-Space Models and Exploiting
8253
       Sparsity

       This paper considers the problem of learning the structure of gene regulatory networks from gene expression time
       series data. A more realistic scenario when the state space model representing a gene network evolves nonlinearly is
       considered while a linear model is assumed for the microarray data. To capture the nonlinearity, a particle filter-based
       state estimation algorithm is considered instead of the contemporary linear approximation-based approaches. The
       parameters characterizing the regulatory relations among various genes are estimated online using a Kalman filter.
       Since a particular gene interacts with a few other genes only, the parameter vector is expected to be sparse. The state
       estimates delivered by the particle filter and the observed microarray data are then subjected to a LASSO-based least

                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       squares regression operation which yields a parsimonious and efficient description of the regulatory network by setting
       the irrelevant coefficients to zero. The performance of the aforementioned algorithm is compared with the extended
       Kalman filter (EKF) and Unscented Kalman Filter (UKF) employing the Mean Square Error (MSE) as the fidelity criterion
       in recovering the parameters of gene regulatory networks from synthetic data and real biological data. Extensive
       computer simulations illustrate that the proposed particle filter-based network inference algorithm outperforms EKF and
       UKF, and therefore, it can serve as a natural framework for modeling gene regulatory networks with nonlinear and
       sparse structure.



EGC
8254   Inferring the Number of Contributors to Mixed DNA Profiles


       Forensic samples containing DNA from two or more individuals can be difficult to interpret. Even ascertaining the
       number of contributors to the sample can be challenging. These uncertainties can dramatically reduce the statistical
       weight attached to evidentiary samples. A probabilistic mixture algorithm that takes into account not just the number
       and magnitude of the alleles at a locus, but also their frequency of occurrence allows the determination of likelihood
       ratios of different hypotheses concerning the number of contributors to a specific mixture. This probabilistic mixture
       algorithm can compute the probability of the alleles in a sample being present in a 2-person mixture, 3-person mixture,
       etc. The ratio of any two of these probabilities then constitutes a likelihood ratio pertaining to the number of contributors
       to such a mixture.




EGC
       Intervention in Gene Regulatory Networks viaPhenotypically Constrained Control
8255
       PoliciesBased on Long-Run Behavior

       A salient purpose for studying gene regulatory networks is to derive intervention strategies to identify potential drug
       targets and design gene-based therapeutic intervention. Optimal and approximate intervention strategies based on the
       transition probability matrix of the underlying Markov chain have been studied extensively for probabilistic Boolean
       networks. While the key goal of control is to reduce the steady-state probability mass of undesirable network states, in
       practice it is important to limit collateral damage and this constraint should be taken into account when designing
       intervention strategies with network models. In this paper, we propose two new phenotypically constrained stationary
       control policies by directly investigating the effects on the network long-run behavior. They are derived to reduce the
       risk of visiting undesirable states in conjunction with constraints on the shift of undesirable steady-state mass so that
       only limited collateral damage can be introduced. We have studied the performance of the new constrained control
       policies together with the previous greedy control policies to randomly generated probabilistic Boolean networks. A
       preliminary example for intervening in a metastatic melanoma network is also given to show their potential application in
       designing genetic therapeutics to reduce the risk of entering both aberrant phenotypes and other ambiguous states


                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       corresponding to complications or collateral damage. Experiments on both random network ensembles and the
       melanoma network demonstrate that, in general, the new proposed control policies exhibit the desired performance. As
       shown by intervening in the melanoma network, these control policies can potentially serve as future practical gene
       therapeutic intervention strategies.


EGC
8256
       Iterative Dictionary Construction for Compression of Large DNA Data Sets


       Genomic repositories increasingly include individual as well as reference sequences, which tend to share long identical
       and near-identical strings of nucleotides. However, the sequential processing used by most compression algorithms,
       and the volumes of data involved, mean that these long-range repetitions are not detected. An order-insensitive, disk-
       based dictionary construction method can detect this repeated content and use it to compress collections of sequences.
       We explore a dictionary construction method that improves repeat identification in large DNA data sets. Our adaptation,
       Comrad, of an existing disk-based method identifies exact repeated content in collections of sequences with similarities
       within and across the set of input sequences. Comrad compresses the data over multiple passes, which is an expensive
       process, but allows Comrad to compress large data sets within reasonable time and space. Comrad allows for random
       access to individual sequences and subsequences without decompressing the whole data set. Comrad has no
       competitor in terms of the size of data sets that it can compress (extending to many hundreds of gigabytes) and, even
       for smaller data sets, the results are competitive compared to alternatives; as an example, 39 S. cerevisiae genomes
       compressed to 0.25 bits per base.



EGC
       k-Information Gain Scaled Nearest Neighbors:A Novel Approach to Classifying Protein-
8257
       Protein Interaction-Related Documents

       Although publicly accessible databases containing protein-protein interaction (PPI)-related information are important
       resources to bench and in silico research scientists alike, the amount of time and effort required to keep them up to date
       is often burdonsome. In an effort to help identify relevant PPI publications, text-mining tools, from the machine learning
       discipline, can be applied to help in this process. Here, we describe and evaluate two document classification algorithms
       that we submitted to the BioCreative II.5 PPI Classification Challenge Task. This task asked participants to design
       classifiers for identifying documents containing PPI-related information in the primary literature, and evaluated them
       against one another. One of our systems was the overall best-performing system submitted to the challenge task. It
       utilizes a novel approach to k-nearest neighbor classification, which we describe here, and compare its performance to
       those of two support vector machine-based classification systems, one of which was also evaluated in the challenge
       task.


EGC
8258
       Manifold Adaptive Experimental Design for Text Categorization



                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       In many information processing tasks, labels are usually expensive and the unlabeled data points are abundant. To
       reduce the cost on collecting labels, it is crucial to predict which unlabeled examples are the most informative, i.e.,
       improve the classifier the most if they were labeled. Many active learning techniques have been proposed for text
       categorization, such as SVMActive and Transductive Experimental Design. However, most of previous approaches try to
       discover the discriminant structure of the data space, whereas the geometrical structure is not well respected. In this
       paper, we propose a novel active learning algorithm which is performed in the data manifold adaptive kernel space. The
       manifold structure is incorporated into the kernel space by using graph Laplacian. This way, the manifold adaptive
       kernel space reflects the underlying geometry of the data. By minimizing the expected error with respect to the optimal
       classifier, we can select the most representative and discriminative data points for labeling. Experimental results on text
       categorization have demonstrated the effectiveness of our proposed approach.


EGC
8259
       Markov Invariants for Phylogenetic Rate Matrices Derived from Embedded Submodels



       We consider novel phylogenetic models with rate matrices that arise via the embedding of a progenitor model on a small
       number of character states, into a target model on a larger number of character states. Adapting representation-
       theoretic results from recent investigations of Markov invariants for the general rate matrix model, we give a prescription
       for identifying and counting Markov invariants for such "symmetric embedded” models, and we provide enumerations of
       these for the first few cases with a small number of character states. The simplest example is a target model on three
       states, constructed from a general 2 state model; the "2 hookrightarrow 3” embedding. We show that for 2 taxa, there
       exist two invariants of quadratic degree that can be used to directly infer pairwise distances from observed sequences
       under this model. A simple simulation study verifies their theoretical expected values, and suggests that, given the
       appropriateness of the model class, they have superior statistical properties than the standard (log) Det invariant (which
       is of cubic degree for this case).



EGC
8260   Matching Split Distance for Unrooted Binary Phylogenetic Trees


       The reconstruction of evolutionary trees is one of the primary objectives in phylogenetics. Such a tree represents the
       historical evolutionary relationship between different species or organisms. Tree comparisons are used for multiple
       purposes, from unveiling the history of species to deciphering evolutionary associations among organisms and
       geographical areas. In this paper, we propose a new method of defining distances between unrooted binary
       phylogenetic trees that is especially useful for relatively large phylogenetic trees. Next, we investigate in detail the
       properties of one example of these metrics, called the Matching Split distance, and describe how the general method
       can be extended to nonbinary trees.



EGC
8261
       Memory Efficient Algorithms for Structural Alignment of RNAs with Pseudoknots
                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com




       Posting In this paper, we consider the problem of structural alignment of a target RNA sequence of length n and a query
       RNA sequence of length m with known secondary structure that may contain simple pseudoknots or embedded simple
       pseudoknots. The best known algorithm for solving this problem runs in O(mn3) time for simple pseudoknot or O(mn4)
       time for embedded simple pseudoknot with space complexity of O(mn3) for both structures, which require too much
       memory making it infeasible for comparing noncoding RNAs (ncRNAs) with length several hundreds or more. We
       propose memory efficient algorithms to solve the same problem. We reduce the space complexity to O(n3) for simple
       pseudoknot and O(mn2 + n3) for embedded simple pseudoknot while maintaining the same time complexity. We also
       show how to modify our algorithm to handle a restricted class of recursive simple pseudoknot which is found abundant
       in real data with space complexity of O(mn2 + n3) and time complexity of O(mn4). Experimental results show that our
       algorithms are feasible for comparing ncRNAs of length more than 500.


EGC    MinePhos: A Literature Mining System for Protein Phoshphorylation Information
8262
       Extraction

       The rapid growth of scientific literature calls for automatic and efficient ways to facilitate extracting experimental data on
       protein phosphorylation. Such information is of great value for biologists in studying cellular processes and diseases
       such as cancer and diabetes. Existing approaches like RLIMS-P are mainly rule based. The performance lays much
       reliance on the completeness of rules. We propose an SVM-based system known as MinePhos which outperforms
       RLIMS-P in both precision and recall of information extraction when tested on a set of articles randomly chosen from
       PubMed.



EGC
8263   Molecular Dynamics Trajectory Compressionwith a Coarse-Grained Model


       Molecular dynamics trajectories are very data intensive thereby limiting sharing and archival of such data. One possible
       solution is compression of trajectory data. Here, trajectory compression based on conversion to the coarse-grained
       model PRIMO is proposed. The compressed data are about one third of the original data and fast decompression is
       possible with an analytical reconstruction procedure from PRIMO to all-atom representations. This protocol largely
       preserves structural features and to a more limited extent also energetic features of the original trajectory.



EGC
       Multiobjective Optimization Based-Approach forDiscovering Novel Cancer Therapies
8264


       Solid tumors must recruit new blood vessels for growth and maintenance. Discovering drugs that block tumor-induced
       development of new blood vessels (angiogenesis) is an important approach in cancer treatment. The complexity of
       angiogenesis presents both challenges and opportunities for cancer therapies. Intuitive approaches, such as blocking

                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       VegF activity, have yielded important therapies. But there maybe opportunities to alter nonintuitive targets either alone
       or in combination. This paper describes the development of a high-fidelity simulation of angiogenesis and uses this as
       the basis for a parallel search-based approach for the discovery of novel potential cancer treatments that inhibit blood
       vessel growth. Discovering new therapies is viewed as a multiobjective combinatorial optimization over two competing
       objectives: minimizing the estimated cost of practically developing the intervention while minimizing the simulated
       oxygen provided to the tumor by angiogenesis. Results show the effectiveness of the search process by finding
       interventions that are currently in use, and more interestingly, discovering potential new approaches that are
       nonintuitive yet effective.


EGC
8265
       Multiscale Binarization of Gene Expression Data for Reconstructing Boolean Networks


       Network inference algorithms can assist life scientists in unraveling gene-regulatory systems on a molecular level. In
       recent years, great attention has been drawn to the reconstruction of Boolean networks from time series. These need to
       be binarized, as such networks model genes as binary variables (either "expressed” or "not expressed”). Common
       binarization methods often cluster measurements or separate them according to statistical or information theoretic
       characteristics and may require many data points to determine a robust threshold. Yet, time series measurements
       frequently comprise only a small number of samples. To overcome this limitation, we propose a binarization that
       incorporates measurements at multiple resolutions. We introduce two such binarization approaches which determine
       thresholds based on limited numbers of samples and additionally provide a measure of threshold validity. Thus, network
       reconstruction and further analysis can be restricted to genes with meaningful thresholds. This reduces the complexity
       of network inference. The performance of our binarization algorithms was evaluated in network reconstruction
       experiments using artificial data as well as real-world yeast expression time series. The new approaches yield
       considerably improved correct network identification rates compared to other binarization techniques by effectively
       reducing the amount of candidate networks.


EGC
       Mutation Region Detection for Closely Related Individuals without a Known Pedigree
8266


       Linkage analysis serves as a way of finding locations of genes that cause genetic diseases. Linkage studies have
       facilitated the identification of several hundreds of human genes that can harbor mutations which by themselves lead to
       a disease phenotype. The fundamental problem in linkage analysis is to identify regions whose allele is shared by all or
       almost all affected members but by none or few unaffected members. Almost all the existing methods for linkage
       analysis are for families with clearly given pedigrees. Little work has been done for the case where the sampled
       individuals are closely related, but their pedigree is not known. This situation occurs very often when the individuals
       share a common ancestor at least six generations ago. Solving this case will tremendously extend the use of linkage
       analysis for finding genes that cause genetic diseases. In this paper, we propose a mathematical model (the shared
       center problem) for inferring the allele-sharing status of a given set of individuals using a database of confirmed


                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       haplotypes as reference. We show the NP-completeness of the shared center problem and present a ratio-2 polynomial-
       time approximation algorithm for its minimization version (called the closest shared center problem). We then convert
       the approximation algorithm into a heuristic algorithm for the shared center problem. Based on this heuristic, we finally
       design a heuristic algorithm for mutation region detection. We further implement the algorithms to obtain a software
       package. Our experimental data show that the software is both fast and accurate.



EGC    Mutual Information Optimization for Mass Spectra Data Alignment
8267


       "Signal” alignments play critical roles in many clinical setting. This is the case of mass spectrometry (MS) data, an
       important component of many types of proteomic analysis. A central problem occurs when one needs to integrate (MS)
       data produced by different sources, e.g., different equipment and/or laboratories. In these cases, some form of "data
       integration” or "data fusion” may be necessary in order to discard some source-specific aspects and improve the ability
       to perform a classification task such as inferring the "disease classes” of patients. The need for new high-performance
       data alignments methods is therefore particularly important in these contexts. In this paper, we propose an approach
       based both on an information theory perspective, generally used in a feature construction problem, and the application
       of a mathematical programming task (i.e., the weighted bipartite matching problem). We present the results of a
       competitive analysis of our method against other approaches. The analysis was conducted on data from
       plasma/ethylenediaminetetraacetic acid of "control” and Alzheimer patients collected from three different hospitals. The
       results point to a significant performance advantage of our method with respect to the competing ones tested



EGC
       On Complexity of Protein Structure Alignment Problem under Distance Constraint
8268


       We study the well-known Largest Common Point-set (LCP) under Bottleneck Distance Problem. Given two proteins o
       and 6 (as sequences of points in three-dimensional space) and a distance cutoff ζ, the goal is to find a spatial
       superposition and an alignment that maximizes the number of pairs of points from a and b that can be fit under the
       distance ζ from each other. The best to date algorithms for approximate and exact solution to this problem run in time
       O(n8) and O(n32), respectively, where n represents protein length. This work improves runtime of the approximation
       algorithm and the expected runtime of the algorithm for absolute optimum for both order-dependent and order-
       independent alignments. More specifically, our algorithms for near-optimal and optimal sequential alignments run in
       time O(n7log n) and O(n14 log n), respectively. For nonsequential alignments, corresponding running times are O(n7.5)
       and O(n14.5).


EGC    On Parameter Synthesis by Parallel Model Checking
8269




                 IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       An important problem in current computational systems biology is to analyze models of biological systems dynamics
       under parameter uncertainty. This paper presents a novel algorithm for parameter synthesis based on parallel model
       checking. The algorithm is conceptually universal with respect to the modeling approach employed. We introduce the
       algorithm, show its scalability, and examine its applicability on several biological models.



EGC
8270
       On the Application of Active Learning and Gaussian Processes in Postcryo preservation
       Cell Membrane Integrity Experiments

       An important problem in current computational systems biology is to analyze models of biological systems dynamics
       under parameter uncertainty. This paper presents a novel algorithm for parameter synthesis based on parallel model
       checking. The algorithm is conceptually universal with respect to the modeling approach employed. We introduce the
       algorithm, show its scalability, and examine its applicability on several biological models.



EGC
       On the Elusiveness of Clusters
8271


       Rooted phylogenetic networks are often used to represent conflicting phylogenetic signals. Given a set of clusters, a
       network is said to represent these clusters in the softwiredsense if, for each cluster in the input set, at least one tree
       embedded in the network contains that cluster. Motivated by parsimony we might wish to construct such a network
       using as few reticulations as possible, or minimizing the level of the network, i.e., the maximum number of reticulations
       used in any "tangled" region of the network. Although these are NP-hard problems, here we prove that, for every fixed k
       ≥ 0, it is polynomial-time solvable to construct a phylogenetic network with level equal to k representing a cluster set, or
       to determine that no such network exists. However, this algorithm does not lend itself to a practical implementation. We
       also prove that the comparatively efficient CASS algorithm correctly solves this problem (and also minimizes the
       reticulation number) when input clusters are obtained from two not necessarily binary gene trees on the same set of
       taxa but does not always minimize level for general cluster sets. Finally, we describe a new algorithm which generates in
       polynomial-time all binary phylogenetic networks with exactly r reticulations representing a set of input clusters (for
       every fixed r ≥ 0)


EGC    Optimizing Phylogenetic Networks for Circular Split Systems
8272


       We address the problem of realizing a given distance matrix by a planar phylogenetic network with a minimum number
       of faces. With the help of the popular software SplitsTree4, we start by approximating the distance matrix with a distance
       metric that is a linear combination of circular splits. The main results of this paper are the necessary and sufficient
       conditions for the existence of a network with a single face. We show how such a network can be constructed, and we
       present a heuristic for constructing a network with few faces using the first algorithm as the base case. Experimental



                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       results on biological data show that this heuristic algorithm can produce phylogenetic networks with far fewer faces
       than the ones computed by SplitsTree4, without affecting the approximation of the distance matrix.



EGC
8273
       Output-Sensitive Algorithms for Finding the Nested Common Intervals of Two General
       Sequences

       This The focus of this paper is the problem of finding all nested common intervals of two general sequences. Depending
       on the treatment one wants to apply to duplicate genes, Blin et al. introduced three models to define nested common
       intervals of two sequences: the uniqueness, the free-inclusion, and the bijection models. We consider all the three
       models. For the uniqueness and the bijection models, we give O(n + Nout)-time algorithms, where Nout denotes the size
       of the output. For the free-inclusion model, we give an O(n1+ε + Nout)-time algorithm, where ε >; 0 is an arbitrarily small
       constant. We also present an upper bound on the size of the output for each model. For the uniqueness and the free-
       inclusion models, we show that Nout = O(n2). Let C = ΣgϵΓ o1(g)o2(5), where Γ is the set of distinct genes, and o1(g) and
       o2(g) are, respectively, the numbers of copies of g in the two given sequences. For the bijection model, we show that
       Nout = O(Cn). In this paper, we also study the problem of finding all approximate nested common intervals of two
       sequences on the bijection model. An O(δn + Nout)-time algorithm is presented, where δ denotes the maximum number
       of allowed gaps. In addition, we show that for this problem Nout is O(δn3).



EGC    Parameter Estimation Using Metaheuristics in Systems Biology: A Comprehensive
8274
       Review
       This paper gives a comprehensive review of the application of metaheuristics to optimization problems in systems
       biology, mainly focusing on the parameter estimation problem (also called the inverse problem or model calibration). It
       is intended for either the system biologist who wishes to learn more about the various optimization techniques available
       and/or the metaheuristic optimizer who is interested in applying such techniques to problems in systems biology. First,
       the parameter estimation problems emerging from different areas of systems biology are described from the point of
       view of machine learning. Brief descriptions of various metaheuristics developed for these problems follow, along with
       outlines of their advantages and disadvantages. Several important issues in applying metaheuristics to the systems
       biology modeling problem are addressed, including the reliability and identifiability of model parameters, optimal design
       of experiments, and so on. Finally, we highlight some possible future research directions in this field.



EGC
       Peptide Reranking with Protein-Peptide Correspondence and Precursor Peak Intensity
8275
       Information
       Searching tandem mass spectra against a protein database has been a mainstream method for peptide identification.
       Improving peptide identification results by ranking true Peptide-Spectrum Matches (PSMs) over their false counterparts
       leads to the development of various reranking algorithms. In peptide reranking, discriminative information is essential to


                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       distinguish true PSMs from false PSMs. Generally, most peptide reranking methods obtain discriminative information
       directly from database search scores or by training machine learning models. Information in the protein database and
       MS1 spectra (i.e., single stage MS spectra) is ignored. In this paper, we propose to use information in the protein
       database and MS1 spectra to rerank peptide identification results. To quantitatively analyze their effects to peptide
       reranking results, three peptide reranking methods are proposed: PPMRanker, PPIRanker, and MIRanker. PPMRanker
       only uses Protein-Peptide Map (PPM) information from the protein database, PPIRanker only uses Precursor Peak
       Intensity (PPI) information, and MIRanker employs both PPM information and PPI information. According to our
       experiments on a standard protein mixture data set, a human data set and a mouse data set, PPMRanker and MIRanker
       achieve better peptide reranking results than PetideProphet, PeptideProphet+NSP (number of sibling peptides) and a
       score regularization method SRPI.



EGC
8276
       Predicting Ligand Binding Residues and Functional Sites Using Multipositional
       Correlations with Graph Theoretic Clustering and Kernel CCA

       Search We present a new computational method for predicting ligand binding residues and functional sites in protein
       sequences. These residues and sites tend to be not only conserved, but also exhibit strong correlation due to the
       selection pressure during evolution in order to maintain the required structure and/or function. To explore the effect of
       correlations among multiple positions in the sequences, the method uses graph theoretic clustering and kernel-based
       canonical correlation analysis (kCCA) to identify binding and functional sites in protein sequences as the residues that
       exhibit strong correlation between the residues' evolutionary characterization at the sites and the structure-based
       functional classification of the proteins in the context of a functional family. The results of testing the method on two
       well-curated data sets show that the prediction accuracy as measured by Receiver Operating Characteristic (ROC)
       scores improves significantly when multipositional correlations are accounted for.



EGC
       Predicting Metal-Binding Sites from Protein Sequence
8277


       Prediction of binding sites from sequence can significantly help toward determining the function of uncharacterized
       proteins on a genomic scale. The task is highly challenging due to the enormous amount of alternative candidate
       configurations. Previous research has only considered this prediction problem starting from 3D information. When
       starting from sequence alone, only methods that predict the bonding state of selected residues are available. The sole
       exception consists of pattern-based approaches, which rely on very specific motifs and cannot be applied to discover
       truly novel sites. We develop new algorithmic ideas based on structured-output learning for determining transition-
       metal-binding sites coordinated by cysteines and histidines. The inference step (retrieving the best scoring output) is
       intractable for general output types (i.e., general graphs). However, under the assumption that no residue can coordinate
       more than one metal ion, we prove that metal binding has the algebraic structure of a matroid, allowing us to employ a
       very efficient greedy algorithm. We test our predictor in a highly stringent setting where the training set consists of



                 IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       protein chains belonging to SCOP folds different from the ones used for accuracy estimation. In this setting, our
       predictor achieves 56 percent precision and 60 percent recall in the identification of ligand-ion bonds.



EGC
       Predicting Protein Function by Multi-Label Correlated Semi-Supervised Learning
8278


       Assigning biological functions to uncharacterized proteins is a fundamental problem in the postgenomic era. The
       increasing availability of large amounts of data on protein-protein interactions (PPIs) has led to the emergence of a
       considerable number of computational methods for determining protein function in the context of a network. These
       algorithms, however, treat each functional class in isolation and thereby often suffer from the difficulty of the scarcity of
       labeled data. In reality, different functional classes are naturally dependent on one another. We propose a new algorithm,
       Multi-label Correlated Semi-supervised Learning (MCSL), to incorporate the intrinsic correlations among functional
       classes into protein function prediction by leveraging the relationships provided by the PPI network and the functional
       class network. The guiding intuition is that the classification function should be sufficiently smooth on subgraphs where
       the respective topologies of these two networks are a good match. We encode this intuition as regularized learning with
       intraclass and interclass consistency, which can be understood as an extension of the graph-based learning with local
       and global consistency (LGC) method. Cross validation on the yeast proteome illustrates that MCSL consistently
       outperforms several state-of-the-art methods. Most notably, it effectively overcomes the problem associated with
       scarcity of label data. The supplementary files are freely available at http://sites.google.com/site/csaijiang/MCSL


EGC    Protein Complexes Discovery Based on Protein-Protein Interaction Data via a Regularized
8279   Sparse Generative Network Model
       Detecting protein complexes from protein interaction networks is one major task in the postgenome era. Previous
       developed computational algorithms identifying complexes mainly focus on graph partition or dense region finding.
       Most of these traditional algorithms cannot discover overlapping complexes which really exist in the protein-protein
       interaction (PPI) networks. Even if some density-based methods have been developed to identify overlapping
       complexes, they are not able to discover complexes that include peripheral proteins. In this study, motivated by recent
       successful application of generative network model to describe the generation process of PPI networks and to detect
       communities from social networks, we develop a regularized sparse generative network model (RSGNM), by adding
       another process that generates propensities using exponential distribution and incorporating Laplacian regularizer into
       an existing generative network model, for protein complexes identification. By assuming that the propensities are
       generated using exponential distribution, the estimators of propensities will be sparse, which not only has good
       biological interpretation but also helps to control the overlapping rate among detected complexes. And the Laplacian
       regularizer will lead to the estimators of propensities more smooth on interaction networks. Experimental results on
       three yeast PPI networks show that RSGNM outperforms six previous competing algorithms in terms of the quality of
       detected complexes. In addition, RSGNM is able to detect overlapping complexes and complexes including peripheral
       proteins simultaneously. These results give new insights about the importance of generative network models in protein
       complexes identification.


                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                             Approved by ISO 9001:2008 and AICTE for SKP Training
                             Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                             http://www.elysiumtechnologies.com, info@elysiumtechnologies.com



EGC
8280   Quantifying Dynamic Stability of Genetic Memory Circuits


       Bistability/Multistability has been found in many biological systems including genetic memory circuits. Proper
       characterization of system stability helps to understand biological functions and has potential applications in fields
       such as synthetic biology. Existing methods of analyzing bistability are either qualitative or in a static way. Assuming
       the circuit is in a steady state, the latter can only reveal the susceptibility of the stability to injected DC noises. However,
       this can be inappropriate and inadequate as dynamics are crucial for many biological networks. In this paper, we
       quantitatively characterize the dynamic stability of a genetic conditional memory circuit by developing new dynamic
       noise margin (DNM) concepts and associated algorithms based on system theory. Taking into account the duration of
       the noisy perturbation, the DNMs are more general cases of their static counterparts. Using our techniques, we analyze
       the noise immunity of the memory circuit and derive insights on dynamic hold and write operations. Considering cell-to-
       cell variations, our parametric analysis reveals that the dynamic stability of the memory circuit has significantly varying
       sensitivities to underlying biochemical reactions attributable to differences in structure, time scales, and nonlinear
       interactions between reactions. With proper extensions, our techniques are broadly applicable to other multistable
       biological systems.



EGC    Quantitative Analysis of the Self-Assembly Strategies of Intermediate Filaments from
8281
       Tetrameric Vimentin
       In vitro assembly of intermediate filaments from tetrameric vimentin consists of a very rapid phase of tetramers laterally
       associating into unit-length filaments and a slow phase of filament elongation. We focus in this paper on a systematic
       quantitative investigation of two molecular models for filament assembly, recently proposed in (Kirmse et al. J. Biol.
       Chem. 282, 52 (2007), 18563-18572), through mathematical modeling, model fitting, and model validation. We analyze the
       quantitative contribution of each filament elongation strategy: with tetramers, with unit-length filaments, with longer
       filaments, or combinations thereof. In each case, we discuss the numerical fitting of the model with respect to one set of
       data, and its separate validation with respect to a second, different set of data. We introduce a high-resolution model for
       vimentin filament self-assembly, able to capture the detailed dynamics of filaments of arbitrary length. This provides
       much more predictive power for the model, in comparison to previous models where only the mean length of all
       filaments in the solution could be analyzed. We show how kinetic observations on low-resolution models can be
       extrapolated to the high-resolution model and used for lowering its complexity.



EGC
       Quantum Gate Circuit Model of Signal Integration in Bacterial Quorum Sensing
8282


       Bacteria evolved cell to cell communication processes to gain information about their environment and regulate gene
       expression. Quorum sensing is such a process in which signaling molecules, called autoinducers, are produced,

                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       secreted and detected. In several cases bacteria use more than one autoinducers and integrate the information
       conveyed by them. It has not yet been explained adequately why bacteria evolved such signal integration circuits and
       what can learn about their environments using more than one autoinducers since all signaling pathways merge in one.
       Here quantum information theory, which includes classical information theory as a special case, is used to construct a
       quantum gate circuit that reproduces recent experimental results. Although the conditions in which biosystems exist do
       not allow for the appearance of quantum mechanical phenomena, the powerful computation tools of quantum
       information processing can be carefully used to cope with signal and information processing by these complex
       systems. A simulation algorithm based on this model has been developed and numerical experiments that analyze the
       dynamical operation of the quorum sensing circuit were performed for various cases of autoinducer variations, which
       revealed that these variations contain significant information about the environment in which bacteria exist.


EGC    Refining Regulatory Networks through Phylogenetic Transfer of Information
8283


       The experimental determination of transcriptional regulatory networks in the laboratory remains difficult and time-
       consuming, while computational methods to infer these networks provide only modest accuracy. The latter can be
       attributed partly to the limitations of a single-organism approach. Computational biology has long used comparative and
       evolutionary approaches to extend the reach and accuracy of its analyses. In this paper, we describe ProPhyC, a
       probabilistic phylogenetic model and associated inference algorithms, designed to improve the inference of regulatory
       networks for a family of organisms by using known evolutionary relationships among these organisms. ProPhyC can be
       used with various network evolutionary models and any existing inference method. Extensive experimental results on
       both biological and synthetic data confirm that our model (through its associated refinement algorithms) yields
       substantial improvement in the quality of inferred networks over all current methods. We also compare ProPhyC with a
       transfer learning approach we design. This approach also uses phylogenetic relationships while inferring regulatory
       networks for a family of organisms. Using similar input information but designed in a very different framework, this
       transfer learning approach does not perform better than ProPhyC, which indicates that ProPhyC makes good use of the
       evolutionary information.

EGC
       RENNSH: A Novel α-Helix Identification Approach for Intermediate Resolution Electron
8284
       Density Maps

       Accurate identification of protein secondary structures is beneficial to understand three-dimensional structures of
       biological macromolecules. In this paper, a novel refined classification framework is proposed, which treats alpha-helix
       identification as a machine learning problem by representing each voxel in the density map with its Spherical Harmonic
       Descriptors (SHD). An energy function is defined to provide statistical analysis of its identification performance, which
       can be applied to all the α-helix identification approaches. Comparing with other existing α-helix identification methods
       for intermediate resolution electron density maps, the experimental results demonstrate that our approach gives the
       best identification accuracy and is more robust to the noise.




                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com



 EGC
         Residues with Similar Hexagon Neighborhoods Share Similar Side-Chain Conformations
 8285


        We present in this study a new approach to code protein side-chain conformations into hexagon substructures.
        Classical side-chain packing methods consist of two steps: first, side-chain conformations, known as rotamers, are
        extracted from known protein structures as candidates for each residue; second, a searching method along with an
        energy function is used to resolve conflicts among residues and to optimize the combinations of side chain
        conformations for all residues. These methods benefit from the fact that the number of possible side-chain
        conformations is limited, and the rotamer candidates are readily extracted; however, these methods also suffer from the
        inaccuracy of energy functions. Inspired by threading and Ab Initio approaches to protein structure prediction, we
        propose to use hexagon substructures to implicitly capture subtle issues of energy functions. Our initial results indicate
        that even without guidance from an energy function, hexagon structures alone can capture side-chain conformations at
        an accuracy of 83.8 percent, higher than 82.6 percent by the state-of-art side-chain packing methods.



 EGC
         Reverse Engineering and Analysis of Genome-Wide Gene Regulatory Networks from
 8286
         Gene Expression Profiles Using High-Performance Computing

        Regulation of gene expression is a carefully regulated phenomenon in the cell. "Reverse-engineering” algorithms try to
        reconstruct the regulatory interactions among genes from genome-scale measurements of gene expression profiles
        (microarrays). Mammalian cells express tens of thousands of genes; hence, hundreds of gene expression profiles are
        necessary in order to have acceptable statistical evidence of interactions between genes. As the number of profiles to
        be analyzed increases, so do computational costs and memory requirements. In this work, we designed and developed a
        parallel computing algorithm to reverse-engineer genome-scale gene regulatory networks from thousands of gene
        expression profiles. The algorithm is based on computing pairwise Mutual Information between each gene-pair. We
        successfully tested it to reverse engineer the Mus Musculus (mouse) gene regulatory network in liver from gene
        expression profiles collected from a public repository. A parallel hierarchical clustering algorithm was implemented to
        discover "communities” within the gene network. Network communities are enriched for genes involved in the same
        biological functions. The inferred network was used to identify two mitochondrial proteins.


EGC
5287
        Software Fault Prediction Using Quad Tree-Based K-Means Clustering Algorithm



        Tumor classification based on Gene Expression Profiles (GEPs), which is of great benefit to the accurate diagnosis and
        personalized treatment for different types of tumor, has drawn a great attention in recent years. This paper proposes a
        novel tumor classification method based on correlation filters to identify the overall pattern of tumor subtype hidden in
        differentially expressed genes. Concretely, two correlation filters, i.e., Minimum Average Correlation Energy (MACE) and
        Optimal Tradeoff Synthetic Discriminant Function (OTSDF), are introduced to determine whether a test sample matches

                   IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                              Approved by ISO 9001:2008 and AICTE for SKP Training
                             Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                             http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


        the templates synthesized for each subclass. The experiments on six publicly available data sets indicate that the
        proposed method is robust to noise, and can more effectively avoid the effects of dimensionality curse. Compared with
        many model-based methods, the correlation filter-based method can achieve better performance when balanced training
        sets are exploited to synthesize the templates. Particularly, the proposed method can detect the similarity of overall
        pattern while ignoring small mismatches between test sample and the synthesized template. And it performs well even if
        only a few training samples are available. More importantly, the experimental results can be visually represented, which
        is helpful for the further analysis of results.



EGC
        Scaffold Filling under the Break point and Related Distances
8288



        Motivated by the trend of genome sequencing without completing the sequence of the whole genomes, a problem on
        filling an incomplete multichromosomal genome (or scaffold) I with respect to a complete target genome G was studied.
        The objective is to minimize the resulting genomic distance between I^{prime } and G, where I^{prime } is the
        corresponding filled scaffold. We call this problem the one-sided scaffold filling problem. In this paper, we conduct a
        systematic study for the scaffold filling problem under the breakpoint distance and its variants, for both
        unichromosomal and multichromosomal genomes (with and without gene repetitions). When the input genome contains
        no gene repetition (i.e., is a fragment of a permutation), we show that the two-sided scaffold filling problem (i.e., G is also
        incomplete) is polynomially solvable for unichromosomal genomes under the breakpoint distance and for
        multichromosomal genomes under the genomic (or DCJ—Double-Cut-and-Join) distance. However, when the input
        genome contains some repeated genes, even the one-sided scaffold filling problem becomes NP-complete when the
        similarity measure is the maximum number of adjacencies between two sequences. For this problem, we also present
        efficient constant-factor approximation algorithms: factor-2 for the general case and factor 1.33 for the one-sided case.
 EGC
 8289
          SimBioNeT: A Simulator of Biological Network Topology


        Studying biological networks at topological level is a major issue in computational biology studies and simulation is
        often used in this context, either to assess reverse engineering algorithms or to investigate how topological properties
        depend on network parameters. In both contexts, it is desirable for a topology simulator to reproduce the current
        knowledge on biological networks, to be able to generate a number of networks with the same properties and to be
        flexible with respect to the possibility to mimic networks of different organisms. We propose a biological network
        topology simulator, SimBioNeT, in which module structures of different type and size are replicated at different level of
        network organization and interconnected, so to obtain the desired degree distribution, e.g., scale free, and a clustering
        coefficient constant with the number of nodes in the network, a typical characteristic of biological networks. Empirical
        assessment of the ability of the simulator to reproduce characteristic properties of biological network and comparison
        with E. coli and S. cerevisiae transcriptional networks demonstrates the effectiveness of our proposal.




                   IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                             Approved by ISO 9001:2008 and AICTE for SKP Training
                            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com



 EGC
 8290
        Smoldyn on Graphics Processing Units: Massively Parallel Brownian Dynamics
        Simulations

        Space is a very important aspect in the simulation of biochemical systems; recently, the need for simulation algorithms
        able to cope with space is becoming more and more compelling. Complex and detailed models of biochemical systems
        need to deal with the movement of single molecules and particles, taking into consideration localized fluctuations,
        transportation phenomena, and diffusion. A common drawback of spatial models lies in their complexity: models can
        become very large, and their simulation could be time consuming, especially if we want to capture the systems behavior
        in a reliable way using stochastic methods in conjunction with a high spatial resolution. In order to deliver the promise
        done by systems biology to be able to understand a system as whole, we need to scale up the size of models we are
        able to simulate, moving from sequential to parallel simulation algorithms. In this paper, we analyze Smoldyn, a widely
        diffused algorithm for stochastic simulation of chemical reactions with spatial resolution and single molecule detail, and
        we propose an alternative, innovative implementation that exploits the parallelism of Graphics Processing Units (GPUs).
        The implementation executes the most computational demanding steps (computation of diffusion, unimolecular, and
        bimolecular reaction, as well as the most common cases of molecule-surface interaction) on the GPU, computing them
        in parallel on each molecule of the system. The implementation offers good speed-ups and real time, high quality
        graphics output.


EGC
        Smolign: A Spatial Motifs-Based Protein Multiple Structural Alignment Method
8291


        Availability of an effective tool for protein multiple structural alignment (MSTA) is essential for discovery and analysis of
        biologically significant structural motifs that can help solve functional annotation and drug design problems. Existing
        MSTA methods collect residue correspondences mostly through pairwise comparison of consecutive fragments, which
        can lead to suboptimal alignments, especially when the similarity among the proteins is low. We introduce a novel
        strategy based on: building a contact-window based motif library from the protein structural data, discovery and
        extension of common alignment seeds from this library, and optimal superimposition of multiple structures according to
        these alignment seeds by an enhanced partial order curve comparison method. The ability of our strategy to detect
        multiple correspondences simultaneously, to catch alignments globally, and to support flexible alignments, endorse a
        sensitive and robust automated algorithm that can expose similarities among protein structures even under low
        similarity conditions. Our method yields better alignment results compared to other popular MSTA methods, on several
        protein structure data sets that span various structural folds and represent different protein similarity levels.



EGC
        Stable Gene Selection from Microarray Data via Sample Weighting
8292




                   IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       Feature selection from gene expression microarray data is a widely used technique for selecting candidate genes in
       various cancer studies. Besides predictive ability of the selected genes, an important aspect in evaluating a selection
       method is the stability of the selected genes. Experts instinctively have high confidence in the result of a selection
       method that selects similar sets of genes under some variations to the samples. However, a common problem of
       existing feature selection methods for gene expression data is that the selected genes by the same method often vary
       significantly with sample variations. In this work, we propose a general framework of sample weighting to improve the
       stability of feature selection methods under sample variations. The framework first weights each sample in a given
       training set according to its influence to the estimation of feature relevance, and then provides the weighted training set
       to a feature selection method. We also develop an efficient margin-based sample weighting algorithm under this
       framework. Experiments on a set of microarray data sets show that the proposed algorithm significantly improves the
       stability of representative feature selection algorithms such as SVM-RFE and ReliefF, without sacrificing their
       classification performance. Moreover, the proposed algorithm also leads to more stable gene signatures than the state-
       of-the-art ensemble method, particularly for small signature sizes.

EGC
       Stochastic Gene Expression Modeling with Hill Function for Switch-Like Gene Responses
8293


       Gene expression models play a key role to understand the mechanisms of gene regulation whose aspects are grade and
       switch-like responses. Though many stochastic approaches attempt to explain the gene expression mechanisms, the
       Gillespie algorithm which is commonly used to simulate the stochastic models hardly explain the switch-like behaviors
       of gene responses. In this study, we propose a stochastic gene expression model which can describe the switch-like
       behaviors of gene responses by employing Hill functions to the conventional Gillespie algorithm. We assume eight
       processes of gene expression and their biologically appropriate reaction rates are estimated based on published
       literatures. Our negative regulatory model shows that the modified Gillespie algorithm successfully describes the
       switch-like behaviors of gene responses, which is consistent with a published experimental study. We observe that the
       state of the system of the toggled switch model is rarely changed since the Hill function prevents the activation of
       involved proteins when their concentrations stay at low level. In ScbA/ScbR system which can control the antibiotic
       metabolite production of microorganisms, our proposed stochastic approach successfully models its switch-like gene
       response and oscillatory expressions.



EGC    Structural SCOP Super family Level Classification Using Unsupervised Machine Learning
8294



       One of the major research directions in bioinformatics is that of assigning superfamily classification to a given set of
       proteins. The classification reflects the structural, evolutionary, and functional relatedness. These relationships are
       embodied in a hierarchical classification, such as the Structural Classification of Protein (SCOP), which is mostly
       manually curated. Such a classification is essential for the structural and functional analyses of proteins. Yet a large
       number of proteins remain unclassified. In this study, we have proposed an unsupervised machine learning approach to


                  IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                            Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       classify and assign a given set of proteins to SCOP superfamilies. In the method, we have constructed a database and
       similarity matrix using P-values obtained from an all-against-all BLAST run and trained the network with the ART2
       unsupervised learning algorithm using the rows of the similarity matrix as input vectors, enabling the trained network to
       classify the proteins from 0.82 to 0.97 f-measure accuracy. The performance of ART2 has been compared with that of
       spectral clustering, Random forest, SVM, and HHpred. ART2 performs better than the others except HHpred. HHpred
       performs better than ART2 and the sum of errors is smaller than that of the other methods evaluated


EGC
8295
       Subcellular Localization Prediction through Boosting Association Rules


       Computational methods for predicting protein subcellular localization have used various types of features, including N-
       terminal sorting signals, amino acid compositions, and text annotations from protein databases. Our approach does not
       use biological knowledge such as the sorting signals or homologues, but use just protein sequence information. The
       method divides a protein sequence into short k-mer sequence fragments which can be mapped to word features in
       document classification. A large number of class association rules are mined from the protein sequence examples that
       range from the N-terminus to the C-terminus. Then, a boosting algorithm is applied to those rules to build up a final
       classifier. Experimental results using benchmark data sets show that our method is excellent in terms of both the
       classification performance and the test coverage. The result also implies that the k-mer sequence features which
       determine subcellular locations do not necessarily exist in specific positions of a protein sequence.


EGC    The Complexity of Finding Multiple Solutions to Betweenness and Quartet Compatibility
8296

       We show that two important problems that have applications in computational biology are ASP-complete, which implies
       that, given a solution to a problem, it is NP-complete to decide if another solution exists. We show first that a variation of
       BETWEENNESS, which is the underlying problem of questions related to radiation hybrid mapping, is ASP-complete.
       Subsequently, we use that result to show that QUARTET COMPATIBILITY, a fundamental problem in phylogenetics that
       asks whether a set of quartets can be represented by a parent tree, is also ASP-complete. The latter result shows that
       Steel's QUARTET CHALLENGE, which asks whether a solution to QUARTET COMPATIBILITY is unique, is coNP-
       complete.




EGC
8295
       The GA and the GWAS: Using Genetic Algorithms to Search for Multilocus Associations


       Enormous data collection efforts and improvements in technology have made large genome-wide association studies a
       promising approach for better understanding the genetics of common diseases. Still, the knowledge gained from these
       studies may be extended even further by testing the hypothesis that genetic susceptibility is due to the combined effect
       of multiple variants or interactions between variants. Here, we explore and evaluate the use of a genetic algorithm to


                   IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                             Approved by ISO 9001:2008 and AICTE for SKP Training
                            Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                            http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


        discover groups of SNPs (of size 2, 3, or 4) that are jointly associated with bipolar disorder. The algorithm is guided by
        the structure of a gene interaction network, and is able to find groups of SNPs that are strongly associated with the
        disease, while performing far fewer statistical tests than other methods.

 EGC     The Impact of Normalization and Phylogenetic Information on Estimating the Distancefor
 8296
         Metagenomes

        Metagenomics enables the study of unculturable microorganisms in different environments directly. Discriminating
        between the compositional differences of metagenomes is an important and challenging problem. Several distance
        functions have been proposed to estimate the differences based on functional profiles or taxonomic distributions;
        however, the strengths and limitations of such functions are still unclear. Initially, we analyzed three well-known
        distance functions and found very little difference between them in the clustering of samples. This motivated us to
        incorporate suitable normalizations and phylogenetic information into the functions so that we could cluster samples
        from both real and synthetic data sets. The results indicate significant improvement in sample clustering over that
        derived by rank-based normalization with phylogenetic information, regardless of whether the samples are from real or
        synthetic microbiomes. Furthermore, our findings suggest that considering suitable normalizations and phylogenetic
        information is essential when designing distance functions for estimating the differences between metagenomes. We
        conclude that incorporating rank-based normalization with phylogenetic information into the distance functions helps
        achieve reliable clustering results.




EGC
8295
        The Kernel of Maximum Agreement Subtrees


        A Maximum Agreement SubTree (MAST) is a largest subtree common to a set of trees and serves as a summary of
        common substructure in the trees. A single MAST can be misleading, however, since there can be an exponential
        number of MASTs, and two MASTs for the same tree set do not even necessarily share any leaves. In this paper, we
        introduce the notion of the Kernel Agreement SubTree (KAST), which is the summary of the common substructure in all
        MASTs, and show that it can be calculated in polynomial time (for trees with bounded degree). Suppose the input trees
        represent competing hypotheses for a particular phylogeny. We explore the utility of the KAST as a method to discern
        the common structure of confidence, and as a measure of how confident we are in a given tree set. We also show the
        trend of the KAST, as compared to other consensus methods, on the set of all trees visited during a Bayesian analysis
        of flatworm genomes.

EGC
8296
         The Relevance of Topology in Parallel Simulation of Biological Networks


        Important achievements in traditional biology have deepened the knowledge about living systems leading to an
        extensive identification of parts-list of the cell as well as of the interactions among biochemical species responsible for


                   IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       cell's regulation. Such an expanding knowledge also introduces new issues. For example, the increasing
       comprehension of the interdependencies between pathways (pathways cross-talk) has resulted, on one hand, in the
       growth of informational complexity, on the other, in a strong lack of information coherence. The overall grand challenge
       remains unchanged: to be able to assemble the knowledge of every "piece” of a system in order to figure out the
       behavior of the whole (integrative approach). In light of these considerations, high performance computing plays a
       fundamental role in the context of in-silico biology. Stochastic simulation is a renowned analysis tool, which, although
       widely used, is subject to stringent computational requirements, in particular when dealing with heterogeneous and high
       dimensional systems. Here, we introduce and discuss a methodology aimed at alleviating the burden of simulating
       complex biological networks. Such a method, which springs from graph theory, is based on the principle of fragmenting
       the computational space of a simulation trace and delegating the computation of fragments to a number of parallel
       processes.



EGC
8295
       Touring Protein Space with Matt



       Maximum Using the Matt structure alignment program, we take a tour of protein space, producing a hierarchical
       clustering scheme that divides protein structural domains into clusters based on geometric dissimilarity. While it was
       known that purely structural, geometric, distance-based measures of structural similarity, such as Dali/FSSP, could
       largely replicate hand-curated schemes such as SCOP at the family level, it was an open question as to whether any
       such scheme could approximate SCOP at the more distant superfamily and fold levels. We partially answer this question
       in the affirmative, by designing a clustering scheme based on Matt that approximately matches SCOP at the superfamily
       level, and demonstrates qualitative differences in performance between Matt and DaliLite. Implications for the debate
       over the organization of protein fold space are discussed. Based on our clustering of protein space, we introduce the
       Mattbench benchmark set, a new collection of structural alignments useful for testing sequence aligners on more
       distantly homologous proteins.

EGC
        Transactional Database Transformation and Its Application in Prioritizing Human Disease
8296
        Genes

       Binary (0,1) matrices, commonly known as transactional databases, can represent many application data, including
       gene-phenotype data where "1” represents a confirmed gene-phenotype relation and "0” represents an unknown
       relation. It is natural to ask what information is hidden behind these "0”s and "1”s. Unfortunately, recent matrix
       completion methods, though very effective in many cases, are less likely to infer something interesting from these (0,1)-
       matrices. To answer this challenge, we propose Ind Evi, a very succinct and effective algorithm to perform independent-
       evidence-based transactional database transformation. Each entry of a (0,1)-matrix is evaluated by "independent
       evidence” (maximal supporting patterns) extracted from the whole matrix for this entry. The value of an entry, regardless




                 IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
Elysium Technologies Private Limited
                           Approved by ISO 9001:2008 and AICTE for SKP Training
                           Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai
                           http://www.elysiumtechnologies.com, info@elysiumtechnologies.com


       of its value as 0 or 1, has completely no effect for its independent evidence. The experiment on a gene-phenotype
       database shows that our method is highly promising in ranking candidate genes and predicting unknown disease genes.



EGC    Transient Dynamics of Reduced-Order Models of Genetic Regulatory Networks
8295

       In systems biology, a number of detailed genetic regulatory networks models have been proposed that are capable of
       modeling the fine-scale dynamics of gene expression. However, limitations on the type and sampling frequency of
       experimental data often prevent the parameter estimation of the detailed models. Furthermore, the high computational
       complexity involved in the simulation of a detailed model restricts its use. In such a scenario, reduced-order models
       capturing the coarse-scale behavior of the network are frequently applied. In this paper, we analyze the dynamics of a
       reduced-order Markov Chain model approximating a detailed Stochastic Master Equation model. Utilizing a reduction
       mapping that maintains the aggregated steady-state probability distribution of stochastic master equation models, we
       provide bounds on the deviation of the Markov Chain transient distribution from the transient aggregated distributions
       of the stochastic master equation model.

EGC
        Using GPUs for the Exact Alignment of Short-Read Genetic Sequences by Means of the
8296
        Burrows-Wheeler Transform

       General Purpose Graphic Processing Units (GPGPUs) constitute an inexpensive resource for computing-intensive
       applications that could exploit an intrinsic fine-grain parallelism. This paper presents the design and implementation in
       GPGPUs of an exact alignment tool for nucleotide sequences based on the Burrows-Wheeler Transform. We compare
       this algorithm with state-of-the-art implementations of the same algorithm over standard CPUs, and considering the
       same conditions in terms of I/O. Excluding disk transfers, the implementation of the algorithm in GPUs shows a speedup
       larger than 12{times}, when compared to CPU execution. This implementation exploits the parallelism by concurrently
       searching different sequences on the same reference search tree, maximizing memory locality and ensuring a
       symmetric access to the data. The paper describes the behavior of the algorithm in GPU, showing a good scalability in
       the performance, only limited by the size of the GPU inner memory.




                 IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects

Ieee projects 2012 2013 - Bio Informatics

  • 1.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com IEEE FINAL YEAR PROJECTS 2012 – 2013 BIO- INFORMATICS Corporate Office: Madurai 227-230, Church road, Anna nagar, Madurai – 625 020. 0452 – 4390702, 4392702, +9199447933980 Email: info@elysiumtechnologies.com, elysiumtechnologies@gmail.com Website: www.elysiumtechnologies.com Branch Office: Trichy 15, III Floor, SI Towers, Melapudur main road, Trichy – 620 001. 0431 – 4002234, +919790464324. Email: trichy@elysiumtechnologies.com, elysium.trichy@gmail.com. Website: www.elysiumtechnologies.com Branch Office: Coimbatore 577/4, DB Road, RS Puram, Opp to KFC, Coimbatore – 641 002. +919677751577 Website: Elysiumtechnologies.com, Email: info@elysiumtechnologies.com Branch Office: Kollam Surya Complex, Vendor junction, Kollam – 691 010, Kerala. 0474 – 2723622, +919446505482. Email: kerala@elysiumtechnologies.com. Website: www.elysiumtechnologies.com Branch Office: Cochin 4th Floor, Anjali Complex, near south over bridge, Valanjampalam, Cochin – 682 016, Kerala. 0484 – 6006002, +917736004002. Email: kerala@elysiumtechnologies.com, Website: www.elysiumtechnologies.com IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 2.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com BIO - INFORMATICS 2012 - 2013 EGC A Biologically Inspired Validity Measure for Comparison of Clustering Methods over 8201 Metabolic Data Sets In the biological domain, clustering is based on the assumption that genes or metabolites involved in a common biological process are coexpressed/coaccumulated under the control of the same regulatory network. Thus, a detailed inspection of the grouped patterns to verify their memberships to well-known metabolic pathways could be very useful for the evaluation of clusters from a biological perspective. The aim of this work is to propose a novel approach for the comparison of clustering methods over metabolic data sets, including prior biological knowledge about the relation among elements that constitute the clusters. A way of measuring the biological significance of clustering solutions is proposed. This is addressed from the perspective of the usefulness of the clusters to identify those patterns that change in coordination and belong to common pathways of metabolic regulation. The measure summarizes in a compact way the objective analysis of clustering methods, which respects coherence and clusters distribution. It also evaluates the biological internal connections of such clusters considering common pathways. The proposed measure was tested in two biological databases using three clustering methods. EGC A Co-Clustering Approach for Mining Large Protein-Protein Interaction Networks 8202 Several approaches have been presented in the literature to cluster Protein-Protein Interaction (PPI) networks. They can be grouped in two main categories: those allowing a protein to participate in different clusters and those generating only nonoverlapping clusters. In both cases, a challenging task is to find a suitable compromise between the biological relevance of the results and a comprehensive coverage of the analyzed networks. Indeed, methods returning high accurate results are often able to cover only small parts of the input PPI network, especially when low-characterized networks are considered. We present a coclustering-based technique able to generate both overlapping and nonoverlapping clusters. The density of the clusters to search for can also be set by the user. We tested our method on the two networks of yeast and human, and compared it to other five well-known techniques on the same interaction data sets. The results showed that, for all the examples considered, our approach always reaches a good compromise between accuracy and network coverage. Furthermore, the behavior of our algorithm is not influenced by the structure of the input network, different from all the techniques considered in the comparison, which returned very good results on the yeast network, while on the human network their outcomes are rather poor. EGC A Comparative Study on Filtering Protein Secondary Structure Prediction 8203 IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 3.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Filtering of Protein Secondary Structure Prediction (PSSP) aims to provide physicochemically realistic results, while it usually improves the predictive performance. We performed a comparative study on this challenging problem, utilizing both machine learning techniques and empirical rules and we found that combinations of the two lead to the highest improvement. EGC A Computational Model for Predicting Protein Interactions Based on Multidomain 8204 Collaboration Recently, several domain-based computational models for predicting protein-protein interactions (PPIs) have been proposed. The conventional methods usually infer domain or domain combination (DC) interactions from already known interacting sets of proteins, and then predict PPIs using the information. However, the majority of these models often have limitations in providing detailed information on which domain pair (single domain interaction) or DC pair (multidomain interaction) will actually interact for the predicted protein interaction. Therefore, a more comprehensive and concrete computational model for the prediction of PPIs is needed. We developed a computational model to predict PPIs using the information of intraprotein domain cohesion and interprotein DC coupling interaction. A method of identifying the primary interacting DC pair was also incorporated into the model in order to infer actual participants in a predicted interaction. Our method made an apparent improvement in the PPI prediction accuracy, and the primary interacting DC pair identification was valid specifically in predicting multidomain protein interactions. In this paper, we demonstrate that 1) the intraprotein domain cohesion is meaningful in improving the accuracy of domain-based PPI prediction, 2) a prediction model incorporating the intradomain cohesion enables us to identify the primary interacting DC pair, and 3) a hybrid approach using the intra/interdomain interaction information can lead to a more accurate prediction. EGC 8205 A Framework for Incorporating Functional Interrelationships into Protein Function Prediction Algorithms The functional annotation of proteins is one of the most important tasks in the post-genomic era. Although many computational approaches have been developed in recent years to predict protein function, most of these traditional algorithms do not take interrelationships among functional terms into account, such as different GO terms usually coannotate with some common proteins. In this study, we propose a new functional similarity measure in the form of Jaccard coefficient to quantify these interrelationships and also develop a framework for incorporating GO term similarity into protein function prediction process. The experimental results of cross-validation on S. cerevisiae and Homo sapiens data sets demonstrate that our method is able to improve the performance of protein function prediction. In addition, we find that small size terms associated with a few of proteins obtain more benefit than the large size ones when considering functional interrelationships. We also compare our similarity measure with other two widely used measures, and results indicate that when incorporated into function prediction algorithms, our proposed measure is more effective. Experiment results also illustrate that our algorithms outperform two previous competing IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 4.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com algorithms, which also take functional interrelationships into account, in prediction accuracy. Finally, we show that our method is robust to annotations in the database which are not complete at present. These results give new insights about the importance of functional interrelationships in protein function prediction. EGC A Hybrid Approach to Survival Model Building Using Integration of Clinical and Molecular 8206 Information in Censored Data In medical society, the prognostic models, which use clinicopathologic features and predict prognosis after a certain treatment, have been externally validated and used in practice. In recent years, most research has focused on high dimensional genomic data and small sample sizes. Since clinically similar but molecularly heterogeneous tumors may produce different clinical outcomes, the combination of clinical and genomic information, which may be complementary, is crucial to improve the quality of prognostic predictions. However, there is a lack of an integrating scheme for clinic- genomic models due to the {rm P}gg{rm N} problem, in particular, for a parsimonious model. We propose a methodology to build a reduced yet accurate integrative model using a hybrid approach based on the Cox regression model, which uses several dimension reduction techniques, {rm L}_{2} penalized maximum likelihood estimation (PMLE), and resampling methods to tackle the problem. The predictive accuracy of the modeling approach is assessed by several metrics via an independent and thorough scheme to compare competing methods. In breast cancer data studies on a metastasis and death event, we show that the proposed methodology can improve prediction accuracy and build a final model with a hybrid signature that is parsimonious when integrating both types of variables. EGC A Hybrid EKF and Switching PSO Algorithm for Joint State and Parameter Estimation of 8207 Lateral Flow Immunoassay Models In this paper, a hybrid extended Kalman filter (EKF) and switching particle swarm optimization (SPSO) algorithm is proposed for jointly estimating both the parameters and states of the lateral flow immunoassay model through available short time-series measurement. Our proposed method generalizes the well-known EKF algorithm by imposing physical constraints on the system states. Note that the state constraints are encountered very often in practice that give rise to considerable difficulties in system analysis and design. The main purpose of this paper is to handle the dynamic modeling problem with state constraints by combining the extended Kalman filtering and constrained optimization algorithms via the maximization probability method. More specifically, a recently developed SPSO algorithm is used to cope with the constrained optimization problem by converting it into an unconstrained optimization one through adding a penalty term to the objective function. The proposed algorithm is then employed to simultaneously identify the parameters and states of a lateral flow immunoassay model. It is shown that the proposed algorithm gives much improved performance over the traditional EKF method. EGC A Memory Efficient Method for Structure-Based RNA Multiple Alignment 8208 IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 5.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Structure-based RNA multiple alignment is particularly challenging because covarying mutations make sequence information alone insufficient. Existing tools for RNA multiple alignment first generate pairwise RNA structure alignments and then build the multiple alignment using only sequence information. Here we present PMFastR, an algorithm which iteratively uses a sequence-structure alignment procedure to build a structure-based RNA multiple alignment from one sequence with known structure and a database of sequences from the same family. PMFastR also has low memory consumption allowing for the alignment of large sequences such as 16S and 23S rRNA. The algorithm also provides a method to utilize a multicore environment. We present results on benchmark data sets from BRAliBase, which shows PMFastR performs comparably to other state-of-the-art programs. Finally, we regenerate 607 Rfam seed alignments and show that our automated process creates multiple alignments similar to the manually curated Rfam seed alignments. Thus, the techniques presented in this paper allow for the generation of multiple alignments using sequence-structure guidance, while limiting memory consumption. As a result, multiple alignments of long RNA sequences, such as 16S and 23S rRNAs, can easily be generated locally on a personal computer. The software and supplementary data are available at http://genome.ucf.edu/PMFastR. EGC 8209 A Metric for Phylogenetic Trees Based on Matching Comparing two or more phylogenetic trees is a fundamental task in computational biology. The simplest outcome of such a comparison is a pairwise measure of similarity, dissimilarity, or distance. A large number of such measures have been proposed, but so far all suffer from problems varying from computational cost to lack of robustness; many can be shown to behave unexpectedly under certain plausible inputs. For instance, the widely used Robinson-Foulds distance is poorly distributed and thus affords little discrimination, while also lacking robustness in the face of very small changes—reattaching a single leaf elsewhere in a tree of any size can instantly maximize the distance. In this paper, we introduce a new pairwise distance measure, based on matching, for phylogenetic trees. We prove that our measure induces a metric on the space of trees, show how to compute it in low polynomial time, verify through statistical testing that it is robust, and finally note that it does not exhibit unexpected behavior under the same inputs that cause problems with other measures. We also illustrate its usefulness in clustering trees, demonstrating significant improvements in the quality of hierarchical clustering as compared to the same collections of trees clustered using the Robinson-Foulds distance. EGC 8210 A New Efficient Algorithm for the Gene-Team Problem on General Sequences Identifying conserved gene clusters is an important step toward understanding the evolution of genomes and predicting the functions of genes. A famous model to capture the essential biological features of a conserved gene cluster is called the gene-team model. The problem of finding the gene teams of two general sequences is the focus of this paper. For this problem, He and Goldwasser had an efficient algorithm that requires O(mn) time using O(m + n) working space, where m and n are, respectively, the numbers of genes in the two given sequences. In this paper, a new efficient IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 6.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com algorithm is presented. Assume m ≤ n. Let C = ΣαϵΣ o1(α)o2(α), where Σ is the set of distinct genes, and o1(α) and o2(a) are, respectively, the numbers of copies of a in the two given sequences. Our new algorithm requires O(min{C lg n,mn}) time using O(m + n) working space. As compared with He and Goldwasser's algorithm, our new algorithm is more practical, as C is likely to be much smaller than mn in practice. In addition, our new algorithm is output sensitive. Its running time is O(lg n) times the size of the output. Moreover, our new algorithm can be efficiently extended to find the gene teams of k general sequences in O(k C Ig (n1n2...nk)) time, where ni is the number of genes in the ith input sequence. EGC A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences 8211 Today's genome analysis applications require sequence representations allowing for fast access to their contents while also being memory-efficient enough to facilitate analyses of large-scale data. While a wide variety of sequence representations exist, lack of a generic implementation of efficient sequence storage has led to a plethora of poorly reusable or programming language- specific implementations. We present a novel, space-efficient data structure (GtEncseq) for storing multiple biological sequences of variable alphabet size, with customizable character transformations, wildcard support, and an assortment of internal representations optimized for different distributions of wildcards and sequence lengths. For the human genome (3.1 gigabases, including 237 million wildcard characters) our representation requires only 2 + 8 · 10-6 bits per character. Implemented in C, our portable software implementation provides a variety of methods for random and sequential access to characters and substrings (including different reading directions) using an object-oriented interface. In addition, it includes access to metadata like sequence descriptions or character distributions. The library is extensible to be used from various scripting languages. GtEncseq is much more versatile than previous solutions, adding features that were previously unavailable. Benchmarks show that it is competitive with respect to space and time requirements EGC A New Unsupervised Feature Ranking Method for Gene Expression Data Based on 8212 Consensus Affinity Feature selection is widely established as one of the fundamental computational techniques in mining microarray data. Due to the lack of categorized information in practice, unsupervised feature selection is more practically important but correspondingly more difficult. Motivated by the cluster ensemble techniques, which combine multiple clustering solutions into a consensus solution of higher accuracy and stability, recent efforts in unsupervised feature selection proposed to use these consensus solutions as oracles. However, these methods are dependent on both the particular cluster ensemble algorithm used and the knowledge of the true cluster number. These methods will be unsuitable when the true cluster number is not available, which is common in practice. In view of the above problems, a new unsupervised feature ranking method is proposed to evaluate the importance of the features based on consensus affinity. Different from previous works, our method compares the corresponding affinity of each feature between a pair of instances based on the consensus matrix of clustering solutions. As a result, our method alleviates the need to know the true number of clusters and the dependence on particular cluster ensemble approaches as in previous works. IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 7.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Experiments on real gene expression data sets demonstrate significant improvement of the feature ranking results when compared to several state-of-the-art techniques. EGC A Sparse Regulatory Network of Copy-Number Driven Gene Expression Reveals 8213 Putative Breast Cancer Oncogenes The influence of DNA cis-regulatory elements on a gene's expression has been intensively studied. However, little is known about expressions driven by trans-acting DNA hotspots. DNA hotspots harboring copy number aberrations are recognized to be important in cancer as they influence multiple genes on a global scale. The challenge in detecting trans-effects is mainly due to the computational difficulty in detecting weak and sparse trans-acting signals amidst co- occuring passenger events. We propose an integrative approach to learn a sparse interaction network of DNA copy- number regions with their downstream targets in a breast cancer dataset. Information from this network helps distinguish copy-number driven from copy-number independent expression changes on a global scale. Our result further delineates cis- and trans-effects in a breast cancer dataset, for which important oncogenes such as ESR1 and ERBB2 appear to be highly copy-number dependent. Further, our model is shown to be efficient and in terms of goodness of fit no worse than other state-of the art predictors and network reconstruction models using both simulated and real data. EGC A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray 8214 Analysis Despite years of research, the name ambiguity problem remains largely unresolved. Outstanding issues include how to A plenitude of feature selection (FS) methods is available in the literature, most of them rising as a need to analyze data of very high dimension, usually hundreds or thousands of variables. Such data sets are now available in various application areas like combinatorial chemistry, text mining, multivariate imaging, or bioinformatics. As a general accepted rule, these methods are grouped in filters, wrappers, and embedded methods. More recently, a new group of methods has been added in the general framework of FS: ensemble techniques. The focus in this survey is on filter feature selection methods for informative feature discovery in gene expression microarray (GEM) analysis, which is also known as differentially expressed genes (DEGs) discovery, gene prioritization, or biomarker discovery. We present them in a unified framework, using standardized notations in order to reveal their technical details and to highlight their common characteristics as well as their particularities. EGC A Swarm Intelligence Framework for Reconstructing Gene Networks: Searching for 8215 Biologically Plausible Architectures In this paper, we investigate the problem of reverse engineering the topology of gene regulatory networks from temporal gene expression data. We adopt a computational intelligence approach comprising swarm intelligence techniques, namely particle swarm optimization (PSO) and ant colony optimization (ACO). In addition, the recurrent neural network IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 8.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com (RNN) formalism is employed for modeling the dynamical behavior of gene regulatory systems. More specifically, ACO is used for searching the discrete space of network architectures and PSO for searching the corresponding continuous space of RNN model parameters. We propose a novel solution construction process in the context of ACO for generating biologically plausible candidate architectures. The objective is to concentrate the search effort into areas of the structure space that contain architectures which are feasible in terms of their topological resemblance to real-world networks. The proposed framework is initially applied to the reconstruction of a small artificial network that has previously been studied in the context of gene network reverse engineering. Subsequently, we consider an artificial data set with added noise for reconstructing a subnetwork of the genetic interaction network of S. cerevisiae (yeast). Finally, the framework is applied to a real-world data set for reverse engineering the SOS response system of the bacterium Escherichia coli. Results demonstrate the relative advantage of utilizing problem-specific knowledge regarding biologically plausible structural properties of gene networks over conducting a problem-agnostic search in the vast space of network architectures. EGC A Top-r Feature Selection Algorithm for Microarray Gene Expression Data 8216 Most of the conventional feature selection algorithms have a drawback whereby a weakly ranked gene that could perform well in terms of classification accuracy with an appropriate subset of genes will be left out of the selection. Considering this shortcoming, we propose a feature selection algorithm in gene expression data analysis of sample classifications. The proposed algorithm first divides genes into subsets, the sizes of which are relatively small (roughly of size h), then selects informative smaller subsets of genes (of size r <; h) from a subset and merges the chosen genes with another gene subset (of size r) to update the gene subset. We repeat this process until all subsets are merged into one informative subset. We illustrate the effectiveness of the proposed algorithm by analyzing three distinct gene expression data sets. Our method shows promising classification accuracy for all the test data sets. We also show the relevance of the selected genes in terms of their biological functions. EGC Algorithms for Reticulate Networks of Multiple Phylogenetic Trees 8217 A reticulate network N of multiple phylogenetic trees may have nodes with two or more parents (called reticulation nodes). There are two ways to define the reticulation number of N. One way is to define it as the number of reticulation nodes in N in this case, a reticulate network with the smallest reticulation number is called an optimal type-I reticulate network of the trees. The better way is to define it as the total number of parents of reticulation nodes in N minus the number of reticulation nodes in N ; in this case, a reticulate network with the smallest reticulation number is called an optimal type-II reticulate network of the trees. In this paper, we first present a fast fixed-parameter algorithm for constructing one or all optimal type-I reticulate networks of multiple phylogenetic trees. We then use the algorithm together with other ideas to obtain an algorithm for estimating a lower bound on the reticulation number of an optimal type-II reticulate network of the input trees. To our knowledge, these are the first fixed-parameter algorithms for the IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 9.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com problems. We have implemented the algorithms in ANSI C, obtaining programs CMPT and MaafB. Our experimental data show that CMPT can construct optimal type-I reticulate networks rapidly and MaafB can compute better lower bounds for optimal type-II reticulate networks within shorter time than the previously best program PIRN designed by Wu. EGC Algorithms to Detect Multi-protein Modularity Conserved during Evolution 8218 Detecting essential multiprotein modules that change infrequently during evolution is a challenging algorithmic task that is important for understanding the structure, function, and evolution of the biological cell. In this paper, we define a measure of modularity for interactomes and present a linear-time algorithm, Produles, for detecting multiprotein modularity conserved during evolution that improves on the running time of previous algorithms for related problems and offers desirable theoretical guarantees. We present a biologically motivated graph theoretic set of evaluation measures complementary to previous evaluation measures, demonstrate that Produles exhibits good performance by all measures, and describe certain recurrent anomalies in the performance of previous algorithms that are not detected by previous measures. Consideration of the newly defined measures and algorithm performance on these measures leads to useful insights on the nature of interactomics data and the goals of previous and current algorithms. Through randomization experiments, we demonstrate that conserved modularity is a defining characteristic of interactomes. Computational experiments on current experimentally derived interactomes for Homo sapiens and Drosophila melanogaster, combining results across algorithms, show that nearly 10 percent of current interactome proteins participate in multiprotein modules with good evidence in the protein interaction data of being conserved between human and Drosophila. EGC An Efficient Algorithm for Haplotype Inferenceon Pedigrees with Recombinations and 8219 Mutations Haplotype Inference (HI) is a computational challenge of crucial importance in a range of genetic studies. Pedigrees allow to infer haplotypes from genotypes more accurately than population data, since Mendelian inheritance restricts the set of possible solutions. In this work, we define a new HI problem on pedigrees, called Minimum-Change Haplotype Configuration (MCHC) problem, that allows two types of genetic variation events: recombinations and mutations. Our new formulation extends the Minimum-Recombinant Haplotype Configuration (MRHC) problem, that has been proposed in the literature to overcome the limitations of classic statistical haplotyping methods. Our contribution is twofold. First, we prove that the MCHC problem is APX-hard under several restrictions. Second, we propose an efficient and accurate heuristic algorithm for MCHC based on an L-reduction to a well-known coding problem. Our heuristic can also be used to solve the original MRHC problem and can take advantage of additional knowledge about the input genotypes. Moreover, the L-reduction proves for the first time that MCHC and MRHC are O(nm/log nm)-approximable on general pedigrees, where n is the pedigree size and m is the genotype length. Finally, we present an extensive experimental evaluation and comparison of our heuristic algorithm with several other state-of-the-art methods for HI on pedigrees. IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 10.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com EGC An Efficient Method for Exploring the Space of Gene Tree/Species Tree Reconciliations in a 8220 Probabilistic Framework Background. Inferring an evolutionary scenario for a gene family is a fundamental problem with applications both in functional and evolutionary genomics. The gene tree/species tree reconciliation approach has been widely used to address this problem, but mostly in a discrete parsimony framework that aims at minimizing the number of gene duplications and/or gene losses. Recently, a probabilistic approach has been developed, based on the classical birth- and-death process, including efficient algorithms for computing posterior probabilities of reconciliations and orthology prediction. Results. In previous work, we described an algorithm for exploring the whole space of gene tree/species tree reconciliations, that we adapt here to compute efficiently the posterior probability of such reconciliations. These posterior probabilities can be either computed exactly or approximated, depending on the reconciliation space size. We use this algorithm to analyze the probabilistic landscape of the space of reconciliations for a real data set of fungal gene families and several data sets of synthetic gene trees. Conclusion. The results of our simulations suggest that, with exact gene trees obtained by a simple birth-and-death process and realistic gene duplication/loss rates, a very small subset of all reconciliations needs to be explored in order to approximate very closely the posterior probability of the most likely reconciliations. For cases where the posterior probability mass is more evenly dispersed, our method allows to explore efficiently the required subspace of reconciliations. EGC 8221 An Efficient Method for Modeling Kinetic Behavior of Channel Proteins in Cardiomyocytes Characterization of the kinetic and conformational properties of channel proteins is a crucial element in the integrative study of congenital cardiac diseases. The proteins of the ion channels of cardiomyocytes represent an important family of biological components determining the physiology of the heart. Some computational studies aiming to understand the mechanisms of the ion channels of cardiomyocytes have concentrated on Markovian stochastic approaches. Mathematically, these approaches employ Chapman-Kolmogorov equations coupled with partial differential equations. As the scale and complexity of such subcellular and cellular models increases, the balance between efficiency and accuracy of algorithms becomes critical. We have developed a novel two-stage splitting algorithm to address efficiency and accuracy issues arising in such modeling and simulation scenarios. Numerical experiments were performed based on the incorporation of our newly developed conformational kinetic model for the rapid delayed rectifier potassium channel into the dynamic models of human ventricular myocytes. Our results show that the new algorithm significantly outperforms commonly adopted adaptive Runge-Kutta methods. Furthermore, our parallel simulations with coupled algorithms for multicellular cardiac tissue demonstrate a high linearity in the speedup of large-scale cardiac simulations. EGC Cluster-Oriented Ensemble Classifier: Impact of Multicluster Characterization on 8222 Ensemble Classifier Learning IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 11.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multiviewpoint-based similarity measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that use other popular similarity measures on various document collections to verify the advantages of our proposal. EGC An Information Theoretic Approach to Constructing Robust Boolean Gene Regulatory 8223 Networks We introduce a class of finite systems models of gene regulatory networks exhibiting behavior of the cell cycle. The network is an extension of a Boolean network model. The system spontaneously cycles through a finite set of internal states, tracking the increase of an external factor such as cell mass, and also exhibits checkpoints in which errors in gene expression levels due to cellular noise are automatically corrected. We present a 7-gene network based on Projective Geometry codes, which can correct, at every given time, one gene expression error. The topology of a network is highly symmetric and requires using only simple Boolean functions that can be synthesized using genes of various organisms. The attractor structure of the Boolean network contains a single cycle attractor. It is the smallest nontrivial network with such high robustness. The methodology allows construction of artificial gene regulatory networks with the number of phases larger than in natural cell cycle. EGC 8224 Antilope—A Lagrangian Relaxation Approach to the de novo Peptide Sequencing Problem Peptide sequencing from mass spectrometry data is a key step in proteome research. Especially de novo sequencing, the identification of a peptide from its spectrum alone, is still a challenge even for state-of-the-art algorithmic approaches. In this paper, we present antilope, a new fast and flexible approach based on mathematical programming. It builds on the spectrum graph model and works with a variety of scoring schemes. ANTILOPE combines Lagrangian relaxation for solving an integer linear programming formulation with an adaptation of Yen's k shortest paths algorithm. It shows a significant improvement in running time compared to mixed integer optimization and performs at the same speed like other state-of-the-art tools. We also implemented a generic probabilistic scoring scheme that can be trained automatically for a data set of annotated spectra and is independent of the mass spectrometer type. Evaluations on benchmark data show that antilope is competitive to the popular state-of-the-art programs PepNovo and NovoHMM both in terms of runtime and accuracy. Furthermore, it offers increased flexibility in the number of considered ion types. ANTILOPE will be freely available as part of the open source proteomics library OpenMS. EGC 8225 Assortative Mixing in Directed Biological Networks IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 12.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com We analyze assortative mixing patterns of biological networks which are typically directed. We develop a theoretical background for analyzing mixing patterns in directed networks before applying them to specific biological networks. Two new quantities are introduced, namely the in-assortativity and the out-assortativity, which are shown to be useful in quantifying assortative mixing in directed networks. We also introduce the local (node level) assortativity quantities for in- and out-assortativity. Local assortativity profiles are the distributions of these local quantities over node degrees and can be used to analyze both canonical and real-world directed biological networks. Many biological networks, which have been previously classified as disassortative, are shown to be assortative with respect to these new measures. Finally, we demonstrate the use of local assortativity profiles in analyzing the functionalities of particular nodes and groups of nodes in real-world biological networks. EGC BpMatch: An Efficient Algorithm for a Segmental Analysis of Genomic Sequences 8226 Here, we propose BpMatch: an algorithm that, working on a suitably modified suffix-tree data structure, is able to compute, in a fast and efficient way, the coverage of a source sequence S on a target sequence T, by taking into account direct and reverse segments, eventually overlapped. Using BpMatch, the operator should define a priori, the minimum length l of a segment and the minimum number of occurrences minRep, so that only segments longer than l and having a number of occurrences greater than minRep are considered to be significant. BpMatch outputs the significant segments found and the computed segment-based distance. On the worst case, assuming the alphabet dimension d is a constant, the time required by BpMatch to calculate the coverage is {rm O}(l^2n). On the average, by setting lge 2log_d(n), the time required to calculate the coverage is only {rm O}(n). BpMatch, thanks to the minRep parameter, can also be used to perform a self-covering: to cover a sequence using segments coming from itself, by avoiding the trivial solution of having a single segment coincident with the whole sequence. EGC Clustering 100,000 Protein Structure Decoysin Minutes 8227 Ab initio protein structure prediction methods first generate large sets of structural conformations as candidates (called decoys), and then select the most representative decoys through clustering techniques. Classical clustering methods are inefficient due to the pairwise distance calculation, and thus become infeasible when the number of decoys is large. In addition, the existing clustering approaches suffer from the arbitrariness in determining a distance threshold for proteins within a cluster: a small distance threshold leads to many small clusters, while a large distance threshold results in the merging of several independent clusters into one cluster. In this paper, we propose an efficient clustering method through fast estimating cluster centroids and efficient pruning rotation spaces. The number of clusters is automatically detected by information distance criteria. A package named ONION, which can be downloaded freely, is implemented accordingly. Experimental results on benchmark data sets suggest that ONION is 14 times faster than IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 13.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com existing tools, and ONION obtains better selections for 31 targets, and worse selection for 19 targets compared to SPICKER's selections. On an average PC, ONION can cluster 100,000 decoys in around 12 minutes. EGC Composition Vector Method Based on Maximum Entropy Principle for Sequence 8228 Comparison The composition vector (CV) method is an alignment-free method for sequence comparison. Because of its simplicity when compared with multiple sequence alignment methods, the method has been widely discussed lately; and some formulas based on probabilistic models, like Hao's and Yu's formulas, have been proposed. In this paper, we improve these formulas by using the entropy principle which can quantify the nonrandomness occurrence of patterns in the sequences. More precisely, existing formulas are used to generate a set of possible formulas from which we choose the one that maximizes the entropy. We give the closed-form solution to the resulting optimization problem. Hence, from any given CV formula, we can find the corresponding one that maximizes the entropy. In particular, we show that Hao's formula is itself maximizing the entropy and we derive a new entropy-maximizing formula from Yu's formula. We illustrate the accuracy of our new formula by using both simulated and experimental data sets. For the simulated data sets, our new formula gives the best consensus and significant values for three different kinds of evolution models. For the data set of tetrapod 18S rRNA sequences, our new formula groups the clades of bird and reptile together correctly, where Hao's and Yu's formulas failed. Using real data sets with different sizes, we show that our formula is more accurate than Hao's and Yu's formulas even for small data sets. EGC Constructing and Drawing Regular Planar Split Networks 8229 Split networks are commonly used to visualize collections of bipartitions, also called splits, of a finite set. Such collections arise, for example, in evolutionary studies. Split networks can be viewed as a generalization of phylogenetic trees and may be generated using the SplitsTree package. Recently, the NeighborNet method for generating split networks has become rather popular, in part because it is guaranteed to always generate a circular split system, which can always be displayed by a planar split network. Even so, labels must be placed on the "outside” of the network, which might be problematic in some applications. To help circumvent this problem, it can be helpful to consider so- called flat split systems, which can be displayed by planar split networks where labels are allowed on the inside of the network too. Here, we present a new algorithm that is guaranteed to compute a minimal planar split network displaying a flat split system in polynomial time, provided the split system is given in a certain format. We will also briefly discuss two heuristics that could be useful for analyzing phylogeographic data and that allow the computation of flat split systems in this format in polynomial time. EGC Constructing Complex 3D Biological Environments from Medical Imaging Using High 8230 Performance Computing IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 14.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Extracting information about the structure of biological tissue from static image data is a complex task requiring computationally intensive operations. Here, we present how multicore CPUs and GPUs have been utilized to extract information about the shape, size, and path followed by the mammalian oviduct, called the fallopian tube in humans, from histology images, to create a unique but realistic 3D virtual organ. Histology images were processed to identify the individual cross sections and determine the 3D path that the tube follows through the tissue. This information was then related back to the histology images, linking the 2D cross sections with their corresponding 3D position along the oviduct. A series of linear 2D spline cross sections, which were computationally generated for the length of the oviduct, were bound to the 3D path of the tube using a novel particle system technique that provides smooth resolution of self- intersections. This results in a unique 3D model of the oviduct, which is grounded in reality. The GPU is used for the processor intensive operations of image processing and particle physics based simulations, significantly reducing the time required to generate a complete model. EGC CSD Homomorphisms between Phylogenetic Networks 8231 Since Darwin, species trees have been used as a simplified description of the relationships which summarize the complicated network N of reality. Recent evidence of hybridization and lateral gene transfer, however, suggest that there are situations where trees are inadequate. Consequently it is important to determine properties that characterize networks closely related to N and possibly more complicated than trees but lacking the full complexity of N. A connected surjective digraph map (CSD) is a map f from one network N to another network M such that every arc is either collapsed to a single vertex or is taken to an arc, such that f is surjective, and such that the inverse image of a vertex is always connected. CSD maps are shown to behave well under composition. It is proved that if there is a CSD map from N to M, then there is a way to lift an undirected version of M into N, often with added resolution. A CSD map from N to M puts strong constraints on N. In general, it may be useful to study classes of networks such that, for any N, there exists a CSD map from N to some standard member of that class. EGC Designing Filters for Fast-Known NcRNA Identification 8232 Detecting members of known noncoding RNA (ncRNA) families in genomic DNA is an important part of sequence annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high- computational cost when used for genome-wide search. This cost can be reduced by using a filter to exclude sequences that are unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite recent advances, designing an efficient filter that can detect ncRNA instances lacking strong conservation while excluding most irrelevant sequences remains challenging. In this work, we design three types of filters based on multiple secondary structure profiles (SSPs). An SSP augments a regular profile (i.e., a position weight matrix) with secondary structure information but can still be efficiently scanned against long sequences. Multi-SSP-based filters combine evidence from multiple SSP matches and can achieve high sensitivity and specificity. Our SSP-based filters are extensively tested in BRAliBase III data set, Rfam 9.0, and a published soil metagenomic data set. In addition, we IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 15.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com compare the SSP-based filters with several other ncRNA search tools including Infernal (with profile HMMs as filters), ERPIN, and tRNAscan-SE. Our experiments demonstrate that carefully designed SSP filters can achieve significant speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families. EGC Detection of Outlier Residues for Improving Interface Prediction in Protein 8233 Heterocomplexes Unlike Sequence-based understanding and identification of protein binding interfaces is a challenging research topic due to the complexity in protein systems and the imbalanced distribution between interface and noninterface residues. This paper presents an outlier detection idea to address the redundancy problem in protein interaction data. The cleaned training data are then used for improving the prediction performance. We use three novel measures to describe the extent a residue is considered as an outlier in comparison to the other residues: the distance of a residue instance from the center instance of all residue instances of the same class label (Dist), the probability of the class label of the residue instance (PCL), and the importance of within-class and between-class (IWB) residue instances. Outlier scores are computed by integrating the three factors; instances with a sufficiently large score are treated as outliers and removed. The data sets without outliers are taken as input for a support vector machine (SVM) ensemble. The proposed SVM ensemble trained on input data without outliers performs better than that with outliers. Our method is also more accurate than many literature methods on benchmark data sets. From our empirical studies, we found that some outlier interface residues are truly near to noninterface regions, and some outlier noninterface residues are close to interface regions. EGC 8234 DICLENS: Divisive Clustering Ensemble with Automatic Cluster Number Clustering has a long and rich history in a variety of scientific fields. Finding natural groupings of a data set is a hard task as attested by hundreds of clustering algorithms in the literature. Each clustering technique makes some assumptions about the underlying data set. If the assumptions hold, good clusterings can be expected. It is hard, in some cases impossible, to satisfy all the assumptions. Therefore, it is beneficial to apply different clustering methods on the same data set, or the same method with varying input parameters or both. We propose a novel method, DICLENS, which combines a set of clusterings into a final clustering having better overall quality. Our method produces the final clustering automatically and does not take any input parameters, a feature missing in many existing algorithms. Extensive experimental studies on real, artificial, and gene expression data sets demonstrate that DICLENS produces very good quality clusterings in a short amount of time. DICLENS implementation runs on standard personal computers by being scalable, and by consuming very little memory and CPU. EGC Disease Liability Prediction from Large Scale Genotyping Data Using Classifiers with a 8235 Reject Option IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 16.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Many Genome-wide association studies (GWA) try to identify the genetic polymorphisms associated with variation in phenotypes. However, the most significant genetic variants may have a small predictive power to forecast the future development of common diseases. We study the prediction of the risk of developing a disease given genome-wide genotypic data using classifiers with a reject option, which only make a prediction when they are sufficiently certain, but in doubtful situations may reject making a classification. To test the reliability of our proposal, we used the Wellcome Trust Case Control Consortium (WTCCC) data set, comprising 14,000 cases of seven common human diseases and 3,000 shared controls. EGC Drosophila Gene Expression Pattern Annotation through Multi-Instance Multi-Label 8236 Learning In the studies of Drosophila embryogenesis, a large number of two-dimensional digital images of gene expression patterns have been produced to build an atlas of spatio-temporal gene expression dynamics across developmental time. Gene expressions captured in these images have been manually annotated with anatomical and developmental ontology terms using a controlled vocabulary (CV), which are useful in research aimed at understanding gene functions, interactions, and networks. With the rapid accumulation of images, the process of manual annotation has become increasingly cumbersome, and computational methods to automate this task are urgently needed. However, the automated annotation of embryo images is challenging. This is because the annotation terms spatially correspond to local expression patterns of images, yet they are assigned collectively to groups of images and it is unknown which term corresponds to which region of which image in the group. In this paper, we address this problem using a new machine learning framework, Multi-Instance Multi-Label (MIML) learning. We first show that the underlying nature of the annotation task is a typical MIML learning problem. Then, we propose two support vector machine algorithms under the MIML framework for the task. Experimental results on the FlyExpress database (a digital library of standardized Drosophila gene expression pattern images) reveal that the exploitation of MIML framework leads to significant performance improvement over state-of-the-art approaches. EGC Efficient Approaches for Retrieving Protein Tertiary Structures 8237 The 3D conformation of a protein in the space is the main factor which determines its function in living organisms. Due to the huge amount of newly discovered proteins, there is a need for fast and accurate computational methods for retrieving protein structures. Their purpose is to speed up the process of understanding the structure-to-function relationship which is crucial in the development of new drugs. There are many algorithms addressing the problem of protein structure retrieval. In this paper, we present several novel approaches for retrieving protein tertiary structures. We present our voxel-based descriptor. Then we present our protein ray-based descriptors which are applied on the interpolated protein backbone. We introduce five novel wavelet descriptors which perform wavelet transforms on the protein distance matrix. We also propose an efficient algorithm for distance matrix alignment named Matrix Alignment by Sequence Alignment within Sliding Window (MASASW), which has shown as much faster than DALI, CE, and IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 17.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com MatAlign. We compared our approaches between themselves and with several existing algorithms, and they generally prove to be fast and accurate. MASASW achieves the highest accuracy. The ray and wavelet-based descriptors as well as MASASW are more accurate than CE. EGC Efficient Genotype Eliminationvia Adaptive Allele Consolidation 8238 In We propose the technique of Adaptive Allele Consolidation, that greatly improves the performance of the Lange- Goradia algorithm for genotype elimination in pedigrees, while still producing equivalent output. Genotype elimination consists in removing from a pedigree those genotypes that are impossible according to the Mendelian law of inheritance. This is used to find errors in genetic data and is useful as a preprocessing step in other analyses (such as linkage analysis or haplotype imputation). The problem of genotype elimination is intrinsically combinatorial, and Allele Consolidation is an existing technique where several alleles are replaced by a single "lumped” allele in order to reduce the number of combinations of genotypes that have to be considered, possibly at the expense of precision. In existing Allele Consolidation techniques, alleles are lumped once and for all before performing genotype elimination. The idea of Adaptive Allele Consolidation is to dynamically change the set of alleles that are lumped together during the execution of the Lange-Goradia algorithm, so that both high performance and precision are achieved. We have implemented the technique in a tool called Celer and evaluated it on a large set of scenarios, with good results. EGC 8239 Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Small-scale and local repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19-50 times the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on finding all the maximal repeats from massive texts in a time- and space-efficient manner. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage of our method is less than double the sequence size. Our space-efficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8 GB internal memory (actual internal memory usage is less than 6 GB) to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use. IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 18.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com EGC Eigen-Genomic System Dynamic-Pattern Analysis (ESDA): Modeling mRNA Degradation 8240 and Self-Regulation The High-throughput methods systematically measure the internal state of the entire cell, but powerful computational tools are needed to infer dynamics from their raw data. Therefore, we have developed a new computational method, Eigen-genomic System Dynamic-pattern Analysis (ESDA), which uses systems theory to infer dynamic parameters from a time series of gene expression measurements. As many genes are measured at a modest number of time points, estimation of the system matrix is underdetermined and traditional approaches for estimating dynamic parameters are ineffective; thus, ESDA uses the principle of dimensionality reduction to overcome the data imbalance. Since degradation rates are naturally confounded by self-regulation, our model estimates an effective degradation rate that is the difference between self-regulation and degradation. We demonstrate that ESDA is able to recover effective degradation rates with reasonable accuracy in simulation. We also apply ESDA to a budding yeast data set, and find that effective degradation rates are normally slower than experimentally measured degradation rates. Our results suggest that either self-regulation is widespread in budding yeast and that self-promotion dominates self-inhibition, or that self- regulation may be rare and that experimental methods for measuring degradation rates based on transcription arrest may severely overestimate true degradation rates in healthy cells. . EGC 8241 Empirical Evidence of the Applicability of Functional Clustering through Gene Expression Classification a great range of prior biological knowledge about the roles and functions of genes and gene-gene The availability of interactions allows us to simplify the analysis of gene expression data to make it more robust, compact, and interpretable. Here, we objectively analyze the applicability of functional clustering for the identification of groups of functionally related genes. The analysis is performed in terms of gene expression classification and uses predictive accuracy as an unbiased performance measure. Features of biological samples that originally corresponded to genes are replaced by features that correspond to the centroids of the gene clusters and are then used for classifier learning. Using 10 benchmark data sets, we demonstrate that functional clustering significantly outperforms random clustering without biological relevance. We also show that functional clustering performs comparably to gene expression clustering, which groups genes according to the similarity of their expression profiles. Finally, the suitability of functional clustering as a feature extraction technique is evaluated and discussed. EGC Evaluating Path Queries over Frequently Updated Route Collections 8242 This The recent advances in the infrastructure of Geographic Information Systems (GIS), and the proliferation of GPS technology, have resulted in the abundance of geodata in the form of sequences of points of interest (POIs), waypoints, etc. We refer to sets of such sequences as route collections. In this work, we consider path queries on frequently updated route collections: given a route collection and two points n_s and n_t, a path query returns a path, i.e., a sequence of points, that connects n_s to n_t. We introduce two path query evaluation paradigms that enjoy the benefits of search algorithms (i.e., fast index maintenance) while utilizing transitivity information to terminate the search sooner. IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 19.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Efficient indexing schemes and appropriate updating procedures are introduced. An extensive experimental evaluation verifies the advantages of our methods compared to conventional graph-based search. . EGC Exploiting Intrastructure Information for Secondary Structure Prediction with Multifaceted 8243 Pipelines Predicting the secondary structure of proteins is still a typical step in several bioinformatic tasks, in particular, for tertiary structure prediction. Notwithstanding the impressive results obtained so far, mostly due to the advent of sequence encoding schemes based on multiple alignment, in our view the problem should be studied from a novel perspective, in which understanding how available information sources are dealt with plays a central role. After revisiting a well-known secondary structure predictor viewed from this perspective (with the goal of identifying which sources of information have been considered and which have not), we propose a generic software architecture designed to account for all relevant information sources. To demonstrate the validity of the approach, a predictor compliant with the proposed generic architecture has been implemented and compared with several state-of-the-art secondary structure predictors. Experiments have been carried out on standard data sets, and the corresponding results confirm the validity of the approach. The predictor is available at http://iasc.diee.unica.it/ssp2/ through the corresponding web application or as downloadable stand-alone portable unpack-and-run bundle. EGC Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic 8244 Topic Modeling This Predicting the secondary structure of proteins is still a typical step in several bioinformatic tasks, in particular, for tertiary structure prediction. Notwithstanding the impressive results obtained so far, mostly due to the advent of sequence encoding schemes based on multiple alignment, in our view the problem should be studied from a novel perspective, in which understanding how available information sources are dealt with plays a central role. After revisiting a well-known secondary structure predictor viewed from this perspective (with the goal of identifying which sources of information have been considered and which have not), we propose a generic software architecture designed to account for all relevant information sources. To demonstrate the validity of the approach, a predictor compliant with the proposed generic architecture has been implemented and compared with several state-of-the-art secondary structure predictors. Experiments have been carried out on standard data sets, and the corresponding results confirm the validity of the approach. EGC Fast Local Search for Unrooted Robinson-Foulds Supertrees 8245 A Robinson-Foulds (RF) supertree for a collection of input trees is a tree containing all the species in the input trees that is at minimum total RF distance to the input trees. Thus, an RF supertree is consistent with the maximum number of splits in the input trees. Constructing RF supertrees for rooted and unrooted data is NP-hard. Nevertheless, effective local search heuristics have been developed for the restricted case where the input trees and the supertree are rooted. We describe new heuristics, based on the Edge Contract and Refine (ECR) operation, that remove this restriction, IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 20.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com thereby expanding the utility of RF supertrees. Our experimental results on simulated and empirical data sets show that our unrooted local search algorithms yield better supertrees than those obtained from MRP and rooted RF heuristics in terms of total RF distance to the input trees and, for simulated data, in terms of RF distance to the true tree. EGC Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Computing on 8246 GPU with CUDA and ELLPACK-R Sparse Format Markov clustering (MCL) is becoming a key algorithm within bioinformatics for determining clusters in networks. However, with increasing vast amount of data on biological networks, performance and scalability issues are becoming a critical limiting factor in applications. Meanwhile, GPU computing, which uses CUDA tool for implementing a massively parallel computing environment in the GPU card, is becoming a very powerful, efficient, and low-cost option to achieve substantial performance gains over CPU approaches. The use of on-chip memory on the GPU is efficiently lowering the latency time, thus, circumventing a major issue in other parallel computing environments, such as MPI. We introduce a very fast Markov clustering algorithm using CUDA (CUDA-MCL) to perform parallel sparse matrix-matrix computations and parallel sparse Markov matrix normalizations, which are at the heart of MCL. We utilized ELLPACK-R sparse format to allow the effective and fine-grain massively parallel processing to cope with the sparse nature of interaction networks data sets in bioinformatics applications. As the results show, CUDA-MCL is significantly faster than the original MCL running on CPU. Thus, large-scale parallel computation on off-the-shelf desktop-machines, that were previously only possible on supercomputing architectures, can significantly change the way bioinformaticians and biologists deal with their data. EGC Faster Mass Spectrometry-Based Protein Inference: Junction Trees Are More Efficient 5247 than Sampling and Marginalization by Enumeration The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different statistical inference methods using a common graphical model, and we demonstrate that junction tree inference substantially improves rates of convergence compared to existing methods. EGC Gene Classification Using Parameter-Free Semi-Supervised Manifold Learning 8248 The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 21.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different statistical inference methods using a common graphical model, and we demonstrate that junction tree inference substantially improves rates of convergence compared to existing methods. EGC 8249 GSGS: A Computational Approach to Reconstruct Signaling Pathway Structures from Gene Sets Reconstruction of signaling pathway structures is essential to decipher complex regulatory relationships in living cells. The existing computational approaches often rely on unrealistic biological assumptions and do not explicitly consider signal transduction mechanisms. Signal transduction events refer to linear cascades of reactions from the cell surface to the nucleus and characterize a signaling pathway. In this paper, we propose a novel approach, Gene Set Gibbs Sampling (GSGS), to reverse engineer signaling pathway structures from gene sets related to the pathways. We hypothesize that signaling pathways are structurally an ensemble of overlapping linear signal transduction events which we encode as Information Flows (IFs). We infer signaling pathway structures from gene sets, referred to as Information Flow Gene Sets (IFGSs), corresponding to these events. Thus, an IFGS only reflects which genes appear in the underlying IF but not their ordering. GSGS offers a Gibbs sampling like procedure to reconstruct the underlying signaling pathway structure by sequentially inferring IFs from the overlapping IFGSs related to the pathway. In the proof- of-concept studies, our approach is shown to outperform the existing state-of-the-art network inference approaches using both continuous and discrete data generated from benchmark networks in the DREAM initiative. We perform a comprehensive sensitivity analysis to assess the robustness of our approach. Finally, we implement GSGS to reconstruct signaling mechanisms in breast cancer cells. EGC Hash Subgraph Pairwise Kernel for Protein-Protein Interaction Extraction 8250 Extracting protein-protein interaction (PPI) from biomedical literature is an important task in biomedical text mining (BioTM). In this paper, we propose a hash subgraph pairwise (HSP) kernel-based approach for this task. The key to the novel kernel is to use the hierarchical hash labels to express the structural information of subgraphs in a linear time. We apply the graph kernel to compute dependency graphs representing the sentence structure for protein-protein interaction extraction task, which can efficiently make use of full graph structural information, and particularly capture the contiguous topological and label information ignored before. We evaluate the proposed approach on five publicly available PPI corpora. The experimental results show that our approach significantly outperforms all-path kernel approach on all five corpora and achieves state-of-the-art performance. EGC 8251 Identification of Essential Proteins Based on Edge Clustering Coefficient IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 22.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Identification of essential proteins is key to understanding the minimal requirements for cellular life and important for drug design. The rapid increase of available protein-protein interaction (PPI) data has made it possible to detect protein essentiality on network level. A series of centrality measures have been proposed to discover essential proteins based on network topology. However, most of them tended to focus only on the location of single protein, but ignored the relevance between interactions and protein essentiality. In this paper, a new centrality measure for identifying essential proteins based on edge clustering coefficient, named as NC, is proposed. Different from previous centrality measures, NC considers both the centrality of a node and the relationship between it and its neighbors. For each interaction in the network, we calculate its edge clustering coefficient. A node's essentiality is determined by the sum of the edge clustering coefficients of interactions connecting it and its neighbors. The new centrality measure NC takes into account the modular nature of protein essentiality. NC is applied to three different types of yeast protein-protein interaction networks, which are obtained from the DIP database, the MIPS database and the BioGRID database, respectively. The experimental results on the three different networks show that the number of essential proteins discovered by NC universally exceeds that discovered by the six other centrality measures: DC, BC, CC, SC, EC, and IC. Moreover, the essential proteins discovered by NC show significant cluster effect. EGC Identifying Gene Pathways Associated with Cancer Characteristics via Sparse Statistical 8252 Methods Information We propose a statistical method for uncovering gene pathways that characterize cancer heterogeneity. To incorporate knowledge of the pathways into the model, we define a set of activities of pathways from microarray gene expression data based on the Sparse Probabilistic Principal Component Analysis (SPPCA). A pathway activity logistic regression model is then formulated for cancer phenotype. To select pathway activities related to binary cancer phenotypes, we use the elastic net for the parameter estimation and derive a model selection criterion for selecting tuning parameters included in the model estimation. Our proposed method can also reverse-engineer gene networks based on the identified multiple pathways that enables us to discover novel gene-gene associations relating with the cancer phenotypes. We illustrate the whole process of the proposed method through the analysis of breast cancer gene expression data. EGC Inferring Gene Regulatory Networks via Nonlinear State-Space Models and Exploiting 8253 Sparsity This paper considers the problem of learning the structure of gene regulatory networks from gene expression time series data. A more realistic scenario when the state space model representing a gene network evolves nonlinearly is considered while a linear model is assumed for the microarray data. To capture the nonlinearity, a particle filter-based state estimation algorithm is considered instead of the contemporary linear approximation-based approaches. The parameters characterizing the regulatory relations among various genes are estimated online using a Kalman filter. Since a particular gene interacts with a few other genes only, the parameter vector is expected to be sparse. The state estimates delivered by the particle filter and the observed microarray data are then subjected to a LASSO-based least IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 23.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com squares regression operation which yields a parsimonious and efficient description of the regulatory network by setting the irrelevant coefficients to zero. The performance of the aforementioned algorithm is compared with the extended Kalman filter (EKF) and Unscented Kalman Filter (UKF) employing the Mean Square Error (MSE) as the fidelity criterion in recovering the parameters of gene regulatory networks from synthetic data and real biological data. Extensive computer simulations illustrate that the proposed particle filter-based network inference algorithm outperforms EKF and UKF, and therefore, it can serve as a natural framework for modeling gene regulatory networks with nonlinear and sparse structure. EGC 8254 Inferring the Number of Contributors to Mixed DNA Profiles Forensic samples containing DNA from two or more individuals can be difficult to interpret. Even ascertaining the number of contributors to the sample can be challenging. These uncertainties can dramatically reduce the statistical weight attached to evidentiary samples. A probabilistic mixture algorithm that takes into account not just the number and magnitude of the alleles at a locus, but also their frequency of occurrence allows the determination of likelihood ratios of different hypotheses concerning the number of contributors to a specific mixture. This probabilistic mixture algorithm can compute the probability of the alleles in a sample being present in a 2-person mixture, 3-person mixture, etc. The ratio of any two of these probabilities then constitutes a likelihood ratio pertaining to the number of contributors to such a mixture. EGC Intervention in Gene Regulatory Networks viaPhenotypically Constrained Control 8255 PoliciesBased on Long-Run Behavior A salient purpose for studying gene regulatory networks is to derive intervention strategies to identify potential drug targets and design gene-based therapeutic intervention. Optimal and approximate intervention strategies based on the transition probability matrix of the underlying Markov chain have been studied extensively for probabilistic Boolean networks. While the key goal of control is to reduce the steady-state probability mass of undesirable network states, in practice it is important to limit collateral damage and this constraint should be taken into account when designing intervention strategies with network models. In this paper, we propose two new phenotypically constrained stationary control policies by directly investigating the effects on the network long-run behavior. They are derived to reduce the risk of visiting undesirable states in conjunction with constraints on the shift of undesirable steady-state mass so that only limited collateral damage can be introduced. We have studied the performance of the new constrained control policies together with the previous greedy control policies to randomly generated probabilistic Boolean networks. A preliminary example for intervening in a metastatic melanoma network is also given to show their potential application in designing genetic therapeutics to reduce the risk of entering both aberrant phenotypes and other ambiguous states IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 24.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com corresponding to complications or collateral damage. Experiments on both random network ensembles and the melanoma network demonstrate that, in general, the new proposed control policies exhibit the desired performance. As shown by intervening in the melanoma network, these control policies can potentially serve as future practical gene therapeutic intervention strategies. EGC 8256 Iterative Dictionary Construction for Compression of Large DNA Data Sets Genomic repositories increasingly include individual as well as reference sequences, which tend to share long identical and near-identical strings of nucleotides. However, the sequential processing used by most compression algorithms, and the volumes of data involved, mean that these long-range repetitions are not detected. An order-insensitive, disk- based dictionary construction method can detect this repeated content and use it to compress collections of sequences. We explore a dictionary construction method that improves repeat identification in large DNA data sets. Our adaptation, Comrad, of an existing disk-based method identifies exact repeated content in collections of sequences with similarities within and across the set of input sequences. Comrad compresses the data over multiple passes, which is an expensive process, but allows Comrad to compress large data sets within reasonable time and space. Comrad allows for random access to individual sequences and subsequences without decompressing the whole data set. Comrad has no competitor in terms of the size of data sets that it can compress (extending to many hundreds of gigabytes) and, even for smaller data sets, the results are competitive compared to alternatives; as an example, 39 S. cerevisiae genomes compressed to 0.25 bits per base. EGC k-Information Gain Scaled Nearest Neighbors:A Novel Approach to Classifying Protein- 8257 Protein Interaction-Related Documents Although publicly accessible databases containing protein-protein interaction (PPI)-related information are important resources to bench and in silico research scientists alike, the amount of time and effort required to keep them up to date is often burdonsome. In an effort to help identify relevant PPI publications, text-mining tools, from the machine learning discipline, can be applied to help in this process. Here, we describe and evaluate two document classification algorithms that we submitted to the BioCreative II.5 PPI Classification Challenge Task. This task asked participants to design classifiers for identifying documents containing PPI-related information in the primary literature, and evaluated them against one another. One of our systems was the overall best-performing system submitted to the challenge task. It utilizes a novel approach to k-nearest neighbor classification, which we describe here, and compare its performance to those of two support vector machine-based classification systems, one of which was also evaluated in the challenge task. EGC 8258 Manifold Adaptive Experimental Design for Text Categorization IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 25.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com In many information processing tasks, labels are usually expensive and the unlabeled data points are abundant. To reduce the cost on collecting labels, it is crucial to predict which unlabeled examples are the most informative, i.e., improve the classifier the most if they were labeled. Many active learning techniques have been proposed for text categorization, such as SVMActive and Transductive Experimental Design. However, most of previous approaches try to discover the discriminant structure of the data space, whereas the geometrical structure is not well respected. In this paper, we propose a novel active learning algorithm which is performed in the data manifold adaptive kernel space. The manifold structure is incorporated into the kernel space by using graph Laplacian. This way, the manifold adaptive kernel space reflects the underlying geometry of the data. By minimizing the expected error with respect to the optimal classifier, we can select the most representative and discriminative data points for labeling. Experimental results on text categorization have demonstrated the effectiveness of our proposed approach. EGC 8259 Markov Invariants for Phylogenetic Rate Matrices Derived from Embedded Submodels We consider novel phylogenetic models with rate matrices that arise via the embedding of a progenitor model on a small number of character states, into a target model on a larger number of character states. Adapting representation- theoretic results from recent investigations of Markov invariants for the general rate matrix model, we give a prescription for identifying and counting Markov invariants for such "symmetric embedded” models, and we provide enumerations of these for the first few cases with a small number of character states. The simplest example is a target model on three states, constructed from a general 2 state model; the "2 hookrightarrow 3” embedding. We show that for 2 taxa, there exist two invariants of quadratic degree that can be used to directly infer pairwise distances from observed sequences under this model. A simple simulation study verifies their theoretical expected values, and suggests that, given the appropriateness of the model class, they have superior statistical properties than the standard (log) Det invariant (which is of cubic degree for this case). EGC 8260 Matching Split Distance for Unrooted Binary Phylogenetic Trees The reconstruction of evolutionary trees is one of the primary objectives in phylogenetics. Such a tree represents the historical evolutionary relationship between different species or organisms. Tree comparisons are used for multiple purposes, from unveiling the history of species to deciphering evolutionary associations among organisms and geographical areas. In this paper, we propose a new method of defining distances between unrooted binary phylogenetic trees that is especially useful for relatively large phylogenetic trees. Next, we investigate in detail the properties of one example of these metrics, called the Matching Split distance, and describe how the general method can be extended to nonbinary trees. EGC 8261 Memory Efficient Algorithms for Structural Alignment of RNAs with Pseudoknots IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 26.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Posting In this paper, we consider the problem of structural alignment of a target RNA sequence of length n and a query RNA sequence of length m with known secondary structure that may contain simple pseudoknots or embedded simple pseudoknots. The best known algorithm for solving this problem runs in O(mn3) time for simple pseudoknot or O(mn4) time for embedded simple pseudoknot with space complexity of O(mn3) for both structures, which require too much memory making it infeasible for comparing noncoding RNAs (ncRNAs) with length several hundreds or more. We propose memory efficient algorithms to solve the same problem. We reduce the space complexity to O(n3) for simple pseudoknot and O(mn2 + n3) for embedded simple pseudoknot while maintaining the same time complexity. We also show how to modify our algorithm to handle a restricted class of recursive simple pseudoknot which is found abundant in real data with space complexity of O(mn2 + n3) and time complexity of O(mn4). Experimental results show that our algorithms are feasible for comparing ncRNAs of length more than 500. EGC MinePhos: A Literature Mining System for Protein Phoshphorylation Information 8262 Extraction The rapid growth of scientific literature calls for automatic and efficient ways to facilitate extracting experimental data on protein phosphorylation. Such information is of great value for biologists in studying cellular processes and diseases such as cancer and diabetes. Existing approaches like RLIMS-P are mainly rule based. The performance lays much reliance on the completeness of rules. We propose an SVM-based system known as MinePhos which outperforms RLIMS-P in both precision and recall of information extraction when tested on a set of articles randomly chosen from PubMed. EGC 8263 Molecular Dynamics Trajectory Compressionwith a Coarse-Grained Model Molecular dynamics trajectories are very data intensive thereby limiting sharing and archival of such data. One possible solution is compression of trajectory data. Here, trajectory compression based on conversion to the coarse-grained model PRIMO is proposed. The compressed data are about one third of the original data and fast decompression is possible with an analytical reconstruction procedure from PRIMO to all-atom representations. This protocol largely preserves structural features and to a more limited extent also energetic features of the original trajectory. EGC Multiobjective Optimization Based-Approach forDiscovering Novel Cancer Therapies 8264 Solid tumors must recruit new blood vessels for growth and maintenance. Discovering drugs that block tumor-induced development of new blood vessels (angiogenesis) is an important approach in cancer treatment. The complexity of angiogenesis presents both challenges and opportunities for cancer therapies. Intuitive approaches, such as blocking IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 27.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com VegF activity, have yielded important therapies. But there maybe opportunities to alter nonintuitive targets either alone or in combination. This paper describes the development of a high-fidelity simulation of angiogenesis and uses this as the basis for a parallel search-based approach for the discovery of novel potential cancer treatments that inhibit blood vessel growth. Discovering new therapies is viewed as a multiobjective combinatorial optimization over two competing objectives: minimizing the estimated cost of practically developing the intervention while minimizing the simulated oxygen provided to the tumor by angiogenesis. Results show the effectiveness of the search process by finding interventions that are currently in use, and more interestingly, discovering potential new approaches that are nonintuitive yet effective. EGC 8265 Multiscale Binarization of Gene Expression Data for Reconstructing Boolean Networks Network inference algorithms can assist life scientists in unraveling gene-regulatory systems on a molecular level. In recent years, great attention has been drawn to the reconstruction of Boolean networks from time series. These need to be binarized, as such networks model genes as binary variables (either "expressed” or "not expressed”). Common binarization methods often cluster measurements or separate them according to statistical or information theoretic characteristics and may require many data points to determine a robust threshold. Yet, time series measurements frequently comprise only a small number of samples. To overcome this limitation, we propose a binarization that incorporates measurements at multiple resolutions. We introduce two such binarization approaches which determine thresholds based on limited numbers of samples and additionally provide a measure of threshold validity. Thus, network reconstruction and further analysis can be restricted to genes with meaningful thresholds. This reduces the complexity of network inference. The performance of our binarization algorithms was evaluated in network reconstruction experiments using artificial data as well as real-world yeast expression time series. The new approaches yield considerably improved correct network identification rates compared to other binarization techniques by effectively reducing the amount of candidate networks. EGC Mutation Region Detection for Closely Related Individuals without a Known Pedigree 8266 Linkage analysis serves as a way of finding locations of genes that cause genetic diseases. Linkage studies have facilitated the identification of several hundreds of human genes that can harbor mutations which by themselves lead to a disease phenotype. The fundamental problem in linkage analysis is to identify regions whose allele is shared by all or almost all affected members but by none or few unaffected members. Almost all the existing methods for linkage analysis are for families with clearly given pedigrees. Little work has been done for the case where the sampled individuals are closely related, but their pedigree is not known. This situation occurs very often when the individuals share a common ancestor at least six generations ago. Solving this case will tremendously extend the use of linkage analysis for finding genes that cause genetic diseases. In this paper, we propose a mathematical model (the shared center problem) for inferring the allele-sharing status of a given set of individuals using a database of confirmed IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 28.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com haplotypes as reference. We show the NP-completeness of the shared center problem and present a ratio-2 polynomial- time approximation algorithm for its minimization version (called the closest shared center problem). We then convert the approximation algorithm into a heuristic algorithm for the shared center problem. Based on this heuristic, we finally design a heuristic algorithm for mutation region detection. We further implement the algorithms to obtain a software package. Our experimental data show that the software is both fast and accurate. EGC Mutual Information Optimization for Mass Spectra Data Alignment 8267 "Signal” alignments play critical roles in many clinical setting. This is the case of mass spectrometry (MS) data, an important component of many types of proteomic analysis. A central problem occurs when one needs to integrate (MS) data produced by different sources, e.g., different equipment and/or laboratories. In these cases, some form of "data integration” or "data fusion” may be necessary in order to discard some source-specific aspects and improve the ability to perform a classification task such as inferring the "disease classes” of patients. The need for new high-performance data alignments methods is therefore particularly important in these contexts. In this paper, we propose an approach based both on an information theory perspective, generally used in a feature construction problem, and the application of a mathematical programming task (i.e., the weighted bipartite matching problem). We present the results of a competitive analysis of our method against other approaches. The analysis was conducted on data from plasma/ethylenediaminetetraacetic acid of "control” and Alzheimer patients collected from three different hospitals. The results point to a significant performance advantage of our method with respect to the competing ones tested EGC On Complexity of Protein Structure Alignment Problem under Distance Constraint 8268 We study the well-known Largest Common Point-set (LCP) under Bottleneck Distance Problem. Given two proteins o and 6 (as sequences of points in three-dimensional space) and a distance cutoff ζ, the goal is to find a spatial superposition and an alignment that maximizes the number of pairs of points from a and b that can be fit under the distance ζ from each other. The best to date algorithms for approximate and exact solution to this problem run in time O(n8) and O(n32), respectively, where n represents protein length. This work improves runtime of the approximation algorithm and the expected runtime of the algorithm for absolute optimum for both order-dependent and order- independent alignments. More specifically, our algorithms for near-optimal and optimal sequential alignments run in time O(n7log n) and O(n14 log n), respectively. For nonsequential alignments, corresponding running times are O(n7.5) and O(n14.5). EGC On Parameter Synthesis by Parallel Model Checking 8269 IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 29.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com An important problem in current computational systems biology is to analyze models of biological systems dynamics under parameter uncertainty. This paper presents a novel algorithm for parameter synthesis based on parallel model checking. The algorithm is conceptually universal with respect to the modeling approach employed. We introduce the algorithm, show its scalability, and examine its applicability on several biological models. EGC 8270 On the Application of Active Learning and Gaussian Processes in Postcryo preservation Cell Membrane Integrity Experiments An important problem in current computational systems biology is to analyze models of biological systems dynamics under parameter uncertainty. This paper presents a novel algorithm for parameter synthesis based on parallel model checking. The algorithm is conceptually universal with respect to the modeling approach employed. We introduce the algorithm, show its scalability, and examine its applicability on several biological models. EGC On the Elusiveness of Clusters 8271 Rooted phylogenetic networks are often used to represent conflicting phylogenetic signals. Given a set of clusters, a network is said to represent these clusters in the softwiredsense if, for each cluster in the input set, at least one tree embedded in the network contains that cluster. Motivated by parsimony we might wish to construct such a network using as few reticulations as possible, or minimizing the level of the network, i.e., the maximum number of reticulations used in any "tangled" region of the network. Although these are NP-hard problems, here we prove that, for every fixed k ≥ 0, it is polynomial-time solvable to construct a phylogenetic network with level equal to k representing a cluster set, or to determine that no such network exists. However, this algorithm does not lend itself to a practical implementation. We also prove that the comparatively efficient CASS algorithm correctly solves this problem (and also minimizes the reticulation number) when input clusters are obtained from two not necessarily binary gene trees on the same set of taxa but does not always minimize level for general cluster sets. Finally, we describe a new algorithm which generates in polynomial-time all binary phylogenetic networks with exactly r reticulations representing a set of input clusters (for every fixed r ≥ 0) EGC Optimizing Phylogenetic Networks for Circular Split Systems 8272 We address the problem of realizing a given distance matrix by a planar phylogenetic network with a minimum number of faces. With the help of the popular software SplitsTree4, we start by approximating the distance matrix with a distance metric that is a linear combination of circular splits. The main results of this paper are the necessary and sufficient conditions for the existence of a network with a single face. We show how such a network can be constructed, and we present a heuristic for constructing a network with few faces using the first algorithm as the base case. Experimental IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 30.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com results on biological data show that this heuristic algorithm can produce phylogenetic networks with far fewer faces than the ones computed by SplitsTree4, without affecting the approximation of the distance matrix. EGC 8273 Output-Sensitive Algorithms for Finding the Nested Common Intervals of Two General Sequences This The focus of this paper is the problem of finding all nested common intervals of two general sequences. Depending on the treatment one wants to apply to duplicate genes, Blin et al. introduced three models to define nested common intervals of two sequences: the uniqueness, the free-inclusion, and the bijection models. We consider all the three models. For the uniqueness and the bijection models, we give O(n + Nout)-time algorithms, where Nout denotes the size of the output. For the free-inclusion model, we give an O(n1+ε + Nout)-time algorithm, where ε >; 0 is an arbitrarily small constant. We also present an upper bound on the size of the output for each model. For the uniqueness and the free- inclusion models, we show that Nout = O(n2). Let C = ΣgϵΓ o1(g)o2(5), where Γ is the set of distinct genes, and o1(g) and o2(g) are, respectively, the numbers of copies of g in the two given sequences. For the bijection model, we show that Nout = O(Cn). In this paper, we also study the problem of finding all approximate nested common intervals of two sequences on the bijection model. An O(δn + Nout)-time algorithm is presented, where δ denotes the maximum number of allowed gaps. In addition, we show that for this problem Nout is O(δn3). EGC Parameter Estimation Using Metaheuristics in Systems Biology: A Comprehensive 8274 Review This paper gives a comprehensive review of the application of metaheuristics to optimization problems in systems biology, mainly focusing on the parameter estimation problem (also called the inverse problem or model calibration). It is intended for either the system biologist who wishes to learn more about the various optimization techniques available and/or the metaheuristic optimizer who is interested in applying such techniques to problems in systems biology. First, the parameter estimation problems emerging from different areas of systems biology are described from the point of view of machine learning. Brief descriptions of various metaheuristics developed for these problems follow, along with outlines of their advantages and disadvantages. Several important issues in applying metaheuristics to the systems biology modeling problem are addressed, including the reliability and identifiability of model parameters, optimal design of experiments, and so on. Finally, we highlight some possible future research directions in this field. EGC Peptide Reranking with Protein-Peptide Correspondence and Precursor Peak Intensity 8275 Information Searching tandem mass spectra against a protein database has been a mainstream method for peptide identification. Improving peptide identification results by ranking true Peptide-Spectrum Matches (PSMs) over their false counterparts leads to the development of various reranking algorithms. In peptide reranking, discriminative information is essential to IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 31.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com distinguish true PSMs from false PSMs. Generally, most peptide reranking methods obtain discriminative information directly from database search scores or by training machine learning models. Information in the protein database and MS1 spectra (i.e., single stage MS spectra) is ignored. In this paper, we propose to use information in the protein database and MS1 spectra to rerank peptide identification results. To quantitatively analyze their effects to peptide reranking results, three peptide reranking methods are proposed: PPMRanker, PPIRanker, and MIRanker. PPMRanker only uses Protein-Peptide Map (PPM) information from the protein database, PPIRanker only uses Precursor Peak Intensity (PPI) information, and MIRanker employs both PPM information and PPI information. According to our experiments on a standard protein mixture data set, a human data set and a mouse data set, PPMRanker and MIRanker achieve better peptide reranking results than PetideProphet, PeptideProphet+NSP (number of sibling peptides) and a score regularization method SRPI. EGC 8276 Predicting Ligand Binding Residues and Functional Sites Using Multipositional Correlations with Graph Theoretic Clustering and Kernel CCA Search We present a new computational method for predicting ligand binding residues and functional sites in protein sequences. These residues and sites tend to be not only conserved, but also exhibit strong correlation due to the selection pressure during evolution in order to maintain the required structure and/or function. To explore the effect of correlations among multiple positions in the sequences, the method uses graph theoretic clustering and kernel-based canonical correlation analysis (kCCA) to identify binding and functional sites in protein sequences as the residues that exhibit strong correlation between the residues' evolutionary characterization at the sites and the structure-based functional classification of the proteins in the context of a functional family. The results of testing the method on two well-curated data sets show that the prediction accuracy as measured by Receiver Operating Characteristic (ROC) scores improves significantly when multipositional correlations are accounted for. EGC Predicting Metal-Binding Sites from Protein Sequence 8277 Prediction of binding sites from sequence can significantly help toward determining the function of uncharacterized proteins on a genomic scale. The task is highly challenging due to the enormous amount of alternative candidate configurations. Previous research has only considered this prediction problem starting from 3D information. When starting from sequence alone, only methods that predict the bonding state of selected residues are available. The sole exception consists of pattern-based approaches, which rely on very specific motifs and cannot be applied to discover truly novel sites. We develop new algorithmic ideas based on structured-output learning for determining transition- metal-binding sites coordinated by cysteines and histidines. The inference step (retrieving the best scoring output) is intractable for general output types (i.e., general graphs). However, under the assumption that no residue can coordinate more than one metal ion, we prove that metal binding has the algebraic structure of a matroid, allowing us to employ a very efficient greedy algorithm. We test our predictor in a highly stringent setting where the training set consists of IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 32.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com protein chains belonging to SCOP folds different from the ones used for accuracy estimation. In this setting, our predictor achieves 56 percent precision and 60 percent recall in the identification of ligand-ion bonds. EGC Predicting Protein Function by Multi-Label Correlated Semi-Supervised Learning 8278 Assigning biological functions to uncharacterized proteins is a fundamental problem in the postgenomic era. The increasing availability of large amounts of data on protein-protein interactions (PPIs) has led to the emergence of a considerable number of computational methods for determining protein function in the context of a network. These algorithms, however, treat each functional class in isolation and thereby often suffer from the difficulty of the scarcity of labeled data. In reality, different functional classes are naturally dependent on one another. We propose a new algorithm, Multi-label Correlated Semi-supervised Learning (MCSL), to incorporate the intrinsic correlations among functional classes into protein function prediction by leveraging the relationships provided by the PPI network and the functional class network. The guiding intuition is that the classification function should be sufficiently smooth on subgraphs where the respective topologies of these two networks are a good match. We encode this intuition as regularized learning with intraclass and interclass consistency, which can be understood as an extension of the graph-based learning with local and global consistency (LGC) method. Cross validation on the yeast proteome illustrates that MCSL consistently outperforms several state-of-the-art methods. Most notably, it effectively overcomes the problem associated with scarcity of label data. The supplementary files are freely available at http://sites.google.com/site/csaijiang/MCSL EGC Protein Complexes Discovery Based on Protein-Protein Interaction Data via a Regularized 8279 Sparse Generative Network Model Detecting protein complexes from protein interaction networks is one major task in the postgenome era. Previous developed computational algorithms identifying complexes mainly focus on graph partition or dense region finding. Most of these traditional algorithms cannot discover overlapping complexes which really exist in the protein-protein interaction (PPI) networks. Even if some density-based methods have been developed to identify overlapping complexes, they are not able to discover complexes that include peripheral proteins. In this study, motivated by recent successful application of generative network model to describe the generation process of PPI networks and to detect communities from social networks, we develop a regularized sparse generative network model (RSGNM), by adding another process that generates propensities using exponential distribution and incorporating Laplacian regularizer into an existing generative network model, for protein complexes identification. By assuming that the propensities are generated using exponential distribution, the estimators of propensities will be sparse, which not only has good biological interpretation but also helps to control the overlapping rate among detected complexes. And the Laplacian regularizer will lead to the estimators of propensities more smooth on interaction networks. Experimental results on three yeast PPI networks show that RSGNM outperforms six previous competing algorithms in terms of the quality of detected complexes. In addition, RSGNM is able to detect overlapping complexes and complexes including peripheral proteins simultaneously. These results give new insights about the importance of generative network models in protein complexes identification. IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 33.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com EGC 8280 Quantifying Dynamic Stability of Genetic Memory Circuits Bistability/Multistability has been found in many biological systems including genetic memory circuits. Proper characterization of system stability helps to understand biological functions and has potential applications in fields such as synthetic biology. Existing methods of analyzing bistability are either qualitative or in a static way. Assuming the circuit is in a steady state, the latter can only reveal the susceptibility of the stability to injected DC noises. However, this can be inappropriate and inadequate as dynamics are crucial for many biological networks. In this paper, we quantitatively characterize the dynamic stability of a genetic conditional memory circuit by developing new dynamic noise margin (DNM) concepts and associated algorithms based on system theory. Taking into account the duration of the noisy perturbation, the DNMs are more general cases of their static counterparts. Using our techniques, we analyze the noise immunity of the memory circuit and derive insights on dynamic hold and write operations. Considering cell-to- cell variations, our parametric analysis reveals that the dynamic stability of the memory circuit has significantly varying sensitivities to underlying biochemical reactions attributable to differences in structure, time scales, and nonlinear interactions between reactions. With proper extensions, our techniques are broadly applicable to other multistable biological systems. EGC Quantitative Analysis of the Self-Assembly Strategies of Intermediate Filaments from 8281 Tetrameric Vimentin In vitro assembly of intermediate filaments from tetrameric vimentin consists of a very rapid phase of tetramers laterally associating into unit-length filaments and a slow phase of filament elongation. We focus in this paper on a systematic quantitative investigation of two molecular models for filament assembly, recently proposed in (Kirmse et al. J. Biol. Chem. 282, 52 (2007), 18563-18572), through mathematical modeling, model fitting, and model validation. We analyze the quantitative contribution of each filament elongation strategy: with tetramers, with unit-length filaments, with longer filaments, or combinations thereof. In each case, we discuss the numerical fitting of the model with respect to one set of data, and its separate validation with respect to a second, different set of data. We introduce a high-resolution model for vimentin filament self-assembly, able to capture the detailed dynamics of filaments of arbitrary length. This provides much more predictive power for the model, in comparison to previous models where only the mean length of all filaments in the solution could be analyzed. We show how kinetic observations on low-resolution models can be extrapolated to the high-resolution model and used for lowering its complexity. EGC Quantum Gate Circuit Model of Signal Integration in Bacterial Quorum Sensing 8282 Bacteria evolved cell to cell communication processes to gain information about their environment and regulate gene expression. Quorum sensing is such a process in which signaling molecules, called autoinducers, are produced, IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 34.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com secreted and detected. In several cases bacteria use more than one autoinducers and integrate the information conveyed by them. It has not yet been explained adequately why bacteria evolved such signal integration circuits and what can learn about their environments using more than one autoinducers since all signaling pathways merge in one. Here quantum information theory, which includes classical information theory as a special case, is used to construct a quantum gate circuit that reproduces recent experimental results. Although the conditions in which biosystems exist do not allow for the appearance of quantum mechanical phenomena, the powerful computation tools of quantum information processing can be carefully used to cope with signal and information processing by these complex systems. A simulation algorithm based on this model has been developed and numerical experiments that analyze the dynamical operation of the quorum sensing circuit were performed for various cases of autoinducer variations, which revealed that these variations contain significant information about the environment in which bacteria exist. EGC Refining Regulatory Networks through Phylogenetic Transfer of Information 8283 The experimental determination of transcriptional regulatory networks in the laboratory remains difficult and time- consuming, while computational methods to infer these networks provide only modest accuracy. The latter can be attributed partly to the limitations of a single-organism approach. Computational biology has long used comparative and evolutionary approaches to extend the reach and accuracy of its analyses. In this paper, we describe ProPhyC, a probabilistic phylogenetic model and associated inference algorithms, designed to improve the inference of regulatory networks for a family of organisms by using known evolutionary relationships among these organisms. ProPhyC can be used with various network evolutionary models and any existing inference method. Extensive experimental results on both biological and synthetic data confirm that our model (through its associated refinement algorithms) yields substantial improvement in the quality of inferred networks over all current methods. We also compare ProPhyC with a transfer learning approach we design. This approach also uses phylogenetic relationships while inferring regulatory networks for a family of organisms. Using similar input information but designed in a very different framework, this transfer learning approach does not perform better than ProPhyC, which indicates that ProPhyC makes good use of the evolutionary information. EGC RENNSH: A Novel α-Helix Identification Approach for Intermediate Resolution Electron 8284 Density Maps Accurate identification of protein secondary structures is beneficial to understand three-dimensional structures of biological macromolecules. In this paper, a novel refined classification framework is proposed, which treats alpha-helix identification as a machine learning problem by representing each voxel in the density map with its Spherical Harmonic Descriptors (SHD). An energy function is defined to provide statistical analysis of its identification performance, which can be applied to all the α-helix identification approaches. Comparing with other existing α-helix identification methods for intermediate resolution electron density maps, the experimental results demonstrate that our approach gives the best identification accuracy and is more robust to the noise. IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 35.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com EGC Residues with Similar Hexagon Neighborhoods Share Similar Side-Chain Conformations 8285 We present in this study a new approach to code protein side-chain conformations into hexagon substructures. Classical side-chain packing methods consist of two steps: first, side-chain conformations, known as rotamers, are extracted from known protein structures as candidates for each residue; second, a searching method along with an energy function is used to resolve conflicts among residues and to optimize the combinations of side chain conformations for all residues. These methods benefit from the fact that the number of possible side-chain conformations is limited, and the rotamer candidates are readily extracted; however, these methods also suffer from the inaccuracy of energy functions. Inspired by threading and Ab Initio approaches to protein structure prediction, we propose to use hexagon substructures to implicitly capture subtle issues of energy functions. Our initial results indicate that even without guidance from an energy function, hexagon structures alone can capture side-chain conformations at an accuracy of 83.8 percent, higher than 82.6 percent by the state-of-art side-chain packing methods. EGC Reverse Engineering and Analysis of Genome-Wide Gene Regulatory Networks from 8286 Gene Expression Profiles Using High-Performance Computing Regulation of gene expression is a carefully regulated phenomenon in the cell. "Reverse-engineering” algorithms try to reconstruct the regulatory interactions among genes from genome-scale measurements of gene expression profiles (microarrays). Mammalian cells express tens of thousands of genes; hence, hundreds of gene expression profiles are necessary in order to have acceptable statistical evidence of interactions between genes. As the number of profiles to be analyzed increases, so do computational costs and memory requirements. In this work, we designed and developed a parallel computing algorithm to reverse-engineer genome-scale gene regulatory networks from thousands of gene expression profiles. The algorithm is based on computing pairwise Mutual Information between each gene-pair. We successfully tested it to reverse engineer the Mus Musculus (mouse) gene regulatory network in liver from gene expression profiles collected from a public repository. A parallel hierarchical clustering algorithm was implemented to discover "communities” within the gene network. Network communities are enriched for genes involved in the same biological functions. The inferred network was used to identify two mitochondrial proteins. EGC 5287 Software Fault Prediction Using Quad Tree-Based K-Means Clustering Algorithm Tumor classification based on Gene Expression Profiles (GEPs), which is of great benefit to the accurate diagnosis and personalized treatment for different types of tumor, has drawn a great attention in recent years. This paper proposes a novel tumor classification method based on correlation filters to identify the overall pattern of tumor subtype hidden in differentially expressed genes. Concretely, two correlation filters, i.e., Minimum Average Correlation Energy (MACE) and Optimal Tradeoff Synthetic Discriminant Function (OTSDF), are introduced to determine whether a test sample matches IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 36.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com the templates synthesized for each subclass. The experiments on six publicly available data sets indicate that the proposed method is robust to noise, and can more effectively avoid the effects of dimensionality curse. Compared with many model-based methods, the correlation filter-based method can achieve better performance when balanced training sets are exploited to synthesize the templates. Particularly, the proposed method can detect the similarity of overall pattern while ignoring small mismatches between test sample and the synthesized template. And it performs well even if only a few training samples are available. More importantly, the experimental results can be visually represented, which is helpful for the further analysis of results. EGC Scaffold Filling under the Break point and Related Distances 8288 Motivated by the trend of genome sequencing without completing the sequence of the whole genomes, a problem on filling an incomplete multichromosomal genome (or scaffold) I with respect to a complete target genome G was studied. The objective is to minimize the resulting genomic distance between I^{prime } and G, where I^{prime } is the corresponding filled scaffold. We call this problem the one-sided scaffold filling problem. In this paper, we conduct a systematic study for the scaffold filling problem under the breakpoint distance and its variants, for both unichromosomal and multichromosomal genomes (with and without gene repetitions). When the input genome contains no gene repetition (i.e., is a fragment of a permutation), we show that the two-sided scaffold filling problem (i.e., G is also incomplete) is polynomially solvable for unichromosomal genomes under the breakpoint distance and for multichromosomal genomes under the genomic (or DCJ—Double-Cut-and-Join) distance. However, when the input genome contains some repeated genes, even the one-sided scaffold filling problem becomes NP-complete when the similarity measure is the maximum number of adjacencies between two sequences. For this problem, we also present efficient constant-factor approximation algorithms: factor-2 for the general case and factor 1.33 for the one-sided case. EGC 8289 SimBioNeT: A Simulator of Biological Network Topology Studying biological networks at topological level is a major issue in computational biology studies and simulation is often used in this context, either to assess reverse engineering algorithms or to investigate how topological properties depend on network parameters. In both contexts, it is desirable for a topology simulator to reproduce the current knowledge on biological networks, to be able to generate a number of networks with the same properties and to be flexible with respect to the possibility to mimic networks of different organisms. We propose a biological network topology simulator, SimBioNeT, in which module structures of different type and size are replicated at different level of network organization and interconnected, so to obtain the desired degree distribution, e.g., scale free, and a clustering coefficient constant with the number of nodes in the network, a typical characteristic of biological networks. Empirical assessment of the ability of the simulator to reproduce characteristic properties of biological network and comparison with E. coli and S. cerevisiae transcriptional networks demonstrates the effectiveness of our proposal. IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 37.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com EGC 8290 Smoldyn on Graphics Processing Units: Massively Parallel Brownian Dynamics Simulations Space is a very important aspect in the simulation of biochemical systems; recently, the need for simulation algorithms able to cope with space is becoming more and more compelling. Complex and detailed models of biochemical systems need to deal with the movement of single molecules and particles, taking into consideration localized fluctuations, transportation phenomena, and diffusion. A common drawback of spatial models lies in their complexity: models can become very large, and their simulation could be time consuming, especially if we want to capture the systems behavior in a reliable way using stochastic methods in conjunction with a high spatial resolution. In order to deliver the promise done by systems biology to be able to understand a system as whole, we need to scale up the size of models we are able to simulate, moving from sequential to parallel simulation algorithms. In this paper, we analyze Smoldyn, a widely diffused algorithm for stochastic simulation of chemical reactions with spatial resolution and single molecule detail, and we propose an alternative, innovative implementation that exploits the parallelism of Graphics Processing Units (GPUs). The implementation executes the most computational demanding steps (computation of diffusion, unimolecular, and bimolecular reaction, as well as the most common cases of molecule-surface interaction) on the GPU, computing them in parallel on each molecule of the system. The implementation offers good speed-ups and real time, high quality graphics output. EGC Smolign: A Spatial Motifs-Based Protein Multiple Structural Alignment Method 8291 Availability of an effective tool for protein multiple structural alignment (MSTA) is essential for discovery and analysis of biologically significant structural motifs that can help solve functional annotation and drug design problems. Existing MSTA methods collect residue correspondences mostly through pairwise comparison of consecutive fragments, which can lead to suboptimal alignments, especially when the similarity among the proteins is low. We introduce a novel strategy based on: building a contact-window based motif library from the protein structural data, discovery and extension of common alignment seeds from this library, and optimal superimposition of multiple structures according to these alignment seeds by an enhanced partial order curve comparison method. The ability of our strategy to detect multiple correspondences simultaneously, to catch alignments globally, and to support flexible alignments, endorse a sensitive and robust automated algorithm that can expose similarities among protein structures even under low similarity conditions. Our method yields better alignment results compared to other popular MSTA methods, on several protein structure data sets that span various structural folds and represent different protein similarity levels. EGC Stable Gene Selection from Microarray Data via Sample Weighting 8292 IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 38.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Feature selection from gene expression microarray data is a widely used technique for selecting candidate genes in various cancer studies. Besides predictive ability of the selected genes, an important aspect in evaluating a selection method is the stability of the selected genes. Experts instinctively have high confidence in the result of a selection method that selects similar sets of genes under some variations to the samples. However, a common problem of existing feature selection methods for gene expression data is that the selected genes by the same method often vary significantly with sample variations. In this work, we propose a general framework of sample weighting to improve the stability of feature selection methods under sample variations. The framework first weights each sample in a given training set according to its influence to the estimation of feature relevance, and then provides the weighted training set to a feature selection method. We also develop an efficient margin-based sample weighting algorithm under this framework. Experiments on a set of microarray data sets show that the proposed algorithm significantly improves the stability of representative feature selection algorithms such as SVM-RFE and ReliefF, without sacrificing their classification performance. Moreover, the proposed algorithm also leads to more stable gene signatures than the state- of-the-art ensemble method, particularly for small signature sizes. EGC Stochastic Gene Expression Modeling with Hill Function for Switch-Like Gene Responses 8293 Gene expression models play a key role to understand the mechanisms of gene regulation whose aspects are grade and switch-like responses. Though many stochastic approaches attempt to explain the gene expression mechanisms, the Gillespie algorithm which is commonly used to simulate the stochastic models hardly explain the switch-like behaviors of gene responses. In this study, we propose a stochastic gene expression model which can describe the switch-like behaviors of gene responses by employing Hill functions to the conventional Gillespie algorithm. We assume eight processes of gene expression and their biologically appropriate reaction rates are estimated based on published literatures. Our negative regulatory model shows that the modified Gillespie algorithm successfully describes the switch-like behaviors of gene responses, which is consistent with a published experimental study. We observe that the state of the system of the toggled switch model is rarely changed since the Hill function prevents the activation of involved proteins when their concentrations stay at low level. In ScbA/ScbR system which can control the antibiotic metabolite production of microorganisms, our proposed stochastic approach successfully models its switch-like gene response and oscillatory expressions. EGC Structural SCOP Super family Level Classification Using Unsupervised Machine Learning 8294 One of the major research directions in bioinformatics is that of assigning superfamily classification to a given set of proteins. The classification reflects the structural, evolutionary, and functional relatedness. These relationships are embodied in a hierarchical classification, such as the Structural Classification of Protein (SCOP), which is mostly manually curated. Such a classification is essential for the structural and functional analyses of proteins. Yet a large number of proteins remain unclassified. In this study, we have proposed an unsupervised machine learning approach to IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 39.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com classify and assign a given set of proteins to SCOP superfamilies. In the method, we have constructed a database and similarity matrix using P-values obtained from an all-against-all BLAST run and trained the network with the ART2 unsupervised learning algorithm using the rows of the similarity matrix as input vectors, enabling the trained network to classify the proteins from 0.82 to 0.97 f-measure accuracy. The performance of ART2 has been compared with that of spectral clustering, Random forest, SVM, and HHpred. ART2 performs better than the others except HHpred. HHpred performs better than ART2 and the sum of errors is smaller than that of the other methods evaluated EGC 8295 Subcellular Localization Prediction through Boosting Association Rules Computational methods for predicting protein subcellular localization have used various types of features, including N- terminal sorting signals, amino acid compositions, and text annotations from protein databases. Our approach does not use biological knowledge such as the sorting signals or homologues, but use just protein sequence information. The method divides a protein sequence into short k-mer sequence fragments which can be mapped to word features in document classification. A large number of class association rules are mined from the protein sequence examples that range from the N-terminus to the C-terminus. Then, a boosting algorithm is applied to those rules to build up a final classifier. Experimental results using benchmark data sets show that our method is excellent in terms of both the classification performance and the test coverage. The result also implies that the k-mer sequence features which determine subcellular locations do not necessarily exist in specific positions of a protein sequence. EGC The Complexity of Finding Multiple Solutions to Betweenness and Quartet Compatibility 8296 We show that two important problems that have applications in computational biology are ASP-complete, which implies that, given a solution to a problem, it is NP-complete to decide if another solution exists. We show first that a variation of BETWEENNESS, which is the underlying problem of questions related to radiation hybrid mapping, is ASP-complete. Subsequently, we use that result to show that QUARTET COMPATIBILITY, a fundamental problem in phylogenetics that asks whether a set of quartets can be represented by a parent tree, is also ASP-complete. The latter result shows that Steel's QUARTET CHALLENGE, which asks whether a solution to QUARTET COMPATIBILITY is unique, is coNP- complete. EGC 8295 The GA and the GWAS: Using Genetic Algorithms to Search for Multilocus Associations Enormous data collection efforts and improvements in technology have made large genome-wide association studies a promising approach for better understanding the genetics of common diseases. Still, the knowledge gained from these studies may be extended even further by testing the hypothesis that genetic susceptibility is due to the combined effect of multiple variants or interactions between variants. Here, we explore and evaluate the use of a genetic algorithm to IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 40.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com discover groups of SNPs (of size 2, 3, or 4) that are jointly associated with bipolar disorder. The algorithm is guided by the structure of a gene interaction network, and is able to find groups of SNPs that are strongly associated with the disease, while performing far fewer statistical tests than other methods. EGC The Impact of Normalization and Phylogenetic Information on Estimating the Distancefor 8296 Metagenomes Metagenomics enables the study of unculturable microorganisms in different environments directly. Discriminating between the compositional differences of metagenomes is an important and challenging problem. Several distance functions have been proposed to estimate the differences based on functional profiles or taxonomic distributions; however, the strengths and limitations of such functions are still unclear. Initially, we analyzed three well-known distance functions and found very little difference between them in the clustering of samples. This motivated us to incorporate suitable normalizations and phylogenetic information into the functions so that we could cluster samples from both real and synthetic data sets. The results indicate significant improvement in sample clustering over that derived by rank-based normalization with phylogenetic information, regardless of whether the samples are from real or synthetic microbiomes. Furthermore, our findings suggest that considering suitable normalizations and phylogenetic information is essential when designing distance functions for estimating the differences between metagenomes. We conclude that incorporating rank-based normalization with phylogenetic information into the distance functions helps achieve reliable clustering results. EGC 8295 The Kernel of Maximum Agreement Subtrees A Maximum Agreement SubTree (MAST) is a largest subtree common to a set of trees and serves as a summary of common substructure in the trees. A single MAST can be misleading, however, since there can be an exponential number of MASTs, and two MASTs for the same tree set do not even necessarily share any leaves. In this paper, we introduce the notion of the Kernel Agreement SubTree (KAST), which is the summary of the common substructure in all MASTs, and show that it can be calculated in polynomial time (for trees with bounded degree). Suppose the input trees represent competing hypotheses for a particular phylogeny. We explore the utility of the KAST as a method to discern the common structure of confidence, and as a measure of how confident we are in a given tree set. We also show the trend of the KAST, as compared to other consensus methods, on the set of all trees visited during a Bayesian analysis of flatworm genomes. EGC 8296 The Relevance of Topology in Parallel Simulation of Biological Networks Important achievements in traditional biology have deepened the knowledge about living systems leading to an extensive identification of parts-list of the cell as well as of the interactions among biochemical species responsible for IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 41.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com cell's regulation. Such an expanding knowledge also introduces new issues. For example, the increasing comprehension of the interdependencies between pathways (pathways cross-talk) has resulted, on one hand, in the growth of informational complexity, on the other, in a strong lack of information coherence. The overall grand challenge remains unchanged: to be able to assemble the knowledge of every "piece” of a system in order to figure out the behavior of the whole (integrative approach). In light of these considerations, high performance computing plays a fundamental role in the context of in-silico biology. Stochastic simulation is a renowned analysis tool, which, although widely used, is subject to stringent computational requirements, in particular when dealing with heterogeneous and high dimensional systems. Here, we introduce and discuss a methodology aimed at alleviating the burden of simulating complex biological networks. Such a method, which springs from graph theory, is based on the principle of fragmenting the computational space of a simulation trace and delegating the computation of fragments to a number of parallel processes. EGC 8295 Touring Protein Space with Matt Maximum Using the Matt structure alignment program, we take a tour of protein space, producing a hierarchical clustering scheme that divides protein structural domains into clusters based on geometric dissimilarity. While it was known that purely structural, geometric, distance-based measures of structural similarity, such as Dali/FSSP, could largely replicate hand-curated schemes such as SCOP at the family level, it was an open question as to whether any such scheme could approximate SCOP at the more distant superfamily and fold levels. We partially answer this question in the affirmative, by designing a clustering scheme based on Matt that approximately matches SCOP at the superfamily level, and demonstrates qualitative differences in performance between Matt and DaliLite. Implications for the debate over the organization of protein fold space are discussed. Based on our clustering of protein space, we introduce the Mattbench benchmark set, a new collection of structural alignments useful for testing sequence aligners on more distantly homologous proteins. EGC Transactional Database Transformation and Its Application in Prioritizing Human Disease 8296 Genes Binary (0,1) matrices, commonly known as transactional databases, can represent many application data, including gene-phenotype data where "1” represents a confirmed gene-phenotype relation and "0” represents an unknown relation. It is natural to ask what information is hidden behind these "0”s and "1”s. Unfortunately, recent matrix completion methods, though very effective in many cases, are less likely to infer something interesting from these (0,1)- matrices. To answer this challenge, we propose Ind Evi, a very succinct and effective algorithm to perform independent- evidence-based transactional database transformation. Each entry of a (0,1)-matrix is evaluated by "independent evidence” (maximal supporting patterns) extracted from the whole matrix for this entry. The value of an entry, regardless IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects
  • 42.
    Elysium Technologies PrivateLimited Approved by ISO 9001:2008 and AICTE for SKP Training Singapore | Madurai | Trichy | Coimbatore | Cochin | Kollam | Chennai http://www.elysiumtechnologies.com, info@elysiumtechnologies.com of its value as 0 or 1, has completely no effect for its independent evidence. The experiment on a gene-phenotype database shows that our method is highly promising in ranking candidate genes and predicting unknown disease genes. EGC Transient Dynamics of Reduced-Order Models of Genetic Regulatory Networks 8295 In systems biology, a number of detailed genetic regulatory networks models have been proposed that are capable of modeling the fine-scale dynamics of gene expression. However, limitations on the type and sampling frequency of experimental data often prevent the parameter estimation of the detailed models. Furthermore, the high computational complexity involved in the simulation of a detailed model restricts its use. In such a scenario, reduced-order models capturing the coarse-scale behavior of the network are frequently applied. In this paper, we analyze the dynamics of a reduced-order Markov Chain model approximating a detailed Stochastic Master Equation model. Utilizing a reduction mapping that maintains the aggregated steady-state probability distribution of stochastic master equation models, we provide bounds on the deviation of the Markov Chain transient distribution from the transient aggregated distributions of the stochastic master equation model. EGC Using GPUs for the Exact Alignment of Short-Read Genetic Sequences by Means of the 8296 Burrows-Wheeler Transform General Purpose Graphic Processing Units (GPGPUs) constitute an inexpensive resource for computing-intensive applications that could exploit an intrinsic fine-grain parallelism. This paper presents the design and implementation in GPGPUs of an exact alignment tool for nucleotide sequences based on the Burrows-Wheeler Transform. We compare this algorithm with state-of-the-art implementations of the same algorithm over standard CPUs, and considering the same conditions in terms of I/O. Excluding disk transfers, the implementation of the algorithm in GPUs shows a speedup larger than 12{times}, when compared to CPU execution. This implementation exploits the parallelism by concurrently searching different sequences on the same reference search tree, maximizing memory locality and ensuring a symmetric access to the data. The paper describes the behavior of the algorithm in GPU, showing a good scalability in the performance, only limited by the size of the GPU inner memory. IEEE Final Year Projects 2012 |Student Projects | Bio-informatics Projects