The document summarizes a study that uses an information-based similarity index to classify the SARS coronavirus. Key points:
1) The study develops a novel alignment-free method to measure genetic sequence similarity based on word frequencies and information content.
2) The method is first validated on human influenza and mitochondrial DNA, correctly reconstructing known phylogenies.
3) The method is then applied to classify SARS coronavirus, finding it is most closely related to group 1 coronaviruses, with some matches to groups 2 and 3.
4) The information-based similarity index provides a new tool for large-scale genomic analysis without sequence alignment.
Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...Jonathan Eisen
Lang JM, Darling AE, Eisen JA (2013) Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees and Supermatrices. PLoS ONE 8(4): e62510. doi:10.1371/journal.pone.0062510
A distance-based method for phylogenetic tree reconstruction using algebraic ...Emily Castner
This poster is for the Joint Mathematics Meetings undergraduate poster session in January 2016 and gives a sample of my research from summer 2015 in implementing a new tree reconstruction method at the Winthrop REU: Bridging Theoretical & Applied Mathematics.
Thomas S. Price, Ph.D. Career resume, Jan 2017Tom Price
The document is a CV for Thomas S. Price, who has expertise in statistical genetics, pharmacogenetics, genetic epidemiology, and bioinformatics. He has analyzed high-dimensional biological datasets using statistical methods like structural equation models and machine learning. He has worked as a researcher at universities in the UK and US, developing novel analysis methods and publishing numerous papers on genetics and genomics topics.
large data set is not available for some disease such as Brain Tumor. This and part2 presentation shows how to find "Actionable solution from a difficult cancer dataset
Multiple Sequence Alignment-just glims of viewes on bioinformatics.Arghadip Samanta
Multiple sequence alignment is used to infer evolutionary relationships by comparing homologous sequences. It involves aligning three or more biological sequences, such as protein, DNA, or RNA that are assumed to share a common ancestor. The document discusses methods for multiple sequence alignment including progressive alignment, which builds alignments sequentially according to a guide tree, and divide-and-conquer algorithms, which divide the problem into subproblems. It also describes using the resulting multiple sequence alignment for phylogenetic analysis to construct evolutionary trees and assess shared ancestry among sequences.
The document presents a secondary structure model for the internal transcribed spacer (ITS) regions of nuclear ribosomal DNA in the plant family Asteraceae, based on comparative analysis of 340 ITS sequences. The model identifies helices and other structural motifs in ITS1 and ITS2 that are supported by patterns of covariation between nucleotide positions. The structure-based alignment allows for phylogenetic analysis, resolving relationships among major lineages of Asteraceae. The well-supported tree shows congruence with chloroplast-based phylogenies. The structural model and family-level phylogeny provide new insights into molecular evolution of ITS regions in angiosperms.
The proposed research aims to develop a computational approach to analyze associations between transcription factor genes and diseases like cancer. The approach will extract gene-disease relationships from literature based on supporting evidence between genes, diseases, and evidence. Relationships will be quantitatively evaluated to extract strongly supported gene-disease linkages and rank them. Existing methods are reviewed that use properties of representative disease genes to find similar candidate genes, but the proposed method will emphasize verifiable evidence for predicted associations and their strength. The goal is to predict gene-disease relationships based on relationships between other entities to help discover disease genes.
Co-clustering algorithm for the identification of cancer subtypes from gene e...TELKOMNIKA JOURNAL
Cancer has been classified as a heterogeneous genetic disease comprising various different subtypes based on gene expression data. Early stages of diagnosis and prognosis for cancer type have become an essential requirement in cancer informatics research because it is helpful for the clinical treatment of patients. Besides this, gene network interaction which is the significant in order to understand the cellular and progressive mechanisms of cancer has been barely considered in current research. Hence, applications of machine learning methods become an important area for researchers to explore in order to categorize cancer genes into high and low risk groups or subtypes. Presently co-clustering is an extensively used data mining technique for analyzing gene expression data. This paper presents an improved network assisted co-clustering for the identification of cancer subtypes (iNCIS) where it combines gene network information with gene expression data to obtain co-clusters. The effectiveness of iNCIS was evaluated on large-scale Breast Cancer (BRCA) and Glioblastoma Multiforme (GBM). This weighted co-clustering approach in iNCIS delivers a distinctive result to integrate gene network into the clustering procedure.
Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees...Jonathan Eisen
Lang JM, Darling AE, Eisen JA (2013) Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees and Supermatrices. PLoS ONE 8(4): e62510. doi:10.1371/journal.pone.0062510
A distance-based method for phylogenetic tree reconstruction using algebraic ...Emily Castner
This poster is for the Joint Mathematics Meetings undergraduate poster session in January 2016 and gives a sample of my research from summer 2015 in implementing a new tree reconstruction method at the Winthrop REU: Bridging Theoretical & Applied Mathematics.
Thomas S. Price, Ph.D. Career resume, Jan 2017Tom Price
The document is a CV for Thomas S. Price, who has expertise in statistical genetics, pharmacogenetics, genetic epidemiology, and bioinformatics. He has analyzed high-dimensional biological datasets using statistical methods like structural equation models and machine learning. He has worked as a researcher at universities in the UK and US, developing novel analysis methods and publishing numerous papers on genetics and genomics topics.
large data set is not available for some disease such as Brain Tumor. This and part2 presentation shows how to find "Actionable solution from a difficult cancer dataset
Multiple Sequence Alignment-just glims of viewes on bioinformatics.Arghadip Samanta
Multiple sequence alignment is used to infer evolutionary relationships by comparing homologous sequences. It involves aligning three or more biological sequences, such as protein, DNA, or RNA that are assumed to share a common ancestor. The document discusses methods for multiple sequence alignment including progressive alignment, which builds alignments sequentially according to a guide tree, and divide-and-conquer algorithms, which divide the problem into subproblems. It also describes using the resulting multiple sequence alignment for phylogenetic analysis to construct evolutionary trees and assess shared ancestry among sequences.
The document presents a secondary structure model for the internal transcribed spacer (ITS) regions of nuclear ribosomal DNA in the plant family Asteraceae, based on comparative analysis of 340 ITS sequences. The model identifies helices and other structural motifs in ITS1 and ITS2 that are supported by patterns of covariation between nucleotide positions. The structure-based alignment allows for phylogenetic analysis, resolving relationships among major lineages of Asteraceae. The well-supported tree shows congruence with chloroplast-based phylogenies. The structural model and family-level phylogeny provide new insights into molecular evolution of ITS regions in angiosperms.
The proposed research aims to develop a computational approach to analyze associations between transcription factor genes and diseases like cancer. The approach will extract gene-disease relationships from literature based on supporting evidence between genes, diseases, and evidence. Relationships will be quantitatively evaluated to extract strongly supported gene-disease linkages and rank them. Existing methods are reviewed that use properties of representative disease genes to find similar candidate genes, but the proposed method will emphasize verifiable evidence for predicted associations and their strength. The goal is to predict gene-disease relationships based on relationships between other entities to help discover disease genes.
Co-clustering algorithm for the identification of cancer subtypes from gene e...TELKOMNIKA JOURNAL
Cancer has been classified as a heterogeneous genetic disease comprising various different subtypes based on gene expression data. Early stages of diagnosis and prognosis for cancer type have become an essential requirement in cancer informatics research because it is helpful for the clinical treatment of patients. Besides this, gene network interaction which is the significant in order to understand the cellular and progressive mechanisms of cancer has been barely considered in current research. Hence, applications of machine learning methods become an important area for researchers to explore in order to categorize cancer genes into high and low risk groups or subtypes. Presently co-clustering is an extensively used data mining technique for analyzing gene expression data. This paper presents an improved network assisted co-clustering for the identification of cancer subtypes (iNCIS) where it combines gene network information with gene expression data to obtain co-clusters. The effectiveness of iNCIS was evaluated on large-scale Breast Cancer (BRCA) and Glioblastoma Multiforme (GBM). This weighted co-clustering approach in iNCIS delivers a distinctive result to integrate gene network into the clustering procedure.
Life is more than accomplishments, life is about problems, challenges and how to overcome them. The human behind the story.
My life changed upside down after being inspired by profesor yunus and the social business concept. After down, these 4 years can be reduced to one word "roller-coaster"
DUMARC is a student organization established in 2012 by Mr. Nazmul Hossain at FBS to develop students' skills, explore their talents, and connect them with companies. It has several brands and teams like Skill Factory, Playmakers, This Is Me, and FBS Beats. Playmakers hosts an annual all-round business competition that is going national. DUMARC engages FBS students through Facebook and values their feedback through an anonymous form.
Thermally modulated heating is applied to a horizontal channel to analyze its effect on drag reduction using numerical techniques. Periodic heating patterns are applied to the upper and lower walls using a sinusoidal function. The governing equations are solved using MATLAB. Results show that drag reduction is highest when the phase shift between the upper and lower wall heating is zero. Increasing uniform background heating along the walls also increases drag reduction, up to 110% higher for some cases. However, the improvement in drag reduction due to uniform heating is only significant at lower Reynolds numbers and for higher values of uniform heating. Vertical heat transfer, measured by the average Nusselt number, also increases with background heating.
The document is a letter of recommendation from Mr. Abayomi's former manager at Eko Hotel & Suites in Lagos, Nigeria praising his work performance and competency in food and beverage operations during his time as Assistant Food and Beverage Manager. The letter highlights Mr. Abayomi's organizational, communication, and leadership skills and notes his involvement in various areas including purchasing, inventory management, cost control, and training. The manager expresses that it was a pleasure working with Mr. Abayomi and recommends him for future endeavors.
Light sec for utilities and critical infrastructure white paperGeorge Wainblat
The document discusses LightSEC, a cyber security solution from ECI that provides comprehensive protection for utilities and critical infrastructure. It consists of a suite of security services that incorporate threat detection, prevention, and mitigation technologies. These services are delivered through a cloud-based platform called Mercury that uses network function virtualization for flexible deployment. LightSEC also includes a threat management platform called LightSEC-V that aggregates security data from across the solution to provide a consolidated view of risks.
2016.07.28 제65회 sw공학 technical_세미나(7월28일)_발표자료1(소셜컴퓨ᄐ...지훈 서
The document discusses several topics related to artificial intelligence and robotics. It describes how the global market for robots and AI is expected to reach $152.7 billion by 2020. It also discusses advances in machine learning, cloud computing, big data, and cognitive computing that are driving growth in AI. Finally, it notes both the opportunities and challenges of integrating AI and robotics technologies across various industries like healthcare, transportation, and manufacturing.
The internship report summarizes Kripa Unnikrishnan's 2-week internship at the Kerala State Electronics Development Corporation (KSEDC). KSEDC has four business units, including the Pneumatic Products Group (PNG) which manufactures pneumatic actuators and cylinders using compressed air or gas. The report describes pneumatic systems, the advantages of pneumatics over other methods, and the basic components and functioning of pneumatic actuators and cylinders. It concludes that the internship provided valuable exposure to KSEDC's departments and manufacturing processes, particularly their towed array product for defense applications.
La arquitectura de bases de datos distribuidas define la estructura de un sistema y sus componentes. El estándar ANSI/SPARC define tres niveles de abstracción: el nivel interno se ocupa de la forma física de almacenamiento de datos, el nivel conceptual define las estructuras de almacenamiento lógicas, y el nivel externo se ocupa de cómo los usuarios ven los datos. Las arquitecturas pueden ser basadas en componentes, funciones o datos, separando las funciones de procesamiento de usuario y datos.
This document discusses work-life balance and provides solutions to address any imbalance. It notes that work-life balance means properly prioritizing between work and lifestyle, then lists potential causes of imbalance like stress and lack of time with friends and family. To restore balance, it recommends tracking how time is spent, making self-care a priority, and practicing stress management.
Here, we describe a learning strategy that results an excellent choice for a first approach of students to produce scientific knowledge that can be confronted in the scientific field as well as recognize in this knowledge the transferability to the natural resources management. Nowadays, the availability of several Population Genetics software together with public molecular database represents a valuable tool of great assistance for teachers of this discipline. In this way, we implemented a
lecture where the students worked with empirical data set from a recent published article. The students joined theoretical concepts learned, computational software free available and empirical data set. The development of the activity comprised four steps: i) estimate population genetics parameters using software recommended by teachers, ii) understand results in a biological sense, iii) read the original manuscript from dataset authors and iv) compare both results in a comprehensive way. The students assumed the challenge under a reflective look and they kept a very fruitful discussion playing a role of population geneticists. Their exchange of ideas allowed them arrive to the conclusion that Manilkara zapota populations keep high levels of genetic diversity, although Ancient Maya left traces in the genetic makeup of these non-native populations with different management histories.
1. The document presents methods to reduce false positive variants in whole genome sequencing data, including logistic regression-based variant filtering and an ensemble approach using multiple variant calling algorithms.
2. The methods were tested on whole genome sequencing data from an extended family sequenced with two platforms. Variant calls agreed by both platforms were considered true positives to evaluate the methods.
3. The results show the proposed methods significantly reduced false negative rates compared to filtering based on genotype quality scores, while maintaining similar false discovery rates. The ensemble approach was especially effective at prioritizing true positive de novo mutations.
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceIJSTA
This document summarizes a research paper that proposes an effective sequence pattern discovery method to find polymorphic motifs in human DNA sequences to detect genetic medical conditions. The method uses three algorithms - LDK, HVS, and MDFS. It aims to identify disease-causing sequences, like those related to Ebola, by finding matching patterns in polymorphic DNA sequences. The position scrolling matrix is used to report test results of sequence pattern matching.
ASHG 2015 - Redundant Annotations in Tertiary AnalysisJames Warren
After obtaining genetic variants from next generation sequencing data, a precursory step in tertiary analysis is to annotate each variant with available relevant information. There is no standardized compendium for this purpose; researchers instead are required to compile data from a motley of annotation tools and public datasets. These sources for annotation are independently maintained, and accordingly there is limited concordance between their reported contents. The choice of annotation datasets thus has a direct and significant impact on the results of the analysis.
A Note On Exact Tests Of Hardy-Weinberg EquilibriumStephen Faucher
This document describes exact tests for assessing Hardy-Weinberg equilibrium that control type 1 error rates better than chi-squared tests, especially for large sample sizes. It presents efficient computational methods for implementing these exact tests, which sum probabilities of all possible genotype configurations conditional on observed allele counts. These exact tests have been programmed into freely available software for quality control and association testing in large genetic studies.
1) The document describes a study that uses an artificial neural network (ANN) to predict promoter regions in DNA sequences.
2) The ANN was trained on a dataset of 106 DNA sequences (57 nucleotides each) containing both promoter and non-promoter regions that were labeled.
3) The ANN achieved a maximum accuracy of 84% at predicting promoters, outperforming some existing techniques according to error rates reported in the literature.
Comparative analysis of genome sequences from six strains of Streptococcus agalactiae (Group B Streptococcus; GBS), representing the five major disease-causing serotypes, and two previously sequenced genomes suggests that a bacterial species can be described by its "pan-genome". The pan-genome includes a core genome of genes present in all strains and a dispensable genome of strain-specific and partially shared genes. While 80% of any single genome is shared among all isolates (core genome), sequencing additional strains revealed unique genes, and extrapolation predicts more unique genes will be found with further sequencing. Multiple independent genome sequences are thus required to fully understand the genomic complexity of a bacterial species.
1. Statistical analysis of big data sets from microarrays and RNA-seq is used to identify differentially expressed genes. Heat maps and volcano plots are commonly used to visualize the data.
2. Gene ontology, gene set enrichment analysis, and transcription factor analysis are used to analyze lists of genes and identify biological processes, pathways, and regulatory relationships.
3. Networks can be constructed by integrating gene lists with protein-protein and gene regulatory interaction databases to build signaling, regulatory, and interaction networks for further analysis.
Analisis de la expresion de genes en la depresionCinthya Yessenia
DNA microarray technologies have advanced in the past 15 years and allow measurement of thousands of gene expressions simultaneously. This study analyzed gene expression profiles in the amygdala and anterior cingulate of depressed patients compared to controls. The authors found correlated changes in these brain regions in both human depressed patients and an animal model of depression. Specifically, they identified decreased expression of oligodendrocyte genes in the amygdala and changes to glial-neuronal communication pathways. By analyzing gene expression across brain regions and species, and confirming results with additional techniques, this study establishes new targets for investigating molecular changes underlying depression.
Life is more than accomplishments, life is about problems, challenges and how to overcome them. The human behind the story.
My life changed upside down after being inspired by profesor yunus and the social business concept. After down, these 4 years can be reduced to one word "roller-coaster"
DUMARC is a student organization established in 2012 by Mr. Nazmul Hossain at FBS to develop students' skills, explore their talents, and connect them with companies. It has several brands and teams like Skill Factory, Playmakers, This Is Me, and FBS Beats. Playmakers hosts an annual all-round business competition that is going national. DUMARC engages FBS students through Facebook and values their feedback through an anonymous form.
Thermally modulated heating is applied to a horizontal channel to analyze its effect on drag reduction using numerical techniques. Periodic heating patterns are applied to the upper and lower walls using a sinusoidal function. The governing equations are solved using MATLAB. Results show that drag reduction is highest when the phase shift between the upper and lower wall heating is zero. Increasing uniform background heating along the walls also increases drag reduction, up to 110% higher for some cases. However, the improvement in drag reduction due to uniform heating is only significant at lower Reynolds numbers and for higher values of uniform heating. Vertical heat transfer, measured by the average Nusselt number, also increases with background heating.
The document is a letter of recommendation from Mr. Abayomi's former manager at Eko Hotel & Suites in Lagos, Nigeria praising his work performance and competency in food and beverage operations during his time as Assistant Food and Beverage Manager. The letter highlights Mr. Abayomi's organizational, communication, and leadership skills and notes his involvement in various areas including purchasing, inventory management, cost control, and training. The manager expresses that it was a pleasure working with Mr. Abayomi and recommends him for future endeavors.
Light sec for utilities and critical infrastructure white paperGeorge Wainblat
The document discusses LightSEC, a cyber security solution from ECI that provides comprehensive protection for utilities and critical infrastructure. It consists of a suite of security services that incorporate threat detection, prevention, and mitigation technologies. These services are delivered through a cloud-based platform called Mercury that uses network function virtualization for flexible deployment. LightSEC also includes a threat management platform called LightSEC-V that aggregates security data from across the solution to provide a consolidated view of risks.
2016.07.28 제65회 sw공학 technical_세미나(7월28일)_발표자료1(소셜컴퓨ᄐ...지훈 서
The document discusses several topics related to artificial intelligence and robotics. It describes how the global market for robots and AI is expected to reach $152.7 billion by 2020. It also discusses advances in machine learning, cloud computing, big data, and cognitive computing that are driving growth in AI. Finally, it notes both the opportunities and challenges of integrating AI and robotics technologies across various industries like healthcare, transportation, and manufacturing.
The internship report summarizes Kripa Unnikrishnan's 2-week internship at the Kerala State Electronics Development Corporation (KSEDC). KSEDC has four business units, including the Pneumatic Products Group (PNG) which manufactures pneumatic actuators and cylinders using compressed air or gas. The report describes pneumatic systems, the advantages of pneumatics over other methods, and the basic components and functioning of pneumatic actuators and cylinders. It concludes that the internship provided valuable exposure to KSEDC's departments and manufacturing processes, particularly their towed array product for defense applications.
La arquitectura de bases de datos distribuidas define la estructura de un sistema y sus componentes. El estándar ANSI/SPARC define tres niveles de abstracción: el nivel interno se ocupa de la forma física de almacenamiento de datos, el nivel conceptual define las estructuras de almacenamiento lógicas, y el nivel externo se ocupa de cómo los usuarios ven los datos. Las arquitecturas pueden ser basadas en componentes, funciones o datos, separando las funciones de procesamiento de usuario y datos.
This document discusses work-life balance and provides solutions to address any imbalance. It notes that work-life balance means properly prioritizing between work and lifestyle, then lists potential causes of imbalance like stress and lack of time with friends and family. To restore balance, it recommends tracking how time is spent, making self-care a priority, and practicing stress management.
Here, we describe a learning strategy that results an excellent choice for a first approach of students to produce scientific knowledge that can be confronted in the scientific field as well as recognize in this knowledge the transferability to the natural resources management. Nowadays, the availability of several Population Genetics software together with public molecular database represents a valuable tool of great assistance for teachers of this discipline. In this way, we implemented a
lecture where the students worked with empirical data set from a recent published article. The students joined theoretical concepts learned, computational software free available and empirical data set. The development of the activity comprised four steps: i) estimate population genetics parameters using software recommended by teachers, ii) understand results in a biological sense, iii) read the original manuscript from dataset authors and iv) compare both results in a comprehensive way. The students assumed the challenge under a reflective look and they kept a very fruitful discussion playing a role of population geneticists. Their exchange of ideas allowed them arrive to the conclusion that Manilkara zapota populations keep high levels of genetic diversity, although Ancient Maya left traces in the genetic makeup of these non-native populations with different management histories.
1. The document presents methods to reduce false positive variants in whole genome sequencing data, including logistic regression-based variant filtering and an ensemble approach using multiple variant calling algorithms.
2. The methods were tested on whole genome sequencing data from an extended family sequenced with two platforms. Variant calls agreed by both platforms were considered true positives to evaluate the methods.
3. The results show the proposed methods significantly reduced false negative rates compared to filtering based on genotype quality scores, while maintaining similar false discovery rates. The ensemble approach was especially effective at prioritizing true positive de novo mutations.
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceIJSTA
This document summarizes a research paper that proposes an effective sequence pattern discovery method to find polymorphic motifs in human DNA sequences to detect genetic medical conditions. The method uses three algorithms - LDK, HVS, and MDFS. It aims to identify disease-causing sequences, like those related to Ebola, by finding matching patterns in polymorphic DNA sequences. The position scrolling matrix is used to report test results of sequence pattern matching.
ASHG 2015 - Redundant Annotations in Tertiary AnalysisJames Warren
After obtaining genetic variants from next generation sequencing data, a precursory step in tertiary analysis is to annotate each variant with available relevant information. There is no standardized compendium for this purpose; researchers instead are required to compile data from a motley of annotation tools and public datasets. These sources for annotation are independently maintained, and accordingly there is limited concordance between their reported contents. The choice of annotation datasets thus has a direct and significant impact on the results of the analysis.
A Note On Exact Tests Of Hardy-Weinberg EquilibriumStephen Faucher
This document describes exact tests for assessing Hardy-Weinberg equilibrium that control type 1 error rates better than chi-squared tests, especially for large sample sizes. It presents efficient computational methods for implementing these exact tests, which sum probabilities of all possible genotype configurations conditional on observed allele counts. These exact tests have been programmed into freely available software for quality control and association testing in large genetic studies.
1) The document describes a study that uses an artificial neural network (ANN) to predict promoter regions in DNA sequences.
2) The ANN was trained on a dataset of 106 DNA sequences (57 nucleotides each) containing both promoter and non-promoter regions that were labeled.
3) The ANN achieved a maximum accuracy of 84% at predicting promoters, outperforming some existing techniques according to error rates reported in the literature.
Comparative analysis of genome sequences from six strains of Streptococcus agalactiae (Group B Streptococcus; GBS), representing the five major disease-causing serotypes, and two previously sequenced genomes suggests that a bacterial species can be described by its "pan-genome". The pan-genome includes a core genome of genes present in all strains and a dispensable genome of strain-specific and partially shared genes. While 80% of any single genome is shared among all isolates (core genome), sequencing additional strains revealed unique genes, and extrapolation predicts more unique genes will be found with further sequencing. Multiple independent genome sequences are thus required to fully understand the genomic complexity of a bacterial species.
1. Statistical analysis of big data sets from microarrays and RNA-seq is used to identify differentially expressed genes. Heat maps and volcano plots are commonly used to visualize the data.
2. Gene ontology, gene set enrichment analysis, and transcription factor analysis are used to analyze lists of genes and identify biological processes, pathways, and regulatory relationships.
3. Networks can be constructed by integrating gene lists with protein-protein and gene regulatory interaction databases to build signaling, regulatory, and interaction networks for further analysis.
Analisis de la expresion de genes en la depresionCinthya Yessenia
DNA microarray technologies have advanced in the past 15 years and allow measurement of thousands of gene expressions simultaneously. This study analyzed gene expression profiles in the amygdala and anterior cingulate of depressed patients compared to controls. The authors found correlated changes in these brain regions in both human depressed patients and an animal model of depression. Specifically, they identified decreased expression of oligodendrocyte genes in the amygdala and changes to glial-neuronal communication pathways. By analyzing gene expression across brain regions and species, and confirming results with additional techniques, this study establishes new targets for investigating molecular changes underlying depression.
Construction of phylogenetic tree from multiple gene trees using principal co...IAEME Publication
This document describes a method for constructing a phylogenetic tree from multiple gene trees using principal component analysis. Multiple gene trees are generated from different protein sequences from various organisms. Distance matrices are calculated for each gene tree and combined into a single data matrix. Principal component analysis is performed on the data matrix to extract the first principal component, which represents the consensus distance vector combining information from all gene trees. A phylogenetic tree is then generated from the consensus distance vector using UPGMA, providing a species tree that integrates information from multiple genes. The method is demonstrated on protein sequence data from primates and placental mammals.
Pairwise Sequence Alignment between HBV and HCC Using Modified Needleman-Wuns...TELKOMNIKA JOURNAL
Ths paper aims to find similarity of Hepatitis B virus (HBV) and Hepatocelluler Carcinoma (HCC) DNA sequences. It is very important in bioformatics task. The similarity of sequence allignments indicates that they have similarity of chemical and physical properties. Mutation of the virus DNA in X region has potential role in HCC. It is observed using pairwise sequence alignment of genotype-A in HBV. The complexity of DNA sequence using dynamic programming, Needleman-Wunsch algorithm, is very high. Therefore, it is purpose to modifiy the method of Needleman Wunsch algorithm for optimum global DNA sequence alignment. The main idea is to optimize filling matrix and backtracking proccess of DNA components.This method can also solve various length of the both sequence alignment.
This research is applied to DNA sequence of 858 hepatitis B virus and 12 carcinoma patient, so that there are 10,296 pairwis of sequences. They are aligned globally using the purposed method and as a result, it is achieved high similarity of 96.547% and validity of 99.854%. Furhthermore, this method has reduced the complexity of original Needleman-Wunsch algorithm The reduction of computational time is as 34.6% and space complexity is as 42.52%.
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...cscpconf
The DNA sequences similarity analysis approaches have been based on the representation and the frequency of sequences components; however, the position inside sequence is important information for the sequence data. Whereas, insufficient information in sequences representations is important reason that causes poor similarity results. Based on three classifications of the DNA bases according to their chemical properties, the frequencies and average positions of group mutations have been grouped into two twelve-components vectors, the Euclidean distances among introduced vectors applied to compare the coding sequences of the first exon of beta globin gene of 11 species.
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...csandit
The DNA sequences similarity analysis approaches have been based on the representation and the frequency of sequences components; however, the position inside sequence is important information for the sequence data. Whereas, insufficient information in sequences
representations is important reason that causes poor similarity results. Based on three classifications of the DNA bases according to their chemical properties, the frequencies and
average positions of group mutations have been grouped into two twelve-components vectors,the Euclidean distances among introduced vectors applied to compare the coding sequences of the first exon of beta globin gene of 11 species.
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...ijcseit
Multiple sequence alignment is increasingly important to bioinformatics, with several applications ranging
from phylogenetic analyses to domain identification. There are several ways to perform multiple sequence
alignment, an important way of which is the progressive alignment approach studied in this work.
Progressive alignment involves three steps: find the distance between each pair of sequences; construct a
guide tree based on the distance matrix; finally based on the guide tree align sequences using the concept
of aligned profiles. Our contribution is in comparing two main methods of guide tree construction in terms
of both efficiency and accuracy of the overall alignment: UPGMA and Neighbor Join methods. Our
experimental results indicate that the Neighbor Join method is both more efficient in terms of performance
and more accurate in terms of overall cost minimization.
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...ijcseit
Multiple sequence alignment is increasingly important to bioinformatics, with several applications rangingfrom phylogenetic analyses to domain identification. There are several ways to perform multiple sequencealignment, an important way of which is the progressive alignment approach studied in this work.Progressive alignment involves three steps: find the distance between each pair of sequences; construct a guide tree based on the distance matrix; finally based on the guide tree align sequences using the concept of aligned profiles. Our contribution is in comparing two main methods of guide tree construction in terms of both efficiency and accuracy of the overall alignment: UPGMA and Neighbor Join methods. Our experimental results indicate that the Neighbor Join method is both more efficient in terms of performance and more accurate in terms of overall cost minimization.
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ijcseit
Multiple sequence alignment is increasingly important to bioinformatics, with several applications ranging from phylogenetic analyses to domain identification. There are several ways to perform multiple sequence alignment, an important way of which is the progressive alignment approach studied in this work.Progressive alignment involves three steps: find the distance between each pair of sequences; construct a
guide tree based on the distance matrix; finally based on the guide tree align sequences using the concept of aligned profiles. Our contribution is in comparing two main methods of guide tree construction in terms of both efficiency and accuracy of the overall alignment: UPGMA and Neighbor Join methods. Our experimental results indicate that the Neighbor Join method is both more efficient in terms of performance
and more accurate in terms of overall cost minimization.
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONcsandit
Sequencing projects arising from high throughput technologies including those of sequencing DNA microarrays allowed to simultaneously measure the expression levels of millions of genes of a biological sample as well as annotate and identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information,
bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the hypothesis of researchers throughout the experimental cycle. In this context, this article describes and discusses some of techniques used for the functional analysis of gene expression data.
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 𝟏)-𝐏𝐫𝐞𝐥𝐢𝐦𝐬
𝐃𝐢𝐬𝐜𝐮𝐬𝐬 𝐭𝐡𝐞 𝐄𝐏𝐏 𝐂𝐮𝐫𝐫𝐢𝐜𝐮𝐥𝐮𝐦 𝐢𝐧 𝐭𝐡𝐞 𝐏𝐡𝐢𝐥𝐢𝐩𝐩𝐢𝐧𝐞𝐬:
- Understand the goals and objectives of the Edukasyong Pantahanan at Pangkabuhayan (EPP) curriculum, recognizing its importance in fostering practical life skills and values among students. Students will also be able to identify the key components and subjects covered, such as agriculture, home economics, industrial arts, and information and communication technology.
𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐭𝐡𝐞 𝐍𝐚𝐭𝐮𝐫𝐞 𝐚𝐧𝐝 𝐒𝐜𝐨𝐩𝐞 𝐨𝐟 𝐚𝐧 𝐄𝐧𝐭𝐫𝐞𝐩𝐫𝐞𝐧𝐞𝐮𝐫:
-Define entrepreneurship, distinguishing it from general business activities by emphasizing its focus on innovation, risk-taking, and value creation. Students will describe the characteristics and traits of successful entrepreneurs, including their roles and responsibilities, and discuss the broader economic and social impacts of entrepreneurial activities on both local and global scales.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
2. 1104 YANG ET AL.
An alternative approach is to develop alignment-free sequence comparison methods. Current alignment-
free sequence comparison methods can be classified into two categories (Vinga and Almeida, 2003):
information theory-based (Li et al., 2001) and word statistics-based measures (Campbell et al., 1999; Qi
et al., 2004; Chaudhuri and Das, 2002; Hao et al., 2003; Karlin and Burge, 1995; Qi et al., 2004; Stuart
et al., 2002). We have developed a new index adapted from linguistic analysis and information theory to
measure the similarity between symbolic sequences (Yang et al., 2003a, 2003b). Our approach is based
on the concept that the information content in any symbolic sequence is primarily determined by the
repetitive usage of its basic elements. The novelty of this information-based similarity index is that it
incorporates elements of both information-based and word statistics-based categories since the rank order
difference of each n-tuple (word statistics) is weighted by its information content using Shannon entropy
(information theory) (Shannon, 1948). Furthermore, the composition of these basic elements captures both
global information related to usage of repetitive elements in genetic sequences, as well as local sequence
order determined by the n-tuple nucleotides. Hence, our method provides a complementary approach to
overcoming limitations of alignment methods and is capable of exploring genetic sequences with hetero-
geneic origins. The resulting measurement has been validated with respect to generic information-carrying
symbolic sequences (Yang et al., 2003a, 2003b). Here we show the specific application of this method to
genomic sequences.
METHODS
We have recently developed and validated a generic information-based similarity index to quantify
the similarity between symbolic sequences. This method, which has been used for analysis of complex
physiologic signals (Yang et al., 2003a) and literary texts (Yang et al., 2003b), can be readily adapted to
genetic sequences by examining usages of n-tuple nucleotides (“words”). We first determine the frequencies
for each n-tuple by applying a sliding window (moving one nucleotide/step) across the entire genome, and
then rank each n-tuple according to its frequency in descending order. To compare the similarity between
genetic sequences, we plot the rank number of each n-tuple in the first sequence against that of the
second sequence. Figure 1 shows the comparison of 4-tuple nucleotide frequencies between the complete
mitochondrial genome of two human lineages and those of the human and gorilla. If two sequences
are similar in their rank order of n-tuples, the scattered points will be located near the diagonal line
(e.g., human versus human). Therefore, the average deviation of these scattered points away from the
diagonal line is a measure of the similarity index between these two sequences (Yang et al., 2003a, 2003b).
FIG. 1. Rank order comparison of 4-tuple nucleotides (DNA words) of complete mitochondrial DNA (mtDNA)
sequences for (a) two human lineages and (b) human and Gorilla gorilla. Words from the two human mtDNA
sequences fall close to the diagonal, indicating similar ranking in nucleotide usage. In contrast, the comparison map
of human versus gorilla mtDNA yields greater scatter of words around the diagonal. The pairwise distance matrix
of virus sequences is then determined (Equation (1)) and used to build a phylogenetic tree using standard distance
methods (Saitou and Nei, 1987; Fitch and Margoliash, 1967).
3. GENOMIC CLASSIFICATION 1105
We can define the similarity index (Dn) using n-tuple nucleotides between two sequences, S1 and S2, as
Dn(S1, S2) =
1
N − 1
N
k=1
|R1(wk) − R2(wk)|
H1(wk) + H2(wk)
N
k=1
[H1(wk) + H2(wk)]
. (1)
Here R1(wk) and R2(wk) represent the rank of a specific n-tuple, wk, in sequences S1 and S2, re-
spectively, and N = 4n is the number of different n-tuple nucleotides. The absolute difference of ranks,
|R1(wk) − R2(wk)|, is proportional to the euclidean distance from a given point to the diagonal line. This
term is then weighted by the sum of Shannon’s entropy H (Shannon, 1948) for wk in sequences S1 and S2.
Shannon’s entropy measures the information richness of each n-tuple in both sequences. Thus, the more
frequently used n-tuples contribute more to measuring similarity among genetic sequences.
We note that this similarity measurement is an empirical index which does not fulfill the criteria of a
rigorous distance measure (Yang et al., 2004, 2003b). Therefore, the triangular inequality test is required
before generating a phylogenetic tree. When applied to the actual nucleotide sequences here, no violation
of the triangular inequality was observed. This similarity metric was then used to determine pairwise
distances among genetic sequences and to construct a phylogenetic tree (Felsenstein, 1993; Saitou and
Nei, 1987; Fitch and Margoliash, 1967).
To address the statistical reliability of the phylogenetic tree topology, we adapted the methodology of
bootstrap analysis (Felsenstein, 1985) and applied it to the information similarity index. Bootstrap analysis
is based on the creation of a series of surrogate datasets obtained by resampling the original dataset with
replacement. In the case of alignment methods, the surrogate datasets are obtained by resampling aligned
columns of nucleotides. To adapt the central concept of bootstrap analysis, we created surrogates by
resampling n-tuples from their original distribution in a given sequence. We then calculated the pairwise
similarity index between bootstrapped rank-order frequency lists and constructed the phylogenetic tree.
The bootstrapped values shown on branches represent the number of successful tests (i.e., those having
the same topology as the non-bootstrapped tree) for 1,000 repetitions of the bootstrap procedure.
To further investigate the effects of length of n-tuples on the tree topology, we constructed a mini-database
consisting of five known coronaviruses representing three groups. We then estimated the phylogenetic
trees based on different lengths of n-tuples (n = 3–6). Figure 2a shows the schematic illustration of three
established coronavirus groups. Figure 2b–e shows results of neighbor-joining phylogenetic trees based
on different lengths of n-tuples (n = 3–6). Bootstrapped values (number of successful tests in 1,000
experiments) are shown on the tree branches.
Qualitatively similar results are obtained for n in the range from 3 to 6. The tree with the highest
bootstrapped value is obtained by using 4-tuple words. Higher values of n require substantially longer
sequences (such that each n–tuple word will be sampled in a statistically meaningful way). As the possible
configurations of n-tuples increase exponentially with n, we found that for the coronavirus database (mean
sequence length: 29,069 ± 1,569 bp), n values greater than 6 reduce the reliability of the phylogenetic trees.
RESULTS AND DISCUSSION
We first validate the method on the human influenza A viral genomes (Fig. 3) and human mitochondrial
DNA database (Fig. 4). The results are comparable to previous reports (Buonagurio et al., 1986; Ingman
et al., 2000) addressing the phylogeny of these two databases. We then apply the method to address
the origin and classification of the newly identified coronavirus associated with severe acute respiratory
syndrome (SARS).
Phylogenetic analysis of the human influenza A virus nonstructural gene database
For initial validation of the information-based similarity index on genetic sequences, we apply this method
to the evolution of human influenza A virus from isolates of major outbreaks since 1933 (Buonagurio et al.,
1986; Levin et al., 2004). The gene coding for nonstructural (NS) proteins of human influenza A virus
4. 1106 YANG ET AL.
FIG. 2. Comparison of effect of different lengths of n-tuples on the topology and statistical significance of phylo-
genetic trees. Bootstrap values generated by testing 1,000 replicates of the dataset are shown on branches. (a) The
schematic illustration of three established coronavirus groups. Group 1 (G1): HCoV-229E and PEDV. Group 2 (G2):
BCoV and MHV. Group 3 (G3): IBV. (b–e) Results of neighbor-joining phylogenetic trees based on different lengths
of n-tuples (n = 3–6). Qualitatively similar results are obtained for n in the range of 3 to 6. The tree with the highest
bootstrapped value is obtained by using 4-tuple words.
5. GENOMIC CLASSIFICATION 1107
FIG. 3. Evolution of human influenza A virus nonstructural (NS) genes. (a) Phylogenetic tree of 16 human in-
fluenza A virus NS genes. The number in front of the geographical location indicates the year of the influenza
outbreak. Each virus isolate is coded by its hemagglutinin (H) and neuraminidase (N) serotype: H1N1 (solid line),
H2N2 (dashed line), and H3N2 (long-dashed line). Each NS gene was analyzed by the method described in the text
using 4-tuples as “words.” The tree structure is comparable to a prior alignment based study (Buonagurio et al., 1986)
showing a rapid and distinct evolutionary path of influenza virus spread during the past 60 years. Bootstrap values
generated by testing 100 replicates of the dataset are shown on branches. (b) Evolution of the NS gene of 16 human
influenza isolates. The graph shows a regression analysis of the year of isolation plotted against the branch length from
the common ancestor node at the main trunk of the phylogenetic tree. The figure shows two separated evolutionary
pathways (Buonagurio et al., 1986) with a linear correlation between the similarity index with respect to the year
of isolation. Furthermore, when analyzing isolates (filled circles) from the major branch of the phylogenetic tree, the
year of zero similarity index extrapolated by regression analysis is consistent with the proposed year in which genetic
material was introduced from the animal to the human influenza virus (Gammelin et al., 1990). An apparently distinct
evolutionary pathway consisting of four H1N1 influenza isolates (77/USSR, 80/Maryland, and 84/85/Houston) is also
shown (filled squares).
6. 1108 YANG ET AL.
has demonstrated a rapid and steady mutation rate, making it suitable for studying evolutionary patterns.
We collected 16 NS gene sequences which represent isolates from different regions over a span of 60
years (Buonagurio et al., 1986). Each sequence was analyzed by the information-based similarity index
using 4-tuples as “words.” The neighbor-joining phylogenetic tree is shown in Fig. 3a. By assigning the
WSN strain (1933) as the root, the tree shows a progressive evolutionary trend consistent with a previous
analysis (Buonagurio et al., 1986). Furthermore, the tree shows distinct evolutionary pathways after the
1947 outbreak. A group of H1N1 subtypes, including 1977 USSR, 1980 Maryland, and 1984/85 Houston
isolates, is closely related to the 1950 Fort Warren strain. The others evolved in a separate pathway and had
another major genetic shift in 1960 and 1972 which resulted in the H2N2 and H3N2 strains, respectively.
These findings are compatible with the consensus view of human influenza A virus evolution (Buonagurio
et al., 1986; Levin et al., 2004) and also confirm the unique epidemiology of the H1N1 virus isolated from
the USSR in 1977 (Buonagurio et al., 1986).
Since the similarity index proposed here is not based on the assumption that genetic sequences evolve
at a constant rate (“molecular clock model”) (Graur and Li, 1999), we further investigate the relationship
of our similarity measure with respect to the evolutionary time. We calculated the branch length of each
isolate from the common ancestor node of the phylogenetic tree shown in Fig. 3a, and plotted against
the year of isolation. Figure 3b shows regression analysis of two distinct evolutionary pathways which is
consistent with a prior study (Buonagurio et al., 1986). The similarity index of the branch length of each
isolate from the common ancestor node indeed linearly correlates with evolutionary time based on the year
of isolation. Furthermore, the year of zero similarity index extrapolated by regression analysis is consistent
with the proposed year in which genetic material was introduced from the animal to the human influenza
virus (Gammelin et al., 1990).
Phylogenetic analysis of the human mitochondrial DNA database
The second part of the validation involves the analysis of complete human mitochondrial DNA (mtDNA)
sequences (Cann et al., 1987; Horai et al., 1995; Ruvolo et al., 1993; Vigilant et al., 1991; Ingman et al.,
2000; Mishmar et al., 2003). Here we provide an independent analysis without sequence prealignment
based on 86 mtDNA sequences (see Appendix for accession numbers) (Ingman et al., 2000; Mishmar
et al., 2003). Each sequence was analyzed by the information-based similarity index using n-tuples as
“words” (n = 3–5). The neighbor-joining phylogenetic tree based on the information-based similarity
index is shown in Fig. 4 (n = 5). The branching order of each mtDNA sequence is comparable with prior
studies based on sequence alignment methods (Horai et al., 1995; Ruvolo et al., 1993; Vigilant et al.,
1991; Ingman et al., 2000). All of sub-Saharan African lineages are classified on the bottom of the tree
near the root, supporting the African origin of human evolution (Horai et al., 1995; Ruvolo et al., 1993;
Vigilant et al., 1991; Ingman et al., 2000). Furthermore, our classification scheme correctly classifies other
lineages according to their geographic distribution. For example, Mediterranean people, including Spanish,
Italian, and Moroccan lineages, are classified under the same branch and close to the branch of European
lineages.
Phylogenetic analysis of the SARS genome
The outbreak of SARS in 2003 has had a tremendous impact on worldwide health care systems (Lee
et al., 2003; Poutanen et al., 2003). A central question relevant to the prevention of the recurrence of
future SARS outbreak is to determine the virus’s origin. Several groups have contributed to identifying
and sequencing the complete genome of the newly recognized pathogen, SARS-associated coronavirus
(SARS-CoV) (Rota et al., 2003; Marra et al., 2003; Drosten et al., 2003; Peiris et al., 2003). A SARS-like
FIG. 4. The neighbor-joining phylogenetic tree (Saitou and Nei, 1987) based on complete mitochondrial DNA
sequences (mtDNA). Eighty-six human mtDNA sequences were available (see Appendix, accession number AF346963-
AF347015) (Ingman et al., 2000). We use 5-tuple (or five consecutive nucleotides) as “words.” The pairwise distance
matrix is calculated using the information-based similarity index. All of the sub-Saharan African lineages are classified
at the bottom of tree near the root, which is comparable to prior studies based on sequence alignment algorithms (Horai
et al., 1995; Ruvolo et al., 1993; Vigilant et al., 1991) supporting the African origin of human evolution. Bootstrap
values generated by testing 100 replicates of the dataset are shown on branches.
8. 1110 YANG ET AL.
FIG. 5. Neighbor-joining phylogenetic analysis (Saitou and Nei, 1987) of coronaviridae using an information-based
similarity index based on 4-tuple nucleotides as “RNA words.” (A comparable result is obtained by the Fitch-Margoliash
method (Fitch and Margoliash, 1967).) Bootstrap values generated by testing 1,000 replicates of the dataset are shown
on branches. (a) complete genomes; (b) spike glycoprotein; (c) membrane protein; and (d) nucleocapsid protein. SARS-
CoV (NC_00471) and SARS-like virus (AY304486) were compared to the following three major coronavirus groups.
Group 1 (G1): human coronavirus 229E (HCoV-229E); transmissible gastroenteritis virus (TGEV); porcine epidemic
diarrhea virus (PEDV); canine coronavirus (CCoV); feline infectious peritonitis virus (FCoV); porcine respiratory
coronavirus (PrCoV). Group 2 (G2): human coronavirus OC43 (HCoV-OC43); bovine coronavirus (BCoV); porcine
hemagglutinating encephalomyelitis virus (HEV); rat sialodacryoadenitis virus (RtCoV); murine hepatitis virus (MHV).
Group 3 (G3): avian infectious bronchitis virus (IBV). Comparable results are obtained using different lengths of n-tuple
words (n: 3–5 for the complete genome; n: 3–4 for structural protein genes).
9. GENOMIC CLASSIFICATION 1111
FIG. 6. Comparison of SARS-CoV with the entire genome of other known coronaviruses. The SARS genome is
decomposed into nonoverlapping 300 bp segments. Each segment is compared to entire genomes of known coron-
aviruses to find the best-fit sequence. Each segment is shown by the index of similarity where low values indicate
greater similarity to the most closely related coronavirus. To estimate the significance level of the similarity index,
we computed the similarity index of 105 pairs of randomly selected 300 bp segments from known coronaviruses.
We then determined the significance level by computing the 95 percentile rank value of the similarity index (0.165).
Only segments within a statistically significant level are color coded to their corresponding group. The white column
represents those sequences not significantly similar to any known group. We find that 30% of the entire genome is
dis-similar to any known groups. Of note, 41% of the entire SARS-CoV genome is related to group 1 coronaviruses,
while only 8% and 21% of the SARS genome are related to group 2 and 3 coronaviruses, respectively. Comparable
results are obtained by using a sliding window to decompose the SARS genome.
virus has also been isolated from wild animals such as the palm civet in southern China, indicating that
SARS-CoV may have originated from a previously unidentified animal coronavirus (Guan et al., 2003).
The relationship of the poorly conserved SARS genome to other coronaviruses, however, is still in question
(Vogel, 2003; Enserink and Normile, 2003; Enserink, 2003) since current studies are based on the small
portion of aligned sequences (Rota et al., 2003; Eickmann et al., 2003; Marra et al., 2003; Snijder, et al.,
2003; Stadler et al., 2003).
Our results based on analysis of available complete coronavirus genomes (Fig. 5a) have some notable
distinctions from these previous phylogenetic studies using sequence alignment (Rota et al., 2003; Eick-
mann et al., 2003; Marra et al., 2003; Stadler et al., 2003). In particular, our method indicates that the
SARS-CoV is not classified as a new group but is close to the group 1 coronaviruses. We further an-
alyze the genes coding for structural proteins of SARS-CoV, including the spike glycoprotein (S), the
membrane glycoprotein (M), as well as the nucleocapsid protein (N). When examining the phylogeny
10. 1112 YANG ET AL.
Table 1. Homologies between SARS-CoV Genes Other Coronaviruses Using an
Information-Based Similarity Indexa
aNumbers indicate the similarity index between individual SARS-CoV genes and other coronaviruses (low value indicates greater
similarity). The values of the most closely related coronaviruses are in bold. Asterisks indicate a nonsignificant similarity index
(greater than 0.165). Nsp, nonstructral protein. RdRp, RNA-dependent RNA polymerase.
(Figs. 5b, 5c, and 5d) of these three genes, we find that all structural proteins consistently cluster with
two viruses: human coronavirus 229E (HCoV-229E) and porcine epidemic diarrhea virus (PEDV). This
finding also suggests that SARS-CoV and group 1 coronaviruses are closely related and are likely to share
a common ancestor.
Since the SARS genome likely has heterogeneous origins (Rota et al., 2003; Marra et al., 2003), analysis
based solely on the whole genome that mixes word distributions from these different sources may not
reveal details of segmentary origins. Therefore, we next systemically compare the similarity of individual
segments of SARS-CoV to representative coronaviruses from established groups, including HCoV-229E,
murine hepatitis virus (MHV), and avian infectious bronchitis virus (IBV). We first decompose the SARS
genome into nonoverlapping 300 bp segments. We then compare each segment to the entire genome of
other coronaviruses to find the best-fit sequences using the similarity index. Each segment is assigned to
the most closely related group (Fig. 6). Only those segments with similarity indexes within a statistically
significant range (see Fig. 6 caption) are color coded to the corresponding group. We find that 30% of
the entire genome is dissimilar to any known groups. Of note, 41% of the entire SARS-CoV genome is
related to group 1 coronaviruses, while only 8% and 21% of the SARS genome are related to group 2
and 3 coronaviruses, respectively. We test the consistency of the results using different segment lengths
11. GENOMIC CLASSIFICATION 1113
Table 2. Top Ranked 4-Tuples of SARS-CoV Genome and Other Coronavirusesa
aTwenty top ranked 4-tuples of the entire genome of the SARS-CoV and other coronaviruses. We first determine the frequencies
for each 4-tuple by applying a sliding window (moving one nucleotide/step) across the entire genome and then rank each 4-tuple
according to its frequency in descending order. Four 4-tuple sequences (TGTT, TTGT, TTTG, and TTTT) shaded gray are common
to all viruses in the list. The number of 4-tuple sequences shared with the SARS-CoV (in bold) is higher in group 1 coronaviruses
than in other groups. Of interest, in contrast to other known coronaviruses, two 4-tuple sequences (TTTT and TTTG) are less frequent
than their complementary sequences (AAAA and CAAA) in the SARS-CoV genome, suggesting an evolutionary pattern that may
be related to replication mechanism.
(100, 500, and 1,000 bp). We find that the proportions of attributed origins from the three coronavirus
groups are consistent with the results using 300 bp segments.
We also observe that those segments with nonsignificant matches are associated with genes coding for
nonstructural proteins (Nsp) 1, 9, and 10, and for structural envelope (E) and N proteins (Table 1). This
finding is consistent with prior reports based on sequence alignment analysis (Rota et al., 2003; Marra
et al., 2003). However, other sequences including those coding for Nsp 3 and 12 and the S protein show
combined origins from other groups. For example, half of S1 domain of the S protein is partly related to
the MHV, whereas the remaining sequence is related to HCoV-229E.
Finally, we compare the rank order of specific 4-tuple sequences of the SARS genome with other known
coronaviruses (Table 2). Of interest, we find that two sequences—TTTT, and TTTG—are consistently in
the top six ranking “words” in non-SARS coronaviruses but in the 15th and 20th rankings in the SARS
genome, respectively. Further, in contrast to all other known coronaviruses, the frequencies of these two
words in the SARS genome are lower than their complementary sequences (AAAA and CAAA). Whether
this n-tuple “word” usage pattern is an incidental finding or is related to the viral evolutionary mechanism
remains to be determined.
12. 1114 YANG ET AL.
CONCLUSIONS
In summary, our information-based method provides a complementary approach to study the entire
genome as well as individual genes of the SARS coronavirus. The method was originally developed
for studying a wide range of symbolic sequences (Yang et al., 2003a, 2003b), and, therefore, assumes
minimal knowledge about sequence origin. The method is based on a simple assumption, namely, that
genetic sequences from different origins have a preference for certain n-tuples, which they use with higher
frequency. Using this method, we have revisited questions concerning the origin of SARS-CoV. Our findings
indicate the following key results: 1) the clustering of SARS-CoV, HCoV-229E, and PEDV suggests that
SARS-CoV shares a common ancestor with these two viruses which may have exchanged genetic material
at an earlier stage; 2) analysis of the entire SARS genome reveals that up to 30% of sequence cannot
be reliably attributed to any known coronavirus, in agreement with published studies; and 3) contrary to
previous reports (Rota et al., 2003; Marra et al., 2003; Eickmann et al., 2003; Snijder et al., 2003; Stadler
et al., 2003), the remainder of the SARS genome is closely related to group 1 coronaviruses, with smaller
intermixed segments from groups 2 and 3.
ACKNOWLEDGMENTS
We thank I. Henry, J. Healey, J. Mietus, M. Costa, and S. Izumo for valuable discussions. We gratefully
acknowledge support from the NIH/NCRR (P41-RR13622), the NIH/NIA OAIC (P60-AG08812), the
James S. McDonnell Foundation, and the G. Harold and Leila Y. Mathers Charitable Foundation.
REFERENCES
Buonagurio, D.A., Nakada, S., Parvin, J.D., Krystal, M., Palese, P., and Fitch, W.M. 1986. Evolution of human
influenza A viruses over 50 years: Rapid, uniform rate of change in NS gene. Science 232, 980–982.
Campbell, A., Mrazek, J., and Karlin, S. 1999. Genome signature comparisons among prokaryote, plasmid, and
mitochondiral DNA. Proc. Natl. Acad. Sci. USA 96, 9184–9189.
Cann, R.L., Stoneking, M., and Wilson, A.C. 1987. Mitochondrial DNA and human evolution. Nature 325, 31–36.
Chaudhuri, P., and Das, S. 2002. SWORDS: A statistical tool for analysing large DNA sequences. J. Biosci. 27, 1–6.
Drosten, C., Gunther, S., Preiser, W., Van Der, W.S., Brodt, H.R., Becker, S., Rabenau, H., Panning, M., Kolesnikova,
L., Fouchier, R.A., Berger, A., Burguiere, A.M., Cinatl, J., Eickmann, M., Escriou, N., Grywna, K., Kramme, S.,
Manuguerra, J.C., Muller, S., Rickerts, V., Sturmer, M., Vieth, S., Klenk, H.D., Osterhaus, A.D., Schmitz, H., and
Doerr, H.W. 2003. Identification of a novel coronavirus in patients with severe acute respiratory syndrome. N. Engl.
J. Med. 348, 1967–1976.
Eickmann, M., Becker, S., Klenk, H.D., Doerr, H.W., Stadler, K., Censini, S., Guidotti, S., Masignani, V., Scarselli,
M., Mora, M., Donati, C., Han, J.H., Song, H.C., Abrignani, S., Covacci, A., and Rappuoli, R. 2003. Phylogeny of
the SARS coronavirus. Science 302, 1504–1505.
Enserink, M. 2003. Infectious diseases. Clues to the animal origins of SARS. Science 300, 1351.
Enserink, M., and Normile, D. 2003. Infectious diseases. Search for SARS origins stalls. Science 302, 766–767.
Felsenstein, J. 1985. Confidence limits on phylogenies: An approach using the bootstrap. Evolution 39, 783–791.
Felsenstein, J. 1993. PHYLIP Phylogeny Inference Package. (3.5c). Department of Genetics, University of Washington,
Seattle.
Fitch, W.M., and Margoliash, E. 1967. Construction of phylogenetic trees. Science 155, 279–284.
Gammelin, M., Altmuller, A., Reinhardt, U., Mandler, J., Harley, V.R., Hudson, P.J., Fitch, W.M., and Scholtissek, C.
1990. Phylogenetic analysis of nucleoproteins suggests that human influenza A viruses emerged from a 19th-century
avian ancestor. Mol. Biol. Evol. 7, 194–200.
Graur, D., and Li, W.H. 1999. Fundamentals of Molecular Evolution, Sinauer Associates, Boston.
Guan, Y., Zheng, B.J., He, Y.Q., Liu, X.L., Zhuang, Z.X., Cheung, C.L., Luo, S.W., Li, P.H., Zhang, L.J., Guan, Y.J.,
Butt, K.M., Wong, K.L., Chan, K.W., Lim, W., Shortridge, K.F., Yuen, K.Y., Peiris, J.S., and Poon, L.L. 2003.
Isolation and characterization of viruses related to the SARS coronavirus from animals in southern China. Science
302, 276–278.
Hao, B., Qi, J., and Wang, B. 2003. Prokaryotic phylogeny based on complete genomes without sequence alignment.
Mod. Phys. Lett. B 17, 91–94.
13. GENOMIC CLASSIFICATION 1115
Horai, S., Hayasaka, K., Kondo, R., Tsugane, K., and Takahata, N. 1995. Recent African origin of modern humans
revealed by complete sequences of hominoid mitochondrial DNAs. Proc. Natl. Acad. Sci. USA 92, 532–536.
Ingman, M., Kaessmann, H., Paabo, S., and Gyllensten, U. 2000. Mitochondrial genome variation and the origin of
modern humans. Nature 408, 708–713.
Karlin, S., and Burge, C. 1995. Dinucleotide relative abundance extremes: A genomic signature. Trends Genet. 11,
283–290.
Lee, N., Hui, D., Wu, A., Chan, P., Cameron, P., Joynt, G.M., Ahuja, A., Yung, M.Y., Leung, C.B., To, K.F., Lui,
S.F., Szeto, C.C., Chung, S., and Sung, J.J.Y. 2003. A major outbreak of severe acute respiratory syndrome in Hong
Kong. N. Engl. J. Med. 348, 1986.
Levin, S.A., Dushoff, J., and Plotkin, J.B. 2004. Evolution and persistence of influenza A and other diseases. Math.
Biosci. 188, 17–28.
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., and Zhang, H. 2001. An information-based sequence distance
and its application to whole mitochondrial genome phylogeny. Bioinformatics 17, 149–154.
Marra, M.A., Jones, S.J.M., Astell, C.R., Holt, R.A., Brooks-Wilson, A., Butterfield, Y.S.N., Khattra, J., Asano, J.K.,
Barber, S.A., Chan, S.Y., Cloutier, A., Coughlin, S.M., Freeman, D., Girn, N., Griffith, O.L., Leach, S.R., Mayo,
M., McDonald, H., Montgomery, S.B., Pandoh, P.K., Petrescu, A.S., Robertson, A.G., Schein, J.E., Siddiqui, A.,
Smailus, D.E., Stott, J.M., Yang, G.S., Plummer, F., Andonov, A., Artsob, H., Bastien, N., Bernard, K., Booth,
T.F., Bowness, D., Drebot, M., Fernando, L., Flick, R., Garbutt, M., Gray, M., Grolla, A., Jones, S., Feldmann, H.,
Meyers, A., Kabani, A., Li, Y., Normand, S., Stroher, U., Tipples, G.A., Tyler, S., Vogrig, R., Ward, D., Watson,
B., Brunham, R.C., Krajden, M., Petric, M., Skowronski, D.M., Upton, C., and Roper, R.L. 2003. The genome
sequence of the SARS-associated coronavirus. Science 300, 1399–1404.
Mishmar, D., Ruiz-Pesini, E., Golik, P., Macaulay, V., Clark, A.G., Hosseini, S., Brandon, M., Easley, K., Chen,
E., Brown, M.D., Sukernik, R.I., Olckers, A., and Wallace, D.C. 2003. Natural selection shaped regional mtDNA
variation in humans. Proc. Natl. Acad. Sci. USA 100, 171–176.
Peiris, J.S., Lai, S.T., Poon, L.L., Guan, Y., Yam, L.Y., Lim, W., Nicholls, J., Yee, W.K., Yan, W.W., Cheung, M.T.,
Cheng, V.C., Chan, K.H., Tsang, D.N., Yung, R.W., Ng, T.K., and Yuen, K.Y. 2003. Coronavirus as a possible cause
of severe acute respiratory syndrome. Lancet 361, 1319–1325.
Poutanen, S.M., Low, D.E., Henry, B., Finkelstein, S., Rose, D., Green, K., Tellier, R., Draker, R., Adachi, D.,
Ayers, M., Chan, A.K., Skowronski, D.M., Salit, I., Simor, A.E., Slutsky, A.S., Doyle, P.W., Krajden, M., Petric,
M., Brunham, R.C., McGeer, A.J., and the National Microbiology Laboratory. 2003. Identification of severe acute
respiratory syndrome in Canada. N. Engl. J. Med. 348, 1995.
Qi, J., Luo, H., and Hao, B. 2004. CVTree: A phylogenetic tree reconstruction tool based on whole genomes. Nucl.
Acids Res. 32, W45–W47.
Qi, J., Wang, B., and Hao, B. 2004. Whole genome prokaryote phylogeny without sequence alignment: A K-string
composition approach. J. Mol. Evol 58, 1–11.
Rota, P.A., Oberste, M.S., Monroe, S.S., Nix, W.A., Campagnoli, R., Icenogle, J.P., Penaranda, S., Bankamp, B., Maher,
K., Chen, M.H., Tong, S., Tamin, A., Lowe, L., Frace, M., DeRisi, J.L., Chen, Q., Wang, D., Erdman, D.D., Peret,
T.C.T., Burns, C., Ksiazek, T.G., Rollin, P.E., Sanchez, A., Liffick, S., Holloway, B., Limor, J., McCaustland, K.,
Olsen-Rassmussen, M., Fouchier, R., Gunther, S., Osterhaus, A.D.M.E., Drosten, C., Pallansch, M.A., Anderson, L.J.,
and Bellini, W.J. 2003. Characterization of a novel coronavirus associated with severe acute respiratory syndrome.
Science 300, 1394–1399.
Ruvolo, M., Zehr, S., Vondornum, M., Pan, D., Chang, B., and Lin, J. 1993. Mitochondrial COII sequences and modern
human origins. Mol. Biol. Evol. 10, 1115–1135.
Saitou, N., and Nei, M. 1987. The neighbor-joining method—a new method for reconstructing phylogenetic trees.
Mol. Biol. Evol. 4, 406–425.
Shannon, C.E. 1948. A mathematical theory of communication. Bell Labs. Tech. 27, 379–423.
Snijder, E.J., Bredenbeek, P.J., Dobbe, J.C., Thiel, V., Ziebuhr, J., Poon, L.L., Guan, Y., Rozanov, M., Spaan, W.J.,
and Gorbalenya, A.E. 2003. Unique and conserved features of genome and proteome of SARS-coronavirus, an early
split-off from the coronavirus group 2 lineage. J. Mol. Biol. 331, 991–1004.
Stadler, K., Masignani, V., Eickmann, M., Becker, S., Abrignani, S., Klenk, H.D., and Rappuoli, R. 2003. SARS—
beginning to understand a new virus. Nat. Rev. Microbiol. 1, 209–218.
Stuart, G.W., Moffett, K., and Baker, S. 2002. Integrated gene and species phylogenies from unaligned whole genome
protein sequences. Bioinformatics 18, 100–108.
Vigilant, L., Stoneking, M., Harpending, H., Hawkes, K., and Wilson, A.C. 1991. African populations and the evolution
of human mitochondrial DNA. Science 253, 1503–1507.
Vinga, S., and Almeida, J. 2003. Alignment-free sequence comparison—a review. Bioinformatics 19, 513–523.
Vogel, G. 2003. SARS outbreak. Flood of sequence data yields clues but few answers. Science 300, 1062–1063.
Yang, A.C., Hseu, S.S., Yien, H.W., Goldberger, A.L., and Peng, C.K. 2003a. Linguistic analysis of the human heartbeat
using frequency and rank order statistics. Phys. Rev. Lett. 90, 108103.
14. 1116 YANG ET AL.
Yang, A.C., Hseu, S.S., Yien, H.W., Goldberger, A.L., and Peng, C.K. 2004. Reply: Comment on linguistic analysis
of the human heartbeat using frequency and rank order statistics. Phys. Rev. Lett. 92, 109802.
Yang, A.C., Peng, C.-K., Yien, H.-W., and Goldberger, A.L. 2003b. Information categorization approach to literary
authorship disputes. Physica A 329, 473–483.
Address correspondence to:
C.-K. Peng
Suite KB-28
330 Brookline Ave.
Boston, MA 02215
E-mail: peng@physionet.org
APPENDIX
LIST OF ACCESSION NUMBERS
Coronavirus database
Avian Infectious Bronchitis Virus: NC_001451
Bovine Coronavirus: NC_003045
Canine Coronavirus: AF502583
Feline Infectious Peritonitis Virus: AB086904
Human Coronavirus 229E: NC_002645
Human Coronavirus OC43: M93390
Murine Hepatitis Virus: NC_001846
Porcine Epidemic Diarrhea Virus: NC_003436
Porcine Hemagglutinating Encephalomyelitis Virus: AY078417
Porcine Respiratory Coronavirus: Z24675
Rat Sialodacryoadenitis Virus: AF207551
SARS-CoV: NC_00471
SARS-like virus: AY304486
Transmissible Gastroenteritis Virus: NC_002306
Influenza database
33/WSN (H1N1): M12597
34/Puerto Rico (H1N1): J02150
42/Bellamy (H1N1): M12596
47/Fort Monmouth (H1N1): K00577
50/Fort Warren (H1N1): K0057
57/Denver (H1N1): M12592
60/Ann Arbor (H2N2): M12591
68/Berkeley (H2N2): M12590
72/Victoria (H3N2): AY210316
77/Alaska (H3N2): K01332
77/USSR (H1N1): K00578
80/Maryland (H1N1): M12595
84/Houston (H1N1): M12594
85/Houston (H1N1): M12593
85/Houston (H3N2): M17699
97/Hong Kong (H3N2): AF256183
Mitochondrial DNA database
Gorilla gorilla (Gorilla): X93347
Homo sapiens (Human): AF346963-AF347015
Pan troglodytes (Chimpanzee): X93335