The document discusses highly cited papers in bioinformatics according to Nature's 2014 ranking. It finds that the top papers can be grouped into three major areas: BLAST, Clustal, and phylogenetics. Papers related to Clustal, such as ClustalW and ClustalX, describe programs for multiple sequence alignment that are widely used. Papers related to phylogenetics, such as the neighbor-joining method paper, describe fundamental methods for reconstructing phylogenetic trees to study evolutionary relationships between species.
Open Science and Ecological meta-anlaysisAntica Culina
This document discusses using open data and meta-analysis to help with ecological and evolutionary synthesis. It describes how data from various sources like published studies, unpublished datasets, and metadata can be gathered and synthesized. Challenges include incomplete or unavailable data as well as differences in data collection and reporting. Case studies on topics like genetic change rates, divorce in birds, microbe communities, and soil carbon stocks demonstrate searching for relevant open data, screening datasets for usability, and analyzing data to answer research questions. The document advocates for open science to improve data sharing and the robustness of synthesis results.
The document discusses the need for quantitative reasoning in ecology to address important questions that affect human well-being and raise ethical issues. It notes that questions in ecology involve complex interactions over different spatial and temporal scales. Developing quantitative models with measurable parameters can help reduce confusion and advance understanding in evolutionary theory and ecology.
This document provides an overview of phylogenetic analysis concepts and methods. It begins with an introduction to phylogenetic trees and their components. It then covers two main approaches to building trees - using distance methods like neighbor-joining and using optimality criteria like maximum parsimony. Key steps in both approaches like multiple sequence alignment and tree-building algorithms are described. The document concludes with discussing tools for evaluating tree reliability through bootstrapping and exploring available phylogenetics programs.
This document summarizes a presentation on scientometric approaches to classification. It discusses:
- Bibliographic databases like Web of Science and Scopus and their coverage.
- Types of classification systems for scientific literature including mono-disciplinary vs multidisciplinary and journal-level vs publication-level classifications.
- The CWTS publication-level classification system which uses a fully algorithmic approach to cluster over 21 million publications into a hierarchical structure of disciplines, fields, and subfields.
- Applications of the CWTS classification system including field normalization, field delineation, research strength analysis, and identification of interdisciplinary areas.
- Studies that have evaluated aspects of the quality and accuracy of classification systems.
Machine Learning for Understanding Biomedical PublicationsGrigorios Tsoumakas
This document discusses machine learning techniques for understanding biomedical publications. It describes multi-label classification approaches for semantic indexing of biomedical literature and modality classification of figures. It also discusses ensemble methods, multi-label learning, and applications to tasks like article screening in systematic reviews and PICO sentence identification.
Presented at Evolution 2013, June 24; describes an approach to teaching populations genetics at the upper undergraduate/beginning graduate level, using simulations based in R and incorporating available large genomic data sets.
Open Science and Ecological meta-anlaysisAntica Culina
This document discusses using open data and meta-analysis to help with ecological and evolutionary synthesis. It describes how data from various sources like published studies, unpublished datasets, and metadata can be gathered and synthesized. Challenges include incomplete or unavailable data as well as differences in data collection and reporting. Case studies on topics like genetic change rates, divorce in birds, microbe communities, and soil carbon stocks demonstrate searching for relevant open data, screening datasets for usability, and analyzing data to answer research questions. The document advocates for open science to improve data sharing and the robustness of synthesis results.
The document discusses the need for quantitative reasoning in ecology to address important questions that affect human well-being and raise ethical issues. It notes that questions in ecology involve complex interactions over different spatial and temporal scales. Developing quantitative models with measurable parameters can help reduce confusion and advance understanding in evolutionary theory and ecology.
This document provides an overview of phylogenetic analysis concepts and methods. It begins with an introduction to phylogenetic trees and their components. It then covers two main approaches to building trees - using distance methods like neighbor-joining and using optimality criteria like maximum parsimony. Key steps in both approaches like multiple sequence alignment and tree-building algorithms are described. The document concludes with discussing tools for evaluating tree reliability through bootstrapping and exploring available phylogenetics programs.
This document summarizes a presentation on scientometric approaches to classification. It discusses:
- Bibliographic databases like Web of Science and Scopus and their coverage.
- Types of classification systems for scientific literature including mono-disciplinary vs multidisciplinary and journal-level vs publication-level classifications.
- The CWTS publication-level classification system which uses a fully algorithmic approach to cluster over 21 million publications into a hierarchical structure of disciplines, fields, and subfields.
- Applications of the CWTS classification system including field normalization, field delineation, research strength analysis, and identification of interdisciplinary areas.
- Studies that have evaluated aspects of the quality and accuracy of classification systems.
Machine Learning for Understanding Biomedical PublicationsGrigorios Tsoumakas
This document discusses machine learning techniques for understanding biomedical publications. It describes multi-label classification approaches for semantic indexing of biomedical literature and modality classification of figures. It also discusses ensemble methods, multi-label learning, and applications to tasks like article screening in systematic reviews and PICO sentence identification.
Presented at Evolution 2013, June 24; describes an approach to teaching populations genetics at the upper undergraduate/beginning graduate level, using simulations based in R and incorporating available large genomic data sets.
1. Phylogenetic trees show the evolutionary relationships among species or other taxonomic groups based on similarities and differences in physical or genetic characteristics.
2. Early representations of phylogenetic trees date back to 1840, but Charles Darwin popularized the concept of an evolutionary "tree" in his 1859 book On the Origin of Species.
3. There are two main types of phylogenetic trees - rooted trees which make assumptions about a common ancestor, and unrooted trees which do not.
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)Marcel Swart
This document summarizes the ECOSTBio CM1305 Action, which aims to establish a European network to study spin states of transition metal complexes. It will set up a SPINSTATE database, develop new computational methods, and facilitate collaboration between experimental and theoretical groups. The Action has 4 working groups focused on the database, enzymatic spin states, spin crossover materials, and biomimetic spin states. It involves 75 parties from 19 countries and over 75 participants in the first year, with equal representation of experimentalists and theoreticians. Future plans include populating the database, surveying spin states in enzymes and spin crossover materials, and synthesizing complexes to study through spectroscopy and reactivity experiments.
Determining cognitive distance between publication portfolios of evaluators a...Jakaria Rahman
When an expert panel evaluates research groups in a discipline specific research evaluation, it is an open question how one can determine the extent to which the panel members are able to evaluate the research groups. The expertise of the panel members should be well-matched with the research groups to ensure the quality and trustworthiness of the evaluation. Panel members who are credible experts in the field are most likely to provide valuable, relevant recommendations and suggestions that should lead to improved research quality. Due to absence of methods to determine the cognitive distance between evaluators and evaluees, this doctoral research leads to the development of informetric methods for expert panel composition. This contributes to the literature by proposing six informetric approaches to measure the match between evaluators and evaluees in a discipline specific research evaluation using their publications as a representation of their expertise.
The thesis is available at http://hdl.handle.net/10067/1481100151162165141
CRI - Teaching Through Research - John Jungck - BioQuestLeadershipProgram
This document provides an overview of quantitative reasoning approaches in biology education. It discusses several examples of quantitative modeling concepts taught in biology, including population size modeling, buffer preparation calculations, and serial dilution experiments. The document advocates for more interdisciplinary teaching that combines biology, mathematics, and quantitative skills. It describes several digital tools and modeling case studies that can be used to illustrate quantitative concepts for students. Overall, the document promotes integrating quantitative and computational approaches into biology education to better prepare students.
GB20 Nodes Training Course 2013, module 5B: Latest trends in data analysisDag Endresen
This document discusses latest trends in data analysis and species distribution modeling. It introduces concepts like presence/absence data, presence-only data, and methods for analyzing each like generalized linear models and maximum entropy (Maxent). Common climate-envelope models like BIOCLIM are presented that define climatic ranges from occurrence records. Challenges with presence-only data like sample bias and variable detectability are noted. The document recommends choosing analysis methods based on the specific data quality issues.
This document summarizes Catriona MacCallum's presentation on data publishing at PLOS. The key points are:
1) PLOS requires authors to make all underlying data openly available without restriction, with rare exceptions. Authors must provide a Data Availability Statement describing compliance.
2) Over 47,000 PLOS papers have included a data statement. Most data is found within submission files or repositories like Dryad and Figshare. PLOS checks data accessibility and ensures anonymity of clinical datasets.
3) PLOS supports initiatives like CRediT for attributing research contributions and data citation principles for giving credit to data producers. PLOS is also involved in projects beyond traditional publishing like preprints and experimental
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Amit Sheth
Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, preventions and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. ...
While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. ..
This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.
Ph.D. Committee: Drs. Amit Sheth (Advisor), TK Prasad, Michael Raymer,
Ramakanth Kavuluru (UKY), Thomas C. Rindflesch (NLM) and Varun Bhagwan (Yahoo! Labs)
Relevant Publications (more at: http://knoesis.wright.edu/students/delroy/)
D. Cameron, R. Kavuluru, T. C. Rindflesch, O. Bodenreider, A. P. Sheth, K. Thirunarayan. Leveraging Distributional Semantics for Domain Agnostic Literature-Based Discovery (under preparation)
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13), 46(2): 238–251, 2013
D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in Biomedical Literature International Bioinformatics and Biomedical Conference (BIBM11), pp. 512–519, 2011 (acceptance rate=19.4%)
D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan. Semantics-empowered Text Exploration for Knowledge Discovery. ACM Southeast Conference (ACMSE10), 14, 2010
Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, preventions and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. ...
While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. ..
This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.
Ph.D. Committee: Drs. Amit Sheth (Advisor), TK Prasad, Michael Raymer,
Ramakanth Kavuluru (UKY), Thomas C. Rindflesch (NLM) and Varun Bhagwan (Yahoo! Labs)
Relevant Publications (more at: http://knoesis.wright.edu/students/delroy/)
D. Cameron, R. Kavuluru, T. C. Rindflesch, O. Bodenreider, A. P. Sheth, K. Thirunarayan. Leveraging Distributional Semantics for Domain Agnostic Literature-Based Discovery (under preparation)
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13), 46(2): 238–251, 2013
D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in Biomedical Literature International Bioinformatics and Biomedical Conference (BIBM11), pp. 512–519, 2011 (acceptance rate=19.4%)
D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan. Semantics-empowered Text Exploration for Knowledge Discovery. ACM Southeast Conference (ACMSE10), 14, 2010
This study aimed to delineate the research area of nanocellulose by developing a procedure to retrieve relevant publications. The researchers:
1) Used keyword searches to identify an initial set of nanocellulose publications and located them within a publication classification system, which grouped publications into 428 research areas.
2) Analyzed the relevance of peripheral research areas and refined the initial publication set using text mining.
3) Selected the most relevant research areas based on concentration of nanocellulose publications.
This delineation procedure identified 12 main nanocellulose research topics and 2 nuclei areas, mapping the local and global structure of nanocellulose research.
Using drone data in modelling:A case study applying the BCCVLARDC
1. The document discusses using drone data for species distribution modelling, with a case study presented using the Biodiversity & Climate Change Virtual Laboratory (BCCVL).
2. It describes how drones can provide high resolution spatial data through images, but species data and environmental variables still need to be extracted from the images through digital image processing and analysis.
3. The presentation then demonstrates how to run a species distribution model within the BCCVL platform using species occurrence data and environmental layers to model suitable habitat for a species.
This document summarizes a presentation on genomics and big data in precision medicine. It discusses how next generation sequencing is generating massive amounts of multi-omics data from the genome, epigenome, transcriptome, proteome and metagenome. It describes some of the algorithms and databases used to analyze this big genomic and biological data, including de Bruijn graph algorithms and databases like NCBI, OMIM, and PANTHER. It also discusses some of the challenges in analyzing such large and complex biological data using computational methods.
The document describes a study that used an integrative modeling approach and data from over 1,000 grassland plots worldwide to examine relationships between plant productivity, species richness, and various environmental factors. Key findings include:
1) Species richness was negatively associated with accumulated biomass, supporting theories of competitive dominance at high productivity. However, the effect was linear across all biomass levels rather than increasing nonlinearly.
2) Species richness had a strong positive effect on productivity, in contrast to expectations from classical models. The effect was consistent and did not level off at high richness.
3) Macroclimate and soil variables were important independent drivers of both richness and productivity, with their effects differing, supporting their semi-independent nature
Levine, Yanai et al: Optimizing environmental monitoring designsquestRCN
This document summarizes research analyzing environmental monitoring designs using uncertainty quantification. It presents several case studies analyzing different monitoring questions and datasets. The studies evaluate how reducing sampling intensity impacts the ability to detect trends over time. The key finding is that uncertainty analysis provides an objective way to evaluate monitoring plans and optimize sampling efforts. Reducing sampling too much can limit the ability to detect important changes in the environment. The document recommends providing enough information to allow others to represent the uncertainty in study results.
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.
Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity...taxonbytes
View a video recording here: https://vimeo.com/195024485
Franz & Sterner @ #TDWG16 - "A new power balance is needed for trustworthy biodiversity data". Talk # 1134, Friday, December 09, 2016, 11:30 am. Session Contributed Papers 05: Data Gaps, Trust, Knowledge Acquisition. See https://mbgserv18.mobot.org/ocs/index.php/tdwg/tdwg2016/schedConf/program
The presentation provides overview and significance of the TERN long term ecological research network. The presentation was part of the Workshop on Approaches to Terrestrial Ecosystem Data Management : from collection to synthesis and beyond which was held on 9th of March 2016 in University of Queensland.
DataCite is a global consortium that provides persistent identifiers (DOIs) for scientific data to make it easily discoverable and citable. It aims to put datasets on the same level as research articles. DataCite has over 1.7 million DOIs registered and many member organizations worldwide. It develops standards and infrastructure like its metadata schema and search portal to help data archives and researchers globally.
- Drone data and big spatial data can be used as input for species distribution modelling, but requires additional processing to extract useful species and environmental data from images.
- Digital image processing techniques can be used to obtain information on vegetation types and indices from drone images.
- Running a species distribution model in the Biodiversity and Climate Change Virtual Laboratory involves selecting species occurrence data, environmental layers, a modelling algorithm, and evaluating the results. Climate change projections can then be run to predict impacts on suitable habitat into the future.
Ontologies for biodiversity informatics, UiO DSC June 2023Dag Endresen
GBIF Norway was invited to the UiO Digital Scholar Centre Data (DSC) Managers Network meeting on 2023-06-08 to present how we use biodiversity ontologies. https://www.gbif.no/news/2023/biodiversity-ontologies.html
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
Scott Edmunds talk at the HUPO congress in Geneva, September 6th 2011 on GigaScience - a journal or a database? Lessons learned from the Genomics Tsunami.
Polymerase chain reaction of the system managersaqlainsial
Science you know that I have seen you somewhere in between 🙂 a few days in advance for your company and I will pay for the biology teacher in Pakistan just for
This document presents a classification of the phylum cyanobacteria. It discusses the major orders of cyanobacteria, including Chroococcales, Pleurocapsales, Oscillatoriales, Nostocales, Stigonematales, and Gloeobacterales. Each order is characterized based on traits like cell shape, reproduction method, presence of heterocysts and akinetes, and habitat. The classification aims to group cyanobacteria based on these distinguishing morphological and physiological features.
More Related Content
Similar to Natures Top 100 Papers - Phylogenetic Tree - ClustalW.pptx
1. Phylogenetic trees show the evolutionary relationships among species or other taxonomic groups based on similarities and differences in physical or genetic characteristics.
2. Early representations of phylogenetic trees date back to 1840, but Charles Darwin popularized the concept of an evolutionary "tree" in his 1859 book On the Origin of Species.
3. There are two main types of phylogenetic trees - rooted trees which make assumptions about a common ancestor, and unrooted trees which do not.
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)Marcel Swart
This document summarizes the ECOSTBio CM1305 Action, which aims to establish a European network to study spin states of transition metal complexes. It will set up a SPINSTATE database, develop new computational methods, and facilitate collaboration between experimental and theoretical groups. The Action has 4 working groups focused on the database, enzymatic spin states, spin crossover materials, and biomimetic spin states. It involves 75 parties from 19 countries and over 75 participants in the first year, with equal representation of experimentalists and theoreticians. Future plans include populating the database, surveying spin states in enzymes and spin crossover materials, and synthesizing complexes to study through spectroscopy and reactivity experiments.
Determining cognitive distance between publication portfolios of evaluators a...Jakaria Rahman
When an expert panel evaluates research groups in a discipline specific research evaluation, it is an open question how one can determine the extent to which the panel members are able to evaluate the research groups. The expertise of the panel members should be well-matched with the research groups to ensure the quality and trustworthiness of the evaluation. Panel members who are credible experts in the field are most likely to provide valuable, relevant recommendations and suggestions that should lead to improved research quality. Due to absence of methods to determine the cognitive distance between evaluators and evaluees, this doctoral research leads to the development of informetric methods for expert panel composition. This contributes to the literature by proposing six informetric approaches to measure the match between evaluators and evaluees in a discipline specific research evaluation using their publications as a representation of their expertise.
The thesis is available at http://hdl.handle.net/10067/1481100151162165141
CRI - Teaching Through Research - John Jungck - BioQuestLeadershipProgram
This document provides an overview of quantitative reasoning approaches in biology education. It discusses several examples of quantitative modeling concepts taught in biology, including population size modeling, buffer preparation calculations, and serial dilution experiments. The document advocates for more interdisciplinary teaching that combines biology, mathematics, and quantitative skills. It describes several digital tools and modeling case studies that can be used to illustrate quantitative concepts for students. Overall, the document promotes integrating quantitative and computational approaches into biology education to better prepare students.
GB20 Nodes Training Course 2013, module 5B: Latest trends in data analysisDag Endresen
This document discusses latest trends in data analysis and species distribution modeling. It introduces concepts like presence/absence data, presence-only data, and methods for analyzing each like generalized linear models and maximum entropy (Maxent). Common climate-envelope models like BIOCLIM are presented that define climatic ranges from occurrence records. Challenges with presence-only data like sample bias and variable detectability are noted. The document recommends choosing analysis methods based on the specific data quality issues.
This document summarizes Catriona MacCallum's presentation on data publishing at PLOS. The key points are:
1) PLOS requires authors to make all underlying data openly available without restriction, with rare exceptions. Authors must provide a Data Availability Statement describing compliance.
2) Over 47,000 PLOS papers have included a data statement. Most data is found within submission files or repositories like Dryad and Figshare. PLOS checks data accessibility and ensures anonymity of clinical datasets.
3) PLOS supports initiatives like CRediT for attributing research contributions and data citation principles for giving credit to data producers. PLOS is also involved in projects beyond traditional publishing like preprints and experimental
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Amit Sheth
Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, preventions and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. ...
While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. ..
This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.
Ph.D. Committee: Drs. Amit Sheth (Advisor), TK Prasad, Michael Raymer,
Ramakanth Kavuluru (UKY), Thomas C. Rindflesch (NLM) and Varun Bhagwan (Yahoo! Labs)
Relevant Publications (more at: http://knoesis.wright.edu/students/delroy/)
D. Cameron, R. Kavuluru, T. C. Rindflesch, O. Bodenreider, A. P. Sheth, K. Thirunarayan. Leveraging Distributional Semantics for Domain Agnostic Literature-Based Discovery (under preparation)
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13), 46(2): 238–251, 2013
D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in Biomedical Literature International Bioinformatics and Biomedical Conference (BIBM11), pp. 512–519, 2011 (acceptance rate=19.4%)
D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan. Semantics-empowered Text Exploration for Knowledge Discovery. ACM Southeast Conference (ACMSE10), 14, 2010
Literature-Based Discovery (LBD) refers to the process of uncovering hidden connections that are implicit in scientific literature. Numerous hypotheses have been generated from scientific literature, which influenced innovations in diagnosis, treatment, preventions and overall public health. However, much of the existing research on discovering hidden connections among concepts have used distributional statistics and graph-theoretic measures to capture implicit associations. Such metrics do not explicitly capture the semantics of hidden connections. ...
While effective in some situations, the practice of relying on domain expertise, structured background knowledge and heuristics to complement distributional and graph-theoretic approaches, has serious limitations. ..
This dissertation proposes an innovative context-driven, automatic subgraph creation method for finding hidden and complex associations among concepts, along multiple thematic dimensions. It outlines definitions for context and shared context, based on implicit and explicit (or formal) semantics, which compensate for deficiencies in statistical and graph-based metrics. It also eliminates the need for heuristics a priori. An evidence-based evaluation of the proposed framework showed that 8 out of 9 existing scientific discoveries could be recovered using this approach. Additionally, insights into the meaning of associations could be obtained using provenance provided by the system. In a statistical evaluation to determine the interestingness of the generated subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE, on average. These results suggest that leveraging implicit and explicit context, as defined in this dissertation, is an advancement of the state-of-the-art in LBD research.
Ph.D. Committee: Drs. Amit Sheth (Advisor), TK Prasad, Michael Raymer,
Ramakanth Kavuluru (UKY), Thomas C. Rindflesch (NLM) and Varun Bhagwan (Yahoo! Labs)
Relevant Publications (more at: http://knoesis.wright.edu/students/delroy/)
D. Cameron, R. Kavuluru, T. C. Rindflesch, O. Bodenreider, A. P. Sheth, K. Thirunarayan. Leveraging Distributional Semantics for Domain Agnostic Literature-Based Discovery (under preparation)
D. Cameron, O. Bodenreider, H. Yalamanchili, T. Danh, S. Vallabhaneni, K. Thirunarayan, A. P. Sheth, T. C. Rindflesch. A Graph-based Recovery and Decomposition of Swanson’s Hypothesis using Semantic Predications. Journal of Biomedical Informatics (JBI13), 46(2): 238–251, 2013
D. Cameron, R. Kavuluru, O. Bodenreider, P. N. Mendes, A. P. Sheth, K. Thirunarayan. Semantic Predications for Complex Information Needs in Biomedical Literature International Bioinformatics and Biomedical Conference (BIBM11), pp. 512–519, 2011 (acceptance rate=19.4%)
D. Cameron, P. N. Mendes, A. P. Sheth, V. Chan. Semantics-empowered Text Exploration for Knowledge Discovery. ACM Southeast Conference (ACMSE10), 14, 2010
This study aimed to delineate the research area of nanocellulose by developing a procedure to retrieve relevant publications. The researchers:
1) Used keyword searches to identify an initial set of nanocellulose publications and located them within a publication classification system, which grouped publications into 428 research areas.
2) Analyzed the relevance of peripheral research areas and refined the initial publication set using text mining.
3) Selected the most relevant research areas based on concentration of nanocellulose publications.
This delineation procedure identified 12 main nanocellulose research topics and 2 nuclei areas, mapping the local and global structure of nanocellulose research.
Using drone data in modelling:A case study applying the BCCVLARDC
1. The document discusses using drone data for species distribution modelling, with a case study presented using the Biodiversity & Climate Change Virtual Laboratory (BCCVL).
2. It describes how drones can provide high resolution spatial data through images, but species data and environmental variables still need to be extracted from the images through digital image processing and analysis.
3. The presentation then demonstrates how to run a species distribution model within the BCCVL platform using species occurrence data and environmental layers to model suitable habitat for a species.
This document summarizes a presentation on genomics and big data in precision medicine. It discusses how next generation sequencing is generating massive amounts of multi-omics data from the genome, epigenome, transcriptome, proteome and metagenome. It describes some of the algorithms and databases used to analyze this big genomic and biological data, including de Bruijn graph algorithms and databases like NCBI, OMIM, and PANTHER. It also discusses some of the challenges in analyzing such large and complex biological data using computational methods.
The document describes a study that used an integrative modeling approach and data from over 1,000 grassland plots worldwide to examine relationships between plant productivity, species richness, and various environmental factors. Key findings include:
1) Species richness was negatively associated with accumulated biomass, supporting theories of competitive dominance at high productivity. However, the effect was linear across all biomass levels rather than increasing nonlinearly.
2) Species richness had a strong positive effect on productivity, in contrast to expectations from classical models. The effect was consistent and did not level off at high richness.
3) Macroclimate and soil variables were important independent drivers of both richness and productivity, with their effects differing, supporting their semi-independent nature
Levine, Yanai et al: Optimizing environmental monitoring designsquestRCN
This document summarizes research analyzing environmental monitoring designs using uncertainty quantification. It presents several case studies analyzing different monitoring questions and datasets. The studies evaluate how reducing sampling intensity impacts the ability to detect trends over time. The key finding is that uncertainty analysis provides an objective way to evaluate monitoring plans and optimize sampling efforts. Reducing sampling too much can limit the ability to detect important changes in the environment. The document recommends providing enough information to allow others to represent the uncertainty in study results.
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.
Franz sterner tdwg 2016 new power balance needed for trustworthy biodiversity...taxonbytes
View a video recording here: https://vimeo.com/195024485
Franz & Sterner @ #TDWG16 - "A new power balance is needed for trustworthy biodiversity data". Talk # 1134, Friday, December 09, 2016, 11:30 am. Session Contributed Papers 05: Data Gaps, Trust, Knowledge Acquisition. See https://mbgserv18.mobot.org/ocs/index.php/tdwg/tdwg2016/schedConf/program
The presentation provides overview and significance of the TERN long term ecological research network. The presentation was part of the Workshop on Approaches to Terrestrial Ecosystem Data Management : from collection to synthesis and beyond which was held on 9th of March 2016 in University of Queensland.
DataCite is a global consortium that provides persistent identifiers (DOIs) for scientific data to make it easily discoverable and citable. It aims to put datasets on the same level as research articles. DataCite has over 1.7 million DOIs registered and many member organizations worldwide. It develops standards and infrastructure like its metadata schema and search portal to help data archives and researchers globally.
- Drone data and big spatial data can be used as input for species distribution modelling, but requires additional processing to extract useful species and environmental data from images.
- Digital image processing techniques can be used to obtain information on vegetation types and indices from drone images.
- Running a species distribution model in the Biodiversity and Climate Change Virtual Laboratory involves selecting species occurrence data, environmental layers, a modelling algorithm, and evaluating the results. Climate change projections can then be run to predict impacts on suitable habitat into the future.
Ontologies for biodiversity informatics, UiO DSC June 2023Dag Endresen
GBIF Norway was invited to the UiO Digital Scholar Centre Data (DSC) Managers Network meeting on 2023-06-08 to present how we use biodiversity ontologies. https://www.gbif.no/news/2023/biodiversity-ontologies.html
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
Scott Edmunds talk at the HUPO congress in Geneva, September 6th 2011 on GigaScience - a journal or a database? Lessons learned from the Genomics Tsunami.
Similar to Natures Top 100 Papers - Phylogenetic Tree - ClustalW.pptx (20)
Polymerase chain reaction of the system managersaqlainsial
Science you know that I have seen you somewhere in between 🙂 a few days in advance for your company and I will pay for the biology teacher in Pakistan just for
This document presents a classification of the phylum cyanobacteria. It discusses the major orders of cyanobacteria, including Chroococcales, Pleurocapsales, Oscillatoriales, Nostocales, Stigonematales, and Gloeobacterales. Each order is characterized based on traits like cell shape, reproduction method, presence of heterocysts and akinetes, and habitat. The classification aims to group cyanobacteria based on these distinguishing morphological and physiological features.
- Cyanobacteria is a phylum of bacteria that obtains its energy through photosynthesis. It is classified into several orders based on morphological and physiological characteristics.
- The main orders discussed are Chroococcales, Pleurocapsales, Oscillatoriales, Nostocales, Stigonematales, and Gloeobacterales. Each order contains different genera of cyanobacteria that share distinguishing traits like cell shape, reproduction method, and habitat.
- Examples of important cyanobacteria genera mentioned for different orders include Aphanocapsa, Chroococcidiopsis, Pleurocapsa, Phormidium, Anabaena, Nostoc, and Gloeobacter.
This document discusses metal to ligand charge transfer in coordination complexes. It provides examples of complexes that exhibit this type of charge transfer, such as [Cr(NH3)6]3+ and [Fe(CO)3(bipy)]. When charge transfer occurs, the metal is oxidized and the ligand is reduced. The document also discusses the nephelauxetic effect, which refers to a decrease in the Racah interelectronic repulsion parameter B that occurs when a transition metal ion forms a complex. This effect results from an expansion of the d electron charge cloud during complexation.
The document discusses ligand to metal charge transfer (LMCT) in octahedral and tetrahedral complexes. It explains that LMCT occurs when electrons are transferred from ligand orbitals to empty metal orbitals. For octahedral complexes, there are four types of LMCT transitions involving the t2g and eg orbitals. Examples of complexes exhibiting LMCT include [CrCl(NH3)5]2+ and [CoX(NH3)5]2+. For tetrahedral complexes like MnO4-, the four LMCT transitions involve ligand t1 and t2 orbitals transferring to empty metal e and t2* orbitals. The MnO4- complex shows all four transitions in its UV-Vis spectrum
This document discusses different types of biofuels including vegetable oils, bioethanol, biodiesel, biogas, and biobutanol. It provides examples of feedstocks used to produce each type of biofuel and how they are made. The advantages of biofuels are reducing greenhouse gas emissions, being less toxic and biodegradable than fossil fuels. However, disadvantages include negative environmental impacts such as loss of natural areas, water pollution, and higher food prices.
The document describes the polymerase chain reaction (PCR) technique. It explains that PCR amplifies DNA sequences by using DNA polymerase to copy the template DNA. The key steps of PCR (denaturation, annealing, and elongation) are described. PCR has various applications in medicine, forensics, and other fields due to its ability to amplify specific DNA regions.
Algae resource potemtial and commercial utility.pptxsaqlainsial
The document discusses the potential of algae as a resource and its commercial uses. It describes how algae can be used as a food source for humans and livestock due to their protein, carbohydrate, and nutrient content. It also explains that algae fix nitrogen in soil, can be used as green fertilizer, and help treat sewage water. Additionally, it outlines how algae pigments like chlorophyll, carotenoids, and phycobilins have various industrial applications in food coloring, supplements, and research tools. Overall, the document highlights the commercial potential of algae across multiple industries such as agriculture, aquaculture, pharmaceuticals, and cosmetics.
The document discusses environmental problems associated with fossil fuel use. It notes that burning coal and oil produces air pollution like smog and acid rain through emissions of sulfur dioxide, nitrogen oxides, and carbon dioxide. Coal mining can also cause environmental damage through production of waste and disruption of land. While natural gas has advantages like being cleaner and easier to transport, increasing fossil fuel efficiency and developing non-fossil fuel sources are needed to reduce their use and alleviate pollution problems. However, tackling these issues faces obstacles such as high costs, inertia to abandon existing infrastructure, and costs being unevenly distributed.
The document discusses gene regulation and operons. It covers:
1. Operons are groups of genes transcribed together in prokaryotes to control important processes. The lac operon in E. coli contains genes for lactose metabolism.
2. The lac operon is regulated by a repressor protein that binds to the operator site and blocks transcription unless the inducer allolactose is present.
3. Eukaryotic gene regulation has multiple levels including chromatin remodeling, histone modification through methylation and acetylation, and transcription factor regulation.
IMPACT OF MUSIC ON PLANT BIOCHEMISTRY.pptxsaqlainsial
Plants have been shown to respond to different types of music and sound waves. Experiments found that plants exposed to music grew more quickly and had increased biomass and crop yields, with the greatest effects seen from classical violin music. However, plants exposed to loud rock music exhibited abnormal growth and damage. While plants do not consciously perceive music, the vibrations from sound waves may stimulate cellular movement in plants and influence their growth and development through physical effects on their tissues and cells. Some commercial growers play classical music for crops, believing it enhances growth, though more research is still needed.
Proteomics is the study of the proteome, which is the complete set of proteins expressed by a genome or cell. It uses technologies like mass spectrometry and genetic analysis to study protein activities, modifications, localization, and interactions. Proteomic techniques can identify disease-related proteins and biomarkers for diagnosis before clinical symptoms appear. Two key proteomic techniques are gel electrophoresis, which separates proteins by charge and size, and mass spectrometry, which identifies proteins with high accuracy. Proteomics has applications in disease diagnosis, structural analysis, and functional studies of protein networks.
Volatile organic compounds (VOCs) are organic chemicals that evaporate at room temperature and participate in atmospheric reactions. VOCs are both naturally occurring and human-made. Plants synthesize a diversity of VOCs through several biochemical pathways to facilitate interactions with their environment. VOCs are derived from terpenes, phenylpropanoids, fatty acids, and amino acids. Their biosynthesis depends on carbon, nitrogen, and sulfur availability and primary metabolic energy. VOCs are emitted through various processes, often involving heat.
The CPU processes instructions and data to run programs, while the GPU renders graphics by performing calculations rapidly. CPUs interpret commands, and GPUs focus on graphics rendering. RAM is a type of volatile memory that allows information to be stored and retrieved quickly but loses data when powered off.
Chapter 14 - The Genetic Code and Transcription Klug.pptsaqlainsial
The document summarizes key aspects of the genetic code and transcription. It describes how the genetic code is written in mRNA using triplets of nucleotides that specify amino acids. It also explains that transcription in eukaryotes involves RNA polymerase II, promoters, and results in a pre-mRNA that undergoes splicing to remove introns and produce the mature mRNA. Visualization by electron microscopy has provided insights into the transcription process.
Mehanism of post Transcription -Cap PolyA kHZ.pptsaqlainsial
The document summarizes several key steps in gene expression after transcription in eukaryotic cells. These include 5' capping, 3' cleavage and polyadenylation of pre-mRNA, splicing, transport of mRNA from the nucleus to cytoplasm, and translation. It focuses on the mechanisms and protein factors involved in RNA capping and 3' end processing, including the AAUAAA polyadenylation signal, GU/U-rich elements, and the roles of CPSF, CstF, PAP, and PAB proteins. Transcription is shown to extend beyond the polyadenylation site, and the polyA tail is added co-transcriptionally in two phases requiring different protein complexes and the AAUAAA
Pakistan has several different soil types due to its varied climatic and geographic regions. The main soil groups include alluvial soils, coastal sands, saline/alkaline soils, arid/desert soils, tropical red soils, lateritic soils, piedmont soils, and montane soils of the Himalayas. Pakistan also has grasslands with a climate characterized by high evaporation and periodic droughts. Physiographically, northern Pakistan is dominated by the Western Himalayan mountains which feed the Indus River as it flows through the Indus Basin plains to the Arabian Sea delta, with the Thar and Cholistan deserts located east of the plains.
The document discusses model specification error, noting that the initial model may be overspecified by including too many variables, underspecified by omitting important variables, or specify the wrong mathematical relationships. Correct specification means the model includes all core variables, excludes irrelevant ones, uses the right functional form, and has no errors in variables or incorrectly specified error terms. Reasons for errors include omitting relevant variables, including unnecessary ones, adopting the wrong functional form, or having errors of measurement.
The document discusses different types of specification errors that can occur when building models:
1. Omission of important variables (underspecification), which leaves out relevant information.
2. Inclusion of irrelevant variables (overspecification), which introduces unnecessary complexity.
3. Using the wrong functional form, such as modeling a variable linearly instead of logarithmically.
4. Measurement errors in the variables, which introduce noise into the model.
These specification errors can lead models to misrepresent relationships and produce unreliable results. Care must be taken to identify all key variables and model them with the appropriate form.
The Calvin cycle is a cyclic process that occurs in the dark phase of photosynthesis and fixes carbon dioxide into sugars. It was discovered by Melvin Calvin in the 1940s using radioactive carbon-14 isotopes to track the path of carbon in photosynthesis. The cycle has three main stages: carbon fixation, reduction, and regeneration. In carbon fixation, the enzyme rubisco incorporates CO2 into ribulose bisphosphate (RuBP). The resulting six-carbon compound then splits into two three-carbon molecules of 3-phosphoglycerate (3PGA). In reduction, ATP and NADPH are used to convert the 3PGA into glyceraldehyde-3-phosphate (G3P). Some G3P molecules
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
2. OUTLINE
1. Overview:
Ranking of scientific papers &
How high up do bioinformatics papers rank?
2. Bioinformatics tools:
ClustalW
Phylogenetics Tree
3. NATURE’S MOST-CITED
RESEARCH OF ALL TIME
• Nature ranked papers published from 1900 - present day
by citation (SCI; science citation index)
• Database: Thomson Reuter’s Web of Science
Many of the world’s most famous papers do not make the cut.
Ex. Theory of Relativity,
Nobel Prize winning discoveries etc.
4. Top 100 papers = 1 cm
58
million
• Thomson Reuter’s Web of
Science includes:
• Social sciences
• Arts and humanities
• Conference proceedings
• Books
• Etc.
TOP 100 PAPERS
5. ClustalW
(progressive MSA)
Of the top 100 papers,
10% of the papers
are bioinformatics or
phylogenetic related.
First one appears in the
top 10 list:
6. MOST-CITED BIOINFORMATICS PAPERS
Rank Title Journal Year Times cited
(2014.10.29*)
Times cited
(2016.12.11)
Subject
10 Clustal W: improving the
sensitivity of progressive
MSA
Nucleic Acids
Res.
1994 40289 53364 Bioinformatics
12 BLAST J. Mol. Biol. 1990 38380 62877 Bioinformatics
14 Gapped BLAST and PSI-
BLAST
Nucleic Acids
Res.
1997 36410 59926 Bioinformatics
28 Clustal X: flexible
strategies for MSA
Nucleic Acids
Res.
1997 23826 35571 Bioinformatics
75 A comprehensive set of
sequence-analysis
programs for the vax
Nucleic Acids
Res.
1984 14226 14252 Bioinformatics
76 MODEL TEST: testing the
model of DNA
Bioinformatics 1998 14099 18787 Bioinformatics
* Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.
7. MOST-CITED PHYLOGENETIC PAPERS
Rank Title Journal Year Times cited
(2014.10.29*)
Times cited
(2016.12.11)
Subject
20 The neighbor-joining
method: a new method
for reconstructing
phylogenetic trees.
Mol. Biol. Evol. 1987 30176 45184 Phylogenetics
41 Confidence limits on
phylogenies: an approach
using the bootstrap
Evolution 1985 21373 31437 Phylogenetics
45 MEGA4: Molecular
Evolutionary Genetics
Analysis (MEGA) software
version 4.0.
Mol. Biol. Evol. 2007 18286 28613 Phylogenetics
100 MrBayes 3: Bayesian
phylogenetic inference
under mixed models.
Bioinformatics 2003 12209 19181 Phylogenetics
* Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.
8. GOOGLE SCHOLAR’S
MOST-CITED RESEARCH OF ALL TIME
• Also ranked by citation
• But Google Scholar’s search engine pulls references from a
much greater literature base
Many world’s most famous papers also do not make the cut.
Ex. large volume of books,
Economic papers etc.
9. GOOGLE SCHOLAR’S MOST-CITED
BIOINFORMATICS OR PHYLOGENETIC PAPERS
Rank Title Journal Year Times cited
(2014.10.17*)
Times cited
(2016.12.11)
Subject
24
(14)
Gapped BLAST and PSI-
BLAST
Nucleic Acids
Res.
1997 52605 59926 Bioinformatics
26
(12)
BLAST J. Mol. Biol. 1990 52314 62877 Bioinformatics
35
(10)
Clustal W: improving the
sensitivity of progressive
MSA
Nucleic Acids
Res.
1994 47523 53364 Bioinformatics
62
(20)
The neighbor-joining
method: a new method
for reconstructing
phylogenetic trees.
Mol. Biol. Evol. 1987 37613 45184 Phylogenetics
98
(28)
Clustal X: flexible
strategies for MSA
Nucleic Acids
Res.
1997 30937 35571 Bioinformatics
* Numbers from Google Scholar. Extracted 17 October 2014.
Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.
10.
11. WHY BIOINFORMATICS?
• Big data, personalized medicine, precision medicine etc.
• Human genome project (1990-2003)
• Craig Venter and whole genome shotgun sequencing
Bioinformatics helps us to:
• Better understand the link between biology and function
• Human genetic history and diseases
13. BLAST
• BLAST (Basic Local Alignment Search Tool)
• Currently ranked no. 12 and 14 out of the top 100 list
• Introduction of BLAST will be covered by another group
14. CLUSTAL
• A series of programs for multiple sequence alignment
• Can align sequences from different organisms, from
seemingly unrelated sequences, and predict how a change
at a specific point in a gene or protein might affect its
function
15. CLUSTAL: SEVERAL VERSIONS
• ClustalW, currently ranked no.10 on the list
• ClustalX, a later version, currently ranked no.28 on the list
• There are several versions of Clustal, all align sequences
by three main steps:
1. Start with a pairwise alignment
2. Create a guide tree (or use a user-defined tree)
3. Use the guide tree to carry out multiple sequence
alignment
18. Web of Science Top 100
18
Rank Title Journal Year Times cited
(2014.10.29*)
Times cited
(2016.12.11)
Subject
20 The neighbor-joining
method: a new method
for reconstructing
phylogenetic trees.
Mol. Biol. Evol. 1987 30176 45184 Phylogenetics
Phylogenetic
reconstruction
41 Confidence limits on
phylogenies: an approach
using the bootstrap
Evolution 1985 21373 31437 Phylogenetics
Statistics
45 MEGA4: Molecular
Evolutionary Genetics
Analysis (MEGA) software
version 4.0.
Mol. Biol. Evol. 2007 18286 28613 Phylogenetics
Tool
100 MrBayes 3: Bayesian
phylogenetic inference
under mixed models.
Bioinformatics 2003 12209 19181 Phylogenetics
Phylogenetic
reconstruction
+ Tool
* Van Noorden, Richard, Brendan Maher, and Regina Nuzzo. "The top 100 papers." Nature 514.7524 (2014): 550-553.
19. Phylogenetic reconstruction
• Distance-based methods
• UPGMA (Unweighted Pair Group Method with
Arithmetic mean)
• Neighbor Joining
• Fitch-Margoliash
• Character-based methods
• Maximum Parsimony
• Maximum Likelihood (Probability-based)
• Bayesian Inference (Probability-based)
19
20. Phylogenetic reconstruction
• Distance-based methods
• UPGMA (Unweighted Pair Group Method with
Arithmetic mean)
• Neighbor Joining
• Fitch-Margoliash
• Character-based methods
• Maximum Parsimony
• Maximum Likelihood (Probability-based)
• Bayesian Inference (Probability-based)
20
21. Distance-based methods
• UPGMA / Neighbor Joining / Fitch-Margoliash
• Distance matrix A B C D E F
A 0 2 4 6 6 8
B 2 0 4 6 6 8
C 4 4 0 6 6 8
D 6 6 6 0 4 8
E 6 6 6 4 0 8
F 8 8 8 8 8 0
21
22. Distance-based methods
• UPGMA / Neighbor Joining / Fitch-Margoliash
• Distance matrix
22
A B C D E F
A 2 4 6 6 8
B 2 4 6 6 8
C 4 4 6 6 8
D 6 6 6 4 8
E 6 6 6 4 8
F 8 8 8 8 8
23. Distance-based methods
• UPGMA / Neighbor Joining / Fitch-Margoliash
• Distance matrix
23
A B C D E F
A
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
24. Distance-based methods
• UPGMA / Neighbor Joining / Fitch-Margoliash
• Distance matrix
24
A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
25. • A bottom-up (agglomerative) hierarchical
clustering method
UPGMA
25
a b c d e f
bc ef
def
bcdef
abcdef
Agglomerative clustering
Divisive clustering
26. • A bottom-up (agglomerative) hierarchical
clustering method
UPGMA
26
A
B
1
1
A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
27. • A bottom-up (agglomerative) hierarchical
clustering method
UPGMA
27
D
E
2
2
(A,B) C D E
C (4+4)/2
D (6+6)/2 6
E (6+6)/2 6 4
F (8+8)/2 8 8 8
A
B
1
1
28. • A bottom-up (agglomerative) hierarchical
clustering method
UPGMA
28
D
E
2
2
(A,B) C (D,E)
C 4
DE (6+6)/2 (6+6)/2
F 8 8 (8+8)/2
C
2
1 A
B
1
1
29. • A bottom-up (agglomerative) hierarchical
clustering method
UPGMA
29
1
1
D
E
2
2
C
2
1 A
B
1
1
((A,B),C) (D,E)
DE (6+6)/2=6
F (8+8)/2=8 8
30. • A bottom-up (agglomerative) hierarchical
clustering method
UPGMA
30
(((A,B),C),(D,E))
F (8+8)/2=8
Root
F
4
1
1
1
D
E
2
2
C
2
1 A
B
1
1
31. • A bottom-up (agglomerative) hierarchical
clustering method
UPGMA
31
F
D
E
C
A
B
Root
4
2
1
1
2
1
2
1
1
1 A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
UPGMA
32. • A bottom-up (agglomerative) hierarchical
clustering method
UPGMA
32
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
Root
4
2
1
4
3
1
2
1
1
1
F
D
E
C
A
B
33. • A bottom-up (agglomerative) hierarchical
clustering method
UPGMA
33
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
Root
F
0.5
4.5
1.5
1
B
1
3
A
C
2
2
D
E
2.5
2.5
UPGMA
34. • A bottom-up (agglomerative) hierarchical
clustering method
UPGMA
34
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
???
UPGMA 1
Root
4
2
1
4
3
2
1
1
1
F
D
E
C
A
B
True tree
Root
F
0.5
4.5
1.5
1
B
1
3
A
C
2
2
D
E
2.5
2.5
ultrametric tree Not ultrametric tree
35. • A bottom-up (agglomerative) hierarchical
clustering method
UPGMA
35
A B C
A 0
B DAB 0
C DAC DBC 0
Ultrametric criterion
DAB ≤ max(DAC, DBC)
DAC ≤ max(DAB, DBC)
DBC ≤ max(DAB, DAC)
A B C Ultrametric criterion
A 0 DAB = 2 ≤ max(4,4)
B 2 0 DAC = 4 ≤ max(2,4)
C 4 4 0 DBC = 4 ≤ max(2,4)
A B C Ultrametric criterion
A 0 DAB = 5 ≤ max(4,7)
B 5 0 DAC = 4 ≤ max(5,7)
C 4 7 0 DBC = 7 > max(5,4)
2
1
4
1
C
A
B
Tree 2.
C
A
B
2
1
1
1
Tree 1.
UPGMA
37. • A bottom-up (agglomerative) clustering method
Neighbor Joining
37
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
???
Neighbor Joining
1
Root
4
2
1
4
3
2
1
1
1
F
D
E
C
A
B
True tree
C
D
E
F
A
B
A star-like tree
38. Step 1-4.
Neighbor Joining
38
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
Step 1-2. Mij = Dij – Si – Sj smallest(M)
MAB = DAB–SA–SB = 5-7.5-10.5 = -13
MDE = DDE–SD–SE = 5-9.5-8.5 = -13
Step 1-3. SiU = Dij/2 + (Si – Sj)/2
SAU1 = DAB/2+(SA–SB)/2 = 5/2+(7.5-10.5)/2 = 1
SBU1 = DAB/2+(SB–SA)/2 = 5/2+(10.5-7.5)/2 = 4
Step 1-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set
SA = (5+4+7+6+8)/(6-2) = 7.5
SB = (5+7+10+9+11)/(6-2) = 10.5
SC = (4+7+7+6+8)/(6-2) = 8
SD = (7+10+7+5+9)/(6-2) = 9.5
SE = (6+9+6+5+8)/(6-2) = 8.5
SF = (8+11+8+9+8)/(6-2) = 11
Step 1-5. DxU = (Dix + Djx – Dij)/2
1 4
U1
A B
C
D
E
F
C
D
E
F
A
B
OTU: Operational Taxonomic Unit
N = 6
39. Step 2-4.
Neighbor Joining
39
U1 C D E
C 4-1 (7-4)
D 7-1 (10-4) 7
E 6-1 (9-4) 6 5
F 8-1 (11-4) 8 9 8
Step 2-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set
SU1 = (3+6+5+7)/(5-2) = 7
SC = (3+7+6+8)/(5-2) = 8
SD = (6+7+5+9)/(5-2) = 9
SE = (5+6+5+8)/(5-2) = 8
SF = (7+8+9+8)/(5-2) = 10.67
Step 2-2. Mij = Dij – Si – Sj smallest(M)
MCU1 = DCU1–SC–SU1 = 3-8-7 = -12
MDE = DDE–SD–SE = 5-9-8 = -12
Step 2-3. SiU = Dij/2 + (Si – Sj)/2
SDU2 = DDE/2+(SD–SE)/2 = 5/2+(9-8)/2 = 3
SEU2 = DDE/2+(SE–SD)/2 = 5/2+(8-9)/2 = 2
Step 1-5. DxU = (Dix + Djx – Dij)/2
Step 2-5. DxU = (Dix + Djx – Dij)/2
1
2
3
4
U1
U2
A B
D
E C
F
OTU: Operational Taxonomic Unit
N = 5
40. Step 3-4.
1
U1
U3
U2
A B
C
D
E
F
2
3
4
1
2
Neighbor Joining
40
U1 C U2
C 3
U2
6-3
(5-2)
7-3
(6-2)
F 7 8 9-3 (8-2)
Step 3-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set
SU1 = (3+3+7)/(4-2) = 6.5
SC = (3+4+8)/(4-2) = 7.5
SU2 = (3+4+6)/(4-2) = 6.5
SF = (7+8+6)/(4-2) = 10.5
Step 3-2. Mij = Dij – Si – Sj smallest(M)
MCU1 = DCU1–SC–SU1 = 3-7.5-6.5 = -11
Step 3-3. SiU = Dij/2 + (Si – Sj)/2
SCU3 = DCU1/2+(SC–SU1)/2 = 3/2+(7.5-6.5)/2 = 2
SU1U3 = DCU1/2+(SU1–SC)/2 = 3/2+(6.5-7.5)/2 = 1 Step 3-5. DxU = (Dix + Djx – Dij)/2
Step 2-5. DxU = (Dix + Djx – Dij)/2
OTU: Operational Taxonomic Unit
N = 4
41. Neighbor Joining
41
U2 U3
U3 4-2 (3-1)
F 6 8-2 (7-1)
Step 4-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set
SU2 = (2+6)/(3-2) = 8
SU3 = (2+6)/(3-2) = 8
SF = (6+6)/(3-2) = 12
Step 4-2. Mij = Dij – Si – Sj smallest(M)
MU2F = DU2F–SU2–SF = 6-8-12 = -14
MU3F = DU3F–SU3–SF = 6-8-12 = -14
MU2U3 = DU2U3–SU2–SU3 = 2-8-8 = -14
Step 4-3. SiU = Dij/2 + (Si – Sj)/2
SU2U4 = DU2U3/2+(SU2–SU3)/2 = 2/2+(8-8)/2 = 1
SU3U4 = DU2U3/2+(SU3–SU2)/2 = 2/2+(8-8)/2 = 1
Step 4-4.
Step 4-5. DxU = (Dix + Djx – Dij)/2
Step 3-5. DxU = (Dix + Djx – Dij)/2
U1
U3
U4
U2
A B
C
D
E
F
2
3
4
1
1
2
1
1
OTU: Operational Taxonomic Unit
N = 3
42. Neighbor Joining
42
U4
F 6-1 (6-1)
Step 5-1. Sx = (sum all Dx)/(N-2), N = # of OTUs in the set
N-2 = 2-2 = 0
Step 5-2.
Step 4-5. DxU = (Dix + Djx – Dij)/2
U1
U3
U4
U2
A B
C
D
E
F
2
3
4
1
1
2
1
1
5
OTU: Operational Taxonomic Unit
N = 2
44. Tools
• MEGA (Molecular Evolutionary Genetics Analysis)
• MrBayes (Bayesian Inference of Phylogeny)
• PHYLIP (the PHYLogeny Inference Package)
• PAUP (Phylogenetic Analysis Using Parsimony)
• iTOL (interactive Tree of Life)
• …
44
45. References
• Van Noorden, Richard, Brendan Maher, and Regina
Nuzzo. "The top 100 papers." Nature 514.7524
(2014): 550-553.
• Barton, N. H., D. E. G. Briggs, J. A. Eisen, D. B.
Goldstein and N. H. Patel (2007). Evolution, Cold
Spring Harbor Laboratory Press.
• Saitou, Naruya, and Masatoshi Nei. "The neighbor-
joining method: a new method for reconstructing
phylogenetic trees." Molecular biology and
evolution 4.4 (1987): 406-425.
45
46. 10th citation: 53,364
CLUSTAL W: improving the sensitivity of progressive multiple
sequence alignment through sequence weighting, position
specific gap penalties and weight matrix choice (1994)
47. ClustalW
• ClustalW is a general purpose multiple alignment program
for DNA or proteins by using progressive alignment.
• It can create multiple alignments, manipulate existing
alignments, do profile analysis and create phylogentic trees.
• It is produced by Julie D. Thompson, Toby Gibson of
European Molecular Biology Laboratory, Germany and
Desmond Higgins of European Bioinformatics Institute,
Cambridge, UK. Algorithmic
48. Progress Alignment
• Proposed by Feng & Doolittle (1987).
• Basic Idea:
- Align the two most closest sequences
- Progressively align the most closest related sequences
until all sequences are aligned.
• Examples of progressive alignment method
ClustalW, T-coffee, Probcons
- Probcons is currently the most accurate MSA algorithm.
- ClustalW is the most popular software.
49. Basic algorithm
1. Computing pairwise distance scores for all pairs of
sequences.
2. Generate the guide tree which ensures similar sequences
are nearer in the tree.
3. Aligning the sequences one by one according to the guide
tree.
50. Step 1: Pairwise distance scores
• Example: For S1 and S2, the global alignment is
• There are 9 non-gap positions and 8 match positions.
• The distance is 1 – 8/9 = 0.111
51. Step 2: Generate guide tree
• By neighbor-joining, generate the guide tree.
52. Step 3: Align the sequences according to
the guide tree (l)
• Aligning S1 and S2, we get
• Aligning S4 and S5, we get
53. Step 3: Align the sequences according to
the guide tree (ll)
• Aligning (S1, S2) with S3,
we get
• Aligning (S1, S2, S3) with
(S4, S5), we get
55. Detail of Profile-Profile alignment (l)
• Given two aligned sets of sequences A1 and A2
- A1 is a length 11 alignment of S1, S2, S3
- A2 is a length 9 alignment of S4, S5
56. Detail of Profile-Profile alignment (ll)
• A1[1…11] is the alignment of S1, S2, S3
• A2[1…9] is the alignment of S4, S5
• Score(A1[9],A2[8]) = δ(C,C)+δ(C,A)+δ(C,C)+δ(C,A)+δ(-,C)+δ(-,A)
• By dynamic programming, you can find the best score of the
multiple alignments. Takes O(k1n1+k2n2+n1n2) time
57. Time complexity
• Step 1: Pairwise distance scores.
Takes O(𝑘2𝑛2) time.
• Step 2: Neighbor-joining
Takes O(𝑘3) time.
• Step 3: Perform at most k profile-profile alignments,
Each takes O(𝑘𝑛 + 𝑛2) time.
Thus, Step 3 takes O(𝑘2𝑛 + 𝑘𝑛2) time.
• Hence, ClustalW takes O(𝑘2𝑛2 + 𝑘3) time.
Neighbor-joining on a set of k taxa require at most k-2 iterations. Each
step has to build and search a matrix. Initially, the matrix size is k × k.
Then, the next step is (k-1) × (k-1), etc.
UPGMA (Unweighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/UPGMA
WPGMA (Weighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/WPGMA
http://mirlab.org/jang/books/dcpr/dcHierClustering.asp?title=3-2%20Hierarchical%20Clustering%20(%B6%A5%BCh%A6%A1%A4%C0%B8s%AAk)&language=Chinese
http://www.sthda.com/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning
UPGMA (Unweighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/UPGMA
WPGMA (Weighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/WPGMA
http://mirlab.org/jang/books/dcpr/dcHierClustering.asp?title=3-2%20Hierarchical%20Clustering%20(%B6%A5%BCh%A6%A1%A4%C0%B8s%AAk)&language=chinese
UPGMA (Unweighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/UPGMA
WPGMA (Weighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/WPGMA
http://mirlab.org/jang/books/dcpr/dcHierClustering.asp?title=3-2%20Hierarchical%20Clustering%20(%B6%A5%BCh%A6%A1%A4%C0%B8s%AAk)&language=chinese
UPGMA (Unweighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/UPGMA
WPGMA (Weighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/WPGMA
http://mirlab.org/jang/books/dcpr/dcHierClustering.asp?title=3-2%20Hierarchical%20Clustering%20(%B6%A5%BCh%A6%A1%A4%C0%B8s%AAk)&language=chinese
UPGMA (Unweighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/UPGMA
WPGMA (Weighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/WPGMA
http://mirlab.org/jang/books/dcpr/dcHierClustering.asp?title=3-2%20Hierarchical%20Clustering%20(%B6%A5%BCh%A6%A1%A4%C0%B8s%AAk)&language=chinese
UPGMA (Unweighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/UPGMA
WPGMA (Weighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/WPGMA
http://mirlab.org/jang/books/dcpr/dcHierClustering.asp?title=3-2%20Hierarchical%20Clustering%20(%B6%A5%BCh%A6%A1%A4%C0%B8s%AAk)&language=chinese
UPGMA (Unweighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/UPGMA
WPGMA (Weighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/WPGMA
http://mirlab.org/jang/books/dcpr/dcHierClustering.asp?title=3-2%20Hierarchical%20Clustering%20(%B6%A5%BCh%A6%A1%A4%C0%B8s%AAk)&language=chinese
UPGMA (Unweighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/UPGMA
WPGMA (Weighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/WPGMA
http://mirlab.org/jang/books/dcpr/dcHierClustering.asp?title=3-2%20Hierarchical%20Clustering%20(%B6%A5%BCh%A6%A1%A4%C0%B8s%AAk)&language=chinese
UPGMA (Unweighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/UPGMA
WPGMA (Weighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/WPGMA
http://mirlab.org/jang/books/dcpr/dcHierClustering.asp?title=3-2%20Hierarchical%20Clustering%20(%B6%A5%BCh%A6%A1%A4%C0%B8s%AAk)&language=chinese
UPGMA (Unweighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/UPGMA
WPGMA (Weighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/WPGMA
http://mirlab.org/jang/books/dcpr/dcHierClustering.asp?title=3-2%20Hierarchical%20Clustering%20(%B6%A5%BCh%A6%A1%A4%C0%B8s%AAk)&language=Chinese
https://en.wikipedia.org/wiki/Ultrametric_space
UPGMA (Unweighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/UPGMA
WPGMA (Weighted Pair Group Method with Arithmetic Mean): https://en.wikipedia.org/wiki/WPGMA
http://mirlab.org/jang/books/dcpr/dcHierClustering.asp?title=3-2%20Hierarchical%20Clustering%20(%B6%A5%BCh%A6%A1%A4%C0%B8s%AAk)&language=Chinese
https://en.wikipedia.org/wiki/Ultrametric_space
Neighbor joining: https://en.wikipedia.org/wiki/Neighbor_joining
Saitou, Naruya, and Masatoshi Nei. "The neighbor-joining method: a new method for reconstructing phylogenetic trees." Molecular biology and evolution 4.4 (1987): 406-425.