The objectives are an understanding of:
▶ How “function” in a genomic context can be defined. This is a
surprisingly philosophical problem.
▶ Some strategies for determining if a genomic region is likely to
be functional or not.
The document provides an overview of the Human Genome Project (HGP). It describes the HGP's goal of mapping and sequencing the entire human genome. The HGP was an international research effort that worked alongside a private company, Celera Genomics, to complete a rough draft of the human genome by 2000. The completion of the HGP marked a major scientific achievement and has transformed fields like medicine, biotechnology, and genetics by providing a comprehensive map of the human genetic code.
Dan Graur - Can the human genome be 100% functional?Andrei Afanasiev
(1) The document discusses the concept of genomic function and argues that not all parts of the genome are functional. It defines functional DNA as DNA whose sequence is under selection, and whose absence would impair fitness.
(2) The document discusses different concepts of biological function and argues the appropriate concept is "selected effect function" - a part's historical role in adaptation and survival, not just causal roles.
(3) It argues activity like transcription does not prove function, and most genome is likely nonfunctional or neutral "rubbish". Function should be defined by its evolutionary role, not biochemical properties.
Guest lecture on comparative genomics for University of Dundee BS32010, delivered 21/3/2016
Workshop/other materials available at DOI:10.5281/zenodo.49447
Stephen Friend Food & Drug Administration 2011-07-18Sage Base
The document discusses potential opportunities for participating in clinical trial projects that study network perturbations in clinical specimens to better understand how to select effective drug targets for different diseases and patients. It describes four potential projects: 1) a clinical trial comparator arm project, 2) a project to decode biology using drug compounds, 3) an oncology non-responders project, and 4) a project to free up failed drug compounds. It asks what actions the FDA or other executive/legislative bodies might take regarding these projects.
Is microbial ecology driven by roaming genes?beiko
Microbial ecology often makes assumptions about the relationship between phylogeny and function, but these assumptions can be invalidated by lateral gene transfer. We need to take a broader view of relationships between genes and genomes in order to make better sense out of microbes.
This study performed a comparative metagenomic analysis of fecal samples from 13 healthy individuals of various ages to identify genomic features common and variable among human gut microbiomes. It found that gut microbiomes from unweaned infants were simple and highly variable, while adults and weaned children were more complex but functionally uniform. 237 gene families were commonly enriched in adult microbiomes and 136 in infant microbiomes, with a small overlap. 647 new gene families were exclusively present in human intestinal microbiomes.
The document provides an overview of the Human Genome Project (HGP). It describes the HGP's goal of mapping and sequencing the entire human genome. The HGP was an international research effort that worked alongside a private company, Celera Genomics, to complete a rough draft of the human genome by 2000. The completion of the HGP marked a major scientific achievement and has transformed fields like medicine, biotechnology, and genetics by providing a comprehensive map of the human genetic code.
Dan Graur - Can the human genome be 100% functional?Andrei Afanasiev
(1) The document discusses the concept of genomic function and argues that not all parts of the genome are functional. It defines functional DNA as DNA whose sequence is under selection, and whose absence would impair fitness.
(2) The document discusses different concepts of biological function and argues the appropriate concept is "selected effect function" - a part's historical role in adaptation and survival, not just causal roles.
(3) It argues activity like transcription does not prove function, and most genome is likely nonfunctional or neutral "rubbish". Function should be defined by its evolutionary role, not biochemical properties.
Guest lecture on comparative genomics for University of Dundee BS32010, delivered 21/3/2016
Workshop/other materials available at DOI:10.5281/zenodo.49447
Stephen Friend Food & Drug Administration 2011-07-18Sage Base
The document discusses potential opportunities for participating in clinical trial projects that study network perturbations in clinical specimens to better understand how to select effective drug targets for different diseases and patients. It describes four potential projects: 1) a clinical trial comparator arm project, 2) a project to decode biology using drug compounds, 3) an oncology non-responders project, and 4) a project to free up failed drug compounds. It asks what actions the FDA or other executive/legislative bodies might take regarding these projects.
Is microbial ecology driven by roaming genes?beiko
Microbial ecology often makes assumptions about the relationship between phylogeny and function, but these assumptions can be invalidated by lateral gene transfer. We need to take a broader view of relationships between genes and genomes in order to make better sense out of microbes.
This study performed a comparative metagenomic analysis of fecal samples from 13 healthy individuals of various ages to identify genomic features common and variable among human gut microbiomes. It found that gut microbiomes from unweaned infants were simple and highly variable, while adults and weaned children were more complex but functionally uniform. 237 gene families were commonly enriched in adult microbiomes and 136 in infant microbiomes, with a small overlap. 647 new gene families were exclusively present in human intestinal microbiomes.
The document discusses a lab for bioinformatics and computational genomics at Ghent University. The lab has 10 "genome hackers" who are mostly engineers and 42 scientists, technicians, geneticists and clinicians. The lab focuses on bioinformatics, epigenetics, personal genomics and 3D printing. Bioinformatics is defined as the application of information technology to biological information, facilitated by computers. The document then discusses various topics related to genetics, genomics and personalized medicine.
In-class introduction to basic Punnett square set-up and problem s.docxbradburgess22840
In-class introduction to basic Punnett square set-up and problem solving, Part 1
Problem-solving tips:
· A Punnett square allows you to predict the possible genetic outcome of children based on the genetic make-up of the parents.
· First, read the problem and figure out whether the trait of interest or genetic disorder is found on the dominant allele or the recessive allele because that will have an impact on how you interpret the results of the Punnett square.
· Select a letter to represent the trait or disorder and define the dominant and recessive alleles. For example: For eye color, B (dominant) = brown eyes and b (recessive) = blue eyes. For achondroplasia (dwarfism), A (dominant) = achondroplasia and a (recessive) = normal allele.
· If it is a sex-linked question, remember to include the sexual genotypes of the parents (XX for mom and XY for dad).
· Write down all possible genotypes & phenotypes and use this information to help you set up the Punnett square.
1. Practice question on a human trait. In reality, eye color is controlled by multiple genes and is a complex trait. For simplicity, we’ll assume that brown eyes are dominant to blue eyes. Answer the questions below.
a) Select a letter for this trait and define the dominant and recessive alleles.
B (dominant) =
b (recessive) =
b) Write down all possible genotypes and phenotypes for individuals in the population
Possible genotypes
(the 2 alleles an individual has)
Possible phenotypes (the physical appearance of a trait)
Homozygous dominant individuals
Homozygous recessive individuals
Heterozygous individuals
c) Set up the Punnett square and solve this problem. Kristy is heterozygous and Mark has blue eyes. What percentage of their offspring will have blue eyes?
Kristy's genotype
Mark's genotype
a) Select a letter for this genetic condition and define the dominant and recessive alleles.
F (dominant) =
f (recessive) =
b) Write down all possible genotypes and phenotypes for individuals in the population
Possible genotypes
(the 2 alleles an individual has)
Possible phenotypes (the physical appearance of a trait)
Homozygous dominant individuals
Homozygous recessive individuals
Heterozygous individuals
c) Set up the Punnett square and solve this problem. Kristy and Mark are carriers for cystic fibrosis. The term carrier is only used when a condition is on the recessive allele. Carriers are heterozygous individuals who are normal and show no symptoms of the disorder, but they have the ability to pass on the mutated recessive allele to their offspring. What percentage of their children will be normal? What percentage of their children will be carriers?
Kristy's genotype
Mark's genotype
2. Practice question on a genetic condition. Cystic fibrosis (CF) is an autosomal, recessive condition that results in mucus buildup in the lungs and digestive system organs. As a result, CF .
This document summarizes a thesis that developed a database comparing gene expression data from axolotl and zebrafish limb regeneration studies. The thesis used a "Rosetta stone" approach to identify 78 axolotl genes that matched human genes and had homologous zebrafish genes. These genes were organized based on gene ontologies, functions during regeneration, and expression changes with TCDD exposure. Key findings included roles for ribosomal proteins, mitochondria, neurite regeneration, extracellular matrix proteins, and genes related to cell differentiation, apoptosis and maintenance.
Biotechnology- Principles and processes investigatory project.Nishant Upadhyay
It is an investigatory project on biology for class XII (12) CBSE on- Biotechnology Principles and processes(chapter 11). It would be very helpful for those whio are searching for a ready made investigatory project.
This document provides an overview of bioinformatics and highlights several key points:
- Bioinformatics has emerged as a field to help analyze the vast amounts of biological data being generated through high-throughput technologies. It integrates biology, computer science, and information technology.
- The size of the human genome and rate of data generation has grown exponentially, necessitating computational approaches. International efforts like the Human Genome Project helped sequence the entire human genome.
- Bioinformatics tools and databases are used to study genomics, transcriptomics, proteomics and more to better understand living systems at the molecular level and enable applications in medicine, agriculture, forensics and more. This work also raises ethical, legal and social considerations.
Presentation made by Abeliovich and Rhinn on the 20th of April, 2017, at the live webinar hosted by Alzforum: http://www.alzforum.org/webinars/webinar-cortex-aging-too-fast-blame-tmem106b-and-progranulin
Systems biology in polypharmacology: explaining and predicting drug secondary...Andrei KUCHARAVY
This document discusses using systems biology approaches to predict and explain off-target drug effects. It notes that unexpected secondary drug effects and lack of therapeutic effects are major reasons drug development fails. The document proposes using computational methods to predict all protein targets a drug may affect and using systems biology to model the consequences, including secondary effects, unexpected therapeutic effects via drug repositioning, and unexpected lack of effects. It outlines a master's project to develop models of polypharmacological drug effects by analyzing networks of drug-affected protein targets and retrieving relevant biological annotation to interpret effects.
This document is a thesis presented by Alexandra Mariela Popa to the University Claude Bernard - Lyon 1 in 2011 for the degree of Doctor of Philosophy. It examines the evolution of recombination and genomic structures through a modeling approach. The thesis analyzes the relationships between the causes, characteristics, and effects of recombination from an evolutionary perspective. Models are developed to compare recombination strategies across species and sexes, study the impact of sex-specific recombination hotspots on GC content evolution, and analyze the impact of recombination on deleterious mutation frequencies in human populations.
Next Generation Sequencing and its Applications in Medical Research - Frances...Sri Ambati
The so-called “next-generation” sequencing (NGS) technologies allows us, in a short time and in parallel, to sequence massive amounts of DNA, overcoming the limitations of the original Sanger sequencing methods used to sequence the first human genome. NGS technologies have had an enormous impact on biomedical research within a short time frame. This talk will give an overview of these applications with specific examples from Mendelian genomics and cancer research. #h2ony
Why Life is Difficult, and What We MIght Do About ItAnita de Waard
This document discusses connecting biological knowledge through claim-evidence networks. It outlines some of the challenges in biology like variability between specimens and gene expression changes. It then proposes that claim-evidence networks can be used to connect biological knowledge by linking experimental evidence to claims. Steps to build these networks include identifying claims in documents, structuring the evidence in databases, and automatically connecting the claims and evidence. Examples of efforts that link drug interactions to evidence and predict protein interactions across species are provided. However, it notes that more still needs to be done to fully realize this approach.
Epigenetics of the Developing BrainFrances A. Cham pagne .docxSALU18
This document summarizes research on epigenetics and the developing brain. It discusses how:
1) Early life experiences like prenatal stress can induce epigenetic changes through DNA methylation that persist into adulthood and affect brain development and behavior.
2) The quality of parent-offspring interactions, especially the mother-infant relationship, can shape the developing brain through epigenetic pathways. Deprivation of maternal contact in early life leads to lasting effects in animals and humans.
3) Prenatal stress may program the stress response in offspring through epigenetic changes to genes like the glucocorticoid receptor gene in the hippocampus. This highlights how both prenatal and early postnatal environments integrate nature
The document discusses genetic algorithms, which are a class of optimization algorithms inspired by biological evolution. It describes the key components of genetic algorithms, including encoding solutions as chromosomes, initializing a population randomly, evaluating fitness, and applying genetic operators like crossover and mutation to produce new generations. The goal is to evolve increasingly fit solutions over many iterations until an optimal or near-optimal solution is found for the given problem.
The document summarizes research characterizing promoter-associated short tandem repeats (STRs) in the human genome and investigating their potential functional effects by analyzing associations with gene expression (eQTLs) and DNA methylation (mQTLs). The researchers targeted over 6,000 promoter-associated STRs in 120 individuals and identified STR variants correlated with local transcription and methylation, providing evidence that STRs can influence gene regulation and represent an overlooked source of genetic and phenotypic variation.
Network biology: Large-scale integration of data and textLars Juhl Jensen
The document discusses network biology and large-scale data integration. It describes protein-protein interaction networks like STRING that integrate data from curated knowledge, experiments, and predictions. It provides exercises to explore the human insulin receptor (INSR) in STRING, examining the types of evidence that support its interaction with IRS1. It also introduces other integrated networks like STITCH for chemicals and COMPARTMENTS for subcellular localization. Natural language processing techniques like named entity recognition, information extraction, and semantic tagging are used to integrate text data from the literature into these interaction networks.
This document discusses challenges and opportunities in applying mRNA sequencing (mRNAseq) to non-model organisms. It describes using digital normalization as a way to cope with having a massive amount of lamprey mRNAseq data but an incomplete genomic reference. Digital normalization enables assembly and analysis that would otherwise not be possible. The document also discusses applying digital normalization to ascidian mRNAseq data, where it results in substantial time savings and comparable transcriptomes to those assembled without normalization. Finally, it discusses next steps for lamprey transcriptome analysis including enabling various evolutionary and biological studies.
Network biology - Large-scale integration of data and textLars Juhl Jensen
The document discusses network biology and integration of large-scale data and text to build interaction networks. It introduces the STRING database, which contains over 9.6 million proteins and integrates interaction data from curated databases, experiments, textmining, and predictive methods. The document uses human insulin receptor (INSR) as an example to demonstrate searching and analyzing the STRING network, showing evidence from different data sources for its interaction with IRS1. It also introduces other integrated networks in the STRING group including STITCH, COMPARTMENTS, TISSUES and DISEASES.
Genes contribute to behaviors but do not determine them. Studies of twins and genes like 5-HTT and COMT show that:
- Genetics influence traits like intelligence (70%) and likelihood of depression, but environment also matters
- The 5-HTT gene predicts if one will become depressed after stressful life events
- COMT impacts stress response and test/task performance depending on allele (fast or slow acting enzyme)
However, one gene does not cause a behavior and ethics around genetic testing and research are important issues.
Objectives are an understanding of:
▶ Homology search tools
▶ E-values
▶ how BLAST works
▶ how profile HMMs (hmmer) work
▶ which is the right tool for different questions
Sequence alignment & comparative genomics
▶ What’s the difference between homology and analogy?
▶ How homology is estimated?
▶ Where do sequence similarity scores come from?
▶ What are the BLOSUM protein scoring matrices?
▶ How are insertions and deletions scored?
More Related Content
Similar to ppgardner-lecture07-genome-function.pdf
The document discusses a lab for bioinformatics and computational genomics at Ghent University. The lab has 10 "genome hackers" who are mostly engineers and 42 scientists, technicians, geneticists and clinicians. The lab focuses on bioinformatics, epigenetics, personal genomics and 3D printing. Bioinformatics is defined as the application of information technology to biological information, facilitated by computers. The document then discusses various topics related to genetics, genomics and personalized medicine.
In-class introduction to basic Punnett square set-up and problem s.docxbradburgess22840
In-class introduction to basic Punnett square set-up and problem solving, Part 1
Problem-solving tips:
· A Punnett square allows you to predict the possible genetic outcome of children based on the genetic make-up of the parents.
· First, read the problem and figure out whether the trait of interest or genetic disorder is found on the dominant allele or the recessive allele because that will have an impact on how you interpret the results of the Punnett square.
· Select a letter to represent the trait or disorder and define the dominant and recessive alleles. For example: For eye color, B (dominant) = brown eyes and b (recessive) = blue eyes. For achondroplasia (dwarfism), A (dominant) = achondroplasia and a (recessive) = normal allele.
· If it is a sex-linked question, remember to include the sexual genotypes of the parents (XX for mom and XY for dad).
· Write down all possible genotypes & phenotypes and use this information to help you set up the Punnett square.
1. Practice question on a human trait. In reality, eye color is controlled by multiple genes and is a complex trait. For simplicity, we’ll assume that brown eyes are dominant to blue eyes. Answer the questions below.
a) Select a letter for this trait and define the dominant and recessive alleles.
B (dominant) =
b (recessive) =
b) Write down all possible genotypes and phenotypes for individuals in the population
Possible genotypes
(the 2 alleles an individual has)
Possible phenotypes (the physical appearance of a trait)
Homozygous dominant individuals
Homozygous recessive individuals
Heterozygous individuals
c) Set up the Punnett square and solve this problem. Kristy is heterozygous and Mark has blue eyes. What percentage of their offspring will have blue eyes?
Kristy's genotype
Mark's genotype
a) Select a letter for this genetic condition and define the dominant and recessive alleles.
F (dominant) =
f (recessive) =
b) Write down all possible genotypes and phenotypes for individuals in the population
Possible genotypes
(the 2 alleles an individual has)
Possible phenotypes (the physical appearance of a trait)
Homozygous dominant individuals
Homozygous recessive individuals
Heterozygous individuals
c) Set up the Punnett square and solve this problem. Kristy and Mark are carriers for cystic fibrosis. The term carrier is only used when a condition is on the recessive allele. Carriers are heterozygous individuals who are normal and show no symptoms of the disorder, but they have the ability to pass on the mutated recessive allele to their offspring. What percentage of their children will be normal? What percentage of their children will be carriers?
Kristy's genotype
Mark's genotype
2. Practice question on a genetic condition. Cystic fibrosis (CF) is an autosomal, recessive condition that results in mucus buildup in the lungs and digestive system organs. As a result, CF .
This document summarizes a thesis that developed a database comparing gene expression data from axolotl and zebrafish limb regeneration studies. The thesis used a "Rosetta stone" approach to identify 78 axolotl genes that matched human genes and had homologous zebrafish genes. These genes were organized based on gene ontologies, functions during regeneration, and expression changes with TCDD exposure. Key findings included roles for ribosomal proteins, mitochondria, neurite regeneration, extracellular matrix proteins, and genes related to cell differentiation, apoptosis and maintenance.
Biotechnology- Principles and processes investigatory project.Nishant Upadhyay
It is an investigatory project on biology for class XII (12) CBSE on- Biotechnology Principles and processes(chapter 11). It would be very helpful for those whio are searching for a ready made investigatory project.
This document provides an overview of bioinformatics and highlights several key points:
- Bioinformatics has emerged as a field to help analyze the vast amounts of biological data being generated through high-throughput technologies. It integrates biology, computer science, and information technology.
- The size of the human genome and rate of data generation has grown exponentially, necessitating computational approaches. International efforts like the Human Genome Project helped sequence the entire human genome.
- Bioinformatics tools and databases are used to study genomics, transcriptomics, proteomics and more to better understand living systems at the molecular level and enable applications in medicine, agriculture, forensics and more. This work also raises ethical, legal and social considerations.
Presentation made by Abeliovich and Rhinn on the 20th of April, 2017, at the live webinar hosted by Alzforum: http://www.alzforum.org/webinars/webinar-cortex-aging-too-fast-blame-tmem106b-and-progranulin
Systems biology in polypharmacology: explaining and predicting drug secondary...Andrei KUCHARAVY
This document discusses using systems biology approaches to predict and explain off-target drug effects. It notes that unexpected secondary drug effects and lack of therapeutic effects are major reasons drug development fails. The document proposes using computational methods to predict all protein targets a drug may affect and using systems biology to model the consequences, including secondary effects, unexpected therapeutic effects via drug repositioning, and unexpected lack of effects. It outlines a master's project to develop models of polypharmacological drug effects by analyzing networks of drug-affected protein targets and retrieving relevant biological annotation to interpret effects.
This document is a thesis presented by Alexandra Mariela Popa to the University Claude Bernard - Lyon 1 in 2011 for the degree of Doctor of Philosophy. It examines the evolution of recombination and genomic structures through a modeling approach. The thesis analyzes the relationships between the causes, characteristics, and effects of recombination from an evolutionary perspective. Models are developed to compare recombination strategies across species and sexes, study the impact of sex-specific recombination hotspots on GC content evolution, and analyze the impact of recombination on deleterious mutation frequencies in human populations.
Next Generation Sequencing and its Applications in Medical Research - Frances...Sri Ambati
The so-called “next-generation” sequencing (NGS) technologies allows us, in a short time and in parallel, to sequence massive amounts of DNA, overcoming the limitations of the original Sanger sequencing methods used to sequence the first human genome. NGS technologies have had an enormous impact on biomedical research within a short time frame. This talk will give an overview of these applications with specific examples from Mendelian genomics and cancer research. #h2ony
Why Life is Difficult, and What We MIght Do About ItAnita de Waard
This document discusses connecting biological knowledge through claim-evidence networks. It outlines some of the challenges in biology like variability between specimens and gene expression changes. It then proposes that claim-evidence networks can be used to connect biological knowledge by linking experimental evidence to claims. Steps to build these networks include identifying claims in documents, structuring the evidence in databases, and automatically connecting the claims and evidence. Examples of efforts that link drug interactions to evidence and predict protein interactions across species are provided. However, it notes that more still needs to be done to fully realize this approach.
Epigenetics of the Developing BrainFrances A. Cham pagne .docxSALU18
This document summarizes research on epigenetics and the developing brain. It discusses how:
1) Early life experiences like prenatal stress can induce epigenetic changes through DNA methylation that persist into adulthood and affect brain development and behavior.
2) The quality of parent-offspring interactions, especially the mother-infant relationship, can shape the developing brain through epigenetic pathways. Deprivation of maternal contact in early life leads to lasting effects in animals and humans.
3) Prenatal stress may program the stress response in offspring through epigenetic changes to genes like the glucocorticoid receptor gene in the hippocampus. This highlights how both prenatal and early postnatal environments integrate nature
The document discusses genetic algorithms, which are a class of optimization algorithms inspired by biological evolution. It describes the key components of genetic algorithms, including encoding solutions as chromosomes, initializing a population randomly, evaluating fitness, and applying genetic operators like crossover and mutation to produce new generations. The goal is to evolve increasingly fit solutions over many iterations until an optimal or near-optimal solution is found for the given problem.
The document summarizes research characterizing promoter-associated short tandem repeats (STRs) in the human genome and investigating their potential functional effects by analyzing associations with gene expression (eQTLs) and DNA methylation (mQTLs). The researchers targeted over 6,000 promoter-associated STRs in 120 individuals and identified STR variants correlated with local transcription and methylation, providing evidence that STRs can influence gene regulation and represent an overlooked source of genetic and phenotypic variation.
Network biology: Large-scale integration of data and textLars Juhl Jensen
The document discusses network biology and large-scale data integration. It describes protein-protein interaction networks like STRING that integrate data from curated knowledge, experiments, and predictions. It provides exercises to explore the human insulin receptor (INSR) in STRING, examining the types of evidence that support its interaction with IRS1. It also introduces other integrated networks like STITCH for chemicals and COMPARTMENTS for subcellular localization. Natural language processing techniques like named entity recognition, information extraction, and semantic tagging are used to integrate text data from the literature into these interaction networks.
This document discusses challenges and opportunities in applying mRNA sequencing (mRNAseq) to non-model organisms. It describes using digital normalization as a way to cope with having a massive amount of lamprey mRNAseq data but an incomplete genomic reference. Digital normalization enables assembly and analysis that would otherwise not be possible. The document also discusses applying digital normalization to ascidian mRNAseq data, where it results in substantial time savings and comparable transcriptomes to those assembled without normalization. Finally, it discusses next steps for lamprey transcriptome analysis including enabling various evolutionary and biological studies.
Network biology - Large-scale integration of data and textLars Juhl Jensen
The document discusses network biology and integration of large-scale data and text to build interaction networks. It introduces the STRING database, which contains over 9.6 million proteins and integrates interaction data from curated databases, experiments, textmining, and predictive methods. The document uses human insulin receptor (INSR) as an example to demonstrate searching and analyzing the STRING network, showing evidence from different data sources for its interaction with IRS1. It also introduces other integrated networks in the STRING group including STITCH, COMPARTMENTS, TISSUES and DISEASES.
Genes contribute to behaviors but do not determine them. Studies of twins and genes like 5-HTT and COMT show that:
- Genetics influence traits like intelligence (70%) and likelihood of depression, but environment also matters
- The 5-HTT gene predicts if one will become depressed after stressful life events
- COMT impacts stress response and test/task performance depending on allele (fast or slow acting enzyme)
However, one gene does not cause a behavior and ethics around genetic testing and research are important issues.
Similar to ppgardner-lecture07-genome-function.pdf (20)
Objectives are an understanding of:
▶ Homology search tools
▶ E-values
▶ how BLAST works
▶ how profile HMMs (hmmer) work
▶ which is the right tool for different questions
Sequence alignment & comparative genomics
▶ What’s the difference between homology and analogy?
▶ How homology is estimated?
▶ Where do sequence similarity scores come from?
▶ What are the BLOSUM protein scoring matrices?
▶ How are insertions and deletions scored?
Genome annotation & comparative genomics
An appreciation for:
▶ An overview of some techniques and methods are used for
comparative genomics
▶ An understanding of genome annotation methods, particularly
the advantages and disadvantages of the different methods:
▶ Sequence analysis (ORF finding)
▶ Comparative sequence analysis
▶ Experimental methods (RNAseq & mass-spectroscopy)
Does RNA avoidance dictate protein expression level?Paul Gardner
Selection against mRNA:ncRNA interactions is observed across bacterial and archaeal genomes. Stochastic interactions between abundant ncRNAs and mRNAs may influence protein expression levels by inhibiting translation. Analysis of highly conserved genes in over 1,500 bacterial and 100 archaeal genomes provides evidence that mRNA and ncRNA sequences have evolved to avoid stable interactions, suggesting such interactions are selectively avoided to prevent inaccurate regulation of protein levels.
1) The document discusses machine learning and random forests, a popular machine learning method.
2) Random forests use decision trees built from random subsets of variables and data, and aggregate results to improve accuracy.
3) Examples using random forests to classify iris data are shown, including evaluating variable importance and classification accuracy.
This document discusses clustering and classification techniques for analyzing multivariate data. It begins by explaining that multivariate data involves collecting measurements of multiple features for each item. The document then outlines two main styles of analysis: 1) classification (supervised learning), which involves using known class labels to develop a procedure for classifying new items, and 2) clustering (unsupervised learning), which aims to find groups within the data without known class labels. Common techniques are discussed for both classification and clustering. Examples of applications are also provided.
- Monte Carlo methods use randomness to solve problems by running simulations of random samples and processes. This allows evaluating the significance of observed statistics.
- The modern version was developed in the 1940s for use in nuclear weapons projects. It involves generating random samples from an assumed model and comparing statistics to observed values.
- As an example, points in spatial data can be tested for random distribution by comparing properties of the real data to randomly generated point data, like minimum distances between points.
The document discusses resampling techniques called the jackknife and bootstrap. The jackknife involves deleting each observation from the dataset and recalculating statistics to estimate bias, standard error, and confidence intervals. The bootstrap resamples the dataset with replacement many times to estimate properties of statistics like the mean. Both techniques are used to assess reliability of estimates and account for uncertainty without assumptions about the population distribution. The document provides examples applying these methods to estimate standard deviation, confidence intervals for the median, and properties of regression.
This document contains the notes from a lecture on contingency tables and related statistical methods. It introduces contingency tables and how they can be used to analyze relationships between variables. It discusses Fisher's exact test and the chi-squared test for assessing independence in contingency tables. Examples are provided to demonstrate contingency table analysis and visualization of results. Additional resampling methods like bootstrapping and Monte Carlo simulation are also mentioned.
1. The document discusses regression analysis and linear regression models. It defines key terms used in regression including the famous five values, sums of squares, variance, covariance, correlation, slope, and intercept.
2. Methods for assessing the explanatory power and fit of regression models are presented, including the coefficient of determination (r2), standard errors for slope and intercept, and partitioning total sum of squares.
3. The importance of model checking is emphasized through assessing residuals, influence points, and considering alternative transformations to improve linearity.
This document provides an overview of linear regression analysis. It defines key terms like residuals, error sum of squares (SSE), and the "famous five" values needed to perform regression (sums of x, y, x-squared, y-squared, and x times y). Linear regression finds the line of best fit by minimizing SSE. The slope of the regression line is calculated as the covariance (SSxy) divided by variance (SSx). Regression guarantees to pass through the point of mean x and mean y.
Analysis of covariation and correlationPaul Gardner
The document discusses correlation and covariance in biology. It defines correlation as a relationship between two or more variables, while covariance is a measure of how two random variables vary together. The document provides examples of calculating correlation coefficients between variables using formulas for covariance, variance, and the correlation coefficient r. It warns that correlation does not necessarily imply causation and provides tips for interpreting correlated data.
This document discusses statistical tests for comparing two samples, specifically Fisher's F test, Student's t-test, and Wilcoxon rank-sum test. It provides an example comparing ozone concentrations between two market gardens (Gardens A and B) using these tests. The F test showed the variances between gardens were not significantly different. The t-test and Wilcoxon test both found the mean ozone concentration was significantly higher in Garden B than Garden A.
The document discusses analyzing single samples and making inferences about population parameters. It addresses three key questions for single samples: what is the mean, is the mean significantly different from expectations, and how certain we are of the mean estimate. Parametric or non-parametric methods must be used depending on whether the data meets assumptions like normality. The normal distribution and central limit theorem allow drawing inferences from sample means. Examples demonstrate calculating probabilities for different parts of the normal distribution using z-scores and the standard normal distribution.
The document discusses measures of centrality in biology. It lists some common measures of centrality used to describe datasets, including the mean. It then shows a graph with many data points distributed around a central mean.
The document provides an overview of key concepts for the BIOL209: Fundamentals course, including:
- Instructor-provided slides had no impact on attendance but adversely affected exam performance.
- Note-taking and self-testing improves learning. Some students may experience math anxiety or stereotype threat.
- The scientific method involves forming hypotheses and testing them through experimentation and analysis of data.
- Understanding statistical and experimental design principles is important for reproducing and interpreting results. Randomization, replication, and controlling for confounding variables strengthen experimental conclusions.
Random RNA interactions control protein expression in prokaryotesPaul Gardner
Presented at the NZSBMB/NZMS Conference in Christchurch 2016
CustomScience Award
A core assumption of gene expression analysis is that mRNA abundances broadly correlate with protein abundance, but these two can be imperfectly correlated. Some of the discrepancy can be accounted for by two important mRNA features: codon usage and mRNA secondary structure. We present a new global factor, called mRNA:ncRNA avoidance, and provide evidence that avoidance increases translational efficiency. We demonstrate a strong selection for the avoidance of stochastic mRNA:ncRNA interactions across prokaryotes, and that these have a greater impact on protein abundance than mRNA structure or codon usage. By generating synonymously variant green fluorescent protein (GFP) mRNAs with different potential for mRNA:ncRNA interactions, we demonstrate that GFP levels correlate well with interaction avoidance. Therefore, taking stochastic mRNA:ncRNA interactions into account enables precise modulation of protein abundance.
Avoidance of stochastic RNA interactions can be harnessed to control protein ...Paul Gardner
Presented at the Computational RNA Biology conference in Hinxton, 17-19th October, 2016.
https://coursesandconferences.wellcomegenomecampus.org/events/item.aspx?e=584
A meta-analysis of computational biology benchmarks reveals predictors of pro...Paul Gardner
This meta-analysis of computational biology benchmarks found no significant predictors of algorithm accuracy. The author analyzed 84 benchmarks comparing the accuracy and speed of 203 bioinformatics methods. Metrics like journal impact factor, author citation counts, and method age showed weak or no correlation with accuracy rank. While fast methods underwent more development iterations, speed was also not predictive of accuracy. The author concludes that factors like author/journal prestige do not guarantee software quality, and heuristic approaches can perform as well as mathematically rigorous ones. Overall, the study found no clear predictors of a method's programming accuracy based on existing benchmarks.
This document provides an introduction to non-coding RNAs (ncRNAs). It discusses how RNA was proposed to have come before DNA and proteins in the RNA world hypothesis. It notes that RNA can store genetic information like DNA and have catalytic abilities like proteins. The document outlines different types and functions of ncRNAs, including how some act as guides for chemical modifications, regulate gene expression as sponges for small RNAs or proteins, or sense environmental signals as riboswitches or thermosensors. Examples of conserved ncRNA families and mechanisms are briefly described.
Anti-Universe And Emergent Gravity and the Dark UniverseSérgio Sacani
Recent theoretical progress indicates that spacetime and gravity emerge together from the entanglement structure of an underlying microscopic theory. These ideas are best understood in Anti-de Sitter space, where they rely on the area law for entanglement entropy. The extension to de Sitter space requires taking into account the entropy and temperature associated with the cosmological horizon. Using insights from string theory, black hole physics and quantum information theory we argue that the positive dark energy leads to a thermal volume law contribution to the entropy that overtakes the area law precisely at the cosmological horizon. Due to the competition between area and volume law entanglement the microscopic de Sitter states do not thermalise at sub-Hubble scales: they exhibit memory effects in the form of an entropy displacement caused by matter. The emergent laws of gravity contain an additional ‘dark’ gravitational force describing the ‘elastic’ response due to the entropy displacement. We derive an estimate of the strength of this extra force in terms of the baryonic mass, Newton’s constant and the Hubble acceleration scale a0 = cH0, and provide evidence for the fact that this additional ‘dark gravity force’ explains the observed phenomena in galaxies and clusters currently attributed to dark matter.
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Sérgio Sacani
Context. The observation of several L-band emission sources in the S cluster has led to a rich discussion of their nature. However, a definitive answer to the classification of the dusty objects requires an explanation for the detection of compact Doppler-shifted Brγ emission. The ionized hydrogen in combination with the observation of mid-infrared L-band continuum emission suggests that most of these sources are embedded in a dusty envelope. These embedded sources are part of the S-cluster, and their relationship to the S-stars is still under debate. To date, the question of the origin of these two populations has been vague, although all explanations favor migration processes for the individual cluster members. Aims. This work revisits the S-cluster and its dusty members orbiting the supermassive black hole SgrA* on bound Keplerian orbits from a kinematic perspective. The aim is to explore the Keplerian parameters for patterns that might imply a nonrandom distribution of the sample. Additionally, various analytical aspects are considered to address the nature of the dusty sources. Methods. Based on the photometric analysis, we estimated the individual H−K and K−L colors for the source sample and compared the results to known cluster members. The classification revealed a noticeable contrast between the S-stars and the dusty sources. To fit the flux-density distribution, we utilized the radiative transfer code HYPERION and implemented a young stellar object Class I model. We obtained the position angle from the Keplerian fit results; additionally, we analyzed the distribution of the inclinations and the longitudes of the ascending node. Results. The colors of the dusty sources suggest a stellar nature consistent with the spectral energy distribution in the near and midinfrared domains. Furthermore, the evaporation timescales of dusty and gaseous clumps in the vicinity of SgrA* are much shorter ( 2yr) than the epochs covered by the observations (≈15yr). In addition to the strong evidence for the stellar classification of the D-sources, we also find a clear disk-like pattern following the arrangements of S-stars proposed in the literature. Furthermore, we find a global intrinsic inclination for all dusty sources of 60 ± 20◦, implying a common formation process. Conclusions. The pattern of the dusty sources manifested in the distribution of the position angles, inclinations, and longitudes of the ascending node strongly suggests two different scenarios: the main-sequence stars and the dusty stellar S-cluster sources share a common formation history or migrated with a similar formation channel in the vicinity of SgrA*. Alternatively, the gravitational influence of SgrA* in combination with a massive perturber, such as a putative intermediate mass black hole in the IRS 13 cluster, forces the dusty objects and S-stars to follow a particular orbital arrangement. Key words. stars: black holes– stars: formation– Galaxy: center– galaxies: star formation
This presentation offers a general idea of the structure of seed, seed production, management of seeds and its allied technologies. It also offers the concept of gene erosion and the practices used to control it. Nursery and gardening have been widely explored along with their importance in the related domain.
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆Sérgio Sacani
Context. The early-type galaxy SDSS J133519.91+072807.4 (hereafter SDSS1335+0728), which had exhibited no prior optical variations during the preceding two decades, began showing significant nuclear variability in the Zwicky Transient Facility (ZTF) alert stream from December 2019 (as ZTF19acnskyy). This variability behaviour, coupled with the host-galaxy properties, suggests that SDSS1335+0728 hosts a ∼ 106M⊙ black hole (BH) that is currently in the process of ‘turning on’. Aims. We present a multi-wavelength photometric analysis and spectroscopic follow-up performed with the aim of better understanding the origin of the nuclear variations detected in SDSS1335+0728. Methods. We used archival photometry (from WISE, 2MASS, SDSS, GALEX, eROSITA) and spectroscopic data (from SDSS and LAMOST) to study the state of SDSS1335+0728 prior to December 2019, and new observations from Swift, SOAR/Goodman, VLT/X-shooter, and Keck/LRIS taken after its turn-on to characterise its current state. We analysed the variability of SDSS1335+0728 in the X-ray/UV/optical/mid-infrared range, modelled its spectral energy distribution prior to and after December 2019, and studied the evolution of its UV/optical spectra. Results. From our multi-wavelength photometric analysis, we find that: (a) since 2021, the UV flux (from Swift/UVOT observations) is four times brighter than the flux reported by GALEX in 2004; (b) since June 2022, the mid-infrared flux has risen more than two times, and the W1−W2 WISE colour has become redder; and (c) since February 2024, the source has begun showing X-ray emission. From our spectroscopic follow-up, we see that (i) the narrow emission line ratios are now consistent with a more energetic ionising continuum; (ii) broad emission lines are not detected; and (iii) the [OIII] line increased its flux ∼ 3.6 years after the first ZTF alert, which implies a relatively compact narrow-line-emitting region. Conclusions. We conclude that the variations observed in SDSS1335+0728 could be either explained by a ∼ 106M⊙ AGN that is just turning on or by an exotic tidal disruption event (TDE). If the former is true, SDSS1335+0728 is one of the strongest cases of an AGNobserved in the process of activating. If the latter were found to be the case, it would correspond to the longest and faintest TDE ever observed (or another class of still unknown nuclear transient). Future observations of SDSS1335+0728 are crucial to further understand its behaviour. Key words. galaxies: active– accretion, accretion discs– galaxies: individual: SDSS J133519.91+072807.4
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Sérgio Sacani
Wereport the study of a huge optical intraday flare on 2021 November 12 at 2 a.m. UT in the blazar OJ287. In the binary black hole model, it is associated with an impact of the secondary black hole on the accretion disk of the primary. Our multifrequency observing campaign was set up to search for such a signature of the impact based on a prediction made 8 yr earlier. The first I-band results of the flare have already been reported by Kishore et al. (2024). Here we combine these data with our monitoring in the R-band. There is a big change in the R–I spectral index by 1.0 ±0.1 between the normal background and the flare, suggesting a new component of radiation. The polarization variation during the rise of the flare suggests the same. The limits on the source size place it most reasonably in the jet of the secondary BH. We then ask why we have not seen this phenomenon before. We show that OJ287 was never before observed with sufficient sensitivity on the night when the flare should have happened according to the binary model. We also study the probability that this flare is just an oversized example of intraday variability using the Krakow data set of intense monitoring between 2015 and 2023. We find that the occurrence of a flare of this size and rapidity is unlikely. In machine-readable Tables 1 and 2, we give the full orbit-linked historical light curve of OJ287 as well as the dense monitoring sample of Krakow.
Embracing Deep Variability For Reproducibility and Replicability
Abstract: Reproducibility (aka determinism in some cases) constitutes a fundamental aspect in various fields of computer science, such as floating-point computations in numerical analysis and simulation, concurrency models in parallelism, reproducible builds for third parties integration and packaging, and containerization for execution environments. These concepts, while pervasive across diverse concerns, often exhibit intricate inter-dependencies, making it challenging to achieve a comprehensive understanding. In this short and vision paper we delve into the application of software engineering techniques, specifically variability management, to systematically identify and explicit points of variability that may give rise to reproducibility issues (eg language, libraries, compiler, virtual machine, OS, environment variables, etc). The primary objectives are: i) gaining insights into the variability layers and their possible interactions, ii) capturing and documenting configurations for the sake of reproducibility, and iii) exploring diverse configurations to replicate, and hence validate and ensure the robustness of results. By adopting these methodologies, we aim to address the complexities associated with reproducibility and replicability in modern software systems and environments, facilitating a more comprehensive and nuanced perspective on these critical aspects.
https://hal.science/hal-04582287
2. Objectives for lecture 07
▶ An understanding of:
▶ How “function” in a genomic context can be defined. This is a
surprisingly philosophical problem.
▶ Some strategies for determining if a genomic region is likely to
be functional or not.
OK Go (2010) This Too Shall Pass.
Rube Goldberg machine
3. What are the “functional” bits in a genome?
HG38 chr11::62,851,834-62,855,980.
A famous region...
4. What is “functional”?
▶ ENCODE: “biochemically functional” i.e. anything with a
reproducible biochemical activity is functional.
▶ Biological philosophy:
▶ “That A implies B does not entail that B must imply A.”
▶ I.e. Protein & ncRNA genes are transcribed, does not imply
that all transcribed elements are genes (or functional).
▶ “Is the mere fact that X causes Y enough to say that Y is X’s
proper function?”
▶ Doolittle & Brunet describe three posibilities based upon
Darwinian evolution:
▶ Selected effect: a trait that was under positive selection in
previous generations (now maintained by negative selection).
▶ Constructive neutral evolution: negative selection of traits
that were never under positive selection.
▶ Exaptation: traits shaped by natural selection for some use
other than their current one.
▶ I.e. evidence of evolutionary selection is required before
claiming a function.
Doolittle & Brunet (2017) On causal roles and selected effects: our genome is mostly junk. BMC Biology.
5. What is your theoretical model of the cell?
Image sources: Wikimedia Commons & Prof. David S. Goodsell
6. What is “functional”?
Doolittle & Brunet (2017) On causal roles and selected effects: our genome is mostly junk. BMC Biology.
7. What is “functional”? – the Gardner simplification
EVIDENCE FOR MAINTENANCE BY
NEGATIVE SELECTION, OR
CURRENTLY UNDER POSITIVE
SELECTION
Based on Doolittle & Brunet (2017) On causal roles and selected effects: our genome is mostly junk. BMC
Biology.
8. Some useful definitions
▶ Negative or purifying selection: the selective removal of alleles
that are deleterious to fitness.
▶ Positive selection: Positive selection is the process by which
alleles that increase organismal fitness also increase in
frequency in a population.
Booker et al. (2017) Detecting positive selection in the genome. BMC Biology.
9. ▶ Legend has it that the charismatic inventor of the
“Kahungunu Wave” left many descendants (Ngā Tukemata o
Kahungunu)
10. Some useful definitions
▶ Constructive neutral evolution: how complex biological
systems might arrive via a series of neutral events.
▶ Almost synonymous with “Negative Selection”, as used by
Doolittle.
▶ Exaptation: a trait (or gene) evolved to serve one particular
function, but subsequently used to serve another.
Lukes̆ et al. (2011) How a Neutral Evolutionary Ratchet Can Build Cellular Complexity. IUBMB Life.
11. A great example of Exaptation
▶ Xist (X-inactive specific transcript) is a long non-coding RNA
required for X-inactivation in placental mammals
▶ Xist derived/exapted from a protein-coding pseudogene
(Lnx3)
▶ Mutually exclusive phylogenetic pattern with Lnx3, sequence
& synteny preserved and out-of-frame mutations in the
mammals...
Duret et al (2006) The Xist RNA gene evolved in eutherians by pseudogenization of a protein-coding gene.
Science.
12. Doolittle & Brunet quotes
▶ On accumulating transposable elements: “There is no
selective advantage to individuals within a species for doing
this, and evolution has no foresight [15]!”
▶ “A function concept embedded in one definitional framework
(SE) was (supposedly) refuted by empirical data based on
quite another (CR).”
▶ “it is the irrelevance of the majority of TEs (at least half of
our own DNA) to fitness at the organismal level that means
that “junk” is likely always to be a reasonable way to refer to
it”
▶ Final sentence: “If lungfishes have junk, or at the least
extraordinarily weakly or diffusely functional DNA in their
genomes, why is it that we think we do not?”
Doolittle & Brunet (2017) On causal roles and selected effects: our genome is mostly junk. BMC Biology.
13. What is “functional”? – a more holistic point of view.
▶ Argue for using combinations of biochemical, evolutionary,
and genetic evidence to elucidate genome function in human
biology and disease.1
Kellis et al. (2014) Defining functional DNA elements in the human genome. PNAS.
1. there are many different levels of evolutionary evidence!
14. Our attempt...
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1kGP SNP Count
1kGP Average MAF
gnomAD Average MAF
gnomAD SNP Count
Covariance Min E−value
Secondary Structure MFE
Interaction Energy Avg
Interaction Energy Min
gnomAD SNP Density
Accessibility
Repeat Distance Sum
1kGP SNP Density
Neutral Predictor
Repeat Distance Min
Genomic copies (E−val<0.01)
RNAalifold Score
Fickett score
GERP Score Avg
GC%
Primary Cell RPKM
Covariance Max
RNAcode Coding Potential
Tissue RPKM
Primary Cell MRD
PhastCons Avg
GERP Score Max
Tissue MRD
PhyloP Avg
PhyloP Max
PhastCons Max
0 50 100 150
Mean Decrease in Protein−coding
Gini Coefficient over 100 runs
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1kGP SNP Count
1kGP Average MAF
Fickett score
gnomAD Average MAF
gnomAD SNP Count
RNAcode Coding Potential
gnomAD SNP Density
Neutral Predictor
RNAalifold Score
Genomic copies (E−val<0.01)
Covariance Max
1kGP SNP Density
Covariance Min E−value
GC%
Repeat Distance Min
Repeat Distance Sum
Interaction Energy Min
PhyloP Max
GERP Score Max
GERP Score Avg
Secondary Structure MFE
Accessibility
Interaction Energy Avg
PhastCons Max
PhyloP Avg
PhastCons Avg
Tissue MRD
Tissue RPKM
Primary Cell RPKM
Primary Cell MRD
0 50 100
Mean Decrease in Short ncRNA
Gini Coefficient over 100 runs
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1kGP SNP Count
Genomic copies (E−val<0.01)
gnomAD SNP Count
PhastCons Max
RNAalifold Score
Secondary Structure MFE
Accessibility
Interaction Energy Avg
1kGP SNP Density
gnomAD Average MAF
PhyloP Avg
1kGP Average MAF
Interaction Energy Min
PhastCons Avg
Covariance Min E−value
Repeat Distance Min
Neutral Predictor
RNAcode Coding Potential
Repeat Distance Sum
gnomAD SNP Density
Fickett score
PhyloP Max
GERP Score Avg
GERP Score Max
Covariance Max
Primary Cell RPKM
GC%
Tissue RPKM
Tissue MRD
Primary Cell MRD
20 40 60
Mean Decrease in lncRNA
Gini Coefficient over 100 runs
Feature type
●
●
●
●
●
●
Intrinsic sequence
Sequence conservation
Transcriptome expression
Genomic repeat association
Protein/RNA specific features
Population variation
Proteins Short ncRNAs Long ncRNAs
Cooper & Gardner (2021) Features of Functional Human Genes. bioRxiv.
15. Random sequence experiments identify lots of transcription
(& translation)
Recall Sean Eddy’s “Random Chromosome Project”?
▶ Neme et al (2017) “Random sequences are an abundant
source of bioactive RNAs or peptides”. Nature Ecology &
Evolution.
▶ ... “by expressing clones with random sequences in E. coli” ...
“Contrary to expectations, we find that random sequences
with bioactivity are not rare.”
▶ de Boer et al (2019) “Deciphering eukaryotic gene-regulatory
logic with 100 million random promoters”. Nature
Biotechnology.
▶ ... “we measure the expression output of > 100 million
synthetic yeast promoter sequences that are fully random” ...
“ it is often tacitly assumed that functional TFBSs are rare” ...
“TFBSs are common in random DNA” ... “Abundant weak
regulatory interactions explain most of expression level”
NB. expression from random sequence is also reproducible!
16. Based upon the Doolittle & Brunet classification...
▶ Are the following elements likely to be a “Selected
Effect”, “Constructive Neutral Evolution”, “Exapted” or
no beneficial effect?
Doolittle & Brunet (2017) On causal roles and selected effects: our genome is mostly junk. BMC Biology.
18. Ultra-conserved elements (UCRs)
▶ Highly conserved sequence (in mammals)
Bejerano et al. (2004) Ultraconserved Elements in the Human Genome. Science.
Snetkova et al. (2022) Perfect and imperfect views of ultraconserved sequences. Nature Reviews Genetics.
UCNEbase: ultra-conserved non-coding elements database
19. Human accelerated regions (HARs)
Positive selection since humans diverged from the human and
chimp ancestor.
▶ Whalen & Pollard (2022) Enhancer Function and Evolutionary
Roles of Human Accelerated Regions. Annual Review of
Genetics.
20. Example: HAR1
▶ Compare genome sequences, find otherwise conserved regions
that are evolving rapidly in humans HAR1 in particular,
appears to be involved in cortical development
▶ Pollard et al. (2006) Forces Shaping the Fastest Evolving
Regions in the Human Genome. PLOS Genetics.
▶ Pollard et al. (2006) An RNA gene expressed during cortical
development evolved rapidly in humans. Nature.
Scale
chr20:
GWAS Catalog
Chimp
Gorilla
Orangutan
Gibbon
Rhesus
Crab-eating_macaque
Baboon
Green_monkey
Marmoset
Bushbaby
Chinese_tree_shrew
Squirrel
Lesser_Egyptian_jerboa
Prairie_vole
Chinese_hamster
Mouse
Rat
Naked_mole-rat
Guinea_pig
Chinchilla
Brush-tailed_rat
Rabbit
Pika
Pig
Alpaca
Bactrian_camel
Killer_whale
Tibetan_antelope
Cow
Sheep
Domestic_goat
White_rhinoceros
Cat
Dog
Ferret_
Panda
Pacific_walrus
Weddell_seal
Black_flying-fox
Megabat
David’s_myotis_(bat)
Little_brown_bat
Big_brown_bat
Hedgehog
Shrew
Elephant
Cape_elephant_shrew
Manatee
Cape_golden_mole
Tenrec
Armadillo
Opossum
Wallaby
Chicken
Common dbSNP(153)
SINE
LINE
LTR
DNA
Simple
Low Complexity
Satellite
RNA
Other
Unknown
100 bases hg38
63,102,050 63,102,100 63,102,150 63,102,200 63,102,250 63,102,300 63,102,350 63,102,400 63,102,450
Your Sequence from Blat Search
GENCODE V39 (4 items filtered out)
Basic Gene Annotation Set from GENCODE Version 39 (Ensembl 105)
Pseudogene Annotation Set from GENCODE Version 39 (Ensembl 105)
NHGRI-EBI Catalog of Published Genome-Wide Association Studies
100 vertebrates Basewise Conservation by PhyloP
Multiz Alignments of 100 Vertebrates
Short Genetic Variants from dbSNP release 153
Repeating Elements by RepeatMasker
HAR1B
HAR1B
HAR1B
ENSG00000274915
HAR1A
HAR1B
HAR1B
HAR1B
ENSG00000274915
HAR1A
Wikipedia article: Human accelerated regions
HAR1 in the UCSC genome browser.
21. 7SL RNA
▶ Also called SRP RNA, part of the signal recognition particle,
involved in protein transport
▶ Universally conserved, orthologues are found in archaea,
bacteria and eukaryotes.
Scale
chr14:
GWAS Catalog
Chimp
Gorilla
Orangutan
Gibbon
Rhesus
Crab-eating_macaque
Baboon
Green_monkey
Marmoset
Bushbaby
Chinese_tree_shrew
Squirrel
Lesser_Egyptian_jerboa
Prairie_vole
Chinese_hamster
Mouse
Rat
Naked_mole-rat
Guinea_pig
Chinchilla
Brush-tailed_rat
Rabbit
Pika
Pig
Alpaca
Bactrian_camel
Killer_whale
Tibetan_antelope
Cow
Sheep
Domestic_goat
White_rhinoceros
Cat
Dog
Ferret_
Panda
Pacific_walrus
Weddell_seal
Black_flying-fox
Megabat
David’s_myotis_(bat)
Little_brown_bat
Big_brown_bat
Hedgehog
Shrew
Elephant
Cape_elephant_shrew
Manatee
Cape_golden_mole
Tenrec
Armadillo
Opossum
Wallaby
Chicken
Common dbSNP(153)
SINE
LINE
LTR
DNA
Simple
Low Complexity
Satellite
RNA
Other
Unknown
200 bases hg38
49,586,450 49,586,500 49,586,550 49,586,600 49,586,650 49,586,700 49,586,750 49,586,800 49,586,850 49,586,900 49,586,950 49,587,000 49,587,050
Basic Gene Annotation Set from GENCODE Version 39 (Ensembl 105)
Pseudogene Annotation Set from GENCODE Version 39 (Ensembl 105)
NHGRI-EBI Catalog of Published Genome-Wide Association Studies
Vertebrate Multiz Alignment & Conservation (100 Species)
Multiz Alignments of 100 Vertebrates
Short Genetic Variants from dbSNP release 153
Repeating Elements by RepeatMasker
RPS29
RPS29
RN7SL1
Cons 100 Verts
Wikipedia article: Signal recognition particle
7SL in the UCSC genome browser.
22. Alu elements (I)
▶ Transposable element, derived from 7SL RNA. Over one
million copies in the human genome
▶ Breaks a LOT of bioinformatics tools, frequently masked from
genome sequences
Wikipedia article: Alu element
Dfam entry for the AluY family
23. Alu elements (II)
▶ Is the Alu element on human chromosome 14, between
31,474,114 and 31,473,798 likely to be functional?
Scale
chr14:
GWAS Catalog
Chimp
Gorilla
Orangutan
Gibbon
Rhesus
Crab-eating_macaque
Baboon
Green_monkey
Marmoset
Bushbaby
Chinese_tree_shrew
Squirrel
Lesser_Egyptian_jerboa
Prairie_vole
Chinese_hamster
Mouse
Rat
Naked_mole-rat
Guinea_pig
Chinchilla
Brush-tailed_rat
Rabbit
Pika
Pig
Alpaca
Bactrian_camel
Killer_whale
Tibetan_antelope
Cow
Sheep
Domestic_goat
White_rhinoceros
Cat
Dog
Ferret_
Panda
Pacific_walrus
Weddell_seal
Black_flying-fox
Megabat
David’s_myotis_(bat)
Little_brown_bat
Big_brown_bat
Hedgehog
Shrew
Elephant
Cape_elephant_shrew
Manatee
Cape_golden_mole
Tenrec
Armadillo
Opossum
Wallaby
Chicken
Common dbSNP(153)
SINE
LINE
LTR
DNA
Simple
Low Complexity
Satellite
RNA
Other
Unknown
100 bases hg38
31,473,750 31,473,800 31,473,850 31,473,900 31,473,950 31,474,000 31,474,050 31,474,100 31,474,150
Your Sequence from Blat Search
GENCODE V39
Basic Gene Annotation Set from GENCODE Version 39 (Ensembl 105)
Pseudogene Annotation Set from GENCODE Version 39 (Ensembl 105)
NHGRI-EBI Catalog of Published Genome-Wide Association Studies
100 vertebrates Basewise Conservation by PhyloP
Multiz Alignments of 100 Vertebrates
Short Genetic Variants from dbSNP release 153
Repeating Elements by RepeatMasker
An Alu element in the UCSC genome browser.
Mills et al. (2007) Which transposable elements are active in the human genome?
24. The main points
▶ “Function” measured in a variety of ways:
▶ Phenotypic: “biochemical” activity
▶ Evolutionary: evidence of positive/negative selection
▶ Genetic: knockout or mutation and compensation/complement
▶ A Darwinian definition of “function”: phenotype/activity
AND evolutionary evidence (SE, CNE or Exaptation)
▶ Evolutionary evidence can be varied: usually based on
sequence conservation or reduced SNPs
▶ Some suggest overlapping evidence of biochemical activity &
evolutionary conservation (i.e. regions under negative
selection).
25. Self-evaluation exercises
▶ Compare and contrast the ENCODE definition of function to
the Darwinian definition proposed by Doolittle & Brunet
(2017).
▶ Outline a method for determining if a predicted mouse gene is
functional. What are potential pitfalls for the approach?
▶ Where do human-accelarated regions, ultraconserved regions,
Xist, BC1 RNA and Junk DNA lie in Doolittle & Brunet
(2017) classification of functional DNA?
▶ Justify your answers!
26. Further reading
▶ Doolittle & Brunet (2017) On causal roles and selected
effects: our genome is mostly junk. BMC Biology.
▶ Kellis et al. (2014) Defining functional DNA elements in the
human genome. PNAS.
▶ Neme et al (2017) Random sequences are an abundant source
of bioactive RNAs or peptides. Nature Ecology & Evolution.
Post questions on the FAQ GoogleDoc:
https://docs.google.com/document/d/1PQd dp7C 0cXA8SwUv-
qrkTOj8c8fUAt-U Z5dg2yc8/edit?usp=sharing