This document provides an overview of comparative genomics. It defines comparative genomics as combining genomic data and evolutionary biology to study genome structure, evolution and function. It discusses three levels of genome comparison: bulk properties like chromosome size and number, whole genome sequence similarity and organization, and functional genome features. The history of experimental comparative genomics is reviewed, noting that practical comparisons predated widespread genome sequencing.
Guest lecture on comparative genomics for University of Dundee BS32010, delivered 21/3/2016
Workshop/other materials available at DOI:10.5281/zenodo.49447
This is the first presentation of the BITS training on 'Comparative genomics'.
It reviews the basic concepts of sequence homology on different levels.
Thanks to Klaas Vandepoele of the PSB department.
Guest lecture on comparative genomics for University of Dundee BS32010, delivered 21/3/2016
Workshop/other materials available at DOI:10.5281/zenodo.49447
This is the first presentation of the BITS training on 'Comparative genomics'.
It reviews the basic concepts of sequence homology on different levels.
Thanks to Klaas Vandepoele of the PSB department.
With the DNA sequences of more than 90 genomes completed, as well as a draft sequence of the human genome, a major challenge in modern biology is to understand the expression, function, and regulation of the entire set of proteins encoded by an organism—the aims of the new field of proteomics. This information will be invaluable for understanding how complex biological processes occur at a molecular level, how they differ in various cell types, and how they are altered in disease states. The term proteomics describes the study and characterization of a complete set of proteins present in a cell, organ, or organism at a given time.
In general, proteomic approaches can be used (a) for proteome profiling, (b) for comparative expression analysis of two or more protein samples, (c) for the localization and identification of posttranslational modifications, and (d) for the study of protein-protein interactions. The human genome harbours 26000–31000 protein-encoding genes; whereas the total number of human protein products, including splice variants and essential posttranslational modifications (PTMs), has been estimated to be close to one million. It is evident that most of the functional information on the genes resides in the proteome, which is the sum of multiple dynamic processes that include protein phosphorylation, protein trafficking, localization, and protein-protein interactions. Moreover, the proteomes of mammalian cells, tissues, and body fluids are complex and display a wide dynamic range of proteins concentration one cell can contain between one and more than 100000 copies of a single protein.
A rapidly emerging set of key technologies is making it possible to identify large numbers of proteins in a mixture or complex, to map their interactions in a cellular context, and to analyze their biological activities. Mass spectrometry has evolved into a versatile tool for examining the simultaneous expression of more than 1000 proteins and the identification and mapping of posttranslational modifications. High-throughput methods performed in an array format have enabled large-scale projects for the characterization of protein localization, protein-protein interactions, and the biochemical analysis of protein function. Finally, the plethora of data generated in the last few years has led to approaches for the integration of diverse data sets that greatly enhance our understanding of both individual protein function and elaborate biological processes.
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...Prasenjit Mitra
This set of slides gives an overview regarding the various omics technologies available and how they can be used for improvement in clinical setting or research
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
Next Generation Sequencing (NGS) Is A Modern And Cost Effective Sequencing Technology Which Enables Scientists To Sequence Nucleic Acids At Much Faster Rate. In This Presentation, You Will Learn About What is NGS, Idea Behind NGS, Methodology And Protocol, Widely Adapted NGS Protocols, Applications And References For Further Study.
description of functional genomics and structural genomics and the techniques involved in it and also decribing the models of forward genetics and techniques involved in it and reverse genetics and techniques involved in it
This was a talk given on 2014-09-17 for the Genome Center’s Bioinformatics Core as part of a 1 week workshop. It concerns the Assemblathon projects as well as other aspects relating to genome assembly.
A version of this talk is also available on Slideshare with embedded notes.
Note, this is an evolving talk. There are older and newer versions of the talk also available on slideshare.
Presentation delivered 8th August 2016, at the European Association for Potato Research (EAPR) meeting, Dundee - outlining classification of bacterial plant pathogens with
With the DNA sequences of more than 90 genomes completed, as well as a draft sequence of the human genome, a major challenge in modern biology is to understand the expression, function, and regulation of the entire set of proteins encoded by an organism—the aims of the new field of proteomics. This information will be invaluable for understanding how complex biological processes occur at a molecular level, how they differ in various cell types, and how they are altered in disease states. The term proteomics describes the study and characterization of a complete set of proteins present in a cell, organ, or organism at a given time.
In general, proteomic approaches can be used (a) for proteome profiling, (b) for comparative expression analysis of two or more protein samples, (c) for the localization and identification of posttranslational modifications, and (d) for the study of protein-protein interactions. The human genome harbours 26000–31000 protein-encoding genes; whereas the total number of human protein products, including splice variants and essential posttranslational modifications (PTMs), has been estimated to be close to one million. It is evident that most of the functional information on the genes resides in the proteome, which is the sum of multiple dynamic processes that include protein phosphorylation, protein trafficking, localization, and protein-protein interactions. Moreover, the proteomes of mammalian cells, tissues, and body fluids are complex and display a wide dynamic range of proteins concentration one cell can contain between one and more than 100000 copies of a single protein.
A rapidly emerging set of key technologies is making it possible to identify large numbers of proteins in a mixture or complex, to map their interactions in a cellular context, and to analyze their biological activities. Mass spectrometry has evolved into a versatile tool for examining the simultaneous expression of more than 1000 proteins and the identification and mapping of posttranslational modifications. High-throughput methods performed in an array format have enabled large-scale projects for the characterization of protein localization, protein-protein interactions, and the biochemical analysis of protein function. Finally, the plethora of data generated in the last few years has led to approaches for the integration of diverse data sets that greatly enhance our understanding of both individual protein function and elaborate biological processes.
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
Genomics, Transcriptomics, Proteomics, Metabolomics - Basic concepts for clin...Prasenjit Mitra
This set of slides gives an overview regarding the various omics technologies available and how they can be used for improvement in clinical setting or research
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
Next Generation Sequencing (NGS) Is A Modern And Cost Effective Sequencing Technology Which Enables Scientists To Sequence Nucleic Acids At Much Faster Rate. In This Presentation, You Will Learn About What is NGS, Idea Behind NGS, Methodology And Protocol, Widely Adapted NGS Protocols, Applications And References For Further Study.
description of functional genomics and structural genomics and the techniques involved in it and also decribing the models of forward genetics and techniques involved in it and reverse genetics and techniques involved in it
This was a talk given on 2014-09-17 for the Genome Center’s Bioinformatics Core as part of a 1 week workshop. It concerns the Assemblathon projects as well as other aspects relating to genome assembly.
A version of this talk is also available on Slideshare with embedded notes.
Note, this is an evolving talk. There are older and newer versions of the talk also available on slideshare.
Presentation delivered 8th August 2016, at the European Association for Potato Research (EAPR) meeting, Dundee - outlining classification of bacterial plant pathogens with
Slides for the afternoon session on "Introduction to Bioinformatics", delivered at the James Hutton Institute, 29th, 20th May and 5th June 2014, by Leighton Pritchard and Peter Cock.
Slides cover introductory guidance and links to resources, theory and use of BLAST tools, and a workshop featuring some common tools and tasks.
All kmers are not created equal: recognizing the signal from the noise in lar...wltrimbl
Talk by Will Trimble of Argonne National Laboratory, on April 23, 2014, at MSU's BEACON Center for the Study of Evolution in Action on visualizing and interpreting the redundancy spectrum of long kmers in high-throughput sequence data.
Comparative genomics: Genomic features are compared, evolutionary relationship
The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. orthologous sequences,
Started as soon as the whole genomes of two organisms became available (that is, the genomes of the bacteria Haemophilus influenzae and Mycoplasma genitalium) in 1995, comparative genomics is now a standard component of the analysis of every new genome sequence. comparative genomics studies of small model organisms (for example the model Caenorhabditis elegans and closely related Caenorhabditis briggsae) are of great importance to advance our understanding of general mechanisms of evolution
Computational tools for analyzing sequences and complete genomes. Application of comparative genomics in agriculture and medicine.
Introductory slides for the Python hands-on session of the Research Data Visualisation Workshop run by the Software Sustainability Institute, University of Manchester 28th July 2016.
Materials for the session are available at https://github.com/widdowquinn/Teaching-Data-Visualisation
Keynote presentation, 4th February 2015, León, México - part of the 2015 Genomics Research on Plant-Parasite Interactions to Increase Food Production UK-MX Workshop.
Highly Discriminatory Diagnostic Primer Design From Whole Genome DataLeighton Pritchard
Presented at the GMI (Global Microbial Identifier) satellite meeting, sponsored by the UK Department for Environment, Food and Rural Affairs (DEFRA), organised by the Food and Environment Research Agency (FERA), Bedern Hall, York, 10th September 2014.
Presentation summarising the 2013 ICSB conference in Copenhagen, a requirement of James Hutton Institute Visits Abroad funding. Presented at the Cellular and Molecular Sciences seminar series.
Golden Rules of Bioinformatics.
Presented as part of a full-day introductory bioinformatics course - the example data and source for the slides can be found at https://github.com/widdowquinn/Teaching-Intro-to-Bioinf
Keynote presentation from Plant and Pathogen Bioinformatics workshop at EMBL-EBI, 8-11 July 2014
Slides and teaching material are available at https://github.com/widdowquinn/Teaching-EMBL-Plant-Path-Genomics
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataLeighton Pritchard
Presentation on use of Galaxy for plant pathology bioinformatics, presented by Peter Cock, at the Genomics for Non-Model Organisms workshop, ISMB/ECCB, Vienna, Austria, 19 July 2011
Presentation delivered 29th October 2012, at the CoZee workshop in Dundee (see CoZee zoonosis network site for more information: http://www.cozee-zoonosis.net/).
[For clarity: our diagnostics work did not at the time form part of the excellent E.coli O104:H4 genome analysis crowd-sourcing consortium work, which can be found at https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki - we talked about it here because it was good work, and without their efforts we couldn't have done what we did]
Presentation given as part of the EMBO Workshop on Plant-Microbe Interactions, at The Sainsbury Laboratory, Norwich, 20th June 2012. This presentation describes bioinformatic and statistical considerations for the prediction of plant pathogen effectors from genome sequences and annotation, with several literature examples.
Slides from a Comparative Genomics and Visualisation course (part 2) presented at the University of Dundee, 11th March 2014. Other materials are available at GitHub (https://github.com/widdowquinn/Teaching)
A Systems Biology Perspective on Plant-Pathogen Interactions 2012-05-08, TurinLeighton Pritchard
My presentation from 8th May 2012, at a workshop on Plant-Microbe Interactions, held at the Turin Botanical Gardens, University of Turin. The talk expands on concepts from this paper: Pritchard L, Birch P (2011) A systems biology perspective on plant-microbe interactions: Biochemical and structural targets of pathogen effectors. Plant Science 180: 584–603. doi:10.1016/j.plantsci.2010.12.008.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
2. Part
1
l What
is
compara've
genomics?
l Levels
of
genome
comparison
l bulk,
whole
sequence,
features
l A
Brief
History
of
Compara've
Genomics
l experimental
compara;ve
genomics
l Computa'onal
Compara've
Genomics
l Bulk
proper;es
l Whole
genome
comparisons
l Part
2
l Genome
feature
comparisons
3. What
is
Compara've
Genomics?
The
combina'on
of
genomic
data
and
compara've
and
evolu'onary
biology
to
address
ques'ons
of
genome
structure,
evolu'on
and
func'on.
4. What
is
Compara've
Genomics?
“Nothing
in
biology
makes
sense,
except
in
the
light
of
evolu9on”
Theodosius
Dobzhansky
5. Why
Compara've
Genomics?
l Genomes
describe
heritable
characteris;cs
l Related
organisms
share
ancestral
genomes
l Func;onal
elements
encoded
in
genomes
are
common
to
related
organisms
l Func;onal
understanding
of
model
systems
(E.
coli,
A.
thaliana,
D.
melanogaster)
can
be
transferred
to
non-‐model
systems
on
the
basis
of
genome
comparisons
l Genome
comparisons
can
be
informa;ve,
even
for
distantly-‐related
organisms
6. Why
Compara've
Genomics?
l BUT:
l Context:
epigene;cs,
;ssue
differen;a;on,
mesoscale
systems,
etc.
l Phenotypic
plas'city:
responses
to
temperature,
stress,
environment,
etc.
7. Why
Compara've
Genomics?
l Genomic
differences
can
underpin
phenotypic
(morphological
or
physiological)
differences.
l Where
phenotypes
or
other
organism-‐level
proper;es
are
known,
comparison
of
genomes
may
give
mechanis;c
or
func;onal
insight
into
differences
(e.g.
GWAS).
l Genome
comparisons
aid
iden;fica;on
of
func;onal
elements
on
the
genome.
l Studying
genomic
changes
reveals
evolu;onary
processes
and
constraints.
8. Why
Compara've
Genomics?
Adapted
from
Hardison
(2003)
PLoS
Biol.
doi:10.1371/journal.pbio.0000058
species
'me
contemporary
organisms
l Comparison
within
species
(e.g.
isolate-‐level
–
or
even
within
individuals):
which
genome
features
may
account
for
unique
characteris;cs
of
organisms/
tumours?
Epigene;cs
in
an
individual.
9. Why
Compara've
Genomics?
genus
'me
contemporary
organisms
l Comparison
within
genus
(e.g.
species-‐level):
what
genome
features
show
evidence
of
selec;ve
pressure,
and
in
which
species?
10. Why
Compara've
Genomics?
subgroup
'me
contemporary
organisms
l Comparison
within
subgroup
(e.g.
genus-‐level):
what
are
the
core
set
of
genome
features
that
define
a
subgroup
or
genus?
11. The
E.coli
long-‐term
evolu'on
experiment
l Run
by
the
Lenski
lab,
Michigan
State
University
since
1988
l hVp://myxo.css.msu.edu/ecoli/
l 12
flasks,
citrate
usage
selec;on
l 50,000
genera;ons
of
Escherichia
coli!
l Cultures
propagated
every
day
l Every
500
genera;ons
(75
days),
mixed-‐popula;on
samples
stored
l Mean
fitness
es;mated
at
500
genera;on
intervals
Jeong
et
al.
(2009)
J.
Mol.
Biol.
doi:10.1016/j.jmb.2009.09.052
Barrick
et
al.
(2009)
Nature
doi:10.1038/nature08480
Wiser
et
al.
(2013)
Science.
doi:10.1126/science.1243357
12. Compara've
Genomics
in
the
News
Sankaraman
et
al.
(2014)
Nature.
doi:10.1038/nature12961
l Neanderthal
alleles:
l Aid
adapta;on
outwith
Africa
l Associated
with
disease
risk
l Reduce
male
fer;lity
13. Levels
of
Genome
Comparison
Genomes
are
complex,
and
can
be
compared
on
a
range
of
conceptual
levels
-‐
both
prac'cally
and
in
silico.
14. Three
broad
levels
of
comparison
l Bulk
Proper;es
l chromosome/plasmid
counts
and
sizes,
l nucleo;de
content,
etc.
l Whole
Genome
Sequence
l sequence
similarity
l organisa;on
of
genomic
regions
(synteny),
etc.
l Genome
Features/Func;onal
Components
l numbers
and
types
of
features
(genes,
ncRNA,
regulatory
elements,
etc.)
l organisa;on
of
features
(synteny,
operons,
regulons,
etc.)
l complements
of
features
l selec;on
pressure,
etc.
15. A
Brief
History
of
Experimental
Compara've
Genomics
You
don’t
have
to
sequence
genomes
to
compare
them
(but
it
helps).
16. Genome
Comparisons
Predate
NGS
l Sequence
data
was
not
always
cheap
and
abundant
l Prac;cal,
experimental
genome
comparisons
were
needed
17. Bulk
Genome
Property
Comparisons
Values
calculated
for
individual
genomes,
and
subsequently
compared.
19. Chromosome
Counts/Size
l The
chromosome
counts/ploidy
of
organisms
can
vary
widely
l Escherichia
coli:
1
(but
plasmids…)
l Rice
(Oryza
sa6va):
24
(but
mitochondria,
plas;ds
etc…)
l Human
(Homo
sapiens):
46,
diploid
l Adders-‐tongue
(Ophioglossum
re6culatum):
up
to
1260
l Domes;c
(but
not
wild)
wheat
soma;c
cells
hexaploid,
gametes
haploid
l Physical
genome
size
(related
to
sequence
length)
can
also
vary
greatly
l Genome
size
and
chromosome
count
do
not
indicate
organism
‘complexity’
l S;ll
surprises
to
be
found
in
physical
study
of
chromosomes!
(e.g.
Hi-‐C)
Kamisugi
et
al.
(1993)
Chromosome
Res.
1(3):
189-‐96
Wang
et
al.
(2013)
Nature
Rev
Genet.
doi:10.1038/nrg3375
20. Nucleo'de
Content
l Experimental
approaches
for
accurate
measurement
l e.g.
use
radiolabelled
monophosphates,
calculate
propor;ons
using
chromatography
Karl
(1980)
Microbiol.
Rev.
44(4)
739-‐796
Krane
et
al.
(1991)
Nucl.
Acids
Res.
doi:10.1093/nar/19.19.5181
22. Whole
Genome
Comparisons
l Requires
two
genomes:
“reference”
and
“comparator”
l Experiment
produces
a
compara;ve
result,
dependent
on
the
choice
of
genomes
l Methods
mostly
based
around
direct
or
indirect
DNA
hybridisa;on
l DNA-‐DNA
hybridisa;on
l Compara;ve
Genomic
Hybridisa;on
(CGH)
l Array
Compara;ve
Genomic
Hybridisa;on
(aCGH)
23. DNA-‐DNA
Hybridisa'on
(DDH)
l Several
methods
based
around
the
same
principle
1. Denature
organism
A,
B
genomic
DNA
mixture
2. Allow
to
anneal
–
hybrids
result
(reassocia;on
≈
similarity)
Morelló-‐Mora
&
Amann
(2001)
FEMS
Microbiol.
Rev.
doi:10.1016/S0168-‐6445(00)00040-‐1
25. DNA-‐DNA
Hybridisa'on
(DDH)
l Used
for
taxonomic
classifica;on
in
prokaryotes
from
1960s
l Sibley
&
Ahlquist
redefined
bird
and
primate
phylogeny
with
DDH
in
1980s:
Homo
shares
more
recent
common
ancestor
with
Pan
than
with
Gorilla
(this
was
previously
in
dispute)
Sibley
&
Ahlquist
(1984)
J.
Mol.
Evol.
doi:10.1007/BF02101980
26. Compara've
Genomic
Hybridisa'on
l Two
genomes:
“reference”
and
“test”
are
labelled
(red
and
green
–
a
bad
conven6on
to
choose,
for
visualisa6on),
then
hybridised
against
a
third
“normal”
genome
l Differences
in
red/green
intensity
mapped
by
microscopy
correspond
to
rela;ve
rela;onship
of
reference
and
test
to
“normal”
genome
l Comparisons
within
species
(or
individual,
for
tumours);
copy
number
varia'ons
(CNV)
l Labour-‐intensive,
low-‐resolu;on
27. Compara've
Genomic
Hybridisa'on
l Image
analysis
required
–
intensity
along
medial
axis.
Kallioniemi
et
al.
(1992)
Science
doi:10.1126/science.1359641
Fraga
et
al.
(2005)
Proc.
Natl.
Acad.
Sci.
USA
doi:10.1073/pnas.0500398102
Epigene'cs:
hybridising
methylated
DNA
28. Array
Compara've
Genomic
Hybridisa'on
l Uses
DNA
microarrays:
thousands
of
short
DNA
probes
(genome
fragments)
immobilised
on
a
surface
l gDNA,
cDNA,
etc.
fluorescently-‐labelled
and
hybridised
to
the
array
l Smaller
sample
sizes
cf.
CGH,
automatable,
high-‐throughput,
high-‐res
l Iden'fies
copy
number
varia'on
(CNV)
and
segmental
duplica'on
Pollack
et
al.
(1999)
Nat.
Genet.
doi:10.1038/12640
30. Chromosomal
Rearrangements
l Genomes
are
dynamic,
and
undergo
large-‐scale
changes
l Hybridisa;on
used
to
map
genome
rearrangement/duplica;on
l Separate
chromosomes
electrophore;cally
l Apply
single
gene
hybridising
probes
l Reciprocal
hybridisa;ons
indicate
transloca;ons
Fischer
et
al.
(2000)
Nature.
doi:10.1038/35013058
31. Diagnos'c
PCR/MLST
l Define
a
set
of
regions
(usually
genes):
l conserved
enough
that
PCR
primers
can
be
designed
to
amplify
the
same
region
in
mul;ple
organisms
l and:
l divergent
enough
that
hybridising
probes
can
dis;nguish
between
groups
l or:
l sequence
the
amplifica;on
products
l Sequence
variants
given
numbers
l Number
profiles
define
groups
l Track
evolu;on
by
minimum
spanning
trees
(MST)
l hVp://pubmlst.org/
Maiden
et
al.
(2006)
Ann.
Rev.
Microbiol.
doi:10.1146/annurev.micro.59.030804.121325
32. l aCGH
can
also
be
applied
across
species
for
classifica'on/diagnos'cs:
l Microarray
probes
represent
genes
from
one
or
more
organisms
l “Off-‐species”
gDNA
fragmented,
labelled,
and
hybridised
l Hybridisa;on
≈
sequence
similarity
≈
gene
presence
l Heatmap
of
217
Staphylococcus
aureus
isolates
on
7-‐strain
array.
l columns=isolates
l yellow/red=gene
present
l blue/white/grey=gene
absent
l Lower
bars
coloured
by
lineage
and
host
(green=caVle,
blue=horse,
purple=human)
Array
Compara've
Genomic
Hybridisa'on
Sung
et
al.
(2008)
Microbiol.
doi:10.1099/mic.0.2007/015289-‐0
34. …And
Then
It
Rained
Sequence
Data
l Modern
high-‐throughput
sequencing
(454,
Illumina)
completely
changed
the
landscape.
l Complete,
(mainly)
accurate
sequence
data
much
cheaper,
enabling:
l more
precise
sequence
comparison
l novel
analyses,
insights
and
visualisa;ons
l Genomic
&
exomic
comparisons
l 19/2/2014
at
GOLD:
l 3,011
“finished”
genomes
l 9,891
“permanent
drar”
genomes
l 19/2/2014
at
NCBI
WGS:
l 17,023
whole
genome
projects
35. …And
Then
It
Rained
Sequence
Data
l In
2012,
GOLD
added
3736
genomes,
NCBI
added
4585
l Mostly
prokaryotes
(archaea
and
bacteria)
l We’re
a
liVle
ahead
of
Su’s
(Scripps,
La
Jolla)
projec;ons
Figures
and
code
from:
hlp://sulab.org/2013/06/sequenced-‐genomes-‐per-‐year/
37. Three
broad
levels
of
comparison
l Bulk
Proper;es
l chromosome/plasmid
counts
and
sizes,
l nucleo;de
content,
etc.
l Whole
Genome
Sequence
l sequence
similarity
l organisa;on
of
genomic
regions
(rearrangements),
etc.
l Genome
Features/Func;onal
Components
l numbers
and
types
of
features
(genes,
ncRNA,
regulatory
elements,
etc.)
l organisa;on
of
features
(synteny,
operons,
regulons,
etc.)
l complements
of
features
l selec;on
pressure,
etc.
38. Bulk
Genome
Property
Comparisons
Values
calculated
for
individual
genomes,
and
subsequently
compared.
39. Nucleo'de
Frequencies/Genome
Size
l Very
easy
to
calculate
from
complete
or
drar
genome
sequence
l (or
in
a
region
of
genome
sequence)
l GC
content/chromosome
size
can
be
characteris;c
of
an
organism
l [ACTIVITY]
l bacteria_size_gc
iPython
notebook
l ipython notebook –-pylab inline
in
bacteria_size
directory
40. Blobology
l Metazoan
sequence
data
can
be
contaminated
by
microbial
symbionts.
l Host
and
symbiont
DNA
have
different
%GC
(and
are
present
in
different
amounts/coverage)
l Preliminary
genome
assembly,
followed
by
read
mapping
l Plot
con;g
coverage
against
%GC
=
Blobology
l hVp://nematodes.org/bioinforma;cs/blobology/
Kumar
&
Blaxter
(2011)
Symbiosis
doi:10.1007/s13199-‐012-‐0154-‐6
41. Nucleo'de
k-‐mers
l Sequence
data
is
required
to
determine
k-‐mers
l Nucleo;de
frequencies:
l A,
C,
G,
T
l Dinucleo;de
frequencies:
l AA,
AC,
AG,
AT,
CA,
CC,
CG,
CT,
GA,
GC,
GG,
GT,
TA,
TC,
TG,
TT
l Trinucleo;de
frequencies:
l 64
trinucleo;des
l k-‐nucleo;de
frequencies:
l 4k
k-‐mers
l [ACTIVITY]
l runApp(“shiny/nucleotide_frequencies”)in
RStudio
42. k-‐mer
Spectra
l k-‐mer
spectrum:
l Frequency
distribu;on
of
observed
k-‐mer
counts
l Most
species
have
a
unimodal
k-‐mer
spectrum
Chor
et
al.
(2009)
Genome
Biol.
doi:10.1186/gb-‐2009-‐10-‐10-‐r108
43. k-‐mer
Spectra
l k-‐mer
spectrum:
l All
mammals
tested
(and
some
other)
species
have
a
mul;modal
k-‐mer
spectrum
l Genomic
regions
differ
in
this
property
Chor
et
al.
(2009)
Genome
Biol.
doi:10.1186/gb-‐2009-‐10-‐10-‐r108
44. Average
Nucleo'de
Iden'ty
(ANI)
l ANI
introduced
as
a
subs;tute
for
DDH
in
2007:
l 70%
iden;ty
(DDH)
=
“gold
standard”
prokaryo;c
species
boundary
l 70%
iden;ty
(DDH)
≈
95%
iden;ty
(ANI)
Goris
et
al.
(2007)
Int.
J.
System.
Evol.
Biol.
doi:10.1099/ijs.0.64483-‐0
45. Average
Nucleo'de
Iden'ty
(ANI)
l ANI
introduced
as
a
subs;tute
for
DDH
in
2007:
l 70%
iden;ty
(DDH)
=
“gold
standard”
prokaryo;c
species
boundary
l 70%
iden;ty
(DDH)
≈
95%
iden;ty
(ANI)
l Original
method
emulates
physical
experiment:
1. break
genome
into
1020nt
fragments
2. align
fragments
using
BLASTN
3. ANI
=
mean
iden;ty
of
all
BLASTN
matches
with
>30%
iden;ty
over
70%
alignable
length
Goris
et
al.
(2007)
Int.
J.
System.
Evol.
Biol.
doi:10.1099/ijs.0.64483-‐0
46. Average
Nucleo'de
Iden'ty
(ANI)
l ANI
introduced
as
a
subs;tute
for
DDH
in
2007:
l 70%
iden;ty
(DDH)
=
“gold
standard”
prokaryo;c
species
boundary
l 70%
iden;ty
(DDH)
≈
95%
iden;ty
(ANI)
l ANIm
and
TETRA
introduced
(2009)
1. Align
sequences
using
NUCmer
2. ANI
=
mean
%iden;ty
of
matches
l TETRA:
1. Calculate
tetranucleo;de
frequencies
2. Determine
each
tetramer
devia;on
from
expecta;on
(Z-‐score)
3. TETRA
=
Pearson
correla;on
coefficient
of
tetramer
Z-‐scores
Richter
&
Rosselló-‐Móra
(2009)
Proc.
Natl.
Acad.
Sci.
USA
doi:10.1073/pnas.0906412106
47. Average
Nucleo'de
Iden'ty
(ANI)
l ANIb
discards
useful
informa;on
that
ANIm
retains
l TETRA
reflects
bulk
genome
proper;es
rather
than
selec;on
on
sequence
l Data
for
Anaplasma
marginale
(3),
A.phagocytophilum
(4),
A.centrale
(1)
l TETRA
scores
are
prone
to
false
posi;ves;
ANIb
scores
are
prone
to
false
nega;ves
49. Diagnos'c
PCR/MLST
l PCR/MLST
s;ll
cheap
l (but
for
how
much
longer?)
l Use
whole
genomes
to
iden;fy
unique/
diagnos;c
regions
for
PCR/MLST
Slezak
et
al.
(2003)
Brief.
Bioinf.
doi:10.1093/bib/4.2.133
Pritchard
et
al.
(2012)
PLoS
One
doi:10.1371/journal.pone.0034498
50. Whole
Genome
Sequence
Comparisons
Comparisons
of
one
whole
or
drac
genome
sequence
with
another
(or
many
others)
52. Whole
Genome
Alignment
l Which
genomes
should
you
align?
(or
not
bother
aligning)
l For
reasonable
analysis,
genomes
should:
l derive
from
a
sufficiently
recent
common
ancestor:
so
that
homologous
regions
can
be
iden;fied.
l derive
from
a
sufficiently
distant
common
ancestor:
so
that
sufficiently
“interes;ng”
changes
are
likely
to
have
occurred
l help
answer
your
biological
ques;on:
„ is
your
ques;on
organism
or
phenotype
specific?
„ are
you
inves;ga;ng
a
process?
l This
may
be
more
involved
for
metazoans
(vertebrates,
arthropods,
nematodes,
etc.)
than
prokaryotes…
53. Whole
Genome
Alignment
l Naïve
alignment
algorithms
(e.g.
Needleman-‐Wunsch/Smith-‐
Waterman)
are
not
appropriate:
l Do
not
handle
rearrangements
l Computa;onally
expensive
on
large
sequences
l Many
whole-‐genome
alignment
algorithms
proposed,
including:
l LASTZ
(hVp://www.bx.psu.edu/~rsharris/lastz/)
l BLAT
(hVp://genome.ucsc.edu/goldenPath/help/blatSpec.html)
l Mugsy
(hVp://mugsy.sourceforge.net/)
l megaBLAST
(hVp://www.ncbi.nlm.nih.gov/blast/html/megablast.html)
l MUMmer
(hVp://mummer.sourceforge.net/)
l LAGAN
(hVp://lagan.stanford.edu/lagan_web/index.shtml)
l WABA,
etc…
54. Whole
Genome
Alignment
l BLAT
l BLAT
is
broadly
similar
to
BLAST
l Main
differences:
„ op;mised
to
find
only
exact
or
near-‐exact
matches,
for
speed
„ indexes
the
subject
genome,
retains
the
index
and
scans
the
query
„ connects
homologous
match
regions
into
a
single
alignment
(BLAST
reports
them
separately)
„ reports
mRNA
match
intron-‐exon
boundaries
exactly
(BLAST
tends
to
extend)
l Advantages:
fast;
exact
exon
boundaries;
UCSC
integra;on
l Disadvantages:
does
not
find
more
remote/very
divergent
matches
Kent
(2002)
Genome
Res.
doi:10.1101/gr.229202
55. Whole
Genome
Alignment
l megaBLAST
l Op;mised
for
speed
over
BLASTN
(see
hVp://www.ncbi.nlm.nih.gov/blast/Why.shtml):
„ genome-‐level
searches
„ queries
on
large
sequence
sets
„ long
alignments
of
very
similar
sequence
(sequencing
errors/SNPs)
l Uses
Zhang
et
al.
(2000)
greedy
algorithm
l Concatenates
queries
to
improve
performance
(“query
packing”)
„ NOTE:
this
is
good
prac'ce
for
large
query
sets!
l Two
modes:
megaBLAST,
and
discon;nuous
megaBLAST
(dc-‐megablast)
„ dc-‐megablast
intended
for
more
divergent
sequences
Zhang
et
al.
(2000)
J.
Comp.
Biol.
7(1-‐2)
203-‐14
Korf
et
al.
(2003)
“BLAST”,
O’Reilly
&
Associates,
Sebastopol,
CA
56. Whole
Genome
Alignment
l MUMmer
l Uses
suffix
trees
for
paVern
matching:
very
fast
even
for
large
sequences
„ Finds
maximal
exact
matches
„ Memory
use
depends
only
on
reference
sequence
size
Kurtz
et
al.
(2004)
Genome
Biol.
doi:10.1186/gb-‐2004-‐5-‐2-‐r12
57. Whole
Genome
Alignment
l MUMmer
l Uses
suffix
trees
for
paVern
matching:
very
fast
even
for
large
sequences
„ Finds
maximal
exact
matches
„ Memory
use
depends
only
on
reference
sequence
size
l Suffix
Tree:
l Can
be
constructed
and
searched
in
O(n)
;me
l Useful
algorithms
are
nontrivial
l BANANA$
„ B
followed
by
ANANA$
only
„ A
followed
by
$,
NA$,
NANA$
„ N
followed
by
A$,
ANA$
Kurtz
et
al.
(2004)
Genome
Biol.
doi:10.1186/gb-‐2004-‐5-‐2-‐r12
58. Whole
Genome
Alignment
l MUMmer
l Process:
„ 1)
Iden;fy
a
non-‐overlapping
subset
of
maximal
exact
matches:
oren
Maximum
Unique
Matches
(MUMs
-‐
though
not
always
unique)
„ 2)
Cluster
into
alignment
anchors
„ 3)
Extend
between
anchors
to
produce
a
final
gapped
alignment
l Very
flexible
approach:
a
suite
of
programs
(mummer, nucmer,
promer,
…)
„ nucleo;de
and
“conceptual
protein”
(more
sensi;ve)
alignments
„ used
for
genome
comparisons,
assembly
scaffolding,
repeat
detec;on,
etc.
„ forms
the
basis
for
other
aligners/assemblers,
e.g.
Mugsy,
AMOS
Kurtz
et
al.
(2004)
Genome
Biol.
doi:10.1186/gb-‐2004-‐5-‐2-‐r12
61. Mul'ple
Genome
Alignment
l LAGAN:
rapid
alignment
of
two
homologous
genome
sequences
l Generate
local
alignments
(anchors,
B)
l Construct
rough
global
map
(maximal-‐scoring
ordered
subset,
C)
„ Join
anchors
that
lie
within
a
threshold
distance,
the
same
way
l Compute
global
alignment
by
dynamic
programming
(D)
Brudno
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.926603
62. Mul'ple
Genome
Alignment
l MLAGAN:
mul;ple
genome
alignment
of
k
genomes
in
k-‐1
alignment
steps,
using
a
phylogene;c
tree
(CLUSTAL-‐like):
l Make
rough
global
maps
between
each
pair
of
sequences
(step
C
in
LAGAN)
l Progressive
mul;ple
alignment
with
anchors
(iterated)
1. Perform
global
alignment
between
closest
pair
of
sequences
with
LAGAN:
alignments
are
“mul6-‐sequences”
2. Find
rough
global
maps
of
this
mul6-‐
sequence
to
all
other
mul6-‐sequences.
Brudno
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.926603
63. Human-‐Mouse-‐Rat
Alignment
l Three-‐way
progressive
alignment,
iden;fying:
l Homologous
(H/M/R),
rodent-‐only
(M/R)
and
human-‐
mouse
or
human-‐rat
(H/M,
H/R)
homologous
regions
l Three-‐way
synteny
synteny
mapped
to
rat
genome
Brudno
et
al.
(2004)
Genome
Res.
doi:10.1101/gr.2067704
Ini'al
alignments
by
BLAT
Syntenous
regions
aligned
with
LAGAN
65. Drac
Genome
Alignment
l Whole
genome
alignments
useful
for
scaffolding
assemblies
l High-‐throughput
sequence
assemblies
come
in
fragments
(con;gs)
l Con;gs
can
some;mes
be
ordered
if
paired
reads
or
long
read
technologies
are
used
l Can
also
align
to
a
known
reference
genome
l MUMmer
l Can
use
NUCmer
or,
for
more
distant
rela;ons,
PROmer
l Mauve/Progressive
Mauve
l hVp://gel.ahabs.wisc.edu/mauve/
Darling
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.2289704
66. Mauve
l Mauve’s
alignment
algorithm
1. Find
local
alignments
(mul;-‐MUMs
–
seed
&
extend)
2. Construct
phylogene;c
guide
tree
from
mul;-‐
MUMs
3. Select
subset
of
mul;-‐MUMs
as
anchors.
„ Par;;on
anchors
into
Local
Collinear
Blocks
(LCBs)
–
consistently-‐ordered
subsets
4. Perform
recursive
anchoring
to
iden;fy
further
anchors
5. Perform
progressive
alignment
(similar
to
CLUSTAL),
against
guide
tree
l Mauve
Con;g
Mover
(MCM)
for
ordering
con;gs
Darling
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.2289704
67. Mauve
l Mauve
alignment
of
LCBs
in
nine
enterobacterial
genomes
l Rearrangement
of
homologous
backbone
sequence
Darling
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.2289704
68. Drac
Genome
Alignment
l [OPTIONAL
ACTIVITY]
(useful
for
exercise)
l Alignment
and
reordering
of
drar
genome
con;gs
l whole_genome_alignments_B.md
Markdown
l hVps://github.com/widdowquinn/Teaching/blob/master/
Compara;ve_Genomics_and_Visualisa;on/Part_1/
whole_genome_alignment/
whole_genome_alignments_B.md
l [ACTIVITY]
l Visualisa;on
of
whole
genome
alignment
with
Biopython
l biopython_visualisation
iPython
notebook
69. Collinearity
and
Synteny
l Rearrangements
may
occur
post-‐specia;on
l Different
species
s;ll
exhibit
conserva;on
of
sequence
similarity
and
order
l Two
elements
are
collinear
if
they
lie
in
the
same
linear
sequence
l Two
elements
are
syntenous
(syntenic)
if:
„ (orig.)
they
lie
on
the
same
chromosome
„ (mod.)
conserva;on
of
blocks
of
order
within
the
same
chromosome
l Signs
of
evolu;onary
constraints,
including
synteny,
may
indicate
func;onal
genome
regions
l More
about
this
in
Part
2,
related
to
genome
features
75. Conclusion
l Physical
and
computa;onal
genome
comparisons:
l Similar
biological
ques;ons
-‐>
similar
concepts
l Lots
of
sequence
data
in
modern
biology
l Conserva;on
≈
evolu;onary
constraint
l Many
choices
of
algorithms/analysis
sorware
l Many
choices
of
visualisa;on
sorware/tools
l Coming
in
Part
2:
genomic
func;onal
elements
76. Credits
l This
slideshow
is
shared
under
a
Crea;ve
Commons
AVribu;on
4.0
License
hVp://crea;vecommons.org/licenses/by/4.0/)
l Copyright
is
held
by
The
James
HuVon
Ins;tute
hVp://www.huVon.ac.uk
l You
may
freely
use
this
material
in
research,
papers,
and
talks
so
long
as
acknowledgement
is
made.
77. Nucleo'de
Content
l A,
C,
G,
T
composi;on
l Varies
between,
and
within
genomes
l staining
varies
across
genomes,
due
to
varia;on
in
GC
content
l “isochores”:
regions
with
liVle
internal
GC
varia;on
(homogeneous)
„
long
a
point
of
discussion
–
difficult
to
define
l In
humans:
l L1,
L2
isochores:
low
GC
(≲41%)
l H1,
H2,
H3
isochores:
high
GC
(≳41%)
l Imprecise
bulk
measurement
Sadoni
et
al.
(1999)
J.
Cell
Biol.
doi:10.1083/jcb.146.6.1211
hybridisa;on
of
H3
isochore
to
human
genome
78. DNA-‐DNA
Hybridisa'on
(DDH)
l Used
for
taxonomic
classifica;on
in
prokaryotes
from
1960s
l Sibley
&
Ahlquist
redefined
bird
and
primate
phylogeny
with
DDH
in
1980s:
l Not
without
controversy:
„ Sugges;ons
of
data
manipula;on
(see
here)
„ Close
evolu;onary
rela;onships
difficult
to
resolve
due
to
paralogy
(more
on
paralogy
later…)
l S;ll
hanging
on
as
a
de
facto
“gold
standard”
in
microbiological
taxonomic
classifica;on.
Sibley
&
Ahlquist
(1987)
J.
Mol.
Evol.
doi:10.1007/BF02111285
79. Finding
isochores
l Isochores:
homogeneous
regions
of
%GC
content
l Easy
to
find
with
windowed
(100kbp)
%GC
calcula;on,
from
sequenced
genomes.
l 3200
isochores
characterised
in
the
human
genome,
consistent
with
5
levels
(L1,
L2,
H1,
H2,
H3)
found
by
staining/hybridisa;on.
Costan'ni
et
al.
(2006)
Genome
Res.
doi:10.1101/gr.4910606
80. Compara've
Genomic
Hybridisa'on
l Two
genomes:
“reference”
and
“test”
labelled
(red
and
green),
then
hybridised
against
a
“normal”
genome
l semiquan'ta've:
l Red:
loss
(<2
copies)
in
tumour
l Green:
gain
(3-‐4
copies)
in
tumour
l Amplifica;ons
(>4
copies)
in
BOLD
l Cases
with
the
same
Copy
Number
Aberra;on
(CNA)
are
numbered
De
Bortoli
et
al.
(2006)
BMC
Cancer
doi:10.1186/1471-‐2407-‐6-‐223
81. l Early
approaches
took
a
threshold
score
(present/absent)
l Later
approaches
used
known
reference
genome
sequence
context
(HMMs,
synteny)
to
improve
presence/absence
calls
l No
hybridisa;on
=
“absent”
or“divergent”?
l Not
nearly
as
good
as
sequencing
directly!
Array
Compara've
Genomic
Hybridisa'on
Pritchard
et
al.
(2009)
PLoS
Comp.
Biol.
doi:10.1371/journal.pcbi.1000473
82. k-‐mer
Spectra
l k-‐mer
spectrum:
l CpG
suppression
(CGs
are
uncommon
in
vertebrate
genomes),
but
(by
simula;on)
only
when
in
combina;on
with
a
par;cular
%GC,
explains
mul;modality
Chor
et
al.
(2009)
Genome
Biol.
doi:10.1186/gb-‐2009-‐10-‐10-‐r108