Comparative Genomics and Visualisation - Part 1

Compara've
Genomics
and

Visualisa'on
–
Part
1

Leighton
Pritchard

Part
1

l What
is
compara've
genomics?

l Levels
of
genome
comparison

l  bulk,
whole
sequence,
features

l A
Brief
History
of
Compara've
Genomics

l  experimental
compara;ve
genomics

l Computa'onal
Compara've
Genomics

l  Bulk
proper;es

l  Whole
genome
comparisons

l Part
2

l  Genome
feature
comparisons

What
is
Compara've
Genomics?

The
combina'on
of
genomic
data
and

compara've
and
evolu'onary
biology
to

address
ques'ons
of
genome
structure,

evolu'on
and
func'on.

What
is
Compara've
Genomics?

“Nothing
in
biology
makes
sense,
except

in
the
light
of
evolu9on”

Theodosius
Dobzhansky

Why
Compara've
Genomics?

l Genomes
describe
heritable
characteris;cs

l Related
organisms
share
ancestral
genomes

l Func;onal
elements
encoded
in
genomes

are
common
to
related
organisms

l Func;onal
understanding
of
model
systems

(E.
coli,
A.
thaliana,
D.
melanogaster)
can
be

transferred
to
non-‐model
systems
on
the

basis
of
genome
comparisons

l Genome
comparisons
can
be
informa;ve,

even
for
distantly-‐related
organisms

Why
Compara've
Genomics?

l BUT:

l  Context:
epigene;cs,
;ssue

diﬀeren;a;on,
mesoscale
systems,
etc.

l  Phenotypic
plas'city:
responses
to

temperature,
stress,
environment,
etc.

Why
Compara've
Genomics?

l Genomic
differences
can
underpin
phenotypic

(morphological
or
physiological)
differences.

l Where
phenotypes
or
other
organism-‐level

proper;es
are
known,
comparison
of

genomes
may
give
mechanis;c
or
func;onal

insight
into
differences
(e.g.
GWAS).

l Genome
comparisons
aid
iden;fica;on
of

func;onal
elements
on
the
genome.

l Studying
genomic
changes
reveals

evolu;onary
processes
and
constraints.

Why
Compara've
Genomics?

Adapted
from
Hardison
(2003)
PLoS
Biol.
doi:10.1371/journal.pbio.0000058

species

'me

contemporary

organisms

l  Comparison
within
species
(e.g.
isolate-‐level
–
or
even
within
individuals):

which
genome
features
may
account
for
unique
characteris;cs
of
organisms/
tumours?
Epigene;cs
in
an
individual.

Why
Compara've
Genomics?

genus

'me

contemporary

organisms

l Comparison
within
genus
(e.g.
species-‐level):
what
genome
features

show
evidence
of
selec;ve
pressure,
and
in
which
species?

Why
Compara've
Genomics?

subgroup

'me

contemporary

organisms

l Comparison
within
subgroup
(e.g.
genus-‐level):
what
are
the
core
set

of
genome
features
that
deﬁne
a
subgroup
or
genus?

The
E.coli
long-‐term
evolu'on
experiment

l Run
by
the
Lenski
lab,
Michigan
State
University
since
1988

l  hVp://myxo.css.msu.edu/ecoli/

l 12
ﬂasks,
citrate
usage
selec;on

l 50,000
genera;ons
of
Escherichia
coli!

l  Cultures
propagated
every
day

l  Every
500
genera;ons
(75
days),

mixed-‐popula;on
samples
stored

l  Mean
ﬁtness
es;mated
at
500

genera;on
intervals

Jeong
et
al.
(2009)
J.
Mol.
Biol.
doi:10.1016/j.jmb.2009.09.052

Barrick
et
al.
(2009)
Nature
doi:10.1038/nature08480

Wiser
et
al.
(2013)
Science.
doi:10.1126/science.1243357

Compara've
Genomics
in
the
News

Sankaraman
et
al.
(2014)
Nature.
doi:10.1038/nature12961

l Neanderthal
alleles:

l  Aid
adapta;on
outwith
Africa

l  Associated
with
disease
risk

l  Reduce
male
fer;lity

Levels
of
Genome
Comparison

Genomes
are
complex,
and
can
be

compared
on
a
range
of
conceptual
levels

-‐
both
prac'cally
and
in
silico.

Three
broad
levels
of
comparison

l Bulk
Proper;es

l  chromosome/plasmid
counts
and
sizes,

l  nucleo;de
content,
etc.

l Whole
Genome
Sequence

l  sequence
similarity

l  organisa;on
of
genomic
regions
(synteny),
etc.

l Genome
Features/Func;onal
Components

l  numbers
and
types
of
features
(genes,
ncRNA,
regulatory

elements,
etc.)

l  organisa;on
of
features
(synteny,
operons,
regulons,
etc.)

l  complements
of
features

l  selec;on
pressure,
etc.

A
Brief
History
of
Experimental

Compara've
Genomics

You
don’t
have
to
sequence
genomes
to

compare
them
(but
it
helps).

Genome
Comparisons
Predate
NGS

l Sequence
data
was
not
always
cheap
and
abundant

l Prac;cal,
experimental
genome
comparisons
were
needed

Bulk
Genome
Property
Comparisons

Values
calculated
for
individual
genomes,

and
subsequently
compared.

Bulk
Genome
Proper'es

l  Large-‐scale
summary
measurements

l  Measure
genomes
independently
–
compare
values
later

l  Number
of
chromosomes

l  Ploidy

l  Chromosome
size

l  Nucleo;de
(A,
C,
G,
T)
frequency/percentage

Chromosome
Counts/Size

l  The
chromosome
counts/ploidy
of
organisms
can
vary
widely

l  Escherichia
coli:
1
(but
plasmids…)

l  Rice
(Oryza
sa6va):
24
(but
mitochondria,
plas;ds
etc…)

l  Human
(Homo
sapiens):
46,
diploid

l  Adders-‐tongue
(Ophioglossum
re6culatum):
up
to
1260

l  Domes;c
(but
not
wild)
wheat
soma;c
cells
hexaploid,
gametes
haploid

l  Physical
genome
size
(related
to
sequence
length)

can
also
vary
greatly

l  Genome
size
and
chromosome
count

do
not
indicate
organism
‘complexity’

l  S;ll
surprises
to
be
found
in
physical

study
of
chromosomes!
(e.g.
Hi-‐C)

Kamisugi
et
al.
(1993)
Chromosome
Res.
1(3):
189-‐96

Wang
et
al.
(2013)
Nature
Rev
Genet.
doi:10.1038/nrg3375

Nucleo'de
Content

l Experimental
approaches
for
accurate
measurement

l  e.g.
use
radiolabelled
monophosphates,
calculate
propor;ons
using

chromatography

Karl
(1980)
Microbiol.
Rev.
44(4)
739-‐796

Krane
et
al.
(1991)
Nucl.
Acids
Res.
doi:10.1093/nar/19.19.5181

Whole
Genome
Comparisons

Comparisons
of
one
whole
or
drac

genome
with
another
(or
many
others)

Whole
Genome
Comparisons

l  Requires
two
genomes:
“reference”
and
“comparator”

l  Experiment
produces
a
compara;ve
result,
dependent
on
the

choice
of
genomes

l  Methods
mostly
based
around
direct
or
indirect
DNA

hybridisa;on

l  DNA-‐DNA
hybridisa;on

l  Compara;ve
Genomic
Hybridisa;on
(CGH)

l  Array
Compara;ve
Genomic
Hybridisa;on
(aCGH)

DNA-‐DNA
Hybridisa'on
(DDH)

l Several
methods
based
around
the
same
principle

1.  Denature
organism
A,
B

genomic
DNA
mixture

2.  Allow
to
anneal
–
hybrids
result

(reassocia;on
≈
similarity)

Morelló-‐Mora
&
Amann
(2001)
FEMS
Microbiol.
Rev.
doi:10.1016/S0168-‐6445(00)00040-‐1

DNA-‐DNA
Hybridisa'on
(DDH)

l  Several
methods
-‐
same
principle

1.  Find
homoduplex
Tm1

2.  Denature
reference,
comparator

gDNA
+
mix

3.  Allow
to
anneal
–
hybrids
result

(reassocia;on
≈
similarity),
ﬁnd

heteroduplex
Tm2

4.  ∆Tm
=
Tm1
–
Tm2

5.  High
∆T
implies
greater
genomic

diﬀerence
(fewer
H-‐bonds)

l  Proxy
for
sequence
similarity

Morelló-‐Mora
&
Amann
(2001)
FEMS
Microbiol.
Rev.
doi:10.1016/S0168-‐6445(00)00040-‐1

DNA-‐DNA
Hybridisa'on
(DDH)

l Used
for
taxonomic
classiﬁca;on
in
prokaryotes
from
1960s

l Sibley
&
Ahlquist
redeﬁned
bird
and
primate
phylogeny
with

DDH
in
1980s:
Homo
shares
more
recent
common
ancestor
with
Pan

than
with
Gorilla
(this
was
previously
in
dispute)

Sibley
&
Ahlquist
(1984)
J.
Mol.
Evol.
doi:10.1007/BF02101980

Compara've
Genomic
Hybridisa'on

l  Two
genomes:
“reference”
and
“test”
are
labelled
(red
and
green
–

a
bad
conven6on
to
choose,
for
visualisa6on),
then
hybridised
against
a

third
“normal”
genome

l  Diﬀerences
in
red/green
intensity
mapped
by
microscopy
correspond
to

rela;ve
rela;onship
of
reference
and
test
to
“normal”
genome

l  Comparisons
within
species
(or
individual,
for
tumours);
copy
number

varia'ons
(CNV)

l  Labour-‐intensive,
low-‐resolu;on

Compara've
Genomic
Hybridisa'on

l Image
analysis
required
–
intensity
along
medial
axis.

Kallioniemi
et
al.
(1992)
Science
doi:10.1126/science.1359641

Fraga
et
al.
(2005)
Proc.
Natl.
Acad.
Sci.
USA
doi:10.1073/pnas.0500398102

Epigene'cs:
hybridising

methylated
DNA

Array
Compara've
Genomic
Hybridisa'on

l  Uses
DNA
microarrays:
thousands
of
short
DNA
probes
(genome

fragments)
immobilised
on
a
surface

l  gDNA,
cDNA,
etc.
ﬂuorescently-‐labelled
and
hybridised
to
the
array

l  Smaller
sample
sizes
cf.
CGH,

automatable,
high-‐throughput,
high-‐res

l  Iden'ﬁes
copy
number
varia'on
(CNV)

and
segmental
duplica'on

Pollack
et
al.
(1999)
Nat.
Genet.
doi:10.1038/12640

Genome
Feature
Comparisons

Comparisons
on
the
basis
of
a
restricted

set
of
genome
features

Chromosomal
Rearrangements

l  Genomes
are
dynamic,
and
undergo
large-‐scale
changes

l  Hybridisa;on
used
to
map
genome
rearrangement/duplica;on

l  Separate
chromosomes
electrophore;cally

l  Apply
single
gene
hybridising
probes

l  Reciprocal
hybridisa;ons
indicate
transloca;ons

Fischer
et
al.
(2000)
Nature.
doi:10.1038/35013058

Diagnos'c
PCR/MLST

l  Define
a
set
of
regions
(usually
genes):

l  conserved
enough
that
PCR
primers
can

be
designed
to
amplify
the
same
region

in
mul;ple
organisms

l  and:

l  divergent
enough
that
hybridising

probes
can
dis;nguish
between
groups

l  or:

l  sequence
the
amplifica;on
products

l  Sequence
variants
given
numbers

l  Number
profiles
define
groups

l  Track
evolu;on
by
minimum
spanning

trees
(MST)

l  hVp://pubmlst.org/

Maiden
et
al.
(2006)
Ann.
Rev.
Microbiol.
doi:10.1146/annurev.micro.59.030804.121325

l  aCGH
can
also
be
applied
across
species
for
classiﬁca'on/diagnos'cs:

l  Microarray
probes
represent
genes

from
one
or
more
organisms

l  “Oﬀ-‐species”
gDNA
fragmented,

labelled,
and
hybridised

l  Hybridisa;on
≈
sequence

similarity
≈
gene
presence

l  Heatmap
of
217
Staphylococcus

aureus
isolates
on
7-‐strain
array.

l  columns=isolates

l  yellow/red=gene
present

l  blue/white/grey=gene
absent

l  Lower
bars
coloured
by
lineage
and
host

(green=caVle,
blue=horse,
purple=human)

Array
Compara've
Genomic
Hybridisa'on

Sung
et
al.
(2008)
Microbiol.
doi:10.1099/mic.0.2007/015289-‐0

But
This
Happened…

l High-‐throughput
sequencing

…And
Then
It
Rained
Sequence
Data

l  Modern
high-‐throughput
sequencing
(454,
Illumina)
completely

changed
the
landscape.

l  Complete,
(mainly)
accurate
sequence

data
much
cheaper,
enabling:

l  more
precise
sequence
comparison

l  novel
analyses,
insights
and

visualisa;ons

l  Genomic
&
exomic
comparisons

l  19/2/2014
at
GOLD:

l  3,011
“ﬁnished”
genomes

l  9,891
“permanent
drar”
genomes

l  19/2/2014
at
NCBI
WGS:

l  17,023
whole
genome
projects

…And
Then
It
Rained
Sequence
Data

l In
2012,
GOLD
added
3736
genomes,
NCBI
added
4585

l Mostly
prokaryotes
(archaea
and
bacteria)

l We’re
a
liVle
ahead
of
Su’s
(Scripps,
La
Jolla)
projec;ons

Figures
and
code
from:
hlp://sulab.org/2013/06/sequenced-‐genomes-‐per-‐year/

Computa'onal
Compara've
Genomics

Massively
enabled
by
high-‐throughput

sequencing,
much
more
powerful
and

precise.

Three
broad
levels
of
comparison

l Bulk
Proper;es

l  chromosome/plasmid
counts
and
sizes,

l  nucleo;de
content,
etc.

l Whole
Genome
Sequence

l  sequence
similarity

l  organisa;on
of
genomic
regions
(rearrangements),
etc.

l Genome
Features/Func;onal
Components

l  numbers
and
types
of
features
(genes,
ncRNA,
regulatory

elements,
etc.)

l  organisa;on
of
features
(synteny,
operons,
regulons,
etc.)

l  complements
of
features

l  selec;on
pressure,
etc.

Nucleo'de
Frequencies/Genome
Size

l Very
easy
to
calculate
from
complete
or
drar
genome
sequence

l  (or
in
a
region
of
genome
sequence)

l GC
content/chromosome
size
can
be
characteris;c
of
an

organism

l [ACTIVITY]

l  bacteria_size_gc
iPython
notebook

l  ipython notebook –-pylab inline
in

bacteria_size
directory

Blobology

l Metazoan
sequence
data
can
be
contaminated
by
microbial

symbionts.

l  Host
and
symbiont
DNA
have
diﬀerent
%GC
(and
are
present
in

diﬀerent
amounts/coverage)

l  Preliminary
genome
assembly,
followed

by
read
mapping

l  Plot
con;g
coverage
against

%GC
=
Blobology

l  hVp://nematodes.org/bioinforma;cs/blobology/

Kumar
&
Blaxter
(2011)
Symbiosis
doi:10.1007/s13199-‐012-‐0154-‐6

Nucleo'de
k-‐mers

l  Sequence
data
is
required
to
determine
k-‐mers

l  Nucleo;de
frequencies:

l  A,
C,
G,
T

l  Dinucleo;de
frequencies:

l  AA,
AC,
AG,
AT,
CA,
CC,
CG,
CT,
GA,
GC,
GG,
GT,
TA,
TC,
TG,
TT

l  Trinucleo;de
frequencies:

l  64
trinucleo;des

l  k-‐nucleo;de
frequencies:

l  4k
k-‐mers

l  [ACTIVITY]

l  runApp(“shiny/nucleotide_frequencies”)in
RStudio

k-‐mer
Spectra

l k-‐mer
spectrum:

l  Frequency
distribu;on
of
observed
k-‐mer
counts

l  Most
species
have
a
unimodal
k-‐mer
spectrum

Chor
et
al.
(2009)
Genome
Biol.
doi:10.1186/gb-‐2009-‐10-‐10-‐r108

k-‐mer
Spectra

l  k-‐mer
spectrum:

l  All
mammals
tested
(and
some
other)
species
have
a
mul;modal
k-‐mer

spectrum

l  Genomic
regions
diﬀer
in
this
property

Chor
et
al.
(2009)
Genome
Biol.
doi:10.1186/gb-‐2009-‐10-‐10-‐r108

Average
Nucleo'de
Iden'ty
(ANI)

l ANI
introduced
as
a
subs;tute
for
DDH
in
2007:

l  70%
iden;ty
(DDH)
=
“gold
standard”

prokaryo;c
species
boundary

l  70%
iden;ty
(DDH)
≈
95%
iden;ty
(ANI)

Goris
et
al.
(2007)
Int.
J.
System.
Evol.
Biol.
doi:10.1099/ijs.0.64483-‐0

Average
Nucleo'de
Iden'ty
(ANI)

l ANI
introduced
as
a
subs;tute
for
DDH
in
2007:

l  70%
iden;ty
(DDH)
=
“gold
standard”

prokaryo;c
species
boundary

l  70%
iden;ty
(DDH)
≈
95%
iden;ty
(ANI)

l Original
method
emulates
physical

experiment:

1.  break
genome
into
1020nt
fragments

2.  align
fragments
using
BLASTN

3.  ANI
=
mean
iden;ty
of
all
BLASTN

matches
with
>30%
iden;ty
over
70%

alignable
length

Goris
et
al.
(2007)
Int.
J.
System.
Evol.
Biol.
doi:10.1099/ijs.0.64483-‐0

Average
Nucleo'de
Iden'ty
(ANI)

l ANI
introduced
as
a
subs;tute
for
DDH
in
2007:

l  70%
iden;ty
(DDH)
=
“gold
standard”
prokaryo;c
species
boundary

l  70%
iden;ty
(DDH)
≈
95%
iden;ty
(ANI)

l ANIm
and
TETRA
introduced
(2009)

1.  Align
sequences
using
NUCmer

2.  ANI
=
mean
%iden;ty
of
matches

l TETRA:

1.  Calculate
tetranucleo;de
frequencies

2.  Determine
each
tetramer
devia;on
from
expecta;on
(Z-‐score)

3.  TETRA
=
Pearson
correla;on
coeﬃcient
of
tetramer
Z-‐scores

Richter
&
Rosselló-‐Móra
(2009)
Proc.
Natl.
Acad.
Sci.
USA
doi:10.1073/pnas.0906412106

Average
Nucleo'de
Iden'ty
(ANI)

l ANIb
discards
useful
informa;on
that
ANIm
retains

l TETRA
reﬂects
bulk
genome
proper;es
rather
than
selec;on
on

sequence

l  Data
for
Anaplasma
marginale
(3),
A.phagocytophilum
(4),
A.centrale
(1)

l TETRA
scores
are
prone
to
false
posi;ves;
ANIb
scores
are
prone
to

false
nega;ves

Average
Nucleo'de
Iden'ty
(ANI)

l Jspecies
(hVp://www.imedea.uib.es/jspecies/)

l  WebStart

l  java -jar -Xms1024m -Xmx1024m jspecies1.2.1.jar
l Python
script

l  scripts/calculate_ani.py
l [ACTIVITY]

l  average_nucleotide_identity/README.md
Markdown

Richter
&
Rosselló-‐Móra
(2009)
Proc.
Natl.
Acad.
Sci.
USA
doi:10.1073/pnas.0906412106

Diagnos'c
PCR/MLST

l PCR/MLST
s;ll
cheap

l  (but
for
how
much
longer?)

l Use
whole
genomes
to
iden;fy
unique/
diagnos;c
regions
for
PCR/MLST

Slezak
et
al.
(2003)
Brief.
Bioinf.
doi:10.1093/bib/4.2.133

Pritchard
et
al.
(2012)
PLoS
One
doi:10.1371/journal.pone.0034498

Whole
Genome
Sequence
Comparisons

Comparisons
of
one
whole
or
drac

genome
sequence
with
another
(or
many

others)

Whole
Genome
Alignment

Whole
Genome
Alignment

l Which
genomes
should
you
align?
(or
not
bother
aligning)

l For
reasonable
analysis,
genomes
should:

l  derive
from
a
sufficiently
recent
common
ancestor:
so
that

homologous
regions
can
be
iden;fied.

l  derive
from
a
sufficiently
distant
common
ancestor:
so
that

sufficiently
“interes;ng”
changes
are
likely
to
have
occurred

l  help
answer
your
biological
ques;on:

„ is
your
ques;on
organism
or
phenotype
specific?

„ are
you
inves;ga;ng
a
process?

l This
may
be
more
involved
for
metazoans
(vertebrates,

arthropods,
nematodes,
etc.)
than
prokaryotes…

Whole
Genome
Alignment

l Naïve
alignment
algorithms
(e.g.
Needleman-‐Wunsch/Smith-‐
Waterman)
are
not
appropriate:

l  Do
not
handle
rearrangements

l  Computa;onally
expensive
on
large
sequences

l Many
whole-‐genome
alignment
algorithms
proposed,
including:

l  LASTZ
(hVp://www.bx.psu.edu/~rsharris/lastz/)

l  BLAT
(hVp://genome.ucsc.edu/goldenPath/help/blatSpec.html)

l  Mugsy
(hVp://mugsy.sourceforge.net/)

l  megaBLAST
(hVp://www.ncbi.nlm.nih.gov/blast/html/megablast.html)

l  MUMmer
(hVp://mummer.sourceforge.net/)

l  LAGAN
(hVp://lagan.stanford.edu/lagan_web/index.shtml)

l  WABA,
etc…

Whole
Genome
Alignment

l BLAT

l  BLAT
is
broadly
similar
to
BLAST

l  Main
differences:

„ op;mised
to
find
only
exact
or
near-‐exact
matches,
for

speed

„ indexes
the
subject
genome,
retains
the
index
and
scans

the
query

„ connects
homologous
match
regions
into
a
single
alignment

(BLAST
reports
them
separately)

„ reports
mRNA
match
intron-‐exon
boundaries
exactly

(BLAST
tends
to
extend)

l  Advantages:
fast;
exact
exon
boundaries;
UCSC
integra;on

l  Disadvantages:
does
not
find
more
remote/very
divergent

matches

Kent
(2002)
Genome
Res.
doi:10.1101/gr.229202

Whole
Genome
Alignment

l megaBLAST

l  Op;mised
for
speed
over
BLASTN

(see
hVp://www.ncbi.nlm.nih.gov/blast/Why.shtml):

„ genome-‐level
searches

„ queries
on
large
sequence
sets

„ long
alignments
of
very
similar
sequence
(sequencing
errors/SNPs)

l  Uses
Zhang
et
al.
(2000)
greedy
algorithm

l  Concatenates
queries
to
improve
performance
(“query
packing”)

„ NOTE:
this
is
good
prac'ce
for
large
query
sets!

l  Two
modes:
megaBLAST,
and
discon;nuous
megaBLAST
(dc-‐megablast)

„ dc-‐megablast
intended
for
more
divergent
sequences

Zhang
et
al.
(2000)
J.
Comp.
Biol.
7(1-‐2)
203-‐14

Korf
et
al.
(2003)
“BLAST”,
O’Reilly
&
Associates,
Sebastopol,
CA

Whole
Genome
Alignment

l MUMmer

l  Uses
suﬃx
trees
for
paVern
matching:
very
fast
even
for
large
sequences

„ Finds
maximal
exact
matches

„ Memory
use
depends
only
on
reference
sequence
size

Kurtz
et
al.
(2004)
Genome
Biol.
doi:10.1186/gb-‐2004-‐5-‐2-‐r12

Whole
Genome
Alignment

l MUMmer

l  Uses
suﬃx
trees
for
paVern
matching:
very
fast
even
for
large
sequences

„ Finds
maximal
exact
matches

„ Memory
use
depends
only
on
reference
sequence
size

l  Suﬃx
Tree:

l  Can
be
constructed
and
searched
in
O(n)
;me

l  Useful
algorithms
are
nontrivial

l  BANANA$

„  B
followed
by
ANANA$
only

„  A
followed
by
$,
NA$,
NANA$

„  N
followed
by
A$,
ANA$

Kurtz
et
al.
(2004)
Genome
Biol.
doi:10.1186/gb-‐2004-‐5-‐2-‐r12

Whole
Genome
Alignment

l MUMmer

l  Process:

„ 1)
Iden;fy
a
non-‐overlapping
subset
of
maximal
exact
matches:

oren
Maximum
Unique
Matches
(MUMs
-‐
though
not
always

unique)

„ 2)
Cluster
into
alignment
anchors

„ 3)
Extend
between
anchors
to
produce
a
final
gapped
alignment

l  Very
flexible
approach:
a
suite
of
programs
(mummer, nucmer,
promer,
…)

„  nucleo;de
and
“conceptual
protein”
(more
sensi;ve)
alignments

„  used
for
genome
comparisons,
assembly
scaffolding,
repeat

detec;on,
etc.

„  forms
the
basis
for
other
aligners/assemblers,
e.g.
Mugsy,
AMOS

Kurtz
et
al.
(2004)
Genome
Biol.
doi:10.1186/gb-‐2004-‐5-‐2-‐r12

Whole
Genome
Alignment

l [ACTIVITY]

l  whole_genome_alignments_A.md Markdown

l  hVps://github.com/widdowquinn/Teaching/blob/master/
Compara;ve_Genomics_and_Visualisa;on/Part_1/
whole_genome_alignment/
whole_genome_alignments_A.md

Mul'ple
Genome
Alignment

l Several
tools:

l  Mugsy
(hVp://mugsy.sourceforge.net/)

l  MLAGAN
(hVp://lagan.stanford.edu/lagan_web/index.shtml)

l  TBA/Mul'Z
(hVp://www.bx.psu.edu/miller_lab/)

l  Mauve
(hVp://gel.ahabs.wisc.edu/mauve/)

l Posi;onal
homology
vs.
glocal

Mul'ple
Genome
Alignment

l LAGAN:
rapid
alignment
of
two
homologous

genome
sequences

l  Generate
local
alignments
(anchors,
B)

l  Construct
rough
global
map

(maximal-‐scoring
ordered
subset,
C)

„ Join
anchors
that
lie
within
a

threshold
distance,
the
same
way

l  Compute
global
alignment
by

dynamic
programming
(D)

Brudno
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.926603

Mul'ple
Genome
Alignment

l MLAGAN:
mul;ple
genome
alignment
of
k

genomes
in
k-‐1
alignment
steps,
using
a

phylogene;c
tree
(CLUSTAL-‐like):

l  Make
rough
global
maps
between
each

pair
of
sequences
(step
C
in
LAGAN)

l  Progressive
mul;ple
alignment
with

anchors
(iterated)

1.  Perform
global
alignment
between

closest
pair
of
sequences
with

LAGAN:
alignments
are

“mul6-‐sequences”

2.  Find
rough
global
maps
of
this
mul6-‐
sequence
to
all
other
mul6-‐sequences.

Brudno
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.926603

Human-‐Mouse-‐Rat
Alignment

l Three-‐way
progressive
alignment,
iden;fying:

l  Homologous
(H/M/R),
rodent-‐only
(M/R)
and
human-‐
mouse
or
human-‐rat
(H/M,
H/R)
homologous
regions

l Three-‐way
synteny

synteny
mapped
to
rat
genome

Brudno
et
al.
(2004)
Genome
Res.
doi:10.1101/gr.2067704

Ini'al
alignments
by
BLAT

Syntenous
regions
aligned
with
LAGAN

Drac
Genome
Alignment

l Whole
genome
alignments
useful
for
scaﬀolding
assemblies

l  High-‐throughput
sequence
assemblies
come
in
fragments
(con;gs)

l  Con;gs
can
some;mes
be
ordered
if
paired
reads
or
long
read

technologies
are
used

l  Can
also
align
to
a
known
reference
genome

l MUMmer

l  Can
use
NUCmer
or,
for
more
distant
rela;ons,
PROmer

l Mauve/Progressive
Mauve

l  hVp://gel.ahabs.wisc.edu/mauve/

Darling
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.2289704

Mauve

l  Mauve’s
alignment
algorithm

1.  Find
local
alignments
(mul;-‐MUMs
–
seed
&

extend)

2.  Construct
phylogene;c
guide
tree
from
mul;-‐
MUMs

3.  Select
subset
of
mul;-‐MUMs
as
anchors.

„  Par;;on
anchors
into
Local
Collinear

Blocks
(LCBs)
–
consistently-‐ordered

subsets

4.  Perform
recursive
anchoring
to
iden;fy

further
anchors

5.  Perform
progressive
alignment
(similar
to

CLUSTAL),
against
guide
tree

l  Mauve
Con;g
Mover
(MCM)
for
ordering
con;gs

Darling
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.2289704

Mauve

l  Mauve
alignment
of
LCBs
in
nine
enterobacterial
genomes

l  Rearrangement
of
homologous
backbone
sequence

Darling
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.2289704

Drac
Genome
Alignment

l [OPTIONAL
ACTIVITY]
(useful
for
exercise)

l  Alignment
and
reordering
of
drar
genome
con;gs

l  whole_genome_alignments_B.md
Markdown

l  hVps://github.com/widdowquinn/Teaching/blob/master/
Compara;ve_Genomics_and_Visualisa;on/Part_1/
whole_genome_alignment/
whole_genome_alignments_B.md

l [ACTIVITY]

l  Visualisa;on
of
whole
genome
alignment
with
Biopython

l  biopython_visualisation
iPython
notebook

Collinearity
and
Synteny

l Rearrangements
may
occur
post-‐specia;on

l Diﬀerent
species
s;ll
exhibit
conserva;on
of
sequence

similarity
and
order

l  Two
elements
are
collinear
if
they
lie
in
the
same
linear

sequence

l  Two
elements
are
syntenous
(syntenic)
if:

„ (orig.)
they
lie
on
the
same
chromosome

„ (mod.)
conserva;on
of
blocks
of
order
within
the
same

chromosome

l Signs
of
evolu;onary
constraints,
including
synteny,
may

indicate
func;onal
genome
regions

l More
about
this
in
Part
2,
related
to
genome
features

Syntenous

l example1.png
from
biopython_visualisation

ac;vity

Nonsyntenous

l example2.png
from
biopython_visualisation

ac;vity

Whole
Genome
Duplica'on

l Puﬀer
ﬁsh
Tetraodon
nigroviridis
(smallest
known
vertebrate
genome)

l  Whole-‐genome
duplica;on,
subsequent
to
divergence
from
mammals.

l  Ancestral
vertebrate
genome
inferred
to
have
12
chromosomes.

Duplicated
genes
(ExoFish)
on
21
chromosomes

Jaillon
et
al.
(2004)
Nature
doi:10.1038/nature03025

VISTA,
mVISTA,
VISTA-‐Point

l Alignment/visualisa;on
tools:

l  hVp://genome.lbl.gov/vista/index.shtml

l mVISTA:
align
and
compare
submiVed
sequences
(up
to
2Mbp)

l VISTA-‐Point:
visualise
precomputed
alignments

Frazer
et
al.
(2004)
Nucl.
Acids
Res.
doi:10.1093/nar/gkh458

UCSC

l hVp://genome.ucsc.edu/

l Many
vertebrate/invertebrate
model
genomes

Kent
et
al.
(2002)
Genome
Res.
doi:10.1101/gr.229102

Conclusion

l Physical
and
computa;onal
genome
comparisons:

l  Similar
biological
ques;ons
-‐>
similar
concepts

l Lots
of
sequence
data
in
modern
biology

l Conserva;on
≈
evolu;onary
constraint

l Many
choices
of
algorithms/analysis
sorware

l Many
choices
of
visualisa;on
sorware/tools

l Coming
in
Part
2:
genomic
func;onal
elements

Credits

l This
slideshow
is
shared
under
a
Crea;ve
Commons

AVribu;on
4.0
License

hVp://crea;vecommons.org/licenses/by/4.0/)

l Copyright
is
held
by
The
James
HuVon
Ins;tute

hVp://www.huVon.ac.uk

l You
may
freely
use
this
material
in
research,
papers,
and

talks
so
long
as
acknowledgement
is
made.

Nucleo'de
Content

l A,
C,
G,
T
composi;on

l  Varies
between,
and
within
genomes

l  staining
varies
across
genomes,
due
to

varia;on
in
GC
content

l “isochores”:
regions
with
liVle

internal
GC
varia;on
(homogeneous)

„ 
long
a
point
of
discussion

–
diﬃcult
to
deﬁne

l In
humans:

l  L1,
L2
isochores:
low
GC
(≲41%)

l  H1,
H2,
H3
isochores:
high
GC
(≳41%)

l  Imprecise
bulk
measurement

Sadoni
et
al.
(1999)
J.
Cell
Biol.
doi:10.1083/jcb.146.6.1211

hybridisa;on
of
H3
isochore
to
human
genome

DNA-‐DNA
Hybridisa'on
(DDH)

l Used
for
taxonomic
classifica;on
in
prokaryotes
from
1960s

l Sibley
&
Ahlquist
redefined
bird
and
primate
phylogeny
with

DDH
in
1980s:

l Not
without
controversy:

„ Sugges;ons
of
data
manipula;on

(see
here)

„ Close
evolu;onary
rela;onships

difficult
to
resolve
due
to
paralogy

(more
on
paralogy
later…)

l S;ll
hanging
on
as
a
de
facto
“gold

standard”
in
microbiological
taxonomic

classifica;on.

Sibley
&
Ahlquist
(1987)
J.
Mol.
Evol.
doi:10.1007/BF02111285

Finding
isochores

l Isochores:
homogeneous
regions
of
%GC
content

l  Easy
to
ﬁnd
with
windowed
(100kbp)

%GC
calcula;on,
from
sequenced

genomes.

l  3200
isochores
characterised
in
the

human
genome,
consistent
with
5

levels
(L1,
L2,
H1,
H2,
H3)
found

by
staining/hybridisa;on.

Costan'ni
et
al.
(2006)
Genome
Res.
doi:10.1101/gr.4910606

Compara've
Genomic
Hybridisa'on

l  Two
genomes:
“reference”
and
“test”
labelled
(red
and
green),

then
hybridised
against
a
“normal”
genome

l  semiquan'ta've:

l  Red:
loss
(<2
copies)
in
tumour

l  Green:
gain
(3-‐4
copies)
in
tumour

l  Ampliﬁca;ons
(>4
copies)
in
BOLD

l  Cases
with
the
same
Copy
Number

Aberra;on
(CNA)
are
numbered

De
Bortoli
et
al.
(2006)
BMC
Cancer
doi:10.1186/1471-‐2407-‐6-‐223

l Early
approaches
took
a
threshold
score
(present/absent)

l Later
approaches
used
known
reference
genome
sequence

context
(HMMs,
synteny)
to
improve
presence/absence
calls

l  No
hybridisa;on
=
“absent”
or“divergent”?

l  Not
nearly
as
good
as
sequencing
directly!

Array
Compara've
Genomic
Hybridisa'on

Pritchard
et
al.
(2009)
PLoS
Comp.
Biol.
doi:10.1371/journal.pcbi.1000473

k-‐mer
Spectra

l k-‐mer
spectrum:

l  CpG
suppression
(CGs
are
uncommon
in
vertebrate
genomes),

but
(by
simula;on)
only
when
in
combina;on
with
a
par;cular

%GC,
explains
mul;modality

Chor
et
al.
(2009)
Genome
Biol.
doi:10.1186/gb-‐2009-‐10-‐10-‐r108

Comparative Genomics and Visualisation - Part 1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Comparative Genomics and Visualisation - Part 1

Similar to Comparative Genomics and Visualisation - Part 1 (20)

More from Leighton Pritchard

More from Leighton Pritchard (20)

Recently uploaded

Recently uploaded (20)

Comparative Genomics and Visualisation - Part 1