Slides from a Comparative Genomics and Visualisation course (part 2) presented at the University of Dundee, 11th March 2014. Other materials are available at GitHub (https://github.com/widdowquinn/Teaching)
2. Part
2
l Part
1
l Experimental
Compara5ve
Genomics
l Bulk
and
Whole
Genome
Comparisons
l Genome
Features
l Who
let
the
–logues
out?
l Finishing
The
Hat
3. Genome
Features
l Genes:
l transla5on
start
l introns
l exons
l transla5on
stop
l transla5on
terminator
l ncRNA:
l tRNA
–
transfer
RNA
l rRNA
–
ribosomal
RNA
l CRISPRs
–
bacterial
and
archaeal
defence
(genome
edi5ng)
l many
other
classes
(including
enhancers)
4. Genome
Features
l Regulatory
sites
l Transcrip5on
start
site
(TSS)
l RNA
polymerase
binding
sites
l Transcrip5on
Factor
Binding
Sites
(TFBS)
l Core,
proximal
and
distal
promoter
regions
l Repe''ve
Regions
and
Mobile
Elements
l Tandem
repeats
l (retro-‐)transposable
elements
„ Alu
has
≈50,000
ac5ve
copies
in
human
genome
l Phage
inclusion
(bacteria/archaea)
Pennacchio
&
Rubin
(2001)
Nat.
Rev.
Genet.
doi:10.1038/35052548
human
v
mouse
comparison
5. Genome
Feature
Iden'fica'on
l Gene
Finding:
1. Empirical
(evidence-‐based)
methods:
„ Inference
from
known
protein/cDNA/mRNA/EST
sequence
„ Inference
from
mapped
RNA
reads
2. Ab
ini*o
methods:
„ Iden5fica5on
of
sequences
associated
with
gene
features:
ª TSS,
CpG
islands,
Shine-‐Dalgarno
sequence,
stop
codons,
etc.
3. Inference
from
genome
comparisons/conserva5on
Liang
et
al.
(2009)
Genome
Res.
doi:10.1101/gr.088997.108
Brent
(2007)
Nat.
Biotech.
doi:10.1038/nbt0807-‐883
Korf
(2004)
BMC
Bioinf.
doi:10.1186/1471-‐2105-‐5-‐59
6. Genome
Feature
Iden'fica'on
l Finding
Regulatory
Elements
(short,
degenerate):
1. Empirical
(evidence-‐based)
methods:
„ Inference
from
protein-‐DNA
binding
experiments
„ Inference
from
coexpression
2. Ab
ini*o
methods:
„ Iden5fica5on
of
regulatory
mo5fs
(profile/other
methods):
ª TATA,
sigma-‐factor
binding
sites,
etc.
„ sta5s5cal
overrepresenta5on
„ Iden5fica5on
from
sequence
proper5es
3. Inference
from
sequence
conserva5on/genome
comparisons
Zhang
et
al.
(2011)
BMC
Bioinf.
doi:10.1186/1471-‐2105-‐12-‐238
Kilic
et
al.
(2013)
Nucl.
Acids
Res.
doi:10.1093/nar/gkt1123
Vavouri
&
Elgar
(2005)
Curr.
Op.
Genet.
Devel.
doi:10.1016/j.gde.2005.05.002
7. Genome
Feature
Iden'fica'on
l All
predic5on
methods
result
in
errors
l All
experiments
have
error
l Genome
comparisons
can
help
correct
errors
l [OPTIONAL
ACTIVITY]
–
useful
for
exercise
l predict_CDS.md
Markdown
l Other
op5ons
for
prokaryo5c
genecalling:
l Glimmer
(hZp://ccb.jhu.edu/soware/glimmer/index.shtml)
l GeneMarkS
(hZp://opal.biology.gatech.edu/)
l RAST
(hZp://rast.nmpdr.org/)
l BASys
(hZps://www.basys.ca/),
etc.
l Op5ons
for
eukaryo5c
genecalling:
l GlimmerHMM
(hZp://ccb.jhu.edu/soware/glimmerhmm/)
l GeneMarkES
(hZp://opal.biology.gatech.edu/gmseuk.html)
l Augustus
(hZp://augustus.gobics.de/),
etc.
8. Who
Let
The
-‐logues
Out?
Evolu'onary
rela'onships
of
genome
features
can
be
complex.
We
require
precise
terms
to
describe
rela'onships
between
genome
features.
9. Comparing
Gene
Features
l Given
gene
annota5ons
for
more
than
one
genome,
how
can
we
organise
and
understand
rela5onships?
l Func5onal
similarity
(analogy)
l Evolu5onary
common
origin
(homology,
orthology,
etc.)
l Evolu5onary/func5onal/family
rela5onships
(paralogy)
Terms
first
suggested
by
Fitch
(1970)
Syst.
Zool.
doi:10.2307/2412448
10. Agack
of
the
–logues
l Technical
terms
describing
evolu5onary
rela5onships
l Homologues:
elements
that
are
similar
because
they
share
a
common
ancestor
(NOTE:
There
are
NOT
degrees
of
homology!)
l Analogues:
elements
that
are
(func5onally?)
similar,
possibly
through
convergent
evolu5on
and
not
by
sharing
common
ancestry
l Orthologues:
homologues
that
diverged
through
specia5on
l Paralogues:
homologues
that
diverged
through
duplica5on
within
the
same
genome
l (also
co-‐orthologues,
xenologues,
etc.)
11. Agack
of
the
–logues
'me
ancestral
genome
feature
genome
12. Agack
of
the
–logues
'me
specia'on
ancestor:
iA
species1:iA
species2:iA
orthologues
• Orthologues:
homologues
that
diverged
through
specia5on
genome
13. Agack
of
the
–logues
ancestral
copy:A
'me
copy
1:A
copy
2:A’
duplica'on
paralogues
Paralogues:
homologues
that
diverged
through
duplica5on
within
the
same
genome
genome
14. Agack
of
the
–logues
'me
specia'on
ancestor:iA
species1:iA
species2:iA
species1:iA’
species1:iA
species2:iA
duplica'on
orthologues
out-‐paralogues
in-‐paralogues
genome
15. Agack
of
the
–logues
'me
specia'on
ancestor:iA
species1:iA
Species2:iA
species1:iA’
species2:iA
species2:iA’
species1:iA
duplica'on
in-‐paralogues
in-‐paralogues
out-‐paralogues
orthologues
genome
16. Agack
of
the
–logues
l BUT:
biology
is
not
well-‐behaved:
rela5onships
can
be
difficult
to
infer
l Gene
loss
occurs
l Homologues
can
diverge
–
some5mes
very
widely:
hard
to
recognise
l Reconstructed
evolu5onary
trees
for
specia5on
events
may
not
be
robust
Kristensen
et
al.
(2011)
Brief.
Bioinf.
doi:10.1093/bib/bbr030
17. genome
extensive
divergence
Agack
of
the
–logues
'me
specia'on
ancestor:iA
species1:iA
Species2:iA
species1:iA’
species2:iA
species2:iA’
species1:iA
duplica'on
species1:iA?
species1:iA
species2:iA?
in-‐paralogues
(co-‐)orthologues?
contemporary
sequence
historical
events
out-‐paralogues/co-‐orthologues?
Current
classifica'ons
of
orthology/paralogy
are
inferences
18. Agack
of
the
–logues
l BUT:
biology
is
not
well-‐behaved:
rela5onships
can
be
difficult
to
infer
l Gene
loss
occurs
l Homologues
can
diverge
–
some5mes
very
widely:
hard
to
recognise
l Reconstructed
evolu5onary
trees
for
specia5on
events
may
not
be
robust
l Some
resources
and
tools
‘bend’
defini5ons,
e.g.
Ensembl
Compara
and
OrthoMCL.
hZp://www.ensembl.org/info/genome/compara/
homology_method.html
Kristensen
et
al.
(2011)
Brief.
Bioinf.
doi:10.1093/bib/bbr030
19. Note
on
“Orthology”
l Frequently
abused/misused
as
a
term
l “Orthology”
is
an
evolu5onary
rela5onship,
oen
bent
into
service
as
a
func5onal
descriptor
l Strictly
defined
only
for
two
species
or
clades!
l (cf.
OrthoMCL,
etc.)
l Orthology
is
not
transi5ve
(A
is
orthologue
of
C
and
B
is
orthologue
of
C
does
not
imply
A
is
an
orthologue
of
B)
l (cf.
EnsemblCompara
defini5ons)
Storm
&
Sonnhammer
(2002)
Bioinforma@cs.
doi:10.1093/bioinforma'cs/18.1.92
20. Ensembl
Compara
defini'ons
l within_species_paralog:
same-‐species
paralogue
(in-‐
paralogue)
l ortholog_one2one:
orthologue
l ortholog_one2many:
orthologue/paralogue
rela5onship
l orthology_many2many:
orthologue/paralogue
rela5onship
Vilella
et
al.
(2009)
Genome
Res.
doi:10.1101/gr.073585.107
NOTE:
the
taxonomy
may
not
always
be
correct…
21. “The
Ortholog
Conjecture”
Without
duplica'on,
a
gene
is
unlikely
to
change
its
basic
func'on,
because
this
would
lead
to
loss
of
the
original
func'on,
and
this
would
be
harmful.
22. Problems
with
the
Ortholog
Conjecture
l Nehrt
et
al.
(2011)
say:
l Paralogues
beZer
predictor
of
func5on
than
orthologues
„ ∴
conjecture
is
false!
l Cellular
context
beZer
for
protein
func5on
inference
l Func5on
defined
from
Gene
Ontology
(GO)
Nehrt
et
al.
(2011)
PLoS
Comp.
Biol.
doi:10.1371/journal.pcbi.1002073
Chen
et
al.
(2012)
PLoS
Comp.
Biol.
doi:10.1371/journal.pcbi.1002784
23. Problems
with
the
Ortholog
Conjecture
l But
do
we
understand
func5on
well
enough
to
test
the
conjecture?
l Chen
et
al.
(2012)
say:
“No”
l “examina5on
of
func5onal
studies
of
homologs
with
iden5cal
protein
sequences
reveals
experimental
biases,
annota5on
errors,
and
homology-‐based
func5onal
inferences
that
are
labeled
in
GO
as
experimental.
These
problems
[…]
make
the
current
GO
inappropriate
for
tes5ng
the
ortholog
conjecture”
l Expression
level
similarity
is
more
similar
for
orthologues
than
paralogues
(but
is
this
“func'on”…?)
Nehrt
et
al.
(2011)
PLoS
Comp.
Biol.
doi:10.1371/journal.pcbi.1002073
Chen
et
al.
(2012)
PLoS
Comp.
Biol.
doi:10.1371/journal.pcbi.1002784
24. Finding
“Orthologues”
The
process
of
finding
evolu'onary
(and/
or
func'onal)
equivalents
of
genes
across
two
or
more
organisms’
genomes.
25. Why
are
“orthologues”
so
important?
l Orthology
formalises
the
concept
of
corresponding
genes
across
mul5ple
organisms.
l Evolu5onary
l Func5onal?
(“The
Ortholog
Conjecture”)
l Applica5ons
in:
l Compara5ve
genomics
l Func5onal
genomics
l Phylogene5cs,
…
l Many
(>35)
databases
aZempt
to
describe
orthologous
rela5onships
l hZp://queskororthologs.org/orthology_databases
Dessimoz
(2011)
Brief.
Bioinf.
doi:10.1093/bib/bbr057
26. How
to
find
orthologues?
l Many
published
methods
and
databases:
l Pairwise
between
two
genomes:
„ RBBH
(aka
BBH,
RBH,
etc.),
RSD,
InParanoid,
RoundUp
l Mul5-‐genome
„ Graph-‐based:
COG,
eggNOG,
OrthoDB,
OrthoMCL,
OMA,
Mul5Paranoid
„ Tree-‐based:
TreeFam,
Ensembl
Compara,
PhylomeDB,
LOFT
l Methods
may
apply
different
-‐
or
refined
-‐
defini5ons
of
orthology,
paralogy,
etc.
Salichos
et
al.
(2011)
PLoS
One.
doi:10.1371/journal.pone.0018755
Trachana
et
al.
(2011)
Bioessays
doi:10.1002/bies.201100062
Kristensen
et
al.
(2011)
Brief.
Bioinf.
doi:10.1093/bib/bbr030
27. Pairwise
approaches
l S1,
S2
are
the
gene
sequence
sets
from
two
organisms
l Compare
S1
to
S2,
and
iden5fy
the
most
similar
pairs
of
sequences:
these
are
“orthologues”
(or
“puta5ve
orthologues”).
l Many
similarity
measures
possible
(which
threshold:
E-‐value,
bit
score,
coverage…?):
l Reciprocal
best
BLAST
hit
(RBBH)
–
used
by
e.g.
InParanoid
l Reciprocal
smallest
difference
(RSD)
–
used
by
e.g.
RoundUp
l and
so
on…
l Can
be
extended
to
mul5-‐organism
clusters
by
graph-‐based
approaches
Östlund
et
al.
(2009)
Nuc.
Acids
Res.
doi:10.1093/nar/gkp931
DeLuca
et
al.
(2012)
Bioinf.
doi:10.1093/bioinforma'cs/bts006
28. Reciprocal
Best
BLAST
Hits
l S1,
S2
are
the
gene
sequence
sets
from
two
organisms
l BLASTP:
l Query=S1,
Subject=S2
l Query=S2,
Subject=S1
l Op5onally
filter
BLAST
hits
(e.g.
on
%iden5ty
and
%coverage)
l Find
all
pairs
of
sequences
{GS1n,
GS2n}
in
S1,
S2
where
GS1n
is
the
best
BLAST
match
to
GS2n
and
GS2n
is
the
best
BLAST
match
to
GS1n.
best
hit
best
hit
best
hit
best
hit
2nd
best
hit
2nd
best
hit
✔
✘
best
hit
29. Reciprocal
Best
BLAST
Hits
l Advantages:
l quick
l easy
l performs
surprisingly
well
(see
later…)
l Disadvantages:
l misses
paralogues
l not
good
at
iden5fying
gene
families
or
*-‐to-‐many
rela5onships
without
more
detailed
analysis.
l no
strong
theore5cal/phylogene5c
basis.
30. COG
l COG
(Clusters
of
Orthologous
Groups;
now
POG,
KOG,
eggNOG
etc.)
l Graph
extension
of
RBBH
to
clusters
of
mutual
RBBH
l “Any
group
of
at
least
three
proteins
from
different
genomes,
more
similar
to
each
other
than
any
other
proteins
from
those
genomes,
are
an
orthologous
family.”
l Conduct
RBBH
l Collapse
paralogues
l Detect
“triangles”
l Merge
triangles
having
common
side
l Manual
cura5on
l Databases
have
many
outparalogues
Tatusov
et
al.
(2000)
Nucl.
Acids
Res.
doi:10.1093/nar/28.1.33
31. MCL
l MCL
constructs
a
network
from
all-‐vs-‐all
BLAST
results
l Then
applies
matrix
opera5ons:
expansion
and
infla5on
l Itera5ve
expansion
and
infla*on
un5l
network
convergence
Enright
et
al.
(2002)
Nucl.
Acids
Res.
doi:10.1093/nar/30.7.1575
33. OrthoMCL
l hZp://orthomcl.org/orthomcl/
1. Defines
poten5al
inparalogue,
orthologue
and
co-‐orthologue
pairs
(using
RBBH!
–
see
algorithm
descrip5on
in
papers
directory)
2. Applies
MCL
to
cluster
inparalogue,
orthologue,
co-‐orthologue
pairs/
l Output
clusters
include
both
orthologues
and
paralogues
Li
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.1224503
34. Notes
of
Cau'on
l BLAST-‐based
orthology
methods
(e.g.
RBBH,
InParanoid,
COG)
are
fast!
l But
they
have
some
drawbacks:
l No
guarantee
that
sequence
matches
are
transi5ve
(A
may
match
B
at
a
domain
differently
than
B
matches
C)
l No
evolu5onary
distance
model
l Mul5ple
domain
matches
are
not
accounted
for
l These
methods
find
similar
sequences,
then
make
assump5ons
based
on
similarity
and
number
of
matches.
They
do
not
detect
orthologues
directly!
l Tree-‐based
methods
incorporate:
l Evolu5onary
distance
l Direct
orthologue
detec5on
38. Evalua'ng
Predic'ons
l Works
the
same
way
for
all
predic5on
tools
1. Define
a
“valida5on
set”
(gold
standard),
unseen
by
the
predic5on
tool
2. Make
predic5ons
with
the
tool
3. Evaluate
confusion
matrix
and
performance
sta5s5cs
l Sensi5vity
l Specificity
l Accuracy
Standard:
+ve
-‐ve
Predict
+ve
TP
FP
Predict
-‐ve
FN
TN
False
posi5ve
rate
FP/(FP+TN)
False
nega5ve
rate
FN/(TP+FN)
Sensi5vity
TP/(TP+FN)
Specificity
TN/(FP+TN)
False
discovery
rate
(FDR)
FP/(FP+TP)
Accuracy
(TP+TN)/(TP+TN+FP+FN)
39. Evalua'ng
Orthologue
Predic'ons
l Take
advantage
of
prokaryo5c
operon
structure:
conserved
syntenic
triplets
likely
to
be
orthologous
l Idea:
If
the
outer
pair
in
a
syntenic
triplet
are
orthologous,
the
middle
gene
is
likely
to
be,
too.
l Middle
genes
are
orthologue
“gold
standard”
l Do
RBBH
reliably
iden5fy
middle
genes
from
syntenic
triplets?
Wolf
et
al.
(2012)
Genome
Biol.
Evol.
doi:10.1093/gbe/evs100
40. Evalua'ng
Orthologue
Predic'ons
l Two
well-‐characterised
genomes
compared
against
573
prokaryotes
l Iden5fied
RBBH
(with
permissive
BLAST
sewngs)
l “Overwhelming
majority”
of
middle
genes
(counterparts)
are
BBH
l 88-‐99%
of
BBH
are
in
syntenic
triplets
l Therefore,
RBBH
reliably
finds
orthologues
Wolf
et
al.
(2012)
Genome
Biol.
Evol.
doi:10.1093/gbe/evs100
41. Evalua'ng
Orthologue
Predic'ons
l Four
orthologue
predic5on
algorithms:
l RBBH
(and
cRBH)
l RSD
(and
cRSD)
l Mul5Paranoid
l OrthoMCL
l Tested
against
2,723
curated
orthologues
from
six
Saccharomycetes
l Rated
by:
l Sensi5vity:
TP/(TP+FN)
–
what
propor5on
of
orthologues
are
found
l Specificity:
TN/(TN+FP)
–
how
well
are
non-‐orthologues
excluded
l Accuracy:
(TP+TN)/(TP+TN+FP+FN)
–
general
measure
of
performance
l FDR:
FP/(FP+TP)
–
what
propor5on
of
predic5ons
are
incorrect
Salichos
et
al.
(2011)
PLoS
One.
doi:10.1371/journal.pone.0018755
42. Evalua'ng
Orthologue
Predic'ons
l Four
orthologue
predic5on
algorithms:
l RBBH
(cRBH)
l RSD
(cRSD)
l Mul5Paranoid
l OrthoMCL
l cRBH
most
accurate,
and
specific,
with
lowest
FDR
Salichos
et
al.
(2011)
PLoS
One.
doi:10.1371/journal.pone.0018755
43. Evalua'ng
Orthologue
Predic'ons
l Tests
of
several
methods
on
a
number
of
literature-‐based
benchmarks
for:
l Correct
branching
of
phylogeny
l Grouping
by
func5on
„ GO
similarity
„ EC
number
„ Expression
level
„ Gene
Neighbourhood
Altenhoff
&
Dessimoz
(2009)
PLoS
Comp.
Biol.
doi:10.1371/journal.pcbi.1000262
45. Evalua'ng
Orthologue
Predic'ons
l 70
gene
family
test,
mul5ple
evolu5onary
scenarios
l Tested
databases
with
associated
algorithms:
Trachana
et
al.
(2011)
Bioessays.
doi:10.1002/bies.201100062
46. Evalua'ng
Orthologue
Predic'ons
l 70
gene
family
test
set,
mul5ple
evolu5onary
scenarios
l All
methods/dbs
have
strong
scope
for
improvement.
l OrthoMCL
poor
performer,
TreeFam
&
eggNOG
do
best
Trachana
et
al.
(2011)
Bioessays.
doi:10.1002/bies.201100062
47. Orthologue
Predic'on
Performance
l Performance
varies
by
choice
of
method
and
interpreta'on
of
“orthology”
l Biggest
influence
is
genome
annota'on
quality
l Rela've
performance
varies
with
benchmark
choice
l (clustering)
RBBH
outperforms
more
complex
algorithms
under
many
circumstances
49. Selec'on
Pressures
l Defining
core
groups
of
genes
by
“orthology”
allows
analysis
of
those
groups:
l Synteny/colloca'on
l Gene
neighbourhood
changes
(e.g.
genome
expansion)
l The
pangenome:
core
and
accessory
genomes
l and
sequences
in
those
groups:
l Mul5ple
alignment
l Domain
detec5on
l Iden5fica5on
of
func5onal
sites
l Inference
of
evolu'onary
pressures
50. Synteny
l Selec5ve
pressures
depend
on
gene
(product)
func5on
l Genes
involving
physically
or
func5onally-‐interac5ng
proteins
tend
to
evolve
under
similar
selec5ve
constraints
l Par5cularly
in
bacteria,
this
leads
to
co-‐expression
as
regulons
and
colloca5on
in
operons
l Colloca5on
(and
coregula5on)
may
be
iden5fied
by
compara5ve
genomics
l (This
is
also
true
when
considering
regulatory
or
metabolic
networks,
similarly
to
genome
organisa5on)
Alvarez-‐Ponce
et
al.
(2011)
Genome
Biol.
Evol.
doi:10.1093/gbe/evq084
51. Synteny
l Many
tools/packages/services
for
synteny
detec5on,
e.g.
l SyMAP
„ hZp://www.agcol.arizona.edu/soware/
symap/
l i-‐ADHoRe
„ hZp://bioinforma5cs.psb.ugent.be/soware/
details/i-‐-‐ADHoRe
l MCScan,
Cyntenator,
etc
Soderlund
et
al.
(2011)
Nucl.
Acids.
Res.
doi:10.1093/nar/gkr123
Proost
et
al.
(2011)
Nucl.
Acids
Res.
doi:10.1093/nar/gkr955
52. i-‐ADHoRe
l Algorithm:
1. Combine
tandem
repeats
of
genes/gene
sets
2. Make
gene
homology
matrix
(GHM):
iden5fy
collinear
regions
(diagonals)
for
first
genome
pair
3. Convert
these
to
profiles
4. Use
GG2
algorithm
to
align
profiles
5. Search
next
genome
with
profiles,
spliwng
them
where
necessary
6. iterate
un5l
complete
l Gives
genome-‐scale
mul5ple
alignments
of
blocks
of
genes
Proost
et
al.
(2011)
Nucl.
Acids
Res.
doi:10.1093/nar/gkr955
54. Genome
Expansion
l Mobile/repeat
elements
reproduce
and
expand
during
evolu5on
l Generates
sequence
“laboratory”
for
varia5on
and
experiment
l e.g.
Phytophthora
infestans
effector
protein
expansion
and
arms
race
Haas
et
al.
(2009)
Nature.
doi:10.1038/nature08358
55. Genome
Expansion
l Mobile
elements
(MEs)
are
large,
carry
genes
with
them.
l Regions
rich
in
MEs
have
larger
gaps
between
consecu5ve
genes
l Effector
proteins
are
found
preferen5ally
in
regions
with
large
gaps,
also
show
increased
rates
of
evolu5onary
divergence.
l “Two-‐speed
genome”
associated
with
adaptability
to
new
hosts/
escape
from
evolu5onary
“boZleneck”
Haas
et
al.
(2009)
Nature.
doi:10.1038/nature08358
56. The
Pangenome
l The
gene
complement
of
a
set
of
organisms
(e.g.
species
group)
is
the
pangenome,
defined
by
the
union
of
two
gene
sets:
l Core
genes:
genes
present
in
all
examples
(define
common
species
characteris5cs)
l Accessory
genes:
genes
only
present
in
a
subset
of
examples
(relevant
to
adapta5on
of
individuals)
l Defini5on
depends
on
composi5on
of
organism
set
l Core
genome
hypothesis:
l “The
core
genome
is
the
primary
cohesive
unit
defining
a
bacterial
species.”
l Online
tools
available,
e.g.
l Panseq
(hZp://lfz.corefacility.ca/panseq/)
Laing
et
al.
(2010)
BMC
Bioinf.
doi:10.1186/1471-‐2105-‐11-‐461
Lefébure
et
al.
(2010)
Genome
Biol.
Evol.
doi:10.1093/gbe/evq048
57. Defining
a
species’
core
genome
l “Orthologue
groups”
with
a
representa5ve
in
(nearly)
every
member
of
the
set
l But
we
only
have
a
sample
of
the
species,
not
every
member…
l …so
use
rarefac5on
curves
to
es5mate
core
genome
size.
1. Randomly
order
organisms,
and
count
number
of
‘core’
and
‘new’
genes
seen
with
each
new
genome
addi5on.
2. Repeat
un5l
you
have
a
reasonable
es5mate
of
error/
no
new
genes
found
Lefébure
et
al.
(2010)
Genome
Biol.
Evol.
doi:10.1093/gbe/evq048
58. Direc'onal
Selec'on
l Several
sta5s5cal
tests
for
direc5onal
selec5on,
e.g.
l QTL
sign
l Ka/Ks
(dN/dS)
ra'o
test
–
most
commonly
applied
l Rela5ve
rate
test
l Ka/Ks
ra'o:
l Ka
(or
dN):
number
of
non-‐synonymous
subs5tu5ons
per
non-‐
synonymous
site
l Ks
(or
dS):
number
of
synonymous
subs5tu5ons
per
synonymous
site
l Ka/Ks
>
1
⇒
posi5ve
selec5on;
Ka/Ks
<
1
⇒
stabilising
selec5on
l Several
methods/tools
for
calcula5on
„ PAML
(hZp://abacus.gene.ucl.ac.uk/soware/paml.html)
„ SeqinR
(hZp://cran.r-‐project.org/web/packages/seqinr/index.html)
60. An
Analysis
Output
l Class
comparison:
animal-‐pathogenic
(APE)
vs
plant-‐
associated
bacteria
(PAB)
l Presence
of
horizontally-‐acquired
islands
(HAI)
l Genes
with
greater
similarity
to
PAB
than
APE
Toth
et
al.
(2006)
Annu.
Rev.
Phytopath.
doi:10.1146/annurev.phyto.44.070505.143444
61. Things
I
Didn’t
Get
To
l Genome-‐Wide
Associa'on
Studies
(GWAS):
l Try
hZp://genenetwork.org/
to
play
with
some
data
l Predic'on
of
regulatory
elements,
e.g.
l Kellis
et
al.
(2003)
Nature
doi:10.1038/nature01644
l King
et
al.
(2007)
Genome
Res.
doi:10.1101/gr.5592107
l Chaivorapol
et
al.
(2008)
BMC
Bioinf.
doi:10.1186/1471-‐2105-‐9-‐455
l CompMOBY:
hZp://genome.ucsf.edu/compmoby/
l Detec'on
of
Horizontal/Lateral
Gene
Transfer
(HGT/LGT),
e.g.
l Tsirigos
&
Rigoutsos
(2005)
Nucl.
Acids.
Res.
doi:10.1093/nar/gki187
l Phylogenomics,
e.g.
l Delsuc
et
al.
(2005)
Nat.
Rev.
Genet.
doi:10.1038/nrg1603
62. Finishing
The
Hat
Some
of
the
things
I
hope
you
have
taken
away
from
the
lectures/ac'vi'es
63. Take-‐Home
Messages
l Compara've
genomics
is
a
powerful
set
of
techniques
for:
l Understanding
and
iden5fying
evolu5onary
processes
and
mechanisms
l Reconstruc5ng
detailed
evolu5onary
history
of
a
set
of
organisms
l Iden5fying
and
understanding
common
genomic
features
of
organisms
l Providing
hypotheses
about
gene
func5on
for
experimental
inves5ga5on
l A
huge
amount
of
data
is
available
to
work
with
l And
it’s
only
going
to
get
much,
much
larger
l Results
feed
into
many
areas
of
study:
l Medicine
and
health
l Agriculture
and
food
security
l Basic
biology
in
all
fields
l Systems
and
synthe5c
biology
64. Take-‐Home
Messages
l Compara've
genomics
is
essen'ally
based
around
comparisons
l What
is
similar
between
two
genomes?
What
is
different?
l Compara've
genomics
is
evolu'onary
genomics
l Large
datasets
benefit
from
visualisa'on
for
effec've
interpreta'on
l Much
scope
for
improvement
in
visualisa5on
l Tools
with
the
same
purpose
give
different
output
l BLAST
vs
MUMmer
l RBBH
vs
MCL
l Choice
of
applica'on
magers
for
correctness
and
interpreta'on!
–
understand
what
the
applica'on
does,
and
its
limits.
66. Credits
l This
slideshow
is
shared
under
a
Crea5ve
Commons
AZribu5on
4.0
License
hZp://crea5vecommons.org/licenses/by/4.0/)
l Copyright
is
held
by
The
James
HuZon
Ins5tute
hZp://www.huZon.ac.uk
l You
may
freely
use
this
material
in
research,
papers,
and
talks
so
long
as
acknowledgement
is
made.