Comparative Genomics and Visualisation - Part 2

Compara've
Genomics
and

Visualisa'on
–
Part
2

Leighton
Pritchard

Part
2

l Part
1

l  Experimental
Compara5ve
Genomics

l  Bulk
and
Whole
Genome
Comparisons

l Genome
Features

l Who
let
the
–logues
out?

l Finishing
The
Hat

Genome
Features

l Genes:

l  transla5on
start

l  introns

l  exons

l  transla5on
stop

l  transla5on
terminator

l ncRNA:

l  tRNA
–
transfer
RNA

l  rRNA
–
ribosomal
RNA

l  CRISPRs
–
bacterial
and
archaeal
defence

(genome
edi5ng)

l  many
other
classes
(including
enhancers)

Genome
Features

l Regulatory
sites

l  Transcrip5on
start
site
(TSS)

l  RNA
polymerase
binding
sites

l  Transcrip5on
Factor
Binding
Sites

(TFBS)

l  Core,
proximal
and
distal
promoter
regions

l Repe''ve
Regions
and
Mobile
Elements

l  Tandem
repeats

l  (retro-‐)transposable
elements

„ Alu
has
≈50,000
ac5ve
copies
in
human
genome

l  Phage
inclusion
(bacteria/archaea)

Pennacchio
&
Rubin
(2001)
Nat.
Rev.
Genet.
doi:10.1038/35052548

human
v
mouse
comparison

Genome
Feature
Iden'ﬁca'on

l Gene
Finding:

1.  Empirical
(evidence-‐based)
methods:

„ Inference
from
known
protein/cDNA/mRNA/EST
sequence

„ Inference
from
mapped
RNA
reads

2.  Ab
ini*o
methods:

„  Iden5ﬁca5on
of
sequences
associated
with
gene

features:

ª  TSS,
CpG
islands,
Shine-‐Dalgarno
sequence,
stop

codons,
etc.

3.  Inference
from
genome
comparisons/conserva5on

Liang
et
al.
(2009)
Genome
Res.
doi:10.1101/gr.088997.108

Brent
(2007)
Nat.
Biotech.
doi:10.1038/nbt0807-‐883

Korf
(2004)
BMC
Bioinf.
doi:10.1186/1471-‐2105-‐5-‐59

Genome
Feature
Iden'ﬁca'on

l Finding
Regulatory
Elements
(short,
degenerate):

1.  Empirical
(evidence-‐based)
methods:

„ Inference
from
protein-‐DNA
binding
experiments

„ Inference
from
coexpression

2.  Ab
ini*o
methods:

of
regulatory
mo5fs
(proﬁle/other
methods):

ª  TATA,
sigma-‐factor
binding
sites,
etc.

„  sta5s5cal
overrepresenta5on

from
sequence
proper5es

3.  Inference
from
sequence
conserva5on/genome
comparisons

Zhang
et
al.
(2011)
BMC
Bioinf.
doi:10.1186/1471-‐2105-‐12-‐238

Kilic
et
al.
(2013)
Nucl.
Acids
Res.
doi:10.1093/nar/gkt1123

Vavouri
&
Elgar
(2005)
Curr.
Op.
Genet.
Devel.
doi:10.1016/j.gde.2005.05.002

Genome
Feature
Iden'ﬁca'on

l  All
predic5on
methods
result
in
errors

l  All
experiments
have
error

l  Genome
comparisons
can
help
correct
errors

l  [OPTIONAL
ACTIVITY]
–
useful
for
exercise

l  predict_CDS.md
Markdown

l  Other
op5ons
for
prokaryo5c
genecalling:

l  Glimmer
(hZp://ccb.jhu.edu/soware/glimmer/index.shtml)

l  GeneMarkS
(hZp://opal.biology.gatech.edu/)

l  RAST
(hZp://rast.nmpdr.org/)

l  BASys
(hZps://www.basys.ca/),
etc.

l  Op5ons
for
eukaryo5c
genecalling:

l  GlimmerHMM
(hZp://ccb.jhu.edu/soware/glimmerhmm/)

l  GeneMarkES
(hZp://opal.biology.gatech.edu/gmseuk.html)

l  Augustus
(hZp://augustus.gobics.de/),
etc.

Who
Let
The
-‐logues
Out?

Evolu'onary
rela'onships
of
genome

features
can
be
complex.

We
require
precise
terms
to
describe

rela'onships
between
genome
features.

Comparing
Gene
Features

l Given
gene
annota5ons
for
more
than
one
genome,
how

can
we
organise
and
understand
rela5onships?

l  Func5onal
similarity
(analogy)

l  Evolu5onary
common
origin
(homology,
orthology,
etc.)

l  Evolu5onary/func5onal/family
rela5onships
(paralogy)

Terms
ﬁrst
suggested
by
Fitch
(1970)
Syst.
Zool.
doi:10.2307/2412448

Agack
of
the
–logues

l Technical
terms
describing
evolu5onary
rela5onships

l Homologues:
elements
that
are
similar
because
they
share
a
common

ancestor
(NOTE:
There
are
NOT
degrees
of
homology!)

l Analogues:
elements
that
are
(func5onally?)
similar,
possibly
through

convergent
evolu5on
and
not
by
sharing
common
ancestry

l Orthologues:
homologues
that
diverged
through
specia5on

l Paralogues:
homologues
that
diverged
through
duplica5on
within
the

same
genome

l (also
co-‐orthologues,
xenologues,
etc.)

Agack
of
the
–logues

'me

ancestral
genome
feature
genome

Agack
of
the
–logues

'me

specia'on

ancestor:
iA

species1:iA
species2:iA

orthologues

•  Orthologues:
homologues
that
diverged
through
specia5on

genome

Agack
of
the
–logues

ancestral
copy:A

'me

copy
1:A
copy
2:A’

duplica'on

paralogues

Paralogues:
homologues
that
diverged
through
duplica5on
within

the
same
genome

genome

Agack
of
the
–logues

'me

specia'on

ancestor:iA

species1:iA
species2:iA

species1:iA’
species1:iA
species2:iA

duplica'on

orthologues

out-‐paralogues

in-‐paralogues

genome

Agack
of
the
–logues

'me

specia'on

ancestor:iA

species1:iA
Species2:iA

species1:iA’
species2:iA
species2:iA’
species1:iA

duplica'on

in-‐paralogues
in-‐paralogues

out-‐paralogues

orthologues

genome

Agack
of
the
–logues

l BUT:
biology
is
not
well-‐behaved:
rela5onships
can
be

diﬃcult
to
infer

l  Gene
loss
occurs

l  Homologues
can
diverge
–
some5mes
very
widely:
hard
to
recognise

l  Reconstructed
evolu5onary
trees
for
specia5on
events
may
not
be
robust

Kristensen
et
al.
(2011)
Brief.
Bioinf.
doi:10.1093/bib/bbr030

genome

extensive

divergence

Agack
of
the
–logues

'me

specia'on

ancestor:iA

species1:iA
Species2:iA

species1:iA’
species2:iA
species2:iA’
species1:iA

duplica'on

species1:iA?
species1:iA
species2:iA?

in-‐paralogues

(co-‐)orthologues?

contemporary

sequence

historical

events

out-‐paralogues/co-‐orthologues?

Current
classiﬁca'ons
of
orthology/paralogy
are
inferences

Agack
of
the
–logues

l BUT:
biology
is
not
well-‐behaved:
rela5onships
can
be

diﬃcult
to
infer

l  Gene
loss
occurs

l  Homologues
can
diverge
–
some5mes
very
widely:
hard
to
recognise

l  Reconstructed
evolu5onary
trees
for
specia5on
events
may
not
be
robust

l Some
resources
and
tools
‘bend’
deﬁni5ons,
e.g.
Ensembl
Compara

and
OrthoMCL.

hZp://www.ensembl.org/info/genome/compara/
homology_method.html

Kristensen
et
al.
(2011)
Brief.
Bioinf.

Note
on
“Orthology”

l Frequently
abused/misused
as
a
term

l “Orthology”
is
an
evolu5onary
rela5onship,
oen
bent

into
service
as
a
func5onal
descriptor

l Strictly
deﬁned
only
for
two
species
or
clades!

l  (cf.
OrthoMCL,
etc.)

l Orthology
is
not
transi5ve
(A
is
orthologue
of
C
and
B
is

orthologue
of
C
does
not
imply
A
is
an
orthologue
of
B)

l  (cf.
EnsemblCompara
deﬁni5ons)

Storm
&
Sonnhammer
(2002)
Bioinforma@cs.
doi:10.1093/bioinforma'cs/18.1.92

Ensembl
Compara
deﬁni'ons

l  within_species_paralog:

same-‐species
paralogue
(in-‐
paralogue)

l  ortholog_one2one:
orthologue

l  ortholog_one2many:

orthologue/paralogue

rela5onship

l  orthology_many2many:

orthologue/paralogue

rela5onship

Vilella
et
al.
(2009)
Genome
Res.
doi:10.1101/gr.073585.107

NOTE:
the
taxonomy
may
not
always
be
correct…

“The
Ortholog
Conjecture”

Without
duplica'on,
a
gene
is
unlikely
to

change
its
basic
func'on,
because
this

would
lead
to
loss
of
the
original

func'on,
and
this
would
be
harmful.

Problems
with
the
Ortholog
Conjecture

l Nehrt
et
al.
(2011)
say:

l  Paralogues
beZer
predictor
of
func5on
than
orthologues

„ ∴
conjecture
is
false!

l  Cellular
context
beZer
for
protein
func5on
inference

l  Func5on
deﬁned
from
Gene
Ontology
(GO)

Nehrt
et
al.
(2011)
PLoS
Comp.
Biol.
doi:10.1371/journal.pcbi.1002073

Chen
et
al.
(2012)
PLoS
Comp.
Biol.

Problems
with
the
Ortholog
Conjecture

l But
do
we
understand
func5on
well
enough
to
test
the

conjecture?

l Chen
et
al.
(2012)
say:
“No”

l  “examina5on
of
func5onal
studies
of
homologs
with
iden5cal

protein
sequences
reveals
experimental
biases,
annota5on
errors,

and
homology-‐based
func5onal
inferences
that
are
labeled
in
GO

as
experimental.
These
problems
[…]
make
the
current
GO

inappropriate
for
tes5ng
the
ortholog
conjecture”

l  Expression
level
similarity
is
more
similar
for
orthologues
than

paralogues
(but
is
this
“func'on”…?)

Nehrt
et
al.
(2011)
PLoS
Comp.
Biol.

Chen
et
al.
(2012)
PLoS
Comp.
Biol.

Finding
“Orthologues”

The
process
of
ﬁnding
evolu'onary
(and/
or
func'onal)
equivalents
of
genes
across

two
or
more
organisms’
genomes.

Why
are
“orthologues”
so
important?

l Orthology
formalises
the
concept
of
corresponding
genes

across
mul5ple
organisms.

l  Evolu5onary

l  Func5onal?
(“The
Ortholog
Conjecture”)

l Applica5ons
in:

l  Compara5ve
genomics

l  Func5onal
genomics

l  Phylogene5cs,
…

l Many
(>35)
databases
aZempt
to
describe
orthologous
rela5onships

l  hZp://queskororthologs.org/orthology_databases

Dessimoz
(2011)
Brief.
Bioinf.

How
to
find
orthologues?

l Many
published
methods
and
databases:

l  Pairwise
between
two
genomes:

„ RBBH
(aka
BBH,
RBH,
etc.),
RSD,
InParanoid,
RoundUp

l  Mul5-‐genome

„ Graph-‐based:
COG,
eggNOG,
OrthoDB,
OrthoMCL,
OMA,

Mul5Paranoid

„ Tree-‐based:
TreeFam,
Ensembl
Compara,
PhylomeDB,
LOFT

l Methods
may
apply
different
-‐
or
refined
-‐
defini5ons
of

orthology,
paralogy,
etc.

Salichos
et
al.
(2011)
PLoS
One.
doi:10.1371/journal.pone.0018755

Trachana

et
al.
(2011)
Bioessays
doi:10.1002/bies.201100062

Kristensen
et
al.
(2011)
Brief.
Bioinf.

Pairwise
approaches

l S1,
S2
are
the
gene
sequence
sets
from
two
organisms

l Compare
S1
to
S2,
and
iden5fy
the
most
similar
pairs
of

sequences:
these
are
“orthologues”
(or
“puta5ve
orthologues”).

l Many
similarity
measures
possible
(which
threshold:
E-‐value,
bit
score,

coverage…?):

l  Reciprocal
best
BLAST
hit
(RBBH)
–
used
by
e.g.
InParanoid

l  Reciprocal
smallest
diﬀerence
(RSD)
–
used
by
e.g.
RoundUp

l  and
so
on…

l Can
be
extended
to
mul5-‐organism
clusters
by
graph-‐based

approaches

Östlund
et
al.
(2009)
Nuc.
Acids
Res.
doi:10.1093/nar/gkp931

DeLuca

et
al.
(2012)
Bioinf.
doi:10.1093/bioinforma'cs/bts006

Reciprocal
Best
BLAST
Hits

l S1,
S2
are
the
gene
sequence
sets
from
two
organisms

l BLASTP:

l  Query=S1,
Subject=S2

l  Query=S2,
Subject=S1

l Op5onally
ﬁlter
BLAST
hits
(e.g.
on
%iden5ty
and
%coverage)

l Find
all
pairs
of
sequences
{GS1n,
GS2n}
in
S1,
S2
where
GS1n
is
the
best

BLAST
match
to
GS2n
and
GS2n
is
the
best
BLAST
match
to
GS1n.

best
hit

best
hit
best
hit

best
hit

2nd
best
hit

2nd
best
hit

✔
✘

best
hit

Reciprocal
Best
BLAST
Hits

l Advantages:

l  quick

l  easy

l  performs
surprisingly
well
(see
later…)

l Disadvantages:

l  misses
paralogues

l  not
good
at
iden5fying
gene
families
or
*-‐to-‐many

rela5onships
without
more
detailed
analysis.

l  no
strong
theore5cal/phylogene5c
basis.

COG

l COG
(Clusters
of
Orthologous
Groups;
now
POG,
KOG,

eggNOG
etc.)

l Graph
extension
of
RBBH
to
clusters
of
mutual
RBBH

l  “Any
group
of
at
least
three
proteins
from
diﬀerent

genomes,
more
similar
to
each
other
than
any
other

proteins
from
those
genomes,
are
an
orthologous
family.”

l  Conduct
RBBH

l  Collapse
paralogues

l  Detect
“triangles”

l  Merge
triangles
having
common
side

l  Manual
cura5on

l Databases
have
many
outparalogues

Tatusov
et
al.
(2000)
Nucl.
Acids
Res.
doi:10.1093/nar/28.1.33

MCL

l MCL
constructs
a
network
from
all-‐vs-‐all
BLAST
results

l Then
applies
matrix
opera5ons:
expansion
and
inﬂa5on

l Itera5ve
expansion
and
inﬂa*on
un5l
network

convergence

Enright
et
al.
(2002)
Nucl.
Acids
Res.
doi:10.1093/nar/30.7.1575

MCL

Expansion
Inﬂa'on

…

…

…
…

→

→

Input

Clustering

OrthoMCL

l hZp://orthomcl.org/orthomcl/

1.  Deﬁnes
poten5al
inparalogue,
orthologue
and
co-‐orthologue
pairs

(using
RBBH!
–
see
algorithm
descrip5on
in
papers
directory)

2.  Applies
MCL
to
cluster
inparalogue,
orthologue,
co-‐orthologue

pairs/

l Output
clusters
include
both
orthologues
and
paralogues

Li
et
al.
(2003)
Genome
Res.
doi:10.1101/gr.1224503

Notes
of
Cau'on

l  BLAST-‐based
orthology
methods
(e.g.
RBBH,
InParanoid,
COG)
are
fast!

l  But
they
have
some
drawbacks:

l  No
guarantee
that
sequence
matches
are
transi5ve
(A
may
match
B
at
a

domain
diﬀerently
than
B
matches
C)

l  No
evolu5onary
distance
model

l  Mul5ple
domain
matches
are
not
accounted
for

l  These
methods
ﬁnd
similar
sequences,
then
make
assump5ons
based
on

similarity
and
number
of
matches.
They
do
not
detect
orthologues

directly!

l  Tree-‐based
methods
incorporate:

l  Evolu5onary
distance

l  Direct
orthologue
detec5on

Finding
“Orthologues”

l Pairwise
analysis:
RBBH

l [ACTIVITY]

l  find_rbbh.ipynb
iPython
notebook

l Mul5-‐organism
analysis:
MCL

l [ACTIVITY]

l  mcl_orthologues/README.md
Markdown

l  mcl_orthologues.ipynb iPython
notebook

Other
Methods

l  Synteny-‐based:

l  Homologene
(NCBI):

„  hZp://www.ncbi.nlm.nih.gov/homologene

l  Manual
cura5on:

l  Mouse
Genome
Database
(MGD):

„  hZp://www.informa5cs.jax.org/homology.shtml

l  Tree-‐based:

l  EnsemblCompara
(EMBL-‐EBI):

„  hZp://www.ensembl.org/info/genome/compara/index.html

l  TreeFam
(EMBL-‐EBI):

„  hZp://www.treefam.org/

l  OrthologID:

„  hZp://nypg.bio.nyu.edu/orthologid/

Evalua'ng
Orthologue
Predic'ons

Which
method
works
best?

(and
what
do
we
mean
by
“best”
anyway?)

Evalua'ng
Predic'ons

l Works
the
same
way
for
all
predic5on
tools

1.  Define
a
“valida5on
set”
(gold
standard),
unseen
by
the
predic5on

tool

2.  Make
predic5ons
with
the
tool

3.  Evaluate
confusion
matrix

and
performance
sta5s5cs

l  Sensi5vity

l  Specificity

l  Accuracy

Standard:
+ve
-‐ve

Predict
+ve
TP
FP

Predict
-‐ve
FN
TN

False
posi5ve
rate
FP/(FP+TN)

False
nega5ve
rate
FN/(TP+FN)

Sensi5vity
TP/(TP+FN)

Specificity
TN/(FP+TN)

False
discovery
rate
(FDR)
FP/(FP+TP)

Accuracy
(TP+TN)/(TP+TN+FP+FN)

Evalua'ng
Orthologue
Predic'ons

l Take
advantage
of
prokaryo5c
operon

structure:
conserved
syntenic
triplets

likely
to
be
orthologous

l Idea:
If
the
outer
pair
in
a
syntenic

triplet
are
orthologous,
the
middle

gene
is
likely
to
be,
too.

l  Middle
genes
are
orthologue
“gold

standard”

l Do
RBBH
reliably
iden5fy
middle

genes
from
syntenic
triplets?

Wolf
et
al.
(2012)
Genome
Biol.
Evol.
doi:10.1093/gbe/evs100

Evalua'ng
Orthologue
Predic'ons

l  Two
well-‐characterised
genomes

compared
against
573
prokaryotes

l  Iden5ﬁed
RBBH
(with
permissive

BLAST
sewngs)

l  “Overwhelming
majority”
of
middle

genes
(counterparts)
are
BBH

l  88-‐99%
of
BBH
are
in
syntenic
triplets

l  Therefore,
RBBH
reliably
ﬁnds
orthologues

Wolf
et
al.
(2012)
Genome
Biol.
Evol.
doi:10.1093/gbe/evs100

Evalua'ng
Orthologue
Predic'ons

l Four
orthologue
predic5on
algorithms:

l  RBBH
(and
cRBH)

l  RSD
(and
cRSD)

l  Mul5Paranoid

l  OrthoMCL

l Tested
against
2,723
curated
orthologues
from
six
Saccharomycetes

l Rated
by:

l  Sensi5vity:
TP/(TP+FN)
–
what
propor5on
of
orthologues
are
found

l  Speciﬁcity:
TN/(TN+FP)
–
how
well
are
non-‐orthologues
excluded

l  Accuracy:
(TP+TN)/(TP+TN+FP+FN)
–
general
measure
of
performance

l  FDR:
FP/(FP+TP)
–
what
propor5on
of
predic5ons
are
incorrect

Salichos
et
al.
(2011)
PLoS
One.

Evalua'ng
Orthologue
Predic'ons

l Four
orthologue
predic5on
algorithms:

l  RBBH
(cRBH)

l  RSD
(cRSD)

l  Mul5Paranoid

l  OrthoMCL

l  cRBH
most
accurate,
and
speciﬁc,
with
lowest
FDR

Salichos
et
al.
(2011)
PLoS
One.

Evalua'ng
Orthologue
Predic'ons

l Tests
of
several
methods
on
a
number
of
literature-‐based

benchmarks
for:

l  Correct
branching
of
phylogeny

l  Grouping
by
func5on

„ GO
similarity

„ EC
number

„ Expression
level

„ Gene
Neighbourhood

Altenhoﬀ
&
Dessimoz
(2009)
PLoS
Comp.
Biol.

Evalua'ng
Orthologue
Predic'ons

Altenhoﬀ
&
Dessimoz
(2009)
PLoS
Comp.
Biol.

Evalua'ng
Orthologue
Predic'ons

l 70
gene
family
test,
mul5ple
evolu5onary
scenarios

l Tested
databases
with
associated
algorithms:

Trachana
et
al.
(2011)
Bioessays.
doi:10.1002/bies.201100062

Evalua'ng
Orthologue
Predic'ons

l 70
gene
family
test
set,
mul5ple
evolu5onary
scenarios

l All
methods/dbs
have
strong
scope
for
improvement.

l OrthoMCL
poor
performer,
TreeFam
&
eggNOG
do
best

Trachana
et
al.
(2011)
Bioessays.
doi:10.1002/bies.201100062

Orthologue
Predic'on
Performance

l Performance
varies
by
choice
of
method
and

interpreta'on
of
“orthology”

l Biggest
inﬂuence
is
genome
annota'on
quality

l Rela've
performance
varies
with
benchmark
choice

l (clustering)
RBBH
outperforms
more
complex
algorithms

under
many
circumstances

Selec'on
Pressures

Signs
of
selec'on
pressure
iden'ﬁable
by

compara've
genomics

Selec'on
Pressures

l Deﬁning
core
groups
of
genes
by
“orthology”
allows

analysis
of
those
groups:

l  Synteny/colloca'on

l  Gene
neighbourhood
changes
(e.g.
genome
expansion)

l  The
pangenome:
core
and
accessory
genomes

l and
sequences
in
those
groups:

l  Mul5ple
alignment

l  Domain
detec5on

l  Iden5ﬁca5on
of
func5onal
sites

l  Inference
of
evolu'onary
pressures

Synteny

l Selec5ve
pressures
depend
on
gene
(product)
func5on

l Genes
involving
physically
or
func5onally-‐interac5ng

proteins
tend
to
evolve
under
similar
selec5ve
constraints

l Par5cularly
in
bacteria,
this
leads
to
co-‐expression
as

regulons
and
colloca5on
in
operons

l Colloca5on
(and
coregula5on)
may
be
iden5ﬁed
by

compara5ve
genomics

l (This
is
also
true
when
considering
regulatory
or

metabolic
networks,
similarly
to
genome
organisa5on)

Alvarez-‐Ponce
et
al.
(2011)
Genome
Biol.
Evol.
doi:10.1093/gbe/evq084

Synteny

l  Many
tools/packages/services
for
synteny
detec5on,

e.g.

l  SyMAP

„  hZp://www.agcol.arizona.edu/soware/
symap/

l  i-‐ADHoRe

„  hZp://bioinforma5cs.psb.ugent.be/soware/
details/i-‐-‐ADHoRe

l  MCScan,
Cyntenator,
etc

Soderlund
et
al.
(2011)
Nucl.
Acids.
Res.
doi:10.1093/nar/gkr123

Proost
et
al.
(2011)
Nucl.
Acids
Res.

i-‐ADHoRe

l Algorithm:

1.  Combine
tandem
repeats
of
genes/gene
sets

2.  Make
gene
homology
matrix
(GHM):
iden5fy
collinear
regions
(diagonals)

for
first
genome
pair

3.  Convert
these
to

profiles

4.  Use
GG2
algorithm
to

align
profiles

5.  Search
next
genome

with
profiles,
spliwng

them
where
necessary

6.  iterate
un5l
complete

l Gives
genome-‐scale
mul5ple
alignments
of
blocks
of
genes

Proost
et
al.
(2011)
Nucl.
Acids
Res.

i-‐ADHoRe

l [ACTIVITY]

l  i-ADHoRe/README.md
Markdown

l  i-ADHoRe.ipynb
iPython
notebook

Genome
Expansion

l Mobile/repeat
elements
reproduce
and
expand
during
evolu5on

l Generates
sequence
“laboratory”
for
varia5on
and
experiment

l e.g.
Phytophthora
infestans
eﬀector
protein
expansion
and
arms
race

Haas
et
al.
(2009)
Nature.
doi:10.1038/nature08358

Genome
Expansion

l  Mobile
elements
(MEs)
are
large,

carry
genes
with
them.

l  Regions
rich
in
MEs
have
larger

gaps
between

consecu5ve
genes

l  Eﬀector
proteins
are
found

preferen5ally
in
regions
with

large
gaps,
also
show
increased

rates
of
evolu5onary
divergence.

l  “Two-‐speed
genome”
associated

with
adaptability
to
new
hosts/
escape
from
evolu5onary

“boZleneck”

Haas
et
al.
(2009)
Nature.
doi:10.1038/nature08358

The
Pangenome

l  The
gene
complement
of
a
set
of
organisms
(e.g.
species
group)
is
the

pangenome,
defined
by
the
union
of
two
gene
sets:

l  Core
genes:
genes
present
in
all
examples
(define
common
species

characteris5cs)

l  Accessory
genes:
genes
only
present
in
a
subset
of
examples
(relevant
to

adapta5on
of
individuals)

l  Defini5on
depends
on
composi5on
of
organism
set

l  Core
genome
hypothesis:

l  “The
core
genome
is
the
primary

cohesive
unit
defining
a
bacterial

species.”

l  Online
tools
available,
e.g.

l  Panseq
(hZp://lfz.corefacility.ca/panseq/)

Laing
et
al.
(2010)
BMC
Bioinf.
doi:10.1186/1471-‐2105-‐11-‐461

Lefébure
et
al.
(2010)
Genome
Biol.
Evol.

Deﬁning
a
species’
core
genome

l  “Orthologue
groups”
with
a

representa5ve
in
(nearly)
every

member
of
the
set

l  But
we
only
have
a
sample
of
the

species,
not
every
member…

l  …so
use
rarefac5on
curves
to

es5mate
core
genome
size.

1.  Randomly
order
organisms,

and
count
number
of
‘core’

and
‘new’
genes
seen
with

each
new
genome
addi5on.

2.  Repeat
un5l
you
have
a

reasonable
es5mate
of
error/
no
new
genes
found

Lefébure
et
al.
(2010)
Genome
Biol.
Evol.

Direc'onal
Selec'on

l Several
sta5s5cal
tests
for
direc5onal
selec5on,
e.g.

l  QTL
sign

l  Ka/Ks
(dN/dS)
ra'o
test
–
most
commonly
applied

l  Rela5ve
rate
test

l Ka/Ks
ra'o:

l  Ka
(or
dN):
number
of
non-‐synonymous
subs5tu5ons
per
non-‐
synonymous
site

l  Ks
(or
dS):
number
of
synonymous
subs5tu5ons
per
synonymous
site

l  Ka/Ks
>
1
⇒
posi5ve
selec5on;
Ka/Ks
<
1
⇒
stabilising
selec5on

l  Several
methods/tools
for
calcula5on

„ PAML
(hZp://abacus.gene.ucl.ac.uk/soware/paml.html)

„ SeqinR
(hZp://cran.r-‐project.org/web/packages/seqinr/index.html)

Genome-‐Wide
Posi've
Selec'on

Lefébure
&
Stanhope
(2009)
Genome
Res.
doi:10.1101/gr.089250.108

An
Analysis
Output

l Class
comparison:

animal-‐pathogenic

(APE)
vs
plant-‐
associated
bacteria

(PAB)

l Presence
of

horizontally-‐acquired

islands
(HAI)

l Genes
with
greater

similarity
to
PAB
than

APE

Toth
et
al.
(2006)
Annu.
Rev.
Phytopath.
doi:10.1146/annurev.phyto.44.070505.143444

Things
I
Didn’t
Get
To

l Genome-‐Wide
Associa'on
Studies
(GWAS):

l  Try
hZp://genenetwork.org/
to
play
with
some
data

l Predic'on
of
regulatory
elements,
e.g.

l  Kellis
et
al.
(2003)
Nature
doi:10.1038/nature01644

l  King
et
al.
(2007)
Genome
Res.
doi:10.1101/gr.5592107

l  Chaivorapol
et
al.
(2008)
BMC
Bioinf.
doi:10.1186/1471-‐2105-‐9-‐455

l  CompMOBY:
hZp://genome.ucsf.edu/compmoby/

l Detec'on
of
Horizontal/Lateral
Gene
Transfer
(HGT/LGT),
e.g.

l  Tsirigos
&
Rigoutsos
(2005)
Nucl.
Acids.
Res.
doi:10.1093/nar/gki187

l Phylogenomics,
e.g.

l  Delsuc
et
al.
(2005)
Nat.
Rev.
Genet.
doi:10.1038/nrg1603

Finishing
The
Hat

Some
of
the
things
I
hope
you
have
taken

away
from
the
lectures/ac'vi'es

Take-‐Home
Messages

l Compara've
genomics
is
a
powerful
set
of
techniques
for:

l  Understanding
and
iden5fying
evolu5onary
processes
and
mechanisms

l  Reconstruc5ng
detailed
evolu5onary
history
of
a
set
of
organisms

l  Iden5fying
and
understanding
common
genomic
features
of
organisms

l  Providing
hypotheses
about
gene
func5on
for
experimental
inves5ga5on

l A
huge
amount
of
data
is
available
to
work
with

l  And
it’s
only
going
to
get
much,
much
larger

l Results
feed
into
many
areas
of
study:

l  Medicine
and
health

l  Agriculture
and
food
security

l  Basic
biology
in
all
ﬁelds

l  Systems
and
synthe5c
biology

Take-‐Home
Messages

l Compara've
genomics
is
essen'ally
based
around
comparisons

l  What
is
similar
between
two
genomes?
What
is
different?

l Compara've
genomics
is
evolu'onary
genomics

l Large
datasets
benefit
from
visualisa'on
for
effec've
interpreta'on

l  Much
scope
for
improvement
in
visualisa5on

l Tools
with
the
same
purpose
give
different
output

l  BLAST
vs
MUMmer

l  RBBH
vs
MCL

l  Choice
of
applica'on
magers
for
correctness
and
interpreta'on!
–

understand
what
the
applica'on
does,
and
its
limits.

Take-‐Home
Messages

l Compara've
genomics
is

l Fun

l Indoor
work,
in
the
warm
and
dry

l Not
a
job
that
involves
heavy
liiing

Credits

l This
slideshow
is
shared
under
a
Crea5ve
Commons

AZribu5on
4.0
License

hZp://crea5vecommons.org/licenses/by/4.0/)

l Copyright
is
held
by
The
James
HuZon
Ins5tute

hZp://www.huZon.ac.uk

l You
may
freely
use
this
material
in
research,
papers,
and

talks
so
long
as
acknowledgement
is
made.

Comparative Genomics and Visualisation - Part 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Comparative Genomics and Visualisation - Part 2

Similar to Comparative Genomics and Visualisation - Part 2 (20)

More from Leighton Pritchard

More from Leighton Pritchard (18)

Recently uploaded

Recently uploaded (20)

Comparative Genomics and Visualisation - Part 2