Introduction: Thousands of microbial genomes are available, yet even for the model organisms, a sizable portion of the genes have unknown function. Phyletic profiling is a technique that can predict their function by comparing the presence/absence profiles of their homologs across genomes. In addition, prokaryotic genomes contain an evolutionary signature of gene expression levels in the codon usage biases, where highly expressed genes prefer the codons better adapted to the cellular tRNA pools.
Objectives: We aimed to augment the existing phyletic profiling approaches by incorporating more detailed knowledge of gene evolutionary history, and create a very large database of predicted gene functions direcly usable for microbiologists.
Materials & methods: We used the OMA groups of orthologs and the paralogy relationships inferred through OMA's „witness of non-orthology“ rule. Genes were assigned to Gene Ontology categories and the phyletic profiles compared using the CLUS classifier that performs a hierarchical multilabel classification using decision trees. We quantified significant codon biases using a Random Forest randomization test that compares against the composition of intergenic DNA. Codon biases in COG gene families were contrasted between microbes inhabiting different enviroments, while controlling for phylogenetic inertia.
Results: The genomic co-occurence patterns of both the orthologs and the paralogs (the homologs separated by a speciation and by a duplication event, respectively) were informative and synergistic in a phylogenetic profiling setup, even though paralogy relationships are thought to conserve function less well. The resulting ~400,000 gene function predictions for 998 prokaryotes (at FDR<10%)> method to systematically link codon adaptation within COG gene families to microbial phenotypes and environments (thus functionally characterizing the COGs) and experimentally validated the predictions for novel E. coli genes relevant for surviving oxidative, thermal or osmotic stress.
Conclusion: Our work towards ehnancing phylogenetic profiling, as well as developing complementary genomic context approaches, will contribute to prioritizing experimental investigation of microbial gene function, cutting time and cost needed for discovery.
Inferring microbial gene function from evolution of synonymous codon usage biases
1. Synonymous mutations - from
bacterial evolution to somatic
changes in human cancer
Fran Supek
1) Lehner group, CRG/EMBL Systems Biology Unit, Barcelona
2) Division of Electronics, RBI, Zagreb, Croatia
XXI Jornades de Biologia Molecular
Barcelona, 11.6.2014
Part 1: Inferring microbial gene function from evolution of codon biases.
3. Synonymous mutations
• (some) synonymous mutations are subject to evolutionary pressures
• clearly shown for many bacteria and yeasts
• likely also higher Eukarya (but weaker signal)
• how does selection for/against synonymous changes relate to gene
function in (a) evolution of bacteria and (b) in carcinogenesis?
evolutionary trace across ~1000 bacterial genomes somatic mutations in ~4000 human cancers
malignant transformationadaptation to diverse environments
( plush microbes in photos are from http://www.giantmicrobes.com/ )
4. • In what way can evolution of synoymous codon preference be used to
systematically infer gene function in bacteria?
• There are other simpler (known) ways to determine gene function
from the genome sequences:
• commonly/systematically applied: transfer of annotation via sequence
similarity (BLAST, COG, Pfam...)
• >30% of genes end up with no known function annotated. They may not have known
homologs, or their homologs may have no experimentally determined function.
• known but less common: genomic context methods, such as phyletic profiling
evolutionary trace across ~1000 bacterial genomes
adaptation to diverse environments
( plush microbes in photos are from http://www.giantmicrobes.com/ )
5. Phyletic (or phylogenetic) profiling
Pellegrini, Marcotte et al., PNAS (1999)
one genomic context method:
examines presence/absence patterns of homologous genes across species.
6. Kensche et al. (2008) J Royal Soc Interface.
~30 examples of success of phyletic profiling
• by 2008 -> n~=30
• by 2014 -> n~=300 (estimate)
• aim for: N > 3000
7. Enriching phyletic profiles
with information on
orthology and paralogy
Species
1
Species
2
…
Species
997
Species
998
Function
OMA 1 … 0
GO:001,
GO:007
OMA 2 0 … ?
… … … … … … …
OMA 64051 0 … 0 0 GO:042
OMA 64052 0 …
GO:003,
GO:160
orthologs in cliques
orth. outside cliques
paralogs
groups of orthologs from OMA database:
Schneider, Dessimoz and Gonnet (2007) Bioinformatics
Skunca et al. PLoS Comp Biology 2013
doi:10.1371/journal.pcbi.1002852
8. Accuracy of predicting GO categories strongly
increases when adding paralogs
+ paralogs + orthologs
(outside clique)
+ para + orthoclique only
(bubbles are
Gene Ontology
categories)
9. Supervised machine learning is superior to
common approaches based on pairwise distances
Based on
correlation
of profiles
AUC(areaunder
ROCcurve)
Decision
trees
Schietgat et al. 2010. BMC Bioinfo
10. Experimental validation of predictions made
with phyletic profiling
• knockout mutants of E. coli in predicted genes
• three selected GO categories targeted by particular antibiotics:
• ‘response to DNA damage’
• ‘translation’
• ‘peptidoglycan-based cell wall biogenesis’
• predictions: 38 genes with expected precision > 60%
15. “We predict Gene Ontology annotations ...
for about 1.3 million poorly annotated
genes in 998 prokaryotes at a stringent
threshold of 90% Precision...”
“...about 19000 of those are highly
specific functions.”
published in:
Skunca et al. PLoS Comp Biology 2013
doi:10.1371/journal.pcbi.1002852
16. • Codon usage biases are another useful
source of evolutionary information
•... complementary to gene presence/absence
•... available from just the genome sequence
•... with an established biological rationale
17. tRNA levels and codon usage biases
E. coli K-12, tRNA gene counts
(proxy for tRNA levels)
codon
anticodon
Commonly used codons typically correspond to abundant
tRNAs, particularly in highly expressed genes.
18. Codon biases correlate to gene expression
0.5
1
1.5
2
2.5
0.5 1 1.5 2 2.5 3 3.5
MILC(non-RPgenes)
MILC (ribosomal protein genes)
ribosomal protein genes other highly expressed genes rest of genome
B
Figure from
Supek and Vlahoviček (2005)
BMC Bioinformatics
doi:10.1186/1471-2105-6-182
E. coli genome
19. •organisms adapt to the environment through changes
in translation efficiency?
•Carbone A (2005) J Mol Evol – codon adaptation in
metabolic pathways:
Photosynthesis genes in
Synechocystis
Methanogenesis genes in
Methanosarcina
Archaea
Bacteria
20. An example phenotype: oxygen requirement
• Man & Pilpel (2007) Nat Genet: 9 yeasts
TCA cycle glycolysis
aerobic anaerobic (low) codon adaptation (high)
• Based on these examples, we aimed to systematically link:
• Many environments/phenotypes, with
• evolutionary change in translation efficiency across many gene families
21. Measuring translation efficiency
Method from
Supek et al. (2010)
PLoS Genetics
doi:10.1371/journal.pgen.1001004
non-HE HE
4-20% of genome
Expression levels: microarrays
on 19 diverse bacteria
0
1
2
3
4
log2expressionratio
OCU/non-OCU, from ref. [7] HE/non-HE ribosomal proteins/all genes
gene 1
intergenic
DNA
codon
usage
increase
in
expr.
A
gene1
B
C
3.9x
6.0x
22. Correlation vs. causality?
a randomization test to control for
confounding phenotypes and phylogeny
This passes the
randomization test:
This fails (association
not unique):
associations between phenotypes, and
also with phylogeny:
23. • 514 aerotolerant vs. 214 aerointolerant:
295 COGs are significantly enriched
with HE genes
• obligate vs. facultative aerobes:
• thermophiles
• halophiles
+ 20 other phenotypes tested
control for confounders 23 COGs
11 COGs
16 COGs
6 COGs
24. Gene families linked to aerotolerance
all experiments: Anita Kriško lab (Mediterranean Institute for Life Science, Split, Croatia)
published as Kriško et al, Genome Biology 2014. doi:10.1186/gb-2014-15-3-r44
60%
80%
100%
120%C
60%
80%
100%
120%
malizedtow.t.
B
0x
1x
2x
3x
4x
5x
6x
0%
20%
40%
60%
80%
100%
120%
NAC/noNACsurvivalratio
survival,normalizedtow.t.
2.5mM H2O2 5mM NAC pretreatment heat shock osmotic shock
A
** ** **
* known antioxidant proteins in E. coli (or homologs in other organisms)
* known to be regulated in response to air or oxidative stress
positive
control
2 nonspeci-
fic hits
25. ROS levels in the mutantscarbonylation
increase
DHR-123
increase
CellROX
increase
total
Fe
increase
dipyridyl
rescue
NADPH
level
increase
NADPH
rescue
fre
sufD
rseC
sodA
w.t.
clpA
recA
napF
lon
ybeQ
yaaU
cysD
ybhJ
gpmM
icd
lpd
yidH
0.8
positive
control
wild-type
ROS are typically not
increased (except cysD,
yaaU, rseC, and the positive
control sodA)
27. Putative mechanisms of oxidative stress resistance
NAD(P)H
related
iron-
related
unknown
all experiments: Anita Kriško lab (Mediterranean Institute for Life Science, Split, Croatia)
published as Kriško et al, Genome Biology 2014. doi:10.1186/gb-2014-15-3-r44
carbonylation
increase
DHR-123
increase
CellROX
increase
totalFe
increase
dipyridyl
rescue
NADPHlevel
decrease
exogenous
NADPHrescue
29. Validation using synthetic genes with
introduced suboptimal codons
0%
5%
10%
15%
20%
25%
30%
w.t. ΔclpS ΔclpS+
clpS_w.t.
ΔclpS+
clpS_15
ΔclpS+
clpS_20
ΔclpS+
clpS_25
%survival
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.5 1 1.5 2 2.5
relativefrequency
codon distance(MILC)to ribosomalprotein genes
ribosomalproteingenes
all otherE. coli genes
w.t.
15
20 25
w.t.
21 28 35
yjjB
clpS
0%
5%
10%
15%
20%
25%
30%
w.t. ΔyjjB ΔyjjB+
yjjB_w.t.
ΔyjjB+
yjjB_21
ΔyjjB+
yjjB_28
ΔyjjB+
yjjB_35
%survival
osmoticshock
heatshock
C
D
B
A
all experiments: Anita Kriško lab (Mediterranean Institute for Life Science, Split, Croatia)
published as Kriško et al, Genome Biology 2014. doi:10.1186/gb-2014-15-3-r44
30. Overall
• 200 links between 187 different
COG gene families
- and -
24 diverse phenotypic traits, including
• spore-forming ability
• motility
• pathogenicity to plants or mammals
• affecting certain tissues/organs
• (1000s more predictions at
less stringent thresholds)
Anita Kriško lab – Mediterranean
Institute for Life
Sciences (MedILS)
Split, Croatia.
all experimental
work shown
Nives Škunca
ETH Zurich.
Phyletic
profiling,
GORBI
31.
32. Thank you!
Fran Supek
1) Lehner group, CRG/EMBL Systems Biology Unit, Barcelona
2) Division of Electronics, RBI, Zagreb, Croatia
XXI Jornades de Biologia Molecular
Barcelona, 11.6.2014
End of Part 1. Part 2 deals with causal synonymous mutations in
human cancer genomes, and is available separately.
Editor's Notes
For envC, we predict ‘peptidoglycan-based cell wall biogenesis’ with Pr of 0.71