Valerie de Anda at #ICG12: A new multi-genomic approach for the study of biogeochemical cycles at global scale: the molecular reconstruction of the sulfur cycle

Valerie De Anda
Ecology Institute UNAM México
Laboratory of Computational Biology Zaragoza
CSIC Spain
valdeanda@ciencias.unam.mx
https://github.com/valdeanda
@val_deanda
The12thInternationalConference onGenomics
O c t o b e r 2 6 t o 2 9 , 2 0 1 7
S h e n z h e n , C h i n a

Revolution in
microbial
ecology field
»
Genomic
reconstruction:
microbial dark
matter
»
Large amount of data
»Ability to evaluate
complex metabolic
functions data in
large data sets
remains:
The iceberg illusion of metagenomics
Biologically and
computationally
challenging
»»Diversity,
ecology,
evolution and
functional
makeup of the
microbial world
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 2 / 2 2
»Really complex to
infer and test
biological
hypothesis in
such data
M E B S

The Iceberg illusion of
metagenomics
Microbial
ecology-
derived ‘omic’
studies
What do we need to improve efficiency of
data processing?
Biological data
interpretation
(evaluate, compare
and analyze
complex data in a
large scale)
Computationally
efficiency:
(high performance,
accuracy, high speed,
data processing,
reproducibility)
» Most abundant
» Marker genes
Metagenomicdata
» Statistically
≠ features
Gomez Cabrero et al 2014 BMC SB
Reshetova et al 2013 BMC SB
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 3 / 2 2M E B S

Data
integration
For a given system,
multiple sources (and
possible types) of data
are available and we
want to study them
integratively to improve
knowledge discovery
What are the available data that can be used to
characterize large-scale metabolic machineries?
How do we integrate all
to improve the understanding the system?.
C
Gomez Cabrero et al 2014 BMC SB
Reshetova et al 2013 BMC SB
Prior knowledge: To
reduce the solution
space and/or to
focus the analysis on
biological meaningful
regions
(specific metabolic
machineries)
(Targeted)
Metabolism Taxa involved in
that particular
metabolism
Proteins involved in
that particular
metabolism
Public available
genomes?
Mathematical model
Relative entropy
Informative Score
MEBS
𝐇′
=
𝑖
𝑃 𝑖 log2
𝑃 𝑖
𝑄 𝑖
n0
≥1
≤0
Informative
Non-Informative

C
Prior knowledge: To
reduce the solution
space and/or to
focus the analysis on
biological meaningful
regions
(specific metabolic
machineries)
(Targeted)
Metabolism Taxa involved in
that particular
metabolism
that particular
metabolism
Large scale
dataset
Mathematical model
Relative entropy
Informative Score
MEBS
𝐇′
=
𝑖
𝑃 𝑖 log2
𝑃 𝑖
𝑄 𝑖
n0
≥1
≤0
Informative
Non-Informative
Does is it really work?
Can capture an entire
metabolic machinery?
Can we used to
evaluate, compare and
analyze complex data in
large scale ? (genomes,
metagenomes)
Computationa
lly efficient?
Accurate, high
speed in large
datasets and
reproducible
Data
integration
Single Value

Data integration: case of study
Atmosphere
Solar
E°
Redox
reactions
Metabolic
guilds
Geological
processes
An entire biogeochemical cycle
S-cycle
CHONS-P
Taxa involved in
that particular
metabolism
that particular
metabolism
Large scale
datasets
Mathematical model
Relative entropy
Informative Score
MEBS
𝐇′
=
𝑖
𝑃 𝑖 log2
𝑃 𝑖
𝑄 𝑖
n0
≥1
≤0
Informative
Non-Informative
They really
capture the
major
processes
involved in the
mobilization
and use of S-
compounds
through Earth
biosphere

Data integration: case of
study S-cycle
https://metacyc.org/META/NEW-IMAGE?object=Sulfur-Metabolism
http://www.genome.jp/kegg-bin/show_pathway?map00920
Manually curated
reconstruction of the S-
metabolic machinery

Data integration: case of study S-cycle
Taxa: metabolic guilds Metabolic machinery
i) CLSB: 24 genera
ii) PSB: 25 genera
iii) GSB: 9 genera
iv) SRB: 40 genera
v) SRM:19 genera
vi) SO:4 genera
Suli
N=161
i) Sulfur
compounds
ii) Metabolic
pathways
iii) Genes
iv) Proteins
Complete nr sequenced
S-genomes
Sucy
N=152
txt
GCF_000006985.1 Chlorobium tepidum TLS
GCF_000007005.1 Sulfolobus solfataricus P2
GCF_000007305.1 Pyrococcus furiosus DSM 3638
GCF_000008545.1 Thermotoga maritima MSB8
GCF_000008625.1 Aquifex aeolicus VF5
GCF_000008665.1 Archaeoglobus fulgidus DSM 4304
GCF_000009965.1 Thermococcus kodakarensis KOD1
>Protein1
MIKPVGSDELKPLFVYDPEEHHKLSHEAESLPSVVISSQGPRVSSM
MGAGYFSPAGFMNV
>Protein 2
MAYKTIIEDGIDVLVVGAGLGGTGAAFEARYWGQDKKIVIAEKANID
>Protein 3
MPTFVYMTRCDGCGQCVDICPSDIMHIDTTIRRAYNIEPNMCWEC
YSCVKACPHNAIDVR
Evidence linking
them with the S-
cycle
(Curated DB and
primarily
literature)
Evidence suggesting
their physiological
and biochemical
involvement in the
use of sulfur
compounds.

Metabolic machinery
i) Sulfur
compounds
ii) Metabolic
pathways
iii) Genes
iv) Proteins
Sucy
N=152
>Protein1
MGAGYFSPAGFMNV
>Protein 2
>Protein 3
YSCVKACPHNAIDVR
Evidence linking
them with the S-
cycle
(Curated DB and
primarily
literature)

Table 1. Metabolic pathways of global biogeochemical S-cycle
Pathway
number
Metabolisma
Chemical
processb Sulfur compound Typec
Chemical
formula
Sourced
Number of
Pfam domaise
P1 DS O Sulfite I SO32- E 9
P2 DS O Thiosulfate I S2O3
2- E 10
P3 DS O Tetrathionate I S4O6
2- E 2
P4 DS R Tetrathionate I S4O6
2- E 17
P5 DS R Sulfate I SO42- E 20
P6 DS R Elemental sulfur I Sº E 20
P7 DS D Thiosulfate I S2O3
2- E 9
P8 DS O Carbon disulfide O CS2 E 1
P9 A DE Alkanesulfonate O CH3O3SR S 5
P10 A R Sulfate I SO4
2- S 20
P11 DS O Sulfide I H2S E/S 29
P12 A DE L-cysteate O C3H6NO5S C/E 1
P13 A DE Dimethyl sulfone O C2H6O2S C/E 3
P14 A DE Sulfoacetate O C2H2O5S C/E 2
P15 A DE Sulfolactate O C3H4O6S C/S 14
P16 A DE Dimethyl sulfide O C2H6S C/S 16
P17 A DE Dimethylsulfoniopropionate O C5H10O2S C/S/E 12
P18 A DE Methylthiopropanoate O C4H7O2S C/S 7
P19 A DE Sulfoacetaldehyde O C2H3O4S C/S 7
P20 DS O Elemental sulfur I S° C/S/E 7
P21 DS D Elemental sulfur I S° C/S/E 1
P22 A DE Methanesulfonate O CH3O3S C/S/E 7
P23 A DE Taurine O C2H7NO3S C/S/E 11
P24 DS M Dimethyl sulfide O C2H6S C 1
P25 DS M Metylthio-propanoate O C4H7O2S C 1
P26 DS M Methanethiol O CH4S C 1
P27 A DE Homotaurine O C3H9NO3S N 1
P28 A B Sulfolipid O SQDG 4
P29 Markers Markers 12
1
Metabolic machinery
i) Sulfur
compounds
ii) Metabolic
pathways
iii) Genes
iv) Proteins
Sucy
N=152
>Protein1
MGAGYFSPAGFMNV
>Protein 2
>Protein 3
YSCVKACPHNAIDVR
Evidence linking
them with the S-
cycle
(Curated DB and
primarily
literature)
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 0 / 2 2M E B S

Metabolic machinery
i) Sulfur
compounds
ii) Metabolic
pathways
iii) Genes
iv) Proteins
Sucy
N=152
>Protein1
MGAGYFSPAGFMNV
>Protein 2
>Protein 3
YSCVKACPHNAIDVR
Evidence linking
them with the S-
cycle
(Curated DB and
primarily
literature)

Large omic datasets
characterize large-scale metabolic pathways?
Mathematical model
Relative entropy
Informative Score
MEBS
𝐇′
=
𝑖
𝑃 𝑖 log2
𝑃 𝑖
𝑄 𝑖
n0
≥1
≤0
Informative
Non-Informative
Taxa involved in
that particular
metabolism
that particular
metabolism
txt
2,107 nr genomes (faa)
Gen 1,5 GB
How many genomes were available
at the time of analysis?
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 2 / 2 2
Num of complete
prokariotic
genomes
≈4,000 (NCBI Refseq)
Dec 2016
Non redundant 2,107 Dec 2016
Public
available
and
manually
cuarted
data
M E B S

Large omic datasets
Mathematical model
Relative entropy
Informative Score
MEBS
𝐇′
=
𝑖
𝑃 𝑖 log2
𝑃 𝑖
𝑄 𝑖
n0
≥1
≤0
Informative
Non-Informative
Taxa: Suli Proteins: Sucy
txt
Gen MetGenF
104GB
≈ 500 GB
1,5 GB
How many metagenomes were
available at the time of analysis?
i) were publicly available
ii) contained associated metadata
iii) had been isolated from well-defined environments
(i.e., rivers, soil, biofilms)
iv) discarding host associated microbiome sequences
(i.e., human, cow, chicken)

112-HMM of S-proteins
C
txt
GCF_000006985.1 Chlorobium tepidum TLS
GCF_000007005.1 Sulfolobus solfataricus P2
GCF_000007305.1 Pyrococcus furiosus DSM 3638
GCF_000008545.1 Thermotoga maritima MSB8
GCF_000008625.1 Aquifex aeolicus VF5
GCF_000008665.1 Archaeoglobus fulgidus DSM 4304
GCF_000009965.1 Thermococcus kodakarensis KOD1
>Protein1
MGAGYFSPAGFMNV
>Protein 2
>Protein 3
YSCVKACPHNAIDVR
Gen GenF
Stage 1: Manual curation and omic datasets
Stage 2: Domain composition
Stage 4: Informative Score Can capture the S- metabolic machinery?
Can we used to evaluate, compare and analyze
complex data in large scale ? (genomes, metagenomes)
Computationally efficient? Accurate,
high speed in large datasets and
reproducibleSingle Value
Mathematical model
𝐇′
=
𝑖
𝑃 𝑖 log2
𝑃 𝑖 (𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑)
𝑄 𝑖 (𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑)
n
≥1
Informative
Non-Informative
Stage 3: Relative Entropy
Domains enriched among the microorganisms of interest
𝑃 𝑖 = frequency of protein domain i in S genomes (161)
Q 𝑖 = frequency of protein domain i in Gen (2,107)
0
≤0
Taxa: Suli Proteins: Sucy
MEBS: GENERAL OVERVIEW

https://github.com/eead-csic-compbio/metagenome_Pfam_score
2,107 genomes 161 Suli +
935 metagenomes

an unnamed endosymbiont of a
scaly snail from a black smoker
chimney
archaeon Geoglobus ahangari,
sampled from a 2,000m depth
hydrothermal vent .
Distribution of Sulfur Score (SS)
in 2,107 nr-genomes
Candidatus
Desulforudis
audaxviator MP104C
Metagenomic reconstructions hard-to culture taxa
Sur
N=192
»
»»

Positive instances
Suli
N=161
(1946) > Negative instances.
Gen
ROC CURVE
• Two-dimensional graphs in which TP rate is plotted on the Y axis and FP rate is plotted on the X axis.
• Depicts relative tradeoffs between benefits (true positives) and costs (false positives).
Perfect
classification
M E B S

Distribution of Sulfur Score (SS) in the metagenomic dataset (935 metagenomes)
Distribution of SS values observed in 935
metagenomes classified in terms of features
(X-axis) and colored according to their
particular habitats Features are sorted
according to their median SS values. Green
lines indicate the lowest and largest 95th
percentiles observed across MSL classes.
Geo-localized
metagenomes
sampled around the
globe are colored
according to their SS
values

mebs
BG cygling
S
genes
S
genomes
Informative
Non-informative
9.5
Markers Comp
C
Conclusions
» We present MEBS a new open source software to evaluate, quantify, compare, and
predict the metabolic machinery of interest in large ‘omic’ datasets using one single
value
» To test the applicability of this approach, we evaluated one of the most complex
biogeochemical cycles the sulfur cycle.
» Using data integration and manual curation we reconstructed the entire sulfur
machinery: Suli and Sucy
» We prove that the use of the mathematical framework of the relative entropy can
be used to capture complex metabolic machineries in large scale omic samples.
» MEBS powerful and broadly applicable approach to predict, and classify
microorganisms closely involved in the sulfur cycle even in hard-to culture
microbial lineages
» Computationally efficient, accurate (AUC0985) and reproducible.
» Not in the presentation: the entropy can be used to detect marker domains and the
completeness of the S-cycle pathways can be benchmarked in large scale
MEBS
M E B S

mebs
BG CYGLING
9.5
C N O
SFe P
BIOREMEDIATION ANTIBIOTICS
EXTREME
ENVIRONMENTS
AGRICULTURE
?
Perspectives
• We are currently finishing the analyses to demonstrate the applicability of
this approach to other biogeochemical cycles (C, N, O, Fe, P).
• Thereby, we hope that the pipeline MEBS will facilitate analysis of
biogeochemical cycles or complex metabolic networks carried out by
specific prokaryotic guilds, such as bioremediation processes (i.e.,
degradation of hydrocarbons, toxic aromatic compounds, heavy metals
etc.).
• We look forward to collaborate and help other researchers by integrating
comprehensive databases that might be helpful to the scientific
community.
• Furthermore, we are currently working to improve the algorithm by using
only a list of sequenced genomes involved in the metabolism of interest,
in order to reduce the manual curation effort.
• We are also considering taking k-mers instead of peptide Hidden Markov
Models to increase the speed of the pipeline.
• We anticipate that our platform will stimulate interest and involvement
among the scientific community to explore uncultured genomes derived
from large metagenomic sequences: exploring microbial dark matter
M E B S

Icoquih
Zapata
Valeria Souza
Luis Equiarte
Bruno
Contreras
De Anda et al., 2017 MEBS, a software platform to evaluate large (meta)genomic collections according to their metabolic
machinery: unraveling the sulfur cycle GigaScience in press
Cesar-Poot
Hernandez

L A B O R A T O R Y O F M O L E C U L A R A N D
E X P E R I M E N T A L E V O L U T I O N E C O L O G Y I N S T I T U T E
U N A M M E X I C O
22
L A B O R A T O R Y O F C O M P U T A T I O N A L
B I O L O G Y
Thank you for your attention!
M E B S

supplementary files
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d am e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 / 1 2

A B
Gen (n=2,107) Met (n=935)
D. acidiphilus
Hydrogenobacullum
A. caldus
A. ferrivorans
T. mobilis
D. aromatica
T. hauera sp.
T. humireducens
A. denitrificans
S. tokodaii
A. hospitalis (among
other 12 genomes)
P. phaeoclathratiforme
C. chlorochromatii
C. tepidum
T. denitrificans
T. violascens
S. thiotaurini
Completeness
Supplementary files
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a

Table 1. Metabolic pathways of global biogeochemical S-cycle
Pathway
number
Metabolisma Chemical
processb Sulfur compound Typec Chemical
formula
Sourced Number of
Pfam domaise
P1 DS O Sulfite I SO32- E 9
P2 DS O Thiosulfate I S2O3
2- E 10
P3 DS O Tetrathionate I S4O6
2- E 2
P4 DS R Tetrathionate I S4O6
2- E 17
P5 DS R Sulfate I SO42- E 20
P6 DS R Elemental sulfur I Sº E 20
P7 DS D Thiosulfate I S2O3
2- E 9
P8 DS O Carbon disulfide O CS2 E 1
P9 A DE Alkanesulfonate O CH3O3SR S 5
P10 A R Sulfate I SO4
2- S 20
P11 DS O Sulfide I H2S E/S 29
P12 A DE L-cysteate O C3H6NO5S C/E 1
P13 A DE Dimethyl sulfone O C2H6O2S C/E 3
P14 A DE Sulfoacetate O C2H2O5S C/E 2
P15 A DE Sulfolactate O C3H4O6S C/S 14
P16 A DE Dimethyl sulfide O C2H6S C/S 16
P17 A DE Dimethylsulfoniopropionate O C5H10O2S C/S/E 12
P18 A DE Methylthiopropanoate O C4H7O2S C/S 7
P19 A DE Sulfoacetaldehyde O C2H3O4S C/S 7
P20 DS O Elemental sulfur I S° C/S/E 7
P21 DS D Elemental sulfur I S° C/S/E 1
P22 A DE Methanesulfonate O CH3O3S C/S/E 7
P23 A DE Taurine O C2H7NO3S C/S/E 11
P24 DS M Dimethyl sulfide O C2H6S C 1
P25 DS M Metylthio-propanoate O C4H7O2S C 1
P26 DS M Methanethiol O CH4S C 1
P27 A DE Homotaurine O C3H9NO3S N 1
P28 A B Sulfolipid O SQDG 4
P29 Markers Markers 12
1
The protein domains currently present in any given
sample are divided by the total number of domains
in the pre-defined pathway
Completeness
Supplementary files

Supplementary files

35 private metagenomes:
microbial mats, sediment
and lake water
Reads
Processing
ORF prediction
Gene Calling
(aa residues)
Mean Size Length
https://microbiome.wordpress.com/
Counts of prokaryotic genomes in each NCBI category as of July 2017
Non-redundantRedundant
LARGE SCALE
Supplementary files

GenF size category 5-percentile 95-percentile
Real -0.091 0.101
30 -0.086 0.105
60 -0.09 0.104
100 -0.088 0.1
150 -0.09 0.103
200 -0.89 0.105
250 -0.09 0.106
300 -0.09 0.1
Completeness
Supplementary files

Table 2 Informative Pfam domains with high H’ and low std. Novel proposed molecular marker
domains in metagenomic data of variable MSL
Pfam ID
( Suli
ocurrences)
H’
mean
H’
std
Description
PF12139
58/161
1.2 0.01 Adenosine-5'-phosphosulfate reductase beta subunit: Key protein domain for both sulfur
oxidation/reduction metabolic pathways. Has been widely studied in the dissimilatory sulfate
reduction metabolism. In all recognized sulfate-reducing prokaryotes, the dissimilatory process is
mediated by three key enzymes: Sat, Apr and Dsr. Homologous proteins are also present in the
anoxygenic photolithotrophic and chemolithotrophic sulfur-oxidizing bacteria (CLSB, PSB, GSB), in
different cluster organization [35].
PF00374
135/161
1.1 0.09 Nickel-dependent hydrogenase: Hydrogenases with S-cluster and selenium containing Cys-x-x-Cys
motifs involved in the binding of nickel. Among the homologues of this hydrogenase domain, is
the alpha subunit of the sulfhydrogenase I complex of Pyrococcus furiosus, that catalyzes the
reduction of polysulfide to hydrogen sulfide with NADPH as the electron donor [55].
PF01747
103/161
1.03 0.06 ATP-sulfurylase: Key protein domain for both sulfur oxidation and reduction processes. The
enzyme catalyzes the transfer of the adenylyl group from ATP to inorganic sulfate, producing
adenosine 5′-phosphosulfate (APS) and pyrophosphate, or the reverse reaction [56].
PF02662
62/161
0.82 0.03 Methyl-viologen-reducing hydrogenase, delta subunit: Is one of the enzymes involved in
methanogenesis and encoded in the mth-flp-mvh-mrt cluster of methane genes in
Methanothermobacter thermautotrophicus. No specific functions have been assigned to the delta
subunit [48].
PF10418
122/161
0.78 0.06 Iron-sulfur cluster binding domain of dihydroorotate dehydrogenase B: Among the homologous
genes in this family are asrA and asrB from Salmonella enterica enterica serovar Typhimurium,
which encode 1) a dissimilatory sulfite reductase, 2) a gamma subunit of the sulfhydrogenase I
complex of Pyrococcus furiosus and, 3) a gamma subunit of the sulfhydrogenase II complex of the
same organism [12].
PF13247
149/161
0.66 0.06 4Fe-4S dicluster domain: Homologues of this family include: 1) DsrO, a ferredoxin-like protein,
related to the electron transfer subunits of respiratory enzymes, 2) dimethylsulfide dehydrogenase
β subunit (ddhB ), involved in dimethyl sulfide degradation in Rhodovulum sulfidophilum and 3)
sulfur reductase FeS subunit (sreB) of Acidianus ambivalens, involved in the sulfur reduction using
H2 or organic substrates as electron donors [12].
PF04358
73/161
0.52 0 DsrC like protein: DsrC is present in all organisms encoding a dsrAB sulfite reductase
(sulfate/sulfite reducers or sulfur oxidizers). The physiological studies suggest that sulfate
reduction rates are determined by cellular levels of this protein. The dissimilatory sulfate reduction
couples the four-electron reduction of the DsrC trisulfide to energy conservation [57]. DsrC was
initially described as a subunit of DsrAB, forming a tight complex; however, it is not a subunit, but
rather a protein with which DsrAB interacts. DsrC is involved in sulfur-transfer reactions; there is a
disulfide bond between the two DsrC cysteines as a redox-active center in the sulfite reduction
pathway. Moreover, DsrC is among the most highly expressed sulfur energy metabolism genes in
isolated organisms and meta- transcriptomes (Santos et al., 2015).
PF01058
158/161
0.45 0.01 NADH ubiquinone oxidoreductase, 20 Kd subunit: Homologous genes are found in the delta
subunits of both sulfhydrogenase complexes of Pyrococcus furiosus [12].
PF01568
156/161
0.4 0.05 Molydopterin dinucleotide binding domain: This domain corresponds to the C-terminal domain IV
in dimethyl sulfoxide (DMSO) reductase [48].
Supplementary files

https://github.com/eead-csic-compbio/metagenome_Pfam_score
Modo avanzado manual
» Biogeochemical cycles (CNOPFe)
Supplementary files

Species SS Genus Guild
Ammonifex degensii KC4 12,508 Moorella group SRB/SR
Archaeoglobus profundus DSM 5631 12,024 Archaeoglobus SRB
Candidatus Desulforudis audaxviator MP104C 11,972 Candidatus Desulforudis Sur
Pelodictyon phaeoclathratiforme BU-1 11,836
Chlorobium/Pelodictyon
group GSB
Chlorobium phaeobacteroides BS1 11,649
group GSB
Chlorobium chlorochromatii CaD3 11,625
group GSB
Thiobacillus denitrificans ATCC 25259 11,61 Thiobacillus CLSB
Desulfohalobium retbaense DSM 5692 11,511 Desulfohalobium SRB
Desulfovibrio alaskensis G20 11,5 Desulfovibrio SRB
Desulfovibrio vulgaris DP4 11,442 Desulfovibrio SRB
Chlorobium tepidum TLS 11,354 Chlorobaculum GSB
endosymbiont of unidentified scaly snail isolate
Monju 11,205 0 Sur
Desulfovibrio vulgaris str. 'Miyazaki F' 11,093 Desulfovibrio SRB
Desulfovibrio desulfuricans subsp.
desulfuricans str. ATCC 27774 11,034 Desulfovibrio SRB
Supplementary files

Supplementary files

34
Supplementary files

Sulfur: 112 H’ Nitrogen: 176 H’ Methane: 119 H’Oxygen:55 H’
Supplementary files
Iron: 112 H’

Biogeochemical cycle Genes Pfam domains Genomes AUC
Sulfur (S) 152 112 161 0.9855
Nitrogen (N) 267 176 144 0.791
Methane (C) 135 119 90 0.988
Oxygenic Photosynthesis (O) 50 55 53 0.983
Phosphorous (P)
Iron (Fe) 36 33 34 0.863
Supplementary files

ID Description H’ mean std
PF00067 Cytochrome P450 0.644 0.033785
PF00115 Cytochrome C and Quinol oxidase polypeptide I 0.513 0.061551
PF01077 Nitrite and sulphite reductase 4Fe-4S domain 0.55825 0.049936
PF02560 Cyanate lyase C-terminal domain 0.93625 0.001389
PF03460 Nitrite/Sulfite reductase ferredoxin-like half domain 0.5525 0.040324
PF04898 Glutamate synthase central domain 0.479 0.034699
PF13442 Cytochrome C oxidase, cbb3-type, subunit III 0.6565 0.047093
python3 plot_entropy.py gen_genF_entropies.oxygen.tab -0.156 0.20625
Oxygen Markers
Supplementary files

PF01913 Formylmethanofuran-tetrahydromethanopterin formyltransferase 3.629125 0.0227
PF01993 methylene-5,6,7,8-tetrahydromethanopterin dehydrogenase 2.876 0
PF02240 Methyl-coenzyme M reductase gamma subunit 3.168 0
PF02241 Methyl-coenzyme M reductase beta subunit, C-terminal domain 3.168 0
PF02289 Cyclohydrolase (MCH) 3.353 0
PF02741 FTR, proximal lobe 3.63475 0.034648
PF02745 Methyl-coenzyme M reductase alpha subunit, N-terminal domain 3.168 0
PF02783 Methyl-coenzyme M reductase beta subunit, N-terminal domain 3.168 0
PF04206 Tetrahydromethanopterin S-methyltransferase, subunit E 3.032 0
PF04207 Tetrahydromethanopterin S-methyltransferase, subunit D 3.032 0
PF04208 Tetrahydromethanopterin S-methyltransferase, subunit A 2.903375 0.015203
PF04211 Tetrahydromethanopterin S-methyltransferase, subunit C 3.02575 0.017678
PF05440 Tetrahydromethanopterin S-methyltransferase subunit B 2.980125 0.036537 python3 plot_entropy.py
gen_genF_entropies.methane.tab -0.121 0.1475m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
Methane

PF00067 Cytochrome P450 0.57375 0.0056
PF00174 Oxidoreductase molybdopterin binding domain 0.528125 0.006578
PF00355 Rieske [2Fe-2S] domain 0.507 0.032076
PF00507 NADH-ubiquinone/plastoquinone oxidoreductase, chain 3 0.36975 0.010886
PF00547 Urease, gamma subunit 0.464 0
PF00699 Urease beta subunit 0.475125 0.001126
PF01077 Nitrite and sulphite reductase 4Fe-4S domain 0.47025 0.014568
PF02211 Nitrile hydratase beta subunit 0.405625 0.005041
PF02633 Creatinine amidohydrolase 0.58725 0.017466
PF03460 Nitrite/Sulfite reductase ferredoxin-like half domain 0.48 0.032715
PF05899 Protein of unknown function (DUF861) 0.52175 0.022914
PF09347 Domain of unknown function (DUF1989) 0.398875 0.007415
Nitrogen
Supplementary files

Iron
PF14522 Cytochrome c7 and related cytochrome c 1.010 0.104
PF00355 Rieske [2Fe-2S] domain 0.51912 0.02854
PF00033 Cytochrome b/b6/petB 0.55875 0.04974
PF00034 Cytochrome c 0.5061 0.1013
Supplementary files

Positive instances
Positive classifications
only with strong evidence so they
make few false positive
errors
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 8 / 2 2
Suli
N=161
(1946) > Negative instances.
Gen
ROC CURVE
• Two-dimensional graphs in which tp
rate is plotted on the Y axis and fp rate is plotted on the X
axis.
• Depicts relative tradeoffs between benefits (true positives)
and costs (false positives).
Never issuing a
positive
classification; such
a classifier
commits no false
positive errors but
also gains no true
positives
Perfect
classification
Random guessing produces the
diagonal line between (0,0) and (1,
1), which has an area of 0.5, no
realistic classifier should have an AUC
less than 0.5

RelativeentropyH’
4Fe-4S dicluster domain
Molydopterin
dinucleotide binding
domain
Cytochrome C
oxidase, cbb3-type,
subunit III
Nitrogenase component
1 type Oxidoreductase
Supplementary files

Valerie de Anda at #ICG12: A new multi-genomic approach for the study of biogeochemical cycles at global scale: the molecular reconstruction of the sulfur cycle

Recommended

Recommended

More Related Content

Similar to Valerie de Anda at #ICG12: A new multi-genomic approach for the study of biogeochemical cycles at global scale: the molecular reconstruction of the sulfur cycle

Similar to Valerie de Anda at #ICG12: A new multi-genomic approach for the study of biogeochemical cycles at global scale: the molecular reconstruction of the sulfur cycle (20)

More from GigaScience, BGI Hong Kong

More from GigaScience, BGI Hong Kong (20)

Recently uploaded

Recently uploaded (20)

Valerie de Anda at #ICG12: A new multi-genomic approach for the study of biogeochemical cycles at global scale: the molecular reconstruction of the sulfur cycle

Editor's Notes