Is it time for a (community) effort 
towards a soil reference 
database? 
Erick Cardenas, James Cole, Maude David, Aaron 
Garoutte, Adina Howe, Janet Jansson, Dave 
Myrold, James Tiedje, and you? 
Modified version of slides will be available after presentation: 
http://www.slideshare.net/adinachuanghowe
The most important hands in soil 
microbiology
Significance of a soil-specific reference 
• Need standardized resource to 
connect sequencing data at 
different levels 
• Integrate sequencing data 
towards soil health and 
productivity 
• Broadly enable “connecting the 
dots” 
Genes 
Organisms 
Communities 
Ecosystems
Soil metagenomic challenges 
• The amount we know… 
• Incredible microbial diversity 
• Spatial heterogeneity 
• Complex dynamics 
• Lack of reference genomes (bacteria, 
archaea, fungal)
HUMAN MICROBIOME PROJECT
Lessons from HMP 
• 2009 Goals: 
– Take advantage of high throughput technologies 
to characterize human microbiome of large 
number of samples 
– Determine whether associations between changes 
in the microbiome and health disease 
– Provide a standardized data resource and new 
technological approaches to enable such studies 
to be undertaken broadly in scientific community
HMP metagenomic challenges 
Soil 
• Incredible microbial 
diversity 
• Spatial heterogeneity 
• Complex dynamics 
• Lack of reference 
genomes (bacteria, 
archaea, fungal) 
HMP 
• Microbial diversity 
• Individual variation 
• Complex host-associated 
dynamics 
• Lack of reference 
genomes?
The HMP reference genome effort 
• Add at least 900-3000 additional reference 
bacterial genome sequences to public 
database 
• Thorough representation of domains and 
major body sites
Not only sequencing….but access to 
data 
Currently, over 1000 bacterial genomes at various stages of 
sequencing
Tools: Opening doors broadly 
Metaphlan, Nature Methods 9, 811-814 (2012) 
Nature Reviews Genetics, 15, 
577-584 (2014) 
Vital et al., mBio, Vol 5., 2014
Another example: GEBA 
Comparison of 
• rRNA tree of life 
• genome 
sequence in the 
DSMZ culture 
collection 
Are there any general benefits that come from this 
"phylogeny driven" approach?
Simpact of “targeted” sequencing of 
improved references 
Higher rate of discovery 
and characterization of 
new gene families 
New ways to link distantly 
related homologs that 
would otherwise go 
undetected 
Significant phylogenetic 
expansions of known 
protein families 
Enrichment of genetic 
diversity
Can a similar strategy benefit soil 
studies?
What could we use it for? 
• Target isolation and sequencing efforts; creation of a “most 
wanted” list 
• Soil specific framework for larger scale sequencing and 
proteomic efforts to identify taxonomic and functional 
information 
• Genome-centric investigation of soil genomes (e.g., distribution 
of shared genes among soil phyla); development of improved 
biomarkers for high throughput assays 
• Providing data to tool developers to make 
bioinformatics/visualization easier for soil-specific studies
What are the challenges? 
• How do we defined a soil organism? 
– Origin form soil? 
– 16S rRNA gene sequence matched one from soil? 
– What level of finishing is adecuate?
What are the challenges? 
• What is the most critical/practical metadata? 
– Soil location 
– Soil taxonomy 
– Links to RefSeq IDs 
– Is the strain available and where?
What are the challenges? 
• Who to include? 
– Fungi! Archaea!
What are the challenges? 
• Expert curators? 
– You? 
– Tiered hierarchy of curation level
Some initial efforts 
RefSoil (2011) 
Erick Cardenas, Aaron Garoutte, Adina Howe, Jim Tiedje 
Bacterial genomes retrieved from Gold 
database , and , and selected those 
associated with soil habitats 
Manually curated to exclude obligated 
human pathogens and extremophiles 
Databases can be biased and redundant 
Proteobacteri 
a, 267 
Firmicutes, 92 
Tenericutes, 5 
Cyanobacteria 
, 7 
Bacteroidetes, 
12 
Actinobacteri 
a, 75 
Acidobacteria, 
5 
Other, 29 
492 organisms 
19 phyla
NCBI Reference Genomes described as 
originating from soil 
Proteobacteria 
Actinobacteria 
Firmicutes 
Bacteroidetes 
Cyanobacteria 
Acidobacteria
Protein Models for Functions: 
FOAM Database 
Nucl. Acids Res. (2014) 
doi: 10.1093/nar/gku702
Some Motivation 
60 terrestrial NEON sites distributed across 20 ecoclimatic domains 
Terrestrial scale streaming of lots of data including sequencing data for each site
If you’d like to contribute 
• Join the breakout session Thursday evening 
(6-7 pm) 
• Know someone with genomes / database, let 
us know? Want to contribute? Have an 
opinion? Have funding? 
Adina Howe, adina.howe@gmail.com

ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this

  • 2.
    Is it timefor a (community) effort towards a soil reference database? Erick Cardenas, James Cole, Maude David, Aaron Garoutte, Adina Howe, Janet Jansson, Dave Myrold, James Tiedje, and you? Modified version of slides will be available after presentation: http://www.slideshare.net/adinachuanghowe
  • 3.
    The most importanthands in soil microbiology
  • 4.
    Significance of asoil-specific reference • Need standardized resource to connect sequencing data at different levels • Integrate sequencing data towards soil health and productivity • Broadly enable “connecting the dots” Genes Organisms Communities Ecosystems
  • 5.
    Soil metagenomic challenges • The amount we know… • Incredible microbial diversity • Spatial heterogeneity • Complex dynamics • Lack of reference genomes (bacteria, archaea, fungal)
  • 6.
  • 7.
    Lessons from HMP • 2009 Goals: – Take advantage of high throughput technologies to characterize human microbiome of large number of samples – Determine whether associations between changes in the microbiome and health disease – Provide a standardized data resource and new technological approaches to enable such studies to be undertaken broadly in scientific community
  • 8.
    HMP metagenomic challenges Soil • Incredible microbial diversity • Spatial heterogeneity • Complex dynamics • Lack of reference genomes (bacteria, archaea, fungal) HMP • Microbial diversity • Individual variation • Complex host-associated dynamics • Lack of reference genomes?
  • 9.
    The HMP referencegenome effort • Add at least 900-3000 additional reference bacterial genome sequences to public database • Thorough representation of domains and major body sites
  • 10.
    Not only sequencing….butaccess to data Currently, over 1000 bacterial genomes at various stages of sequencing
  • 11.
    Tools: Opening doorsbroadly Metaphlan, Nature Methods 9, 811-814 (2012) Nature Reviews Genetics, 15, 577-584 (2014) Vital et al., mBio, Vol 5., 2014
  • 12.
    Another example: GEBA Comparison of • rRNA tree of life • genome sequence in the DSMZ culture collection Are there any general benefits that come from this "phylogeny driven" approach?
  • 13.
    Simpact of “targeted”sequencing of improved references Higher rate of discovery and characterization of new gene families New ways to link distantly related homologs that would otherwise go undetected Significant phylogenetic expansions of known protein families Enrichment of genetic diversity
  • 14.
    Can a similarstrategy benefit soil studies?
  • 15.
    What could weuse it for? • Target isolation and sequencing efforts; creation of a “most wanted” list • Soil specific framework for larger scale sequencing and proteomic efforts to identify taxonomic and functional information • Genome-centric investigation of soil genomes (e.g., distribution of shared genes among soil phyla); development of improved biomarkers for high throughput assays • Providing data to tool developers to make bioinformatics/visualization easier for soil-specific studies
  • 16.
    What are thechallenges? • How do we defined a soil organism? – Origin form soil? – 16S rRNA gene sequence matched one from soil? – What level of finishing is adecuate?
  • 17.
    What are thechallenges? • What is the most critical/practical metadata? – Soil location – Soil taxonomy – Links to RefSeq IDs – Is the strain available and where?
  • 18.
    What are thechallenges? • Who to include? – Fungi! Archaea!
  • 19.
    What are thechallenges? • Expert curators? – You? – Tiered hierarchy of curation level
  • 20.
    Some initial efforts RefSoil (2011) Erick Cardenas, Aaron Garoutte, Adina Howe, Jim Tiedje Bacterial genomes retrieved from Gold database , and , and selected those associated with soil habitats Manually curated to exclude obligated human pathogens and extremophiles Databases can be biased and redundant Proteobacteri a, 267 Firmicutes, 92 Tenericutes, 5 Cyanobacteria , 7 Bacteroidetes, 12 Actinobacteri a, 75 Acidobacteria, 5 Other, 29 492 organisms 19 phyla
  • 21.
    NCBI Reference Genomesdescribed as originating from soil Proteobacteria Actinobacteria Firmicutes Bacteroidetes Cyanobacteria Acidobacteria
  • 22.
    Protein Models forFunctions: FOAM Database Nucl. Acids Res. (2014) doi: 10.1093/nar/gku702
  • 23.
    Some Motivation 60terrestrial NEON sites distributed across 20 ecoclimatic domains Terrestrial scale streaming of lots of data including sequencing data for each site
  • 24.
    If you’d liketo contribute • Join the breakout session Thursday evening (6-7 pm) • Know someone with genomes / database, let us know? Want to contribute? Have an opinion? Have funding? Adina Howe, adina.howe@gmail.com

Editor's Notes

  • #13 Analysis of these genomes demonstrated pronounced benefits (compared to an equivalent set of genomes randomly selected from the existing database) in diverse areas including the reconstruction of phylogenetic history, the discovery of new protein families and biological properties, and the prediction of functions for known genes from other organisms. Our results strongly support the need for systematic ‘phylogenomic’ efforts to compile a phylogeny-driven ‘Genomic Encyclopedia of Bacteria and Archaea’ in order to derive maximum knowledge from existing microbial genome data as well as from genome sequences to come.
  • #23 FOAM has (1) a semantic component: an ontology which organizes function into categories and subcategories. Click on "Ontology" in the menu bar to see it. FOAM is also a peptide profile DB turned to HMM