ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
1.
2. Is it time for a (community) effort
towards a soil reference
database?
Erick Cardenas, James Cole, Maude David, Aaron
Garoutte, Adina Howe, Janet Jansson, Dave
Myrold, James Tiedje, and you?
Modified version of slides will be available after presentation:
http://www.slideshare.net/adinachuanghowe
4. Significance of a soil-specific reference
• Need standardized resource to
connect sequencing data at
different levels
• Integrate sequencing data
towards soil health and
productivity
• Broadly enable “connecting the
dots”
Genes
Organisms
Communities
Ecosystems
5. Soil metagenomic challenges
• The amount we know…
• Incredible microbial diversity
• Spatial heterogeneity
• Complex dynamics
• Lack of reference genomes (bacteria,
archaea, fungal)
7. Lessons from HMP
• 2009 Goals:
– Take advantage of high throughput technologies
to characterize human microbiome of large
number of samples
– Determine whether associations between changes
in the microbiome and health disease
– Provide a standardized data resource and new
technological approaches to enable such studies
to be undertaken broadly in scientific community
9. The HMP reference genome effort
• Add at least 900-3000 additional reference
bacterial genome sequences to public
database
• Thorough representation of domains and
major body sites
10. Not only sequencing….but access to
data
Currently, over 1000 bacterial genomes at various stages of
sequencing
12. Another example: GEBA
Comparison of
• rRNA tree of life
• genome
sequence in the
DSMZ culture
collection
Are there any general benefits that come from this
"phylogeny driven" approach?
13. Simpact of “targeted” sequencing of
improved references
Higher rate of discovery
and characterization of
new gene families
New ways to link distantly
related homologs that
would otherwise go
undetected
Significant phylogenetic
expansions of known
protein families
Enrichment of genetic
diversity
15. What could we use it for?
• Target isolation and sequencing efforts; creation of a “most
wanted” list
• Soil specific framework for larger scale sequencing and
proteomic efforts to identify taxonomic and functional
information
• Genome-centric investigation of soil genomes (e.g., distribution
of shared genes among soil phyla); development of improved
biomarkers for high throughput assays
• Providing data to tool developers to make
bioinformatics/visualization easier for soil-specific studies
16. What are the challenges?
• How do we defined a soil organism?
– Origin form soil?
– 16S rRNA gene sequence matched one from soil?
– What level of finishing is adecuate?
17. What are the challenges?
• What is the most critical/practical metadata?
– Soil location
– Soil taxonomy
– Links to RefSeq IDs
– Is the strain available and where?
18. What are the challenges?
• Who to include?
– Fungi! Archaea!
19. What are the challenges?
• Expert curators?
– You?
– Tiered hierarchy of curation level
20. Some initial efforts
RefSoil (2011)
Erick Cardenas, Aaron Garoutte, Adina Howe, Jim Tiedje
Bacterial genomes retrieved from Gold
database , and , and selected those
associated with soil habitats
Manually curated to exclude obligated
human pathogens and extremophiles
Databases can be biased and redundant
Proteobacteri
a, 267
Firmicutes, 92
Tenericutes, 5
Cyanobacteria
, 7
Bacteroidetes,
12
Actinobacteri
a, 75
Acidobacteria,
5
Other, 29
492 organisms
19 phyla
21. NCBI Reference Genomes described as
originating from soil
Proteobacteria
Actinobacteria
Firmicutes
Bacteroidetes
Cyanobacteria
Acidobacteria
22. Protein Models for Functions:
FOAM Database
Nucl. Acids Res. (2014)
doi: 10.1093/nar/gku702
23. Some Motivation
60 terrestrial NEON sites distributed across 20 ecoclimatic domains
Terrestrial scale streaming of lots of data including sequencing data for each site
24. If you’d like to contribute
• Join the breakout session Thursday evening
(6-7 pm)
• Know someone with genomes / database, let
us know? Want to contribute? Have an
opinion? Have funding?
Adina Howe, adina.howe@gmail.com
Editor's Notes
Analysis of these genomes demonstrated pronounced benefits (compared to an equivalent set of genomes randomly selected from the existing database) in diverse areas including the reconstruction of phylogenetic history, the discovery of new protein families and biological properties, and the prediction of functions for known genes from other organisms. Our results strongly support the need for systematic ‘phylogenomic’ efforts to compile a phylogeny-driven ‘Genomic Encyclopedia of Bacteria and Archaea’ in order to derive maximum knowledge from existing microbial genome data as well as from genome sequences to come.
FOAM has (1) a semantic component: an ontology which organizes function into categories and subcategories. Click on "Ontology" in the menu bar to see it.
FOAM is also a peptide profile DB turned to HMM