Big Data Field Museum

RIDING THE BIG DATA
TIDAL WAVE OF MODERN
MICROBIOLOGY
Adina Howe
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)

Understanding community
dynamics
 Who is there?
 What are they doing?
 How are they doing it?

Understanding community
dynamics
 Who is there?
 What are they doing?
 How are they doing it?
Kim Lewis, 2010

Gene / Genome Sequencing
 Collect samples
 Extract DNA
 Sequence DNA
 “Analyze” DNA to identify its content and origin
Taxonomy
(e.g., pathogenic E. Coli)
Function
(e.g., degrades cellulose)

Effects of low cost
sequencing…
First free-living bacterium sequenced
for billions of dollars and years of
analysis
Personal genome can be
mapped in a few days and
hundreds to few thousand
dollars

The experimental continuum
Single Isolate
Pure Culture
Enrichment
Mixed Cultures
Natural systems

The era of big data in biology
NGS (Shotgun) Sequencing
(doubling time 5 months)
100,000,000
100,000,000
10,000,000
1,000,000
1,000,000
100,000
100,000
10,000
10,000
1,000
1,000
100
100
10
10
1
1
Stein, Genome Biology, 2010
Computational Hardware
Sanger Sequencing
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
1,000,000
100,000
10,000
1,000
100
10
1
0
Disk Storage, Mb/$
0.1
DNA Sequencing, Mbp per $
10,000,000
0.1

Postdoc experience with data
2003-2008 Cumulative sequencing in PhD = 2000 bp
2008-2009 Postdoc Year 1 = 50 Gbp
2009-2010 Postdoc Year 2 = 450 Gbp
2014 = 50 Tbp
2015 = 500 Tbp budgeted

TARGETTED SEQUENCING
STRATEGY
“Soil Census” to “Soil Catalogs”: Who is there?
Targetting conserved regions
of known genes
Most popular:
16S ribosomal RNA gene –
conserved in bacteria and
archaea
 Who is there - community
profiling based on sequence
similarity
 Must have previous
knowledge of genes
 Must infer function based on
phylogeny – not advised

TARGETTED SEQUENCING
STRATEGY
“Soil Census” to “Soil Catalogs”: Who is there?
Targetting conserved regions
of known genes
Most popular:
16S ribosomal RNA gene –
conserved in bacteria and
archaea
$15 / sample
 Who is there - community
profiling based on sequence
similarity
 Must have previous
knowledge of genes
 Must infer function based on
phylogeny – not advised

Tackling Soil Biodiversity
Source: Chuck Haney
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)
Janet Jansson, Susannah Tringe (JGI)

THE DIRT ON SOIL
MAGNIFICENT BIODIVERSITY
Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress

THE DIRT ON SOIL
SPATIAL HETEROGENEITY
http://www.fao.org/ www.cnr.uidaho.edu

THE DIRT ON SOIL
INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES
Philippot, 2013, Nature Reviews Microbiology

Our shared challenges
Climate Change
USGCRP 2009
Energy Supply
www.alutiiq.com
Human Health
http://guardianlv.com/
An understanding
of microbial ecology

SOIL MICROBIOLOGY: CARBON
REGULATION
The anthropogenic CO2 production is only 10% of that of the soil
Sustainable agriculture permits carbon
sequestration in the range of 0.3 – 1 ton of
C/ha.yr ~ 10% of all carbon emitted by cars
(Denman et al., 2007; Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change)

Lesson #1: Accessing information in
data
http://siliconangle.com/files/2010/09/image_thumb69.png

de novo assembly
Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes
Compresses dataset size significantly
Improved data quality (longer sequences, gene order)
Reference not necessary (novelty)

Metagenome assembly…a scaling
problem.

Shotgun sequencing and de novo
assembly
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness

Practical Challenges – Intensive
computing
Months of
“computer
crunching” on a
super computer
Howe et al, 2014, PNAS

Practical Challenges – Intensive
computing
Months of
“computer
crunching” on a
super computer
Assembly of 300 Gbp can be
done with any assembly program
in less than 14 GB RAM and less
than 24 hours.

Natural community characteristics
 Diverse
 Many organisms
(genomes)

 Diverse
 Many organisms
(genomes)
 Variable abundance
 Most abundant organisms, sampled
more often
 Assembly requires a minimum amount
of sampling
 More sequencing, more errors
Sample 1x

 Diverse
 Many organisms
(genomes)
more often
of sampling
Sample 1x Sample 10x

 Diverse
 Many organisms
(genomes)
more often
of sampling
Overkill
Sample 1x Sample 10x

Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One

Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
 Scales datasets for assembly up to 95% - same assembly
outputs.
 Genomes, mRNA-seq, metagenomes (soils, gut, water)

More like…
Howe et. al, 2014, PNAS Source: Chuck Haney

SOIL METAGENOME REALITY CHECK
 Grand Challenge effort –
10% of soil biodiversity
sampled
 Incredible soil biodiversity
(estimate required 10
Tbp/sample)
 “To boldly go where no man
has gone before”: >60%
Unknown
400
300
200
100
0
amino acid metabolism
carbohydrate metabolism
membrane transport
signal transduction
translation
folding, sorting and degradation
metabolism of cofactors and vitamins
energy metabolism
transport and catabolism
lipid metabolism
transcription
cell growth and death
replication and repair
xenobiotics biodegradation and metabolism
nucleotide metabolism
glycan biosynthesis and metabolism
metabolism of terpenoids and polyketides
cell motility
Total Count
KO
corn and prairie
corn only
prairie only
Managed agriculture soils exhibit less
diversity, likely from its history of
cultivation.

Frustrating, but helpful
 “Low input, high throughput, no output?” (Sean
Eddy / Sydney Brenner)
 Evaluation of sequencing as a tool
 Broad characterization
 “Right” kind of data
 How much should I sequence?
 Data characteristics
 Breadth vs. depth of sampling
 Computational tool development
 Tr

Lesson #2: Connecting the dots
from data to information
If 80% is
unknown…what
can one do?

Co-occurrence to detect novelty
Williams et al., 2014, Frontiers of Micro

Causation vs correlation
Wikipedia.com

Success story in Human
Microbiome

#3 Is more data better?
Bottlenecks for the emerging
microbiologists

Technical challenges – many
solutions
 Access to data and its value
 Access to resources
 Data volume and velocity “clog”
 Data is very heterogeneous

Data intensive microbiology
Software Developers
Computer Scientists
Clinicians
PIs
Data generators
Microbiologists
Data Analyzers
Statisticians
Bioinformaticians
http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html

Social obstacles – the main
challenge
Shift of costs do not mean shift of
expectations
http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/
Dear PI,
It will take longer than
the time it took you to do
your experiment to
analyze the data. Please
do not write me for
results within 24 hours of
your sequences
becoming available.
- Adina

Culture of sharing
Metagenomic Datasets
http://www.heathershumaker.com/

Training / Incentives
Emails between collaborators don’t contain as
much “science” as I’d like:

All analysis: accessible,
reproducible, and automated

RIDING THE BIG DATA
TIDAL WAVE OF MODERN
MICROBIOLOGY
Adina Howe
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)
“
”

Acknowledgements
 C. Titus Brown (MSU)
 James Tiedje (MSU)
 Daina Ringus (UC)
 Folker Meyer (ANL)
 Eugene Chang (UC)
 NSF Biology Postdoc Fellowship
 DOE Great Lakes Bioenergy Research Center

Big Data Field Museum

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Field Museum

Similar to Big Data Field Museum (20)

Recently uploaded

Recently uploaded (20)

Big Data Field Museum

Editor's Notes