User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
Big Data Field Museum
1. RIDING THE BIG DATA
TIDAL WAVE OF MODERN
MICROBIOLOGY
Adina Howe
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)
4. Gene / Genome Sequencing
Collect samples
Extract DNA
Sequence DNA
“Analyze” DNA to identify its content and origin
Taxonomy
(e.g., pathogenic E. Coli)
Function
(e.g., degrades cellulose)
5. Effects of low cost
sequencing…
First free-living bacterium sequenced
for billions of dollars and years of
analysis
Personal genome can be
mapped in a few days and
hundreds to few thousand
dollars
7. The era of big data in biology
NGS (Shotgun) Sequencing
(doubling time 5 months)
100,000,000
100,000,000
10,000,000
1,000,000
1,000,000
100,000
100,000
10,000
10,000
1,000
1,000
100
100
10
10
1
1
Stein, Genome Biology, 2010
Computational Hardware
(doubling time 14 months)
Sanger Sequencing
(doubling time 19 months)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
1,000,000
100,000
10,000
1,000
100
10
1
0
Disk Storage, Mb/$
0.1
DNA Sequencing, Mbp per $
10,000,000
0.1
8. Postdoc experience with data
2003-2008 Cumulative sequencing in PhD = 2000 bp
2008-2009 Postdoc Year 1 = 50 Gbp
2009-2010 Postdoc Year 2 = 450 Gbp
2014 = 50 Tbp
2015 = 500 Tbp budgeted
9. TARGETTED SEQUENCING
STRATEGY
“Soil Census” to “Soil Catalogs”: Who is there?
Targetting conserved regions
of known genes
Most popular:
16S ribosomal RNA gene –
conserved in bacteria and
archaea
Who is there - community
profiling based on sequence
similarity
Must have previous
knowledge of genes
Must infer function based on
phylogeny – not advised
10. TARGETTED SEQUENCING
STRATEGY
“Soil Census” to “Soil Catalogs”: Who is there?
Targetting conserved regions
of known genes
Most popular:
16S ribosomal RNA gene –
conserved in bacteria and
archaea
$15 / sample
Who is there - community
profiling based on sequence
similarity
Must have previous
knowledge of genes
Must infer function based on
phylogeny – not advised
11. Tackling Soil Biodiversity
Source: Chuck Haney
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)
Janet Jansson, Susannah Tringe (JGI)
12. THE DIRT ON SOIL
MAGNIFICENT BIODIVERSITY
Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress
13. THE DIRT ON SOIL
SPATIAL HETEROGENEITY
http://www.fao.org/ www.cnr.uidaho.edu
15. THE DIRT ON SOIL
INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES
Philippot, 2013, Nature Reviews Microbiology
16. Our shared challenges
Climate Change
USGCRP 2009
Energy Supply
www.alutiiq.com
Human Health
http://guardianlv.com/
An understanding
of microbial ecology
17. SOIL MICROBIOLOGY: CARBON
REGULATION
The anthropogenic CO2 production is only 10% of that of the soil
Sustainable agriculture permits carbon
sequestration in the range of 0.3 – 1 ton of
C/ha.yr ~ 10% of all carbon emitted by cars
(Denman et al., 2007; Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change)
18. Tackling Soil Biodiversity
Source: Chuck Haney
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)
Janet Jansson, Susannah Tringe (JGI)
19. Lesson #1: Accessing information in
data
http://siliconangle.com/files/2010/09/image_thumb69.png
20. de novo assembly
Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes
Compresses dataset size significantly
Improved data quality (longer sequences, gene order)
Reference not necessary (novelty)
22. Shotgun sequencing and de novo
assembly
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness
23. Practical Challenges – Intensive
computing
Months of
“computer
crunching” on a
super computer
Howe et al, 2014, PNAS
24. Practical Challenges – Intensive
computing
Months of
“computer
crunching” on a
super computer
Howe et al, 2014, PNAS
Assembly of 300 Gbp can be
done with any assembly program
in less than 14 GB RAM and less
than 24 hours.
26. Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x
27. Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x Sample 10x
28. Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Overkill
Sample 1x Sample 10x
34. Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Scales datasets for assembly up to 95% - same assembly
outputs.
Genomes, mRNA-seq, metagenomes (soils, gut, water)
35. Tackling Soil Biodiversity
Source: Chuck Haney
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)
Janet Jansson, Susannah Tringe (JGI)
38. SOIL METAGENOME REALITY CHECK
Grand Challenge effort –
10% of soil biodiversity
sampled
Incredible soil biodiversity
(estimate required 10
Tbp/sample)
“To boldly go where no man
has gone before”: >60%
Unknown
400
300
200
100
0
amino acid metabolism
carbohydrate metabolism
membrane transport
signal transduction
translation
folding, sorting and degradation
metabolism of cofactors and vitamins
energy metabolism
transport and catabolism
lipid metabolism
transcription
cell growth and death
replication and repair
xenobiotics biodegradation and metabolism
nucleotide metabolism
glycan biosynthesis and metabolism
metabolism of terpenoids and polyketides
cell motility
Total Count
KO
corn and prairie
corn only
prairie only
Howe et al, 2014, PNAS
Managed agriculture soils exhibit less
diversity, likely from its history of
cultivation.
39. Frustrating, but helpful
“Low input, high throughput, no output?” (Sean
Eddy / Sydney Brenner)
Evaluation of sequencing as a tool
Broad characterization
“Right” kind of data
How much should I sequence?
Data characteristics
Breadth vs. depth of sampling
Computational tool development
Tr
40. Lesson #2: Connecting the dots
from data to information
If 80% is
unknown…what
can one do?
47. #3 Is more data better?
Bottlenecks for the emerging
microbiologists
48. Technical challenges – many
solutions
Access to data and its value
Access to resources
Data volume and velocity “clog”
Data is very heterogeneous
49. Data intensive microbiology
Software Developers
Computer Scientists
Clinicians
PIs
Data generators
Microbiologists
Data Analyzers
Statisticians
Bioinformaticians
http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html
50.
51.
52. Social obstacles – the main
challenge
Shift of costs do not mean shift of
expectations
http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/
Dear PI,
It will take longer than
the time it took you to do
your experiment to
analyze the data. Please
do not write me for
results within 24 hours of
your sequences
becoming available.
- Adina
56. RIDING THE BIG DATA
TIDAL WAVE OF MODERN
MICROBIOLOGY
Adina Howe
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)
“
”
57. RIDING THE BIG DATA
TIDAL WAVE OF MODERN
MICROBIOLOGY
Adina Howe
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)
58. Acknowledgements
C. Titus Brown (MSU)
James Tiedje (MSU)
Daina Ringus (UC)
Folker Meyer (ANL)
Eugene Chang (UC)
NSF Biology Postdoc Fellowship
DOE Great Lakes Bioenergy Research Center
Editor's Notes
Thank Beckett
Journey with big data
The questions we have in understanding microbes have not changed much…
Historically, we have been asking these questions in model organisms.
The challenge of model organisms…comparing them to what we know is in the environment…
First automated DNA sequencing machines late 80s,
New ay of asking questions.
Highlighted in recent news
Opportunities and changes in the systems we study.
So then the question is not only who is there and what they are doing? But what are they doing together and how?
The growth – point out NGS imapct
Accompanied by challenges of computation…even to store data on.
Data during my career really reflects this groth.
During postdoc, first year, 50 million reads to about 40x that within literally 9 months. data increased 25x million times….
Notice the gap from 2010 – 2014, figuring stuff out.
The goal is to understand the communities in the soil. The challenge is that the community in the soil is too large to sample. Using the targeted approaches, you’ll see many microbial soil and enviornmental studies report data on community membership and structure.
These investigations target the 16S rib RNA gene which is conserved in bacteria and archaea. Because this gene is conserved, this allows the sampling of these genes to result in a comparison of how similar these biomarkers are within a community. Basically you take each sequence of each gene and align into the other genes you’ve sampled. And from that you can identify a community structure profile that you can then compare between samples.
You can compare sampled sequences to previously observed sequences and identify who and how much of that microorganism is in your soils.
The goal is to understand the communities in the soil. The challenge is that the community in the soil is too large to sample. Using the targeted approaches, you’ll see many microbial soil and enviornmental studies report data on community membership and structure.
These investigations target the 16S rib RNA gene which is conserved in bacteria and archaea. Because this gene is conserved, this allows the sampling of these genes to result in a comparison of how similar these biomarkers are within a community. Basically you take each sequence of each gene and align into the other genes you’ve sampled. And from that you can identify a community structure profile that you can then compare between samples.
You can compare sampled sequences to previously observed sequences and identify who and how much of that microorganism is in your soils.
Soil biodiversity is amazing.
Great Prairie – world’s most fertile. Important reference site for the biological baseis and ecosystems of soil microbial communities. It sequesters most carbon, produces large amount of biomass anually, key for biofuels and security.
Know little about the who / what in these soils.
Excitement about what we could clean now with the technologies.
Most of us now recognize that microbial communities generally exhibit a high level of diversity, much highter than previously assume by what was revealed by classical microscopy and basic culturing techniques.
In soil, even in one gram of soil, there is estimated to be more microbial species than there are stars in the galaxy. We have far to go for any comprehensive characterization of any single soil community. A key question then Is why is soil diversity so high?
One reason may be that the soil structure provides unique niche that provide a high diversity of food resources.
Its varied structure provides stable, protective, and even ancient environments for microorganims.
Soil investigations are further complicated by the primarily dormant state of the large majority of the soil microbial population. The turnover rate of soil microbes is predicted to be over 30 fold and even up to 300 fold slower than that of microbes in the oceans.
And these microbes live in relatively unpredicatlbe patterns of pertubations – for example rainfall or leaf litter introduction. They also undergo defined temporal perturbations – diurnal energy input.
This complexity in the soil has formed a dynamic microbial ecosystem which interacts with nutrients, plants, and the soil structure itself at multiple scales.
I would argue that we as a field are still trying to find tractable methods of accessing these interactions and understanding the drivers of “healthy” or “productive” soils.
There are several grand challenges that our society is currently facing which I think are of paramount importance. These are predicting and managing the impacts of climate change, finding sustainable sources of liquid fuels, and understanding the emerging pandemics facing human health in recent years. From carbon emissions from land use (which is magnitudes more than that of car emissions), degrading cellulosic biomass, to pathogens in our bodies, microbes are involved in complex communities that drive the health and productivity of either our natural resources or our own bodies. And its buidling up the expertise to ask
For example, microbes in the soil help cycle important nutrients for plants to grow while also impacting global flows of important elements such as carbon and nitrogen.
In fact, when you estimate the carbon production of CO2 in soils and compare it to automative emissions, you’d find that anthropogenic sources of CO2 make up only 10% of that of the soils – which has a lot to do with the underlying microbes. As a consequence, you could capture roughly about 10% of all carbon emitted by cars just by employing sustainable agriculture practices.
Understanding these processes in the soil can help us then learn how to both predict the impacts of our land management strategies on climate change but also help us understand how we can best manage our limited soil resources.
Soil biodiversity is amazing.
Great Prairie – world’s most fertile. Important reference site for the biological baseis and ecosystems of soil microbial communities. It sequesters most carbon, produces large amount of biomass anually, key for biofuels and security.
Know little about the who / what in these soils.
Excitement about what we could clean now with the technologies.
With growing volumes of data, the most obvious way to tractably access this data is to “smartly” reduce this data.
One genomic way to reduce data is a process known as assembly.
Assembly has been around since the sequencing of single organisms.
Metagenomics…a problem of scale
Assembly is the process of trying to come up with a consensus sequence based on finding overlaps in small fragments.
In this example, we are coming up with a solution of one sentence using 8 fragments.
In metagenomic assembly, you are trying to come up with hundreds to thousands to even millions of genomes using billions of fragments.
And to do this, you have to compare each fragment to every other one in the dataset, making it very computationally intensive.
Even the smallest dataset that I had at the beginning of my postdoc required several months on a supercomputer, something having over 100 GB of RAM. These were resources I simply didn’t have at this time. And for my larger datasets, there was simply nothing I could do with them, they would essentially crash any available assembly program that existed.
So I had to come up with a way to deal with all of this data or essentially, there were a handful of Pis that had just invested tens of thousands of dollars in a project where we couldn’t tractably handle the datasets.
I’m going to tell you now a little bit about how we were able to do this and there actually two different strategies we had to combine.
First start thinking about what tare the natural chracteristics of environemntal communities.
Diverse.: There are multiple genomes, and even potentially millions of species, in a sample.
Variable abundance in nature, some are highly abundant some are not.
This diversity and distribution of abundances means that we are unevenly sampling strains in the environment.
If we want to sample the rarerest species….we need
A strategy we came up with was can we come up with a way to come up with the minimal dataset that you need for assembly, discarding these reads from this overkill section?
From a sequencing standpoint then, what we see is that for a given genome (represented here as a dotted line), we start sampling fragments from it.
As we sample more, we will have some sequences which will have errors in it.
And we’ll keep sequencing this genome, randomly sampling different parts of it. We’ll get to a point, where we’ll have enough sequences where we can make a good guess at what the original sequence may have looked like.
For assembly, you need a minimum amount of information. So anything beyond this 6 is excessive or redundant information.
So we can discard or set aside this read and not use it for our assembly. And that actually turns out to be a good thing because in discarding this information, we’re actually removing data with errors in it.
minimal dataset needed for an assembly of the dataset here in pink and a redundant set of information which we have set aside.
In setting aside these reads here in the red, discard errors
Improve assembly
Soil biodiversity is amazing.
Great Prairie – world’s most fertile. Important reference site for the biological baseis and ecosystems of soil microbial communities. It sequesters most carbon, produces large amount of biomass anually, key for biofuels and security.
Know little about the who / what in these soils.
Excitement about what we could clean now with the technologies.
More than half, 50-80% sequences unknown in soil, gut microbiome
Overall, many funcitons are shared between corn and prairies soils. Interestingly, prairie soils have much many more unique functions (indicated here as blue bars) compared to unique functions in the corn (here green). This result may reflect the varying management history of these two soils. Unlike the prairie soils, which have never been tilled, the corn soils have been cultivated for more than 100 y and have had annual additions of animal manure that potentially could enrich specific metabolic pathways with decreased diversity.
Reducing data is only part of the problem when using sequences to inform microbiological processes.
Link data (largely unknown) to biological processes
One way is start linking unknowns to things we know about.
We can look at characteristics of something known that we may be able to describe to some degree and then find unknown entities that might exhibit similar patterns
Gives us a clue at what the unknown might be.
Example, three fridges. Set of objects in there that might describe the community that these fridges might be associated with.
We can then look for patterns in unknown parts of our dataset that exhibit similar patterns as these known entities. For example, entities that share similar abundance profiles.
These unknown can then be characterized by association. Fratboys characteristic communities, graduate student, and healthy chef communities might all have different characteristics.
The reliability of this analysis heeds caution and further validation.
I’ve found that this analysis almost always leads to more questions than answers.
Always turning back to model systems to help clarify hypotheses generated from this analysis.
Finally, as the last part of my perspective on big data, I wanted to talk about the theme of this workshop? Is more data better? For me, the answer is always yes. I always want more data. This is largely attributable to the fact that I have a lot of experience working with this data and the resources to play with it. But that is not always the case. So what are the challenges of big data to the microbiologist or biologist?
More challenging is the emerging role a microbiologist now has to fill and the changing teams we are now involved in.
I’m asked to play all these roles in various projects I’m involved in.
And definitely, I’m asked to communicate to people in all these roles and they are asked to communicate with me. This communication can be challenging.
For example, if you asked us all to describe a tire swing building project, you’d undoubtably get many varied descriptions
Communication and social obstacles are the most difficult,
The need to share and participate in interdiscipinary research come along with a culture of needing to demonstrate individual impact
Total reproducibility of all figures – one button
Change the dataset, redo entire analysis on your own data