SlideShare a Scribd company logo
1 of 58
RIDING THE BIG DATA 
TIDAL WAVE OF MODERN 
MICROBIOLOGY 
Adina Howe 
Argonne National Laboratory / Michigan State University 
Iowa State University, Ag & Biosystems Engr (January)
Understanding community 
dynamics 
 Who is there? 
 What are they doing? 
 How are they doing it?
Understanding community 
dynamics 
 Who is there? 
 What are they doing? 
 How are they doing it? 
Kim Lewis, 2010
Gene / Genome Sequencing 
 Collect samples 
 Extract DNA 
 Sequence DNA 
 “Analyze” DNA to identify its content and origin 
Taxonomy 
(e.g., pathogenic E. Coli) 
Function 
(e.g., degrades cellulose)
Effects of low cost 
sequencing… 
First free-living bacterium sequenced 
for billions of dollars and years of 
analysis 
Personal genome can be 
mapped in a few days and 
hundreds to few thousand 
dollars
The experimental continuum 
Single Isolate 
Pure Culture 
Enrichment 
Mixed Cultures 
Natural systems
The era of big data in biology 
NGS (Shotgun) Sequencing 
(doubling time 5 months) 
100,000,000 
100,000,000 
10,000,000 
1,000,000 
1,000,000 
100,000 
100,000 
10,000 
10,000 
1,000 
1,000 
100 
100 
10 
10 
1 
1 
Stein, Genome Biology, 2010 
Computational Hardware 
(doubling time 14 months) 
Sanger Sequencing 
(doubling time 19 months) 
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 
Year 
1,000,000 
100,000 
10,000 
1,000 
100 
10 
1 
0 
Disk Storage, Mb/$ 
0.1 
DNA Sequencing, Mbp per $ 
10,000,000 
0.1
Postdoc experience with data 
2003-2008 Cumulative sequencing in PhD = 2000 bp 
2008-2009 Postdoc Year 1 = 50 Gbp 
2009-2010 Postdoc Year 2 = 450 Gbp 
2014 = 50 Tbp 
2015 = 500 Tbp budgeted
TARGETTED SEQUENCING 
STRATEGY 
“Soil Census” to “Soil Catalogs”: Who is there? 
Targetting conserved regions 
of known genes 
Most popular: 
16S ribosomal RNA gene – 
conserved in bacteria and 
archaea 
 Who is there - community 
profiling based on sequence 
similarity 
 Must have previous 
knowledge of genes 
 Must infer function based on 
phylogeny – not advised
TARGETTED SEQUENCING 
STRATEGY 
“Soil Census” to “Soil Catalogs”: Who is there? 
Targetting conserved regions 
of known genes 
Most popular: 
16S ribosomal RNA gene – 
conserved in bacteria and 
archaea 
$15 / sample 
 Who is there - community 
profiling based on sequence 
similarity 
 Must have previous 
knowledge of genes 
 Must infer function based on 
phylogeny – not advised
Tackling Soil Biodiversity 
Source: Chuck Haney 
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) 
Janet Jansson, Susannah Tringe (JGI)
THE DIRT ON SOIL 
MAGNIFICENT BIODIVERSITY 
Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress
THE DIRT ON SOIL 
SPATIAL HETEROGENEITY 
http://www.fao.org/ www.cnr.uidaho.edu
THE DIRT ON SOIL 
DYNAMIC
THE DIRT ON SOIL 
INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES 
Philippot, 2013, Nature Reviews Microbiology
Our shared challenges 
Climate Change 
USGCRP 2009 
Energy Supply 
www.alutiiq.com 
Human Health 
http://guardianlv.com/ 
An understanding 
of microbial ecology
SOIL MICROBIOLOGY: CARBON 
REGULATION 
The anthropogenic CO2 production is only 10% of that of the soil 
Sustainable agriculture permits carbon 
sequestration in the range of 0.3 – 1 ton of 
C/ha.yr ~ 10% of all carbon emitted by cars 
(Denman et al., 2007; Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change)
Tackling Soil Biodiversity 
Source: Chuck Haney 
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) 
Janet Jansson, Susannah Tringe (JGI)
Lesson #1: Accessing information in 
data 
http://siliconangle.com/files/2010/09/image_thumb69.png
de novo assembly 
Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes 
Compresses dataset size significantly 
Improved data quality (longer sequences, gene order) 
Reference not necessary (novelty)
Metagenome assembly…a scaling 
problem.
Shotgun sequencing and de novo 
assembly 
It was the Gest of times, it was the wor 
, it was the worst of timZs, it was the 
isdom, it was the age of foolisXness 
, it was the worVt of times, it was the 
mes, it was Ahe age of wisdom, it was th 
It was the best of times, it Gas the wor 
mes, it was the age of witdom, it was th 
isdom, it was tIe age of foolishness 
It was the best of times, it was the worst of times, it was the 
age of wisdom, it was the age of foolishness
Practical Challenges – Intensive 
computing 
Months of 
“computer 
crunching” on a 
super computer 
Howe et al, 2014, PNAS
Practical Challenges – Intensive 
computing 
Months of 
“computer 
crunching” on a 
super computer 
Howe et al, 2014, PNAS 
Assembly of 300 Gbp can be 
done with any assembly program 
in less than 14 GB RAM and less 
than 24 hours.
Natural community characteristics 
 Diverse 
 Many organisms 
(genomes)
Natural community characteristics 
 Diverse 
 Many organisms 
(genomes) 
 Variable abundance 
 Most abundant organisms, sampled 
more often 
 Assembly requires a minimum amount 
of sampling 
 More sequencing, more errors 
Sample 1x
Natural community characteristics 
 Diverse 
 Many organisms 
(genomes) 
 Variable abundance 
 Most abundant organisms, sampled 
more often 
 Assembly requires a minimum amount 
of sampling 
 More sequencing, more errors 
Sample 1x Sample 10x
Natural community characteristics 
 Diverse 
 Many organisms 
(genomes) 
 Variable abundance 
 Most abundant organisms, sampled 
more often 
 Assembly requires a minimum amount 
of sampling 
 More sequencing, more errors 
Overkill 
Sample 1x Sample 10x
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One 
 Scales datasets for assembly up to 95% - same assembly 
outputs. 
 Genomes, mRNA-seq, metagenomes (soils, gut, water)
Tackling Soil Biodiversity 
Source: Chuck Haney 
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) 
Janet Jansson, Susannah Tringe (JGI)
The reality?
More like… 
Howe et. al, 2014, PNAS Source: Chuck Haney
SOIL METAGENOME REALITY CHECK 
 Grand Challenge effort – 
10% of soil biodiversity 
sampled 
 Incredible soil biodiversity 
(estimate required 10 
Tbp/sample) 
 “To boldly go where no man 
has gone before”: >60% 
Unknown 
400 
300 
200 
100 
0 
amino acid metabolism 
carbohydrate metabolism 
membrane transport 
signal transduction 
translation 
folding, sorting and degradation 
metabolism of cofactors and vitamins 
energy metabolism 
transport and catabolism 
lipid metabolism 
transcription 
cell growth and death 
replication and repair 
xenobiotics biodegradation and metabolism 
nucleotide metabolism 
glycan biosynthesis and metabolism 
metabolism of terpenoids and polyketides 
cell motility 
Total Count 
KO 
corn and prairie 
corn only 
prairie only 
Howe et al, 2014, PNAS 
Managed agriculture soils exhibit less 
diversity, likely from its history of 
cultivation.
Frustrating, but helpful 
 “Low input, high throughput, no output?” (Sean 
Eddy / Sydney Brenner) 
 Evaluation of sequencing as a tool 
 Broad characterization 
 “Right” kind of data 
 How much should I sequence? 
 Data characteristics 
 Breadth vs. depth of sampling 
 Computational tool development 
 Tr
Lesson #2: Connecting the dots 
from data to information 
If 80% is 
unknown…what 
can one do?
The idea of co-occurrence
Co-occurrence to detect novelty 
Williams et al., 2014, Frontiers of Micro
Causation vs correlation 
Wikipedia.com
Success story in Human 
Microbiome
#3 Is more data better? 
Bottlenecks for the emerging 
microbiologists
Technical challenges – many 
solutions 
 Access to data and its value 
 Access to resources 
 Data volume and velocity “clog” 
 Data is very heterogeneous
Data intensive microbiology 
Software Developers 
Computer Scientists 
Clinicians 
PIs 
Data generators 
Microbiologists 
Data Analyzers 
Statisticians 
Bioinformaticians 
http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html
Social obstacles – the main 
challenge 
Shift of costs do not mean shift of 
expectations 
http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/ 
Dear PI, 
It will take longer than 
the time it took you to do 
your experiment to 
analyze the data. Please 
do not write me for 
results within 24 hours of 
your sequences 
becoming available. 
- Adina
Culture of sharing 
Metagenomic Datasets 
http://www.heathershumaker.com/
Training / Incentives 
Emails between collaborators don’t contain as 
much “science” as I’d like:
All analysis: accessible, 
reproducible, and automated
RIDING THE BIG DATA 
TIDAL WAVE OF MODERN 
MICROBIOLOGY 
Adina Howe 
Argonne National Laboratory / Michigan State University 
Iowa State University, Ag & Biosystems Engr (January) 
“ 
”
RIDING THE BIG DATA 
TIDAL WAVE OF MODERN 
MICROBIOLOGY 
Adina Howe 
Argonne National Laboratory / Michigan State University 
Iowa State University, Ag & Biosystems Engr (January)
Acknowledgements 
 C. Titus Brown (MSU) 
 James Tiedje (MSU) 
 Daina Ringus (UC) 
 Folker Meyer (ANL) 
 Eugene Chang (UC) 
 NSF Biology Postdoc Fellowship 
 DOE Great Lakes Bioenergy Research Center

More Related Content

What's hot

Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...
Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...
Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...Jonathan Eisen
 
American Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk UniversityAmerican Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk Universitymcdonadt
 
Trends In Genomics
Trends In GenomicsTrends In Genomics
Trends In GenomicsSaul Kravitz
 
Microbial Phylogenomics (EVE161) Class 5
Microbial Phylogenomics (EVE161) Class 5Microbial Phylogenomics (EVE161) Class 5
Microbial Phylogenomics (EVE161) Class 5Jonathan Eisen
 
UC Davis EVE161 Lecture 18 by @phylogenomics
 UC Davis EVE161 Lecture 18 by @phylogenomics UC Davis EVE161 Lecture 18 by @phylogenomics
UC Davis EVE161 Lecture 18 by @phylogenomicsJonathan Eisen
 
UC Davis EVE161 Lecture 9 by @phylogenomics
UC Davis EVE161 Lecture 9 by @phylogenomicsUC Davis EVE161 Lecture 9 by @phylogenomics
UC Davis EVE161 Lecture 9 by @phylogenomicsJonathan Eisen
 
EVE 161 Winter 2018 Class 14
EVE 161 Winter 2018 Class 14EVE 161 Winter 2018 Class 14
EVE 161 Winter 2018 Class 14Jonathan Eisen
 
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...QIAGEN
 
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...Mick Watson
 
Supercomputing Your Inner Microbiome
Supercomputing Your Inner MicrobiomeSupercomputing Your Inner Microbiome
Supercomputing Your Inner MicrobiomeNicole McLaughlin
 
Evolutionary genomics
Evolutionary genomicsEvolutionary genomics
Evolutionary genomicsboussau
 
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingMicrobial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingJonathan Eisen
 
Quantifying Your Superorganism Body Using Big Data Supercomputing
Quantifying Your Superorganism Body Using Big Data SupercomputingQuantifying Your Superorganism Body Using Big Data Supercomputing
Quantifying Your Superorganism Body Using Big Data SupercomputingLarry Smarr
 
Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics
Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics
Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics Jonathan Eisen
 
Adaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maizeAdaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maizejrossibarra
 
Rapid Impact Assessment of Climatic and Physio-graphic Changes on Flagship G...
Rapid Impact Assessment of Climatic and Physio-graphic Changes  on Flagship G...Rapid Impact Assessment of Climatic and Physio-graphic Changes  on Flagship G...
Rapid Impact Assessment of Climatic and Physio-graphic Changes on Flagship G...Arvinder Singh
 
The Human Microbiome and the Revolution in Digital Health
The Human Microbiome and the Revolution in Digital HealthThe Human Microbiome and the Revolution in Digital Health
The Human Microbiome and the Revolution in Digital HealthLarry Smarr
 
Is microbial ecology driven by roaming genes?
Is microbial ecology driven by roaming genes?Is microbial ecology driven by roaming genes?
Is microbial ecology driven by roaming genes?beiko
 
EVE 161 Winter 2018 Class 13
EVE 161 Winter 2018 Class 13EVE 161 Winter 2018 Class 13
EVE 161 Winter 2018 Class 13Jonathan Eisen
 
Discovering the Other 90% of our Human Superorganism
Discovering the Other 90% of our Human SuperorganismDiscovering the Other 90% of our Human Superorganism
Discovering the Other 90% of our Human SuperorganismLarry Smarr
 

What's hot (20)

Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...
Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...
Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...
 
American Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk UniversityAmerican Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk University
 
Trends In Genomics
Trends In GenomicsTrends In Genomics
Trends In Genomics
 
Microbial Phylogenomics (EVE161) Class 5
Microbial Phylogenomics (EVE161) Class 5Microbial Phylogenomics (EVE161) Class 5
Microbial Phylogenomics (EVE161) Class 5
 
UC Davis EVE161 Lecture 18 by @phylogenomics
 UC Davis EVE161 Lecture 18 by @phylogenomics UC Davis EVE161 Lecture 18 by @phylogenomics
UC Davis EVE161 Lecture 18 by @phylogenomics
 
UC Davis EVE161 Lecture 9 by @phylogenomics
UC Davis EVE161 Lecture 9 by @phylogenomicsUC Davis EVE161 Lecture 9 by @phylogenomics
UC Davis EVE161 Lecture 9 by @phylogenomics
 
EVE 161 Winter 2018 Class 14
EVE 161 Winter 2018 Class 14EVE 161 Winter 2018 Class 14
EVE 161 Winter 2018 Class 14
 
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...
 
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
 
Supercomputing Your Inner Microbiome
Supercomputing Your Inner MicrobiomeSupercomputing Your Inner Microbiome
Supercomputing Your Inner Microbiome
 
Evolutionary genomics
Evolutionary genomicsEvolutionary genomics
Evolutionary genomics
 
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingMicrobial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
 
Quantifying Your Superorganism Body Using Big Data Supercomputing
Quantifying Your Superorganism Body Using Big Data SupercomputingQuantifying Your Superorganism Body Using Big Data Supercomputing
Quantifying Your Superorganism Body Using Big Data Supercomputing
 
Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics
Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics
Microbial Phylogenomics (EVE161) Class 15: Shotgun Metagenomics
 
Adaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maizeAdaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maize
 
Rapid Impact Assessment of Climatic and Physio-graphic Changes on Flagship G...
Rapid Impact Assessment of Climatic and Physio-graphic Changes  on Flagship G...Rapid Impact Assessment of Climatic and Physio-graphic Changes  on Flagship G...
Rapid Impact Assessment of Climatic and Physio-graphic Changes on Flagship G...
 
The Human Microbiome and the Revolution in Digital Health
The Human Microbiome and the Revolution in Digital HealthThe Human Microbiome and the Revolution in Digital Health
The Human Microbiome and the Revolution in Digital Health
 
Is microbial ecology driven by roaming genes?
Is microbial ecology driven by roaming genes?Is microbial ecology driven by roaming genes?
Is microbial ecology driven by roaming genes?
 
EVE 161 Winter 2018 Class 13
EVE 161 Winter 2018 Class 13EVE 161 Winter 2018 Class 13
EVE 161 Winter 2018 Class 13
 
Discovering the Other 90% of our Human Superorganism
Discovering the Other 90% of our Human SuperorganismDiscovering the Other 90% of our Human Superorganism
Discovering the Other 90% of our Human Superorganism
 

Similar to Big Data Field Museum

Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainAdina Chuang Howe
 
Job Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringJob Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringAdina Chuang Howe
 
2015 Soil Science of America Meeting
2015 Soil Science of America Meeting2015 Soil Science of America Meeting
2015 Soil Science of America MeetingAdina Chuang Howe
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Larry Smarr
 
Microbial Metagenomics Drives a New Cyberinfrastructure
Microbial Metagenomics Drives a New CyberinfrastructureMicrobial Metagenomics Drives a New Cyberinfrastructure
Microbial Metagenomics Drives a New CyberinfrastructureLarry Smarr
 
RPG iEvoBio 2010 Keynote
RPG iEvoBio 2010 KeynoteRPG iEvoBio 2010 Keynote
RPG iEvoBio 2010 KeynoteRob Guralnick
 
iEvoBio Keynote Talk 2010
iEvoBio Keynote Talk 2010iEvoBio Keynote Talk 2010
iEvoBio Keynote Talk 2010Rob Guralnick
 
Bioinformatics workshop presentation
Bioinformatics   workshop presentationBioinformatics   workshop presentation
Bioinformatics workshop presentationSKUAST-Kashmir
 
High-Throughput Sequencing of the Human Microbiome, Rob Knight Research Group...
High-Throughput Sequencing of the Human Microbiome, Rob Knight Research Group...High-Throughput Sequencing of the Human Microbiome, Rob Knight Research Group...
High-Throughput Sequencing of the Human Microbiome, Rob Knight Research Group...Copenhagenomics
 
The Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersThe Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersLarry Smarr
 
PAPER 3.1 ~ HUMAN GENOME PROJECT
PAPER 3.1 ~  HUMAN GENOME PROJECTPAPER 3.1 ~  HUMAN GENOME PROJECT
PAPER 3.1 ~ HUMAN GENOME PROJECTNusrat Gulbarga
 
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...FOODCROPS
 
Methods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big dataMethods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big dataChirag Patel
 
Adina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina Chuang Howe
 

Similar to Big Data Field Museum (20)

Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
Job Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringJob Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio Engineering
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 Soil Science of America Meeting
2015 Soil Science of America Meeting2015 Soil Science of America Meeting
2015 Soil Science of America Meeting
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
Microbial Metagenomics Drives a New Cyberinfrastructure
Microbial Metagenomics Drives a New CyberinfrastructureMicrobial Metagenomics Drives a New Cyberinfrastructure
Microbial Metagenomics Drives a New Cyberinfrastructure
 
RPG iEvoBio 2010 Keynote
RPG iEvoBio 2010 KeynoteRPG iEvoBio 2010 Keynote
RPG iEvoBio 2010 Keynote
 
iEvoBio Keynote Talk 2010
iEvoBio Keynote Talk 2010iEvoBio Keynote Talk 2010
iEvoBio Keynote Talk 2010
 
Bioinformatics workshop presentation
Bioinformatics   workshop presentationBioinformatics   workshop presentation
Bioinformatics workshop presentation
 
High-Throughput Sequencing of the Human Microbiome, Rob Knight Research Group...
High-Throughput Sequencing of the Human Microbiome, Rob Knight Research Group...High-Throughput Sequencing of the Human Microbiome, Rob Knight Research Group...
High-Throughput Sequencing of the Human Microbiome, Rob Knight Research Group...
 
The Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersThe Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics Researchers
 
PAPER 3.1 ~ HUMAN GENOME PROJECT
PAPER 3.1 ~  HUMAN GENOME PROJECTPAPER 3.1 ~  HUMAN GENOME PROJECT
PAPER 3.1 ~ HUMAN GENOME PROJECT
 
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
 
Methods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big dataMethods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big data
 
Pathogen Genome Data
Pathogen Genome DataPathogen Genome Data
Pathogen Genome Data
 
Adina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABE
 

Recently uploaded

Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxnoordubaliya2003
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 

Recently uploaded (20)

Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 

Big Data Field Museum

  • 1. RIDING THE BIG DATA TIDAL WAVE OF MODERN MICROBIOLOGY Adina Howe Argonne National Laboratory / Michigan State University Iowa State University, Ag & Biosystems Engr (January)
  • 2. Understanding community dynamics  Who is there?  What are they doing?  How are they doing it?
  • 3. Understanding community dynamics  Who is there?  What are they doing?  How are they doing it? Kim Lewis, 2010
  • 4. Gene / Genome Sequencing  Collect samples  Extract DNA  Sequence DNA  “Analyze” DNA to identify its content and origin Taxonomy (e.g., pathogenic E. Coli) Function (e.g., degrades cellulose)
  • 5. Effects of low cost sequencing… First free-living bacterium sequenced for billions of dollars and years of analysis Personal genome can be mapped in a few days and hundreds to few thousand dollars
  • 6. The experimental continuum Single Isolate Pure Culture Enrichment Mixed Cultures Natural systems
  • 7. The era of big data in biology NGS (Shotgun) Sequencing (doubling time 5 months) 100,000,000 100,000,000 10,000,000 1,000,000 1,000,000 100,000 100,000 10,000 10,000 1,000 1,000 100 100 10 10 1 1 Stein, Genome Biology, 2010 Computational Hardware (doubling time 14 months) Sanger Sequencing (doubling time 19 months) 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 Year 1,000,000 100,000 10,000 1,000 100 10 1 0 Disk Storage, Mb/$ 0.1 DNA Sequencing, Mbp per $ 10,000,000 0.1
  • 8. Postdoc experience with data 2003-2008 Cumulative sequencing in PhD = 2000 bp 2008-2009 Postdoc Year 1 = 50 Gbp 2009-2010 Postdoc Year 2 = 450 Gbp 2014 = 50 Tbp 2015 = 500 Tbp budgeted
  • 9. TARGETTED SEQUENCING STRATEGY “Soil Census” to “Soil Catalogs”: Who is there? Targetting conserved regions of known genes Most popular: 16S ribosomal RNA gene – conserved in bacteria and archaea  Who is there - community profiling based on sequence similarity  Must have previous knowledge of genes  Must infer function based on phylogeny – not advised
  • 10. TARGETTED SEQUENCING STRATEGY “Soil Census” to “Soil Catalogs”: Who is there? Targetting conserved regions of known genes Most popular: 16S ribosomal RNA gene – conserved in bacteria and archaea $15 / sample  Who is there - community profiling based on sequence similarity  Must have previous knowledge of genes  Must infer function based on phylogeny – not advised
  • 11. Tackling Soil Biodiversity Source: Chuck Haney C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) Janet Jansson, Susannah Tringe (JGI)
  • 12. THE DIRT ON SOIL MAGNIFICENT BIODIVERSITY Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress
  • 13. THE DIRT ON SOIL SPATIAL HETEROGENEITY http://www.fao.org/ www.cnr.uidaho.edu
  • 14. THE DIRT ON SOIL DYNAMIC
  • 15. THE DIRT ON SOIL INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES Philippot, 2013, Nature Reviews Microbiology
  • 16. Our shared challenges Climate Change USGCRP 2009 Energy Supply www.alutiiq.com Human Health http://guardianlv.com/ An understanding of microbial ecology
  • 17. SOIL MICROBIOLOGY: CARBON REGULATION The anthropogenic CO2 production is only 10% of that of the soil Sustainable agriculture permits carbon sequestration in the range of 0.3 – 1 ton of C/ha.yr ~ 10% of all carbon emitted by cars (Denman et al., 2007; Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change)
  • 18. Tackling Soil Biodiversity Source: Chuck Haney C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) Janet Jansson, Susannah Tringe (JGI)
  • 19. Lesson #1: Accessing information in data http://siliconangle.com/files/2010/09/image_thumb69.png
  • 20. de novo assembly Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes Compresses dataset size significantly Improved data quality (longer sequences, gene order) Reference not necessary (novelty)
  • 22. Shotgun sequencing and de novo assembly It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 23. Practical Challenges – Intensive computing Months of “computer crunching” on a super computer Howe et al, 2014, PNAS
  • 24. Practical Challenges – Intensive computing Months of “computer crunching” on a super computer Howe et al, 2014, PNAS Assembly of 300 Gbp can be done with any assembly program in less than 14 GB RAM and less than 24 hours.
  • 25. Natural community characteristics  Diverse  Many organisms (genomes)
  • 26. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Sample 1x
  • 27. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Sample 1x Sample 10x
  • 28. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Overkill Sample 1x Sample 10x
  • 29. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 30. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 31. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 32. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 33. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 34. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One  Scales datasets for assembly up to 95% - same assembly outputs.  Genomes, mRNA-seq, metagenomes (soils, gut, water)
  • 35. Tackling Soil Biodiversity Source: Chuck Haney C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) Janet Jansson, Susannah Tringe (JGI)
  • 37. More like… Howe et. al, 2014, PNAS Source: Chuck Haney
  • 38. SOIL METAGENOME REALITY CHECK  Grand Challenge effort – 10% of soil biodiversity sampled  Incredible soil biodiversity (estimate required 10 Tbp/sample)  “To boldly go where no man has gone before”: >60% Unknown 400 300 200 100 0 amino acid metabolism carbohydrate metabolism membrane transport signal transduction translation folding, sorting and degradation metabolism of cofactors and vitamins energy metabolism transport and catabolism lipid metabolism transcription cell growth and death replication and repair xenobiotics biodegradation and metabolism nucleotide metabolism glycan biosynthesis and metabolism metabolism of terpenoids and polyketides cell motility Total Count KO corn and prairie corn only prairie only Howe et al, 2014, PNAS Managed agriculture soils exhibit less diversity, likely from its history of cultivation.
  • 39. Frustrating, but helpful  “Low input, high throughput, no output?” (Sean Eddy / Sydney Brenner)  Evaluation of sequencing as a tool  Broad characterization  “Right” kind of data  How much should I sequence?  Data characteristics  Breadth vs. depth of sampling  Computational tool development  Tr
  • 40. Lesson #2: Connecting the dots from data to information If 80% is unknown…what can one do?
  • 41. The idea of co-occurrence
  • 42. Co-occurrence to detect novelty Williams et al., 2014, Frontiers of Micro
  • 43. Causation vs correlation Wikipedia.com
  • 44.
  • 45.
  • 46. Success story in Human Microbiome
  • 47. #3 Is more data better? Bottlenecks for the emerging microbiologists
  • 48. Technical challenges – many solutions  Access to data and its value  Access to resources  Data volume and velocity “clog”  Data is very heterogeneous
  • 49. Data intensive microbiology Software Developers Computer Scientists Clinicians PIs Data generators Microbiologists Data Analyzers Statisticians Bioinformaticians http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html
  • 50.
  • 51.
  • 52. Social obstacles – the main challenge Shift of costs do not mean shift of expectations http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/ Dear PI, It will take longer than the time it took you to do your experiment to analyze the data. Please do not write me for results within 24 hours of your sequences becoming available. - Adina
  • 53. Culture of sharing Metagenomic Datasets http://www.heathershumaker.com/
  • 54. Training / Incentives Emails between collaborators don’t contain as much “science” as I’d like:
  • 55. All analysis: accessible, reproducible, and automated
  • 56. RIDING THE BIG DATA TIDAL WAVE OF MODERN MICROBIOLOGY Adina Howe Argonne National Laboratory / Michigan State University Iowa State University, Ag & Biosystems Engr (January) “ ”
  • 57. RIDING THE BIG DATA TIDAL WAVE OF MODERN MICROBIOLOGY Adina Howe Argonne National Laboratory / Michigan State University Iowa State University, Ag & Biosystems Engr (January)
  • 58. Acknowledgements  C. Titus Brown (MSU)  James Tiedje (MSU)  Daina Ringus (UC)  Folker Meyer (ANL)  Eugene Chang (UC)  NSF Biology Postdoc Fellowship  DOE Great Lakes Bioenergy Research Center

Editor's Notes

  1. Thank Beckett Journey with big data
  2. The questions we have in understanding microbes have not changed much…
  3. Historically, we have been asking these questions in model organisms. The challenge of model organisms…comparing them to what we know is in the environment…
  4. First automated DNA sequencing machines late 80s, New ay of asking questions.
  5. Highlighted in recent news
  6. Opportunities and changes in the systems we study. So then the question is not only who is there and what they are doing? But what are they doing together and how?
  7. The growth – point out NGS imapct Accompanied by challenges of computation…even to store data on.
  8. Data during my career really reflects this groth. During postdoc, first year, 50 million reads to about 40x that within literally 9 months. data increased 25x million times…. Notice the gap from 2010 – 2014, figuring stuff out.
  9. The goal is to understand the communities in the soil. The challenge is that the community in the soil is too large to sample. Using the targeted approaches, you’ll see many microbial soil and enviornmental studies report data on community membership and structure. These investigations target the 16S rib RNA gene which is conserved in bacteria and archaea. Because this gene is conserved, this allows the sampling of these genes to result in a comparison of how similar these biomarkers are within a community. Basically you take each sequence of each gene and align into the other genes you’ve sampled. And from that you can identify a community structure profile that you can then compare between samples. You can compare sampled sequences to previously observed sequences and identify who and how much of that microorganism is in your soils.
  10. The goal is to understand the communities in the soil. The challenge is that the community in the soil is too large to sample. Using the targeted approaches, you’ll see many microbial soil and enviornmental studies report data on community membership and structure. These investigations target the 16S rib RNA gene which is conserved in bacteria and archaea. Because this gene is conserved, this allows the sampling of these genes to result in a comparison of how similar these biomarkers are within a community. Basically you take each sequence of each gene and align into the other genes you’ve sampled. And from that you can identify a community structure profile that you can then compare between samples. You can compare sampled sequences to previously observed sequences and identify who and how much of that microorganism is in your soils.
  11. Soil biodiversity is amazing. Great Prairie – world’s most fertile. Important reference site for the biological baseis and ecosystems of soil microbial communities. It sequesters most carbon, produces large amount of biomass anually, key for biofuels and security. Know little about the who / what in these soils. Excitement about what we could clean now with the technologies.
  12. Most of us now recognize that microbial communities generally exhibit a high level of diversity, much highter than previously assume by what was revealed by classical microscopy and basic culturing techniques. In soil, even in one gram of soil, there is estimated to be more microbial species than there are stars in the galaxy. We have far to go for any comprehensive characterization of any single soil community. A key question then Is why is soil diversity so high?
  13. One reason may be that the soil structure provides unique niche that provide a high diversity of food resources. Its varied structure provides stable, protective, and even ancient environments for microorganims.
  14. Soil investigations are further complicated by the primarily dormant state of the large majority of the soil microbial population. The turnover rate of soil microbes is predicted to be over 30 fold and even up to 300 fold slower than that of microbes in the oceans. And these microbes live in relatively unpredicatlbe patterns of pertubations – for example rainfall or leaf litter introduction. They also undergo defined temporal perturbations – diurnal energy input.
  15. This complexity in the soil has formed a dynamic microbial ecosystem which interacts with nutrients, plants, and the soil structure itself at multiple scales. I would argue that we as a field are still trying to find tractable methods of accessing these interactions and understanding the drivers of “healthy” or “productive” soils.
  16. There are several grand challenges that our society is currently facing which I think are of paramount importance. These are predicting and managing the impacts of climate change, finding sustainable sources of liquid fuels, and understanding the emerging pandemics facing human health in recent years. From carbon emissions from land use (which is magnitudes more than that of car emissions), degrading cellulosic biomass, to pathogens in our bodies, microbes are involved in complex communities that drive the health and productivity of either our natural resources or our own bodies. And its buidling up the expertise to ask
  17. For example, microbes in the soil help cycle important nutrients for plants to grow while also impacting global flows of important elements such as carbon and nitrogen. In fact, when you estimate the carbon production of CO2 in soils and compare it to automative emissions, you’d find that anthropogenic sources of CO2 make up only 10% of that of the soils – which has a lot to do with the underlying microbes. As a consequence, you could capture roughly about 10% of all carbon emitted by cars just by employing sustainable agriculture practices. Understanding these processes in the soil can help us then learn how to both predict the impacts of our land management strategies on climate change but also help us understand how we can best manage our limited soil resources.
  18. Soil biodiversity is amazing. Great Prairie – world’s most fertile. Important reference site for the biological baseis and ecosystems of soil microbial communities. It sequesters most carbon, produces large amount of biomass anually, key for biofuels and security. Know little about the who / what in these soils. Excitement about what we could clean now with the technologies.
  19. With growing volumes of data, the most obvious way to tractably access this data is to “smartly” reduce this data.
  20. One genomic way to reduce data is a process known as assembly. Assembly has been around since the sequencing of single organisms.
  21. Metagenomics…a problem of scale
  22. Assembly is the process of trying to come up with a consensus sequence based on finding overlaps in small fragments. In this example, we are coming up with a solution of one sentence using 8 fragments. In metagenomic assembly, you are trying to come up with hundreds to thousands to even millions of genomes using billions of fragments. And to do this, you have to compare each fragment to every other one in the dataset, making it very computationally intensive.
  23. Even the smallest dataset that I had at the beginning of my postdoc required several months on a supercomputer, something having over 100 GB of RAM. These were resources I simply didn’t have at this time. And for my larger datasets, there was simply nothing I could do with them, they would essentially crash any available assembly program that existed. So I had to come up with a way to deal with all of this data or essentially, there were a handful of Pis that had just invested tens of thousands of dollars in a project where we couldn’t tractably handle the datasets.
  24. I’m going to tell you now a little bit about how we were able to do this and there actually two different strategies we had to combine.
  25. First start thinking about what tare the natural chracteristics of environemntal communities. Diverse.: There are multiple genomes, and even potentially millions of species, in a sample.
  26. Variable abundance in nature, some are highly abundant some are not.
  27. This diversity and distribution of abundances means that we are unevenly sampling strains in the environment. If we want to sample the rarerest species….we need
  28. A strategy we came up with was can we come up with a way to come up with the minimal dataset that you need for assembly, discarding these reads from this overkill section?
  29. From a sequencing standpoint then, what we see is that for a given genome (represented here as a dotted line), we start sampling fragments from it.
  30. As we sample more, we will have some sequences which will have errors in it.
  31. And we’ll keep sequencing this genome, randomly sampling different parts of it. We’ll get to a point, where we’ll have enough sequences where we can make a good guess at what the original sequence may have looked like.
  32. For assembly, you need a minimum amount of information. So anything beyond this 6 is excessive or redundant information.
  33. So we can discard or set aside this read and not use it for our assembly. And that actually turns out to be a good thing because in discarding this information, we’re actually removing data with errors in it.
  34. minimal dataset needed for an assembly of the dataset here in pink and a redundant set of information which we have set aside. In setting aside these reads here in the red, discard errors Improve assembly
  35. Soil biodiversity is amazing. Great Prairie – world’s most fertile. Important reference site for the biological baseis and ecosystems of soil microbial communities. It sequesters most carbon, produces large amount of biomass anually, key for biofuels and security. Know little about the who / what in these soils. Excitement about what we could clean now with the technologies.
  36. More than half, 50-80% sequences unknown in soil, gut microbiome
  37. Overall, many funcitons are shared between corn and prairies soils. Interestingly, prairie soils have much many more unique functions (indicated here as blue bars) compared to unique functions in the corn (here green). This result may reflect the varying management history of these two soils. Unlike the prairie soils, which have never been tilled, the corn soils have been cultivated for more than 100 y and have had annual additions of animal manure that potentially could enrich specific metabolic pathways with decreased diversity.
  38. Reducing data is only part of the problem when using sequences to inform microbiological processes. Link data (largely unknown) to biological processes One way is start linking unknowns to things we know about.
  39. We can look at characteristics of something known that we may be able to describe to some degree and then find unknown entities that might exhibit similar patterns Gives us a clue at what the unknown might be. Example, three fridges. Set of objects in there that might describe the community that these fridges might be associated with.
  40. We can then look for patterns in unknown parts of our dataset that exhibit similar patterns as these known entities. For example, entities that share similar abundance profiles. These unknown can then be characterized by association. Fratboys characteristic communities, graduate student, and healthy chef communities might all have different characteristics.
  41. The reliability of this analysis heeds caution and further validation. I’ve found that this analysis almost always leads to more questions than answers. Always turning back to model systems to help clarify hypotheses generated from this analysis.
  42. Finally, as the last part of my perspective on big data, I wanted to talk about the theme of this workshop? Is more data better? For me, the answer is always yes. I always want more data. This is largely attributable to the fact that I have a lot of experience working with this data and the resources to play with it. But that is not always the case. So what are the challenges of big data to the microbiologist or biologist?
  43. More challenging is the emerging role a microbiologist now has to fill and the changing teams we are now involved in. I’m asked to play all these roles in various projects I’m involved in. And definitely, I’m asked to communicate to people in all these roles and they are asked to communicate with me. This communication can be challenging.
  44. For example, if you asked us all to describe a tire swing building project, you’d undoubtably get many varied descriptions
  45. Communication and social obstacles are the most difficult,
  46. The need to share and participate in interdiscipinary research come along with a culture of needing to demonstrate individual impact
  47. Total reproducibility of all figures – one button Change the dataset, redo entire analysis on your own data
  48. Journey begins with TG 2008
  49. 6 years later…