SlideShare a Scribd company logo
1 of 41
Data-intensive approaches to
investigating non-model
organisms
C. Titus Brown
ctb@msu.edu
Assistant Professor
Microbiology and Molecular Genetics; Computer Science and Engineering;
BEACON; Quantitative Biology Initiative
Outline
• My research!
• Opportunities for computational science training
• More unsolicited advice
Acknowledgements
Lab members involved Collaborators
• Adina Howe (w/Tiedje)
• Jason Pell
• Arend Hintze
• Rosangela Canino-Koning
• Qingpeng Zhang
• Elijah Lowe
• Likit Preeyanon
• Jiarong Guo
• Tim Brom
• Kanchan Pavangadkar
• Eric McDonald
• Jim Tiedje, MSU
• Erich Schwarz, Caltech / Cornell
• Paul Sternberg, Caltech
• Robin Gasser, U. Melbourne
• Weiming Li
• Hans Cheng
Funding
USDA NIFA; NSF IOS;
BEACON; NIH.
My interests
I work primarily on organisms of agricultural, evolutionary, or
ecological importance, which tend to have poor reference
genomes and transcriptomes. Focus on:
• Improving assembly sensitivity to better recover
genomic/transcriptomic sequence, often from “weird”
samples.
• Scaling sequence assembly approaches so that huge
assemblies are possible and big assemblies are
straightforward.
• “Better science through superior software”
There is quite a bit of life left to sequence & assemble.
http://pacelab.colorado.edu/
“Weird” biological samples:
• Single genome
• Transcriptome
• High polymorphism data
• Whole genome amplified
• Metagenome (mixed
microbial community)
• Hard to sequence DNA
(e.g. GC/AT bias)
• Differential expression!
• Multiple alleles
• Often extreme
amplification bias
• Differential abundance
within community.
Single genome assembly is already
challenging --
Once you start sequencing
metagenomes…
DNA sequencing
• Observation of actual DNA sequence
• Counting of molecules
Image: Werner Van Belle
Fast, cheap, and easy to
generate.
Image: Werner Van Belle
New problem: data analysis &
integration!
• Once you can generate virtually any data set you want…
• …the next problem becomes finding your answer in the data
set!
• Think of it as a gigantic NSA treasure hunt: you know there are
terrorists out there, but to find them you to hunt through 1 bn
phone calls a day…
“Heuristics”
• What do computers do when the answer is either really, really
hard to compute exactly, or actually impossible?
• They approximate! Or guess!
• The term “heuristic” refers to a guess, or shortcut
procedure, that usually returns a pretty good answer.
Oftenexplicitor implicittradeoffs between
compute“amount”and quality of result
http://www.infernodevelopment.com/how-
computer-chess-engines-think-minimax-tree
My actual research focus
What we do is think about ways to get computers to play chess
better, by:
• Identifying better ways to guess;
• Speeding up the guessing process;
• Improving people’s ability to use the chess playing computer
Now, replace “play chess” with
“analyze biological data”...
My actual research focus…
We build tools that help experimental biologists work efficiently
and correctly with large amounts of data, to help answer their
scientific questions.
This touches on many problems, including:
• Computational and scientific correctness.
• Computational efficiency.
• Cultural divides between experimental biologists and
computational scientists.
• Lack of training (biology and medical curricula devoid of math
and computing).
Not-so-secretsauce:“digitalnormalization”
• One primary step of one type of data analysis becomes 20-200x
faster, 20-150x “cheaper”.
Approach: Digital normalization
(acomputationalversionoflibrarynormalization)
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume disk
space and, because of
errors, memory.
We can discard it for you…
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
• Is single pass: looks at each read only once;
• Does not “collect” the majority of errors;
• Keeps all low-coverage reads;
• Smooths out coverage of regions.
http://en.wikipedia.org/wiki/JPEG
Lossy compression
http://en.wikipedia.org/wiki/JPEG
Lossy compression
http://en.wikipedia.org/wiki/JPEG
Lossy compression
http://en.wikipedia.org/wiki/JPEG
Lossy compression
http://en.wikipedia.org/wiki/JPEG
Lossy compression
Raw data
(~10-100 GB)
Analysis "Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Restated:
Can we use lossy compression approaches to make
downstream analysis faster and better? (Yes.)
~2 GB – 2 TB of single-chassis RAM
Soil metagenome assembly
• Observation: 99% of microbes cannot easily be cultured in the
lab. (“The great plate count anomaly”)
• Many reasons why you can’t or don’t want to culture:
• Syntrophic relationships
• Niche-specificity or unknown physiology
• Dormant microbes
• Abundance within communities
Single-cell sequencing & shotgun metagenomics are two common
ways to investigate microbial communities.
Investigating soil microbial ecology
• What ecosystem level functions are present, and how do
microbes do them?
• How does agricultural soil differ from native soil?
• How does soil respond to climate perturbation?
• Questions that are not easy to answer without shotgun
sequencing:
• What kind of strain-level heterogeneity is present in the
population?
• What does the phage and viral population look like?
• What species are where?
SAMPLING LOCATIONS
A “Grand Challenge” dataset
(DOE/JGI)
0
100
200
300
400
500
600
Iowa,
Continuous
corn
Iowa, Native
Prairie
Kansas,
Cultivated
corn
Kansas,
Native
Prairie
Wisconsin,
Continuous
corn
Wisconsin,
Native
Prairie
Wisconsin,
Restored
Prairie
Wisconsin,
Switchgrass
BasepairsofSequencing(Gbp)
GAII HiSeq
Rumen (Hess et. al, 2011), 268 Gbp
MetaHIT (Qin et. al, 2011), 578 Gbp
NCBI nr database,
37 Gbp
Total: 1,846 Gbp soil metagenome
Rumen K-mer Filtered,
111 Gbp
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Assemblyresults for Iowacorn and prairie
(2x~300Gbpsoilmetagenomes)
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Adina Howe
Strain variation?Toptwoallelefrequencies
Position within contig
Of 5000 most
abundant
contigs, only 1 has
a
polymorphism
rate > 5%
Can measure by
read mapping.
Tentative observations from our
soil samples:
• We need 100x as much data…
• Much of our sample may consist of phage.
• Phylogeny varies more than functional predictions.
• We see little to no strain variation within our samples
• Not bulk soil --
• Very small, localized, and low coverage samples
• We may be able to do selective really deep sequencing and
then infer the rest from 16s.
• Implications for soil aggregate assembly?
I also work on…
• Genome assembly & analysis
• Transcriptome assembly and analysis
• Interpretation of annoying large data sets
Whatarethetissuelevelchangesingeneexpressionthatsupportregeneration?
TranscriptomeanalysisofaregeneratingvertebrateafterSCI
brain
spinal cord
RNA-Seq to determine
differential expression
profile after injury
Sampling >weekly
-/+ Dex
Ona Bloom
Training opportunities
• PLB/MMG 810 (Shiu; ??)
• CSE 801/Intro BEACON course (Brown; FS ‘13)
“Intro to Computational Science for Evolutionary Biologists”
• CSE 801 bootcamp (late Sep)
• Software Carpentry bootcamp(s) (late Sep)
• Workshops in Applied Bioinformatics (Buell; ‘14?)
• Next-Gen Sequence Analysis Workshop (Brown; summer ‘14)
+ a variety of genomics courses that I can’t keep track of!
Becky Mansel will have these slides.
Unsolicited advice
Consider both faculty and non-faculty careers.
• It’s a bad time to be looking for faculty positions, and it’s a bad
time to be looking for funding; maybe this will improve in 10
years, maybe not.
• A PhD qualifies you for many, many more things than we will
(or can) tell you about!
• Specific advice:
• Network with industry folk; think beyond your advisor’s career.
• Write a blog: ivory.idyll.org/blog/advice-to-scientists-on-
blogging.html

More Related Content

Viewers also liked

Export Compliance: Keeping You Safe, Solvent + Out of Trouble
Export Compliance: Keeping You Safe, Solvent + Out of TroubleExport Compliance: Keeping You Safe, Solvent + Out of Trouble
Export Compliance: Keeping You Safe, Solvent + Out of TroubleKegler Brown Hill + Ritter
 
ciudatenii
ciudateniiciudatenii
ciudateniinbmro
 
S1031 re 5.6.13 vt realtors 2013
S1031 re   5.6.13 vt realtors 2013S1031 re   5.6.13 vt realtors 2013
S1031 re 5.6.13 vt realtors 2013Edmund_Wheeler
 
How to do windows movie maker?
How to do windows movie maker?How to do windows movie maker?
How to do windows movie maker?jessecadelina
 
Top 5 Issues Affecting the HR Profession in Ohio
Top 5 Issues Affecting the HR Profession in OhioTop 5 Issues Affecting the HR Profession in Ohio
Top 5 Issues Affecting the HR Profession in OhioKegler Brown Hill + Ritter
 
Updated-Enroll And Survey
Updated-Enroll And Survey Updated-Enroll And Survey
Updated-Enroll And Survey bsrmailbox
 
Evaluaciones de jheickson noguera
Evaluaciones de jheickson nogueraEvaluaciones de jheickson noguera
Evaluaciones de jheickson nogueraLili Cardenas
 
Manduca
ManducaManduca
Manducanbmro
 
The Power of Section 1031 for Accounting Professionals
The Power of Section 1031 for Accounting ProfessionalsThe Power of Section 1031 for Accounting Professionals
The Power of Section 1031 for Accounting ProfessionalsEdmund_Wheeler
 
Company Presentation for Publishers
Company Presentation for PublishersCompany Presentation for Publishers
Company Presentation for PublishersSponsormob
 
Intellisoft introductionrecruitment l120219
Intellisoft introductionrecruitment l120219Intellisoft introductionrecruitment l120219
Intellisoft introductionrecruitment l120219Sham Yemul
 
Manual Book - Telkomsel Care Applications
Manual Book - Telkomsel Care ApplicationsManual Book - Telkomsel Care Applications
Manual Book - Telkomsel Care ApplicationsKhomeini Mujahid
 
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua ArquiteturaFernando Galdino
 
Your Guide to Business + Legal Success in Latin America
Your Guide to Business + Legal Success in Latin AmericaYour Guide to Business + Legal Success in Latin America
Your Guide to Business + Legal Success in Latin AmericaKegler Brown Hill + Ritter
 
MoMo Tel Aviv Israel June 2009 Mike Moore
MoMo Tel Aviv Israel June 2009 Mike MooreMoMo Tel Aviv Israel June 2009 Mike Moore
MoMo Tel Aviv Israel June 2009 Mike MooreMobileMonday Tel-Aviv
 

Viewers also liked (20)

Export Compliance: Keeping You Safe, Solvent + Out of Trouble
Export Compliance: Keeping You Safe, Solvent + Out of TroubleExport Compliance: Keeping You Safe, Solvent + Out of Trouble
Export Compliance: Keeping You Safe, Solvent + Out of Trouble
 
ciudatenii
ciudateniiciudatenii
ciudatenii
 
S1031 re 5.6.13 vt realtors 2013
S1031 re   5.6.13 vt realtors 2013S1031 re   5.6.13 vt realtors 2013
S1031 re 5.6.13 vt realtors 2013
 
How to do windows movie maker?
How to do windows movie maker?How to do windows movie maker?
How to do windows movie maker?
 
Vizerra 2010
Vizerra 2010Vizerra 2010
Vizerra 2010
 
2013 gbmf-mmi-ci
2013 gbmf-mmi-ci2013 gbmf-mmi-ci
2013 gbmf-mmi-ci
 
Top 5 Issues Affecting the HR Profession in Ohio
Top 5 Issues Affecting the HR Profession in OhioTop 5 Issues Affecting the HR Profession in Ohio
Top 5 Issues Affecting the HR Profession in Ohio
 
Updated-Enroll And Survey
Updated-Enroll And Survey Updated-Enroll And Survey
Updated-Enroll And Survey
 
Evaluaciones de jheickson noguera
Evaluaciones de jheickson nogueraEvaluaciones de jheickson noguera
Evaluaciones de jheickson noguera
 
Deadlocks
DeadlocksDeadlocks
Deadlocks
 
Manduca
ManducaManduca
Manduca
 
The Power of Section 1031 for Accounting Professionals
The Power of Section 1031 for Accounting ProfessionalsThe Power of Section 1031 for Accounting Professionals
The Power of Section 1031 for Accounting Professionals
 
Company Presentation for Publishers
Company Presentation for PublishersCompany Presentation for Publishers
Company Presentation for Publishers
 
Intellisoft introductionrecruitment l120219
Intellisoft introductionrecruitment l120219Intellisoft introductionrecruitment l120219
Intellisoft introductionrecruitment l120219
 
Manual Book - Telkomsel Care Applications
Manual Book - Telkomsel Care ApplicationsManual Book - Telkomsel Care Applications
Manual Book - Telkomsel Care Applications
 
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
[TDC 2013] Integre um grid de dados em memória na sua Arquitetura
 
Your Guide to Business + Legal Success in Latin America
Your Guide to Business + Legal Success in Latin AmericaYour Guide to Business + Legal Success in Latin America
Your Guide to Business + Legal Success in Latin America
 
Solar Power Purchase Agreement Contracts
Solar Power Purchase Agreement ContractsSolar Power Purchase Agreement Contracts
Solar Power Purchase Agreement Contracts
 
MoMo Tel Aviv Israel June 2009 Mike Moore
MoMo Tel Aviv Israel June 2009 Mike MooreMoMo Tel Aviv Israel June 2009 Mike Moore
MoMo Tel Aviv Israel June 2009 Mike Moore
 
2 3 Principios
2 3 Principios2 3 Principios
2 3 Principios
 

Similar to 2013 bms-retreat-talk

2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Handsfnothaft
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing techc.titus.brown
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
Curate locally, think globally
Curate locally, think globallyCurate locally, think globally
Curate locally, think globallyValerie Wood
 
Jillian ms defense-4-14-14-ja-novid3
Jillian ms defense-4-14-14-ja-novid3Jillian ms defense-4-14-14-ja-novid3
Jillian ms defense-4-14-14-ja-novid3Jillian Aurisano
 

Similar to 2013 bms-retreat-talk (20)

2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2014 naples
2014 naples2014 naples
2014 naples
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Curate locally, think globally
Curate locally, think globallyCurate locally, think globally
Curate locally, think globally
 
2014 villefranche
2014 villefranche2014 villefranche
2014 villefranche
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
Jillian ms defense-4-14-14-ja-novid3
Jillian ms defense-4-14-14-ja-novid3Jillian ms defense-4-14-14-ja-novid3
Jillian ms defense-4-14-14-ja-novid3
 

More from c.titus.brown

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

2013 bms-retreat-talk

  • 1. Data-intensive approaches to investigating non-model organisms C. Titus Brown ctb@msu.edu Assistant Professor Microbiology and Molecular Genetics; Computer Science and Engineering; BEACON; Quantitative Biology Initiative
  • 2. Outline • My research! • Opportunities for computational science training • More unsolicited advice
  • 3. Acknowledgements Lab members involved Collaborators • Adina Howe (w/Tiedje) • Jason Pell • Arend Hintze • Rosangela Canino-Koning • Qingpeng Zhang • Elijah Lowe • Likit Preeyanon • Jiarong Guo • Tim Brom • Kanchan Pavangadkar • Eric McDonald • Jim Tiedje, MSU • Erich Schwarz, Caltech / Cornell • Paul Sternberg, Caltech • Robin Gasser, U. Melbourne • Weiming Li • Hans Cheng Funding USDA NIFA; NSF IOS; BEACON; NIH.
  • 4. My interests I work primarily on organisms of agricultural, evolutionary, or ecological importance, which tend to have poor reference genomes and transcriptomes. Focus on: • Improving assembly sensitivity to better recover genomic/transcriptomic sequence, often from “weird” samples. • Scaling sequence assembly approaches so that huge assemblies are possible and big assemblies are straightforward. • “Better science through superior software”
  • 5. There is quite a bit of life left to sequence & assemble. http://pacelab.colorado.edu/
  • 6. “Weird” biological samples: • Single genome • Transcriptome • High polymorphism data • Whole genome amplified • Metagenome (mixed microbial community) • Hard to sequence DNA (e.g. GC/AT bias) • Differential expression! • Multiple alleles • Often extreme amplification bias • Differential abundance within community.
  • 7. Single genome assembly is already challenging --
  • 8. Once you start sequencing metagenomes…
  • 9. DNA sequencing • Observation of actual DNA sequence • Counting of molecules Image: Werner Van Belle
  • 10. Fast, cheap, and easy to generate. Image: Werner Van Belle
  • 11. New problem: data analysis & integration! • Once you can generate virtually any data set you want… • …the next problem becomes finding your answer in the data set! • Think of it as a gigantic NSA treasure hunt: you know there are terrorists out there, but to find them you to hunt through 1 bn phone calls a day…
  • 12. “Heuristics” • What do computers do when the answer is either really, really hard to compute exactly, or actually impossible? • They approximate! Or guess! • The term “heuristic” refers to a guess, or shortcut procedure, that usually returns a pretty good answer.
  • 13. Oftenexplicitor implicittradeoffs between compute“amount”and quality of result http://www.infernodevelopment.com/how- computer-chess-engines-think-minimax-tree
  • 14. My actual research focus What we do is think about ways to get computers to play chess better, by: • Identifying better ways to guess; • Speeding up the guessing process; • Improving people’s ability to use the chess playing computer Now, replace “play chess” with “analyze biological data”...
  • 15. My actual research focus… We build tools that help experimental biologists work efficiently and correctly with large amounts of data, to help answer their scientific questions. This touches on many problems, including: • Computational and scientific correctness. • Computational efficiency. • Cultural divides between experimental biologists and computational scientists. • Lack of training (biology and medical curricula devoid of math and computing).
  • 16. Not-so-secretsauce:“digitalnormalization” • One primary step of one type of data analysis becomes 20-200x faster, 20-150x “cheaper”.
  • 17. Approach: Digital normalization (acomputationalversionoflibrarynormalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 24. Digital normalization approach A digital analog to cDNA library normalization, diginorm: • Is single pass: looks at each read only once; • Does not “collect” the majority of errors; • Keeps all low-coverage reads; • Smooths out coverage of regions.
  • 30. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Restated: Can we use lossy compression approaches to make downstream analysis faster and better? (Yes.) ~2 GB – 2 TB of single-chassis RAM
  • 31. Soil metagenome assembly • Observation: 99% of microbes cannot easily be cultured in the lab. (“The great plate count anomaly”) • Many reasons why you can’t or don’t want to culture: • Syntrophic relationships • Niche-specificity or unknown physiology • Dormant microbes • Abundance within communities Single-cell sequencing & shotgun metagenomics are two common ways to investigate microbial communities.
  • 32. Investigating soil microbial ecology • What ecosystem level functions are present, and how do microbes do them? • How does agricultural soil differ from native soil? • How does soil respond to climate perturbation? • Questions that are not easy to answer without shotgun sequencing: • What kind of strain-level heterogeneity is present in the population? • What does the phage and viral population look like? • What species are where?
  • 34. A “Grand Challenge” dataset (DOE/JGI) 0 100 200 300 400 500 600 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass BasepairsofSequencing(Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp MetaHIT (Qin et. al, 2011), 578 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp
  • 35. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Assemblyresults for Iowacorn and prairie (2x~300Gbpsoilmetagenomes) Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Adina Howe
  • 36. Strain variation?Toptwoallelefrequencies Position within contig Of 5000 most abundant contigs, only 1 has a polymorphism rate > 5% Can measure by read mapping.
  • 37. Tentative observations from our soil samples: • We need 100x as much data… • Much of our sample may consist of phage. • Phylogeny varies more than functional predictions. • We see little to no strain variation within our samples • Not bulk soil -- • Very small, localized, and low coverage samples • We may be able to do selective really deep sequencing and then infer the rest from 16s. • Implications for soil aggregate assembly?
  • 38. I also work on… • Genome assembly & analysis • Transcriptome assembly and analysis • Interpretation of annoying large data sets
  • 40. Training opportunities • PLB/MMG 810 (Shiu; ??) • CSE 801/Intro BEACON course (Brown; FS ‘13) “Intro to Computational Science for Evolutionary Biologists” • CSE 801 bootcamp (late Sep) • Software Carpentry bootcamp(s) (late Sep) • Workshops in Applied Bioinformatics (Buell; ‘14?) • Next-Gen Sequence Analysis Workshop (Brown; summer ‘14) + a variety of genomics courses that I can’t keep track of! Becky Mansel will have these slides.
  • 41. Unsolicited advice Consider both faculty and non-faculty careers. • It’s a bad time to be looking for faculty positions, and it’s a bad time to be looking for funding; maybe this will improve in 10 years, maybe not. • A PhD qualifies you for many, many more things than we will (or can) tell you about! • Specific advice: • Network with industry folk; think beyond your advisor’s career. • Write a blog: ivory.idyll.org/blog/advice-to-scientists-on- blogging.html

Editor's Notes

  1. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
  2. Diginorm is a subsampling approach that may help assemble highly polymorphic sequences. Observed levels of variation are quite low relative to e.g. marine free spawning animals.