SlideShare a Scribd company logo
1 of 50
STREAMING VARIANT 
CALLING? 
C. Titus Brown 
Michigan State University 
Sep 2014, NCI EDRN / Bethesda, MD
Mapping: locate reads in reference 
http://en.wikipedia.org/wiki/File:Mapping_Reads.png
Variant detection after mapping 
http://www.kenkraaijeveld.nl/genomics/bioinformatics/
Problem 1: 
Analysis is done after sequencing. 
Sequencing Analysis
Problem 2: 
Much of your data is unnecessary. 
Shotgun data is randomly sampled; 
So, you need high coverage for high sensitivity.
Problem 3: 
Current variant calling approaches are multipass 
Data 
Mapping 
Sorting 
Calling Answer
Problem 4: 
Allelic mapping bias favors reference genome. 
Number of nbh differentiating polymorphisms. 
Stevenson et al., 2013 (BMC Genomics)
Problem 5: 
Current approaches are often insensitive to indels 
Iqbal et al., Nat Gen 2012
Why are we concerned at all!? 
Looking forward 5 years… 
Navin et al., 2011
Some basic math: 
• 1000 single cells from a tumor… 
• …sequenced to 40x haploid coverage with Illumina… 
• …yields 120 Gbp each cell… 
• …or 120 Tbp of data. 
• HiSeq X10 can do the sequencing in ~3 weeks. 
• The variant calling will require 2,000 CPU weeks… 
• …so, given ~2,000 computers, can do this all in one 
month.
Similar math applies: 
• Pathogen detection in blood; 
• Environmental sequencing; 
• Sequencing rare DNA from circulating blood. 
• Two issues: 
•Volume of data & compute 
infrastructure; 
• Latency for clinical applications.
Can we improve this situation? 
• Tie directly into machine as it generates sequence 
(Illumina, PacBio, and Nanopore can all do streaming, in theory) 
• Analyze data as it comes off; for some (many?) 
applications, can stop run early if signal detected. 
• Avoid using a reference genome for primary variant 
calling. 
• Easier indel detection, less allelic mapping bias 
• Can use reference for interpretation. 
Does such a magical approach exist!?
~Digression: Digital normalization 
(a computational version of library normalization) 
Suppose you have 
a dilution factor of 
A (10) to B(1). To 
get 10x of B you 
need to get 100x 
of A! Overkill!! 
The high-coverage 
reads in sample A 
are unnecessary 
for assembly, and, 
in fact, distract.
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization is streaming
Digital normalization
Some key points -- 
• Digital normalization is streaming. 
• Digital normalizing is computationally efficient (lower 
memory than other approaches; parallelizable/multicore; 
single-pass) 
• Currently, primarily used for prefiltering for assembly, but 
relies on underlying abstraction (De Bruijn graph) that is 
also used in variant calling.
Assembly now scales with richness, not diversity. 
• 10-100 fold decrease in memory requirements 
• 10-100 fold speed up in analysis
Diginorm is widely useful: 
1. Assembly of the H. contortus parasitic nematode 
genome, a “high polymorphism/variable coverage” 
problem. 
(Schwarz et al., 2013; pmid 23985341) 
2. Reference-free assembly of the lamprey (P. marinus) 
transcriptome, a “big assembly” problem. (in prep) 
3. Osedax symbiont metagenome, a “contaminated 
metagenome” problem (Goffredi et al, 2013; pmid 
24225886)
Anecdata: diginorm is used in Illumina 
long-read sequencing (?)
Diginorm is “lossy compression” 
• Nearly perfect from an information theoretic perspective: 
• Discards 95% more of data for genomes. 
• Loses < 00.02% of information.
Digital normalization => graph alignment 
What we are actually doing this stage 
is building a graph of all the reads, 
and aligning new reads to that graph.
Error correction via graph alignment 
Jason Pell and Jordan Fish
Error correction on simulated E. coli data 
TP FP TN FN 
ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9% 
1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2% 
1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8% 
(corrected) (mistakes) (OK) (missed) 
1% error rate, 100x coverage. 
Jordan Fish and Jason Pell
Error correction  variant calling 
Single pass, reference free, tunable, streaming 
online variant calling.
Coverage is adjusted to retain signal
Graph alignment can detect read saturation
Streaming with reads… 
Sequence... 
Graph 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
.... 
Variants
Analysis is done after sequencing. 
Sequencing Analysis
Streaming with bases 
k bases... 
Graph 
k+1 
k bases... k+1 
k+2 
k bases... k+1 
k bases... k+1 
k bases... k+1 
... 
k bases... k+1 
Variants
Integrate sequencing and analysis 
Sequencing 
Analysis 
Are we done yet?
Streaming approach also supports more 
compute-intensive interludes – 
remapping, etc. 
Rimmer et al., 2014
Streaming algorithms can be very efficient 
Data 
1-pass 
Answer 
See also eXpress, Roberts et al., 2013.
So: reference-free variant calling 
• Streaming & online algorithm; single pass. 
• For real-time diagnostics, can be applied as bases are emitted from 
sequencer. 
• Reference free: independent of reference bias. 
• Coverage of variants is adaptively adjusted to retain all 
signal. 
• Parameters are easily tuned, although theory needs to be 
developed. 
• High sensitivity (e.g. C=50 in 100x coverage) => poor compression 
• Low sensitivity (C=20) => good compression. 
• Can “subtract” reference => novel structural variants. 
• (See: Cortex, Zam Iqbal.)
Two other features -- 
• More single-computer scalable approach than current: low 
disk access, high parallelizability. 
• Openness – our software is free to use, reuse, remix; no 
intellectual property restrictions. (Hence “We hear Illumina 
is using it…”)
Prospectus for streaming variant detection 
• Underlying concept is sound and offers many advantages over 
current approaches; 
• We have proofs of concept implemented; 
• We know that underlying approach works well in amplification 
situations, as well; 
• Tuning and math/theory needed! 
• …grad students keep on getting poached by Amazon and 
Google. (This is becoming a serious problem.)
Raw data 
(~10-100 GB) Analysis 
"Information" 
~1 GB 
"Information" 
"Information" 
"Information" 
"Information" 
Database & 
integration 
Compression 
(~2 GB) 
Lossy compression can substantially 
reduce data size while retaining 
information needed for later (re)analysis.
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
Raw data 
(~10-100 GB) Analysis 
"Information" 
~1 GB 
"Information" 
"Information" 
"Information" 
"Information" 
Database & 
integration 
Compression 
(~2 GB) 
Save in cold storage 
Save for reanalysis, 
investigation.
Data integration? 
Once you have all the data, what do you do? 
"Business as usual simply cannot work." 
Looking at millions to billions of genomes. 
(David Haussler, 2014)
Data recipes 
Standardized (versioned, open, remixable, cloud) 
pipelines and protocols for sequence data analysis. 
See: khmer-recipes, khmer-protocols. 
Increases buy-in :)
Training! 
Lots of training planned at Davis – 
open workshops. 
ivory.idyll.org/blog/2014-davis-and-training.html 
Increases buy-in x 2!
Acknowledgements 
Lab members involved Collaborators 
• Adina Howe (w/Tiedje) 
• Jason Pell 
• Qingpeng Zhang 
• Tim Brom 
• Jordan Fish 
• Michael Crusoe 
• Jim Tiedje, MSU 
• Billie Swalla, UW 
• Janet Jansson, LBNL 
• Susannah Tringe, JGI 
• Eran Andrechek, MSU 
Funding 
USDA NIFA; NSF IOS; 
NIH NHGRI; NSF 
BEACON.

More Related Content

What's hot

Edge-based Discovery of Training Data for Machine Learning
Edge-based Discovery of Training Data for Machine LearningEdge-based Discovery of Training Data for Machine Learning
Edge-based Discovery of Training Data for Machine LearningZiqiang Feng
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...Oleg Ovcharenko
 
Feasibility of moment tensor inversion for a single-well microseismic data us...
Feasibility of moment tensor inversion for a single-well microseismic data us...Feasibility of moment tensor inversion for a single-well microseismic data us...
Feasibility of moment tensor inversion for a single-well microseismic data us...Oleg Ovcharenko
 
Deep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeDeep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeSiby Jose Plathottam
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
Nearest neighbor, defect prediction
Nearest neighbor, defect predictionNearest neighbor, defect prediction
Nearest neighbor, defect predictionCS, NcState
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFIan Foster
 

What's hot (8)

Edge-based Discovery of Training Data for Machine Learning
Edge-based Discovery of Training Data for Machine LearningEdge-based Discovery of Training Data for Machine Learning
Edge-based Discovery of Training Data for Machine Learning
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
 
Feasibility of moment tensor inversion for a single-well microseismic data us...
Feasibility of moment tensor inversion for a single-well microseismic data us...Feasibility of moment tensor inversion for a single-well microseismic data us...
Feasibility of moment tensor inversion for a single-well microseismic data us...
 
Deep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeDeep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and Hype
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
Nearest neighbor, defect prediction
Nearest neighbor, defect predictionNearest neighbor, defect prediction
Nearest neighbor, defect prediction
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 

Viewers also liked

Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
The Nuts + Bolts of Construction Financial Management
The Nuts + Bolts of Construction Financial ManagementThe Nuts + Bolts of Construction Financial Management
The Nuts + Bolts of Construction Financial ManagementKegler Brown Hill + Ritter
 
Trabajos Verano 2º Eso 2009
Trabajos Verano 2º Eso 2009Trabajos Verano 2º Eso 2009
Trabajos Verano 2º Eso 2009guest5bbe75
 
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...Circles of San Antonio Community Coalition
 
Enlightenment
EnlightenmentEnlightenment
EnlightenmentGregorio
 
Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009Eyeblaster Spain
 
Analizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en BisonAnalizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en BisonEgdares Futch H.
 
Contiguity Principle
Contiguity PrincipleContiguity Principle
Contiguity Principlejnpletcher
 
Whitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van SourcingWhitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van SourcingElitas Groep BV
 
Prithvi Cg Work Presentation
Prithvi Cg Work PresentationPrithvi Cg Work Presentation
Prithvi Cg Work Presentationprithvionline
 
Osss (Page Revisi)
Osss (Page Revisi)Osss (Page Revisi)
Osss (Page Revisi)@rtNya
 
13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional ResponsibilityKegler Brown Hill + Ritter
 
用寧靜心擁抱世界
用寧靜心擁抱世界用寧靜心擁抱世界
用寧靜心擁抱世界tina59520
 
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsBusiness in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsKegler Brown Hill + Ritter
 

Viewers also liked (20)

Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
The Nuts + Bolts of Construction Financial Management
The Nuts + Bolts of Construction Financial ManagementThe Nuts + Bolts of Construction Financial Management
The Nuts + Bolts of Construction Financial Management
 
Trabajos Verano 2º Eso 2009
Trabajos Verano 2º Eso 2009Trabajos Verano 2º Eso 2009
Trabajos Verano 2º Eso 2009
 
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
 
11i Logs
11i Logs11i Logs
11i Logs
 
Enlightenment
EnlightenmentEnlightenment
Enlightenment
 
TPSI by Competitive Analytics
TPSI by Competitive AnalyticsTPSI by Competitive Analytics
TPSI by Competitive Analytics
 
Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009
 
Analizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en BisonAnalizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en Bison
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Contiguity Principle
Contiguity PrincipleContiguity Principle
Contiguity Principle
 
RealTimeSchool
RealTimeSchoolRealTimeSchool
RealTimeSchool
 
Tips And Tricks For Photos
Tips And Tricks For PhotosTips And Tricks For Photos
Tips And Tricks For Photos
 
Whitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van SourcingWhitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van Sourcing
 
Prithvi Cg Work Presentation
Prithvi Cg Work PresentationPrithvi Cg Work Presentation
Prithvi Cg Work Presentation
 
CG borodino
CG borodinoCG borodino
CG borodino
 
Osss (Page Revisi)
Osss (Page Revisi)Osss (Page Revisi)
Osss (Page Revisi)
 
13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility
 
用寧靜心擁抱世界
用寧靜心擁抱世界用寧靜心擁抱世界
用寧靜心擁抱世界
 
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsBusiness in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
 

Similar to 2014 nci-edrn

2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talkc.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assemblyc.titus.brown
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
Alternative Computing
Alternative ComputingAlternative Computing
Alternative ComputingShayshab Azad
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsManaging & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsRaul Chong
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsChester Chen
 

Similar to 2014 nci-edrn (20)

2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Alternative Computing
Alternative ComputingAlternative Computing
Alternative Computing
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsManaging & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
 
Deeplearning in finance
Deeplearning in financeDeeplearning in finance
Deeplearning in finance
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN Applications
 

More from c.titus.brown

More from c.titus.brown (20)

2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 

Recently uploaded

Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Tamer Koksalan, PhD
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 

Recently uploaded (20)

Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 

2014 nci-edrn

  • 1. STREAMING VARIANT CALLING? C. Titus Brown Michigan State University Sep 2014, NCI EDRN / Bethesda, MD
  • 2. Mapping: locate reads in reference http://en.wikipedia.org/wiki/File:Mapping_Reads.png
  • 3. Variant detection after mapping http://www.kenkraaijeveld.nl/genomics/bioinformatics/
  • 4. Problem 1: Analysis is done after sequencing. Sequencing Analysis
  • 5. Problem 2: Much of your data is unnecessary. Shotgun data is randomly sampled; So, you need high coverage for high sensitivity.
  • 6. Problem 3: Current variant calling approaches are multipass Data Mapping Sorting Calling Answer
  • 7. Problem 4: Allelic mapping bias favors reference genome. Number of nbh differentiating polymorphisms. Stevenson et al., 2013 (BMC Genomics)
  • 8. Problem 5: Current approaches are often insensitive to indels Iqbal et al., Nat Gen 2012
  • 9. Why are we concerned at all!? Looking forward 5 years… Navin et al., 2011
  • 10. Some basic math: • 1000 single cells from a tumor… • …sequenced to 40x haploid coverage with Illumina… • …yields 120 Gbp each cell… • …or 120 Tbp of data. • HiSeq X10 can do the sequencing in ~3 weeks. • The variant calling will require 2,000 CPU weeks… • …so, given ~2,000 computers, can do this all in one month.
  • 11. Similar math applies: • Pathogen detection in blood; • Environmental sequencing; • Sequencing rare DNA from circulating blood. • Two issues: •Volume of data & compute infrastructure; • Latency for clinical applications.
  • 12. Can we improve this situation? • Tie directly into machine as it generates sequence (Illumina, PacBio, and Nanopore can all do streaming, in theory) • Analyze data as it comes off; for some (many?) applications, can stop run early if signal detected. • Avoid using a reference genome for primary variant calling. • Easier indel detection, less allelic mapping bias • Can use reference for interpretation. Does such a magical approach exist!?
  • 13. ~Digression: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! The high-coverage reads in sample A are unnecessary for assembly, and, in fact, distract.
  • 20. Some key points -- • Digital normalization is streaming. • Digital normalizing is computationally efficient (lower memory than other approaches; parallelizable/multicore; single-pass) • Currently, primarily used for prefiltering for assembly, but relies on underlying abstraction (De Bruijn graph) that is also used in variant calling.
  • 21. Assembly now scales with richness, not diversity. • 10-100 fold decrease in memory requirements • 10-100 fold speed up in analysis
  • 22. Diginorm is widely useful: 1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep) 3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid 24225886)
  • 23. Anecdata: diginorm is used in Illumina long-read sequencing (?)
  • 24. Diginorm is “lossy compression” • Nearly perfect from an information theoretic perspective: • Discards 95% more of data for genomes. • Loses < 00.02% of information.
  • 25. Digital normalization => graph alignment What we are actually doing this stage is building a graph of all the reads, and aligning new reads to that graph.
  • 26. Error correction via graph alignment Jason Pell and Jordan Fish
  • 27. Error correction on simulated E. coli data TP FP TN FN ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9% 1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2% 1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8% (corrected) (mistakes) (OK) (missed) 1% error rate, 100x coverage. Jordan Fish and Jason Pell
  • 28. Error correction  variant calling Single pass, reference free, tunable, streaming online variant calling.
  • 29. Coverage is adjusted to retain signal
  • 30. Graph alignment can detect read saturation
  • 31. Streaming with reads… Sequence... Graph Sequence... Sequence... Sequence... Sequence... Sequence... Sequence... Sequence... .... Variants
  • 32. Analysis is done after sequencing. Sequencing Analysis
  • 33. Streaming with bases k bases... Graph k+1 k bases... k+1 k+2 k bases... k+1 k bases... k+1 k bases... k+1 ... k bases... k+1 Variants
  • 34. Integrate sequencing and analysis Sequencing Analysis Are we done yet?
  • 35. Streaming approach also supports more compute-intensive interludes – remapping, etc. Rimmer et al., 2014
  • 36. Streaming algorithms can be very efficient Data 1-pass Answer See also eXpress, Roberts et al., 2013.
  • 37. So: reference-free variant calling • Streaming & online algorithm; single pass. • For real-time diagnostics, can be applied as bases are emitted from sequencer. • Reference free: independent of reference bias. • Coverage of variants is adaptively adjusted to retain all signal. • Parameters are easily tuned, although theory needs to be developed. • High sensitivity (e.g. C=50 in 100x coverage) => poor compression • Low sensitivity (C=20) => good compression. • Can “subtract” reference => novel structural variants. • (See: Cortex, Zam Iqbal.)
  • 38. Two other features -- • More single-computer scalable approach than current: low disk access, high parallelizability. • Openness – our software is free to use, reuse, remix; no intellectual property restrictions. (Hence “We hear Illumina is using it…”)
  • 39. Prospectus for streaming variant detection • Underlying concept is sound and offers many advantages over current approaches; • We have proofs of concept implemented; • We know that underlying approach works well in amplification situations, as well; • Tuning and math/theory needed! • …grad students keep on getting poached by Amazon and Google. (This is becoming a serious problem.)
  • 40. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Lossy compression can substantially reduce data size while retaining information needed for later (re)analysis.
  • 46. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Save in cold storage Save for reanalysis, investigation.
  • 47. Data integration? Once you have all the data, what do you do? "Business as usual simply cannot work." Looking at millions to billions of genomes. (David Haussler, 2014)
  • 48. Data recipes Standardized (versioned, open, remixable, cloud) pipelines and protocols for sequence data analysis. See: khmer-recipes, khmer-protocols. Increases buy-in :)
  • 49. Training! Lots of training planned at Davis – open workshops. ivory.idyll.org/blog/2014-davis-and-training.html Increases buy-in x 2!
  • 50. Acknowledgements Lab members involved Collaborators • Adina Howe (w/Tiedje) • Jason Pell • Qingpeng Zhang • Tim Brom • Jordan Fish • Michael Crusoe • Jim Tiedje, MSU • Billie Swalla, UW • Janet Jansson, LBNL • Susannah Tringe, JGI • Eran Andrechek, MSU Funding USDA NIFA; NSF IOS; NIH NHGRI; NSF BEACON.

Editor's Notes

  1. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
  2. Update from Jordan