SlideShare a Scribd company logo
1 of 50
STREAMING VARIANT 
CALLING? 
C. Titus Brown 
Michigan State University 
Sep 2014, NCI EDRN / Bethesda, MD
Mapping: locate reads in reference 
http://en.wikipedia.org/wiki/File:Mapping_Reads.png
Variant detection after mapping 
http://www.kenkraaijeveld.nl/genomics/bioinformatics/
Problem 1: 
Analysis is done after sequencing. 
Sequencing Analysis
Problem 2: 
Much of your data is unnecessary. 
Shotgun data is randomly sampled; 
So, you need high coverage for high sensitivity.
Problem 3: 
Current variant calling approaches are multipass 
Data 
Mapping 
Sorting 
Calling Answer
Problem 4: 
Allelic mapping bias favors reference genome. 
Number of nbh differentiating polymorphisms. 
Stevenson et al., 2013 (BMC Genomics)
Problem 5: 
Current approaches are often insensitive to indels 
Iqbal et al., Nat Gen 2012
Why are we concerned at all!? 
Looking forward 5 years… 
Navin et al., 2011
Some basic math: 
• 1000 single cells from a tumor… 
• …sequenced to 40x haploid coverage with Illumina… 
• …yields 120 Gbp each cell… 
• …or 120 Tbp of data. 
• HiSeq X10 can do the sequencing in ~3 weeks. 
• The variant calling will require 2,000 CPU weeks… 
• …so, given ~2,000 computers, can do this all in one 
month.
Similar math applies: 
• Pathogen detection in blood; 
• Environmental sequencing; 
• Sequencing rare DNA from circulating blood. 
• Two issues: 
•Volume of data & compute 
infrastructure; 
• Latency for clinical applications.
Can we improve this situation? 
• Tie directly into machine as it generates sequence 
(Illumina, PacBio, and Nanopore can all do streaming, in theory) 
• Analyze data as it comes off; for some (many?) 
applications, can stop run early if signal detected. 
• Avoid using a reference genome for primary variant 
calling. 
• Easier indel detection, less allelic mapping bias 
• Can use reference for interpretation. 
Does such a magical approach exist!?
~Digression: Digital normalization 
(a computational version of library normalization) 
Suppose you have 
a dilution factor of 
A (10) to B(1). To 
get 10x of B you 
need to get 100x 
of A! Overkill!! 
The high-coverage 
reads in sample A 
are unnecessary 
for assembly, and, 
in fact, distract.
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization is streaming
Digital normalization
Some key points -- 
• Digital normalization is streaming. 
• Digital normalizing is computationally efficient (lower 
memory than other approaches; parallelizable/multicore; 
single-pass) 
• Currently, primarily used for prefiltering for assembly, but 
relies on underlying abstraction (De Bruijn graph) that is 
also used in variant calling.
Assembly now scales with richness, not diversity. 
• 10-100 fold decrease in memory requirements 
• 10-100 fold speed up in analysis
Diginorm is widely useful: 
1. Assembly of the H. contortus parasitic nematode 
genome, a “high polymorphism/variable coverage” 
problem. 
(Schwarz et al., 2013; pmid 23985341) 
2. Reference-free assembly of the lamprey (P. marinus) 
transcriptome, a “big assembly” problem. (in prep) 
3. Osedax symbiont metagenome, a “contaminated 
metagenome” problem (Goffredi et al, 2013; pmid 
24225886)
Anecdata: diginorm is used in Illumina 
long-read sequencing (?)
Diginorm is “lossy compression” 
• Nearly perfect from an information theoretic perspective: 
• Discards 95% more of data for genomes. 
• Loses < 00.02% of information.
Digital normalization => graph alignment 
What we are actually doing this stage 
is building a graph of all the reads, 
and aligning new reads to that graph.
Error correction via graph alignment 
Jason Pell and Jordan Fish
Error correction on simulated E. coli data 
TP FP TN FN 
ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9% 
1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2% 
1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8% 
(corrected) (mistakes) (OK) (missed) 
1% error rate, 100x coverage. 
Jordan Fish and Jason Pell
Error correction  variant calling 
Single pass, reference free, tunable, streaming 
online variant calling.
Coverage is adjusted to retain signal
Graph alignment can detect read saturation
Streaming with reads… 
Sequence... 
Graph 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
.... 
Variants
Analysis is done after sequencing. 
Sequencing Analysis
Streaming with bases 
k bases... 
Graph 
k+1 
k bases... k+1 
k+2 
k bases... k+1 
k bases... k+1 
k bases... k+1 
... 
k bases... k+1 
Variants
Integrate sequencing and analysis 
Sequencing 
Analysis 
Are we done yet?
Streaming approach also supports more 
compute-intensive interludes – 
remapping, etc. 
Rimmer et al., 2014
Streaming algorithms can be very efficient 
Data 
1-pass 
Answer 
See also eXpress, Roberts et al., 2013.
So: reference-free variant calling 
• Streaming & online algorithm; single pass. 
• For real-time diagnostics, can be applied as bases are emitted from 
sequencer. 
• Reference free: independent of reference bias. 
• Coverage of variants is adaptively adjusted to retain all 
signal. 
• Parameters are easily tuned, although theory needs to be 
developed. 
• High sensitivity (e.g. C=50 in 100x coverage) => poor compression 
• Low sensitivity (C=20) => good compression. 
• Can “subtract” reference => novel structural variants. 
• (See: Cortex, Zam Iqbal.)
Two other features -- 
• More single-computer scalable approach than current: low 
disk access, high parallelizability. 
• Openness – our software is free to use, reuse, remix; no 
intellectual property restrictions. (Hence “We hear Illumina 
is using it…”)
Prospectus for streaming variant detection 
• Underlying concept is sound and offers many advantages over 
current approaches; 
• We have proofs of concept implemented; 
• We know that underlying approach works well in amplification 
situations, as well; 
• Tuning and math/theory needed! 
• …grad students keep on getting poached by Amazon and 
Google. (This is becoming a serious problem.)
Raw data 
(~10-100 GB) Analysis 
"Information" 
~1 GB 
"Information" 
"Information" 
"Information" 
"Information" 
Database & 
integration 
Compression 
(~2 GB) 
Lossy compression can substantially 
reduce data size while retaining 
information needed for later (re)analysis.
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
Raw data 
(~10-100 GB) Analysis 
"Information" 
~1 GB 
"Information" 
"Information" 
"Information" 
"Information" 
Database & 
integration 
Compression 
(~2 GB) 
Save in cold storage 
Save for reanalysis, 
investigation.
Data integration? 
Once you have all the data, what do you do? 
"Business as usual simply cannot work." 
Looking at millions to billions of genomes. 
(David Haussler, 2014)
Data recipes 
Standardized (versioned, open, remixable, cloud) 
pipelines and protocols for sequence data analysis. 
See: khmer-recipes, khmer-protocols. 
Increases buy-in :)
Training! 
Lots of training planned at Davis – 
open workshops. 
ivory.idyll.org/blog/2014-davis-and-training.html 
Increases buy-in x 2!
Acknowledgements 
Lab members involved Collaborators 
• Adina Howe (w/Tiedje) 
• Jason Pell 
• Qingpeng Zhang 
• Tim Brom 
• Jordan Fish 
• Michael Crusoe 
• Jim Tiedje, MSU 
• Billie Swalla, UW 
• Janet Jansson, LBNL 
• Susannah Tringe, JGI 
• Eran Andrechek, MSU 
Funding 
USDA NIFA; NSF IOS; 
NIH NHGRI; NSF 
BEACON.

More Related Content

What's hot

Edge-based Discovery of Training Data for Machine Learning
Edge-based Discovery of Training Data for Machine LearningEdge-based Discovery of Training Data for Machine Learning
Edge-based Discovery of Training Data for Machine LearningZiqiang Feng
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...Oleg Ovcharenko
 
Feasibility of moment tensor inversion for a single-well microseismic data us...
Feasibility of moment tensor inversion for a single-well microseismic data us...Feasibility of moment tensor inversion for a single-well microseismic data us...
Feasibility of moment tensor inversion for a single-well microseismic data us...Oleg Ovcharenko
 
Deep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeDeep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeSiby Jose Plathottam
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
Nearest neighbor, defect prediction
Nearest neighbor, defect predictionNearest neighbor, defect prediction
Nearest neighbor, defect predictionCS, NcState
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFIan Foster
 

What's hot (8)

Edge-based Discovery of Training Data for Machine Learning
Edge-based Discovery of Training Data for Machine LearningEdge-based Discovery of Training Data for Machine Learning
Edge-based Discovery of Training Data for Machine Learning
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
 
Feasibility of moment tensor inversion for a single-well microseismic data us...
Feasibility of moment tensor inversion for a single-well microseismic data us...Feasibility of moment tensor inversion for a single-well microseismic data us...
Feasibility of moment tensor inversion for a single-well microseismic data us...
 
Deep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeDeep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and Hype
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
Nearest neighbor, defect prediction
Nearest neighbor, defect predictionNearest neighbor, defect prediction
Nearest neighbor, defect prediction
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 

Viewers also liked

Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
The Nuts + Bolts of Construction Financial Management
The Nuts + Bolts of Construction Financial ManagementThe Nuts + Bolts of Construction Financial Management
The Nuts + Bolts of Construction Financial ManagementKegler Brown Hill + Ritter
 
Trabajos Verano 2º Eso 2009
Trabajos Verano 2º Eso 2009Trabajos Verano 2º Eso 2009
Trabajos Verano 2º Eso 2009guest5bbe75
 
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...Circles of San Antonio Community Coalition
 
Enlightenment
EnlightenmentEnlightenment
EnlightenmentGregorio
 
Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009Eyeblaster Spain
 
Analizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en BisonAnalizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en BisonEgdares Futch H.
 
Contiguity Principle
Contiguity PrincipleContiguity Principle
Contiguity Principlejnpletcher
 
Whitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van SourcingWhitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van SourcingElitas Groep BV
 
Prithvi Cg Work Presentation
Prithvi Cg Work PresentationPrithvi Cg Work Presentation
Prithvi Cg Work Presentationprithvionline
 
Osss (Page Revisi)
Osss (Page Revisi)Osss (Page Revisi)
Osss (Page Revisi)@rtNya
 
13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional ResponsibilityKegler Brown Hill + Ritter
 
用寧靜心擁抱世界
用寧靜心擁抱世界用寧靜心擁抱世界
用寧靜心擁抱世界tina59520
 
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsBusiness in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsKegler Brown Hill + Ritter
 

Viewers also liked (20)

Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
The Nuts + Bolts of Construction Financial Management
The Nuts + Bolts of Construction Financial ManagementThe Nuts + Bolts of Construction Financial Management
The Nuts + Bolts of Construction Financial Management
 
Trabajos Verano 2º Eso 2009
Trabajos Verano 2º Eso 2009Trabajos Verano 2º Eso 2009
Trabajos Verano 2º Eso 2009
 
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
 
11i Logs
11i Logs11i Logs
11i Logs
 
Enlightenment
EnlightenmentEnlightenment
Enlightenment
 
TPSI by Competitive Analytics
TPSI by Competitive AnalyticsTPSI by Competitive Analytics
TPSI by Competitive Analytics
 
Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009
 
Analizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en BisonAnalizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en Bison
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Contiguity Principle
Contiguity PrincipleContiguity Principle
Contiguity Principle
 
RealTimeSchool
RealTimeSchoolRealTimeSchool
RealTimeSchool
 
Tips And Tricks For Photos
Tips And Tricks For PhotosTips And Tricks For Photos
Tips And Tricks For Photos
 
Whitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van SourcingWhitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van Sourcing
 
Prithvi Cg Work Presentation
Prithvi Cg Work PresentationPrithvi Cg Work Presentation
Prithvi Cg Work Presentation
 
CG borodino
CG borodinoCG borodino
CG borodino
 
Osss (Page Revisi)
Osss (Page Revisi)Osss (Page Revisi)
Osss (Page Revisi)
 
13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility
 
用寧靜心擁抱世界
用寧靜心擁抱世界用寧靜心擁抱世界
用寧靜心擁抱世界
 
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsBusiness in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
 

Similar to 2014 nci-edrn

2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talkc.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assemblyc.titus.brown
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
Alternative Computing
Alternative ComputingAlternative Computing
Alternative ComputingShayshab Azad
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsManaging & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsRaul Chong
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsChester Chen
 

Similar to 2014 nci-edrn (20)

2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Alternative Computing
Alternative ComputingAlternative Computing
Alternative Computing
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsManaging & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
 
Deeplearning in finance
Deeplearning in financeDeeplearning in finance
Deeplearning in finance
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN Applications
 

More from c.titus.brown

More from c.titus.brown (20)

2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 

Recently uploaded

Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 

Recently uploaded (20)

Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 

2014 nci-edrn

  • 1. STREAMING VARIANT CALLING? C. Titus Brown Michigan State University Sep 2014, NCI EDRN / Bethesda, MD
  • 2. Mapping: locate reads in reference http://en.wikipedia.org/wiki/File:Mapping_Reads.png
  • 3. Variant detection after mapping http://www.kenkraaijeveld.nl/genomics/bioinformatics/
  • 4. Problem 1: Analysis is done after sequencing. Sequencing Analysis
  • 5. Problem 2: Much of your data is unnecessary. Shotgun data is randomly sampled; So, you need high coverage for high sensitivity.
  • 6. Problem 3: Current variant calling approaches are multipass Data Mapping Sorting Calling Answer
  • 7. Problem 4: Allelic mapping bias favors reference genome. Number of nbh differentiating polymorphisms. Stevenson et al., 2013 (BMC Genomics)
  • 8. Problem 5: Current approaches are often insensitive to indels Iqbal et al., Nat Gen 2012
  • 9. Why are we concerned at all!? Looking forward 5 years… Navin et al., 2011
  • 10. Some basic math: • 1000 single cells from a tumor… • …sequenced to 40x haploid coverage with Illumina… • …yields 120 Gbp each cell… • …or 120 Tbp of data. • HiSeq X10 can do the sequencing in ~3 weeks. • The variant calling will require 2,000 CPU weeks… • …so, given ~2,000 computers, can do this all in one month.
  • 11. Similar math applies: • Pathogen detection in blood; • Environmental sequencing; • Sequencing rare DNA from circulating blood. • Two issues: •Volume of data & compute infrastructure; • Latency for clinical applications.
  • 12. Can we improve this situation? • Tie directly into machine as it generates sequence (Illumina, PacBio, and Nanopore can all do streaming, in theory) • Analyze data as it comes off; for some (many?) applications, can stop run early if signal detected. • Avoid using a reference genome for primary variant calling. • Easier indel detection, less allelic mapping bias • Can use reference for interpretation. Does such a magical approach exist!?
  • 13. ~Digression: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! The high-coverage reads in sample A are unnecessary for assembly, and, in fact, distract.
  • 20. Some key points -- • Digital normalization is streaming. • Digital normalizing is computationally efficient (lower memory than other approaches; parallelizable/multicore; single-pass) • Currently, primarily used for prefiltering for assembly, but relies on underlying abstraction (De Bruijn graph) that is also used in variant calling.
  • 21. Assembly now scales with richness, not diversity. • 10-100 fold decrease in memory requirements • 10-100 fold speed up in analysis
  • 22. Diginorm is widely useful: 1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep) 3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid 24225886)
  • 23. Anecdata: diginorm is used in Illumina long-read sequencing (?)
  • 24. Diginorm is “lossy compression” • Nearly perfect from an information theoretic perspective: • Discards 95% more of data for genomes. • Loses < 00.02% of information.
  • 25. Digital normalization => graph alignment What we are actually doing this stage is building a graph of all the reads, and aligning new reads to that graph.
  • 26. Error correction via graph alignment Jason Pell and Jordan Fish
  • 27. Error correction on simulated E. coli data TP FP TN FN ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9% 1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2% 1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8% (corrected) (mistakes) (OK) (missed) 1% error rate, 100x coverage. Jordan Fish and Jason Pell
  • 28. Error correction  variant calling Single pass, reference free, tunable, streaming online variant calling.
  • 29. Coverage is adjusted to retain signal
  • 30. Graph alignment can detect read saturation
  • 31. Streaming with reads… Sequence... Graph Sequence... Sequence... Sequence... Sequence... Sequence... Sequence... Sequence... .... Variants
  • 32. Analysis is done after sequencing. Sequencing Analysis
  • 33. Streaming with bases k bases... Graph k+1 k bases... k+1 k+2 k bases... k+1 k bases... k+1 k bases... k+1 ... k bases... k+1 Variants
  • 34. Integrate sequencing and analysis Sequencing Analysis Are we done yet?
  • 35. Streaming approach also supports more compute-intensive interludes – remapping, etc. Rimmer et al., 2014
  • 36. Streaming algorithms can be very efficient Data 1-pass Answer See also eXpress, Roberts et al., 2013.
  • 37. So: reference-free variant calling • Streaming & online algorithm; single pass. • For real-time diagnostics, can be applied as bases are emitted from sequencer. • Reference free: independent of reference bias. • Coverage of variants is adaptively adjusted to retain all signal. • Parameters are easily tuned, although theory needs to be developed. • High sensitivity (e.g. C=50 in 100x coverage) => poor compression • Low sensitivity (C=20) => good compression. • Can “subtract” reference => novel structural variants. • (See: Cortex, Zam Iqbal.)
  • 38. Two other features -- • More single-computer scalable approach than current: low disk access, high parallelizability. • Openness – our software is free to use, reuse, remix; no intellectual property restrictions. (Hence “We hear Illumina is using it…”)
  • 39. Prospectus for streaming variant detection • Underlying concept is sound and offers many advantages over current approaches; • We have proofs of concept implemented; • We know that underlying approach works well in amplification situations, as well; • Tuning and math/theory needed! • …grad students keep on getting poached by Amazon and Google. (This is becoming a serious problem.)
  • 40. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Lossy compression can substantially reduce data size while retaining information needed for later (re)analysis.
  • 46. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Save in cold storage Save for reanalysis, investigation.
  • 47. Data integration? Once you have all the data, what do you do? "Business as usual simply cannot work." Looking at millions to billions of genomes. (David Haussler, 2014)
  • 48. Data recipes Standardized (versioned, open, remixable, cloud) pipelines and protocols for sequence data analysis. See: khmer-recipes, khmer-protocols. Increases buy-in :)
  • 49. Training! Lots of training planned at Davis – open workshops. ivory.idyll.org/blog/2014-davis-and-training.html Increases buy-in x 2!
  • 50. Acknowledgements Lab members involved Collaborators • Adina Howe (w/Tiedje) • Jason Pell • Qingpeng Zhang • Tim Brom • Jordan Fish • Michael Crusoe • Jim Tiedje, MSU • Billie Swalla, UW • Janet Jansson, LBNL • Susannah Tringe, JGI • Eran Andrechek, MSU Funding USDA NIFA; NSF IOS; NIH NHGRI; NSF BEACON.

Editor's Notes

  1. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
  2. Update from Jordan