Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
C. Titus BrownAssistant ProfessorCSE, MMG, BEACONMichigan State UniversityMay 1, firstname.lastname@example.orgStreaming approaches to reference-free variantcalling
Open, online scienceMuch of the software and approaches I’m talkingabout today are available:khmer software:github.com/ged-lab/khmer/Blog: http://ivory.idyll.org/blog/Twitter: @ctitusbrown
Outline & Overview Motivation: lots of data; analyzed with “offline”approaches. Reference-based vs reference-free approaches. Single-pass algorithms for lossy compression;application to resequencing data.
Shotgun sequencingIt was the best of times, it was the wor, it was the worst of times, it was theisdom, it was the age of foolishnessmes, it was the age of wisdom, it was thIt was the best of times, it was the worst of times, it wasthe age of wisdom, it was the age of foolishness…but for lots and lots of fragments!
Sequencers produce errorsIt was the Gest of times, it was the wor, it was the worst of timZs, it was theisdom, it was the age of foolisXness, it was the worVt of times, it was themes, it was Ahe age of wisdom, it was thIt was the best of times, it Gas the wormes, it was the age of witdom, it was thisdom, it was tIe age of foolishnessIt was the best of times, it was the worst of times, it was theage of wisdom, it was the age of foolishness
Three basic problemsResequencing, counting, and assembly.
Three basic problemsResequencing & counting, and assembly.
Resequencing analysisWe know a reference genome, and want to findvariants (blue) in a background of errors (red)
CountingWe have a reference genome (or gene set) andwant to know how much we have. Think geneexpression/microarrays, copy number variation..
Noisy observations <->informationIt was the Gest of times, it was the wor, it was the worst of timZs, it was theisdom, it was the age of foolisXness, it was the worVt of times, it was themes, it was Ahe age of wisdom, it was thIt was the best of times, it Gas the wormes, it was the age of witdom, it was thisdom, it was tIe age of foolishnessIt was the best of times, it was the worst of times, it was theage of wisdom, it was the age of foolishness
“Three types of data scientists.”(Bob Grossman, U. Chicago, at XLDB 2012)1. Your data gathering rate is slower than Moore’sLaw.2. Your data gathering rate matches Moore’s Law.3. Your data gathering rate exceeds Moore’s Law.
“Three types of data scientists.”1. Your data gathering rate is slower than Moore’sLaw.=> Be lazy, all will work out.2. Your data gathering rate matches Moore’s Law.=> You need to write good software, but all willwork out.3. Your data gathering rate exceeds Moore’s Law.=> You need serious help.
Random sampling => deep samplingneededTypically 10-100x needed for robust recovery (300 Gbp for human)
Applications in cancer genomics Single-cell cancer genomics will advance: e.g. ~60-300 Gbp data for each of ~1000 tumorcells. Infer phylogeny of tumor => mechanistic insight. Current approaches are computationally intensiveand data-heavy.
Current variant calling approach.Map reads toreference"Pileup" and do variantcallingDownstreamdiagnostics
Drawbacks of reference-basedapproaches Fairly narrowly defined heuristics. Allelic mapping bias: mapping biased towardsreference allele. Ignorant of “unexpected” novelty Indels, especially large indels, are often ignored. Structural variation is not easily retained orrecovered. True novelty discarded. Most implementations are multipass on big data.
Challenges Considerable amounts of noise in data (0.1-1%error) Reference-based approaches have severaldrawbacks. Dependent on quality/applicability of reference. Detection of true novelty (SNP vs indels; SVs)problematic. => The first major data reduction step (variantcalling) is extremely lossy in terms of potentialinformation.
Raw data(~10-100 GB) Analysis"Information"~1 GB"Information""Information""Information""Information"Database &integrationCompression(~2 GB)A software & algorithms approach: can we developlossy compression approaches that1. Reduce data size & remove errors => efficientprocessing?2. Retain all “information”? (think JPEG)If so, then we can store only the compressed data forlater reanalysis.Short answer is: yes, we can.
Raw data(~10-100 GB) Analysis"Information"~1 GB"Information""Information""Information""Information"Database &integrationCompression(~2 GB)Save in cold storageSave for reanalysis,investigation.
My lab at MSU:Theoretical => applied solutions.Theoretical advancesin data structures andalgorithmsPractically useful & usableimplementations, at scale.Demonstratedeffectiveness on real data.
1. Time- and space-efficient k-mercountingTo add element: increment associated counter at all hash localesTo get count: retrieve minimum counter across all hash localeshttp://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
1%5%15%10%Pell et al., PNAS, 20122. Compressible assembly graphs(NOVEL)
Transcriptomes, microbial genomes incl MDA,and most metagenomes can be assembled inunder 50 GB of RAM, with identical or improvedresults. Core algorithm is single pass, “low” memory.3. Online, streaming, lossycompression.(NOVEL)Brown et al., arXiv, 2012
Digital normalization approachA digital analog to cDNA library normalization, diginorm: Reference free. Is single pass: looks at each read only once; Does not “collect” the majority of errors; Keeps all low-coverage reads & retains allinformation. Smooths out coverage of regions.
Can we apply this algorithmically efficienttechnique to variants? Yes.Single pass, reference free, tunable, streaming online varian
Reference-free variant calling Streaming & online algorithm; single pass. For real-time diagnostics, can be applied as bases areemitted from sequencer. Reference free: independent of reference bias. Coverage of variants is adaptively adjusted to retainall signal. Parameters are easily tuned, although theory needsto be developed. High sensitivity (e.g. C=50 in 100x coverage) => poorcompression Low sensitivity (C=20) => good compression. Can “subtract” reference => novel structural variants. (See: Cortex, Zam Iqbal.)
Concluding thoughts This approach could provide significant andsubstantial practical and theoretical leverage tochallenging problem. They provide a path to the future: Many-core implementation; distributable? Decreased memory footprint => cloud/rental computingcan be used for many analyses. Still early days, but funded… Our other techniques are in use, ~dozens of labsusing digital normalization.
References & reading list Iqbal et al., De novo assembly and genotyping ofvariants using colored de Bruijn graphs. Nat. Gen2012.(PubMed 22231483) Nordstrom et al., Mutation identification by directcomparison of whole-genome sequencing datafrom mutant and wild-type individuals using k-mers. Nat. Biotech 2013.(PubMed 23475072) Brown et al., Reference-Free Algorithm forComputational Normalization of ShotgunSequencing Data. arXiv 1203.4802Note: this talk is online at slideshare.net, c.titus.brown.
AcknowledgementsLab members involved Collaborators Adina Howe (w/Tiedje) Jason Pell Arend Hintze Rosangela Canino-Koning Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald Chris Welcher Jim Tiedje, MSU Billie Swalla, UW Janet Jansson, LBNL Susannah Tringe, JGIFundingUSDA NIFA; NSF IOS;BEACON.Thank you for the invitation!