Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

Luc DehaspeGenomics Core, UZ LeuvenWOUD – Onderzoeksgroep Associatie Universiteit Gent - 28 Sept 2011 Race against the sequencing machineProcessing of raw DNA sequence data at the Genomics Core

DNA sequencingdetermines the order of nucleotide bases in a genomeDNA replicationmachinaryHumanGenome2 x 3 billion basesHuman Genome2 x 3 billion baseshoursSequencing machineFinalGenerationSequencing machineComputer’scopyfunctionHuman Genome2 x 800 MbtextHuman Genome2 x 800 Mbtextminutes

Nextgeneration sequencingQualitydeterioratesafter 100-1000 base pairsSolution:Cut genomes in readablefragmentsSequencefragments->readsUsebioinformatics to reconstruct genomes fromreadsHumanGenome2 x 3 billion basesNextGenerationSequencing machineReads in textformatbioinformaticsHuman Genome2 x 800 Mbtext

SequencersvsBioinformaticsHumanGenome2 x 3 billion basesHiSeq 2000 v3HiSeq 2000 v2Roche GS FLX55billion basesper day6 Human Genomes in 10 days18billion basesper day1billionbpdbioinformaticsScale up bioinformaticsorpile up sequencer outputHuman Genome2 x 800 Mbtext

Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ runBioinformaticspipelineDemultiplexSortindexedreads per sampleAlignmentAlignreads per sample to reference genome

Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ runBioinformaticspipelineDemultiplexSortindexedreads per sampleAlignmentAlignreads per sample to reference genomeVariant CallingComparepileup of reads at givenlocus to reference, identifySNPs, insertions and deletions

A bioinformaticspipeline Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ runDemultiplexSortindexedreads per sampleAlignmentAlignreads per sample to reference genomeVariant CallingCompare to reference, identifySNPs, insertions and deletionsAnnotatevariants (gene, effect onproteinsequence, conservation, frequency, predicted effect onproteinfunction, …AnnotationSequencing: 10 daysAbovepipeline: > 60 dayson 1 cpuScale up orpile up

Favourable race conditionsSametaskperformedonmanyreadsorlociFOR 1.1 billionindexedreads DOIdentify sampleFOR 3 billionHuman Genome loci DOComparelocus in alignedreads to reference and identify homo- and heterozygoticSNPsResultsforoneread/locus independent of resultsforotherreads/lociSuggestsnaturalscale up strategy …

Data parallelismReads or loci partitioned among nodes of computer cluster Each node demultiplexes, aligns, etc on local partitionSpeed up (near) linear to number of cluster nodesVariant calling 3 billionHuman Genome lociVariant calling Chr1Variant callingChrYCluster of 24 computers (nodes)

Data parallelismDemultiplexHiSeq 2000 microplate1 node, 1.1 billionreads1600 reads per second8 days1 microplate8 nodes, each 138 millionreads11 day… 8 lanes384 nodes, each 3 millionreads811384½ hour384 tiles…

Favourable race conditionsMapReduce: data parallelism made easyDeveloped and extensivelyused at GoogleOpen sourcelibrary (C++) takes care ofParallelizationFault ToleranceData DistributionLoad BalancingNo knowledge of parallel systems requiredUser implements functions Map() and Reduce()

MapReduce: demultiplexreads8 lanes8 Map tasks…Map: sortreadsMap: sortreadsSample1Sample3Sample2Sample1Sample3Sample2Waituntil map has finished81 Sample1 reads Sample3 reads Sample2 readsReduce: deduplicatereadsReduce: deduplicatereadsReduce: deduplicatereadsSample1.fastq.gzSample3.fastq.gzSample2.fastq.gz

Favourable Race ConditionsGATK: MapReducefor sequencing projectsGenome analysis toolkitDevelopedby and usedextensively at BroadInstitute (Harvard and MIT)Open Source, Java 1.6 frameworkProvides common data accesspatternsTraversalbyreadTraversalbylocus

Favourable race conditionsData parallelismsupportedbymany (open source) bioinformatics toolsNumber of nodes is parameterFull analysispipelineswidelyavailableGATKCASAVA…

ConclusionData parallelism is keyScale up bybuying extra cluster nodesGenomics core recentlyadded 400 nodes(shared)CannedsolutionsforcommonbioinformaticstasksEstablishedprogrammingframeworksforcustomsolutionsMapReduceGATK

ConclusionBioinformaticiansenjoyfavourableconditionsforkeepingpacewithsequencer …HumanGenome2 x 3 billion basesNextGenerationSequencing machineFinalGenerationSequencing machineReads in textformatBioinformaticsusing data parallelismHuman Genome2 x 800 Mbtext… until made redundant byfinalgeneration

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

More Related Content

Viewers also liked

Similar to Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

More from Maté Ongenaert

Recently uploaded

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core