Luc DehaspeGenomics Core, UZ LeuvenWOUD – Onderzoeksgroep Associatie Universiteit Gent - 28 Sept 2011 Race against the sequencing machineProcessing of raw DNA sequence data at the Genomics Core
DNA sequencingdetermines the order of nucleotide bases in a genomeDNA replicationmachinaryHumanGenome2 x 3 billion basesHuman Genome2 x 3 billion baseshoursSequencing machineFinalGenerationSequencing machineComputer’scopyfunctionHuman Genome2 x 800 MbtextHuman Genome2 x 800 Mbtextminutes
Nextgeneration sequencingQualitydeterioratesafter 100-1000 base pairsSolution:Cut genomes in readablefragmentsSequencefragments->readsUsebioinformatics to reconstruct genomes fromreadsHumanGenome2 x 3 billion basesNextGenerationSequencing machineReads in textformatbioinformaticsHuman Genome2 x 800 Mbtext
SequencersvsBioinformaticsHumanGenome2 x 3 billion basesHiSeq 2000 v3HiSeq 2000 v2Roche GS FLX55billion basesper day6 Human Genomes in 10 days18billion basesper day1billionbpdbioinformaticsScale up bioinformaticsorpile up sequencer outputHuman Genome2 x 800 Mbtext
 Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ runBioinformaticspipelineDemultiplexSortindexedreads per sampleAlignmentAlignreads per sample to reference genome
 Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ runBioinformaticspipelineDemultiplexSortindexedreads per sampleAlignmentAlignreads per sample to reference genomeVariant CallingComparepileup of reads at givenlocus to reference, identifySNPs, insertions and deletions
A bioinformaticspipeline Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ runDemultiplexSortindexedreads per sampleAlignmentAlignreads per sample to reference genomeVariant CallingCompare to reference, identifySNPs, insertions and deletionsAnnotatevariants (gene, effect onproteinsequence, conservation, frequency, predicted effect onproteinfunction, …AnnotationSequencing: 10 daysAbovepipeline: > 60 dayson 1 cpuScale up orpile up
Favourable race conditionsSametaskperformedonmanyreadsorlociFOR 1.1 billionindexedreads DOIdentify sampleFOR 3 billionHuman Genome loci DOComparelocus in alignedreads to reference and identify homo- and heterozygoticSNPsResultsforoneread/locus independent of resultsforotherreads/lociSuggestsnaturalscale up strategy …
Data parallelismReads or loci partitioned among nodes of computer cluster Each node demultiplexes, aligns, etc on local partitionSpeed up (near) linear to number of cluster nodesVariant calling 3 billionHuman Genome lociVariant calling Chr1Variant callingChrYCluster of 24 computers (nodes)
Data parallelismDemultiplexHiSeq 2000 microplate1 node, 1.1 billionreads1600 reads per second8 days1 microplate8 nodes, each 138 millionreads11 day… 8 lanes384 nodes, each 3 millionreads811384½ hour384 tiles…
Favourable race conditionsMapReduce: data parallelism made easyDeveloped and extensivelyused at GoogleOpen sourcelibrary (C++) takes care ofParallelizationFault ToleranceData DistributionLoad BalancingNo knowledge of parallel systems requiredUser implements functions Map() and Reduce()
MapReduce: demultiplexreads8 lanes8 Map tasks…Map: sortreadsMap: sortreadsSample1Sample3Sample2Sample1Sample3Sample2Waituntil map has finished81 Sample1 reads Sample3 reads Sample2 readsReduce: deduplicatereadsReduce: deduplicatereadsReduce: deduplicatereadsSample1.fastq.gzSample3.fastq.gzSample2.fastq.gz
Favourable Race ConditionsGATK: MapReducefor sequencing projectsGenome analysis toolkitDevelopedby and usedextensively at BroadInstitute (Harvard and MIT)Open Source, Java 1.6 frameworkProvides common data accesspatternsTraversalbyreadTraversalbylocus
Favourable race conditionsData parallelismsupportedbymany (open source) bioinformatics toolsNumber of nodes is parameterFull analysispipelineswidelyavailableGATKCASAVA…
ConclusionData parallelism is keyScale up bybuying extra cluster nodesGenomics core recentlyadded 400 nodes(shared)CannedsolutionsforcommonbioinformaticstasksEstablishedprogrammingframeworksforcustomsolutionsMapReduceGATK
ConclusionBioinformaticiansenjoyfavourableconditionsforkeepingpacewithsequencer …HumanGenome2 x 3 billion basesNextGenerationSequencing machineFinalGenerationSequencing machineReads in textformatBioinformaticsusing data parallelismHuman Genome2 x 800 Mbtext… until made redundant byfinalgeneration

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

  • 1.
    Luc DehaspeGenomics Core,UZ LeuvenWOUD – Onderzoeksgroep Associatie Universiteit Gent - 28 Sept 2011 Race against the sequencing machineProcessing of raw DNA sequence data at the Genomics Core
  • 2.
    DNA sequencingdetermines theorder of nucleotide bases in a genomeDNA replicationmachinaryHumanGenome2 x 3 billion basesHuman Genome2 x 3 billion baseshoursSequencing machineFinalGenerationSequencing machineComputer’scopyfunctionHuman Genome2 x 800 MbtextHuman Genome2 x 800 Mbtextminutes
  • 3.
    Nextgeneration sequencingQualitydeterioratesafter 100-1000base pairsSolution:Cut genomes in readablefragmentsSequencefragments->readsUsebioinformatics to reconstruct genomes fromreadsHumanGenome2 x 3 billion basesNextGenerationSequencing machineReads in textformatbioinformaticsHuman Genome2 x 800 Mbtext
  • 4.
    SequencersvsBioinformaticsHumanGenome2 x 3billion basesHiSeq 2000 v3HiSeq 2000 v2Roche GS FLX55billion basesper day6 Human Genomes in 10 days18billion basesper day1billionbpdbioinformaticsScale up bioinformaticsorpile up sequencer outputHuman Genome2 x 800 Mbtext
  • 5.
    Case: HumanExome,raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ runBioinformaticspipelineDemultiplexSortindexedreads per sampleAlignmentAlignreads per sample to reference genome
  • 6.
    Case: HumanExome,raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ runBioinformaticspipelineDemultiplexSortindexedreads per sampleAlignmentAlignreads per sample to reference genomeVariant CallingComparepileup of reads at givenlocus to reference, identifySNPs, insertions and deletions
  • 7.
    A bioinformaticspipeline Case:HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ runDemultiplexSortindexedreads per sampleAlignmentAlignreads per sample to reference genomeVariant CallingCompare to reference, identifySNPs, insertions and deletionsAnnotatevariants (gene, effect onproteinsequence, conservation, frequency, predicted effect onproteinfunction, …AnnotationSequencing: 10 daysAbovepipeline: > 60 dayson 1 cpuScale up orpile up
  • 8.
    Favourable race conditionsSametaskperformedonmanyreadsorlociFOR1.1 billionindexedreads DOIdentify sampleFOR 3 billionHuman Genome loci DOComparelocus in alignedreads to reference and identify homo- and heterozygoticSNPsResultsforoneread/locus independent of resultsforotherreads/lociSuggestsnaturalscale up strategy …
  • 9.
    Data parallelismReads orloci partitioned among nodes of computer cluster Each node demultiplexes, aligns, etc on local partitionSpeed up (near) linear to number of cluster nodesVariant calling 3 billionHuman Genome lociVariant calling Chr1Variant callingChrYCluster of 24 computers (nodes)
  • 10.
    Data parallelismDemultiplexHiSeq 2000microplate1 node, 1.1 billionreads1600 reads per second8 days1 microplate8 nodes, each 138 millionreads11 day… 8 lanes384 nodes, each 3 millionreads811384½ hour384 tiles…
  • 11.
    Favourable race conditionsMapReduce:data parallelism made easyDeveloped and extensivelyused at GoogleOpen sourcelibrary (C++) takes care ofParallelizationFault ToleranceData DistributionLoad BalancingNo knowledge of parallel systems requiredUser implements functions Map() and Reduce()
  • 12.
    MapReduce: demultiplexreads8 lanes8Map tasks…Map: sortreadsMap: sortreadsSample1Sample3Sample2Sample1Sample3Sample2Waituntil map has finished81 Sample1 reads Sample3 reads Sample2 readsReduce: deduplicatereadsReduce: deduplicatereadsReduce: deduplicatereadsSample1.fastq.gzSample3.fastq.gzSample2.fastq.gz
  • 13.
    Favourable Race ConditionsGATK:MapReducefor sequencing projectsGenome analysis toolkitDevelopedby and usedextensively at BroadInstitute (Harvard and MIT)Open Source, Java 1.6 frameworkProvides common data accesspatternsTraversalbyreadTraversalbylocus
  • 14.
    Favourable race conditionsDataparallelismsupportedbymany (open source) bioinformatics toolsNumber of nodes is parameterFull analysispipelineswidelyavailableGATKCASAVA…
  • 15.
    ConclusionData parallelism iskeyScale up bybuying extra cluster nodesGenomics core recentlyadded 400 nodes(shared)CannedsolutionsforcommonbioinformaticstasksEstablishedprogrammingframeworksforcustomsolutionsMapReduceGATK
  • 16.
    ConclusionBioinformaticiansenjoyfavourableconditionsforkeepingpacewithsequencer …HumanGenome2 x3 billion basesNextGenerationSequencing machineFinalGenerationSequencing machineReads in textformatBioinformaticsusing data parallelismHuman Genome2 x 800 Mbtext… until made redundant byfinalgeneration