Your SlideShare is downloading. ×
0
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

1,298

Published on

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core …

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
by dr. Luc Dehaspe - Genomics Core, UZ Leuven

To grow and function, living organisms unconsciously and continuously read instructions from the DNA sequence in each cell. Thanks to the advances in DNA sequencing technology, scientists are increasingly able to consciously read along. In 2001, sequencing efforts resulted in a first draft of human genome. Since then, the capacity of the DNA reading machines has doubled every six months on average. While the first human genome sequencing project took years of worldwide collaboration, multiple genomes can now be sequenced in 10 days on a single machine at a service facility such as the Genomics Core.
Each sequencing run gives rise to a few terabytes of raw data that, using bioinformatics techniques, must be processed in time, before the next bunch of data arrives.
I will discuss bioinformatics techniques that are commonly used in the Genomics Core and that have a chance to survive another generation of sequencing machines. <\br>A crucial feature of these techniques is that they keep up with the sequencing machines by creating sub-tasks that are distributed over an extensible network of computers.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,298
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Luc Dehaspe<br />Genomics Core, UZ Leuven<br />WOUD – Onderzoeksgroep Associatie Universiteit Gent - 28 Sept 2011 <br />Race against the sequencing machineProcessing of raw DNA sequence data at the Genomics Core<br />
  • 2. DNA sequencing<br />determines the order of nucleotide bases in a genome<br />DNA replicationmachinary<br />HumanGenome<br />2 x 3 billion bases<br />Human Genome<br />2 x 3 billion bases<br />hours<br />Sequencing machine<br />FinalGenerationSequencing machine<br />Computer’s<br />copyfunction<br />Human Genome<br />2 x 800 Mbtext<br />Human Genome<br />2 x 800 Mbtext<br />minutes<br />
  • 3. Nextgeneration sequencing<br />Qualitydeterioratesafter 100-1000 base pairs<br />Solution:<br />Cut genomes in readablefragments<br />Sequencefragments-&gt;reads<br />Usebioinformatics to reconstruct genomes fromreads<br />HumanGenome<br />2 x 3 billion bases<br />NextGenerationSequencing machine<br />Reads in textformat<br />bioinformatics<br />Human Genome<br />2 x 800 Mbtext<br />
  • 4. SequencersvsBioinformatics<br />HumanGenome<br />2 x 3 billion bases<br />HiSeq 2000 v3<br />HiSeq 2000 v2<br />Roche GS FLX<br />55billion bases<br />per day<br />6 Human Genomes in 10 days<br />18billion bases<br />per day<br />1billionbpd<br />bioinformatics<br />Scale up bioinformaticsor<br />pile up sequencer output<br />Human Genome<br />2 x 800 Mbtext<br />
  • 5. Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run<br />Bioinformaticspipeline<br />Demultiplex<br />Sortindexedreads per sample<br />Alignment<br />Alignreads per sample to reference genome<br />
  • 6. Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run<br />Bioinformaticspipeline<br />Demultiplex<br />Sortindexedreads per sample<br />Alignment<br />Alignreads per sample to reference genome<br />Variant Calling<br />Comparepileup of reads at givenlocus to reference, identifySNPs, insertions and deletions<br />
  • 7. A bioinformaticspipeline<br /> Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run<br />Demultiplex<br />Sortindexedreads per sample<br />Alignment<br />Alignreads per sample to reference genome<br />Variant Calling<br />Compare to reference, identifySNPs, insertions and deletions<br />Annotatevariants (gene, effect onproteinsequence, conservation, frequency, predicted effect onproteinfunction, …<br />Annotation<br />Sequencing: 10 days<br />Abovepipeline: &gt; 60 dayson 1 cpu<br />Scale up orpile up<br />
  • 8. Favourable race conditions<br />Sametaskperformedonmanyreadsorloci<br />FOR 1.1 billionindexedreads DO<br />Identify sample<br />FOR 3 billionHuman Genome loci DO<br />Comparelocus in alignedreads to reference and identify homo- and heterozygoticSNPs<br />Resultsforoneread/locus independent of resultsforotherreads/loci<br />Suggestsnaturalscale up strategy …<br />
  • 9. Data parallelism<br />Reads or loci partitioned among nodes of computer cluster <br />Each node demultiplexes, aligns, etc on local partition<br />Speed up (near) linear to number of cluster nodes<br />Variant calling 3 billionHuman Genome loci<br />Variant calling Chr1<br />Variant callingChrY<br />Cluster of 24 computers (nodes)<br />
  • 10. Data parallelism<br />DemultiplexHiSeq 2000 microplate<br />1 node, 1.1 billionreads<br />1600 reads per second<br />8 days<br />1 microplate<br /><ul><li>8 nodes, each 138 millionreads</li></ul>1<br />1 day<br />…<br /> 8 lanes<br /><ul><li>384 nodes, each 3 millionreads</li></ul>8<br />1<br />1<br />384<br />½ hour<br />384 tiles<br />…<br />
  • 11. Favourable race conditions<br />MapReduce: data parallelism made easy<br />Developed and extensivelyused at Google<br />Open sourcelibrary (C++) takes care of<br />Parallelization<br />Fault Tolerance<br />Data Distribution<br />Load Balancing<br />No knowledge of parallel systems required<br />User implements functions Map() and Reduce()<br />
  • 12. MapReduce: demultiplexreads<br />8 lanes<br />8 Map tasks<br />…<br />Map: sortreads<br />Map: sortreads<br />Sample1<br />Sample3<br />Sample2<br />Sample1<br />Sample3<br />Sample2<br />Waituntil map has finished<br />8<br />1<br /> Sample1 reads<br /> Sample3 reads<br /> Sample2 reads<br />Reduce: deduplicatereads<br />Reduce: deduplicatereads<br />Reduce: deduplicatereads<br />Sample1.fastq.gz<br />Sample3.fastq.gz<br />Sample2.fastq.gz<br />
  • 13. Favourable Race Conditions<br />GATK: MapReducefor sequencing projects<br />Genome analysis toolkit<br />Developedby and usedextensively at BroadInstitute (Harvard and MIT)<br />Open Source, Java 1.6 framework<br />Provides common data accesspatterns<br />Traversalbyread<br />Traversalbylocus<br />
  • 14. Favourable race conditions<br />Data parallelismsupportedbymany (open source) bioinformatics tools<br />Number of nodes is parameter<br />Full analysispipelineswidelyavailable<br />GATK<br />CASAVA<br />…<br />
  • 15. Conclusion<br />Data parallelism is key<br />Scale up bybuying extra cluster nodes<br />Genomics core recentlyadded 400 nodes(shared)<br />Cannedsolutionsforcommonbioinformaticstasks<br />Establishedprogrammingframeworksforcustomsolutions<br />MapReduce<br />GATK<br />
  • 16. Conclusion<br />Bioinformaticiansenjoyfavourableconditionsforkeepingpacewithsequencer …<br />HumanGenome<br />2 x 3 billion bases<br />NextGenerationSequencing machine<br />FinalGeneration<br />Sequencing machine<br />Reads in textformat<br />Bioinformaticsusing data parallelism<br />Human Genome<br />2 x 800 Mbtext<br /><ul><li>… until made redundant byfinalgeneration</li>

×