• Save
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
Upcoming SlideShare
Loading in...5
×
 

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

on

  • 1,422 views

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core ...

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
by dr. Luc Dehaspe - Genomics Core, UZ Leuven

To grow and function, living organisms unconsciously and continuously read instructions from the DNA sequence in each cell. Thanks to the advances in DNA sequencing technology, scientists are increasingly able to consciously read along. In 2001, sequencing efforts resulted in a first draft of human genome. Since then, the capacity of the DNA reading machines has doubled every six months on average. While the first human genome sequencing project took years of worldwide collaboration, multiple genomes can now be sequenced in 10 days on a single machine at a service facility such as the Genomics Core.
Each sequencing run gives rise to a few terabytes of raw data that, using bioinformatics techniques, must be processed in time, before the next bunch of data arrives.
I will discuss bioinformatics techniques that are commonly used in the Genomics Core and that have a chance to survive another generation of sequencing machines. <\br>A crucial feature of these techniques is that they keep up with the sequencing machines by creating sub-tasks that are distributed over an extensible network of computers.

Statistics

Views

Total Views
1,422
Views on SlideShare
1,169
Embed Views
253

Actions

Likes
2
Downloads
0
Comments
0

1 Embed 253

http://mellfire.ugent.be 253

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core Presentation Transcript

    • Luc Dehaspe
      Genomics Core, UZ Leuven
      WOUD – Onderzoeksgroep Associatie Universiteit Gent - 28 Sept 2011
      Race against the sequencing machineProcessing of raw DNA sequence data at the Genomics Core
    • DNA sequencing
      determines the order of nucleotide bases in a genome
      DNA replicationmachinary
      HumanGenome
      2 x 3 billion bases
      Human Genome
      2 x 3 billion bases
      hours
      Sequencing machine
      FinalGenerationSequencing machine
      Computer’s
      copyfunction
      Human Genome
      2 x 800 Mbtext
      Human Genome
      2 x 800 Mbtext
      minutes
    • Nextgeneration sequencing
      Qualitydeterioratesafter 100-1000 base pairs
      Solution:
      Cut genomes in readablefragments
      Sequencefragments->reads
      Usebioinformatics to reconstruct genomes fromreads
      HumanGenome
      2 x 3 billion bases
      NextGenerationSequencing machine
      Reads in textformat
      bioinformatics
      Human Genome
      2 x 800 Mbtext
    • SequencersvsBioinformatics
      HumanGenome
      2 x 3 billion bases
      HiSeq 2000 v3
      HiSeq 2000 v2
      Roche GS FLX
      55billion bases
      per day
      6 Human Genomes in 10 days
      18billion bases
      per day
      1billionbpd
      bioinformatics
      Scale up bioinformaticsor
      pile up sequencer output
      Human Genome
      2 x 800 Mbtext
    • Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run
      Bioinformaticspipeline
      Demultiplex
      Sortindexedreads per sample
      Alignment
      Alignreads per sample to reference genome
    • Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run
      Bioinformaticspipeline
      Demultiplex
      Sortindexedreads per sample
      Alignment
      Alignreads per sample to reference genome
      Variant Calling
      Comparepileup of reads at givenlocus to reference, identifySNPs, insertions and deletions
    • A bioinformaticspipeline
      Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run
      Demultiplex
      Sortindexedreads per sample
      Alignment
      Alignreads per sample to reference genome
      Variant Calling
      Compare to reference, identifySNPs, insertions and deletions
      Annotatevariants (gene, effect onproteinsequence, conservation, frequency, predicted effect onproteinfunction, …
      Annotation
      Sequencing: 10 days
      Abovepipeline: > 60 dayson 1 cpu
      Scale up orpile up
    • Favourable race conditions
      Sametaskperformedonmanyreadsorloci
      FOR 1.1 billionindexedreads DO
      Identify sample
      FOR 3 billionHuman Genome loci DO
      Comparelocus in alignedreads to reference and identify homo- and heterozygoticSNPs
      Resultsforoneread/locus independent of resultsforotherreads/loci
      Suggestsnaturalscale up strategy …
    • Data parallelism
      Reads or loci partitioned among nodes of computer cluster
      Each node demultiplexes, aligns, etc on local partition
      Speed up (near) linear to number of cluster nodes
      Variant calling 3 billionHuman Genome loci
      Variant calling Chr1
      Variant callingChrY
      Cluster of 24 computers (nodes)
    • Data parallelism
      DemultiplexHiSeq 2000 microplate
      1 node, 1.1 billionreads
      1600 reads per second
      8 days
      1 microplate
      • 8 nodes, each 138 millionreads
      1
      1 day

      8 lanes
      • 384 nodes, each 3 millionreads
      8
      1
      1
      384
      ½ hour
      384 tiles

    • Favourable race conditions
      MapReduce: data parallelism made easy
      Developed and extensivelyused at Google
      Open sourcelibrary (C++) takes care of
      Parallelization
      Fault Tolerance
      Data Distribution
      Load Balancing
      No knowledge of parallel systems required
      User implements functions Map() and Reduce()
    • MapReduce: demultiplexreads
      8 lanes
      8 Map tasks

      Map: sortreads
      Map: sortreads
      Sample1
      Sample3
      Sample2
      Sample1
      Sample3
      Sample2
      Waituntil map has finished
      8
      1
      Sample1 reads
      Sample3 reads
      Sample2 reads
      Reduce: deduplicatereads
      Reduce: deduplicatereads
      Reduce: deduplicatereads
      Sample1.fastq.gz
      Sample3.fastq.gz
      Sample2.fastq.gz
    • Favourable Race Conditions
      GATK: MapReducefor sequencing projects
      Genome analysis toolkit
      Developedby and usedextensively at BroadInstitute (Harvard and MIT)
      Open Source, Java 1.6 framework
      Provides common data accesspatterns
      Traversalbyread
      Traversalbylocus
    • Favourable race conditions
      Data parallelismsupportedbymany (open source) bioinformatics tools
      Number of nodes is parameter
      Full analysispipelineswidelyavailable
      GATK
      CASAVA

    • Conclusion
      Data parallelism is key
      Scale up bybuying extra cluster nodes
      Genomics core recentlyadded 400 nodes(shared)
      Cannedsolutionsforcommonbioinformaticstasks
      Establishedprogrammingframeworksforcustomsolutions
      MapReduce
      GATK
    • Conclusion
      Bioinformaticiansenjoyfavourableconditionsforkeepingpacewithsequencer …
      HumanGenome
      2 x 3 billion bases
      NextGenerationSequencing machine
      FinalGeneration
      Sequencing machine
      Reads in textformat
      Bioinformaticsusing data parallelism
      Human Genome
      2 x 800 Mbtext
      • … until made redundant byfinalgeneration