Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

  • 1,212 views
Uploaded on

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core …

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
by dr. Luc Dehaspe - Genomics Core, UZ Leuven

To grow and function, living organisms unconsciously and continuously read instructions from the DNA sequence in each cell. Thanks to the advances in DNA sequencing technology, scientists are increasingly able to consciously read along. In 2001, sequencing efforts resulted in a first draft of human genome. Since then, the capacity of the DNA reading machines has doubled every six months on average. While the first human genome sequencing project took years of worldwide collaboration, multiple genomes can now be sequenced in 10 days on a single machine at a service facility such as the Genomics Core.
Each sequencing run gives rise to a few terabytes of raw data that, using bioinformatics techniques, must be processed in time, before the next bunch of data arrives.
I will discuss bioinformatics techniques that are commonly used in the Genomics Core and that have a chance to survive another generation of sequencing machines. <\br>A crucial feature of these techniques is that they keep up with the sequencing machines by creating sub-tasks that are distributed over an extensible network of computers.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,212
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Luc Dehaspe
    Genomics Core, UZ Leuven
    WOUD – Onderzoeksgroep Associatie Universiteit Gent - 28 Sept 2011
    Race against the sequencing machineProcessing of raw DNA sequence data at the Genomics Core
  • 2. DNA sequencing
    determines the order of nucleotide bases in a genome
    DNA replicationmachinary
    HumanGenome
    2 x 3 billion bases
    Human Genome
    2 x 3 billion bases
    hours
    Sequencing machine
    FinalGenerationSequencing machine
    Computer’s
    copyfunction
    Human Genome
    2 x 800 Mbtext
    Human Genome
    2 x 800 Mbtext
    minutes
  • 3. Nextgeneration sequencing
    Qualitydeterioratesafter 100-1000 base pairs
    Solution:
    Cut genomes in readablefragments
    Sequencefragments->reads
    Usebioinformatics to reconstruct genomes fromreads
    HumanGenome
    2 x 3 billion bases
    NextGenerationSequencing machine
    Reads in textformat
    bioinformatics
    Human Genome
    2 x 800 Mbtext
  • 4. SequencersvsBioinformatics
    HumanGenome
    2 x 3 billion bases
    HiSeq 2000 v3
    HiSeq 2000 v2
    Roche GS FLX
    55billion bases
    per day
    6 Human Genomes in 10 days
    18billion bases
    per day
    1billionbpd
    bioinformatics
    Scale up bioinformaticsor
    pile up sequencer output
    Human Genome
    2 x 800 Mbtext
  • 5. Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run
    Bioinformaticspipeline
    Demultiplex
    Sortindexedreads per sample
    Alignment
    Alignreads per sample to reference genome
  • 6. Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run
    Bioinformaticspipeline
    Demultiplex
    Sortindexedreads per sample
    Alignment
    Alignreads per sample to reference genome
    Variant Calling
    Comparepileup of reads at givenlocus to reference, identifySNPs, insertions and deletions
  • 7. A bioinformaticspipeline
    Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run
    Demultiplex
    Sortindexedreads per sample
    Alignment
    Alignreads per sample to reference genome
    Variant Calling
    Compare to reference, identifySNPs, insertions and deletions
    Annotatevariants (gene, effect onproteinsequence, conservation, frequency, predicted effect onproteinfunction, …
    Annotation
    Sequencing: 10 days
    Abovepipeline: > 60 dayson 1 cpu
    Scale up orpile up
  • 8. Favourable race conditions
    Sametaskperformedonmanyreadsorloci
    FOR 1.1 billionindexedreads DO
    Identify sample
    FOR 3 billionHuman Genome loci DO
    Comparelocus in alignedreads to reference and identify homo- and heterozygoticSNPs
    Resultsforoneread/locus independent of resultsforotherreads/loci
    Suggestsnaturalscale up strategy …
  • 9. Data parallelism
    Reads or loci partitioned among nodes of computer cluster
    Each node demultiplexes, aligns, etc on local partition
    Speed up (near) linear to number of cluster nodes
    Variant calling 3 billionHuman Genome loci
    Variant calling Chr1
    Variant callingChrY
    Cluster of 24 computers (nodes)
  • 10. Data parallelism
    DemultiplexHiSeq 2000 microplate
    1 node, 1.1 billionreads
    1600 reads per second
    8 days
    1 microplate
    • 8 nodes, each 138 millionreads
    1
    1 day

    8 lanes
    • 384 nodes, each 3 millionreads
    8
    1
    1
    384
    ½ hour
    384 tiles

  • 11. Favourable race conditions
    MapReduce: data parallelism made easy
    Developed and extensivelyused at Google
    Open sourcelibrary (C++) takes care of
    Parallelization
    Fault Tolerance
    Data Distribution
    Load Balancing
    No knowledge of parallel systems required
    User implements functions Map() and Reduce()
  • 12. MapReduce: demultiplexreads
    8 lanes
    8 Map tasks

    Map: sortreads
    Map: sortreads
    Sample1
    Sample3
    Sample2
    Sample1
    Sample3
    Sample2
    Waituntil map has finished
    8
    1
    Sample1 reads
    Sample3 reads
    Sample2 reads
    Reduce: deduplicatereads
    Reduce: deduplicatereads
    Reduce: deduplicatereads
    Sample1.fastq.gz
    Sample3.fastq.gz
    Sample2.fastq.gz
  • 13. Favourable Race Conditions
    GATK: MapReducefor sequencing projects
    Genome analysis toolkit
    Developedby and usedextensively at BroadInstitute (Harvard and MIT)
    Open Source, Java 1.6 framework
    Provides common data accesspatterns
    Traversalbyread
    Traversalbylocus
  • 14. Favourable race conditions
    Data parallelismsupportedbymany (open source) bioinformatics tools
    Number of nodes is parameter
    Full analysispipelineswidelyavailable
    GATK
    CASAVA

  • 15. Conclusion
    Data parallelism is key
    Scale up bybuying extra cluster nodes
    Genomics core recentlyadded 400 nodes(shared)
    Cannedsolutionsforcommonbioinformaticstasks
    Establishedprogrammingframeworksforcustomsolutions
    MapReduce
    GATK
  • 16. Conclusion
    Bioinformaticiansenjoyfavourableconditionsforkeepingpacewithsequencer …
    HumanGenome
    2 x 3 billion bases
    NextGenerationSequencing machine
    FinalGeneration
    Sequencing machine
    Reads in textformat
    Bioinformaticsusing data parallelism
    Human Genome
    2 x 800 Mbtext
    • … until made redundant byfinalgeneration