Initial steps towards a production platform
   for DNA sequence analysis on the grid

           ISMB/ECCB conference – 18 July 2011

      Barbera van Schaik, Angela Luyf, Michel de Vries,
   Frank Baas, Antoine van Kampen and Silvia Olabarriaga

                b.d.vanschaik@amc.uva.nl
Overview

Grid computing and workflow technology
        Example: Virus discovery

     Analysis of larger data sets
 Example: Genome of the Netherlands

       Challenges and summary
Sequencing, Moore’s law and personnel



                                                                                Note:
Acceleration




                                                                            Only slope is
                                                                            meaningful in
                                                                             this graph




                  http://www.politigenomics.com/2009/02/the-scale-up.html
What are the options?
Local cluster
Desktop grid
Super computer
Hadoop cluster
GPU cluster
Cloud computing
(Inter) national Grid     Each system has its own interface
DNA computing             Need to learn how they all work
National computing facilities
Grids
    Distributed resources

             Computing
             Data storage


    Open protocols


    It's all about sharing

             Resources
             Methods
             Collaborations
Dutch grid (resources)




                               grid




http://www.biggrid.nl/
Sequence
   facility         People, resources and data flow
                         My role




               Bioinformatics
                 NGS team
                                e-BioScience
                                    team       grid
  Research
laboratories
Example: Virus discovery
VIDISCA
method
                                              Virus discovery unit

                      exp1
                    exp1
                       exp1
                   exp1
                        exp1
                   exp1
                         exp6
                 exp1
                     exp1
                           exp3
                       exp2
                  exp1
                                                    GenBank - NR

Goal: Identify known and discover new viruses in samples
                                        Michel de Vries et al (2011) PloS one
BLAST analysis workflow

    Input: sequence reads


 Conversion step (sff to fasta)


            BLAST


    Output: BLAST results
Implementation of workflow components
 Workflow description (XML)
        In: sequences                               In: sequences   In: database
              (sff)                                      (fasta)  X     (fasta)

      Component 1 (XML)                                     Component 2 (XML)
        Executable/script:                                   Executable/script:
           sff2fasta.pl                                          BLAST


           Out: sequences                                     Out: blast result
               (fasta)                                              (txt)
Tristan Glatard (2008) Future generation computer systems
http://gwendia.i3s.unice.fr/doku.php?id=gwendia
Run workflow on the grid




Silvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In Biomedicine
Tristan Glatard (2008) International Journal of High Performance Computing Applications
Graphical user interface: VBrowser




                                     http://www.vl-e.nl/vbrowser
Workflow monitoring
Speed up
                    exp1
                  exp1
                     exp1
                 exp1
                      exp1
                 exp1
                       exp6
               exp1
                   exp1
                                                      Blast
                         exp3
                     exp2
                exp1                                                      2 databases:
                                                                        Human ribosomal
           15 experiments                                                    Viruses
            722 samples

                                                             Total CPU time: 413 hrs (~17 days)
                                                             Elapsed time workflow: 13.7 hrs
                                                             = 30x speed up
Angela Luyf, Barbera van Schaik et al (2010) BMC Bioinformatics
Benefits workflow technology

         Agile development

       Re-use of components

         Iteration strategy

      Knowledge about analysis
     steps captured in workflow
Analysis of larger data sets
          Genome of the Netherlands (GoNL)

                                       770 samples
Whole genome                           45 TB raw data
sequencing of
                                       Many partners
250 trios                              (data sharing)

Enrich biobanks                        Analysis on
                                       distributed sites
Reference set for
disease studies                              http://www.bbmri.nl/
                                             http://www.nlgenome.nl/
GoNL alignment pipeline
      Pair1.fastq                  Reference
      Pair2.fastq                  genome                     160 samples (478 lanes) are
                                                              currently analyzed on the Dutch grid
BWA aln, sampe, sam-to-bam, sort bam, index
                                                              Development and small tests:
            Picard mark duplicates                            Nov 22, 2010 - now

               GATK realignment                               Analysis:
                                                              Mar 25, 2011 - Jul 15, 2011
                 Picard fix mates                             Jobs: 13,981
                                                              Total CPU time: 5.5 years
               GATK recalibration                             Disk space used: 315 TB

                    Result.bam
 Pipeline similar to what is used at the Broad Institute. Implemented for GoNL by Freerk van Dijk (Groningen)
Challenges

•   Error handling
•   Data management
•   Data protection
•   Provenance tracking
•   Transparent addition of other resources
Summary
More research and development needed in e-bioscience

Latest IT infrastructures needed for scaling up NGS data
   analysis (grids, clouds, big clusters)

Workflow technology assists agile implementation of
 bioinformatics software

Separate workflow development from IT infrastructure for
  easier migration and expansion (middleware)
Acknowledgements
Genome of the                    University of Amsterdam   Bioinformatics Laboratory, AMC
Netherlands, NL                  Piter de Boer             Antoine van Kampen
Cisca Wijmenga
Morris Swertz                    BiG Grid                  NGS bioinformatics team
All project partners             Jan Just Keijser          Aldo Jongejan
                                 Tom Visser                Marcel Willemsen
Virus discovery unit, AMC        Grid support
Lia van der Hoek                                           e-Bioscience team
Michel de Vries                  Modalis, France           Silvia Olabarriaga
                                 Johan Montagnat           Angela Luyf
Department of                                              Mark Santcroos
genome analysis, AMC             Creatis, France           Shayan Shahand
Frank Baas                       Tristan Glatard
Ted Bradley
Marja Jakobs




                       http://www.bioinformaticslaboratory.nl/
BWA on grid – component description




                           22
BWA on grid – component description




                           23
BWA on grid – workflow description




                            24
http://orange.ebioscience.amc.nl/ebioinfragateway/
                   e-BioInfra gateway
No grid certificate needed
Data upload via sFTP (intranet)
Synced with grid storage
Workflows are started from web page
Implemented workflow components
       for next generation sequencing

Existing software                     In-house software
• BLAST          •   Roche software   • Data format converters
• BLAT           •   GATK             • Quality trimming
• BWA            •   Picard           • Alternative splice product detection
• Annovar        •   Samtools         • CDR3 detection (T- and B-cell variation)
• Varscan                             • Genome comparison (small genomes)
• Newbler
• FastQC

Initial steps towards a production platform for DNA sequence analysis on the grid

  • 1.
    Initial steps towardsa production platform for DNA sequence analysis on the grid ISMB/ECCB conference – 18 July 2011 Barbera van Schaik, Angela Luyf, Michel de Vries, Frank Baas, Antoine van Kampen and Silvia Olabarriaga b.d.vanschaik@amc.uva.nl
  • 2.
    Overview Grid computing andworkflow technology Example: Virus discovery Analysis of larger data sets Example: Genome of the Netherlands Challenges and summary
  • 3.
    Sequencing, Moore’s lawand personnel Note: Acceleration Only slope is meaningful in this graph http://www.politigenomics.com/2009/02/the-scale-up.html
  • 4.
    What are theoptions? Local cluster Desktop grid Super computer Hadoop cluster GPU cluster Cloud computing (Inter) national Grid Each system has its own interface DNA computing Need to learn how they all work National computing facilities
  • 5.
    Grids Distributed resources Computing Data storage Open protocols It's all about sharing Resources Methods Collaborations
  • 6.
    Dutch grid (resources) grid http://www.biggrid.nl/
  • 7.
    Sequence facility People, resources and data flow My role Bioinformatics NGS team e-BioScience team grid Research laboratories
  • 8.
    Example: Virus discovery VIDISCA method Virus discovery unit exp1 exp1 exp1 exp1 exp1 exp1 exp6 exp1 exp1 exp3 exp2 exp1 GenBank - NR Goal: Identify known and discover new viruses in samples Michel de Vries et al (2011) PloS one
  • 9.
    BLAST analysis workflow Input: sequence reads Conversion step (sff to fasta) BLAST Output: BLAST results
  • 10.
    Implementation of workflowcomponents Workflow description (XML) In: sequences In: sequences In: database (sff) (fasta) X (fasta) Component 1 (XML) Component 2 (XML) Executable/script: Executable/script: sff2fasta.pl BLAST Out: sequences Out: blast result (fasta) (txt) Tristan Glatard (2008) Future generation computer systems http://gwendia.i3s.unice.fr/doku.php?id=gwendia
  • 11.
    Run workflow onthe grid Silvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In Biomedicine Tristan Glatard (2008) International Journal of High Performance Computing Applications
  • 12.
    Graphical user interface:VBrowser http://www.vl-e.nl/vbrowser
  • 13.
  • 14.
    Speed up exp1 exp1 exp1 exp1 exp1 exp1 exp6 exp1 exp1 Blast exp3 exp2 exp1 2 databases: Human ribosomal 15 experiments Viruses 722 samples Total CPU time: 413 hrs (~17 days) Elapsed time workflow: 13.7 hrs = 30x speed up Angela Luyf, Barbera van Schaik et al (2010) BMC Bioinformatics
  • 15.
    Benefits workflow technology Agile development Re-use of components Iteration strategy Knowledge about analysis steps captured in workflow
  • 16.
    Analysis of largerdata sets Genome of the Netherlands (GoNL) 770 samples Whole genome 45 TB raw data sequencing of Many partners 250 trios (data sharing) Enrich biobanks Analysis on distributed sites Reference set for disease studies http://www.bbmri.nl/ http://www.nlgenome.nl/
  • 17.
    GoNL alignment pipeline Pair1.fastq Reference Pair2.fastq genome 160 samples (478 lanes) are currently analyzed on the Dutch grid BWA aln, sampe, sam-to-bam, sort bam, index Development and small tests: Picard mark duplicates Nov 22, 2010 - now GATK realignment Analysis: Mar 25, 2011 - Jul 15, 2011 Picard fix mates Jobs: 13,981 Total CPU time: 5.5 years GATK recalibration Disk space used: 315 TB Result.bam Pipeline similar to what is used at the Broad Institute. Implemented for GoNL by Freerk van Dijk (Groningen)
  • 18.
    Challenges • Error handling • Data management • Data protection • Provenance tracking • Transparent addition of other resources
  • 19.
    Summary More research anddevelopment needed in e-bioscience Latest IT infrastructures needed for scaling up NGS data analysis (grids, clouds, big clusters) Workflow technology assists agile implementation of bioinformatics software Separate workflow development from IT infrastructure for easier migration and expansion (middleware)
  • 20.
    Acknowledgements Genome of the University of Amsterdam Bioinformatics Laboratory, AMC Netherlands, NL Piter de Boer Antoine van Kampen Cisca Wijmenga Morris Swertz BiG Grid NGS bioinformatics team All project partners Jan Just Keijser Aldo Jongejan Tom Visser Marcel Willemsen Virus discovery unit, AMC Grid support Lia van der Hoek e-Bioscience team Michel de Vries Modalis, France Silvia Olabarriaga Johan Montagnat Angela Luyf Department of Mark Santcroos genome analysis, AMC Creatis, France Shayan Shahand Frank Baas Tristan Glatard Ted Bradley Marja Jakobs http://www.bioinformaticslaboratory.nl/
  • 22.
    BWA on grid– component description 22
  • 23.
    BWA on grid– component description 23
  • 24.
    BWA on grid– workflow description 24
  • 25.
    http://orange.ebioscience.amc.nl/ebioinfragateway/ e-BioInfra gateway No grid certificate needed Data upload via sFTP (intranet) Synced with grid storage Workflows are started from web page
  • 26.
    Implemented workflow components for next generation sequencing Existing software In-house software • BLAST • Roche software • Data format converters • BLAT • GATK • Quality trimming • BWA • Picard • Alternative splice product detection • Annovar • Samtools • CDR3 detection (T- and B-cell variation) • Varscan • Genome comparison (small genomes) • Newbler • FastQC