Initial steps towards a production platform   for DNA sequence analysis on the grid           ISMB/ECCB conference – 18 Ju...
OverviewGrid computing and workflow technology        Example: Virus discovery     Analysis of larger data sets Example: G...
Sequencing, Moore’s law and personnel                                                                                Note:...
What are the options?Local clusterDesktop gridSuper computerHadoop clusterGPU clusterCloud computing(Inter) national Grid ...
Grids    Distributed resources             Computing             Data storage    Open protocols    Its all about sharing  ...
Dutch grid (resources)                               gridhttp://www.biggrid.nl/
Sequence   facility         People, resources and data flow                         My role               Bioinformatics  ...
Example: Virus discoveryVIDISCAmethod                                              Virus discovery unit                   ...
BLAST analysis workflow    Input: sequence reads Conversion step (sff to fasta)            BLAST    Output: BLAST results
Implementation of workflow components Workflow description (XML)        In: sequences                               In: se...
Run workflow on the gridSilvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In BiomedicineTristan ...
Graphical user interface: VBrowser                                     http://www.vl-e.nl/vbrowser
Workflow monitoring
Speed up                    exp1                  exp1                     exp1                 exp1                      ...
Benefits workflow technology         Agile development       Re-use of components         Iteration strategy      Knowledg...
Analysis of larger data sets          Genome of the Netherlands (GoNL)                                       770 samplesWh...
GoNL alignment pipeline      Pair1.fastq                  Reference      Pair2.fastq                  genome              ...
Challenges•   Error handling•   Data management•   Data protection•   Provenance tracking•   Transparent addition of other...
SummaryMore research and development needed in e-bioscienceLatest IT infrastructures needed for scaling up NGS data   anal...
AcknowledgementsGenome of the                    University of Amsterdam   Bioinformatics Laboratory, AMCNetherlands, NL  ...
BWA on grid – component description                           22
BWA on grid – component description                           23
BWA on grid – workflow description                            24
http://orange.ebioscience.amc.nl/ebioinfragateway/                   e-BioInfra gatewayNo grid certificate neededData uplo...
Implemented workflow components       for next generation sequencingExisting software                     In-house softwar...
Initial steps towards a production platform for DNA sequence analysis on the grid
Upcoming SlideShare
Loading in …5
×

Initial steps towards a production platform for DNA sequence analysis on the grid

1,209 views
1,121 views

Published on

Presented at the ISMB/ECCB 2011 conference. https://www.iscb.org/cms_addon/conferences/ismbeccb2011/highlights.php#HL13

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,209
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Initial steps towards a production platform for DNA sequence analysis on the grid

  1. 1. Initial steps towards a production platform for DNA sequence analysis on the grid ISMB/ECCB conference – 18 July 2011 Barbera van Schaik, Angela Luyf, Michel de Vries, Frank Baas, Antoine van Kampen and Silvia Olabarriaga b.d.vanschaik@amc.uva.nl
  2. 2. OverviewGrid computing and workflow technology Example: Virus discovery Analysis of larger data sets Example: Genome of the Netherlands Challenges and summary
  3. 3. Sequencing, Moore’s law and personnel Note:Acceleration Only slope is meaningful in this graph http://www.politigenomics.com/2009/02/the-scale-up.html
  4. 4. What are the options?Local clusterDesktop gridSuper computerHadoop clusterGPU clusterCloud computing(Inter) national Grid Each system has its own interfaceDNA computing Need to learn how they all workNational computing facilities
  5. 5. Grids Distributed resources Computing Data storage Open protocols Its all about sharing Resources Methods Collaborations
  6. 6. Dutch grid (resources) gridhttp://www.biggrid.nl/
  7. 7. Sequence facility People, resources and data flow My role Bioinformatics NGS team e-BioScience team grid Researchlaboratories
  8. 8. Example: Virus discoveryVIDISCAmethod Virus discovery unit exp1 exp1 exp1 exp1 exp1 exp1 exp6 exp1 exp1 exp3 exp2 exp1 GenBank - NRGoal: Identify known and discover new viruses in samples Michel de Vries et al (2011) PloS one
  9. 9. BLAST analysis workflow Input: sequence reads Conversion step (sff to fasta) BLAST Output: BLAST results
  10. 10. Implementation of workflow components Workflow description (XML) In: sequences In: sequences In: database (sff) (fasta) X (fasta) Component 1 (XML) Component 2 (XML) Executable/script: Executable/script: sff2fasta.pl BLAST Out: sequences Out: blast result (fasta) (txt)Tristan Glatard (2008) Future generation computer systemshttp://gwendia.i3s.unice.fr/doku.php?id=gwendia
  11. 11. Run workflow on the gridSilvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In BiomedicineTristan Glatard (2008) International Journal of High Performance Computing Applications
  12. 12. Graphical user interface: VBrowser http://www.vl-e.nl/vbrowser
  13. 13. Workflow monitoring
  14. 14. Speed up exp1 exp1 exp1 exp1 exp1 exp1 exp6 exp1 exp1 Blast exp3 exp2 exp1 2 databases: Human ribosomal 15 experiments Viruses 722 samples Total CPU time: 413 hrs (~17 days) Elapsed time workflow: 13.7 hrs = 30x speed upAngela Luyf, Barbera van Schaik et al (2010) BMC Bioinformatics
  15. 15. Benefits workflow technology Agile development Re-use of components Iteration strategy Knowledge about analysis steps captured in workflow
  16. 16. Analysis of larger data sets Genome of the Netherlands (GoNL) 770 samplesWhole genome 45 TB raw datasequencing of Many partners250 trios (data sharing)Enrich biobanks Analysis on distributed sitesReference set fordisease studies http://www.bbmri.nl/ http://www.nlgenome.nl/
  17. 17. GoNL alignment pipeline Pair1.fastq Reference Pair2.fastq genome 160 samples (478 lanes) are currently analyzed on the Dutch gridBWA aln, sampe, sam-to-bam, sort bam, index Development and small tests: Picard mark duplicates Nov 22, 2010 - now GATK realignment Analysis: Mar 25, 2011 - Jul 15, 2011 Picard fix mates Jobs: 13,981 Total CPU time: 5.5 years GATK recalibration Disk space used: 315 TB Result.bam Pipeline similar to what is used at the Broad Institute. Implemented for GoNL by Freerk van Dijk (Groningen)
  18. 18. Challenges• Error handling• Data management• Data protection• Provenance tracking• Transparent addition of other resources
  19. 19. SummaryMore research and development needed in e-bioscienceLatest IT infrastructures needed for scaling up NGS data analysis (grids, clouds, big clusters)Workflow technology assists agile implementation of bioinformatics softwareSeparate workflow development from IT infrastructure for easier migration and expansion (middleware)
  20. 20. AcknowledgementsGenome of the University of Amsterdam Bioinformatics Laboratory, AMCNetherlands, NL Piter de Boer Antoine van KampenCisca WijmengaMorris Swertz BiG Grid NGS bioinformatics teamAll project partners Jan Just Keijser Aldo Jongejan Tom Visser Marcel WillemsenVirus discovery unit, AMC Grid supportLia van der Hoek e-Bioscience teamMichel de Vries Modalis, France Silvia Olabarriaga Johan Montagnat Angela LuyfDepartment of Mark Santcroosgenome analysis, AMC Creatis, France Shayan ShahandFrank Baas Tristan GlatardTed BradleyMarja Jakobs http://www.bioinformaticslaboratory.nl/
  21. 21. BWA on grid – component description 22
  22. 22. BWA on grid – component description 23
  23. 23. BWA on grid – workflow description 24
  24. 24. http://orange.ebioscience.amc.nl/ebioinfragateway/ e-BioInfra gatewayNo grid certificate neededData upload via sFTP (intranet)Synced with grid storageWorkflows are started from web page
  25. 25. Implemented workflow components for next generation sequencingExisting software In-house software• BLAST • Roche software • Data format converters• BLAT • GATK • Quality trimming• BWA • Picard • Alternative splice product detection• Annovar • Samtools • CDR3 detection (T- and B-cell variation)• Varscan • Genome comparison (small genomes)• Newbler• FastQC

×