Uploaded on

Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands (Morris Swertz)

Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands (Morris Swertz)

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
366
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
12
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Phase information; accurate haplotypes Better characterization of Structural Variation Detection of de novo variants and new mutation rates

Transcript

  • 1. Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands Morris Swertz , UMC Groningen, Netherlands and members of BBMRI-NL, NBIC, MOLGENIS BOSC 2011, July 15, Vienna
  • 2. BOSC 2010 we demonstrated the MOLGENIS software toolkit Use (web) Animal Observatory NextGenSeq Mutation database Model organisms Model (xml) Generator (java) Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org
  • 3. Get stuff for free as others build it already Connect to annotation services Plugin rich analysis tools Connect to statistics UML documentation of your model Edit & trace your data Import/export to Excel find.investigation() 102 downloaded obs<-find.observedvalue( 43,920 downloaded #some calculation add.inferredvalue(res) 36 added      
  • 4. Three steps: Model –> Generate –> Use Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org
  • 5. Three steps: Model –> Generate –> Use 9200 INFO [FormScreenGen] generated generatedjavauiscreenTopMenuMainProtocolsForm.java 9293 INFO [FormScreenGen] generated generatedjavauiscreenTopMenuMainProtocolsProtocolMenuParametersForm.java 9325 INFO [FormScreenGen] generated generatedjavauiscreenTopMenuMainProtocolsProtocolMenuProtocolComponentsForm.java 9496 INFO [FormScreenGen] generated generatedjavauiscreenTopMenuMainOntologiesOntologyTermsForm.java 9528 INFO [FormScreenGen] generated generatedjavauiscreenTopMenuMainOntologiesOntologySourcesForm.java 9606 INFO [FormScreenGen] generated generatedjavauiscreenTopMenuMainOntologiesOntologySourcesOntologyTermsForm.java 9638 INFO [FormScreenGen] generated generatedjavauiscreenTopMenuMainOntologiesCodeListsForm.java 9700 INFO [FormScreenGen] generated generatedjavauiscreenTopMenuMainOntologiesCodeListsCodesForm.java 9965 INFO [MenuScreenGen] generated generatedjavauiscreenTopMenuMenu.java 10012 INFO [MenuScreenGen] generated generatedjavauiscreenTopMenuMainMenu.java 10059 INFO [MenuScreenGen] generated generatedjavauiscreenTopMenuMainInvestigationsInvestigationMenuMenu.java 10152 INFO [MenuScreenGen] generated generatedjavauiscreenTopMenuMainInvestigationsInvestigationMenuProtocolApplicationsProtocolApplicationMenuMenu.java 10230 INFO [MenuScreenGen] generated generatedjavauiscreenTopMenuMainObservationTargetsMenu.java 10293 INFO [MenuScreenGen] generated generatedjavauiscreenTopMenuMainProtocolsProtocolMenuMenu.java 10324 INFO [MenuScreenGen] generated generatedjavauiscreenTopMenuMainOntologiesMenu.java 11354 INFO [PluginScreenGen] generated Molgenis33Workspacemolgenis4phenotypegeneratedjavauiscreenTopMenuMainReportPlugin.java 11557 INFO [PluginScreenGen] generated Molgenis33Workspacemolgenis4phenotypegeneratedjavauiscreenTopMenuMainOntologiesOntologyManagerPlugin.java 11604 INFO [PluginScreenGen] generated Molgenis33Workspacemolgenis4phenotypegeneratedjavauiscreenTopMenuModel_documentationPlugin.java 11604 INFO [PluginScreenGen] generated Molgenis33Workspacemolgenis4phenotypegeneratedjavauiscreenTopMenuRprojectApiPlugin.java 11620 INFO [PluginScreenGen] generated Molgenis33Workspacemolgenis4phenotypegeneratedjavauiscreenTopMenuHttpApiPlugin.java 11635 INFO [PluginScreenGen] generated Molgenis33Workspacemolgenis4phenotypegeneratedjavauiscreenTopMenuWebServicesApiPlugin.java 11651 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwrittenjavapluginreportInvestigationOverview.ftl 11807 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwrittenjavapluginOntologyBrowserOntologyBrowserPlugin.ftl 11807 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwrittenjavaplugintopmenuDocumentationScreen.ftl 11807 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwrittenjavaplugintopmenuRprojectApiScreen.ftl 11823 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwrittenjavaplugintopmenuHttpAPiScreen.ftl 11823 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwrittenjavaplugintopmenuSoapApiScreen.ftl 11854 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwrittenjavapluginreportInvestigationOverview.java 12057 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwrittenjavapluginOntologyBrowserOntologyBrowserPlugin.java 12072 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwrittenjavaplugintopmenuDocumentationScreen.java 12088 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwrittenjavaplugintopmenuRprojectApiScreen.java 12088 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwrittenjavaplugintopmenuHttpAPiScreen.java 12088 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwrittenjavaplugintopmenuSoapApiScreen.java 12103 INFO [MolgenisServletContextGen] generated WebContentMETA-INFcontext.xml 12259 INFO [SoapApiGen] generated generatedjavauiSoapApi.java 12353 INFO [CsvExportGen] generated generatedjavatoolsCsvExport.java 12431 INFO [CsvImportByNameGen] generated generatedjavatoolsCsvImportByName.java 12636 INFO [CopyMemoryToDatabaseGen] generated generatedjavauitoolsCopyMemoryToDatabase.java Real example: Generates 150 files, 30k lines of Java, MySQL, CXF, Tomcat config, and R code + docs
  • 6. Three steps: Model –> Generate –> Use Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org
  • 7. Currently: Towards an integrated app suite XGAP for GWAS/GWL Disease specific databases BBMRI biobank catalogue GWAS central data manager NGS cyber infrastructure MAGE-TAB microarray AnimalDB Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org
  • 8. Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
    • Background: Genome of the Netherlands project
        • Why: create a Dutch genetic hapmap to find rarer variants
        • Aim: genome sequence of 1000 chromosomes (12x)
    • Challenge: analyze 2250 Illumina lanes
        • Alignment and SNP calls of 760 samples calls
        • Data handling, QC, reports, etc
    • Solution: NGS software/hardware infrastructure
        • GPFS storage for >100TB of data files
        • Template system for compute protocols
        • Generators to automatically produce analysis scripts
        • MOLGENIS to run and track inputs, analyses, output data
    • Demo movie
    • Conclusion
  • 9. Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
    • Background: Genome of the Netherlands project
        • Why: create a Dutch genetic hapmap to find rarer variants
        • Aim: genome sequence of 1000 chromosomes (12x)
    • Challenge: analyze 2250 Illumina lanes
        • Alignment and SNP calls of 760 samples calls
        • Data handling, QC, reports, etc
    • Solution: NGS software/hardware infrastructure
        • GPFS storage for >100TB of data files
        • Template system for compute protocols
        • Generators to automatically produce analysis scripts
        • MOLGENIS to run and track inputs, analyses, output data
    • Demo movie
    • Conclusion
  • 10. Motivation: GWAS revolution in human genetics
  • 11. Motivation: GWAS revolution in human genetics
  • 12. Motivation: GWAS revolution in human genetics
  • 13. Motivation: GWAS revolution in human genetics
  • 14. Motivation: GWAS revolution in human genetics
  • 15. GREAT! Ankylosing Spondylitis Celiac Disease Crohn’s disease Multiple Sclerosis Psoriasis Rheumatoid Arthritis Systemic Lupus Erythematosus Type 1 Diabetes Ulcerative Colitis
  • 16. BUT … these explain a small part of heritability
  • 17. Missing heritability? Where might it be hiding?
  • 18. However: Sequencing candidate loci implicates unknown (rare) variants
  • 19. More insight into the specific genetic architecture of individual populations is crucial First analysis of 1000G project data Durbin et al., Nature 2010 common known
  • 20. More insight into the specific genetic architecture of individual populations is crucial First analysis of 1000G project data shows that the majority of the newly identified and rare variants are population specific (and there are no Dutch in 1000G) Durbin et al., Nature 2010 common known new
  • 21.
    • Genome of the Netherlands (GoNL):
    • Unique family-based design: 250 trios
      • 230 x 2 parents – 1 offspring
      • 10 x 2 parents – 2 offspring
      • 10 x 2 parents – 1 MZ twin offspring
    • Immunochip microrray QC control data
    • Specifications:
      • Families equally distributed over the Dutch provinces
      • Genomic DNA, paired-end sequencing on HiSeq2000, 12x coverage
      • Trios allow phase information; accurate haplotypes
      • Other results: Structural variation, detection de novo variants
    Idea 1: sequence 1000 independent Dutch chromosomes Biobanks * analysis teams
  • 22. Idea 2: lets impute 100.000 existing Dutch GWAS data  Imputation is the process of inferring any missing or untyped genetic variants from typed flanking genetic variants, based on the known local LD relationship GWAS data
  • 23. Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
    • Background: Genome of the Netherlands project
        • Why: create a Dutch genetic hapmap to find rarer variants
        • Aim: genome sequence of 1000 chromosomes (12x)
    • Challenge: analyze 2250 Illumina lanes
        • Alignment and SNP calls of 760 samples calls
        • Data handling, QC, reports, etc
    • Solution: NGS software/hardware infrastructure
        • GPFS storage for >100TB of data files
        • Template system for compute protocols
        • Generators to automatically produce analysis scripts
        • MOLGENIS to run and track inputs, analyses, output data
    • Demo movie
    • Conclusion
  • 24. GoNL: sequence 1000 independent Dutch chromosomes
    • Sequence analysis
    • 230 trio’s (690)
    • 10 quartets (40)
    • 10 MZ twin (40)
    • Immunochip GWAS data for QC (UMCG)
  • 25. GoNL: sequence 1000 independent Dutch chromosomes
    • Sequence analysis
    • 230 trio’s (690)
    • 10 quartets (40)
    • 10 MZ twin (40)
    • Immunochip GWAS data for QC (UMCG)
    • Data analysis &
    • Method development
    • ~ 75% of data aligned to reference (hg19)
    • In-depth analysis on 20 trio’s (pilot1)
  • 26. GoNL: sequence 1000 independent Dutch chromosomes
    • Sequence analysis
    • 230 trio’s (690)
    • 10 quartets (40)
    • 10 MZ twin (40)
    • Immunochip GWAS data for QC (UMCG)
    TODO: Imputation ~100,000 Dutch samples with GWAS data
    • Data analysis &
    • Method development
    • ~ 50% of data aligned to reference (hg19)
    • In-depth analysis on 20 trio’s (pilot)
  • 27. GoNL: sequence 1000 independent Dutch chromosomes
    • Sequence analysis
    • 230 trio’s (690)
    • 10 quartets (40)
    • 10 MZ twin (40)
    TODO: Imputation ~100,000 Dutch samples with GWAS data
    • Data analysis &
    • Method development
    • ~ 50% of data aligned to reference (hg19)
    • In-depth analysis on 20 trio’s (pilot)
    TODO: Further analysis Structural variation, Population Genetics, De novo mutations, Mitochondrial DNA This is an open national project: please contact [email_address] [email_address] and [email_address] for analysis ideas.
  • 28. GoNL: sequence 1000 independent Dutch chromosomes
    • Data analysis &
    • Method development
    • ~ 75% of data aligned to reference (hg19)
    • In-depth analysis on 20 trio’s (pilot)
    • Sequence analysis
    • 230 trio’s (690)
    • 10 quartets (40)
    • 10 MZ twin (40)
    Imputation existing GWAS ~100,000 Dutch samples with GWAS data Further analysis Structural variation, Population Genetics, De novo mutations, Mitochondrial DNA This is an open national project: please contact debakker@broadinstitute.org; m.a.swertz@rug.nl; [email_address] for analysis ideas.
  • 29. Challenge 1: Data storage
    • 45TB raw data (fq.gz)
    • 450TB intermediate data (bam)
    • 90TB results (bam + vcf)
  • 30. Challenge 2: Alignment, Variant Calling, and QC pipelines Alignment Variant calling Alignment to human genome (Build 37) Clean up alignment (mark duplicates, realignment, recalibration) Quality control SNP calling Indel calling Variant Filtering ~ 1 Week ~ 1 Week QC: Immunochip concordance
  • 31. 2300 lanes * 15 analysis steps => 34.500 commands needed
    • > 2300 * 15 files, 2300 + 750 QC reports, a nightmare to track
    /data/gcc/tools/bwa-0.5.8c_patched/bwa aln /data/gcc/resources/hg19/indices/human_g1k_v37.fa /data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_1.fq.gz -t 4 -f /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe01.bwa_align_pair1.ftl.human_g1k_v37.2011_05_30_20_22.1.sai /data/gcc/tools/bwa_45_patched/bwa sampe -P -p illumina -i L6 -m 24173 -l A80MP0ABXX /data/gcc/resources/hg19/indices/human_g1k_v37.fa /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe01.bwa_align_pair1.ftl.human_g1k_v37.2011_05_30_20_22.1.sai /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe02.bwa_align_pair2.ftl.human_g1k_v37.2011_05_30_20_22.2.sai /data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_1.fq.gz /data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_2.fq.gz -f /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe03.bwa_sampe.ftl.human_g1k_v37.2011_05_30_20_22.sam java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/SamFormatConverter.jar INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe03.bwa_sampe.ftl.human_g1k_v37.2011_05_30_20_22.sam OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe04.sam_to_bam.ftl.human_g1k_v37.2011_05_30_20_22.bam VALIDATION_STRINGENCY=LENIENT MAX_RECORDS_IN_RAM=2000000 TMP_DIR=/local java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/SortSam.jar INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe04.sam_to_bam.ftl.human_g1k_v37.2011_05_30_20_22.bam OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam SORT_ORDER=coordinate VALIDATION_STRINGENCY=LENIENT MAX_RECORDS_IN_RAM=1000000 TMP_DIR=/local java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/BuildBamIndex.jar INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam.bai VALIDATION_STRINGENCY=LENIENT MAX_RECORDS_IN_RAM=1000000 TMP_DIR=/local
  • 32. Challenge 3: > 200.000 hours compute hours
    • Alignment 2300 lanes, 15 steps, ~75 hours per lane
    • SNP calling 760 samples, 6 steps, ~50 hours per sample
    • Immunochip QC 760 samples, 5 steps, 1 hours per sample
    Compute power Network and storage I/O
  • 33. Challenge 4: Did we analyze it all? Correctly? Completely? Batches: UModqR 60 HUMcriR 90 HUMhxsR 222 HUMrutR 235 HUMjxbR 153 HUMsnrR 10
  • 34. Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
    • Background: Genome of the Netherlands project
        • Why: create a Dutch genetic hapmap to find rarer variants
        • Aim: genome sequence of 1000 chromosomes (12x)
    • Challenge: analyze 2250 Illumina lanes
        • Alignment and SNP calls of 760 samples calls
        • Data handling, QC, reports, etc
    • Solution: NGS software/hardware infrastructure
        • GPFS storage for >100TB of data files
        • Template system for compute protocols
        • Generators to automatically produce analysis scripts
        • MOLGENIS to run and track inputs, analyses, output data
    • Demo movie
    • Conclusion
  • 35. Kickstart the project building on NBIC/BioAssist
    • NGS task force
    • Biobanking task force
    • e-BioGrid team
  • 36. Solution 1: GPFS shared data storage
    • Primary storage in Groningen on ‘Target’
    • Backup storage in Amsterdam on ‘ BigGrid ’
    • Data transfer via hard drives
    • Systematic organization of rawdate, resultdata, logs
    2.000 TB 750 x 3TB disks 3200 tapes GPFS http://www.bbmriwiki.nl/wiki/DataManagement http://www.rug.nl/target/index
  • 37. Solution 2: data management via sample-lane worksheet sample flowcell lane lib machine date file A24a FC80R35ABXX L3 HUMhxsRJODIAAPE I433 101119 101119_I433_FC80R35ABXX_L3_HUMhxsRJODIAAPE A24a FC80F2RABXX L3 HUMhxsRJODIABPE I481 101120 101120_I481_FC80F2RABXX_L3_HUMhxsRJODIABPE A24a FC80GHKABXX L2 HUMhxsRJODIBAPE I114 101202 101202_I114_FC80GHKABXX_L2_HUMhxsRJODIBAPE A24b FC80R35ABXX L4 HUMhxsRJPDIAAPE I433 101119 101119_I433_FC80R35ABXX_L4_HUMhxsRJPDIAAPE A24b FC80F2RABXX L4 HUMhxsRJPDIABPE I481 101120 101120_I481_FC80F2RABXX_L4_HUMhxsRJPDIABPE A24b FC80GHKABXX L3 HUMhxsRJPDIBAPE I114 101202 101202_I114_FC80GHKABXX_L3_HUMhxsRJPDIBAPE A24b FC81C8UABXX L3 HUMhxsRJPDIBAPE I340 110114 110114_I340_FC81C8UABXX_L3_HUMhxsRJPDIBAPE A24c FC80R35ABXX L5 HUMhxsRJQDIAAPE I433 101119 101119_I433_FC80R35ABXX_L5_HUMhxsRJQDIAAPE A24c FC80F2RABXX L6 HUMhxsRJQDIABPE I481 101120 101120_I481_FC80F2RABXX_L6_HUMhxsRJQDIABPE A24c FC80GHKABXX L4 HUMhxsRJQDIBAPE I114 101202 101202_I114_FC80GHKABXX_L4_HUMhxsRJQDIBAPE A25a FC80R35ABXX L6 HUMhxsRJRDIAAPE I433 101119 101119_I433_FC80R35ABXX_L6_HUMhxsRJRDIAAPE A25a FC81C8UABXX L2 HUMhxsRJRDIAAPE I340 110114 110114_I340_FC81C8UABXX_L2_HUMhxsRJRDIAAPE A25a FC80F54ABXX L7 HUMhxsRJRDIABPE I171 101122 101122_I171_FC80F54ABXX_L7_HUMhxsRJRDIABPE A25a FC80GHKABXX L5 HUMhxsRJRDIBAPE I114 101202 101202_I114_FC80GHKABXX_L5_HUMhxsRJRDIBAPE A25b FC80R35ABXX L7 HUMhxsRJSDIAAPE I433 101119 101119_I433_FC80R35ABXX_L7_HUMhxsRJSDIAAPE A25b FC80EE1ABXX L5 HUMhxsRJSDIABPE I171 101122 101122_I171_FC80EE1ABXX_L5_HUMhxsRJSDIABPE A25b FC80GHKABXX L6 HUMhxsRJSDIBAPE I114 101202 101202_I114_FC80GHKABXX_L6_HUMhxsRJSDIBAPE A25b FC80GHJABXX L1 HUMhxsRJSDIBAPE I117 101208 101208_I117_FC80GHJABXX_L1_HUMhxsRJSDIBAPE A25c FC80R35ABXX L8 HUMhxsRJTDIAAPE I433 101119 101119_I433_FC80R35ABXX_L8_HUMhxsRJTDIAAPE A25c FC80F54ABXX L5 HUMhxsRJTDIABPE I171 101122 101122_I171_FC80F54ABXX_L5_HUMhxsRJTDIABPE A25c FC80GHKABXX L7 HUMhxsRJTDIBAPE I114 101202 101202_I114_FC80GHKABXX_L7_HUMhxsRJTDIBAPE A25c FC81C7KABXX L5 HUMhxsRJTDIBAPE I125 110115 110115_I125_FC81C7KABXX_L5_HUMhxsRJTDIBAPE A26a FC80PEWABXX L5 HUMhxsRJUDIAAPE I198 101120 101120_I198_FC80PEWABXX_L5_HUMhxsRJUDIAAPE A26a FC80F2RABXX L7 HUMhxsRJUDIABPE I481 101120 101120_I481_FC80F2RABXX_L7_HUMhxsRJUDIABPE A26a FC80GHKABXX L8 HUMhxsRJUDIBAPE I114 101202 101202_I114_FC80GHKABXX_L8_HUMhxsRJUDIBAPE A26b FC80N58ABXX L5 HUMhxsRJVDIAAPE I245 101120 101120_I245_FC80N58ABXX_L5_HUMhxsRJVDIAAPE A26b FC80PNWABXX L2 HUMhxsRJVDIABPE I453 101119 101119_I453_FC80PNWABXX_L2_HUMhxsRJVDIABPE A26b FC80G37ABXX L1 HUMhxsRJVDIBAPE I127 101126 101126_I127_FC80G37ABXX_L1_HUMhxsRJVDIBAPE A26c FC80LDLABXX L1 HUMhxsRJWDIAAPE I453 101119 101119_I453_FC80LDLABXX_L1_HUMhxsRJWDIAAPE A26c FC80PNWABXX L3 HUMhxsRJWDIABPE I453 101119 101119_I453_FC80PNWABXX_L3_HUMhxsRJWDIABPE A26c FC80G37ABXX L2 HUMhxsRJWDIBAPE I127 101126 101126_I127_FC80G37ABXX_L2_HUMhxsRJWDIBAPE
  • 38. (of course it is a bit more advanced than that)
    • NB:
    • we have a beta Galaxy tool.xml mapper
    • based on GEN2PHEN ‘observation’ model
    • we would love to have a shared workflow model
  • 39. Solution 3: auto-generate all computational protocols
    • Auto-generate all the analysis commands:
    Generate scripts 1. Create SampleLane list 2. Generate pipeline from templates 3. Submit to Compute cluster bwa aln ${lane} bwa aln FC80R35ABXX_L3.fq.gz bwa aln FC80R35ABXX_L3.fq.gz bwa aln FC80R35ABXX_L3.fq.gz 34.500 scripts 15 templates http://www.bbmriwiki.nl/svn/ngs_pipelines/templates/ngs/
  • 40. Solution 4: distributed compute efforts > 200.000 hours
    • Alignment 2300 lanes, 15 steps, ~75 hours per lane
    • SNP calling 760 samples, 6 steps, ~50 hours per sample
    • Immunochip QC 760 samples, 5 steps, 1 hours per sample
    RUG CIT/Target ~900 lanes done ~240 per week 360 cpus AMC/BigGrid ~250 lanes done ~30 per week ~270 cpus EMC Hubrecht Other BigGrid
  • 41. Solution 5: a tool to submit and monitor compute jobs
  • 42. Solution 6: REST based services
    • To interact with R, Galaxy, Taverna (WSDL), Shell etc
    • e.g. simply upload a csv from shell
    • e.g. simply get data via R
    http://www.molgenis.org/wiki/MolgenisRestInterface http://www.molgenis.org/wiki/MolgenisRinterface curl -d 'data_type_input=org.molgenis.pheno.Individual &data_input=Name,Descriptio%0AInd1,Desc1%0AInd2,Desc2 &data_action=ADD &data_silent=F&submit_input=submit'   http://vm7.target.rug.nl/ngs_test/api/add source(&quot;http://a.host:8080/molgenis_ngs/api/R&quot;)”> res <- find.NgsSample();
  • 43. All working together (beta) MOLGENIS user interface for NGS (Java) Petabyte File storage (GPFS, GridFS?) compute cluster (PBS, Grid?) bwa aln ${lane} Protocol catalogue (Freermaker) Lane & Sample metadata And QC reports (MySQL) MOLGENIS/compute Generate ‘ ProtocolApplications ’ Submit and monitor (GridGain) uses
    • API
    • R
    • Galaxy
    • Taverna
    • IGV
    • UCSC
    Data & protocols Result exploration uses Test & play
  • 44. Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
    • Background: Genome of the Netherlands project
        • Why: create a Dutch genetic hapmap to find rarer variants
        • Aim: genome sequence of 1000 chromosomes (12x)
    • Challenge: analyze 2250 Illumina lanes
        • Alignment and SNP calls of 760 samples calls
        • Data handling, QC, reports, etc
    • Solution: NGS software/hardware infrastructure
        • GPFS storage for >100TB of data files
        • Template system for compute protocols
        • Generators to automatically produce analysis scripts
        • MOLGENIS to run and track inputs, analyses, output data
    • Demo movie
    • Conclusion
  • 45. Download demo from DropBox
    • http://dl.dropbox.com/u/1839500/Swertz_BOSC_2011. mp4
  • 46. Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands
    • Background: Genome of the Netherlands project
        • Why: create a Dutch genetic hapmap to find rarer variants
        • Aim: genome sequence of 1000 chromosomes (12x)
    • Challenge: analyze 2250 Illumina lanes
        • Alignment and SNP calls of 760 samples calls
        • Data handling, QC, reports, etc
    • Solution: NGS software/hardware infrastructure
        • GPFS storage for >100TB of data files
        • Template system for compute protocols
        • Generators to automatically produce analysis scripts
        • MOLGENIS to run and track inputs, analyses, output data
    • Demo movie
    • Conclusion
  • 47. Alignment results Alignment Variant calling Alignment to human genome (Build 37) Clean up alignment (mark duplicates, realignment, recalibration) Quality control Individual SNP calling Indel calling Variant Filtering ~ 1 Week ~ 1 Week >94% reads aligned >13x avg coverage
  • 48. SNP calling result (GoNL Pilot Chr20 – 1KG Phase I) 16,045 177,389 648,284 1KG Estimated Chr20 Ti/Tv: 2.36 GoNL Pilot Only SNPs 16,045 %dbSNP 2.05 Ti/Tv 2.20 1KG Phase 1 Only SNPs 648,284 %dbSNP 10.23 Ti/Tv 2.36 Intersection SNPs 177,389 %dbSNP 65.91 Ti/Tv 2.41
  • 49. Next…
    • Polish the software ... a lot
      • Its MOLGENIS so anybody can download and customize (ideas anyone?)
      • Integrate the login/security module
      • Providing reports for the ‘end-users’
      • Enabeling trend analyses , etc
    • Integrate and run more pipelines for GoNL
      • Structural Variation Group
        • Finalize GoNL SV pipeline
        • Integrate SNP Calling / SV pipelines
      • Imputation Group
        • Phase Pilot data
        • Impute sequence data
        • Estimate gain of GoNL vs HapMap/1KG as Imputation panel
  • 50.
    • Acknowledgements
      • GoNL / MOLGENIS Infrastructure team
        • George Byelas, Martijn Dijkstra, Robert Wagner, Pieter Neerincx, Abhishek Narain, Jan Bot and indirectly GEN2PHEN, EBI, FIMM, ...
      • GoNL Analysis team (creating pipelines and tools)
        • Freerk van Dijk (UMCG), Barbera van Schaik (AMC), Ies Nijman (Hubrecht), Slavik Koval (EMC) Laurent Francioli (UU), Kai Ye (LUMC), Jeroen Laros (LUMC), Lennart Karssen (EMC), JoukeJan Hottenga (VU), Mathijs Kattenberg (VU), David van Enckvort (NBIC), Leon Mei (NBIC), Elise van Leeuwen (EMC), … and many, many others
      • GoNL Steering group (coordination)
        • Cisca Wijmenga (PI GoNL), Morris Swertz (PI analysis), Gertjan van Ommen (LUMC), Eline Slagboom (LUMC), Jasper Bovenberg (ELSI issues), Cornelia van Duijn (EMC), Dorret Boomsma (VU), Paul de Bakker (co-PI analysis, UU)
    Get all as open source: GoNL - http://www.nlgenome.nl MOLGENIS - http://www.molgenis.org Analysis team - http://www.bbmriwiki.nl Contact? [email_address]