Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands  Morris Swertz , UMC Groningen, Netherlands and members of BBMRI-NL, NBIC, MOLGENIS BOSC 2011, July 15, Vienna
BOSC 2010 we demonstrated the MOLGENIS software toolkit Use (web) Animal Observatory NextGenSeq Mutation database Model organisms Model (xml) Generator (java) Swertz  et al  (2010)  BMC Bioinformatics  11(Suppl 12):S12,  http://www.molgenis.org
Get stuff for free as others build it already Connect to  annotation services Plugin rich  analysis tools Connect to  statistics UML documentation of your model Edit & trace your data Import/export to Excel find.investigation() 102 downloaded obs<-find.observedvalue( 43,920 downloaded #some calculation add.inferredvalue(res) 36 added      
Three steps:  Model  –> Generate –> Use Swertz  et al  (2010)  BMC Bioinformatics  11(Suppl 12):S12,  http://www.molgenis.org
Three steps: Model –>  Generate  –> Use 9200 INFO  [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\ProtocolsForm.java 9293 INFO  [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Protocols\ProtocolMenu\ParametersForm.java 9325 INFO  [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Protocols\ProtocolMenu\ProtocolComponentsForm.java 9496 INFO  [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\OntologyTermsForm.java 9528 INFO  [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\OntologySourcesForm.java 9606 INFO  [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\OntologySources\OntologyTermsForm.java 9638 INFO  [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\CodeListsForm.java 9700 INFO  [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\CodeLists\CodesForm.java 9965 INFO  [MenuScreenGen] generated generated\java\ui\screen\TopMenuMenu.java 10012 INFO  [MenuScreenGen] generated generated\java\ui\screen\TopMenu\MainMenu.java 10059 INFO  [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\Investigations\InvestigationMenuMenu.java 10152 INFO  [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\Investigations\InvestigationMenu\ProtocolApplications\ProtocolApplicationMenuMenu.java 10230 INFO  [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\ObservationTargetsMenu.java 10293 INFO  [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\Protocols\ProtocolMenuMenu.java 10324 INFO  [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\OntologiesMenu.java 11354 INFO  [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\Main\ReportPlugin.java 11557 INFO  [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\Main\Ontologies\OntologyManagerPlugin.java 11604 INFO  [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\Model_documentationPlugin.java 11604 INFO  [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\RprojectApiPlugin.java 11620 INFO  [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\HttpApiPlugin.java 11635 INFO  [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\WebServicesApiPlugin.java 11651 WARN  [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\report\InvestigationOverview.ftl 11807 WARN  [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\OntologyBrowser\OntologyBrowserPlugin.ftl 11807 WARN  [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\DocumentationScreen.ftl 11807 WARN  [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\RprojectApiScreen.ftl 11823 WARN  [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\HttpAPiScreen.ftl 11823 WARN  [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\SoapApiScreen.ftl 11854 WARN  [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\report\InvestigationOverview.java 12057 WARN  [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\OntologyBrowser\OntologyBrowserPlugin.java 12072 WARN  [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\DocumentationScreen.java 12088 WARN  [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\RprojectApiScreen.java 12088 WARN  [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\HttpAPiScreen.java 12088 WARN  [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\SoapApiScreen.java 12103 INFO  [MolgenisServletContextGen] generated WebContent\META-INF\context.xml 12259 INFO  [SoapApiGen] generated generated\java\ui\SoapApi.java 12353 INFO  [CsvExportGen] generated generated\java\tools\CsvExport.java 12431 INFO  [CsvImportByNameGen] generated generated\java\tools\CsvImportByName.java 12636 INFO  [CopyMemoryToDatabaseGen] generated generated\java\ui\tools\CopyMemoryToDatabase.java Real example: Generates 150 files, 30k lines of Java, MySQL, CXF, Tomcat config, and R code + docs
Three steps: Model –> Generate –>  Use Swertz  et al  (2010)  BMC Bioinformatics  11(Suppl 12):S12,  http://www.molgenis.org
Currently: Towards an integrated app suite XGAP for GWAS/GWL Disease specific databases BBMRI biobank catalogue GWAS central data manager NGS cyber infrastructure MAGE-TAB microarray AnimalDB Swertz  et al  (2010)  BMC Bioinformatics  11(Suppl 12):S12,  http://www.molgenis.org
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands  Background: Genome of the Netherlands project Why: create a Dutch genetic hapmap to find rarer variants Aim: genome sequence of 1000 chromosomes (12x) Challenge: analyze 2250 Illumina lanes Alignment and SNP calls of 760 samples calls Data handling, QC, reports, etc Solution: NGS software/hardware infrastructure GPFS storage for >100TB of data files Template system for compute protocols Generators to automatically produce analysis scripts MOLGENIS to run and track inputs, analyses, output data Demo movie Conclusion
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands  Background: Genome of the Netherlands project Why: create a Dutch genetic hapmap to find rarer variants Aim: genome sequence of 1000 chromosomes (12x) Challenge: analyze 2250 Illumina lanes Alignment and SNP calls of 760 samples calls Data handling, QC, reports, etc Solution: NGS software/hardware infrastructure GPFS storage for >100TB of data files Template system for compute protocols Generators to automatically produce analysis scripts MOLGENIS to run and track inputs, analyses, output data Demo movie Conclusion
Motivation: GWAS revolution in human genetics
Motivation: GWAS revolution in human genetics
Motivation: GWAS revolution in human genetics
Motivation: GWAS revolution in human genetics
Motivation: GWAS revolution in human genetics
GREAT! Ankylosing Spondylitis Celiac Disease Crohn’s disease Multiple Sclerosis Psoriasis Rheumatoid Arthritis Systemic Lupus Erythematosus Type 1 Diabetes Ulcerative Colitis
BUT … these explain a small part of heritability
Missing heritability? Where might it be hiding?
However: Sequencing candidate loci implicates unknown (rare) variants
More insight into the specific genetic architecture of individual populations is crucial First analysis of 1000G project data Durbin  et al., Nature 2010 common known
More insight into the specific genetic architecture of individual populations is crucial First analysis of 1000G project data shows that the majority of the newly identified and rare variants are  population specific (and there are no Dutch in 1000G) Durbin  et al., Nature 2010 common known new
Genome of the Netherlands (GoNL): Unique family-based design: 250 trios 230 x 2 parents – 1 offspring 10 x 2 parents – 2 offspring 10 x 2 parents – 1 MZ twin offspring Immunochip microrray QC control data Specifications: Families equally distributed over the Dutch provinces Genomic DNA, paired-end sequencing on HiSeq2000, 12x coverage Trios allow phase information; accurate haplotypes  Other results: Structural variation, detection  de novo  variants Idea 1: sequence 1000 independent Dutch chromosomes Biobanks * analysis teams
Idea 2: lets impute 100.000 existing Dutch GWAS data   Imputation is the process of inferring any missing or untyped genetic variants from typed flanking genetic variants, based on the known local LD relationship  GWAS data
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands  Background: Genome of the Netherlands project Why: create a Dutch genetic hapmap to find rarer variants Aim: genome sequence of 1000 chromosomes (12x) Challenge: analyze 2250 Illumina lanes Alignment and SNP calls of 760 samples calls Data handling, QC, reports, etc Solution: NGS software/hardware infrastructure GPFS storage for >100TB of data files Template system for compute protocols Generators to automatically produce analysis scripts MOLGENIS to run and track inputs, analyses, output data Demo movie Conclusion
GoNL: sequence 1000 independent Dutch chromosomes Sequence analysis 230 trio’s (690) 10 quartets (40) 10 MZ twin (40) Immunochip GWAS data for QC (UMCG)
GoNL: sequence 1000 independent Dutch chromosomes Sequence analysis 230 trio’s (690) 10 quartets (40) 10 MZ twin (40) Immunochip GWAS data for QC (UMCG) Data analysis & Method development ~ 75% of data aligned to reference (hg19) In-depth analysis on 20 trio’s (pilot1)
GoNL: sequence 1000 independent Dutch chromosomes Sequence analysis 230 trio’s (690) 10 quartets (40) 10 MZ twin (40) Immunochip GWAS data for QC (UMCG) TODO:  Imputation ~100,000 Dutch samples with GWAS data Data analysis & Method development ~ 50% of data aligned to reference (hg19) In-depth analysis on 20 trio’s (pilot)
GoNL: sequence 1000 independent Dutch chromosomes Sequence analysis 230 trio’s (690) 10 quartets (40) 10 MZ twin (40) TODO:  Imputation ~100,000 Dutch samples with GWAS data Data analysis & Method development ~ 50% of data aligned to reference (hg19) In-depth analysis on 20 trio’s (pilot) TODO: Further analysis Structural variation, Population Genetics,  De novo mutations, Mitochondrial DNA This is an open national project: please contact  [email_address]   [email_address]  and  [email_address]  for analysis ideas.
GoNL: sequence 1000 independent Dutch chromosomes Data analysis & Method development ~ 75% of data aligned to reference (hg19) In-depth analysis on 20 trio’s (pilot) Sequence analysis 230 trio’s (690) 10 quartets (40) 10 MZ twin (40) Imputation existing GWAS ~100,000 Dutch samples with GWAS data Further analysis Structural variation, Population Genetics,  De novo mutations, Mitochondrial DNA This is an open national project: please contact  debakker@broadinstitute.org; m.a.swertz@rug.nl;  [email_address]  for analysis ideas.
Challenge 1: Data storage 45TB raw data (fq.gz)  450TB intermediate data (bam) 90TB results (bam + vcf)
Challenge 2: Alignment, Variant Calling, and QC pipelines Alignment Variant calling Alignment to human genome (Build 37) Clean up alignment  (mark duplicates, realignment, recalibration) Quality control SNP calling Indel calling Variant Filtering ~ 1 Week ~ 1 Week QC: Immunochip concordance
2300 lanes * 15 analysis steps => 34.500 commands needed > 2300 * 15 files, 2300 + 750 QC reports, a nightmare to track /data/gcc/tools/bwa-0.5.8c_patched/bwa aln \ /data/gcc/resources/hg19/indices/human_g1k_v37.fa \ /data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_1.fq.gz \ -t 4 \ -f /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe01.bwa_align_pair1.ftl.human_g1k_v37.2011_05_30_20_22.1.sai /data/gcc/tools/bwa_45_patched/bwa sampe -P \ -p illumina \ -i L6 \ -m 24173 \ -l A80MP0ABXX \ /data/gcc/resources/hg19/indices/human_g1k_v37.fa \ /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe01.bwa_align_pair1.ftl.human_g1k_v37.2011_05_30_20_22.1.sai \ /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe02.bwa_align_pair2.ftl.human_g1k_v37.2011_05_30_20_22.2.sai \ /data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_1.fq.gz \ /data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_2.fq.gz \ -f /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe03.bwa_sampe.ftl.human_g1k_v37.2011_05_30_20_22.sam java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/SamFormatConverter.jar \ INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe03.bwa_sampe.ftl.human_g1k_v37.2011_05_30_20_22.sam \ OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe04.sam_to_bam.ftl.human_g1k_v37.2011_05_30_20_22.bam \ VALIDATION_STRINGENCY=LENIENT \ MAX_RECORDS_IN_RAM=2000000 \ TMP_DIR=/local java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/SortSam.jar \ INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe04.sam_to_bam.ftl.human_g1k_v37.2011_05_30_20_22.bam \ OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam \ SORT_ORDER=coordinate \ VALIDATION_STRINGENCY=LENIENT \ MAX_RECORDS_IN_RAM=1000000 \ TMP_DIR=/local java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/BuildBamIndex.jar \ INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam \ OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam.bai \ VALIDATION_STRINGENCY=LENIENT \ MAX_RECORDS_IN_RAM=1000000 \ TMP_DIR=/local
Challenge 3: > 200.000 hours compute hours Alignment 2300 lanes, 15 steps, ~75 hours per lane SNP calling 760 samples, 6 steps, ~50 hours per sample Immunochip QC 760 samples, 5 steps, 1 hours per sample Compute power Network and storage I/O
Challenge 4: Did we analyze it all? Correctly? Completely? Batches: UModqR 60 HUMcriR 90  HUMhxsR 222 HUMrutR 235 HUMjxbR 153  HUMsnrR 10
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands  Background: Genome of the Netherlands project Why: create a Dutch genetic hapmap to find rarer variants Aim: genome sequence of 1000 chromosomes (12x) Challenge: analyze 2250 Illumina lanes Alignment and SNP calls of 760 samples calls Data handling, QC, reports, etc Solution: NGS software/hardware infrastructure GPFS storage for >100TB of data files Template system for compute protocols Generators to automatically produce analysis scripts MOLGENIS to run and track inputs, analyses, output data Demo movie Conclusion
Kickstart the project building on NBIC/BioAssist NGS task force Biobanking task force e-BioGrid team
Solution 1: GPFS shared data storage Primary storage in Groningen on ‘Target’ Backup storage in Amsterdam on ‘ BigGrid ’ Data transfer via hard drives Systematic organization of rawdate, resultdata, logs 2.000 TB 750 x 3TB disks 3200 tapes GPFS http://www.bbmriwiki.nl/wiki/DataManagement   http://www.rug.nl/target/index
Solution 2: data management via  sample-lane worksheet sample flowcell lane lib machine date file A24a FC80R35ABXX L3 HUMhxsRJODIAAPE I433 101119 101119_I433_FC80R35ABXX_L3_HUMhxsRJODIAAPE A24a FC80F2RABXX L3 HUMhxsRJODIABPE I481 101120 101120_I481_FC80F2RABXX_L3_HUMhxsRJODIABPE A24a FC80GHKABXX L2 HUMhxsRJODIBAPE I114 101202 101202_I114_FC80GHKABXX_L2_HUMhxsRJODIBAPE A24b FC80R35ABXX L4 HUMhxsRJPDIAAPE I433 101119 101119_I433_FC80R35ABXX_L4_HUMhxsRJPDIAAPE A24b FC80F2RABXX L4 HUMhxsRJPDIABPE I481 101120 101120_I481_FC80F2RABXX_L4_HUMhxsRJPDIABPE A24b FC80GHKABXX L3 HUMhxsRJPDIBAPE I114 101202 101202_I114_FC80GHKABXX_L3_HUMhxsRJPDIBAPE A24b FC81C8UABXX L3 HUMhxsRJPDIBAPE I340 110114 110114_I340_FC81C8UABXX_L3_HUMhxsRJPDIBAPE A24c FC80R35ABXX L5 HUMhxsRJQDIAAPE I433 101119 101119_I433_FC80R35ABXX_L5_HUMhxsRJQDIAAPE A24c FC80F2RABXX L6 HUMhxsRJQDIABPE I481 101120 101120_I481_FC80F2RABXX_L6_HUMhxsRJQDIABPE A24c FC80GHKABXX L4 HUMhxsRJQDIBAPE I114 101202 101202_I114_FC80GHKABXX_L4_HUMhxsRJQDIBAPE A25a FC80R35ABXX L6 HUMhxsRJRDIAAPE I433 101119 101119_I433_FC80R35ABXX_L6_HUMhxsRJRDIAAPE A25a FC81C8UABXX L2 HUMhxsRJRDIAAPE I340 110114 110114_I340_FC81C8UABXX_L2_HUMhxsRJRDIAAPE A25a FC80F54ABXX L7 HUMhxsRJRDIABPE I171 101122 101122_I171_FC80F54ABXX_L7_HUMhxsRJRDIABPE A25a FC80GHKABXX L5 HUMhxsRJRDIBAPE I114 101202 101202_I114_FC80GHKABXX_L5_HUMhxsRJRDIBAPE A25b FC80R35ABXX L7 HUMhxsRJSDIAAPE I433 101119 101119_I433_FC80R35ABXX_L7_HUMhxsRJSDIAAPE A25b FC80EE1ABXX L5 HUMhxsRJSDIABPE I171 101122 101122_I171_FC80EE1ABXX_L5_HUMhxsRJSDIABPE A25b FC80GHKABXX L6 HUMhxsRJSDIBAPE I114 101202 101202_I114_FC80GHKABXX_L6_HUMhxsRJSDIBAPE A25b FC80GHJABXX L1 HUMhxsRJSDIBAPE I117 101208 101208_I117_FC80GHJABXX_L1_HUMhxsRJSDIBAPE A25c FC80R35ABXX L8 HUMhxsRJTDIAAPE I433 101119 101119_I433_FC80R35ABXX_L8_HUMhxsRJTDIAAPE A25c FC80F54ABXX L5 HUMhxsRJTDIABPE I171 101122 101122_I171_FC80F54ABXX_L5_HUMhxsRJTDIABPE A25c FC80GHKABXX L7 HUMhxsRJTDIBAPE I114 101202 101202_I114_FC80GHKABXX_L7_HUMhxsRJTDIBAPE A25c FC81C7KABXX L5 HUMhxsRJTDIBAPE I125 110115 110115_I125_FC81C7KABXX_L5_HUMhxsRJTDIBAPE A26a FC80PEWABXX L5 HUMhxsRJUDIAAPE I198 101120 101120_I198_FC80PEWABXX_L5_HUMhxsRJUDIAAPE A26a FC80F2RABXX L7 HUMhxsRJUDIABPE I481 101120 101120_I481_FC80F2RABXX_L7_HUMhxsRJUDIABPE A26a FC80GHKABXX L8 HUMhxsRJUDIBAPE I114 101202 101202_I114_FC80GHKABXX_L8_HUMhxsRJUDIBAPE A26b FC80N58ABXX L5 HUMhxsRJVDIAAPE I245 101120 101120_I245_FC80N58ABXX_L5_HUMhxsRJVDIAAPE A26b FC80PNWABXX L2 HUMhxsRJVDIABPE I453 101119 101119_I453_FC80PNWABXX_L2_HUMhxsRJVDIABPE A26b FC80G37ABXX L1 HUMhxsRJVDIBAPE I127 101126 101126_I127_FC80G37ABXX_L1_HUMhxsRJVDIBAPE A26c FC80LDLABXX L1 HUMhxsRJWDIAAPE I453 101119 101119_I453_FC80LDLABXX_L1_HUMhxsRJWDIAAPE A26c FC80PNWABXX L3 HUMhxsRJWDIABPE I453 101119 101119_I453_FC80PNWABXX_L3_HUMhxsRJWDIABPE A26c FC80G37ABXX L2 HUMhxsRJWDIBAPE I127 101126 101126_I127_FC80G37ABXX_L2_HUMhxsRJWDIBAPE
(of course it is a bit more advanced than that) NB:  we have a beta Galaxy tool.xml mapper based on GEN2PHEN ‘observation’ model we would love to have a shared workflow model
Solution 3: auto-generate all computational protocols Auto-generate all the analysis commands: Generate scripts 1. Create  SampleLane list 2. Generate pipeline  from templates 3. Submit to  Compute cluster bwa aln  ${lane} bwa aln  FC80R35ABXX_L3.fq.gz bwa aln  FC80R35ABXX_L3.fq.gz bwa aln  FC80R35ABXX_L3.fq.gz 34.500 scripts 15 templates http://www.bbmriwiki.nl/svn/ngs_pipelines/templates/ngs/
Solution 4: distributed compute efforts > 200.000 hours Alignment 2300 lanes, 15 steps, ~75 hours per lane SNP calling 760 samples, 6 steps, ~50 hours per sample Immunochip QC 760 samples, 5 steps, 1 hours per sample RUG CIT/Target ~900 lanes done ~240 per week 360 cpus AMC/BigGrid ~250 lanes done ~30 per week ~270 cpus EMC Hubrecht Other BigGrid
Solution 5: a tool to submit and monitor compute jobs
Solution 6: REST based services To interact with R, Galaxy, Taverna (WSDL), Shell etc e.g. simply upload a csv from shell e.g. simply get data via R http://www.molgenis.org/wiki/MolgenisRestInterface http://www.molgenis.org/wiki/MolgenisRinterface   curl -d  'data_type_input=org.molgenis.pheno.Individual &data_input=Name,Descriptio%0AInd1,Desc1%0AInd2,Desc2 &data_action=ADD &data_silent=F&submit_input=submit'   http://vm7.target.rug.nl/ngs_test/api/add source(&quot;http://a.host:8080/molgenis_ngs/api/R&quot;)”> res <- find.NgsSample();
All working together (beta) MOLGENIS user interface  for NGS (Java) Petabyte File storage (GPFS, GridFS?) compute cluster (PBS, Grid?) bwa aln  ${lane} Protocol catalogue (Freermaker) Lane & Sample metadata  And QC reports (MySQL) MOLGENIS/compute Generate  ‘ ProtocolApplications ’ Submit and monitor (GridGain) uses API R Galaxy Taverna IGV UCSC Data & protocols Result exploration uses Test & play
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands  Background: Genome of the Netherlands project Why: create a Dutch genetic hapmap to find rarer variants Aim: genome sequence of 1000 chromosomes (12x) Challenge: analyze 2250 Illumina lanes Alignment and SNP calls of 760 samples calls Data handling, QC, reports, etc Solution: NGS software/hardware infrastructure GPFS storage for >100TB of data files Template system for compute protocols Generators to automatically produce analysis scripts MOLGENIS to run and track inputs, analyses, output data Demo movie Conclusion
Download demo from DropBox http://dl.dropbox.com/u/1839500/Swertz_BOSC_2011. mp4
Large scale NGS pipelines using the MOLGENIS platform: processing the Genome of the Netherlands  Background: Genome of the Netherlands project Why: create a Dutch genetic hapmap to find rarer variants Aim: genome sequence of 1000 chromosomes (12x) Challenge: analyze 2250 Illumina lanes Alignment and SNP calls of 760 samples calls Data handling, QC, reports, etc Solution: NGS software/hardware infrastructure GPFS storage for >100TB of data files Template system for compute protocols Generators to automatically produce analysis scripts MOLGENIS to run and track inputs, analyses, output data Demo movie Conclusion
Alignment results Alignment Variant calling Alignment to human genome (Build 37) Clean up alignment  (mark duplicates, realignment, recalibration) Quality control Individual SNP calling Indel calling Variant Filtering ~ 1 Week ~ 1 Week >94% reads aligned >13x avg coverage
SNP calling result (GoNL Pilot Chr20  – 1KG Phase I) 16,045 177,389 648,284 1KG Estimated Chr20 Ti/Tv:  2.36 GoNL Pilot Only SNPs 16,045 %dbSNP 2.05 Ti/Tv 2.20 1KG Phase 1 Only SNPs 648,284 %dbSNP 10.23 Ti/Tv 2.36 Intersection SNPs 177,389 %dbSNP 65.91 Ti/Tv 2.41
Next… Polish the software ... a lot Its MOLGENIS so anybody can download and customize (ideas anyone?) Integrate the login/security module Providing reports for the ‘end-users’ Enabeling trend analyses , etc Integrate and run more pipelines for GoNL Structural Variation Group Finalize GoNL SV pipeline Integrate SNP Calling / SV pipelines Imputation Group Phase Pilot data Impute sequence data Estimate gain of GoNL vs HapMap/1KG as Imputation panel
Acknowledgements GoNL / MOLGENIS Infrastructure team George Byelas, Martijn Dijkstra, Robert Wagner, Pieter Neerincx, Abhishek Narain, Jan Bot and indirectly GEN2PHEN, EBI, FIMM, ... GoNL Analysis team (creating pipelines and tools) Freerk van Dijk (UMCG), Barbera van Schaik (AMC), Ies Nijman (Hubrecht), Slavik Koval (EMC) Laurent Francioli (UU), Kai Ye (LUMC), Jeroen Laros (LUMC), Lennart Karssen (EMC), JoukeJan Hottenga (VU), Mathijs Kattenberg (VU), David van Enckvort (NBIC), Leon Mei (NBIC), Elise van Leeuwen (EMC), … and many, many others GoNL Steering group (coordination) Cisca Wijmenga (PI GoNL), Morris Swertz (PI analysis), Gertjan van Ommen (LUMC), Eline Slagboom (LUMC), Jasper Bovenberg (ELSI issues), Cornelia van Duijn (EMC), Dorret Boomsma (VU), Paul de Bakker (co-PI analysis, UU)  Get all as open source: GoNL -  http://www.nlgenome.nl MOLGENIS  -  http://www.molgenis.org   Analysis team -  http://www.bbmriwiki.nl   Contact? [email_address]

D02-NextGenSeq-MOLGENIS

  • 1.
    Large scale NGSpipelines using the MOLGENIS platform: processing the Genome of the Netherlands Morris Swertz , UMC Groningen, Netherlands and members of BBMRI-NL, NBIC, MOLGENIS BOSC 2011, July 15, Vienna
  • 2.
    BOSC 2010 wedemonstrated the MOLGENIS software toolkit Use (web) Animal Observatory NextGenSeq Mutation database Model organisms Model (xml) Generator (java) Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org
  • 3.
    Get stuff forfree as others build it already Connect to annotation services Plugin rich analysis tools Connect to statistics UML documentation of your model Edit & trace your data Import/export to Excel find.investigation() 102 downloaded obs<-find.observedvalue( 43,920 downloaded #some calculation add.inferredvalue(res) 36 added      
  • 4.
    Three steps: Model –> Generate –> Use Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org
  • 5.
    Three steps: Model–> Generate –> Use 9200 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\ProtocolsForm.java 9293 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Protocols\ProtocolMenu\ParametersForm.java 9325 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Protocols\ProtocolMenu\ProtocolComponentsForm.java 9496 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\OntologyTermsForm.java 9528 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\OntologySourcesForm.java 9606 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\OntologySources\OntologyTermsForm.java 9638 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\CodeListsForm.java 9700 INFO [FormScreenGen] generated generated\java\ui\screen\TopMenu\Main\Ontologies\CodeLists\CodesForm.java 9965 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenuMenu.java 10012 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\MainMenu.java 10059 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\Investigations\InvestigationMenuMenu.java 10152 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\Investigations\InvestigationMenu\ProtocolApplications\ProtocolApplicationMenuMenu.java 10230 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\ObservationTargetsMenu.java 10293 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\Protocols\ProtocolMenuMenu.java 10324 INFO [MenuScreenGen] generated generated\java\ui\screen\TopMenu\Main\OntologiesMenu.java 11354 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\Main\ReportPlugin.java 11557 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\Main\Ontologies\OntologyManagerPlugin.java 11604 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\Model_documentationPlugin.java 11604 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\RprojectApiPlugin.java 11620 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\HttpApiPlugin.java 11635 INFO [PluginScreenGen] generated Molgenis33Workspace\molgenis4phenotype\generated\java\ui\screen\TopMenu\WebServicesApiPlugin.java 11651 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\report\InvestigationOverview.ftl 11807 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\OntologyBrowser\OntologyBrowserPlugin.ftl 11807 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\DocumentationScreen.ftl 11807 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\RprojectApiScreen.ftl 11823 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\HttpAPiScreen.ftl 11823 WARN [PluginScreenFTLTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\SoapApiScreen.ftl 11854 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\report\InvestigationOverview.java 12057 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\OntologyBrowser\OntologyBrowserPlugin.java 12072 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\DocumentationScreen.java 12088 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\RprojectApiScreen.java 12088 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\HttpAPiScreen.java 12088 WARN [PluginScreenJavaTemplateGen] Skipped because exists: handwritten\java\plugin\topmenu\SoapApiScreen.java 12103 INFO [MolgenisServletContextGen] generated WebContent\META-INF\context.xml 12259 INFO [SoapApiGen] generated generated\java\ui\SoapApi.java 12353 INFO [CsvExportGen] generated generated\java\tools\CsvExport.java 12431 INFO [CsvImportByNameGen] generated generated\java\tools\CsvImportByName.java 12636 INFO [CopyMemoryToDatabaseGen] generated generated\java\ui\tools\CopyMemoryToDatabase.java Real example: Generates 150 files, 30k lines of Java, MySQL, CXF, Tomcat config, and R code + docs
  • 6.
    Three steps: Model–> Generate –> Use Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org
  • 7.
    Currently: Towards anintegrated app suite XGAP for GWAS/GWL Disease specific databases BBMRI biobank catalogue GWAS central data manager NGS cyber infrastructure MAGE-TAB microarray AnimalDB Swertz et al (2010) BMC Bioinformatics 11(Suppl 12):S12, http://www.molgenis.org
  • 8.
    Large scale NGSpipelines using the MOLGENIS platform: processing the Genome of the Netherlands Background: Genome of the Netherlands project Why: create a Dutch genetic hapmap to find rarer variants Aim: genome sequence of 1000 chromosomes (12x) Challenge: analyze 2250 Illumina lanes Alignment and SNP calls of 760 samples calls Data handling, QC, reports, etc Solution: NGS software/hardware infrastructure GPFS storage for >100TB of data files Template system for compute protocols Generators to automatically produce analysis scripts MOLGENIS to run and track inputs, analyses, output data Demo movie Conclusion
  • 9.
    Large scale NGSpipelines using the MOLGENIS platform: processing the Genome of the Netherlands Background: Genome of the Netherlands project Why: create a Dutch genetic hapmap to find rarer variants Aim: genome sequence of 1000 chromosomes (12x) Challenge: analyze 2250 Illumina lanes Alignment and SNP calls of 760 samples calls Data handling, QC, reports, etc Solution: NGS software/hardware infrastructure GPFS storage for >100TB of data files Template system for compute protocols Generators to automatically produce analysis scripts MOLGENIS to run and track inputs, analyses, output data Demo movie Conclusion
  • 10.
    Motivation: GWAS revolutionin human genetics
  • 11.
    Motivation: GWAS revolutionin human genetics
  • 12.
    Motivation: GWAS revolutionin human genetics
  • 13.
    Motivation: GWAS revolutionin human genetics
  • 14.
    Motivation: GWAS revolutionin human genetics
  • 15.
    GREAT! Ankylosing SpondylitisCeliac Disease Crohn’s disease Multiple Sclerosis Psoriasis Rheumatoid Arthritis Systemic Lupus Erythematosus Type 1 Diabetes Ulcerative Colitis
  • 16.
    BUT … theseexplain a small part of heritability
  • 17.
    Missing heritability? Wheremight it be hiding?
  • 18.
    However: Sequencing candidateloci implicates unknown (rare) variants
  • 19.
    More insight intothe specific genetic architecture of individual populations is crucial First analysis of 1000G project data Durbin et al., Nature 2010 common known
  • 20.
    More insight intothe specific genetic architecture of individual populations is crucial First analysis of 1000G project data shows that the majority of the newly identified and rare variants are population specific (and there are no Dutch in 1000G) Durbin et al., Nature 2010 common known new
  • 21.
    Genome of theNetherlands (GoNL): Unique family-based design: 250 trios 230 x 2 parents – 1 offspring 10 x 2 parents – 2 offspring 10 x 2 parents – 1 MZ twin offspring Immunochip microrray QC control data Specifications: Families equally distributed over the Dutch provinces Genomic DNA, paired-end sequencing on HiSeq2000, 12x coverage Trios allow phase information; accurate haplotypes Other results: Structural variation, detection de novo variants Idea 1: sequence 1000 independent Dutch chromosomes Biobanks * analysis teams
  • 22.
    Idea 2: letsimpute 100.000 existing Dutch GWAS data  Imputation is the process of inferring any missing or untyped genetic variants from typed flanking genetic variants, based on the known local LD relationship GWAS data
  • 23.
    Large scale NGSpipelines using the MOLGENIS platform: processing the Genome of the Netherlands Background: Genome of the Netherlands project Why: create a Dutch genetic hapmap to find rarer variants Aim: genome sequence of 1000 chromosomes (12x) Challenge: analyze 2250 Illumina lanes Alignment and SNP calls of 760 samples calls Data handling, QC, reports, etc Solution: NGS software/hardware infrastructure GPFS storage for >100TB of data files Template system for compute protocols Generators to automatically produce analysis scripts MOLGENIS to run and track inputs, analyses, output data Demo movie Conclusion
  • 24.
    GoNL: sequence 1000independent Dutch chromosomes Sequence analysis 230 trio’s (690) 10 quartets (40) 10 MZ twin (40) Immunochip GWAS data for QC (UMCG)
  • 25.
    GoNL: sequence 1000independent Dutch chromosomes Sequence analysis 230 trio’s (690) 10 quartets (40) 10 MZ twin (40) Immunochip GWAS data for QC (UMCG) Data analysis & Method development ~ 75% of data aligned to reference (hg19) In-depth analysis on 20 trio’s (pilot1)
  • 26.
    GoNL: sequence 1000independent Dutch chromosomes Sequence analysis 230 trio’s (690) 10 quartets (40) 10 MZ twin (40) Immunochip GWAS data for QC (UMCG) TODO: Imputation ~100,000 Dutch samples with GWAS data Data analysis & Method development ~ 50% of data aligned to reference (hg19) In-depth analysis on 20 trio’s (pilot)
  • 27.
    GoNL: sequence 1000independent Dutch chromosomes Sequence analysis 230 trio’s (690) 10 quartets (40) 10 MZ twin (40) TODO: Imputation ~100,000 Dutch samples with GWAS data Data analysis & Method development ~ 50% of data aligned to reference (hg19) In-depth analysis on 20 trio’s (pilot) TODO: Further analysis Structural variation, Population Genetics, De novo mutations, Mitochondrial DNA This is an open national project: please contact [email_address] [email_address] and [email_address] for analysis ideas.
  • 28.
    GoNL: sequence 1000independent Dutch chromosomes Data analysis & Method development ~ 75% of data aligned to reference (hg19) In-depth analysis on 20 trio’s (pilot) Sequence analysis 230 trio’s (690) 10 quartets (40) 10 MZ twin (40) Imputation existing GWAS ~100,000 Dutch samples with GWAS data Further analysis Structural variation, Population Genetics, De novo mutations, Mitochondrial DNA This is an open national project: please contact debakker@broadinstitute.org; m.a.swertz@rug.nl; [email_address] for analysis ideas.
  • 29.
    Challenge 1: Datastorage 45TB raw data (fq.gz) 450TB intermediate data (bam) 90TB results (bam + vcf)
  • 30.
    Challenge 2: Alignment,Variant Calling, and QC pipelines Alignment Variant calling Alignment to human genome (Build 37) Clean up alignment (mark duplicates, realignment, recalibration) Quality control SNP calling Indel calling Variant Filtering ~ 1 Week ~ 1 Week QC: Immunochip concordance
  • 31.
    2300 lanes *15 analysis steps => 34.500 commands needed > 2300 * 15 files, 2300 + 750 QC reports, a nightmare to track /data/gcc/tools/bwa-0.5.8c_patched/bwa aln \ /data/gcc/resources/hg19/indices/human_g1k_v37.fa \ /data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_1.fq.gz \ -t 4 \ -f /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe01.bwa_align_pair1.ftl.human_g1k_v37.2011_05_30_20_22.1.sai /data/gcc/tools/bwa_45_patched/bwa sampe -P \ -p illumina \ -i L6 \ -m 24173 \ -l A80MP0ABXX \ /data/gcc/resources/hg19/indices/human_g1k_v37.fa \ /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe01.bwa_align_pair1.ftl.human_g1k_v37.2011_05_30_20_22.1.sai \ /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe02.bwa_align_pair2.ftl.human_g1k_v37.2011_05_30_20_22.2.sai \ /data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_1.fq.gz \ /data/gcc/rawdata/ngs/in-house/28may11/24173/110303_SN163_0393_L6_A80MP0ABXX_AGAGAT_2.fq.gz \ -f /data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe03.bwa_sampe.ftl.human_g1k_v37.2011_05_30_20_22.sam java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/SamFormatConverter.jar \ INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe03.bwa_sampe.ftl.human_g1k_v37.2011_05_30_20_22.sam \ OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe04.sam_to_bam.ftl.human_g1k_v37.2011_05_30_20_22.bam \ VALIDATION_STRINGENCY=LENIENT \ MAX_RECORDS_IN_RAM=2000000 \ TMP_DIR=/local java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/SortSam.jar \ INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe04.sam_to_bam.ftl.human_g1k_v37.2011_05_30_20_22.bam \ OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam \ SORT_ORDER=coordinate \ VALIDATION_STRINGENCY=LENIENT \ MAX_RECORDS_IN_RAM=1000000 \ TMP_DIR=/local java -jar -Xmx3g /data/gcc/tools/picard-tools-1.32/BuildBamIndex.jar \ INPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam \ OUTPUT=/data/gcc/rawdata/ngs/in-house/28may11/results/24173/24173.393_L6.HSpe05.sam_sort.ftl.human_g1k_v37.2011_05_30_20_22.sorted.bam.bai \ VALIDATION_STRINGENCY=LENIENT \ MAX_RECORDS_IN_RAM=1000000 \ TMP_DIR=/local
  • 32.
    Challenge 3: >200.000 hours compute hours Alignment 2300 lanes, 15 steps, ~75 hours per lane SNP calling 760 samples, 6 steps, ~50 hours per sample Immunochip QC 760 samples, 5 steps, 1 hours per sample Compute power Network and storage I/O
  • 33.
    Challenge 4: Didwe analyze it all? Correctly? Completely? Batches: UModqR 60 HUMcriR 90 HUMhxsR 222 HUMrutR 235 HUMjxbR 153 HUMsnrR 10
  • 34.
    Large scale NGSpipelines using the MOLGENIS platform: processing the Genome of the Netherlands Background: Genome of the Netherlands project Why: create a Dutch genetic hapmap to find rarer variants Aim: genome sequence of 1000 chromosomes (12x) Challenge: analyze 2250 Illumina lanes Alignment and SNP calls of 760 samples calls Data handling, QC, reports, etc Solution: NGS software/hardware infrastructure GPFS storage for >100TB of data files Template system for compute protocols Generators to automatically produce analysis scripts MOLGENIS to run and track inputs, analyses, output data Demo movie Conclusion
  • 35.
    Kickstart the projectbuilding on NBIC/BioAssist NGS task force Biobanking task force e-BioGrid team
  • 36.
    Solution 1: GPFSshared data storage Primary storage in Groningen on ‘Target’ Backup storage in Amsterdam on ‘ BigGrid ’ Data transfer via hard drives Systematic organization of rawdate, resultdata, logs 2.000 TB 750 x 3TB disks 3200 tapes GPFS http://www.bbmriwiki.nl/wiki/DataManagement http://www.rug.nl/target/index
  • 37.
    Solution 2: datamanagement via sample-lane worksheet sample flowcell lane lib machine date file A24a FC80R35ABXX L3 HUMhxsRJODIAAPE I433 101119 101119_I433_FC80R35ABXX_L3_HUMhxsRJODIAAPE A24a FC80F2RABXX L3 HUMhxsRJODIABPE I481 101120 101120_I481_FC80F2RABXX_L3_HUMhxsRJODIABPE A24a FC80GHKABXX L2 HUMhxsRJODIBAPE I114 101202 101202_I114_FC80GHKABXX_L2_HUMhxsRJODIBAPE A24b FC80R35ABXX L4 HUMhxsRJPDIAAPE I433 101119 101119_I433_FC80R35ABXX_L4_HUMhxsRJPDIAAPE A24b FC80F2RABXX L4 HUMhxsRJPDIABPE I481 101120 101120_I481_FC80F2RABXX_L4_HUMhxsRJPDIABPE A24b FC80GHKABXX L3 HUMhxsRJPDIBAPE I114 101202 101202_I114_FC80GHKABXX_L3_HUMhxsRJPDIBAPE A24b FC81C8UABXX L3 HUMhxsRJPDIBAPE I340 110114 110114_I340_FC81C8UABXX_L3_HUMhxsRJPDIBAPE A24c FC80R35ABXX L5 HUMhxsRJQDIAAPE I433 101119 101119_I433_FC80R35ABXX_L5_HUMhxsRJQDIAAPE A24c FC80F2RABXX L6 HUMhxsRJQDIABPE I481 101120 101120_I481_FC80F2RABXX_L6_HUMhxsRJQDIABPE A24c FC80GHKABXX L4 HUMhxsRJQDIBAPE I114 101202 101202_I114_FC80GHKABXX_L4_HUMhxsRJQDIBAPE A25a FC80R35ABXX L6 HUMhxsRJRDIAAPE I433 101119 101119_I433_FC80R35ABXX_L6_HUMhxsRJRDIAAPE A25a FC81C8UABXX L2 HUMhxsRJRDIAAPE I340 110114 110114_I340_FC81C8UABXX_L2_HUMhxsRJRDIAAPE A25a FC80F54ABXX L7 HUMhxsRJRDIABPE I171 101122 101122_I171_FC80F54ABXX_L7_HUMhxsRJRDIABPE A25a FC80GHKABXX L5 HUMhxsRJRDIBAPE I114 101202 101202_I114_FC80GHKABXX_L5_HUMhxsRJRDIBAPE A25b FC80R35ABXX L7 HUMhxsRJSDIAAPE I433 101119 101119_I433_FC80R35ABXX_L7_HUMhxsRJSDIAAPE A25b FC80EE1ABXX L5 HUMhxsRJSDIABPE I171 101122 101122_I171_FC80EE1ABXX_L5_HUMhxsRJSDIABPE A25b FC80GHKABXX L6 HUMhxsRJSDIBAPE I114 101202 101202_I114_FC80GHKABXX_L6_HUMhxsRJSDIBAPE A25b FC80GHJABXX L1 HUMhxsRJSDIBAPE I117 101208 101208_I117_FC80GHJABXX_L1_HUMhxsRJSDIBAPE A25c FC80R35ABXX L8 HUMhxsRJTDIAAPE I433 101119 101119_I433_FC80R35ABXX_L8_HUMhxsRJTDIAAPE A25c FC80F54ABXX L5 HUMhxsRJTDIABPE I171 101122 101122_I171_FC80F54ABXX_L5_HUMhxsRJTDIABPE A25c FC80GHKABXX L7 HUMhxsRJTDIBAPE I114 101202 101202_I114_FC80GHKABXX_L7_HUMhxsRJTDIBAPE A25c FC81C7KABXX L5 HUMhxsRJTDIBAPE I125 110115 110115_I125_FC81C7KABXX_L5_HUMhxsRJTDIBAPE A26a FC80PEWABXX L5 HUMhxsRJUDIAAPE I198 101120 101120_I198_FC80PEWABXX_L5_HUMhxsRJUDIAAPE A26a FC80F2RABXX L7 HUMhxsRJUDIABPE I481 101120 101120_I481_FC80F2RABXX_L7_HUMhxsRJUDIABPE A26a FC80GHKABXX L8 HUMhxsRJUDIBAPE I114 101202 101202_I114_FC80GHKABXX_L8_HUMhxsRJUDIBAPE A26b FC80N58ABXX L5 HUMhxsRJVDIAAPE I245 101120 101120_I245_FC80N58ABXX_L5_HUMhxsRJVDIAAPE A26b FC80PNWABXX L2 HUMhxsRJVDIABPE I453 101119 101119_I453_FC80PNWABXX_L2_HUMhxsRJVDIABPE A26b FC80G37ABXX L1 HUMhxsRJVDIBAPE I127 101126 101126_I127_FC80G37ABXX_L1_HUMhxsRJVDIBAPE A26c FC80LDLABXX L1 HUMhxsRJWDIAAPE I453 101119 101119_I453_FC80LDLABXX_L1_HUMhxsRJWDIAAPE A26c FC80PNWABXX L3 HUMhxsRJWDIABPE I453 101119 101119_I453_FC80PNWABXX_L3_HUMhxsRJWDIABPE A26c FC80G37ABXX L2 HUMhxsRJWDIBAPE I127 101126 101126_I127_FC80G37ABXX_L2_HUMhxsRJWDIBAPE
  • 38.
    (of course itis a bit more advanced than that) NB: we have a beta Galaxy tool.xml mapper based on GEN2PHEN ‘observation’ model we would love to have a shared workflow model
  • 39.
    Solution 3: auto-generateall computational protocols Auto-generate all the analysis commands: Generate scripts 1. Create SampleLane list 2. Generate pipeline from templates 3. Submit to Compute cluster bwa aln ${lane} bwa aln FC80R35ABXX_L3.fq.gz bwa aln FC80R35ABXX_L3.fq.gz bwa aln FC80R35ABXX_L3.fq.gz 34.500 scripts 15 templates http://www.bbmriwiki.nl/svn/ngs_pipelines/templates/ngs/
  • 40.
    Solution 4: distributedcompute efforts > 200.000 hours Alignment 2300 lanes, 15 steps, ~75 hours per lane SNP calling 760 samples, 6 steps, ~50 hours per sample Immunochip QC 760 samples, 5 steps, 1 hours per sample RUG CIT/Target ~900 lanes done ~240 per week 360 cpus AMC/BigGrid ~250 lanes done ~30 per week ~270 cpus EMC Hubrecht Other BigGrid
  • 41.
    Solution 5: atool to submit and monitor compute jobs
  • 42.
    Solution 6: RESTbased services To interact with R, Galaxy, Taverna (WSDL), Shell etc e.g. simply upload a csv from shell e.g. simply get data via R http://www.molgenis.org/wiki/MolgenisRestInterface http://www.molgenis.org/wiki/MolgenisRinterface curl -d 'data_type_input=org.molgenis.pheno.Individual &data_input=Name,Descriptio%0AInd1,Desc1%0AInd2,Desc2 &data_action=ADD &data_silent=F&submit_input=submit'   http://vm7.target.rug.nl/ngs_test/api/add source(&quot;http://a.host:8080/molgenis_ngs/api/R&quot;)”> res <- find.NgsSample();
  • 43.
    All working together(beta) MOLGENIS user interface for NGS (Java) Petabyte File storage (GPFS, GridFS?) compute cluster (PBS, Grid?) bwa aln ${lane} Protocol catalogue (Freermaker) Lane & Sample metadata And QC reports (MySQL) MOLGENIS/compute Generate ‘ ProtocolApplications ’ Submit and monitor (GridGain) uses API R Galaxy Taverna IGV UCSC Data & protocols Result exploration uses Test & play
  • 44.
    Large scale NGSpipelines using the MOLGENIS platform: processing the Genome of the Netherlands Background: Genome of the Netherlands project Why: create a Dutch genetic hapmap to find rarer variants Aim: genome sequence of 1000 chromosomes (12x) Challenge: analyze 2250 Illumina lanes Alignment and SNP calls of 760 samples calls Data handling, QC, reports, etc Solution: NGS software/hardware infrastructure GPFS storage for >100TB of data files Template system for compute protocols Generators to automatically produce analysis scripts MOLGENIS to run and track inputs, analyses, output data Demo movie Conclusion
  • 45.
    Download demo fromDropBox http://dl.dropbox.com/u/1839500/Swertz_BOSC_2011. mp4
  • 46.
    Large scale NGSpipelines using the MOLGENIS platform: processing the Genome of the Netherlands Background: Genome of the Netherlands project Why: create a Dutch genetic hapmap to find rarer variants Aim: genome sequence of 1000 chromosomes (12x) Challenge: analyze 2250 Illumina lanes Alignment and SNP calls of 760 samples calls Data handling, QC, reports, etc Solution: NGS software/hardware infrastructure GPFS storage for >100TB of data files Template system for compute protocols Generators to automatically produce analysis scripts MOLGENIS to run and track inputs, analyses, output data Demo movie Conclusion
  • 47.
    Alignment results AlignmentVariant calling Alignment to human genome (Build 37) Clean up alignment (mark duplicates, realignment, recalibration) Quality control Individual SNP calling Indel calling Variant Filtering ~ 1 Week ~ 1 Week >94% reads aligned >13x avg coverage
  • 48.
    SNP calling result(GoNL Pilot Chr20 – 1KG Phase I) 16,045 177,389 648,284 1KG Estimated Chr20 Ti/Tv: 2.36 GoNL Pilot Only SNPs 16,045 %dbSNP 2.05 Ti/Tv 2.20 1KG Phase 1 Only SNPs 648,284 %dbSNP 10.23 Ti/Tv 2.36 Intersection SNPs 177,389 %dbSNP 65.91 Ti/Tv 2.41
  • 49.
    Next… Polish thesoftware ... a lot Its MOLGENIS so anybody can download and customize (ideas anyone?) Integrate the login/security module Providing reports for the ‘end-users’ Enabeling trend analyses , etc Integrate and run more pipelines for GoNL Structural Variation Group Finalize GoNL SV pipeline Integrate SNP Calling / SV pipelines Imputation Group Phase Pilot data Impute sequence data Estimate gain of GoNL vs HapMap/1KG as Imputation panel
  • 50.
    Acknowledgements GoNL /MOLGENIS Infrastructure team George Byelas, Martijn Dijkstra, Robert Wagner, Pieter Neerincx, Abhishek Narain, Jan Bot and indirectly GEN2PHEN, EBI, FIMM, ... GoNL Analysis team (creating pipelines and tools) Freerk van Dijk (UMCG), Barbera van Schaik (AMC), Ies Nijman (Hubrecht), Slavik Koval (EMC) Laurent Francioli (UU), Kai Ye (LUMC), Jeroen Laros (LUMC), Lennart Karssen (EMC), JoukeJan Hottenga (VU), Mathijs Kattenberg (VU), David van Enckvort (NBIC), Leon Mei (NBIC), Elise van Leeuwen (EMC), … and many, many others GoNL Steering group (coordination) Cisca Wijmenga (PI GoNL), Morris Swertz (PI analysis), Gertjan van Ommen (LUMC), Eline Slagboom (LUMC), Jasper Bovenberg (ELSI issues), Cornelia van Duijn (EMC), Dorret Boomsma (VU), Paul de Bakker (co-PI analysis, UU) Get all as open source: GoNL - http://www.nlgenome.nl MOLGENIS - http://www.molgenis.org Analysis team - http://www.bbmriwiki.nl Contact? [email_address]

Editor's Notes

  • #22 Phase information; accurate haplotypes Better characterization of Structural Variation Detection of de novo variants and new mutation rates