Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Qbi Centre for Brain genomics (Informatics side)


Published on

An overview of QBI’s production informatics framework with an emphasis on what service will be provided and how the resulting data is made available: from interactive quality control to integration with external data on the genome browser.

Published in: Technology, Education
  • Be the first to comment

Qbi Centre for Brain genomics (Informatics side)

  1. 1. QBI’s Centre for Brain Genomics<br />The informatics side of things<br />[Sprengben [why not get a friend]]<br />September 8, 2011<br />
  2. 2. Objective of QBI’s Centre for Brain genomics<br />On-time delivery<br />Reliable data production<br />Convincing data<br />Easy delivery<br />Perkel JM. Coding your way out of a problem. Nat Methods. 2011 Jun PMID: 21716280.<br />
  3. 3. Birdseye view of facility’s workflow<br />September 8, 2011<br />
  4. 4. Detailed workflow<br />September 8, 2011<br />Cbot<br />HiSeq<br />30 diff. <br />programs<br />CASAVA<br />Raw sequence<br />reads<br />projects<br />flowcell<br />HiSeq<br />cluster<br />cluster<br />
  5. 5. Overview of Production Informatics framework<br />September 8, 2011<br />Automatic<br />Manual<br />Processing Evaluation<br />Run/<br />Data/<br /><br /> armed<br /> html<br />Unaligned/<br />bwa/, reCaAl/, variant/<br />Summary.html<br />//clusterstorage<br />Apache, IGV, R, UCSC<br />//cluster-vm<br />
  6. 6.<br />September 8, 2011<br />Keeping data separate from scripts<br />Automating verification, quality control and summary HTML generation<br />Rerunning pipeline from every point<br />
  7. 7. Flexible generic names: header<br />#Programs<br />BWA="/clusterdata/hiseq_apps/bin/$MODE/bwa"<br />SAMTOOLS="/clusterdata/hiseq_apps/bin/$MODE/samtools"<br />IGVTOOLS="/clusterdata/hiseq_apps/bin/$MODE/igvtools/IGVTools/igvtools.jar”<br /># Task names<br />TASKFASTQC="fastQC"<br />TASKBWA="bwa"<br />TASKRCA="reCalAln”<br />#Fileabb<br />READONE="read1"<br />READTWO="read2"<br />FASTQ="fastq.gz"<br />ALN="aln" # aligned <br />September 8, 2011<br />
  8. 8. Config.txt<br />September 8, 2011<br />#********************<br /># Tasks<br />#********************<br />mappingBWA="1" <br />recalibrateQualScore="1" <br />#********************<br /># Paths<br />#********************<br />FASTA="/clusterdata/resources/hg19/hg19.fasta" <br />SEQREG=chr1:229994688-230071581"<br />DBSNP="/clusterdata/resources/hg19/snpdb132.vcf" <br />#********************<br /># PARAMETER<br />#********************<br />LIBRARY="QBI”<br />ADDPARAMBWA=“--force single” <br />Specifics what to do,<br />e.g. mapping and recalibration <br />Specifics where to find resources <br />Customizes stanardsripts for this project<br />
  9. 9. call<br />trigger.shconfig.txtarmed<br />trigger.shconfig.txthtml<br />September 8, 2011<br />s_1_read1.fastq<br />s_1_read2.fastq<br />s_2_read1.fastq<br />s_2_read2.fastq<br />s_3_read1.fastq<br />s_3_read2.fastq<br />s_4_read1.fastq<br />s_4_read2.fastq<br />s_1.bam<br />s_2.bam<br />s_1.ashrr.bam<br />s_2.ashrr.bam<br />s_3.bam<br />s_4.bam<br />s_3.ashrr.bam<br />s_4.ashrr.bam<br />Sub1_s_1.out<br />Sub1_s_2.out<br />Sub2_s_3.out<br />Sub2_s_4.out<br />Sub1_s_1.out<br />Sub1_s_2.out<br />Sub2_s_3.out<br />Sub2_s_4.out<br />
  10. 10. Summary.html<br />Project Cards<br />September 8, 2011<br />Sequence statistics<br />Run check <br />points<br />Data Visualization<br />Mapping stats<br />Download<br />Interesting Regions<br />
  11. 11. Scaffold of Error catching<br />September 8, 2011<br />Code example for setting up what errors to look out for<br /># QCVARIABLES, loosing reads, unmapped read,no such file,file not found, line<br />Output in Summary.html<br />>>>>>>>>>> Errors<br />QC_PASS .. 0 have We are loosing reads/184<br />QC_PASS .. 0 have for unmapped read/184<br />QC_PASS .. 0 have no such file/184<br />QC_PASS .. 0 have file not found/184<br />QC_PASS .. 0 have line/184<br />
  12. 12. Scaffold of checkpoints<br />September 8, 2011<br />qsub -by -jy [PBSOPTIONS] -k HISEQINF [PARAMETERS]<br />Code example for setting up checkpoints in the<br />echo “********* mapping”<br />$BWA aln -t $THREADS $FASTA $f > $OUT/${n/$FASTQ/sai}<br />$BWA aln -t $THREADS $FASTA ${f/$READONE/$READTWO} > $OUT/${n/$READONE.$FASTQ/$READTWO.sai}<br />Output in Summary.html<br />>>>>>>>>>> CheckPoints<br />QC_PASS .. 184 have mapping/184<br />QC_PASS .. 184 have sorting and bam-conversion/184<br />QC_PASS .. 184 have mark duplicates/184<br />QC_PASS .. 184 have statistics/184<br />QC_PASS .. 184 have coverage track/184<br />
  13. 13. Availability: tailored to skills<br />1<br />2<br />3<br />Website <br />RStudio<br />Command line<br />
  14. 14. The big picture<br />Covering all aspects of: design*, set-up*, maintenance*, usage <br />(*except cluster)<br />Documentation: Project Server<br />//project<br />5 TB raw data<br />750 GB processed data<br />57 GB external data<br />7 project-cards<br />10 Projects, 6 HiSeq-Runs <br />40 wiki pages, 250 Tasks, 551h logged<br />160 Commits<br />35 external programs<br />41 custom scripts (4197 lines of code)<br />Application<br />Backup/Version Control<br />Data Warehousing<br />Statistic <br />Analysis<br />HiSeq Output<br />RSudio<br />Raw Data<br />Quality Control<br />Project Cards<br />Processed Data<br />Processed Data<br />Rsync<br />Hypothesis Generation<br />Software<br />BWA, GATK, samtools, etc.<br />Custom Scripts<br />Custom Scripts<br />Version Control<br />Data<br />Processing and Analysis<br />External Genomic Resources<br />Cluster<br />Genomes, Annotation, etc.<br />Project Server<br />Content<br />Galaxy<br />Visualization<br />IGV<br />Genome Browser<br />//cluster-vm<br />//clusterstorage<br />//groupshare, //ethan<br />
  15. 15. Three things to remember<br />Reliable data production<br />Projects have all a similar structure and are processed in the same way<br />Convincing data<br />All steps are tightly quality controlled and the QC report is accessible<br />Easy delivery<br />We tailored data availability to skill-levels (webpage, Rstudio, console<br />On time delivery<br />Production informatics has priority on the cluster<br />September 8, 2011<br />(<br />)<br />
  16. 16. Next week<br />NGS Discussion group: <br />Methylation analysis<br /> Kevin Dudley and Danay Baker-Andresen<br />September 8, 2011<br />