Your SlideShare is downloading. ×
Qbi Centre for Brain genomics (Informatics side)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Qbi Centre for Brain genomics (Informatics side)

2,118
views

Published on

An overview of QBI’s production informatics framework with an emphasis on what service will be provided and how the resulting data is made available: from interactive quality control to integration …

An overview of QBI’s production informatics framework with an emphasis on what service will be provided and how the resulting data is made available: from interactive quality control to integration with external data on the genome browser.

Published in: Technology, Education

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,118
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • http://www.haynesboone.com/files/ImageControl/64f36756-3f0f-4254-b7bb-d9b447ae14d5/c8cd574b-4e35-4071-8a35-007febd928ee/Presentation/Image/mainImage_perspective.jpg
  • Transcript

    • 1. QBI’s Centre for Brain Genomics
      The informatics side of things
      [Sprengben [why not get a friend]]
      September 8, 2011
    • 2. Objective of QBI’s Centre for Brain genomics
      On-time delivery
      Reliable data production
      Convincing data
      Easy delivery
      Perkel JM. Coding your way out of a problem. Nat Methods. 2011 Jun PMID: 21716280.
    • 3. Birdseye view of facility’s workflow
      September 8, 2011
    • 4. Detailed workflow
      September 8, 2011
      Cbot
      HiSeq
      30 diff.
      programs
      CASAVA
      Raw sequence
      reads
      projects
      flowcell
      HiSeq
      cluster
      cluster
    • 5. Overview of Production Informatics framework
      September 8, 2011
      Automatic
      Manual
      Processing Evaluation
      Run/
      Data/
      MakeFastq.sh
      trigger.sh armed
      trigger.sh html
      Unaligned/
      bwa/, reCaAl/, variant/
      Summary.html
      //clusterstorage
      Apache, IGV, R, UCSC
      //cluster-vm
    • 6. Trigger.sh
      September 8, 2011
      Keeping data separate from scripts
      Automating verification, quality control and summary HTML generation
      Rerunning pipeline from every point
    • 7. Flexible generic names: header
      #Programs
      BWA="/clusterdata/hiseq_apps/bin/$MODE/bwa"
      SAMTOOLS="/clusterdata/hiseq_apps/bin/$MODE/samtools"
      IGVTOOLS="/clusterdata/hiseq_apps/bin/$MODE/igvtools/IGVTools/igvtools.jar”
      # Task names
      TASKFASTQC="fastQC"
      TASKBWA="bwa"
      TASKRCA="reCalAln”
      #Fileabb
      READONE="read1"
      READTWO="read2"
      FASTQ="fastq.gz"
      ALN="aln" # aligned
      September 8, 2011
    • 8. Config.txt
      September 8, 2011
      #********************
      # Tasks
      #********************
      mappingBWA="1"
      recalibrateQualScore="1"
      #********************
      # Paths
      #********************
      FASTA="/clusterdata/resources/hg19/hg19.fasta"
      SEQREG=chr1:229994688-230071581"
      DBSNP="/clusterdata/resources/hg19/snpdb132.vcf"
      #********************
      # PARAMETER
      #********************
      LIBRARY="QBI”
      ADDPARAMBWA=“--force single”
      Specifics what to do,
      e.g. mapping and recalibration
      Specifics where to find resources
      Customizes stanardsripts for this project
    • 9. call
      trigger.shconfig.txtarmed
      trigger.shconfig.txthtml
      September 8, 2011
      s_1_read1.fastq
      s_1_read2.fastq
      s_2_read1.fastq
      s_2_read2.fastq
      s_3_read1.fastq
      s_3_read2.fastq
      s_4_read1.fastq
      s_4_read2.fastq
      s_1.bam
      s_2.bam
      s_1.ashrr.bam
      s_2.ashrr.bam
      s_3.bam
      s_4.bam
      s_3.ashrr.bam
      s_4.ashrr.bam
      Sub1_s_1.out
      Sub1_s_2.out
      Sub2_s_3.out
      Sub2_s_4.out
      Sub1_s_1.out
      Sub1_s_2.out
      Sub2_s_3.out
      Sub2_s_4.out
    • 10. Summary.html
      Project Cards
      September 8, 2011
      Sequence statistics
      Run check
      points
      Data Visualization
      Mapping stats
      Download
      Interesting Regions
    • 11. Scaffold of pbsScripts.sh: Error catching
      September 8, 2011
      Code example for setting up what errors to look out for
      # QCVARIABLES, loosing reads, unmapped read,no such file,file not found,bwa.sh: line
      Output in Summary.html
      >>>>>>>>>> Errors
      QC_PASS .. 0 have We are loosing reads/184
      QC_PASS .. 0 have for unmapped read/184
      QC_PASS .. 0 have no such file/184
      QC_PASS .. 0 have file not found/184
      QC_PASS .. 0 have bwa.sh: line/184
    • 12. Scaffold of pbsScripts.sh: checkpoints
      September 8, 2011
      qsub -by -jy [PBSOPTIONS] pbsScript.sh -k HISEQINF [PARAMETERS]
      Code example for setting up checkpoints in the pbsScript.sh
      echo “********* mapping”
      $BWA aln -t $THREADS $FASTA $f > $OUT/${n/$FASTQ/sai}
      $BWA aln -t $THREADS $FASTA ${f/$READONE/$READTWO} > $OUT/${n/$READONE.$FASTQ/$READTWO.sai}
      Output in Summary.html
      >>>>>>>>>> CheckPoints
      QC_PASS .. 184 have mapping/184
      QC_PASS .. 184 have sorting and bam-conversion/184
      QC_PASS .. 184 have mark duplicates/184
      QC_PASS .. 184 have statistics/184
      QC_PASS .. 184 have coverage track/184
    • 13. Availability: tailored to skills
      1
      2
      3
      Website
      RStudio
      Command line
    • 14. The big picture
      Covering all aspects of: design*, set-up*, maintenance*, usage
      (*except cluster)
      Documentation: Project Server
      //project
      5 TB raw data
      750 GB processed data
      57 GB external data
      7 project-cards
      10 Projects, 6 HiSeq-Runs
      40 wiki pages, 250 Tasks, 551h logged
      160 Commits
      35 external programs
      41 custom scripts (4197 lines of code)
      Application
      Backup/Version Control
      Data Warehousing
      Statistic
      Analysis
      HiSeq Output
      RSudio
      Raw Data
      Quality Control
      Project Cards
      Processed Data
      Processed Data
      Rsync
      Hypothesis Generation
      Software
      BWA, GATK, samtools, etc.
      Custom Scripts
      Custom Scripts
      Version Control
      Data
      Processing and Analysis
      External Genomic Resources
      Cluster
      Genomes, Annotation, etc.
      Project Server
      Content
      Galaxy
      Visualization
      IGV
      Genome Browser
      //cluster-vm
      //clusterstorage
      //groupshare, //ethan
    • 15. Three things to remember
      Reliable data production
      Projects have all a similar structure and are processed in the same way
      Convincing data
      All steps are tightly quality controlled and the QC report is accessible
      Easy delivery
      We tailored data availability to skill-levels (webpage, Rstudio, console
      On time delivery
      Production informatics has priority on the cluster
      September 8, 2011
      (
      )
    • 16. Next week
      NGS Discussion group:
      Methylation analysis
      Kevin Dudley and Danay Baker-Andresen
      September 8, 2011