• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Qbi Centre for Brain genomics (Informatics side)
 

Qbi Centre for Brain genomics (Informatics side)

on

  • 1,495 views

An overview of QBI’s production informatics framework with an emphasis on what service will be provided and how the resulting data is made available: from interactive quality control to integration ...

An overview of QBI’s production informatics framework with an emphasis on what service will be provided and how the resulting data is made available: from interactive quality control to integration with external data on the genome browser.

Statistics

Views

Total Views
1,495
Views on SlideShare
1,495
Embed Views
0

Actions

Likes
1
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • http://www.haynesboone.com/files/ImageControl/64f36756-3f0f-4254-b7bb-d9b447ae14d5/c8cd574b-4e35-4071-8a35-007febd928ee/Presentation/Image/mainImage_perspective.jpg

Qbi Centre for Brain genomics (Informatics side) Qbi Centre for Brain genomics (Informatics side) Presentation Transcript

  • QBI’s Centre for Brain Genomics
    The informatics side of things
    [Sprengben [why not get a friend]]
    September 8, 2011
  • Objective of QBI’s Centre for Brain genomics
    On-time delivery
    Reliable data production
    Convincing data
    Easy delivery
    Perkel JM. Coding your way out of a problem. Nat Methods. 2011 Jun PMID: 21716280.
  • Birdseye view of facility’s workflow
    September 8, 2011
  • Detailed workflow
    September 8, 2011
    Cbot
    HiSeq
    30 diff.
    programs
    CASAVA
    Raw sequence
    reads
    projects
    flowcell
    HiSeq
    cluster
    cluster
  • Overview of Production Informatics framework
    September 8, 2011
    Automatic
    Manual
    Processing Evaluation
    Run/
    Data/
    MakeFastq.sh
    trigger.sh armed
    trigger.sh html
    Unaligned/
    bwa/, reCaAl/, variant/
    Summary.html
    //clusterstorage
    Apache, IGV, R, UCSC
    //cluster-vm
  • Trigger.sh
    September 8, 2011
    Keeping data separate from scripts
    Automating verification, quality control and summary HTML generation
    Rerunning pipeline from every point
  • Flexible generic names: header
    #Programs
    BWA="/clusterdata/hiseq_apps/bin/$MODE/bwa"
    SAMTOOLS="/clusterdata/hiseq_apps/bin/$MODE/samtools"
    IGVTOOLS="/clusterdata/hiseq_apps/bin/$MODE/igvtools/IGVTools/igvtools.jar”
    # Task names
    TASKFASTQC="fastQC"
    TASKBWA="bwa"
    TASKRCA="reCalAln”
    #Fileabb
    READONE="read1"
    READTWO="read2"
    FASTQ="fastq.gz"
    ALN="aln" # aligned
    September 8, 2011
  • Config.txt
    September 8, 2011
    #********************
    # Tasks
    #********************
    mappingBWA="1"
    recalibrateQualScore="1"
    #********************
    # Paths
    #********************
    FASTA="/clusterdata/resources/hg19/hg19.fasta"
    SEQREG=chr1:229994688-230071581"
    DBSNP="/clusterdata/resources/hg19/snpdb132.vcf"
    #********************
    # PARAMETER
    #********************
    LIBRARY="QBI”
    ADDPARAMBWA=“--force single”
    Specifics what to do,
    e.g. mapping and recalibration
    Specifics where to find resources
    Customizes stanardsripts for this project
  • call
    trigger.shconfig.txtarmed
    trigger.shconfig.txthtml
    September 8, 2011
    s_1_read1.fastq
    s_1_read2.fastq
    s_2_read1.fastq
    s_2_read2.fastq
    s_3_read1.fastq
    s_3_read2.fastq
    s_4_read1.fastq
    s_4_read2.fastq
    s_1.bam
    s_2.bam
    s_1.ashrr.bam
    s_2.ashrr.bam
    s_3.bam
    s_4.bam
    s_3.ashrr.bam
    s_4.ashrr.bam
    Sub1_s_1.out
    Sub1_s_2.out
    Sub2_s_3.out
    Sub2_s_4.out
    Sub1_s_1.out
    Sub1_s_2.out
    Sub2_s_3.out
    Sub2_s_4.out
  • Summary.html
    Project Cards
    September 8, 2011
    Sequence statistics
    Run check
    points
    Data Visualization
    Mapping stats
    Download
    Interesting Regions
  • Scaffold of pbsScripts.sh: Error catching
    September 8, 2011
    Code example for setting up what errors to look out for
    # QCVARIABLES, loosing reads, unmapped read,no such file,file not found,bwa.sh: line
    Output in Summary.html
    >>>>>>>>>> Errors
    QC_PASS .. 0 have We are loosing reads/184
    QC_PASS .. 0 have for unmapped read/184
    QC_PASS .. 0 have no such file/184
    QC_PASS .. 0 have file not found/184
    QC_PASS .. 0 have bwa.sh: line/184
  • Scaffold of pbsScripts.sh: checkpoints
    September 8, 2011
    qsub -by -jy [PBSOPTIONS] pbsScript.sh -k HISEQINF [PARAMETERS]
    Code example for setting up checkpoints in the pbsScript.sh
    echo “********* mapping”
    $BWA aln -t $THREADS $FASTA $f > $OUT/${n/$FASTQ/sai}
    $BWA aln -t $THREADS $FASTA ${f/$READONE/$READTWO} > $OUT/${n/$READONE.$FASTQ/$READTWO.sai}
    Output in Summary.html
    >>>>>>>>>> CheckPoints
    QC_PASS .. 184 have mapping/184
    QC_PASS .. 184 have sorting and bam-conversion/184
    QC_PASS .. 184 have mark duplicates/184
    QC_PASS .. 184 have statistics/184
    QC_PASS .. 184 have coverage track/184
  • Availability: tailored to skills
    1
    2
    3
    Website
    RStudio
    Command line
  • The big picture
    Covering all aspects of: design*, set-up*, maintenance*, usage
    (*except cluster)
    Documentation: Project Server
    //project
    5 TB raw data
    750 GB processed data
    57 GB external data
    7 project-cards
    10 Projects, 6 HiSeq-Runs
    40 wiki pages, 250 Tasks, 551h logged
    160 Commits
    35 external programs
    41 custom scripts (4197 lines of code)
    Application
    Backup/Version Control
    Data Warehousing
    Statistic
    Analysis
    HiSeq Output
    RSudio
    Raw Data
    Quality Control
    Project Cards
    Processed Data
    Processed Data
    Rsync
    Hypothesis Generation
    Software
    BWA, GATK, samtools, etc.
    Custom Scripts
    Custom Scripts
    Version Control
    Data
    Processing and Analysis
    External Genomic Resources
    Cluster
    Genomes, Annotation, etc.
    Project Server
    Content
    Galaxy
    Visualization
    IGV
    Genome Browser
    //cluster-vm
    //clusterstorage
    //groupshare, //ethan
  • Three things to remember
    Reliable data production
    Projects have all a similar structure and are processed in the same way
    Convincing data
    All steps are tightly quality controlled and the QC report is accessible
    Easy delivery
    We tailored data availability to skill-levels (webpage, Rstudio, console
    On time delivery
    Production informatics has priority on the cluster
    September 8, 2011
    (
    )
  • Next week
    NGS Discussion group:
    Methylation analysis
    Kevin Dudley and Danay Baker-Andresen
    September 8, 2011