RNA-SEQ ANALYSIS
   고준수, 송상훈, 김현민

   테라젠 바이오 연구소
      2012. 2. 5
CONTENTS
• NGS                • Mapping

• RNA-seq            • PCR   Duplication
• File   Forat       • Expression

• Workflow            • DEG

• Preparation        • Report

• Filtering   & QC
TODAY’S KEYWORDS

NGS
                         File Format
Illumina, Paired-End
                         Fastq, BAM        DEG
                                           Cuffdiff, DESeq
      RNA-seq
      mRNA, Reference-based
                                       Expression
Design                                 Cufflinks, Cuffmerge
                       Mapping
Replicates
                       TopHat
NEXT-GENERATION
  SEQUENCING
SEQUENCING
Sanger (1st Generation)
NEXT-GENERATION SEQUENCING
2nd Generation




                                                          3rd Generation




Nat Rev Genet. 2010 Jan;11(1):31-46. doi: 10.1038/
nrg2626. Epub 2009 Dec 8. Sequencing technologies - the
next generation. Metzker ML.
NGS WEAKNESS AND OVERCOMING




                                                                                  Sanger 0.001%




 Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y. A tale of three next generation sequencing
 platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012 Jul 24;13:341.

 Nature Biotechnology 26, 1135 - 1145 (2008), Next-generation DNA
 sequencing, Shendure J. and Ji H.
NGS
Library Construction                      Sequencing




                                                        Raw
                                                       Reads
http://users.ugent.be/~avierstr/nextgen/nextgen.html
GENERAL NGS ANALYSIS PROCESS
                                                           Speed          3
                            1                                                        WGS
                                                                          Low depth < NT < High depth
                                    Mapping


     2

Depth
(Coverage)



                         Coverage
             Shearer AE, Hildebrand MS, Sloan CM, Smith RJ. Deafness
             in the genomics era. Hear Res. 2011 Dec;282(1-2):1-9. doi:
             10.1016/j.heares.2011.10.001. Epub 2011 Oct 8.
MAPPING TOOLS
                 BWA                                                     •   Mapper Type

                  for                                                        •   DNA
                                                                             •   RNA
                 WGS                                                         •   miRNA
                                                                             •   bisulphite
   TopHat
  for RNA




Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-
throughput sequencing data. Bioinformatics. 2012 Dec 1;28(24):3169-77.
PCR DUPLICATION
       http://www.clcbio.com/clc-plugin/duplicate-reads-
       removal-plugin/




                            remove
ILLUMINA PAIRED-END
                                                           mate-pair inner
                                                                 distnace

                                                             http://vallandingham.me/
                                                            RNA_seq_differential_expr
                                                                     ession.html



                                             Quinlan AR, Boland MJ, Leibowitz ML, Shumilina S,
                                             Pehrson SM, Baldwin KK, Hall IM. Genome
                                             sequencing of mouse induced
                                             pluripotent stem cells reveals
                                             retroelement stability and infrequent
                                             DNA rearrangement during
                                             reprogramming. Cell Stem Cell. 2011 Oct
                                             4;9(4):366-73. doi: 10.1016/j.stem.2011.07.018.


                                             Haas BJ, Zody MC.
fastq_1                                      Advancing RNA-Seq analysis.
                                             Nat Biotechnol. 2010 May;28(5):421-3.
                                             doi: 10.1038/nbt0510-421.




          http://users.ugent.be/~avierstr/       fastq_2
          nextgen/nextgen.html
SUMMARY
• NGS   platform : Short Reads, Depth, Coverage
• Sequencing    Protocol
• Analysis   Protocol
• Mapping

• PCR   duplication
• Illumina   Paired-end
TRANSCRIPTOME
   RNA-SEQ
TRANSCRIPTOME
•   The complete set of transcripts in a cell, and their quantity

•   The key aims of transcriptomics are:

    •   to catalogue all species of transcript, including mRNAs, non-coding RNAs and
        small RNAs

    •   to determine the transcriptional structure of genes, in terms of their start sites,
        5′ and 3′ ends, splicing patterns and other post-transcriptional modifications

    •   to quantify the changing expression levels of each transcript during
        development and under different conditions.

Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for
transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/
nrg2484.
ADVANTAGES OF RNA-SEQ




Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary
tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63.
doi: 10.1038/nrg2484.
RNA-SEQ & MICROARRAY




Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for
transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/
nrg2484.
RNA-SEQ
•   Gene expression level

•   Relative expression level in sample

•   Differentially expressed gene

•   Identification of alternative spliced transcripts

•   Prediction of novel transcripts

•   Gene Fusion

Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool
for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi:
10.1038/nrg2484.
RNA-SEQ VS. DNA-SEQ
                  RNA-seq                     DNA-seq
                 Reference-based,                WES,
Methods          de novo assembly           WGS re-sequencing,
                                             WGS de novo
                    Expression,
          Differentially Expressed Genes,
 Goal            Novel transcript,          SNPs, Indels, SV
             Alternative splicing form,
                    Gene fusion



Measure Mapped Read Count                   Base accuracy
OVERVIEW OF A TYPICAL RNA-SEQ
RNA MAPPING

                                                               Oshlack A, Robinson MD, Young MD. From RNA-seq reads to
                                                               differential expression results. Genome Biol. 2010;11(12):220.




Trapnell C, Salzberg SL., How to map billions of short reads onto
genomes. Nat Biotechnol. 2009 May;27(5):455-7.
MAPPER
   Mapper         Data      Seq.Plat.       Input          Output       Cit.    Cit/years          Reference
  MapSplice       RNA           I         FASTA/Q        SAM, BED       50          28.17     Wang et al. (2010)
 MicroRazerS     miRNA         N          FASTA/Q         SAM, TSV       7           2.75     Emde et al. (2010)
   mrFAST        miRNA          I         FASTA/Q           SAM         158         58.34     Alkan et al. (2009)
   mrsFAST       miRNA        I,So        FASTA/Q           SAM         32          18.03     Hach et al. (2010)
    Passion       RNA        I,4,Sa,P     FASTA/Q           BED          -            -       Zhang et al. (2012)
   PatMaN        miRNA         N            FASTA           TSV         38           9.36     Prufer et al. (2008)
  QPALMA          RNA          I,4         Specific          TSV         75          21.11    De Bona et al. (2008)
  RNA-Mate        RNA          So          CFASTA       BED, Counts     28          10.04    Cloonan et al. (2009)
    RUM           RNA          I,4        FASTA/Q       SAM,TSV,BED      2           2.36     Grant et al. (2011)
  SOAPSplice      RNA          I,4        FASTA/Q           TSV          3           3.54     Huang et al. (2011)
  SpliceMap       RNA           I         FASTA/Q        SAM, BED       63          29.80       Au et al. (2010)
  Supersplat      RNA          N            FASTA           TSV         21           9.93    Bryant Jr et al. (2010)
   TopHat         RNA           I       FASTA/Q, GFF        BAM         389         121.04   Trapnell et al. (2009)
The number of citations (Cit.) was obtained from Google Scholar on April 14, 2012


Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping
high-throughput sequencing data. Bioinformatics. 2012
Dec 1;28(24):3169-77.
ANALYSIS STRATEGIES
                            Reference-based                                 de novo

                  •Using a reference genome                 •not use a reference genome
   Method         •The transcriptome assembly
                   can be built upon it
                  • Contamination or sequencing      • Not depend on a reference genome
                   artefacts are not a major concern • Not depend on the correct alignment
                  • Very sensitive and can assemble of reads to known splice sites or the
     Adv.          transcripts of low abundance             prediction of novel splicing sites
                  • To discover novel transcripts that      • Trans-spliced transcripts can be
                   are not present in the current           assembled
                   annotation


    Disadv.
                 • Depends on the quality of the • Computing resources
                 reference genome being used.    • Senstive to sequencing errors
    Depth                         ~ 10x                                      > 30x

Martin JA, Wang Z. Next-generation transcriptome
assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi: 10.1038/
nrg3068.
REFERENCE-BASED




Martin JA, Wang Z. Next-generation transcriptome
assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi:
10.1038/nrg3068.
REFERENCE-BASED




Martin JA, Wang Z. Next-generation transcriptome
assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi:
10.1038/nrg3068.
SUMMARY

• Transcriptome

• RNA-seq     advantages
• Process

• Analysis   strategies
• Reference-based     method
NGS FILE FORMAT
FILE FORMAT
•   NGS
    •   Fastq
    •   SAM/BAM
    •   VCF
•   Reference
    •   Fasta
    •   GTF / GFF
Sequencer
                        FASTQ FORMAT
                                                                                       Fastq
       • de factor standard file format for raw reads
       • fq, fastq, fq.gz, fastq.gz      1: @title identifier description

S01_1.fq                                                                          2: Sequence

                                                                                 3: + description


                                                                           4: Quality values
                                             Paired-end

S01_2.fq
QUALITY SCORE
•   The base-calling error probabilities.

•   Types

    •   Pred33 / Illumina 1.8+
        •   Score 0~60
        •   ASCII 33 ~ 126
    •   Solexa / Illumina 1.0
        •   -5~62
        •   ASCII 56 ~ 126
    •   Pred64 / Illumina 1.3 ~ 1.5
        •   0 ~ 62
        •   ASCII 64 ~126


                                  http://www.asciitable.com
SAM / BAM FORMAT
                                                                             Sequencer
•   SAM stands for Sequence Alignment/Map format.
•   TAB-delimited text format                                                  Fastq
•   11 mandatory fields
                                                                              Mapper


                                                                               SAM/
                                                                               BAM


      Read Name

                  Flag                          Quality
                         Reference                                  Length
                                                          Pos. of
                                     Position
                                                           Mate
SAM / BAM
Flag


                      SAM




       CIGAR
TOOLS FOR SAM/BAM
• Samtools            • tview

 • index              • mpileup

 • view             • Picard

 • sort               • SortSam

 • faidx              • MarkDuplicates

 • flagstat            • ......
GTF (ENSEMBL)


                                                     Gene ID   Transcript
                                                                  ID




protein_coding, mtRNA, miRNA, lincRNA, pseudogene......
SUMMARY
• Fastq   format
  • de    facto standard
• Quality   Score
  • Pred33/Illumina      1.8+, Illumina 1.0, Pred64/Illumina 1.3~1.5
• SAM/BAM       format
• GTF
WORKFLOW
REFERENCE
REFERENCE WORKFLOW
                  Mapped          Assembled
Sample             reads          transcripts
  1


                                                           Final
         TopHat        Cufflinks        Cuffmerge      transcriptome      Cuffdiff
                                                         assembly

                                                                       Differential
Sample            Mapped          Assembled                            expression
  2                reads          transcripts                            results


                                                   Expression
                                                                      CummeRbund
                                                     plots
OUR WORKFLOW

Samples          Reference                   Geneset


                                                                            DEG
                                                                           analysis

                       Read                      Expression     Gene       Cuffdiff
   Filtering                                                               DEGseq       Report
                      Mapping                      Level      Structure
                                 Duplication                               DESeq
                       TopHat      Picard
                        RUM       Samtools       Cufflinks                              cummeRbund
   TBI-toolkit                                                Cuffmerge                   GO
                        BWA                     HTseq-count               Annotation
                       Bowtie2

                                                                           UniProt
    FastQC             RSeQC                                                GO
                                                                           KEGG
PREPARATION
DIRECTORY
/KOGO/RNA-seq       ref                 chr.fa, ens.gtf, mask.gtf


                  inputs                  S01.fq.gz, S02.fq.gz


                 outputs       S01          accepted_hits.bam, transcripts.gtf

                               ......       accepted_hits.bam, transcripts.gtf



                           merged_asm            merged.gtf, transcripts.gtf
                scripts
                           Diff-S01-S02      gene_exp.diff, isoforms_exp.diff

                Tools
SAMPLES
            운동전    운동후


Horse 1      S01     S02



Horse I1     S03     S04
TOOLS
Category      Programs      Version                             Homepage
   QC          FastQC       0.10.1       http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
               Bowtie2       2.0.5         http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
 Mapper
               TopHat        2.0.7                      http://tophat.cbcb.umd.edu
               Cufflinks      2.0.2                     http://cufflinks.cbcb.umd.edu
Abundance    HTseq-count       -      http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
               DESeq        1.10.1    http://bioconductor.org/packages/release/bioc/html/DESeq.html
Annotation      goseq       1.10.0    http://www.bioconductor.org/packages/2.11/bioc/html/goseq.html
              samtools      0.1.18                    http://samtools.sourceforge.net
                picard       1.83                      http://picard.sourceforge.net
  Tools       TBI-toolkit     0.1                         http://dev.totalomics.kr/
                  R         2.15.0                       http://www.r-project.org
               Gnuplot         -                          http://www.gnuplot.info
TBI-TOOLKIT
•   TBI NGS Toolkit
    •   http://dev.totalomics.kr
•   Application
    •   TBI-toolkit-qscore
    •   TBI-toolkit-fq_filter
    •   TBI-toolkit-gtf_selector
    •   TBI-toolkit-fa_spliter
    •   TBI-toolkit-make_matrix
REFERENCE
• Reference-based             strategy
    Name           FileType                        Description

  Reference          fasta                     Genome Sequence

   Geneset       GTF2.2/GFF3                   Reference Geneset

    Name          Source                          Description

                                    Geneset that has ncRNA information.
Mask Geneset     Geneset
                                     (rRNA, tRNA, and other ncRNA)

Bowtie2 Index    Reference               Index files for running bowtie2         Optional

GO information     GO             Gene ontology information for GO enrichment
REFERENCE SOURCE
•   Ensembl (http://www.ensembl.org)
    •   General file format for all species
    •   Geneset (GTF format)
    •   Constant Database schema for all species
    •   Comprehensive Annotation (GO, InterPro, Pfam, Prosite Smart, ...... )
    •   Automated Update
•   UCSC (http://genome.ucsc.edu)
    •   Semi general file format for all species
    •   Semi constant Database schea for all species
           •   Gene table dump (BED format compatible)
    •   Annotation (Pfam, Kegg)
    •   Comparative Analysis
•   NCBI
    •   Raw data bank
    •   GFF type geneset file
ENSEMBL
ensembl.org           plants.ensembl.org     fungi.ensembl.org




metazoa.ensembl.org   protists.ensembl.org   bacteria.ensembl.org
ENSEMBL
•   Homo Sapiens ( ftp://ftp.ensembl.org/pub/release-69 )
    •   fasta/homo_sapiens/
                                                                         chr.fa
        •   dna/Homo_sapiens.GRCh37.69.dna.toplevel.fa.gz
            •   dna/Homo_sapiens.GRCh37.69.dna.chromosome.1.fa.gz
        •   cdna/Homo_sapiens.GRCh37.69.cdna.all.fa.gz
                                                                           ens.gtf
    •   gtf/homo_sapiens/Homo_sapiens.GRCh37.69.gtf.gz
    •   mysql/homo_sapiens_core_69_37/
•   Arabidopsis thaliana ( ftp://ftp.ensemblgenomes.org/pub/release-16/plants )
    •   fasta/arabidopsis_thaliana
        •   dna/Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa.gz
        •   cdna/Arabidopsis_thaliana.TAIR10.16.cdna.all.fa.gz
    •   gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.16.gtf.gz
    •   mysql/arabidopsis_thaliana_core_16_69_10/
PRE-PROCESSING
• Check   quality score type of input file
• Reference   files
• Reference   index
• Mask   geneset
SAMPLE QUALITY SCORE
Usage)
$ TBI-toolkit-qscore [FASTQ]
Sanger(Phred33) or Illumina 1.8+
   0 to 93 using ASCII 33 to 126

Run)
$ cd /KOGO/RNA-seq/inputs
$ TBI-toolkit-qscore S01_1.fq.gz
Sanger(Phred33) or Illumina 1.8+
   0 to 93 using ASCII 33 to 126
   0:1, 1:”, 2:#, 3:$, 4:%, 5:&, ......
REFERENCE INDEX
Index for bowtie2 mapper
Usage)
$ bowtie2-build [options] <reference_in> <bt2_base>

Run)
$ cd /KOGO/RNA-seq/ref
$ bowtie2-build chr.fa chr.fa
$ ls
chr.fa.1.bt2 chr.fa.2.bt2 ......

Fasta index
Usage)
$ samtools faidx <ref.fasta>

Run)
$ cd /KOGO/RNA-seq/ref
$ samtools faidx chr.fa
$ ls
chr.fa.fai
MASK GENESET
...... We recommend including any annotated rRNA, mitochondrial transcripts other
abundant transcripts you wish to ignore in your analysis in this file. Due to variable
efficiency of mRNA enrichment methods and rRNA depletion kits, masking these
transcripts often improves the overall robustness of transcript abundance estimates.
                                cufflinks manuals (http://cufflinks.cbcb.umd.edu/manual.html)

Usage)
$ TBI-toolkit-gtf_selector [IN GTF] [OUT GTF] [Source 1] [Source 2] ......


Run)
$ cd /KOGO/RNA-seq/ref
$ TBI-toolkit-gtf_selector ens.gtf mask.gtf tRNA rRNA Mt_tRNA Mt_rRNA
SUMMARY
• Directory

  • /KOGO/RNA-seq

• Tools

• Reference

• Pre-processing
FILTERING & QC

                                                               DEG
             Read                                             analysis
Filtering                           Expression     Gene
            Mapping                                                       Report
                      Duplication     Level      Structure
FastQC      RSeQC                                            Annotation
Filtering
             FILTERING & QC                       Mapping

                                                 Duplication
•   Improving assembly accuracy
                                                 Expression
•   Removing artifacts
                                                    Gene
    •   Sequencing adaptor                        Structure

    •   Low quality reads                        DEG    Annotation

    •   Near-identical reads                      Report
        •   PCR amplification
    •   rRNA and other RNA
•   Applications
    •   Filtering - TBI-toolkit, fastx-toolkit
    •   QC - FastQC, SolexaQC, RSeQC
QUALITY CONTROL
•   FastQC ( v0.10.1 )
    •   A quality control tool for high throughput
        sequence data.
    •   Java
    •   http://www.bioinformatics.babraham.ac.uk/
        projects/fastqc/
•   RSeQC
    •   RSeQC package provides a number of useful
        modules that can comprehensively evaluate high
        throughput sequence data especially RNA-seq
        data
    •   http://code.google.com/p/rseqc/
FASTQC
Usages)
$ fastqc seqfile1 seqfile2 .. seqfileN
$ fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN

Arguments
  -f format bam,sam,bam_mapped,sam_mapped and fastq
  -t threads

Run)
$ cd /KOGO/RNA-seq/inputs
$ fastqc -f fastq -t 2 S01_1.fq.gz S01_2.fq.gz

Output)
$ firefox R01_1.fq_fastqc/fastqc_report.html
$ firefox R01_2.fq_fastqc/fastqc_report.html
FASTQC
Per Base Sequence Quality   Per Sequence Quality Scores   Per Base Sequence Content      Per Base GC Content




Per Sequence GC Content        Per Base N Content         Sequence Length Distribution    Duplicate Sequences
RSEQC
READ FILTERING (CUTOFF)

               RNA-seq            DNA-seq


                N > 10%           N > 10%
 Low
           Average QV < Q20   Average AV < Q20
Quality
           NT (<Q20) > 40%     NT (<Q20) > 5%


             No trimming
Trimming          or              Trimming
              Trimming
FILTERING
Usages)
$ TBI-toolkit filter [option*] seqfile_1 seqfile_2 output_1 output_2

Option)
  -n N_ratio
  -a integer : Average QV of read
  -m NT_ratio < QV


Run)
$ cd /KOGO/RNA-seq/inputs
$ TBI-toolkit-fq_filter -n 0.1 -m 0.4 -a 20 S01_1.fq.gz S01_2.fq.gz S01_Q20_1.fq.gz S01_Q20_2.fq.gz
$ ls
S01_Q20_1.fq.gz S01_Q20_2.fq.gz S01_Q20.log S01_Q20.err
$ cat S01_Q20.log
$ less S01_Q20.err
FASTQC
Run)
$ cd /KOGO/RNA-seq/inputs
$ fastqc -f fastq -t 2 S01_Q20_1.fq.gz S01_Q20_2.fq.gz
SUMMARY
• Read     Quality
• FastQC

• RSeQC

• Filter
MAPPING READS
  (TOPHAT)

                                                               DEG
             Read                                             analysis
Filtering                           Expression     Gene
            Mapping                                                       Report
                      Duplication     Level      Structure
FastQC      RSeQC                                            Annotation
Filtering


                              TOPHAT                                   Mapping
                                                                       Duplication

•   TopHat is a fast splice junction mapper for RNA-                   Expression
    Seq reads.
                                                                          Gene
                                                                        Structure
•   It aligns RNA-Seq reads to mammalian-sized
    genomes using the ultra high-throughput short                      DEG    Annotation
    read aligner Bowtie, and then analyzes the mapping
    results to identify splice junctions between exons.                 Report




    Trapnell C, Salzberg SL. How to map billions of short reads onto
    genomes. Nat Biotechnol. 2009 May;27(5):455-7.
USAGE
Usage
$ tophat [options] <bowtie_index_base> <reads1_1> <reads1_2>
      Option                Value                                              Description
  -o/--output-dir           string        The default is "./tophat_out".
 -p/--num-threads            int          Use this many threads to align reads. The default is 1.
-r/--mate-inner-dist         int          This is the expected (mean) inner distance between mate pairs.The default is 50bp
                                          The standard deviation for the distribution on inner distances between mate pairs.
  --mate-std-dev             int
                                          The default is 20bp.
                        fr-unstranded     fr-unstranded : Standard Illumina
   --library-type         fr-firststrand   fr-firststrand : dUTP, NSR, NNSR
                       fr-secondstrand    fr-secondstrand : Ligation, Standard Solid
  --solexa-quals              -           Use the Solexa scale for quality values in FASTQ files.

--solexa1.3-quals             -           Phred64/Illumina 1.3~1.5

     -G/--GTF             Geneset         Geneset (GTF 2.2 or GFF3 formatted file)

      --rg-id               string        Read group ID
   --rg-sample              string        Sample ID
RUN
$ cd /KOGO/RNA-seq/outputs
$ tophat -o S01 -p 1 -r 170
       --library-type fr-unstranded -G ../ref/ens.gtf --rg-id S01_Q20 --rg-sample S01_Q20
       ../ref/chr.fa ../inputs/S01_Q20_1.fq.gz ../inputs/S01_Q20_2.fq.gz


       Category                  Option                                        Value

       Output                -o/--output-dir                       /KOGO/RNA-seq/outputs/S01

       Thread               -p/--num-threads                                     1

 Inner Distance Mean       -r/--mate-inner-dist                                 170              check
  Inner distance SD.         --mate-std-dev                                 20 (default)

     Library Type             --library-type                     fr-unstranded (Standard Illumina)

    Quality Score                                                        Phred33 (default)

      Geneset                   -G/--GTF                          /KOGO/RNA-seq/ref/ens_69.gtf

                                 --rg-id
     Read Group                                                             S01_Q20
                              --rg-sample
ALGORITHM




Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice
junctions with RNA-Seq. Bioinformatics. 2009 May 1;25(9):1105-11.
TOPHAT
•   Two step method
    •   Extracting the transcript sequences and using Bowtie to align
        reads to this virtual transcriptome first.
    •    Only the reads that do not fully map to the transcriptome will
        then be mapped on the genome.
•   Optimized for reads >= 75bp
•   The values in the first column of the provided GTF/GFF file must
    match the name of the reference sequence in the Bowtie index
    you are using with TopHat.
OUTPUT
    Filename          Types                   Description

                                A list of read alignments in SAM format.
accepted_hits.bam     BAM
                                            Coordinate-sorted

 unmapped.bam         BAM       A list of unmapped read in SAM format.

  junctions.bed     UCSC BED    A track of junctions reported by TopHat

                               chromLeft referes to the last genomic base
  insertions.bed    UCSC BED
                                          before the insertion
                               chromLeft referes to the first genomic base
  deletions.bed     UCSC BED
                                          before the insertion
SIMPLE ALIGNMENT VIEW
Usage
$ cd /KOGO/RNA-seq/output/S01
$ samtools index accepted_hits.bam                    25:413751
$ samtools tview accepted_hits.bam ../../ref/chr.fa

                                                        Key               Desc
                                                         ?             This window
                                                       Arrows     Small scroll movement
                                                       H,J,K,L    Large scroll movement
                                                        space       Scroll one screen
                                                      backspace   Scroll back one screen
                                                         g        Go to specific location
                                                         m        Color for mapping qual
                                                         n         Color for nucleotide
                                                         b        Color for base quality
                                                          .       Toggle on/off dot view
                                                         q                 Exit
MAPPING STATISTICS
 Run)                                                 Run)
 $ cd /KOGO/RNA-seq/outputs/S01                       $ cd /KOGO/RNA-seq/outputs/S01
 $ samtools flagstat accepted_hits.bam                 $ bam_stat.py -i accepted_hits.bam

45338688 + 0 in total (QC-passed reads + QC-failed   Total Reads (Records):   45338688
reads)
                                                     QC failed:             0
0 + 0 duplicates                                     Optical/PCR duplicate:   0
45338688 + 0 mapped (100.00%:-nan%)                  Non Primary Hits         1861695
45338688 + 0 paired in sequencing                    Unmapped reads:          0
22757885 + 0 read1                                   Multiple mapped reads:    586067
22580803 + 0 read2
39796048 + 0 properly paired (87.78%:-nan%)          Uniquely mapped:         42890926
42308960 + 0 with itself and mate mapped             Read-1:              21527100
                                                     Read-2:              21363826
3029728 + 0 singletons (6.68%:-nan%)                 Reads map to '+':       21457407
705846 + 0 with mate mapped to a different chr       Reads map to '-':      21433519
92166 + 0 with mate mapped to a different chr        Non-splice reads:       32872272
(mapQ>=5)                                            Splice reads:        10018654
                                                     Reads mapped in proper pairs: 38402964
SUMMARY
• TopHat

 • Splice   junction
• Geneset

 • Two   step method
• accepted_hits.bam
PCR DUPLICATES
  (OPTIONAL)
                                                               DEG
             Read                                             analysis
Filtering                           Expression     Gene
            Mapping                                                       Report
                      Duplication     Level      Structure
FastQC      RSeQC                                            Annotation
PCR DUPLICATION                                              Filtering

                                                                    Mapping
                                                                          Duplication
•    Removing reads that have same mapping coordinates.
                                                                   Expression
•    Tools
                                                                      Gene
      •   samtools - rmdup                                          Structure

      •   Picard - MarkDuplicates                                  DEG    Annotation

    Run)                                                            Report
    $ cd /KOGO/RNA-seq/outputs/S01/
    $ samtools rmdup accepted_hits.bam accepted_hits.rmdup.bam

    Run)
    $ cd /KOGO/RNA-seq/outputs/S01/
    $ java -jar /KOGO/RNA-seq/Tools/Picard/MarkDuplicates.jar
        INPUT=accepted_hits.bam OUTPUT=accpted_hits.mark_dup.bam
        ASSUME_SORTED=true REMOVE_DUPLICATES=true
        METRICS_FILE=accpeted_hits.metric
PCR DUPLICATION
  accepted_hits.bam                    samtools                    Picard (Mark)                Picard (Remove)
45338688 + 0 in total          29259330 + 0 in total          45338688 + 0 in total          27621444 + 0 in total
0 + 0 duplicates               0 + 0 duplicates               17717244 + 0 duplicates        0 + 0 duplicates
45338688 + 0 mapped            29259330 + 0 mapped            45338688 + 0 mapped            27621444 + 0 mapped
45338688 + 0 paired            29259330 + 0 paired            45338688 + 0 paired            27621444 + 0 paired
22757885 + 0 read1             14717809 + 0 read1             22757885 + 0 read1             13820471 + 0 read1
22580803 + 0 read2             14541521 + 0 read2             22580803 + 0 read2             13800973 + 0 read2
39796048 + 0 properly          24471885 + 0 properly          39796048 + 0 properly          24945306 + 0 properly
paired (87.78%:-nan%)          paired (83.64%:-nan%)          paired (87.78%:-nan%)          paired (90.31%:-nan%)
42308960 + 0 with itself and   26229602 + 0 with itself and   42308960 + 0 with itself and   26660814 + 0 with itself and
mate mapped                    mate mapped                    mate mapped                    mate mapped
3029728 + 0 singletons         3029728 + 0 singletons         3029728 + 0 singletons         960630 + 0 singletons
(6.68%:-nan%)                  (10.35%:-nan%)                 (6.68%:-nan%)                  (3.48%:-nan%)
705846 + 0 with mate           705846 + 0 with mate           705846 + 0 with mate           655922 + 0 with mate
mapped to a different chr      mapped to a different chr      mapped to a different chr      mapped to a different chr
92166 + 0 with mate            92166 + 0 with mate            92166 + 0 with mate            52614 + 0 with mate
mapped to a different chr      mapped to a different chr      mapped to a different chr      mapped to a different chr
(mapQ>=5)                      (mapQ>=5)                      (mapQ>=5)                      (mapQ>=5)
EXPRESSION
                      (CUFFLINKS)


                                                               DEG
             Read                                             analysis
Filtering                           Expression     Gene
            Mapping                                                       Report
                      Duplication     Level      Structure
FastQC      RSeQC                                            Annotation
EXPRESSINO & MODELING




Adam	
  Roberts	
  et	
  al.,	
  Iden%fica%on	
  of	
  novel	
  transcripts	
  in	
  annotated	
  
genomes	
  using	
  RNA-­‐Seq.	
  Bioinforma4cs,	
  2011,	
  	
  27:2325–2329
NORMALIZATION
 •   Read counts need to be properly normalized to extract meaningful
     expression estimates
     •   First, RNA fragmentation during library construction causes longer
         transcripts to generate more reads compared to shorter transcripts
         present at the same abundance in the sample
     •   Second, the variability in the number of reads produced for each run
         causes fluctuations in the number of fragments mapped across samples




Garber M, Grabherr MG, Guttman M, Trapnell C. Computational
methods for transcriptome annotation and quantification using
RNA-seq. Nat Methods. 2011 Jun;8(6):469-77.
RPKM
                         the reads per kilobase of transcript per
                                 million mapped reads
                                                                                      Relative
                                                                                 Expression Level in
                                                                                      Sample

      •      C : the number of mappable reads that fell onto the gene’s exons
      •      N : the total number of mappable reads in the experiment
      •      L : the sum of the exons in base pairs


Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold, B. (2008).
Mapping and quantifying mammalian transcriptomes by rna-seq. Nat
Methods, 5(7):621-628.
Filtering

               CUFFLINKS                                          Mapping

                                                                 Duplication


•   Cufflinks assembles transcripts, estimates their               Expression
    abundances, and tests for differential
    expression and regulation in RNA-Seq                            Gene
                                                                  Structure
    samples
                                                                DEG     Annotation
•   Cufflinks constructs a parsimonious set of
                                                                  Report
    transcripts that "explain" the reads observed
    in an RNA-Seq experiment




                                           http://cufflinks.cbcb.umd.edu/index.html
CUFFLINKS PACKAGE
•   cufflinks
    •   assembles transcripts
    •   estimates their abundances
•   cuffmerge
    •   a script called cuffmerge that you can use to merge together several
        Cufflinks assemblies.
•   cuffdiff
    •   tests for differential expression
USAGE
$ cufflinks [options] <aligned_reads.(sam/bam)>

     Option              Value                                            Description

                                       Sets the name of the directory in which Cufflinks will write all of its output.
  -o/--output-dir       String
                                       The default is "./".
                                                                                          Quantification
 -p/--num-threads         int          Use this many threads to align reads. The default is 1.

                                       Use the supplied reference annotation (a GFF file) to estimate isoform
    -G/--GTF            geneset
                                       expression. It will not assemble novel transcripts.          Novel Isoforms
                                       Use the supplied reference annotation (GFF) to guide RABT assembly.
 -g/--GTF-guide         geneset        Output will include all reference transcripts as well as any novel genes and
                                       isoforms that are assembled.
                                                                                                             Improving
                                       Ignore all reads that could have come from transcripts in this GTF file. We
  -M/--mask-file     mask geneset                                                                             accuracy
                                       recommend including any annotated rRNA, mitochondrial transcripts other
                                       abundant transcripts you wish to ignore in your analysis in this file.

                      fr-unstranded    fr-unstranded : Standard Illumina
  --library-type       fr-firststrand   fr-firststrand : dUTP, NSR, NNSR /
                    fr-secondstrand    fr-secondstrand : Ligation, Standard Solid
RUN
$ cd /KOGO/RNA-seq/outputs
$ cufflinks -o S01 -p 1 --library-type fr-unstranded -g ../ref/ens.gtf -M ../ref/mask.gtf
   S01/accepted_hits.bam

     Category                Option                                   Value

      Output             -o/--output-dir                  /KOGO/RNA-seq/outputs/S01

      Thread            -p/--num-threads                               1

  Guide Geneset          -g/--GTF-guide                    /KOGO/RNA-seq/ref/ens.gtf

   Mask Geneset          -M/--mask-file                    /KOGO/RNA-seq/ref/mask.gtf


    Library Type          --library-type                          fr-unstranded
ALGORITHM




Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL,
Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals
unannotated transcripts and isoform switching during cell differentiation. Nat
Biotechnol. 2010 May;28(5):511-5.
CUFFLINKS EXPRESSION
•   FPKM
    •   Fragments Per Kilobase of exon per Million fragments mapped
    •   analogous to single-read “RPKM”
•   Isoform expression estimation
    •   maximum likelihood estimation
•   Normalization
    •   by total number of mapped reads
    •   by upper quantile method
OUTPUT

        File                              Description

    transcripts.gtf     The GTF file contains Cufflinks ‘ assembled isoforms

                       The estimated isoform-level expression values in the
isoforms.fpkm_tracking
                       generic FPKM Tracking Format.

                        The estimated gene-level expression values in the
 genes.fpkm_tracking
                        generic FPKM Tracking Format.
TRANSCRIPTS.GTF
Col.    Name        Example                                         Description

 1     seqname        chrX     Chromosome or contig name

 2      source      Cufflinks   The name of the program that generated this file (always 'Cufflinks')

 3      feature       exon     The type of record (always either "transcript" or "exon".

 4       start      77696957   The leftmost coordinate of this record (where 1 is the leftmost possible coordinate)

 5        end       77712009   The rightmost coordinate of this record, inclusive.

                               The most abundant isoform for each gene is assigned a score of 1000. Minor
 6       score        1000
                               isoforms are scored by the ratio (minor FPKM/major FPKM)

 7      strand         +       Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "."

                               Cufflinks does not predict where the start and stop codons (if any) are located
 7      frame          .
                               within each transcript, so this field is not used.

 8     attributes      ...     See below.
TRANSCRIPTS.GTF
    Attribute       Example                                        Description

    gene_id         CUFF.1    Cufflinks gene id

  transcript_id     CUFF.1.1 Cufflinks transcript id

                              Isoform-level relative abundance in Fragments Per Kilobase of exon model per
     FPKM           101.267
                              Million mapped fragments

      frac           0.7647   Reserved. Please ignore, as this attribute may be deprecated in the future

                              Lower bound of the 95% confidence interval of the abundance of this isoform, as a
    conf_lo           0.07
                              fraction of the isoform abundance. That is, lower bound = FPKM * (1.0 - conf_lo)
                              Upper bound of the 95% confidence interval of the abundance of this isoform, as a
     conf_hi         0.1102
                              fraction of the isoform abundance. That is, upper bound = FPKM * (1.0 + conf_lo)

      cov           100.765   Estimate for the absolute depth of read coverage across the whole transcript

                              When RABT assembly is used, this attribute reports whether or not all introns and
full_read_support     yes
                              internal exons were fully covered by reads from the data.
FPKM TRACKING FILES
Col.       name               Example                                                  Description
 1      tracking_id       TCONS_00000001     A unique identifier describing the object (gene, transcript, CDS, primary transcript)
                                             The class_code attribute for the object, or "-" if not a transcript, or if class_code isn't
 2      class_code               =
                                             present
 3     nearest_ref_id       NM_008866.1      The reference transcript to which the class code refers, if any
 4        gene_id            NM_008866       The gene_id(s) associated with the object
 5     gene_short_name         Lypla1        The gene_short_name(s) associated with the object
                                             The tss_id associated with the object, or "-" if not a transcript/primary transcript, or if
 6         tss_id               TSS1
                                             tss_id isn't present
 7         locus         chr1:4797771-4835363 Genomic coordinates for easy browsing to the object

 8         length               2447         The number of base pairs in the transcript, or '-' if not a transcript/primary transcript
 9       coverage              43.4279       Estimate for the absolute depth of read coverage across the object
10         FPKM                8.01089       FPKM of the object in sample
11       FPKM_lo               7.03583       the lower bound of the 95% confidence interval on the FPKM of the object in sample
12       FPKM_hi               8.98595       the upper bound of the 95% confidence interval on the FPKM of the object in sample
                                             OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced),
13         status                OK
                                             HIDATA (too many fragments in locus), or FAIL
SIMPLE STATISTICS
$ cd /KOGO/RNA-seq/outputs/S01
# Check higest expressed genes
$ sort -r -g -k 10 genes.fpkm_tracking | head -n 30
# Select FPKM S
$ cut -f 1,10 genes.fpkm_traking > gene_fpkm_s
#
$R
> data <- read.table(“gene_fpkm_s”, header=TRUE)
> fpkm_s <- as.numeric(data[,2])
>
> mean(fpkm_s)
> sd(fpkm_s)
>
> fpkm_s.log10 <- log(fpkm_s+1,10)
> bin_seq = seq(min(fpkm_s.log10-0.1),max(fpkm_s.log10+0.1),by=0.1)
> hist(fpkm_s.log10, breaks=bin_seq, xlab=‘log10(x+1)’, ylab=‘Number of genes’, axes=TRUE)
>
> boxplot(fpkm_s.log10)
SUMMARY
•   Expression Level
•   Normalization
    •   RPKM (FPKM)
    •   Length Bias
•   Cufflinks
    •   Isoforms
    •   maximum likelihood estimation
CUFFMERGE

                                                               DEG
             Read                                             analysis
Filtering                           Expression     Gene
            Mapping                                                       Report
                      Duplication     Level      Structure
FastQC      RSeQC                                            Annotation
Filtering


                         CUFFMERGE                                            Mapping

                                                                             Duplication

                                                                             Expression
 •   Use to merge together several
     Cufflinks assemblies                                                         Gene
                                                                               Structure
 •   Automatically filters a number of
     transfrags that are probably
     artfifacts                                                               DEG    Annotation


 •   The main purpose of this script is                                       Report
     to make it easier to make an
     assembly GTF file suitable for use
     with Cuffdiff




Trapnell C. et al. Differential gene and transcript expression analysis of
RNA-seq experiments with TopHat and Cufflinks.
Nat Protoc. 2012 Mar 1;7(3):562-78. doi: 10.1038/nprot.2012.016.
USAGE
$ cuffmerge [options] <assembly_GTF_list.txt>

    Option             Value                                               Description

       -o           <outprefix>    Write the summary stats into the text output file <outprefix>(instead of stdout)

                                  An optional "reference" annotation GTF. The input assemblies are merged together with
   -g/--ref-gtf       geneset
                                  the reference GTF and included in the final output.

-p/--num-threads       <int>      Use this many threads to align reads. The default is 1.

                                  This argument should point to the genomic DNA sequences for the reference. If a
                                  directory, it should contain one fasta file per contig. If a multifasta file, all contigs should
                                  be present. The merge script will pass this option to cuffcompare, which will use the
                    <seq_dir>/
-s/--ref-sequence                 sequences to assist in classifying transfrags and excluding artifacts (e.g. repeats). For
                    <seq_fasta>
                                  example, Cufflinks transcripts consisting mostly of lower-case bases are classified as
                                  repeats. Note that <seq_dir> must contain one fasta file per reference chromosome,
                                  and each file must be named after the chromosome, and have a .fa or .fasta extension.
RUN
$ cd /KOGO/RNA-seq/outputs
$ find ./ -iname transcripts.gtf > gtf_list.txt
$ cuffmerge -p 1 -g ../ref/ens.gtf -s ../ref/chr.fa gtf_list.txt


     Category                  Option                                      Value

   Outputprefix                   -o                                /KOGO/RNA-seq/outputs

     Geneset                 -g/--ref-gtf                     /KOGO/RNA-seq/ref/ens.gtf

      Thread              -p/--num-threads                                   1


    Reference             -s/--ref-sequence                    /KOGO/RNA-seq/ref/chr.fa
RUN
$ cd /KOGO/RNA-seq/outputs/merged_asm
$ less transcripts.gtf
$ less merged.gtf
$ gffread -g /KOGO/ref/chr.fa -w transcripts.fa transcripts.gtf
$ head transcripts.fa
>CUFF.11.1 gene=CUFF.11
GTGCATGTAACCCAAGAAGGGTTTGGCTGGGGGCTGTGGCAGCGCCAGAGTTCT
GTTCGAATCCCAATTG
GGTTCTGGTCACAGATTTGGCATGGAGCAGAAGAGAGATACAGCATGGTTGAAAA
GCAGTTATTGGCTAC
$ grep '>' transcripts.fa | head -n 30
>CUFF.2.1 gene=CUFF.2
>CUFF.11.1 gene=CUFF.11
>ENSGALT00000015891 gene=CUFF.11
>CUFF.12.1 gene=CUFF.12
DEG ANALYSIS

                                                               DEG
             Read                                             analysis
Filtering                           Expression     Gene
            Mapping                                                       Report
                      Duplication     Level      Structure
FastQC      RSeQC                                            Annotation
Filtering

 DIFFERENTIALLY EXPRESSED GENE                                                               Mapping

                                                                                             Duplication

• Abundance   of transcripts between                                                         Expression
 different conditions                                                                           Gene
                                                                                              Structure


                                                                                             DEG
                                                                                                      Annotation

                                                                                              Report




 Robinson MD, Oshlack A. A scaling normalization method for   Zhang et al., Mol Cancer Res
 differential expression analysis of RNA-seq data.
 Genome Biol. 2010;11(3):R25. doi: 10.1186/gb-2010-11-3-      June 2006 4; 401
 r25. Epub 2010 Mar 2.
LENGTH BIAS




Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data
confounds systems biology. Biol Direct. 2009 Apr 16;4:14.
BIAS




Robinson MD, Oshlack A. A scaling normalization method for differential
expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25.
REPLICATES
                                                                More variance, More useful
                Technical                 Biological
                Replicates                Replicates

Source       Same samples             Different samples
                                       A quantity from
          the reproducibility of      difference sources
Purpose
               the results             under the same
                                           conditions.
                                what is similar in your
           The differences are
                               replicates and how they
             based only on
 Issue                           are different from a
           technical issues in
                                    different set of
           the measurement
                                       conditions

                                                                Taylor S, Wakem M, Dijkman G, Alsarraj M, Nguyen M. A practical
                                                                approach to RT-qPCR-Publishing data that conform to the MIQE
                                                                guidelines. Methods. 2010 Apr;50(4):S1-5. doi: 10.1016/j.ymeth.
                                                                2010.01.005.
           http://wiki.answers.com/Q/
           What_is_defference_between_Biological_replicates_a
           nd_technical_replicates
DEG METHODS

Cuffdiff       DEGseq             DESeq

    -           Poisson       Negative binomial

 Isoform         Gene              Gene

 geneset
             Raw Read Count   Raw Read Count
BAM files

Technical      Technical         Biological
Replicates     Replicates        Replicates
CUFFDIFF
• Use  to find significant changes in transcript expression,
 splicing, and promoter use.
Usage)
$ cuffdiff [options]* <transcripts.gtf> <sample1_replicate1.sam[,...,sample1_replicateM]>
<sample2_replicate1.sam[,...,sample2_replicateM.sam]>

    Option                 Value                                        Description
     -o /                                     Sets the name of the directory in which Cuffdiff will write all
                         <string>
 --output-dir                                 of its output. The default is "./".

      -L /                                    Specify a label for each sample, which will be included in
                 <label1,label2,...,labelN>
    --labels                                  various output files produced by Cuffdiff.
     -p /
                          <int>               Use this many threads to align reads. The default is 1.
--num-threads
RUN
$ cd /KOGO/RNA-seq/outputs
$ cuffdiff -o Diff-S01-S02 -L S01,S02 -p 1 merged_asm/merged.gtf S01/
accepted_hits.bam S02/accepted_hits.bam


   Category            Option                            Value

    Output          -o/--output-dir       /KOGO/RNA-seq/outputs/Diff-S01-S02

     Label           -L / --labels                      S01,S02

    Thread        -p / --num-threads                       1
OUTPUT
   Type                    Files                                         Description

                   genes.fpkm_tracking        Gene [FPKMs, counts, read group tracking]. Tracks the summed
  Genes            genes.count_tracking       [FPKMs, counts, read group tracking] of transcripts sharing each
                genes.read_group_tracking     gene_id

                   isoforms.fpkm_tracking
 Isoforms         isoforms.count_tracking   Transcript [FPKMs, counts, read group tracking]
               isoforms.read_group_tracking

                    cds.fpkm_tracking         Coding sequence [FPKMs, counts, read group tracking]. Tracks
   CDS              cds.count_tracking        the summed [FPKMs, counts, read group tracking] of transcripts
                 cds.read_group_tracking      sharing each p_id, independent of tss_id

                  tss_groups.fpkm_tracking   Primary transcript [FPKMs, counts, read group tracking]. Tracks
 Primary
                 tss_groups.count_tracking   the summed [FPKMs, counts, read group tracking] of transcripts
Transcripts
              tss_groups.read_group_tracking sharing each tss_id
FPKM TRACKING FILES
$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02
$ cut -f 1,3,4,5,9,10,13,14,17 genes.fpkm_tracking | head

Col.    Column name          Example      Description
                                          A unique identifier describing the object (gene, transcript, CDS, primary
 1        tracking_id    TCONS_00000001
                                          transcript)
 3      nearest_ref_id    NM_008866.1     The reference transcript to which the class code refers, if any
 4         gene_id         NM_008866      The gene_id(s) associated with the object
 5     gene_short_name       Lypla1       The gene_short_name(s) associated with the object
 9        coverage           43.4279      Estimate for the absolute depth of read coverage across the object
10        q0_FPKM            8.01089      FPKM of the object in sample 0
                                          OK (deconvolution successful), LOWDATA (too complex or shallowly
13        q0_status            OK
                                          sequenced), HIDATA (too many fragments in locus), or FAIL
14        q1_FPKM            8.55155      FPKM of the object in sample 1
                                          OK (deconvolution successful), LOWDATA (too complex or shallowly
17        q1_status            OK
                                          sequenced), HIDATA (too many fragments in locus), or FAIL
OUTPUT
   Type             Files                                          Description
                                   Gene differential FPKM. Tests difference sin the summed FPKM of
  Genes         gene_exp.diff
                                   transcripts sharing each gene_id
 Isoforms      isoform_exp.diff    Transcript differential FPKM.
                                   Coding sequence differential FPKM. Tests differences in the summed FPKM
   CDS           cds_exp.diff
                                   of transcripts sharing each p_id independent of tss_id
 Primary                           Primary transcript differential FPKM. Tests differences in the summed FPKM
              tss_group_exp.diff
Transcripts                        of transcripts sharing each tss_id
                                   how much differential splicing exists between isoforms processed from a
 Splicing        splicing.diff
                                   single primary transcript
                                   the amount of overloading detected among its coding sequences, i.e. how
   CDS             cds.diff
                                   much differential CDS output exists between samples
                                   the amount of overloading detected among its primary transcripts, i.e. how
Promoter        promoter.diff
                                   much differential promoter use exists between samples.
GENE_EXP.DIFF
  $ cd /KOGO/RNA-seq/outputs/Diff-S01-S02
  $ cut -f 1,2,7,8,9,10,11,12,13,14 gene_exp.diff | head

Col.        Name            Example                                           Description
 1         Tested id       XLOC_000001   A unique identifier
 2           gene            Lypla1      The gene_name(s) or gene_id(s) being tested

                                         OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too
 6         Test status      NOTEST
                                         complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL

 7           FPKMx           8.01089     FPKM of the gene in sample x
 8           FPKMy          8.551545     FPKM of the gene in sample y
 9     log2(FPKMy/FPKMx)     0.06531     The (base 2) log of the fold change y/x
                                         The value of the test statistic used to compute significance of the observed change in
10          test stat       0.860902
                                         FPKM
11          p value         0.389292     The uncorrected p-value of the test statistic
12          q value         0.985216     The FDR-adjusted p-value of the test statistic
                                         Can be either "yes" or "no", depending on whether p is greater then the FDR after
13         significant          no
                                         Benjamini-Hochberg correction for multiple-testing
SIMPLE STATISTICS
$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02
$ gnuplot
gnuplot> set grid
gnuplot> set zeroaxis -1
gnuplot> set xlabel ‘log(FPKMs of S01)’
gnuplot> set ylabel ‘log(FPKMs of S02)’
gnuplot> pl ‘genes.fpkm_tracking’ u (log($10)):(log($14)) w points notitle, x notitle
gnuplot> exit


$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02
$ grep yes gene_exp.diff > gene_exp.diff.yes
$ less gene_exp.diff.yes
$ grep no gene_exp.diff > gene_exp.diff.no
$ gnuplot
gnuplot> set grid
gnuplot> set zeroaxis lt 2
gnuplot> set xlabel ‘log2foldchange’
gnuplot> set ylabel ‘-log(p-value)’
gnuplot> pl ‘gene_exp.diff.no’ u 10:(-log($12)) lt 0 no title,
‘gene_exp.diff.yes’ u 10:(-log($12)) lt 1 pt 6 ps 2 t ‘DE’
gnuplot> exit
DESEQ
•   Differential gene expression analysis based
    on the negative binomial distribution
•   R
•   raw count
•   biological replicates
•   http://bioconductor.org/packages/release/
    bioc/html/DESeq.html




    http://jura.wi.mit.edu/bio/education/hot_topics/RNAseq/
    RNAseqDE_Dec2011.pdf
HTSEQ-COUNT
•   To count how many reads map to each feature

•   Not counted for any feature for various reasons, namely:

    •   no_feature: reads which could not be assigned to any
        feature

    •   ambiguous: reads which could have been assigned to more
        than one feature and hence were not counted for any of
        these

    •   too_low_aQual: reads which were not counted due to the
        -a option

    •   not_aligned: reads in the SAM file without alignment

    •   alignment_not_unique: reads with more than one reported
        alignment. These reads are recognized from the NH
        optional SAM field tag.

•   If you have paired-end data, you have to sort the SAM file by
    read name first

        http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
HTSEQ-COUNT
If you have paired-end data, you have to sort the SAM file by read name first

Usage)
$ htseq-count [options] <sam_file> [gff_file, ensembl gtf]

Options)
-m [union,intersection-strict,intersection-nonempty]
-s.--stranded=<yes, no, or reverse>
    whether the data is from a strand-specific assay (default: yes)

Run)
$ cd /KOGO/RNA-seq/outputs/S01
$ samtools sort -n accepted_hits.bam accepted_hits.nameSorted
$ samtools view accepted_hits.nameSorted.bam | htseq-count
-m union -s no - ../merged_asm/merged.gtf > accepted_hits.count
$ less accepted_hits.count
# ..... for (S02, S03, S04)
RUN
Run)
$ cd /KOGO/RNA-seq/outputs
$ mkdir DESeq
$ TBI-toolkit-make_matrix S01/hits.count 2 S02/hits.count 2 S03/hits.count 2 S04/hits.count 2 >
DESeq/hits.mtx
$ cd DESeq
$ less hits.mtx
$ cp /KOGO/RNA-seq/scripts/DESeq.4samples.R .
$ R CMD BATCH DESeq.R
                                                     DESeq.2samples for 2 samples
DEG METHODS
•   Cuffdiff, baySeq, DESeq, edgeR and NOISeq generated consistent results
•   edgeR identified more DGE than the other methods at the same cut-
    off, which might infer less control of type 1 error with this method




Nookaew I, et al. A comprehensive comparison of RNA-Seq-based transcriptome analysis
from reads to differential gene expression and cross-comparison with microarrays: a case
study in Saccharomyces cerevisiae. Nucleic Acids Res. 2012 Nov 1;40(20):10084-97.
SUMMARY
•   DEG
•   Replicate
    •   Technical replicates
    •   Biological replicates
•   Cuffdiff
•   HTSeq-count
•   DESeq
REPORT
(CUMMERBUND)
                                                               DEG
             Read                                             analysis
Filtering                           Expression     Gene
            Mapping                                                       Report
                      Duplication     Level      Structure
FastQC      RSeQC                                            Annotation
CUMMERBUND
• anR package that is designed to aid and simplify the task of
 analyzing Cufflinks RNA-Seq output.

•R

• using   SQLite

  • cuffData.db
CUMMERBUND DB SCHEMA
RUN
Run)                                       # Pairwise Scatterplots
$ cd /KOGO/outputs/Diff-S01-S02            > s<-csScatter(genes(cuff),"S01","S02",smooth=T)
$R                                         >s
> library(cummeRbund)                      # Geneset level plots
> cuff <- readCufflinks()
> cuff
                                           > data(sampleData)
# Global statistics and Quality Control    > myGeneIds <- sampleIDs
> disp<-dispersionPlot(genes(cuff))        > myGenes <- getGenes(cuff,myGeneIds)
> disp                                     > h<-csHeatmap(myGenes,cluster='both')
# Density                                  >h
> dens<-csDensity(genes(cuff))             # Barplot
> dens                                     > b <- expressionBarplot(myGenes)
# Boxplot                                  >b
> b<-csBoxplot(genes(cuff))                # Cluster
>b
# Volcano
                                           > ic <-csCluster(myGenes,k=4)
> v<-csVolcanoMatrix(genes(cuff))          > icp <- csClusterPlot(ic)
>v                                         > icp
> v<-csVolcano(genes(cuff),"S01","S02")
ADDITIONAL ANALYSIS
VIEWER
•   IGV
                                             Run) Generate BAM index
                                             $ cd /KOGO/RNA-seq/outputs/S01
    •   Integrative Genomics Viewer          $ samtools index accepted_hits.bam
                                             $ ls
    •   http://www.broadinstitute.org/igv/   accepted_hits.bai
GO ENRICHMENT
• GO   annotation      • GO   Enrichment
 • Using   SwissProt    • GOseq

  • Blastx              • Fisher's   exact test
 • Blast2Go             • DAVID

 • InterProScan
CONCLUSION
•   (m)RNA-seq analysis       •   Mapping
•   Reference-based method        •   RNA mapper
•   NGS data analysis         •   Gene Expression
    •   RNA-seq vs. DNA-seq       •   Normalization
•   Filtering                 •   DEG Analysis
    •   Low Quality               •   RPKM
    •   PCR Duplication           •   Replicates
END

Kogo 2013 RNA-seq analysis

  • 1.
    RNA-SEQ ANALYSIS 고준수, 송상훈, 김현민 테라젠 바이오 연구소 2012. 2. 5
  • 2.
    CONTENTS • NGS • Mapping • RNA-seq • PCR Duplication • File Forat • Expression • Workflow • DEG • Preparation • Report • Filtering & QC
  • 3.
    TODAY’S KEYWORDS NGS File Format Illumina, Paired-End Fastq, BAM DEG Cuffdiff, DESeq RNA-seq mRNA, Reference-based Expression Design Cufflinks, Cuffmerge Mapping Replicates TopHat
  • 4.
  • 5.
  • 6.
    NEXT-GENERATION SEQUENCING 2nd Generation 3rd Generation Nat Rev Genet. 2010 Jan;11(1):31-46. doi: 10.1038/ nrg2626. Epub 2009 Dec 8. Sequencing technologies - the next generation. Metzker ML.
  • 7.
    NGS WEAKNESS ANDOVERCOMING Sanger 0.001% Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012 Jul 24;13:341. Nature Biotechnology 26, 1135 - 1145 (2008), Next-generation DNA sequencing, Shendure J. and Ji H.
  • 8.
    NGS Library Construction Sequencing Raw Reads http://users.ugent.be/~avierstr/nextgen/nextgen.html
  • 9.
    GENERAL NGS ANALYSISPROCESS Speed 3 1 WGS Low depth < NT < High depth Mapping 2 Depth (Coverage) Coverage Shearer AE, Hildebrand MS, Sloan CM, Smith RJ. Deafness in the genomics era. Hear Res. 2011 Dec;282(1-2):1-9. doi: 10.1016/j.heares.2011.10.001. Epub 2011 Oct 8.
  • 10.
    MAPPING TOOLS BWA • Mapper Type for • DNA • RNA WGS • miRNA • bisulphite TopHat for RNA Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high- throughput sequencing data. Bioinformatics. 2012 Dec 1;28(24):3169-77.
  • 11.
    PCR DUPLICATION http://www.clcbio.com/clc-plugin/duplicate-reads- removal-plugin/ remove
  • 12.
    ILLUMINA PAIRED-END mate-pair inner distnace http://vallandingham.me/ RNA_seq_differential_expr ession.html Quinlan AR, Boland MJ, Leibowitz ML, Shumilina S, Pehrson SM, Baldwin KK, Hall IM. Genome sequencing of mouse induced pluripotent stem cells reveals retroelement stability and infrequent DNA rearrangement during reprogramming. Cell Stem Cell. 2011 Oct 4;9(4):366-73. doi: 10.1016/j.stem.2011.07.018. Haas BJ, Zody MC. fastq_1 Advancing RNA-Seq analysis. Nat Biotechnol. 2010 May;28(5):421-3. doi: 10.1038/nbt0510-421. http://users.ugent.be/~avierstr/ fastq_2 nextgen/nextgen.html
  • 13.
    SUMMARY • NGS platform : Short Reads, Depth, Coverage • Sequencing Protocol • Analysis Protocol • Mapping • PCR duplication • Illumina Paired-end
  • 14.
  • 15.
    TRANSCRIPTOME • The complete set of transcripts in a cell, and their quantity • The key aims of transcriptomics are: • to catalogue all species of transcript, including mRNAs, non-coding RNAs and small RNAs • to determine the transcriptional structure of genes, in terms of their start sites, 5′ and 3′ ends, splicing patterns and other post-transcriptional modifications • to quantify the changing expression levels of each transcript during development and under different conditions. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/ nrg2484.
  • 16.
    ADVANTAGES OF RNA-SEQ WangZ, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/nrg2484.
  • 17.
    RNA-SEQ & MICROARRAY WangZ, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/ nrg2484.
  • 18.
    RNA-SEQ • Gene expression level • Relative expression level in sample • Differentially expressed gene • Identification of alternative spliced transcripts • Prediction of novel transcripts • Gene Fusion Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/nrg2484.
  • 19.
    RNA-SEQ VS. DNA-SEQ RNA-seq DNA-seq Reference-based, WES, Methods de novo assembly WGS re-sequencing, WGS de novo Expression, Differentially Expressed Genes, Goal Novel transcript, SNPs, Indels, SV Alternative splicing form, Gene fusion Measure Mapped Read Count Base accuracy
  • 20.
    OVERVIEW OF ATYPICAL RNA-SEQ
  • 21.
    RNA MAPPING Oshlack A, Robinson MD, Young MD. From RNA-seq reads to differential expression results. Genome Biol. 2010;11(12):220. Trapnell C, Salzberg SL., How to map billions of short reads onto genomes. Nat Biotechnol. 2009 May;27(5):455-7.
  • 22.
    MAPPER Mapper Data Seq.Plat. Input Output Cit. Cit/years Reference MapSplice RNA I FASTA/Q SAM, BED 50 28.17 Wang et al. (2010) MicroRazerS miRNA N FASTA/Q SAM, TSV 7 2.75 Emde et al. (2010) mrFAST miRNA I FASTA/Q SAM 158 58.34 Alkan et al. (2009) mrsFAST miRNA I,So FASTA/Q SAM 32 18.03 Hach et al. (2010) Passion RNA I,4,Sa,P FASTA/Q BED - - Zhang et al. (2012) PatMaN miRNA N FASTA TSV 38 9.36 Prufer et al. (2008) QPALMA RNA I,4 Specific TSV 75 21.11 De Bona et al. (2008) RNA-Mate RNA So CFASTA BED, Counts 28 10.04 Cloonan et al. (2009) RUM RNA I,4 FASTA/Q SAM,TSV,BED 2 2.36 Grant et al. (2011) SOAPSplice RNA I,4 FASTA/Q TSV 3 3.54 Huang et al. (2011) SpliceMap RNA I FASTA/Q SAM, BED 63 29.80 Au et al. (2010) Supersplat RNA N FASTA TSV 21 9.93 Bryant Jr et al. (2010) TopHat RNA I FASTA/Q, GFF BAM 389 121.04 Trapnell et al. (2009) The number of citations (Cit.) was obtained from Google Scholar on April 14, 2012 Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics. 2012 Dec 1;28(24):3169-77.
  • 23.
    ANALYSIS STRATEGIES Reference-based de novo •Using a reference genome •not use a reference genome Method •The transcriptome assembly can be built upon it • Contamination or sequencing • Not depend on a reference genome artefacts are not a major concern • Not depend on the correct alignment • Very sensitive and can assemble of reads to known splice sites or the Adv. transcripts of low abundance prediction of novel splicing sites • To discover novel transcripts that • Trans-spliced transcripts can be are not present in the current assembled annotation Disadv. • Depends on the quality of the • Computing resources reference genome being used. • Senstive to sequencing errors Depth ~ 10x > 30x Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi: 10.1038/ nrg3068.
  • 24.
    REFERENCE-BASED Martin JA, WangZ. Next-generation transcriptome assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi: 10.1038/nrg3068.
  • 25.
    REFERENCE-BASED Martin JA, WangZ. Next-generation transcriptome assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi: 10.1038/nrg3068.
  • 26.
    SUMMARY • Transcriptome • RNA-seq advantages • Process • Analysis strategies • Reference-based method
  • 27.
  • 28.
    FILE FORMAT • NGS • Fastq • SAM/BAM • VCF • Reference • Fasta • GTF / GFF
  • 29.
    Sequencer FASTQ FORMAT Fastq • de factor standard file format for raw reads • fq, fastq, fq.gz, fastq.gz 1: @title identifier description S01_1.fq 2: Sequence 3: + description 4: Quality values Paired-end S01_2.fq
  • 30.
    QUALITY SCORE • The base-calling error probabilities. • Types • Pred33 / Illumina 1.8+ • Score 0~60 • ASCII 33 ~ 126 • Solexa / Illumina 1.0 • -5~62 • ASCII 56 ~ 126 • Pred64 / Illumina 1.3 ~ 1.5 • 0 ~ 62 • ASCII 64 ~126 http://www.asciitable.com
  • 31.
    SAM / BAMFORMAT Sequencer • SAM stands for Sequence Alignment/Map format. • TAB-delimited text format Fastq • 11 mandatory fields Mapper SAM/ BAM Read Name Flag Quality Reference Length Pos. of Position Mate
  • 32.
    SAM / BAM Flag SAM CIGAR
  • 33.
    TOOLS FOR SAM/BAM •Samtools • tview • index • mpileup • view • Picard • sort • SortSam • faidx • MarkDuplicates • flagstat • ......
  • 34.
    GTF (ENSEMBL) Gene ID Transcript ID protein_coding, mtRNA, miRNA, lincRNA, pseudogene......
  • 35.
    SUMMARY • Fastq format • de facto standard • Quality Score • Pred33/Illumina 1.8+, Illumina 1.0, Pred64/Illumina 1.3~1.5 • SAM/BAM format • GTF
  • 36.
  • 37.
  • 38.
    REFERENCE WORKFLOW Mapped Assembled Sample reads transcripts 1 Final TopHat Cufflinks Cuffmerge transcriptome Cuffdiff assembly Differential Sample Mapped Assembled expression 2 reads transcripts results Expression CummeRbund plots
  • 39.
    OUR WORKFLOW Samples Reference Geneset DEG analysis Read Expression Gene Cuffdiff Filtering DEGseq Report Mapping Level Structure Duplication DESeq TopHat Picard RUM Samtools Cufflinks cummeRbund TBI-toolkit Cuffmerge GO BWA HTseq-count Annotation Bowtie2 UniProt FastQC RSeQC GO KEGG
  • 40.
  • 41.
    DIRECTORY /KOGO/RNA-seq ref chr.fa, ens.gtf, mask.gtf inputs S01.fq.gz, S02.fq.gz outputs S01 accepted_hits.bam, transcripts.gtf ...... accepted_hits.bam, transcripts.gtf merged_asm merged.gtf, transcripts.gtf scripts Diff-S01-S02 gene_exp.diff, isoforms_exp.diff Tools
  • 42.
    SAMPLES 운동전 운동후 Horse 1 S01 S02 Horse I1 S03 S04
  • 43.
    TOOLS Category Programs Version Homepage QC FastQC 0.10.1 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Bowtie2 2.0.5 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml Mapper TopHat 2.0.7 http://tophat.cbcb.umd.edu Cufflinks 2.0.2 http://cufflinks.cbcb.umd.edu Abundance HTseq-count - http://www-huber.embl.de/users/anders/HTSeq/doc/count.html DESeq 1.10.1 http://bioconductor.org/packages/release/bioc/html/DESeq.html Annotation goseq 1.10.0 http://www.bioconductor.org/packages/2.11/bioc/html/goseq.html samtools 0.1.18 http://samtools.sourceforge.net picard 1.83 http://picard.sourceforge.net Tools TBI-toolkit 0.1 http://dev.totalomics.kr/ R 2.15.0 http://www.r-project.org Gnuplot - http://www.gnuplot.info
  • 44.
    TBI-TOOLKIT • TBI NGS Toolkit • http://dev.totalomics.kr • Application • TBI-toolkit-qscore • TBI-toolkit-fq_filter • TBI-toolkit-gtf_selector • TBI-toolkit-fa_spliter • TBI-toolkit-make_matrix
  • 45.
    REFERENCE • Reference-based strategy Name FileType Description Reference fasta Genome Sequence Geneset GTF2.2/GFF3 Reference Geneset Name Source Description Geneset that has ncRNA information. Mask Geneset Geneset (rRNA, tRNA, and other ncRNA) Bowtie2 Index Reference Index files for running bowtie2 Optional GO information GO Gene ontology information for GO enrichment
  • 46.
    REFERENCE SOURCE • Ensembl (http://www.ensembl.org) • General file format for all species • Geneset (GTF format) • Constant Database schema for all species • Comprehensive Annotation (GO, InterPro, Pfam, Prosite Smart, ...... ) • Automated Update • UCSC (http://genome.ucsc.edu) • Semi general file format for all species • Semi constant Database schea for all species • Gene table dump (BED format compatible) • Annotation (Pfam, Kegg) • Comparative Analysis • NCBI • Raw data bank • GFF type geneset file
  • 47.
    ENSEMBL ensembl.org plants.ensembl.org fungi.ensembl.org metazoa.ensembl.org protists.ensembl.org bacteria.ensembl.org
  • 48.
    ENSEMBL • Homo Sapiens ( ftp://ftp.ensembl.org/pub/release-69 ) • fasta/homo_sapiens/ chr.fa • dna/Homo_sapiens.GRCh37.69.dna.toplevel.fa.gz • dna/Homo_sapiens.GRCh37.69.dna.chromosome.1.fa.gz • cdna/Homo_sapiens.GRCh37.69.cdna.all.fa.gz ens.gtf • gtf/homo_sapiens/Homo_sapiens.GRCh37.69.gtf.gz • mysql/homo_sapiens_core_69_37/ • Arabidopsis thaliana ( ftp://ftp.ensemblgenomes.org/pub/release-16/plants ) • fasta/arabidopsis_thaliana • dna/Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa.gz • cdna/Arabidopsis_thaliana.TAIR10.16.cdna.all.fa.gz • gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.16.gtf.gz • mysql/arabidopsis_thaliana_core_16_69_10/
  • 49.
    PRE-PROCESSING • Check quality score type of input file • Reference files • Reference index • Mask geneset
  • 50.
    SAMPLE QUALITY SCORE Usage) $TBI-toolkit-qscore [FASTQ] Sanger(Phred33) or Illumina 1.8+ 0 to 93 using ASCII 33 to 126 Run) $ cd /KOGO/RNA-seq/inputs $ TBI-toolkit-qscore S01_1.fq.gz Sanger(Phred33) or Illumina 1.8+ 0 to 93 using ASCII 33 to 126 0:1, 1:”, 2:#, 3:$, 4:%, 5:&, ......
  • 51.
    REFERENCE INDEX Index forbowtie2 mapper Usage) $ bowtie2-build [options] <reference_in> <bt2_base> Run) $ cd /KOGO/RNA-seq/ref $ bowtie2-build chr.fa chr.fa $ ls chr.fa.1.bt2 chr.fa.2.bt2 ...... Fasta index Usage) $ samtools faidx <ref.fasta> Run) $ cd /KOGO/RNA-seq/ref $ samtools faidx chr.fa $ ls chr.fa.fai
  • 52.
    MASK GENESET ...... Werecommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates. cufflinks manuals (http://cufflinks.cbcb.umd.edu/manual.html) Usage) $ TBI-toolkit-gtf_selector [IN GTF] [OUT GTF] [Source 1] [Source 2] ...... Run) $ cd /KOGO/RNA-seq/ref $ TBI-toolkit-gtf_selector ens.gtf mask.gtf tRNA rRNA Mt_tRNA Mt_rRNA
  • 53.
    SUMMARY • Directory • /KOGO/RNA-seq • Tools • Reference • Pre-processing
  • 54.
    FILTERING & QC DEG Read analysis Filtering Expression Gene Mapping Report Duplication Level Structure FastQC RSeQC Annotation
  • 55.
    Filtering FILTERING & QC Mapping Duplication • Improving assembly accuracy Expression • Removing artifacts Gene • Sequencing adaptor Structure • Low quality reads DEG Annotation • Near-identical reads Report • PCR amplification • rRNA and other RNA • Applications • Filtering - TBI-toolkit, fastx-toolkit • QC - FastQC, SolexaQC, RSeQC
  • 56.
    QUALITY CONTROL • FastQC ( v0.10.1 ) • A quality control tool for high throughput sequence data. • Java • http://www.bioinformatics.babraham.ac.uk/ projects/fastqc/ • RSeQC • RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data • http://code.google.com/p/rseqc/
  • 57.
    FASTQC Usages) $ fastqc seqfile1seqfile2 .. seqfileN $ fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN Arguments -f format bam,sam,bam_mapped,sam_mapped and fastq -t threads Run) $ cd /KOGO/RNA-seq/inputs $ fastqc -f fastq -t 2 S01_1.fq.gz S01_2.fq.gz Output) $ firefox R01_1.fq_fastqc/fastqc_report.html $ firefox R01_2.fq_fastqc/fastqc_report.html
  • 58.
    FASTQC Per Base SequenceQuality Per Sequence Quality Scores Per Base Sequence Content Per Base GC Content Per Sequence GC Content Per Base N Content Sequence Length Distribution Duplicate Sequences
  • 59.
  • 60.
    READ FILTERING (CUTOFF) RNA-seq DNA-seq N > 10% N > 10% Low Average QV < Q20 Average AV < Q20 Quality NT (<Q20) > 40% NT (<Q20) > 5% No trimming Trimming or Trimming Trimming
  • 61.
    FILTERING Usages) $ TBI-toolkit filter[option*] seqfile_1 seqfile_2 output_1 output_2 Option) -n N_ratio -a integer : Average QV of read -m NT_ratio < QV Run) $ cd /KOGO/RNA-seq/inputs $ TBI-toolkit-fq_filter -n 0.1 -m 0.4 -a 20 S01_1.fq.gz S01_2.fq.gz S01_Q20_1.fq.gz S01_Q20_2.fq.gz $ ls S01_Q20_1.fq.gz S01_Q20_2.fq.gz S01_Q20.log S01_Q20.err $ cat S01_Q20.log $ less S01_Q20.err
  • 62.
    FASTQC Run) $ cd /KOGO/RNA-seq/inputs $fastqc -f fastq -t 2 S01_Q20_1.fq.gz S01_Q20_2.fq.gz
  • 63.
    SUMMARY • Read Quality • FastQC • RSeQC • Filter
  • 64.
    MAPPING READS (TOPHAT) DEG Read analysis Filtering Expression Gene Mapping Report Duplication Level Structure FastQC RSeQC Annotation
  • 65.
    Filtering TOPHAT Mapping Duplication • TopHat is a fast splice junction mapper for RNA- Expression Seq reads. Gene Structure • It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short DEG Annotation read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. Report Trapnell C, Salzberg SL. How to map billions of short reads onto genomes. Nat Biotechnol. 2009 May;27(5):455-7.
  • 66.
    USAGE Usage $ tophat [options]<bowtie_index_base> <reads1_1> <reads1_2> Option Value Description -o/--output-dir string The default is "./tophat_out". -p/--num-threads int Use this many threads to align reads. The default is 1. -r/--mate-inner-dist int This is the expected (mean) inner distance between mate pairs.The default is 50bp The standard deviation for the distribution on inner distances between mate pairs. --mate-std-dev int The default is 20bp. fr-unstranded fr-unstranded : Standard Illumina --library-type fr-firststrand fr-firststrand : dUTP, NSR, NNSR fr-secondstrand fr-secondstrand : Ligation, Standard Solid --solexa-quals - Use the Solexa scale for quality values in FASTQ files. --solexa1.3-quals - Phred64/Illumina 1.3~1.5 -G/--GTF Geneset Geneset (GTF 2.2 or GFF3 formatted file) --rg-id string Read group ID --rg-sample string Sample ID
  • 67.
    RUN $ cd /KOGO/RNA-seq/outputs $tophat -o S01 -p 1 -r 170 --library-type fr-unstranded -G ../ref/ens.gtf --rg-id S01_Q20 --rg-sample S01_Q20 ../ref/chr.fa ../inputs/S01_Q20_1.fq.gz ../inputs/S01_Q20_2.fq.gz Category Option Value Output -o/--output-dir /KOGO/RNA-seq/outputs/S01 Thread -p/--num-threads 1 Inner Distance Mean -r/--mate-inner-dist 170 check Inner distance SD. --mate-std-dev 20 (default) Library Type --library-type fr-unstranded (Standard Illumina) Quality Score Phred33 (default) Geneset -G/--GTF /KOGO/RNA-seq/ref/ens_69.gtf --rg-id Read Group S01_Q20 --rg-sample
  • 68.
    ALGORITHM Trapnell C, PachterL, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009 May 1;25(9):1105-11.
  • 69.
    TOPHAT • Two step method • Extracting the transcript sequences and using Bowtie to align reads to this virtual transcriptome first. • Only the reads that do not fully map to the transcriptome will then be mapped on the genome. • Optimized for reads >= 75bp • The values in the first column of the provided GTF/GFF file must match the name of the reference sequence in the Bowtie index you are using with TopHat.
  • 70.
    OUTPUT Filename Types Description A list of read alignments in SAM format. accepted_hits.bam BAM Coordinate-sorted unmapped.bam BAM A list of unmapped read in SAM format. junctions.bed UCSC BED A track of junctions reported by TopHat chromLeft referes to the last genomic base insertions.bed UCSC BED before the insertion chromLeft referes to the first genomic base deletions.bed UCSC BED before the insertion
  • 71.
    SIMPLE ALIGNMENT VIEW Usage $cd /KOGO/RNA-seq/output/S01 $ samtools index accepted_hits.bam 25:413751 $ samtools tview accepted_hits.bam ../../ref/chr.fa Key Desc ? This window Arrows Small scroll movement H,J,K,L Large scroll movement space Scroll one screen backspace Scroll back one screen g Go to specific location m Color for mapping qual n Color for nucleotide b Color for base quality . Toggle on/off dot view q Exit
  • 72.
    MAPPING STATISTICS Run) Run) $ cd /KOGO/RNA-seq/outputs/S01 $ cd /KOGO/RNA-seq/outputs/S01 $ samtools flagstat accepted_hits.bam $ bam_stat.py -i accepted_hits.bam 45338688 + 0 in total (QC-passed reads + QC-failed Total Reads (Records): 45338688 reads) QC failed: 0 0 + 0 duplicates Optical/PCR duplicate: 0 45338688 + 0 mapped (100.00%:-nan%) Non Primary Hits 1861695 45338688 + 0 paired in sequencing Unmapped reads: 0 22757885 + 0 read1 Multiple mapped reads: 586067 22580803 + 0 read2 39796048 + 0 properly paired (87.78%:-nan%) Uniquely mapped: 42890926 42308960 + 0 with itself and mate mapped Read-1: 21527100 Read-2: 21363826 3029728 + 0 singletons (6.68%:-nan%) Reads map to '+': 21457407 705846 + 0 with mate mapped to a different chr Reads map to '-': 21433519 92166 + 0 with mate mapped to a different chr Non-splice reads: 32872272 (mapQ>=5) Splice reads: 10018654 Reads mapped in proper pairs: 38402964
  • 73.
    SUMMARY • TopHat •Splice junction • Geneset • Two step method • accepted_hits.bam
  • 74.
    PCR DUPLICATES (OPTIONAL) DEG Read analysis Filtering Expression Gene Mapping Report Duplication Level Structure FastQC RSeQC Annotation
  • 75.
    PCR DUPLICATION Filtering Mapping Duplication • Removing reads that have same mapping coordinates. Expression • Tools Gene • samtools - rmdup Structure • Picard - MarkDuplicates DEG Annotation Run) Report $ cd /KOGO/RNA-seq/outputs/S01/ $ samtools rmdup accepted_hits.bam accepted_hits.rmdup.bam Run) $ cd /KOGO/RNA-seq/outputs/S01/ $ java -jar /KOGO/RNA-seq/Tools/Picard/MarkDuplicates.jar INPUT=accepted_hits.bam OUTPUT=accpted_hits.mark_dup.bam ASSUME_SORTED=true REMOVE_DUPLICATES=true METRICS_FILE=accpeted_hits.metric
  • 76.
    PCR DUPLICATION accepted_hits.bam samtools Picard (Mark) Picard (Remove) 45338688 + 0 in total 29259330 + 0 in total 45338688 + 0 in total 27621444 + 0 in total 0 + 0 duplicates 0 + 0 duplicates 17717244 + 0 duplicates 0 + 0 duplicates 45338688 + 0 mapped 29259330 + 0 mapped 45338688 + 0 mapped 27621444 + 0 mapped 45338688 + 0 paired 29259330 + 0 paired 45338688 + 0 paired 27621444 + 0 paired 22757885 + 0 read1 14717809 + 0 read1 22757885 + 0 read1 13820471 + 0 read1 22580803 + 0 read2 14541521 + 0 read2 22580803 + 0 read2 13800973 + 0 read2 39796048 + 0 properly 24471885 + 0 properly 39796048 + 0 properly 24945306 + 0 properly paired (87.78%:-nan%) paired (83.64%:-nan%) paired (87.78%:-nan%) paired (90.31%:-nan%) 42308960 + 0 with itself and 26229602 + 0 with itself and 42308960 + 0 with itself and 26660814 + 0 with itself and mate mapped mate mapped mate mapped mate mapped 3029728 + 0 singletons 3029728 + 0 singletons 3029728 + 0 singletons 960630 + 0 singletons (6.68%:-nan%) (10.35%:-nan%) (6.68%:-nan%) (3.48%:-nan%) 705846 + 0 with mate 705846 + 0 with mate 705846 + 0 with mate 655922 + 0 with mate mapped to a different chr mapped to a different chr mapped to a different chr mapped to a different chr 92166 + 0 with mate 92166 + 0 with mate 92166 + 0 with mate 52614 + 0 with mate mapped to a different chr mapped to a different chr mapped to a different chr mapped to a different chr (mapQ>=5) (mapQ>=5) (mapQ>=5) (mapQ>=5)
  • 77.
    EXPRESSION (CUFFLINKS) DEG Read analysis Filtering Expression Gene Mapping Report Duplication Level Structure FastQC RSeQC Annotation
  • 78.
    EXPRESSINO & MODELING Adam  Roberts  et  al.,  Iden%fica%on  of  novel  transcripts  in  annotated   genomes  using  RNA-­‐Seq.  Bioinforma4cs,  2011,    27:2325–2329
  • 79.
    NORMALIZATION • Read counts need to be properly normalized to extract meaningful expression estimates • First, RNA fragmentation during library construction causes longer transcripts to generate more reads compared to shorter transcripts present at the same abundance in the sample • Second, the variability in the number of reads produced for each run causes fluctuations in the number of fragments mapped across samples Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 Jun;8(6):469-77.
  • 80.
    RPKM the reads per kilobase of transcript per million mapped reads Relative Expression Level in Sample • C : the number of mappable reads that fell onto the gene’s exons • N : the total number of mappable reads in the experiment • L : the sum of the exons in base pairs Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods, 5(7):621-628.
  • 81.
    Filtering CUFFLINKS Mapping Duplication • Cufflinks assembles transcripts, estimates their Expression abundances, and tests for differential expression and regulation in RNA-Seq Gene Structure samples DEG Annotation • Cufflinks constructs a parsimonious set of Report transcripts that "explain" the reads observed in an RNA-Seq experiment http://cufflinks.cbcb.umd.edu/index.html
  • 82.
    CUFFLINKS PACKAGE • cufflinks • assembles transcripts • estimates their abundances • cuffmerge • a script called cuffmerge that you can use to merge together several Cufflinks assemblies. • cuffdiff • tests for differential expression
  • 83.
    USAGE $ cufflinks [options]<aligned_reads.(sam/bam)> Option Value Description Sets the name of the directory in which Cufflinks will write all of its output. -o/--output-dir String The default is "./". Quantification -p/--num-threads int Use this many threads to align reads. The default is 1. Use the supplied reference annotation (a GFF file) to estimate isoform -G/--GTF geneset expression. It will not assemble novel transcripts. Novel Isoforms Use the supplied reference annotation (GFF) to guide RABT assembly. -g/--GTF-guide geneset Output will include all reference transcripts as well as any novel genes and isoforms that are assembled. Improving Ignore all reads that could have come from transcripts in this GTF file. We -M/--mask-file mask geneset accuracy recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. fr-unstranded fr-unstranded : Standard Illumina --library-type fr-firststrand fr-firststrand : dUTP, NSR, NNSR / fr-secondstrand fr-secondstrand : Ligation, Standard Solid
  • 84.
    RUN $ cd /KOGO/RNA-seq/outputs $cufflinks -o S01 -p 1 --library-type fr-unstranded -g ../ref/ens.gtf -M ../ref/mask.gtf S01/accepted_hits.bam Category Option Value Output -o/--output-dir /KOGO/RNA-seq/outputs/S01 Thread -p/--num-threads 1 Guide Geneset -g/--GTF-guide /KOGO/RNA-seq/ref/ens.gtf Mask Geneset -M/--mask-file /KOGO/RNA-seq/ref/mask.gtf Library Type --library-type fr-unstranded
  • 85.
    ALGORITHM Trapnell C, WilliamsBA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010 May;28(5):511-5.
  • 86.
    CUFFLINKS EXPRESSION • FPKM • Fragments Per Kilobase of exon per Million fragments mapped • analogous to single-read “RPKM” • Isoform expression estimation • maximum likelihood estimation • Normalization • by total number of mapped reads • by upper quantile method
  • 87.
    OUTPUT File Description transcripts.gtf The GTF file contains Cufflinks ‘ assembled isoforms The estimated isoform-level expression values in the isoforms.fpkm_tracking generic FPKM Tracking Format. The estimated gene-level expression values in the genes.fpkm_tracking generic FPKM Tracking Format.
  • 88.
    TRANSCRIPTS.GTF Col. Name Example Description 1 seqname chrX Chromosome or contig name 2 source Cufflinks The name of the program that generated this file (always 'Cufflinks') 3 feature exon The type of record (always either "transcript" or "exon". 4 start 77696957 The leftmost coordinate of this record (where 1 is the leftmost possible coordinate) 5 end 77712009 The rightmost coordinate of this record, inclusive. The most abundant isoform for each gene is assigned a score of 1000. Minor 6 score 1000 isoforms are scored by the ratio (minor FPKM/major FPKM) 7 strand + Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "." Cufflinks does not predict where the start and stop codons (if any) are located 7 frame . within each transcript, so this field is not used. 8 attributes ... See below.
  • 89.
    TRANSCRIPTS.GTF Attribute Example Description gene_id CUFF.1 Cufflinks gene id transcript_id CUFF.1.1 Cufflinks transcript id Isoform-level relative abundance in Fragments Per Kilobase of exon model per FPKM 101.267 Million mapped fragments frac 0.7647 Reserved. Please ignore, as this attribute may be deprecated in the future Lower bound of the 95% confidence interval of the abundance of this isoform, as a conf_lo 0.07 fraction of the isoform abundance. That is, lower bound = FPKM * (1.0 - conf_lo) Upper bound of the 95% confidence interval of the abundance of this isoform, as a conf_hi 0.1102 fraction of the isoform abundance. That is, upper bound = FPKM * (1.0 + conf_lo) cov 100.765 Estimate for the absolute depth of read coverage across the whole transcript When RABT assembly is used, this attribute reports whether or not all introns and full_read_support yes internal exons were fully covered by reads from the data.
  • 90.
    FPKM TRACKING FILES Col. name Example Description 1 tracking_id TCONS_00000001 A unique identifier describing the object (gene, transcript, CDS, primary transcript) The class_code attribute for the object, or "-" if not a transcript, or if class_code isn't 2 class_code = present 3 nearest_ref_id NM_008866.1 The reference transcript to which the class code refers, if any 4 gene_id NM_008866 The gene_id(s) associated with the object 5 gene_short_name Lypla1 The gene_short_name(s) associated with the object The tss_id associated with the object, or "-" if not a transcript/primary transcript, or if 6 tss_id TSS1 tss_id isn't present 7 locus chr1:4797771-4835363 Genomic coordinates for easy browsing to the object 8 length 2447 The number of base pairs in the transcript, or '-' if not a transcript/primary transcript 9 coverage 43.4279 Estimate for the absolute depth of read coverage across the object 10 FPKM 8.01089 FPKM of the object in sample 11 FPKM_lo 7.03583 the lower bound of the 95% confidence interval on the FPKM of the object in sample 12 FPKM_hi 8.98595 the upper bound of the 95% confidence interval on the FPKM of the object in sample OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), 13 status OK HIDATA (too many fragments in locus), or FAIL
  • 91.
    SIMPLE STATISTICS $ cd/KOGO/RNA-seq/outputs/S01 # Check higest expressed genes $ sort -r -g -k 10 genes.fpkm_tracking | head -n 30 # Select FPKM S $ cut -f 1,10 genes.fpkm_traking > gene_fpkm_s # $R > data <- read.table(“gene_fpkm_s”, header=TRUE) > fpkm_s <- as.numeric(data[,2]) > > mean(fpkm_s) > sd(fpkm_s) > > fpkm_s.log10 <- log(fpkm_s+1,10) > bin_seq = seq(min(fpkm_s.log10-0.1),max(fpkm_s.log10+0.1),by=0.1) > hist(fpkm_s.log10, breaks=bin_seq, xlab=‘log10(x+1)’, ylab=‘Number of genes’, axes=TRUE) > > boxplot(fpkm_s.log10)
  • 92.
    SUMMARY • Expression Level • Normalization • RPKM (FPKM) • Length Bias • Cufflinks • Isoforms • maximum likelihood estimation
  • 93.
    CUFFMERGE DEG Read analysis Filtering Expression Gene Mapping Report Duplication Level Structure FastQC RSeQC Annotation
  • 94.
    Filtering CUFFMERGE Mapping Duplication Expression • Use to merge together several Cufflinks assemblies Gene Structure • Automatically filters a number of transfrags that are probably artfifacts DEG Annotation • The main purpose of this script is Report to make it easier to make an assembly GTF file suitable for use with Cuffdiff Trapnell C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012 Mar 1;7(3):562-78. doi: 10.1038/nprot.2012.016.
  • 95.
    USAGE $ cuffmerge [options]<assembly_GTF_list.txt> Option Value Description -o <outprefix> Write the summary stats into the text output file <outprefix>(instead of stdout) An optional "reference" annotation GTF. The input assemblies are merged together with -g/--ref-gtf geneset the reference GTF and included in the final output. -p/--num-threads <int> Use this many threads to align reads. The default is 1. This argument should point to the genomic DNA sequences for the reference. If a directory, it should contain one fasta file per contig. If a multifasta file, all contigs should be present. The merge script will pass this option to cuffcompare, which will use the <seq_dir>/ -s/--ref-sequence sequences to assist in classifying transfrags and excluding artifacts (e.g. repeats). For <seq_fasta> example, Cufflinks transcripts consisting mostly of lower-case bases are classified as repeats. Note that <seq_dir> must contain one fasta file per reference chromosome, and each file must be named after the chromosome, and have a .fa or .fasta extension.
  • 96.
    RUN $ cd /KOGO/RNA-seq/outputs $find ./ -iname transcripts.gtf > gtf_list.txt $ cuffmerge -p 1 -g ../ref/ens.gtf -s ../ref/chr.fa gtf_list.txt Category Option Value Outputprefix -o /KOGO/RNA-seq/outputs Geneset -g/--ref-gtf /KOGO/RNA-seq/ref/ens.gtf Thread -p/--num-threads 1 Reference -s/--ref-sequence /KOGO/RNA-seq/ref/chr.fa
  • 97.
    RUN $ cd /KOGO/RNA-seq/outputs/merged_asm $less transcripts.gtf $ less merged.gtf $ gffread -g /KOGO/ref/chr.fa -w transcripts.fa transcripts.gtf $ head transcripts.fa >CUFF.11.1 gene=CUFF.11 GTGCATGTAACCCAAGAAGGGTTTGGCTGGGGGCTGTGGCAGCGCCAGAGTTCT GTTCGAATCCCAATTG GGTTCTGGTCACAGATTTGGCATGGAGCAGAAGAGAGATACAGCATGGTTGAAAA GCAGTTATTGGCTAC $ grep '>' transcripts.fa | head -n 30 >CUFF.2.1 gene=CUFF.2 >CUFF.11.1 gene=CUFF.11 >ENSGALT00000015891 gene=CUFF.11 >CUFF.12.1 gene=CUFF.12
  • 98.
    DEG ANALYSIS DEG Read analysis Filtering Expression Gene Mapping Report Duplication Level Structure FastQC RSeQC Annotation
  • 99.
    Filtering DIFFERENTIALLY EXPRESSEDGENE Mapping Duplication • Abundance of transcripts between Expression different conditions Gene Structure DEG Annotation Report Robinson MD, Oshlack A. A scaling normalization method for Zhang et al., Mol Cancer Res differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25. doi: 10.1186/gb-2010-11-3- June 2006 4; 401 r25. Epub 2010 Mar 2.
  • 100.
    LENGTH BIAS Oshlack A,Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009 Apr 16;4:14.
  • 101.
    BIAS Robinson MD, OshlackA. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25.
  • 102.
    REPLICATES More variance, More useful Technical Biological Replicates Replicates Source Same samples Different samples A quantity from the reproducibility of difference sources Purpose the results under the same conditions. what is similar in your The differences are replicates and how they based only on Issue are different from a technical issues in different set of the measurement conditions Taylor S, Wakem M, Dijkman G, Alsarraj M, Nguyen M. A practical approach to RT-qPCR-Publishing data that conform to the MIQE guidelines. Methods. 2010 Apr;50(4):S1-5. doi: 10.1016/j.ymeth. 2010.01.005. http://wiki.answers.com/Q/ What_is_defference_between_Biological_replicates_a nd_technical_replicates
  • 103.
    DEG METHODS Cuffdiff DEGseq DESeq - Poisson Negative binomial Isoform Gene Gene geneset Raw Read Count Raw Read Count BAM files Technical Technical Biological Replicates Replicates Replicates
  • 104.
    CUFFDIFF • Use to find significant changes in transcript expression, splicing, and promoter use. Usage) $ cuffdiff [options]* <transcripts.gtf> <sample1_replicate1.sam[,...,sample1_replicateM]> <sample2_replicate1.sam[,...,sample2_replicateM.sam]> Option Value Description -o / Sets the name of the directory in which Cuffdiff will write all <string> --output-dir of its output. The default is "./". -L / Specify a label for each sample, which will be included in <label1,label2,...,labelN> --labels various output files produced by Cuffdiff. -p / <int> Use this many threads to align reads. The default is 1. --num-threads
  • 105.
    RUN $ cd /KOGO/RNA-seq/outputs $cuffdiff -o Diff-S01-S02 -L S01,S02 -p 1 merged_asm/merged.gtf S01/ accepted_hits.bam S02/accepted_hits.bam Category Option Value Output -o/--output-dir /KOGO/RNA-seq/outputs/Diff-S01-S02 Label -L / --labels S01,S02 Thread -p / --num-threads 1
  • 106.
    OUTPUT Type Files Description genes.fpkm_tracking Gene [FPKMs, counts, read group tracking]. Tracks the summed Genes genes.count_tracking [FPKMs, counts, read group tracking] of transcripts sharing each genes.read_group_tracking gene_id isoforms.fpkm_tracking Isoforms isoforms.count_tracking Transcript [FPKMs, counts, read group tracking] isoforms.read_group_tracking cds.fpkm_tracking Coding sequence [FPKMs, counts, read group tracking]. Tracks CDS cds.count_tracking the summed [FPKMs, counts, read group tracking] of transcripts cds.read_group_tracking sharing each p_id, independent of tss_id tss_groups.fpkm_tracking Primary transcript [FPKMs, counts, read group tracking]. Tracks Primary tss_groups.count_tracking the summed [FPKMs, counts, read group tracking] of transcripts Transcripts tss_groups.read_group_tracking sharing each tss_id
  • 107.
    FPKM TRACKING FILES $cd /KOGO/RNA-seq/outputs/Diff-S01-S02 $ cut -f 1,3,4,5,9,10,13,14,17 genes.fpkm_tracking | head Col. Column name Example Description A unique identifier describing the object (gene, transcript, CDS, primary 1 tracking_id TCONS_00000001 transcript) 3 nearest_ref_id NM_008866.1 The reference transcript to which the class code refers, if any 4 gene_id NM_008866 The gene_id(s) associated with the object 5 gene_short_name Lypla1 The gene_short_name(s) associated with the object 9 coverage 43.4279 Estimate for the absolute depth of read coverage across the object 10 q0_FPKM 8.01089 FPKM of the object in sample 0 OK (deconvolution successful), LOWDATA (too complex or shallowly 13 q0_status OK sequenced), HIDATA (too many fragments in locus), or FAIL 14 q1_FPKM 8.55155 FPKM of the object in sample 1 OK (deconvolution successful), LOWDATA (too complex or shallowly 17 q1_status OK sequenced), HIDATA (too many fragments in locus), or FAIL
  • 108.
    OUTPUT Type Files Description Gene differential FPKM. Tests difference sin the summed FPKM of Genes gene_exp.diff transcripts sharing each gene_id Isoforms isoform_exp.diff Transcript differential FPKM. Coding sequence differential FPKM. Tests differences in the summed FPKM CDS cds_exp.diff of transcripts sharing each p_id independent of tss_id Primary Primary transcript differential FPKM. Tests differences in the summed FPKM tss_group_exp.diff Transcripts of transcripts sharing each tss_id how much differential splicing exists between isoforms processed from a Splicing splicing.diff single primary transcript the amount of overloading detected among its coding sequences, i.e. how CDS cds.diff much differential CDS output exists between samples the amount of overloading detected among its primary transcripts, i.e. how Promoter promoter.diff much differential promoter use exists between samples.
  • 109.
    GENE_EXP.DIFF $cd /KOGO/RNA-seq/outputs/Diff-S01-S02 $ cut -f 1,2,7,8,9,10,11,12,13,14 gene_exp.diff | head Col. Name Example Description 1 Tested id XLOC_000001 A unique identifier 2 gene Lypla1 The gene_name(s) or gene_id(s) being tested OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too 6 Test status NOTEST complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL 7 FPKMx 8.01089 FPKM of the gene in sample x 8 FPKMy 8.551545 FPKM of the gene in sample y 9 log2(FPKMy/FPKMx) 0.06531 The (base 2) log of the fold change y/x The value of the test statistic used to compute significance of the observed change in 10 test stat 0.860902 FPKM 11 p value 0.389292 The uncorrected p-value of the test statistic 12 q value 0.985216 The FDR-adjusted p-value of the test statistic Can be either "yes" or "no", depending on whether p is greater then the FDR after 13 significant no Benjamini-Hochberg correction for multiple-testing
  • 110.
    SIMPLE STATISTICS $ cd/KOGO/RNA-seq/outputs/Diff-S01-S02 $ gnuplot gnuplot> set grid gnuplot> set zeroaxis -1 gnuplot> set xlabel ‘log(FPKMs of S01)’ gnuplot> set ylabel ‘log(FPKMs of S02)’ gnuplot> pl ‘genes.fpkm_tracking’ u (log($10)):(log($14)) w points notitle, x notitle gnuplot> exit $ cd /KOGO/RNA-seq/outputs/Diff-S01-S02 $ grep yes gene_exp.diff > gene_exp.diff.yes $ less gene_exp.diff.yes $ grep no gene_exp.diff > gene_exp.diff.no $ gnuplot gnuplot> set grid gnuplot> set zeroaxis lt 2 gnuplot> set xlabel ‘log2foldchange’ gnuplot> set ylabel ‘-log(p-value)’ gnuplot> pl ‘gene_exp.diff.no’ u 10:(-log($12)) lt 0 no title, ‘gene_exp.diff.yes’ u 10:(-log($12)) lt 1 pt 6 ps 2 t ‘DE’ gnuplot> exit
  • 111.
    DESEQ • Differential gene expression analysis based on the negative binomial distribution • R • raw count • biological replicates • http://bioconductor.org/packages/release/ bioc/html/DESeq.html http://jura.wi.mit.edu/bio/education/hot_topics/RNAseq/ RNAseqDE_Dec2011.pdf
  • 112.
    HTSEQ-COUNT • To count how many reads map to each feature • Not counted for any feature for various reasons, namely: • no_feature: reads which could not be assigned to any feature • ambiguous: reads which could have been assigned to more than one feature and hence were not counted for any of these • too_low_aQual: reads which were not counted due to the -a option • not_aligned: reads in the SAM file without alignment • alignment_not_unique: reads with more than one reported alignment. These reads are recognized from the NH optional SAM field tag. • If you have paired-end data, you have to sort the SAM file by read name first http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
  • 113.
    HTSEQ-COUNT If you havepaired-end data, you have to sort the SAM file by read name first Usage) $ htseq-count [options] <sam_file> [gff_file, ensembl gtf] Options) -m [union,intersection-strict,intersection-nonempty] -s.--stranded=<yes, no, or reverse> whether the data is from a strand-specific assay (default: yes) Run) $ cd /KOGO/RNA-seq/outputs/S01 $ samtools sort -n accepted_hits.bam accepted_hits.nameSorted $ samtools view accepted_hits.nameSorted.bam | htseq-count -m union -s no - ../merged_asm/merged.gtf > accepted_hits.count $ less accepted_hits.count # ..... for (S02, S03, S04)
  • 114.
    RUN Run) $ cd /KOGO/RNA-seq/outputs $mkdir DESeq $ TBI-toolkit-make_matrix S01/hits.count 2 S02/hits.count 2 S03/hits.count 2 S04/hits.count 2 > DESeq/hits.mtx $ cd DESeq $ less hits.mtx $ cp /KOGO/RNA-seq/scripts/DESeq.4samples.R . $ R CMD BATCH DESeq.R DESeq.2samples for 2 samples
  • 115.
    DEG METHODS • Cuffdiff, baySeq, DESeq, edgeR and NOISeq generated consistent results • edgeR identified more DGE than the other methods at the same cut- off, which might infer less control of type 1 error with this method Nookaew I, et al. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res. 2012 Nov 1;40(20):10084-97.
  • 116.
    SUMMARY • DEG • Replicate • Technical replicates • Biological replicates • Cuffdiff • HTSeq-count • DESeq
  • 117.
    REPORT (CUMMERBUND) DEG Read analysis Filtering Expression Gene Mapping Report Duplication Level Structure FastQC RSeQC Annotation
  • 118.
    CUMMERBUND • anR packagethat is designed to aid and simplify the task of analyzing Cufflinks RNA-Seq output. •R • using SQLite • cuffData.db
  • 119.
  • 120.
    RUN Run) # Pairwise Scatterplots $ cd /KOGO/outputs/Diff-S01-S02 > s<-csScatter(genes(cuff),"S01","S02",smooth=T) $R >s > library(cummeRbund) # Geneset level plots > cuff <- readCufflinks() > cuff > data(sampleData) # Global statistics and Quality Control > myGeneIds <- sampleIDs > disp<-dispersionPlot(genes(cuff)) > myGenes <- getGenes(cuff,myGeneIds) > disp > h<-csHeatmap(myGenes,cluster='both') # Density >h > dens<-csDensity(genes(cuff)) # Barplot > dens > b <- expressionBarplot(myGenes) # Boxplot >b > b<-csBoxplot(genes(cuff)) # Cluster >b # Volcano > ic <-csCluster(myGenes,k=4) > v<-csVolcanoMatrix(genes(cuff)) > icp <- csClusterPlot(ic) >v > icp > v<-csVolcano(genes(cuff),"S01","S02")
  • 121.
  • 122.
    VIEWER • IGV Run) Generate BAM index $ cd /KOGO/RNA-seq/outputs/S01 • Integrative Genomics Viewer $ samtools index accepted_hits.bam $ ls • http://www.broadinstitute.org/igv/ accepted_hits.bai
  • 123.
    GO ENRICHMENT • GO annotation • GO Enrichment • Using SwissProt • GOseq • Blastx • Fisher's exact test • Blast2Go • DAVID • InterProScan
  • 124.
    CONCLUSION • (m)RNA-seq analysis • Mapping • Reference-based method • RNA mapper • NGS data analysis • Gene Expression • RNA-seq vs. DNA-seq • Normalization • Filtering • DEG Analysis • Low Quality • RPKM • PCR Duplication • Replicates
  • 125.