Successfully reported this slideshow.

RNA-seq: Mapping and quality control - part 3

4,020 views

Published on

Third part in the 'RNA-seq for DE analysis'. See http://www.bits.vib.be for more details.

Published in: Education, Technology
  • Be the first to comment

RNA-seq: Mapping and quality control - part 3

  1. 1. Mapping to assign reads to genes Joachim Jacob 20 and 27 January 2014 This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
  2. 2. Goal Assign reads to genes. The result of the mapping will be used to construct a summary of the counts: the count table. GeneA: 12 GeneB: 5
  3. 3. 2 scenarios Reference genome sequence available NO reference genome sequence available ● De novo assembly of the reads (trinity) (transcriptome construction) ● Map the reads to the assembly (RSEM mapper) ● Extract count table (note:no removal of polyA is required. Computationally expensive!)
  4. 4. Reference genome sequence available Preprocessed reads are mapped to the reference sequence: 1. Reference is haplotype: mixture of alleles, leads to mismatches. 35 → for 2 alleles together If we compare samples within the same specimen, this effect is similar for all samples. 2. Reads contain sequencing errors 3. Reads derived from mRNA, genome is DNA.
  5. 5. mRNA reads: some reads span introns ● Reads are derived from mRNA mRNA One isoform! exon intron etc. Many reads span introns: they need to be aligned with gaps. This can be used to detect intron-exon junctions http://www.ensembl.org
  6. 6. mRNA reads: multiple isoforms exist ● Isoforms are transcribed at different levels, contributing differently to the number of reads. http://www.ensembl.org
  7. 7. Algorithm: gapped read mapping ● Exon-first approach: TopHat (popular) Junction database constructed to try to map unmapped reads. TopHat: discovering splice junctions with RNA-Seq Vol. doi:10.1093/bioinformatics/btp120 25 no. 9 2009, pages 1105–1111
  8. 8. Principle of gapped read mapping ● STAR: fast and suited for longer reads STAR: ultrafast universal RNA-seq aligner Alexander Dobin et al. Bioinformatics
  9. 9. Checklist for mapping to reference genome 1. A reference genome sequence (fasta), to be indexed by the alignment software. 2. A genome annotation file (GFF3 or GTF), with indication of currently known annotations (optional, but highly recommended) 3. The cleaned (preprocessed) reads (fastq )
  10. 10. Getting your reference genome sequence ● ● Genomes to be used by TopHat can be fetched from iGenomes and for STAR here If your genome is not listed above, check http://ensembl.org and http://ensemblgenomes.org ; and follow indexing software ● If still no luck, try a specialized species website, e.g.
  11. 11. Indexing a genome Mapping reads is fairly fast, because the heavy lifting is done beforehand: the reference genome sequence is preprocessed by indexing (taking a lot of time), making mapping fast. ● On Galaxy, the indexing has already been performed for you. Just choose your genome from the list. ●
  12. 12. Using genome annotation information Annotation info is stored in text files formatted as GTF or GFF3 files. ● If sequencing is deep enough, the complete transcriptome structure can be derived from the mapping: splice junctions, isoforms, variants,... CuffLinks for example reconstructs the annotation from an alignment, and generates a GFF file, to be used. Potentially novel transcripts are included in this file. But remember, this is NOT OUR GOAL. ● We will use a GTF file from an respected genome database to assist the mapping of reads. ● http://cufflinks.cbcb.umd.edu/
  13. 13. Using genome annotation information
  14. 14. GTF example
  15. 15. Mapping in Galaxy Mapping in Galaxy Basic settings !
  16. 16. Mapping in Galaxy ! !
  17. 17. Mapping in Galaxy
  18. 18. Mapping QC TIP: align a subsample of reads in Galaxy. Play with the settings, and determine the best outcome. ● Set the mapping fairly liberal: map as much as possible, and let the mapper assign mapping qualities. Ideally, every read maps once ('uniquely mapped'). In the following step, we will discard reads mapped to multiple locations ('multi reads'). ● The outcome of the alignment is a SAM or a BAM format, which you can visualize in Galaxy (or with a stand-alone viewer such as GenomeView or IGV. ●
  19. 19. Mapping QC The outcome of the alignment is a SAM or a BAM format, which you can visualize in Galaxy (or with a stand-alone viewer such as GenomeView or IGV. Check whether this visualization matches: - paired end - splice junctions - strandedness - ... Let's visualize
  20. 20. Practical tips Position on the reference genome sequence Add the GTF to the viz These are the reads, 2 colours because of the sense and antisense strand. (obviously this library was not stranded!) Some reads span an intron
  21. 21. Mapping QC - RSeQC After checking the mapping visually, determine more metrics with RseQC. http://rseqc.sourceforge.net/
  22. 22. Mapping QC - RSeQC Duplication rate observed in the RNA-seq data. http://rseqc.sourceforge.net/
  23. 23. Mapping QC - RSeQC Read quality of aligned reads http://rseqc.sourceforge.net/
  24. 24. Mapping QC - RSeQC Sequence depth saturation Q1 → Q4: from low count genes to high count genes Early flattening points to saturation http://rseqc.sourceforge.net/
  25. 25. Mapping QC - RSeQC Sequence depth saturation http://rseqc.sourceforge.net/
  26. 26. Mapping QC - RSeQC After checking visually, determine more metrics with RseQC. http://rseqc.sourceforge.net/
  27. 27. Mapping QC - RSeQC After checking visually, determine more metrics with RseQC. Deviating! http://rseqc.sourceforge.net/
  28. 28. Mapping QC - BamQC Another useful tool is BamQC of the Qualimap Suite. Be aware however: also useful for DNA-seq! http://qualimap.bioinfo.cipf.es/
  29. 29. Mapping QC: BamQC http://qualimap.bioinfo.cipf.es/
  30. 30. Mapping QC: BamQC http://qualimap.bioinfo.cipf.es/
  31. 31. Mapping QC: BamQC http://qualimap.bioinfo.cipf.es/
  32. 32. Mapping QC: BamQC Fraction of genome sequence not covered http://qualimap.bioinfo.cipf.es/
  33. 33. Mapping QC: BamQC http://qualimap.bioinfo.cipf.es/
  34. 34. Mapping QC: BamQC http://qualimap.bioinfo.cipf.es/
  35. 35. Mapping QC: BamQC http://qualimap.bioinfo.cipf.es/
  36. 36. Mapping QC: BamQC http://qualimap.bioinfo.cipf.es/
  37. 37. Mapping QC: BamQC http://qualimap.bioinfo.cipf.es/
  38. 38. Mapping QC: BamQC Some examples to watch out for. http://qualimap.bioinfo.cipf.es/
  39. 39. Mapping QC: BamQC Some examples to watch out for. http://qualimap.bioinfo.cipf.es/
  40. 40. Mapping QC: BamQC Some examples to watch out for. http://qualimap.bioinfo.cipf.es/
  41. 41. Keywords haplotype Gapped mapping GTF duplication isoforms strandedness coverage Write in your own words what the terms mean
  42. 42. Exercise → → Mapping exercise
  43. 43. Break

×