Felicity Newell - NGS mapping, errors and quality control

2,782 views
2,522 views

Published on

Dr Felicity Newell, The University of Queensland Diamantina Institute

An important step in next generation sequencing is the alignment (mapping) of the short reads that are generated to a reference genome. Tools designed for mapping are required to efficiently and accurately align each read and 
more than 60 applications are currently available for this purpose. In this presentation I will describe some of the 
approaches to sequence alignment, highlighting popular tools that are used such as BWA, Novoalign and Bowtie. 
An important consideration for mapping and downstream sequence analysis is the ability to recognise and deal with common errors and biases that can occur during the process. I will discuss some of the common errors that occur in next generation sequencing and the approaches to quality control that should be applied in order to obtain high quality data.

First presented at the 2014 Winter School in Mathematical and Computational Biology http://bioinformatics.org.au/ws14/program/

Published in: Science
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,782
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
123
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Felicity Newell - NGS mapping, errors and quality control

  1. 1. NGS mapping, errors and quality control Felicity Newell Winter School, 2014
  2. 2. Presentation Outline Challenges of NGS mapping Choosing an aligner File formats Errors and bias Quality control Summary
  3. 3. Introduction Assembly Alignment/mapping REFERENCE
  4. 4. Challenges of NGS mapping Read length and number Platform Read length Reads produced Illumina HiSeq2000 100bp 150 million/lane 454 GS FLX 700bp 700 000 / PTP Ion torrent PGM 318 200bp 4 million / chip PACBIO RSII 8500bp 50 000/cell
  5. 5. Challenges of NGS mapping Error rate Adapted from NGS field guide Table 3c Platform Error rate (%) 454 (all models) 1 Illumina (all models) ~0.1% Ion torrent ~1 PACBIO RSII <=15
  6. 6. Challenges of NGS mapping Genetic variation • Greater than 10 million SNPs in dbSNP Repeat regions • Greater than 50% of human genome contains repeat regions
  7. 7. NGS aligners • Why not BLAST or BLAT? - optimized for longer reads - too slow • Algorithms involve indexing to speed up mapping • 2 main categories: - Hash table indexing - Burrows-Wheeler transform
  8. 8. Hash table indexing Trapnell & Salzberg (2009) Nature Biotech 27, 455 - 457 • MAQ • SOAP • MOSAIK • Novoalign • Cut reads and reference into small “seeds” • Store seeds in a lookup table (hash index) • eg. spaced seeds • May hash index reads or the genome
  9. 9. Burrows-Wheeler Transform • sort genome and index (BWT) • align read base by base to find positions in the genome • BWA • Bowtie • SOAP2 Trapnell & Salzberg (2009) Nature Biotech 27, 455 - 457
  10. 10. Presentation Outline Challenges of NGS mapping File formats Errors and bias Quality control Summary Choosing an aligner
  11. 11. List of aligners • More than 60 aligners are available • Many review papers that compare characteristics and performance Fonseca et al Bioinformatics (2012) 28 (24): 3169-3177.
  12. 12. Choosing an aligner Compute resources vs sensitivity Feature Hash table index tools BWT tools Speed Slower Faster Memory Higher Lower Sensitivity Higher Lower • Many have parallel modes to allow them to run on compute clusters
  13. 13. Choosing an aligner Platform and experiment type • What mappers support my platform? - Illumina, 454, IonTorrent, SOLiD, PacBio • Does the mapper support my sequence type? - DNA, RNA, Bisulphite • What do other people use? Aligner Data Citations BWA DNA 224.2 Bowtie DNA 363.42 Novoalign DNA 34.49 Fonseca et al. Bioinformatics (2012) 28 (24): 3169-3177.
  14. 14. Choosing an aligner Input data requirements • read length? • ability to handle paired-end reads? • ability to use base quality generated by sequencer? http://www.illumina.com INSERT SIZE
  15. 15. Choosing an aligner Input data requirements Mapper Max. Read length PE Quality Score BWA 200bp Y Y Bowtie 1000bp Y Y Novoalign 300bp Y Y Fonseca et al. Bioinformatics (2012) 28 (24): 3169-3177.
  16. 16. Choosing an aligner Variation AGCTTGTTGGTATGGCCCTGATGGTA Reference TTGTT--TATGGCACTGAT Read Polymorphism? Amplification Error? Sequencing Error? SNPDeletion (indel)
  17. 17. Choosing an aligner Variation • Does the mapper allow mismatches (SNPs)? • Does the mapper allow gapped alignments (indels)? Feature Tool Set number of mismatches Novoalign, ELAND, SOAP, SOAP2 No constraint on mismatches (score) Bowtie, Bowtie2, Mosaik Allow gapped alignment BWA, Bowtie2 No gapped alignment Bowtie
  18. 18. Choosing an aligner Repeats • Does the mapper deal with reads that map to more than one region (multi-mappers?) all regions best region random user defined number unique only
  19. 19. Choosing an aligner RNA-seq • need to deal with spliced reads spanning exon- exon boundaries • spliced aligners can handle these large gaps • use tool such as Bowtie to align continuous reads • align spliced reads using another algorithm: - known splice junctions - de-novo - de-novo with optional annotations • Tools include: TopHat, GSNAP, STAR EXON 1 EXON 2 Nature Methods 8, 469–477 (2011)
  20. 20. Choosing an aligner Bisulphite sequencing (DNA methylation) T C T C G CH3 T T T C G CH3 Bisulphite • sequence conversions of reads/genomes • use aligners such as BWA or Bowtie to align to reference • Tools include: Bismark, BSMAP, BSeeker2
  21. 21. Choosing an aligner Recent releases and updates • Bowtie2: - update to allow gapped alignment • BWA-MEM: - faster, more accurate - longer reads
  22. 22. Choosing an aligner Recent releases and updates • Isaac (alignment + variant annotation): - much faster than BWA, but - lower sensitivity and specificity. - useful for Hiseq X Ten data. • BLASR: - PacBio reads - able to handle long reads with insertion and deletion errors
  23. 23. Choosing an aligner • Default options may not be best “… there is no tool that outperforms all of the others in all the tests. Therefore, the end user should clearly specify his needs in order to choose the tool that provides the best results.” - Hatem et al BMC Bioinformatics 2013, 14:184
  24. 24. Presentation Outline Challenges of NGS mapping Choosing an aligner Errors and bias Quality control Summary File formats
  25. 25. Raw read input format • FASTQ: @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATA + !''*((((***+))%%%++)(%%%%).1***- id sequence description line base qualities res.illumina.com/documents/products/technotes/technote_q-scores.pdf • Base qualities: - Phred-like: probability that the base is incorrect
  26. 26. Mapping output format SAM format • Sequence/Alignment format - text tab-delimited format - Exome: >50Gb - Whole genome: 800Gb-1Tb • binary, compressed: BAM - index for quick lookup - Exome: 2-10Gb - Whole genome 100-300Gb
  27. 27. Mapping output format SAM/BAM header @HD VN:1.0 SO:coordinate @SQ SN:chr1 LN:249250621 AS:NCBI37 @SQ SN:chr2 LN:243199373 AS:NCBI37 @SQ SN:chr3 LN:198022430 AS:NCBI37 @RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS17071 @PG ID:bwa VN:0.5.4 Sort order Chromosomes and length read description line program info
  28. 28. Mapping output format SAM/BAM read information Reads (x4)
  29. 29. Mapping output format SAM format read information Field Description Alignment QNAME Query template name HWI- ST1359:47:H1178ADXX:2:22 01:20747:42719 FLAG bitwise flag 99 RNAME Reference name chr13 POS Reference position 28540518 MAPQ Mapping quality 60 CIGAR CIGAR string 101M MRNM/RNEXT Reference next read/mate = MPOS/PNEXT Position next/read mate 28540656 ISIZE/TLEN Template length 239
  30. 30. Mapping output format SAM format read information Field Description Alignment SEQ Sequence CCTCAAACCCACTCCAGGCTGCCATT GGTACTCGCCCCTTTTTACAGATGAG GAAATGGAGAATCAGACCGGGTCACG CAGATAGTATCAGGCGGGGTTGG QUAL Base qualities 7783>446=;@6;B;@B56/>=7>;8 <;?@0>A>=;;7A==@@9BB<;497> 565179==2864:>C=083&&42;37 69<8==>=;=>>D=768==8:79 TAGs Tags X0:i:1 X1:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 MQ:i:60 XT:A:U
  31. 31. Mapping output format Flags
  32. 32. Mapping output format Mapping Quality • Probability that a read is mapped incorrectly • Function of factors such as: - uniqueness (ie not a multi-mapper) - number of mismatches in read - number of indels in read - quality of bases in read
  33. 33. Presentation Outline Challenges of NGS mapping Choosing an aligner File formats Quality control Summary Errors and bias
  34. 34. Errors and bias Sequencing/PCR artifacts: duplication • PCR duplicates during sample prep • Optical duplicates: read the same cluster twice on sequencer • High duplication can lead to problems in downstream analysis eg. skew allele frequencies.
  35. 35. Errors and bias Sequencing error Kircher et al. Genome Biology 2009 10:R83 • Poor quality at the end of reads
  36. 36. Errors and bias Coverage Minoche et al. Genome Biology 2011 12:R112
  37. 37. Errors and bias G/C content Rieber N, et al. (2013) PLoS ONE 8(6): e66621.
  38. 38. Errors and bias Repeat regions Rieber N, et al. (2013) PLoS ONE 8(6): e66621.
  39. 39. Errors and bias Homopolymers AAAAAAAA TTTTTTTT CCCCCCCC GGGGGGGG Bragg et al PLoS Comput Biol. 2013 Apr;9(4):e1003031.
  40. 40. Errors of sequencing and mapping Strand bias + strand - strand
  41. 41. Errors and bias • Platforms have different strengths and weaknesses • Be aware of biases when performing downstream analysis Ion Torrent Homopolymers PacBio High overall error rate (but random) Illumina GC bias Complete Genomics Not as uniform in coverage
  42. 42. Presentation Outline Challenges of NGS mapping Choosing an aligner File formats Errors and bias Summary Quality control
  43. 43. Quality control Sequencing Quality control Data cleaning Mapping Quality control Downstream analysis Quality control Data cleaning
  44. 44. Before mapping QC: raw reads http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • Vendor software: quality metrics • FASTQC • Free Java program that reports quality profile of reads • Reports on FASTQ, SAM/BAM file • Identify a number of issues
  45. 45. Before mapping QC: raw reads Base qualities http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html Good Okay Bad
  46. 46. Before mapping QC: raw reads Adapter contamination http://www.bioinformatics.babraham.ac.uk/projects/fastqc/small_rna_fastqc.html
  47. 47. Before mapping QC: raw reads Duplication rate http://proteo.me.uk/2013/09/a-new-way-to-look-at-duplication-in-fastqc-v0-11/
  48. 48. Before mapping QC: data cleaning • Filter for reads for poor base quality • Trim adapter sequence Tool Function Website Fastx-Toolkit Fastq manipulation including base quality filtering http://hannonlab.cshl.e du/fastx_toolkit/ Cutadapt Adapter trimmer http://code.google.co m/pAda/cutadapt Skewer Adapter trimmer https://sourceforge.net /projects/skewer
  49. 49. Quality control Sequencing Quality control Data cleaning Mapping Quality control Downstream analysis
  50. 50. Post-mapping QC: Visualisation IGV https://www.broadinstitute.org/igv/home
  51. 51. Post-mapping QC: Mapping Statistics Picard • Java-based command-line utilities that manipulate SAM/BAM files - CollectInsertSizeMetrics - CollectGcBiasMetrics - CollectAlignmentSummaryMetrics - QualityScoreDistribution http://picard.sourceforge.net/
  52. 52. Post-mapping QC: Mapping Statistics Picard http://picard.sourceforge.net/ Summary metrics CATEGORY TOTAL_READS PCT_PF_READS_ALIGNED FIRST_OF_PAIR 22734444 0.999075 SECOND_OF_PAIR 22734444 0.984633 PAIR 45468888 0.991854 • high quality reads aligned • mismatch rate • indel rate
  53. 53. Post-mapping QC: Mapping Statistics http://picard.sourceforge.net/ Quality score distributionInsert size
  54. 54. Post-mapping QC: Mapping Statistics SAMStat samstat.sourceforge.net/ Bioinformatics. Jan 1, 2011; 27(1): 130–131.
  55. 55. Post-mapping QC: Exome on-target NGSrich: http://sourceforge.net/projects/ngsrich/ EXONS On target reads
  56. 56. Post-mapping QC: RNA-seq statistics Picard: CollectRnaSeqMetrics htpp://picard.sourceforge.net RSeQC http://rseqc.sourceforge.net/ RNA-SeQC http://www.broadinstitute.org/cancer/cga/rna-seqc • Metrics including: - read counts: mapped, % ribosomal RNA, transcript annotated (exonic, intronic) - uniformity of coverage: 5’ or 3’ bias
  57. 57. Post mapping QC: Downstream analysis • GATK indel realign BEFORE AFTER https://www.broadinstitute.org/gatk/
  58. 58. Post mapping QC: Downstream analysis • Many downstream analysis tools have: - filtering algorithms - options to only use high quality sequence data • Examples include: - Mapping quality cutoffs - Use only unique reads - Strand bias filtering - Base quality filtering - Repeat region filtering
  59. 59. Presentation Outline Challenges of NGS mapping Choosing an aligner File formats Errors and bias Quality control Summary
  60. 60. Summary: Sequencing Quality control Data cleaning Mapping Quality control Downstream analysis • Check raw reads before mapping • Trim/filter if necessary • Many options for tools • Choice is dependent on aims and type of experiment • Many kinds of error and bias • Important to do quality checks of data • The quality of your data can have a great impact on the success of your downstream analysis

×