Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

3,729 views

Published on

Course: Bioinformatics for Biomedical Research (2014).
Session: 2.3- Introduction to NGS Variant Calling Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.

Published in: Science, Technology

Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

  1. 1. Hospital Universitari Vall d’Hebron Institut de Recerca - VHIR Institut d’Investigació Sanitària de l’Instituto de Salud Carlos III (ISCIII) Bioinformàtica per la Recerca Biomèdica http://ueb.vhir.org/2014BRB Ferran Briansó ferran.brianso@vhir.org 15/05/2014 INTRODUCTION TO NGS VARIANT CALLING ANALYSIS
  2. 2. 1. NGS WORKFLOW OVERVIEW 2. WET LAB STEPS 3. IMPORTANT SEQUENCING CONCEPTS 4. NGS ANALYSIS WORKFLOW 1. Primary analysis: de-multiplexing, QC 2. Secondary analysis: read mapping and variant calling 3. Tertiary analysis: annotation, filtering... 5. VISUALIZATION 6. COMMON PIPELINES AND FORMATS 7. CONCLUSIONS 5 1 2 3 5 6 PRESENTATION OUTLINE 4 7
  3. 3. NGS WORKFLOW OVERVIEW1 3Extracted from Dr Kassahn's publicly shared slides (2013)
  4. 4. LIBRARY PREPARATION2 4 Select target Hybridization-based cature or PCR Add adapters Contain binding sequences Barcodes Primer sequences Amplify material 2
  5. 5. 5 Select target Hybridization-based cature or PCR Add adapters Contain binding sequences Barcodes Primer sequences Amplify material A) Fragment DNA B) End-repair C) A-tailing, adapter ligation and PCR D) Final library contains • sample insert • indices (barcodes) • flowcell binding sequences • primer binding sequences LIBRARY PREPARATION2
  6. 6. 6 Select target Hybridization-based cature or PCR Add adapters Contain binding sequences Barcodes Primer sequences Amplify material LIBRARY PREPARATION2
  7. 7. TEMPLATE PREPARATION 7 Attachment of library e.g. To Illumina Flowcell Amplification of library molecules e.g. Brigde amplification 2
  8. 8. BRIDGE AMPLIFICATION 8 2
  9. 9. SEQUENCING 9 Sequencing-by-Synthesis Detection by: • Illumina – fluorescence • Ion Torrent – pH • ROCHE 454 – PO4 and light 2
  10. 10. SEQUENCING-BY-SYNTHESIS (ILLUMINA) 10 2
  11. 11. IMPORTANT SEQUENCING CONCEPTS1 11 Barcoding/Indexing: allows multiplexing of different samples Single-end vs paired-end sequencing Coverage: avg. number reads per target Quality scores (Qscore): log-scales! 3
  12. 12. NGS DATA ANALYSIS WORKFLOW4 12
  13. 13. DE-MULTIPLEXING (BARCODE SPLITTING) 13 4
  14. 14. FASTQ FORMAT 14 4 see en.wikipedia.org/wiki/FASTQ_format
  15. 15. SEQUENCE QUALITY: fastQC 15 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Details of the output https://docs.google.com/document/pub?id=16GwPmwYW7o_r-ZUgCu8-oSBBY1gC97TfTTinGDk98Ws 4
  16. 16. NGS DATA ANALYSIS WORKFLOW4 16
  17. 17. READ MAPPING (BASIC ALIGNMENT)4 17 Comparison against reference genome (! not assembly !) Many aligners (short reads, longer reads, RNAseq...) Examples: BWA, Bowtie SAM/BAM files
  18. 18. BURROWS-WHEELER ALIGNMENT TOOL (BWA) 18 Popular tool for genomic sequence data (not RNASeq!) Li and Durbin 2009 Bioinformatics Challenge: compare billion of short sequence reads (.fastq file) against human genome (3Gb) Burrows-Wheeler Transform to “index” the human genome and allow memory-efficient and fast string matching between sequence read and reference genome 4 Li & Durbin 2009 Bionformatics
  19. 19. SAM/BAM FILES 19 4 see http://samtools.sourceforge.net/SAMv1.pdf
  20. 20. SAM/BAM FILES 20 @ Header (information regarding reference genome, alignment method...) 1) Read ID (QNAME) 2) Bitwise FLAG (first/second read in pair, both reads mapped...) 3) ReferenceSequence Name (RNAME) 4) Position (POS, coordinate) 5) MapQuality (MAPQ = -10log10P[wrong mapping position]) 6) CIGAR (describes alignment – matches, skipped regions, insertions..) 7) ReferenceSequence (RNEXT, Ref seq of the pair) 8) Position of the pair (PNEXT) 9) TemplateLength (TLEN) 10) ReadSequence 11) QUAL (in Fastq format, '*' if NA) ... 4
  21. 21. VARIANT CALLING 21 Identify sequence variants Distinguish signal vs noise VCF files Examples: SAMtools, SNVmix 4
  22. 22. SEQUENCE VARIANTS 22 Differences to the reference 4
  23. 23. SEQUENCE VARIANTS 23 Sanger: is it real?? NGS: read count Provides confidence (statistics!) Sensitivity tune-able parameter (dependent on coverage) 4
  24. 24. VARIANT CALLING: GATK 24 Genome Analysis Toolkit (BROAD Institute) • Initially developed for 1000 Genomes Project • Single or multiple sample analysis (cohort) • Popular tool for germline variant calling • Evaluates probability of genotype given read data 4 see http://www.broadinstitute.org/gatk/ and McKenna et al. Genome Research 2010
  25. 25. SOMATIC VARIANT CALLING 25 Somatic mutations can occur at low freq. (<10%) due to: • Tumor heterogeneity (multiple clones) • Low tumor purity (% normal cells in tumor sample) Requires different thresholds than germline variant calling when evaluating signal vs noise Trade-off between sensitivity (ability to detect mutation) and specificity (rate of false positives) Nature Reviews Cancer 12, 323-334 (May 2012) 4
  26. 26. INDELS DETECTION1 26 Small insertions/ deletions The trouble with mapping approaches 4 modified from Heng Li (Broad Institute)
  27. 27. INDELS DETECTION 27 Small insertions/ deletions The trouble with mapping approaches 4
  28. 28. INDELS DETECTION 28 Small insertions/ deletions The trouble with mapping approaches 4
  29. 29. RE-ALIGNMENT 29 Re-align considering multi-read context, SNPs & INDELS previous info... 4 adapted from Andreas Schreiber
  30. 30. EVALUATING VARIANT QUALITY 30 TAKING INTO ACCOUNT: • Coverage at position • Number independent reads supporting variant • Observed allele fraction vs expected (somatic / germline) • Strand bias • Base qualities at variant position • Mapping qualities of reads supporting variant • Variant position within reads (near ends or at centre) 4
  31. 31. VCF FILES 31 Variant Call Format Standard for reporting variants from NGS Describes metadata of analysis and variant calls Text file format (open in Text Editor or Excel) !!! Not a MS Office vCard !!! see http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format -version-41 4
  32. 32. VCF FILES 32 4
  33. 33. NGS DATA ANALYSIS WORKFLOW 33 4
  34. 34. VARIANT ANNOTATION 34 Provide biological & clinical context Identify disease-causing mutations (among 1000s of variants) 4
  35. 35. ANNOTATION OVERVIEW 35 4
  36. 36. VARIANT FILTERING AND PRIORIZATION 36 PURPOSE: Identify pathogenic or disease-associated mutation(s) Reduce candidate variants to reportable setCOMMON STEPS: • Remove poor quality variant calls • Remove common polymorphisms • Prioritize variants with high functional impact • Compare against known disease genes • Consider mode of inheritance (autosomal recessive, X-linked...) • Consider segregation in family (where multiple samples available) 4
  37. 37. NGS DATA ANALYSIS WORKFLOW 37 5
  38. 38. VISUALIZATION – IGV (or Genome Browser, Circos...) 38 5 provided by Katherine Pillman
  39. 39. COMMON PIPELINE6 39 bcl2fastq (Illumina) FastQC (open-source) Exomes (HiSeq): BWA(open-source), GATK (Broad) Gene panels (MiSeq, PGM): MiSeq Reporter (Illumina) Torrent Suite (Ion Torrent) Custom scripts and third party tools (Annovar, snpEff, PolyPhen, SIFT...) Commercial annotation software (GeneticistAssistant, VariantStudio...)
  40. 40. COMMON DATA FORMATS6 40 .bcl .fastq .BAM .VCF .csv .txt .xls .html ...
  41. 41. CONCLUSIONS7 41 NGS data - the new currency of (molecular) biology Broad applications (ecology, evolution, ag sciences, medical research and clinical diagnostics...). Rapidly evolving (sequencing technologies, library preparation methods, analysis approaches, software). Different tools/pipelines/parametrization gives different results, (more standards needed). Bioinformatics pipelines typically combine vendor software, third-party tools and custom scripts. Requires skills in scripting, Linux/Unix, HPC. Requires advanced hardware (not always available). Understanding of data (SE, PE, RNA-Seq) important for successful analysis.

×