Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Workshop NGS data analysis - 2

2,510 views

Published on

Workshop NGS data analysis - 2
Mapping of reads to a reference genome, QC and first downstream analysis

Published in: Education
  • Be the first to comment

Workshop NGS data analysis - 2

  1. 1. Sequencing data analysisWorkshop – part 2 / mapping to a reference genome Outline Previously in this workshop… Mapping to a reference genome – the steps Mapping to a reference genome – the workshop Maté Ongenaert
  2. 2. Previously in this workshop…Introduction – the real cost of sequencing
  3. 3. Previously in this workshop…Introduction – the real cost of sequencing
  4. 4. Previously in this workshop… The workflow of NGS data analysis Data analysis Raw machine reads… What’s next? Preprocessing (machine/technology) - adaptors, indexes, conversions,… - machine/technology dependent Reads with associated qualities (universal) - FASTQ - QC check Depending on application (general applicable) - ‘de novo’ assembly of genome (bacterial genomes,…) - Mapping to a reference genome  mapped reads - SAM/BAM/… High-level analysis (specific for application) - SNP calling - Peak calling
  5. 5. Previously in this workshop… The workflow of NGS data analysis
  6. 6. Previously in this workshop… Main data formats Raw sequence reads:- Represent the sequence ~ FASTA >SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT- Extension: represent the quality, per base ~ FASTQ – Q for qualityScore ~ phred ~ ASCII table ~ phred + 33 = Sanger @SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !*((((***+))%%%++)(%%%%).1***-+*))**55CCF>>>>>>CCCCCCC65- Machine and platform independent and compressed: SRA (NCBI)Get the original FASTQ file using SRATools (NCBI)
  7. 7. Previously in this workshop… Main data formats- Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)- BAM: binary (read: computer-readable, indexed, compressed) ‘form’ of SAMDESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION# QNAME: template name#FLAG#RNAME: reference name# POS: mapping position#MAPQ: mapping quality#CIGAR: CIGAR string#RNEXT: reference name of the mate/next fragment#PNEXT: position of the mate/next fragment#TLEN: observed template length#SEQ: fragment sequence#QUAL: ASCII of Phred-scale base quality+33#Headers@HD VN:1.3 SO:coordinate@SQ SN:ref LN:45#Alignment blockr001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
  8. 8. Previously in this workshop… Main data formats- BED files (location / annotation / scores): Browser Extensible DataUsed for mapping / annotation / peak locations / - extension: bigBED (binary)FIELDS USED:# chr# start# end# name# score# strandtrack name=pairedReads description="Clone Paired Reads" useScore=1#chr start end name score strandchr22 1000 5000 cloneA 960 +chr22 2000 6000 cloneB 900 –- BEDGraph files (location, combined with score)Used to represent peak scorestrack type=bedGraph name="BedGraph Format" description="BedGraph format"visibility=full color=200,100,0 altColor=0,100,200 priority=20#chr start end scorechr19 59302000 59302300 -1.0chr19 59302300 59302600 -0.75chr19 59302600 59302900 -0.50
  9. 9. Previously in this workshop… Main data formats- WIG files (location / annotation / scores): wiggleUsed for visulization or summarize data, in most cases count data or normalized countdata (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks)browser position chr19:59304200-59310700browser hide all#150 base wide bar graph at arbitrarily spaced positions,#threshold line drawn at y=11.76#autoScale off viewing range set to [0:25]#priority = 10 positions this as the first graphtrack type=wiggle_0 name="variableStep" description="variableStep format"visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255yLineMark=11.76 yLineOnOff=on priority=10variableStep chrom=chr19 span=15059304701 10.059304901 12.559305401 15.059305601 17.559305901 20.059306081 17.5
  10. 10. Previously in this workshop… Main data formats- GFF format (General Feature Format) or GTFUsed for annotation of genetic / genomic features – such as all coding genes in EnsemblOften used in downstream analysis to assign annotation to regions / peaks / …FIELDS USED:# seqname (the name of the sequence)# source (the program that generated this feature)# feature (the name of this type of feature – for example: exon)# start (the starting position of the feature in the sequence)# end (the ending position of the feature)# score (a score between 0 and 1000)# strand (valid entries include +, -, or .)# frame (if the feature is a coding exon, frame should be a number between0-2 that represents the reading frame of the first base. If the feature isnot a coding exon, the value should be ..)# group (all lines with the same group are linked together into a singleitem)track name=regulatory description="TeleGene(tm) Regulatory Regions"#chr source feature start end scores tr fr groupchr22 TeleGene enhancer 1000000 1001000 500 + . touch1chr22 TeleGene promoter 1010000 1010100 900 + . touch1chr22 TeleGene promoter 1020000 1020000 800 - . touch2
  11. 11. Previously in this workshop… Main data formats- VCF format (Variant Call Format)For SNP representation
  12. 12. Previously in this workshop… Main data formats- http://genome.ucsc.edu/FAQ/FAQformat.html- UCSC brower data formats, including all most commonly used formats that are accepted and widely used- In addition, ENCODE data formats (narrowPeak / broadPEAK)
  13. 13. Sequencing data analysisWorkshop – part 2 / mapping to a reference genome Outline Previously in this workshop… Mapping to a reference genome – the steps Mapping to a reference genome – the workshop Maté Ongenaert
  14. 14. Mapping to a reference genome The workflowMapping:Aligning the raw sequence reads to a reference genome by using an indexing strategy andaligning algorithm, taking into account the quality scores and with specific conditions- Raw sequence reads with quality scores: FASTQ- Reference genome: FASTA files can be downloaded (UCSC/Ensembl)- Sequence reads <> reference genome: alignment- To perform an efficient alignment, an indexing strategy is used- For instance (BWA/Bowtie): FM indexes (based on burrows-wheeler algorithm) on the reference genome and/or the sequence reads- Specific conditions: single-end or paired-end; how many mismatches allowed; trade-off speed/accuracy/specificity; local re-alignment afterwards for improved indel calling; …>> Result: mapped sequence reads: chr / start / end / quality >> SAM file (>> BAM)
  15. 15. Mapping to a reference genome The workflowThe reference genome- Sequences (human; rat: mouse:…) can be downloaded from UCSC (Golden path) or Ensembl- Difficulty: download in 2bit format (needs convertor) >> fasta files (.fa)- Need to be indexed by the mapping program you are going to use- BWA: bwa index- Bowtie: bowtie-build (pre-computed indexes available)- BWA example:bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta>Index database sequences in the FASTA format.OPTIONS:-c Build color-space index. The input fast should be in nucleotide space.-p STR Prefix of the output database [same as db filename]-a STR Algorithm for constructing BWT index. Available options are:is IS linear-time algorithm for constructing suffix array. It requires 5.37N memory where N is the size of the database.bwtsw Algorithm implemented in BWT-SW. This method works with the whole human genome
  16. 16. Mapping to a reference genome The workflowThe sequencing reads- Sequence reads with quality scores: FASTQ files from the machine- Depending on the mapping program, need to be indexed as well- BWA: converts reads to SA coordinates (Suffix Array) based on the reference genome index- Bowtie: not needed: indexing and aligning in one step- BWA:- Index reference genome- Index sequence reads (INPUT: FASTQ and REF. GENOME ) >> SA coordinates (OUTPUT: SAI)- SA coordinates (INPUT: SAI/FASTQ and REF. GENOME >> SAM/BAM (OUTPUT)
  17. 17. Mapping to a reference genome The workflowaln bwa aln [-n][-o][-e][-d][-i][-k][-l][-t][-cRN][-M][-O][-E][-q] <in.db.fasta> <in.query.fq> > <out.sai>Find the SA coordinates of the input reads.Maximum maxSeedDiff differences are allowed in the first seedLen subsequencemaximum maxDiff differences are allowed in the whole sequence.OPTIONS:-n NUM Maximum edit distance if the value is INT-o INT Maximum number of gap opens-e INT Maximum number of gap extensions, -1 for k-difference mode-d INT Disallow a long deletion within INT bp towards the 3’-end-i INT Disallow an indel within INT bp towards the ends [5]-l INT Take the first INT subsequence as seed.-k INT Maximum edit distance in the seed-t INT Number of threads (multi-threading mode)-M INT Mismatch penalty-O INT Gap open penalty-E INT Gap extension penalty-R INT Proceed with suboptimal alignments-c Reverse query but not complement it-N Disable iterative search.-q INT Parameter for read trimming.-I The input is in the Illumina 1.3+ read format (quality equals ASCII-64)-B INT Length of barcode starting from the 5’-end.-b Specify the input read sequence file is the BAM format.-0 When -b is specified, only use single-end reads in mapping.-1 When -b is specified, only use the first read in a read pair in mapping-2 When -b is specified, only use the second read in a read pair in mapping
  18. 18. Mapping to a reference genome The workflowsamse bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>Generate alignments in the SAM format given single-end readsRepetitive hits will be randomly chosen.OPTIONS:-n INT Maximum number of alignments to output in the XA tag for reads paired properly.-r STR Specify the read group in a format like ‘@RGtID:footSM:bar’sampe bwa sampe [-a][-o][-n][-N][-P]<in.db.fasta><in1.sai><in2.sai><in1.fq><in2.fq> ><out.sam>Generate alignments in the SAM format given paired-end reads.Repetitive read pairs will be placed randomly.OPTIONS:-a INT Maximum insert size for a read pair to be considered being mapped properly.-o INT Maximum occurrences of a read for pairing.-P Load the entire FM-index into memory to reduce disk operations-n INT Maximum number of alignments to output in the XA tag for reads paired properly-N INT Maximum number of alignments to output in the XA tag for disconcordant read pairs-r STR Specify the read group in a format like ‘@RGtID:footSM:bar’
  19. 19. Sequencing data analysisWorkshop – part 2 / mapping to a reference genome Outline Previously in this workshop… Mapping to a reference genome – the steps Mapping to a reference genome – the workshop Maté Ongenaert
  20. 20. Mapping to a reference genome The workshopMapping using BWAbwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.saibwa-0.5.9 BWA and its versionaln: alignement functionality of BWA-t 4: use 4 processes (CPU cores) at the same time to speed up/opt/genomes/index/bwa/GRCh37: location of the reference genome indexSRR058523.fastq: fastq file to align to the reference> Indicates outputting to a fileSRR058523.sai: the output file (SA Index file)Maps the input sequences (FASTQ) to the reference genome index  output: indexes of the readsNo ‘real genomic mapping’ thus, this would need a next step…
  21. 21. Mapping to a reference genome The workshopMapping using BWAbwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |samtools-0.1.18 view -bhSo PHF6-unsorted.bam –bwa-0.5.9 BWA and its versionsamse: single-end mapping and output to sam format/opt/genomes/index/bwa/GRCh37: location of the reference genome indexSRR058523.sai: the reads indexSRR058523.fastq: the raw reads and quality scoresThis would output a sam file (> SRR058523.sam) for instanceBut we don’t need the SAM file, we would like a BAM file  processing by samtools| is the ‘pipe’ symbol: hands over the output from one command to the othersamtools-0.1.18: samtools and its versionview: the command to process sam files- B output BAM ; h print the headers; S input is SAM; o output namePHF6-unsorted.bam: output file name- End of the | symbol (end of second command)
  22. 22. Mapping to a reference genome The workshopMapping using BWAbwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.saibwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |samtools-0.1.18 view -bhSo PHF8-unsorted.bam –Two-step process in BWANext steps: process the BAM file  sort and index it (using samtools)samtools-0.1.18 sort PHF8-unsorted.bam PHF8-sortedCreates a sorted BAM file (PHF6-sorted.bam)samtools-0.1.18 index PHF8-sorted.bamIndexes the sorted BAM file (and created a BAM index file – PHF6-sorted.bam.bai)
  23. 23. Mapping to a reference genome The workshopBAM: what’s next?So, now we have the sorted and indexed BAM file – what’s next?This file is the starting point for all other analysis, depending on the application:ChIP-seq: peak callingSNP callingRNA-seq: calculate gene-expression levels of the transcripts / find splice variantsWhat are the first things?- Visualize it (IGV can load BAM files)- First downstream analysis: QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes,…)
  24. 24. Mapping to a reference genome The workshopFirst downstream analysis- QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes, information on paired-end reads,…)Samstat/opt/samstat/samstat PHF8-sorted.bam- Outputs a HTML file with statistics
  25. 25. Mapping to a reference genome The workshopFirst downstream analysis- QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes, information on paired-end reads,…)BamUtil (stats)Bam stats --in PHF8-sorted.bam –-basic --phred --baseSumNumber of records read = 15732744TotalReads(e6) 15.73MappedReads(e6) 15.04PairedReads(e6) 15.73ProperPair(e6) 14.65DuplicateReads(e6) 0.00QCFailureReads(e6) 0.00MappingRate(%) 95.59PairedReads(%) 100.00ProperPair(%) 93.11DupRate(%) 0.00QCFailRate(%) 0.00TotalBases(e6) 802.37BasesInMappedReads(e6) 766.95Quality Count33 034 035 7137336 037 038 20354439 40364940 92171441 208109942 197461543 2285826
  26. 26. Mapping to a reference genome The workshopFirst downstream analysis- QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes, information on paired-end reads,…)Samtoolssamtools-0.1.18 idxstats PHF8-sorted.bam1 249250621 503714 02 243199373 345217 03 198022430 273477 04 191154276 229016 05 180915260 360339 06 171115067 257468 07 159138663 269704 08 146364022 242656 09 141213431 203505 010 135534747 237496 011 135006516 218116 012 133851895 231426 013 115169878 106831 014 107349540 119062 015 102531392 141351 016 90354753 183004 017 81195210 187024 018 78077248 86101 0
  27. 27. Mapping to a reference genome The workshopFirst downstream analysis- Think about PCR duplicates  you may want to remove them (or set a ‘flag’ in the BAM file, indicating it is a duplicate)- Samtools rmdup or Picard MarkDuplicates- Find out how these tools work and what otyher flags are used in BAM files- Can you make statistics with the BAM flags?
  28. 28. Mapping to a reference genome The workshopMapping – now let’s start!- Mapping is only the starting point for most downstream analysis tools- Depends on the application and what you want to do: - Exome sequencing / whole genome sequencing: SNP calling (samtools): based on mapping quality / coverage /  identification of SNPs (VCF output format) - ChIP-seq: peak calling: based on coverage of ChIP and input, enriched regions are identified (BED output, BEDgraph and/or WIG files) - RNA-seq: assign reads to the transcripts, normalize (length of exon and number of reads in the sequencing library = RPKM)  (relative) expression levels  identification of differentially expressed genes
  29. 29. Blokde Van… ETER

×