• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Next generation sequencing course - part 2: sequence mapping
 

Next generation sequencing course - part 2: sequence mapping

on

  • 3,031 views

 

Statistics

Views

Total Views
3,031
Views on SlideShare
3,031
Embed Views
0

Actions

Likes
2
Downloads
583
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Next generation sequencing course - part 2: sequence mapping Next generation sequencing course - part 2: sequence mapping Presentation Transcript

    • [I0D51A] Bioinformatics: High-Throughput Analysis Next-generation sequencing. Part 2: MappingProf Jan AertsFaculty of Engineering - ESAT/SCDjan.aerts@esat.kuleuven.beTA: Alejandro Sifrim (alejandro.sifrim@esat.kuleuven.be) 1 1
    • Context 2 2
    • Assembly vs mapping 3 3
    • Trapnell & Salzberg, 2009challenges: • how quickly can we align the reads to the genome? • what do we do with repetitive sequences? 4 4
    • Approaches Burrows- Wheelerhash-based transform Trapnell & Salzberg, 2009 5 5
    • Hash-based mappingE.g. MAQSteps: • Index reference genome (or sequence reads) => creates hash index (= big file: >50GB) • Divide each read into segments (seeds) and look up in table seed positions ... ... AAGC 3,473,2738,... AAGG 34,236,1827,... AAGT 8,172,782,1921,... ... ... 6 6
    • Burrows-Wheeler transformE.g. BWAUsed in data compression (e.g. bzip) => index: much smaller than hash-basedindex (<2GB)Alignment speed: 30x faster than MAQSteps: • Create BWT index of genome • Align read 1 character at a time to BWT-transformed genome 7 7
    • Burrows-Wheeler transform 2. Read mapping Creating Burrows-Wheeler 8 8
    • Inverse BWT: recreating original textif BWT = O^OOGO$L => what was original text?O^OOGO$L = last column L => first column F = sorted Last column L First column F O G ^ G O L sort O O G O G O $ ^ L $ 9 9
    • Inverse BWT: recreating original text ith occurrence of a character in L is same text occurrence as the ith occurrence in F F L 1st G G O 1st O 2nd G G ^ 1st ^ 1st L L O 2nd O 1st O O O 3rd O 2nd O O G 1st G 3rd O O G 2nd G 1st ^ ^ $ 1st $ 1st $ $ L 1st L 10 10
    • F L1st G G O 1st O2nd G G ^ 1st ^1st L L O 2nd O1st O O O 3rd O $2nd O O G 1st G3rd O O G 2nd G1st ^ ^ $ 1st $1st $ $ L 1st L 11 11
    • F L1st G G O 1st O2nd G G ^ 1st ^1st L L O 2nd O1st O O O 3rd O L$2nd O O G 1st G3rd O O G 2nd G1st ^ ^ $ 1st $1st $ $ L 1st L 12 12
    • F L1st G G O 1st O2nd G G ^ 1st ^1st L L O 2nd O1st O O O 3rd O OL$2nd O O G 1st G3rd O O G 2nd G1st ^ ^ $ 1st $1st $ $ L 1st L 13 13
    • F L1st G G O 1st O2nd G G ^ 1st ^1st L L O 2nd O1st O O O 3rd O GOL$2nd O O G 1st G3rd O O G 2nd G1st ^ ^ $ 1st $1st $ $ L 1st L 14 14
    • F L1st G G O 1st O2nd G G ^ 1st ^1st L L O 2nd O1st O O O 3rd O OGOL$2nd O O G 1st G3rd O O G 2nd G1st ^ ^ $ 1st $1st $ $ L 1st L 15 15
    • F L1st G G O 1st O2nd G G ^ 1st ^1st L L O 2nd O1st O O O 3rd O OOGOL$2nd O O G 1st G3rd O O G 2nd G1st ^ ^ $ 1st $1st $ $ L 1st L 16 16
    • F L1st G G O 1st O2nd G G ^ 1st ^1st L L O 2nd O1st O O O 3rd O GOOGOL$2nd O O G 1st G3rd O O G 2nd G1st ^ ^ $ 1st $1st $ $ L 1st L 17 17
    • F L1st G G O 1st O2nd G G ^ 1st ^1st L L O 2nd O1st O O O 3rd O ^GOOGOL$2nd O O G 1st G3rd O O G 2nd G1st ^ ^ $ 1st $1st $ $ L 1st L 18 18
    • Searching using BWTuses row index and fact that rows are alphabetically sorted => binary searche.g. at what positions does “GO” occur in “^GOOGOL$”?take middle position: is “GO” alphabetically before or after this position?-> if before: take middle position of first half (if after: last half) and discard otherhalf-> repeat until string found-> row indices indicate positions of substring: “GO” is at positions 2 and 5 19 19
    • Issues• placing reads in regions that do not exist in the reference genome• sequencing errors and variations: alignment between read and true source in genome may have more differences than alignment with some other copy of repeat What if many nucleotide differences with closest fully sequenced genome?• placing reads in repetitive regions: MAQ & bwa return only 1 mapping; If multiple: mapQ = 0• MAQ & bwa: use paired-end information => might prefer correct distance over correct alignment 20 20
    • File formatsSAM (Sequence Alignment/Map) format = unified format for storing readalignments to a reference genomeBAM = binary version of SAM for fast querying 21 21
    • 7172283 83 chr9 139389482 60 90M = 139389330 -242 ACGGGAG... #######...7172283 163 chr9 139389330 60 90M = 139389482 242 TAGGAGG... EHHHHHH...7705896 83 chr9 139389513 60 90M = 139389512 -91 GCTGGGG... EBCHHFC...7705896 163 chr9 139389512 60 90M = 139389513 91 AGCTGGG... HHHHHHH... 1 QNAME query template name 2 FLAG bitwise flag 3 RNAME reference sequence name 4 POS 1-based leftmost mapping position 5 MAPQ mapping quality 6 CIGAR CIGAR string 7 RNEXT reference name of mate 8 PNEXT position of mate 9 TLEN observed template length 10 SEQ sequence 11 QUAL ASCII of Phred-scaled base quality http://samtools.sourceforge.net/SAM1.pdf 22 22
    • paired data7172283 83 chr9 139389482 60 90M = 139389330 -242 ACGGGAG... #######...7172283 163 chr9 139389330 60 90M = 139389482 242 TAGGAGG... EHHHHHH...7705896 83 chr9 139389513 60 90M = 139389512 -91 GCTGGGG... EBCHHFC...7705896 163 chr9 139389512 60 90M = 139389513 91 AGCTGGG... HHHHHHH... 23 23
    • SAM format: FLAG fieldnumeric binary description 1 00000001 template has multiple fragments in sequencing 2 00000010 each fragment properly mapped according to aligner 4 00000100 fragment is unmapped 8 00001000 mate is unmapped 16 00010000 sequence is reverse complemented 32 00100000 sequence of mate is reversed 64 01000000 is first fragment in template 128 10000000 is second fragment in template 24 24
    • SAM FLAG: examples• 83 = 64 + 16 + 2 + 1 = 01010011 template has multiple fragments, each fragment is properly aligned, fragment is not unmapped, mate is not unmapped, sequence is reverse complemented, sequence of mate is not reversed, this is the first fragment in the template, this is not the second fragment in the template• 163 = 128 + 32 + 2 + 1 = 10100011 template has multiple fragments, each fragment is properly aligned, fragment is not unmapped, mate is not unmapped, sequence is not reverse complemented, sequence of mate is reversed, this is not the first fragment in the template, this is the second fragment in the template 25 25
    • SAM format: CIGAR string M alignment match (can be sequence match or mismatch) I insertion to the reference D deletion to the reference N skipped region from the reference S soft clipping (clipped sequence is present in SEQ) H hard clipping (clipped sequence is not present in SEQ) P padding (silent deletion from padded reference) = sequence match X sequence mismatch 26 26
    • CIGAR string: example read ACGCA-TGCAGTtagacgt reference ACGCAGTG--GT CIGAR 5M1D2M2I2M7S 27 27
    • Running bwa (FASTQ -> BAM)http://bio-bwa.sourceforge.netSteps: 1.Create index for genome (only has to be done once) 2.Run “bwa aln” to find suffix array coordinates of good hits of each individual read 3.Run “bwa samse/sampe” which converts suffix array coordinates to chromosomal coordinates and paired reads (for sampe) 28 28
    • Running “bwa” without arguments returns help. 29 29
    • bwa: indexing the genomeOnly has to be done once!To index chromosome 17 only: 1.Download chr17.fa.gz from UCSC Genome Browser (downloads section) 2.Run bwa index -a is chr17.fa 30 30
    • 31 31
    • bwa: finding suffix array coordinates for reads 32 32
    • bwa: converting suffix array coordinates tochromosome coordinates 33 33
    • Using Galaxy for read mapping 34 34
    • Viewing BAM filesMany options:• Integrative Genome Viewer (IGV) by Broad Institute• samtools tview• UCSC genome browser• bamview• bambino• ... 35 35
    • Viewing BAM files: IGVhttp://www.broadinstitute.org/software/igv/Java WebStart 36 36
    • coverage reads polymorphisms gene model 37 37
    • Is this a known SNP? 38 38
    • File -> Load from Server... 39 39
    • Yes, it is... 40 40
    • Viewing BAM files: samtools tview http://samtools.sourceforge.net 41 41
    • 42 42
    • Viewing BAM files: UCSC Genome Browserhttp://genome.ucsc.edu-> “Genome Browser”-> “Manage Custom Tracks”-> “Add Custom Tracks”-> In “Edit configuration”: track type=bam name="My BAM" bigDataUrl=http://med.kuleuven.be/lcb/ teaching/aln.sorted.bam-> “Submit”aln.sorted.bam contains reads that map to the first 10Mb of chr17 43 43
    • whole chromosome 44 44
    • zoomed in 45 45
    • zoomed in even further query template names 46 46
    • Read details 47 47
    • Manipulating SAM/BAM files• convert SAM <-> BAM• remove PCR duplicates• sort BAM file - necessary for loading into tools such as IGV• index BAM file - necessary for loading into tools such as IGV• local realignment around indels• base quality recalibration• pileup - i.e. convert from read-based to position-based; SNP calling• ... 48 48
    • Manipulating SAM/BAM files - tools: samtools Li et al, 2009 http://samtools.sourceforge.net 49 49
    • convert SAM to BAM sort index 50 50
    • Manipulating SAM/BAM files - tools: PICARDhttp://picard.sourceforge.net= Java-based command-line utility with similar functionality as samtoolsuseful commands: • MarkDuplicates - flags duplicate records (i.e. due to PCR amplification bias) • CalculateHsMetrics - calculates set of Hybrid Selection specific metrics • SamToFastq - extracts read sequences and qualities from SAM file 51 51
    • 52 52
    • Duplicate removal PCR amplification bias some reads: better amplified than others => bias!! => keep only one (with highest mapping Q) PCR went well PCR didn’t go PCR didn’t so well work 53 53
    • java -Xmx2048m -jar /path_to_picard/MarkDuplicates.jar INPUT=input.bam OUTPUT=output.bam METRICS_FILE=output.metrics VALIDATION_STRINGENCY=LENIENT Picard samtools samtools rmdup input.bam output.bam 54 54
    • Manipulating SAM/BAM files - tools: GATKGATK = Genome Analysis Toolkit, developed by Broad Institutehttp://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit • Full variant discovery workflow • Variant evaluation • ... 55 55
    • Base quality recalibration• Why? correct for variation in quality with machine cycle, sequence context, lane, baseQ, ...• Steps: • Identify what to correct for • Calculate covariates • Apply covariates • Check (create plots) 56 56
    • Mapping quality dependent on sequence context 57 57
    • java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R resources/Homo_sapiens_assembly18.fasta --DBSNP resources/dbsnp_129_hg18.rod -I my_reads.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov DinucCovariate -recalFile my_reads.recal_data.csv java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R resources/Homo_sapiens_assembly18.fasta -I my_reads.bam -T TableRecalibration -outputBam my_reads.recal.bam -recalFile my_reads.recal_data.csv 58 58
    • Local realignment near indels 59 59
    • Local realignment near indels 60 60
    • java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/reference.fasta -o /path/to/output.intervalsjava -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir -jar /path/to/GenomeAnalysisTK.jar -I input.bam -R ref.fasta -T IndelRealigner -targetIntervals /path/to/output.intervals -o realignedBam.bam 61 61
    • Exercises 62 62
    • Aligning reads to reference on the command lineLogin on the server mentioned on Toledo, and:From directory ~jaerts/i0d51a/: copy the files s_1_sequence_small.txt,s_2_sequence_small.txt and chr9.fa to your own home directory.If you know that s_1_sequence_small.txt and s_2_sequence_small.txt containpaired reads: align these against chr9. You’ll first have to create an index forchr9 (see slides). Also convert the resulting sam-file to a bam-file.How many of the reads were mapped? How many could not be mapped? Howmany mapped without mismatches (i.e. CIGAR string equal to “=”)? 63 63
    • Aligning reads to reference using GalaxyLog into your account on Galaxy.Align the reads in s_1_sequence_small.txt and s_2_sequence_small.txt (thatyou uploaded in the last lesson) against hg19. Perform the mapping using BWAfor Illumina. Use the built-in index “Human (Homo sapiens): hg19 Full” (type“hg19” in the “Select a reference genome” box). Do not suppress the header inthe output SAM file.Using Galaxy: create a histogram of the insert sizes of this DNA sequencinglibrary (tip: you’ll need some commands from the “Text Manipulation” and“Filter and Sort” groups) 64 64
    • Investigating BAM file with IGVStart the IGV application from http://www.broadinstitute.org/software/igv/download(750MB version) and open the first10Mbchr17.sorted.bam file which you can downloadfrom Toledo.• Is this data from a whole-genome sequencing experiment, or rather from some type of pulldown? If the latter: what type of pulldown (i.e. what were the targets).• Is the complete CDS of the KIF1C gene covered?• What is the left-most gene that is also in OMIM (you can find those at “Load from Server -> hg19 -> Phenotype and Disease Associations”)? Are all its exons covered?• At position 11,928 of chromosome 17: is this a SNP? If it is: is it already known in dbSNP? What about position 13,905? 65 65