Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Mapping Raw Reads to a Reference Genome
1. Read Processing and Mapping:
From Raw to Analysis-ready Reads
Ben Passarelli
Stem Cell Institute Genome Center
NGS Workshop
12 September 2012
2. Click to edit Master title styleSamples to Information
Variant calling
Gene expression
Chromatin structure
Methylome
Immunorepertoires
De novo assembly
…
3. Click to edit Master title style
http://www.broadinstitute.org/gsa/wiki/images/7/7a/Overall_flow.jpg
http://www.broadinstitute.org/gatk/guide/topic?name=intro
Many Analysis Pipelines Start with Read Mapping
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
Genotyping (GATK) RNA-seq (Tuxedo)
4. Click to edit Master title styleFrom Raw to Analysis-ready Reads
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
Session Topics
• Understand read data formats and quality scores
• Identify and fix some common read data problems
• Find and prepare a genomic reference for mapping
• Map reads to a genome reference
• Understand alignment output
• Sort, merge, index alignment for further analysis
• Locally realign at indels to reduce alignment artifacts
• Mark/eliminate duplicate reads
• Recalibrate base quality scores
• An easy way to get started
5. Click to edit Master title styleInstrument Output
Illumina
MiSeq
Illumina
HiSeq
IonTorrent
PGM
Roche
454
Pacific Biosciences
RS
Images (.tiff)
Cluster intensity file (.cif)
Base call file (.bcl)
Standard flowgram file (.sff) Movie
Trace (.trc.h5)
Pulse (.pls.h5)
Base (.bas.h5)
Sequence Data
(FASTQ Format)
6. Click to edit Master title style
Raw reads
Read assessment and
prep
Mapping
Local realignment
Duplicate marking
Base quality recalibration
Analysis-ready
reads
FASTQ Format (Illumina Example)
@DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA
CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT
+
BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ
@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG
AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG
+
@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2
@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG
CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC
+
CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ
@DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG
GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG
+
CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ
Read Record
Header
Read Bases
Separator
(with optional
repeated
header)
Read Quality
Scores
Flow Cell ID
Lane Tile
Tile
Coordinates
Barcode
NOTE: for paired-end runs, there is a second file
with one-to-one corresponding headers and reads
7. Click to edit Master title style
Phred* quality score Q with base-calling error probability P
Q = -10 log10P
* Name of first program to assign accurate base quality scores. From the Human Genome Project.
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~
| | | | | |
33 59 64 73 104 126
S - Sanger Phred+33 range: 0 to 40
I - Illumina 1.3+ Phred+64 range: 0 to 40
L - Illumina 1.8+ Phred+33 range: 0 to 41
Q score
Probability of
base error Base confidence
Sanger-encoded
(Q Score + 33)
ASCII character
10 0.1 90% “+”
20 0.01 99% “5”
30 0.001 99.9% “?”
40 0.0001 99.99% “I”
Base Call Quality: Phred Quality Scores
8. Click to edit Master title style
[benpass@solexalign]$ ls
Raw reads
Read assessment and
prep
Mapping
Local realignment
Duplicate marking
Base quality recalibration
Analysis-ready
reads
File Organization
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
Barcode
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
Read
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
Format
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
gzip compressed
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
gzip compressed
[benpass@solexalign]$ ls
Sample_FS53_EPCAM+_CD10-_IL2270-18
Sample_FS53_EPCAM+_CD10+_IL2269-19
Sample_COH77_CD49F-_IL2275-13
Sample_COH77_CD49F+_CD66-_IL2274-14
Sample_COH77_CD49F+_CD66+_IL2273-15
Sample_COH74_EPCAM+_CD10-_IL2272-16
Sample_COH74_EPCAM+_CD10+_IL2271-17
Sample_COH69_EPCAM+_CD10-_IL2268-20
Sample_COH69_EPCAM+_CD10+_IL2267-21
[benpass@solexalign]$ ls Sample_COH77_CD49F-_IL2275-13
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R1_004.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_001.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_002.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_003.fastq.gz
COH77_CD49F-_IL2275-13_AGTCAA_L002_R2_004.fastq.gz
gzip compressed
9. Click to edit Master title styleInitial Read Assessment
Common problems that can affect analysis
• Low confidence base calls
– typically toward ends of reads
– criteria vary by application
• Presence of adapter sequence in reads
– poor fragment size selection
– protocol execution or artifacts
• Over-abundant sequence duplicates
• Library contamination
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
10. Click to edit Master title styleInitial Read Assessment: FastQC
• Free Download
Download: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y
• Samples reads (200K default): fast, low resource use
Raw reads
Read assessment
and prep
Mapping
Local realignment
Duplicate marking
Base quality
recalibration
Analysis-ready
reads
11. Click to edit Master title style
http://proteo.me.uk/2011/05/interpreting-the-duplicate-sequence-plot-in-fastqc
Read Duplication
Read Assessment Examples
~8% of
sampled
sequences
occur twice
~6% of
sequences
occur more
than 10x
~71.48% of
sequences are
duplicates
Sanger Quality Score by Cycle
Median, Inner Quartile Range, 10-90 percentile range, Mean
Note: Duplication based on read identity,
not alignment at this point
12. Click to edit Master title style
Per base sequence content should resemble this…
Read Assessment Example (Cont’d)
13. Click to edit Master title styleRead Assessment Example (Cont’d)
14. Click to edit Master title styleRead Assessment Example (Cont’d)
TruSeq Adapter, Index 9 5’
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG
15. Click to edit Master title styleRead Assessment Example (Cont’d)
Trim for base quality or adapters
(run or library issue)
Trim leading bases
(library artifact)
16. Click to edit Master title style
Fastx toolkit* http://hannonlab.cshl.edu/fastx_toolkit/
(partial list)
FASTQ Information: Chart Quality Statistics and Nucleotide Distribution
FASTQ Trimmer: Shortening FASTQ/FASTA reads (removing barcodes or noise).
FASTQ Clipper: Removing sequencing adapters
FASTQ Quality Filter: Filters sequences based on quality
FASTQ Quality Trimmer: Trims (cuts) sequences based on quality
FASTQ Masker: Masks nucleotides with 'N' (or other character) based on quality
*defaults to old Illumina fastq (ASCII offset 64). Use –Q33 option.
SepPrep https://github.com/jstjohn/SeqPrep
Adapter trimming
Merge overlapping paired-end read
Biopython http://biopython.org, http://biopython.org/DIST/docs/tutorial/Tutorial.html
(for python programmers)
Especially useful for implementing custom/complex sequence analysis/manipulation
Galaxy http://galaxy.psu.edu
Great for beginners: upload data, point and click
Just about everything you’ll see in today’s presentations
Selected Tools to Process Reads
17. Click to edit Master title style
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
Read Mapping
http://www.broadinstitute.org/igv/
18. Click to edit Master title style
SOAP2
(2.20)
Bowtie (0.12.8)
BWA
(0.6.2)
Novoalign
(2.07.00)
License GPL v3 LGPL v3 GPL v3 Commercial
Mismatch
allowed
exactly 0,1,2 0-3 max in read user specified.
max is function of
read length and
error rate
up to 8 or more
Alignments
reported per
read
random/all/none user selected user selected random/all/none
Gapped
alignment
1-3bp gap no yes up to 7bp
Pair-end reads yes yes yes yes
Best alignment minimal number
of mismatches
minimal number
of mismatches
minimal number
of mismatches
highest alignment
score
Trim bases 3’ end 3’ and 5’ end 3’ and 5’ end 3’ end
Read Mapping: Aligning to a Reference
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
19. Click to edit Master title style
BWA Features
• Uses Burrows Wheeler Transform
— fast
— modest memory footprint (<4GB)
• Accurate
• Tolerates base mismatches
— increased sensitivity
— reduces allele bias
• Gapped alignment for both single- and paired-ended reads
• Automatically adjusts parameters based on read lengths and
error rates
• Native BAM/SAM output (the de facto standard)
• Large installed base, well-supported
• Open-source (no charge)
Read Mapping: BWA
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
20. Click to edit Master title style
Sequence References and Annotations
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml
http://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome
Comprehensive reference information
http://hgdownload.cse.ucsc.edu/downloads.html
Comprehensive reference, annotation, and translation information
ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle
References and SNP information data by GATK
Human only
http://cufflinks.cbcb.umd.edu/igenomes.html
Pre-indexed references and gene annotations for Tuxedo suite
Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C. elegans, Yeast
http://www.repeatmasker.org/
21. Click to edit Master title style
Fasta Sequence Format
>chr1
…
TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggt
agccaccaggcacatgcagccactgagcacttgaaatgtggatagtctga
attgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctaga
aaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaa
taaccatatttgggatatactggattttaaaaatatatcactaatttcat
…
>chr2
…
>chr3
…
• One or more sequences per file
• “>” denotes beginning of sequence or contig
• Subsequent lines up to the next “>” define sequence
• Lowercase base denotes repeat masked base
• Contig ID may have comments delimited by “|”
22. Click to edit Master title style
Input files:
reference.fasta, read1.fastq.gz, read2.fastq.gz
Step 1: Index the genome (~3 CPU hours for a human genome reference):
bwa index -a bwtsw reference.fasta
Step 2: Generate alignments in Burrows-Wheeler transform suffix array
coordinates:
bwa aln reference.fasta read1.fastq.gz > read1.sai
bwa aln reference.fasta read2.fastq.gz > read2.sai
Apply option –q<quality threshold> to trim poor quality bases at 3'-ends of reads
Step 3: Generate alignments in the SAM format (paired-end):
bwa sampe reference.fasta read1.sai read2.sai
read1.fastq.gz read2.fastq.gz > alignment_ouput.sam
http://bio-bwa.sourceforge.net/bwa.shtml
Running BWA
24. Click to edit Master title styleSAM (BAM) Format
Sequence Alignment/Map format
– Universal standard
– Human-readable (SAM) and compact (BAM) forms
Structure
– Header
version, sort order, reference sequences, read groups,
program/processing history
– Alignment records
26. Click to edit Master title style
[benpass align_genotype]$ samtools view allY.recalibrated.merge.bam
HW-ST605:127:B0568ABXX:2:1201:10933:3739 147 chr1 27675 60 101M = 27588 -188
TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC
=7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/
RG:Z:86-191
HW-ST605:127:B0568ABXX:3:1104:21059:173553 83 chr1 27682 60 101M = 27664 -119
ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA
8;8.7::<?=BDHFHGFFDCGDAACCABHCCBDFBE</BA4//BB@BCAA@CBA@CB@ABA>A??@B@BBACA>?;A@8??CABBBA@AAAA?AA??@BB0
RG:Z:SDH023
* Many fields after column 12 deleted (e.g., recalibrated base scores) have been deleted for improved readability
SAM/BAM Format: Alignment Records
http://samtools.sourceforge.net/SAM1.pdf
1
3 4 5 6 8 9
10
11
27. Click to edit Master title style
• Subsequent steps require sorted and indexed bams
– Sort orders: karyotypic, lexicographical
– Indexing improves analysis performance
• Picard tools: fast, portable, free
http://picard.sourceforge.net/command-line-overview.shtml
Sort: SortSam.jar
Merge: MergeSamFiles.jar
Index: BuildBamIndex.jar
• Order: sort, merge (optional), index
Preparing for Next Steps
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
28. Click to edit Master title styleLocal Realignment
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
• BWT-based alignment is fast for matching reads to reference
• Individual base alignments often sub-optimal at indels
• Approach
– Fast read mapping with BWT-based aligner
– Realign reads at indel sites using gold standard (but much
slower) Smith-Waterman1 algorithm
• Benefits
– Refines location of indels
– Reduces erroneous SNP calls
– Very high alignment accuracy in significantly less time,
with fewer resources
1Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of
Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5. PMID 7265238
29. Click to edit Master title styleLocal Realignment
DePristo MA, et al. A framework for variation discovery and genotyping
using next-generation DNA sequencing data. Nat Genet. 2011
May;43(5):491-8. PMID: 21478889
Post re-alignment at indelsRaw BWA alignment
30. Click to edit Master title style
• Covered in genotyping presentation
• Note that this is done after alignment
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
Duplicate Marking
31. Click to edit Master title style
Raw reads
Read assessment
and prep
Mapping
Local
realignment
Duplicate
marking
Base quality
recalibration
Analysis-ready
reads
STEP 1: Find covariates at non-dbSNP sites using:
Reported quality score
The position within the read
The preceding and current nucleotide (sequencer properties)
java -Xmx4g -jar GenomeAnalysisTK.jar
-T BaseRecalibrator
-I alignment.bam
-R hg19/ucsc.hg19.fasta
-knownSites hg19/dbsnp_135.hg19.vcf
-o alignment.recal_data.grp
STEP 2: Generate BAM with recalibrated base scores:
java -Xmx4g -jar GenomeAnalysisTK.jar
-T PrintReads
-R hg19/ucsc.hg19.fasta
-I alignment.bam
-BQSR alignment.recal_data.grp
-o alignment.recalibrated.bam
Base Quality Recalibration
32. Click to edit Master title styleBase Quality Recalibration (Cont’d)
33. Click to edit Master title styleGetting Started
Is there an easier way to get started?!!
34. Click to edit Master title styleGetting Started
http://galaxy.psu.edu/ Click “Use Galaxy”
35. Click to edit Master title styleGetting Started
http://galaxy.psu.edu/ Click “Use Galaxy”
More samples, more data, more runs. And more customers. As the HTS is adopted, computer sophistication of average user is less though amount of data, variety of data types is more complicated than ever.
Whether Genotyping, RNA-seq, ChIPseq, Methylation analysis – data requires processing. These number and makeup of steps is
Evolving asynchronously.