The OERs: Transforming Education for Sustainable Future by Dr. Sarita Anand
Workshop NGS data analysis - 2
1. Sequencing data analysis
Workshop – part 2 / mapping to a reference genome
Outline
Previously in this workshop…
Mapping to a reference genome – the steps
Mapping to a reference genome – the workshop
Maté Ongenaert
6. Previously in this workshop…
Main data formats
Raw sequence reads:
- Represent the sequence ~ FASTA
>SEQUENCE_IDENTIFIER
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
- Extension: represent the quality, per base ~ FASTQ – Q for quality
Score ~ phred ~ ASCII table ~ phred + 33 = Sanger
@SEQUENCE_IDENTIFIER
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
- Machine and platform independent and compressed: SRA (NCBI)
Get the original FASTQ file using SRATools (NCBI)
7. Previously in this workshop…
Main data formats
- Now moving to a common file format SAM / BAM (Sequence Alignment/Map)
- BAM: binary (read: computer-readable, indexed, compressed) ‘form’ of SAM
DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION
# QNAME: template name
#FLAG
#RNAME: reference name
# POS: mapping position
#MAPQ: mapping quality
#CIGAR: CIGAR string
#RNEXT: reference name of the mate/next fragment
#PNEXT: position of the mate/next fragment
#TLEN: observed template length
#SEQ: fragment sequence
#QUAL: ASCII of Phred-scale base quality+33
#Headers
@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45
#Alignment block
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
8. Previously in this workshop…
Main data formats
- BED files (location / annotation / scores): Browser Extensible Data
Used for mapping / annotation / peak locations / - extension: bigBED (binary)
FIELDS USED:
# chr
# start
# end
# name
# score
# strand
track name=pairedReads description="Clone Paired Reads" useScore=1
#chr start end name score strand
chr22 1000 5000 cloneA 960 +
chr22 2000 6000 cloneB 900 –
- BEDGraph files (location, combined with score)
Used to represent peak scores
track type=bedGraph name="BedGraph Format" description="BedGraph format"
visibility=full color=200,100,0 altColor=0,100,200 priority=20
#chr start end score
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50
9. Previously in this workshop…
Main data formats
- WIG files (location / annotation / scores): wiggle
Used for visulization or summarize data, in most cases count data or normalized count
data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks)
browser position chr19:59304200-59310700
browser hide all
#150 base wide bar graph at arbitrarily spaced positions,
#threshold line drawn at y=11.76
#autoScale off viewing range set to [0:25]
#priority = 10 positions this as the first graph
track type=wiggle_0 name="variableStep" description="variableStep format"
visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255
yLineMark=11.76 yLineOnOff=on priority=10
variableStep chrom=chr19 span=150
59304701 10.0
59304901 12.5
59305401 15.0
59305601 17.5
59305901 20.0
59306081 17.5
10. Previously in this workshop…
Main data formats
- GFF format (General Feature Format) or GTF
Used for annotation of genetic / genomic features – such as all coding genes in Ensembl
Often used in downstream analysis to assign annotation to regions / peaks / …
FIELDS USED:
# seqname (the name of the sequence)
# source (the program that generated this feature)
# feature (the name of this type of feature – for example: exon)
# start (the starting position of the feature in the sequence)
# end (the ending position of the feature)
# score (a score between 0 and 1000)
# strand (valid entries include '+', '-', or '.')
# frame (if the feature is a coding exon, frame should be a number between
0-2 that represents the reading frame of the first base. If the feature is
not a coding exon, the value should be '.'.)
# group (all lines with the same group are linked together into a single
item)
track name=regulatory description="TeleGene(tm) Regulatory Regions"
#chr source feature start end scores tr fr group
chr22 TeleGene enhancer 1000000 1001000 500 + . touch1
chr22 TeleGene promoter 1010000 1010100 900 + . touch1
chr22 TeleGene promoter 1020000 1020000 800 - . touch2
11. Previously in this workshop…
Main data formats
- VCF format (Variant Call Format)
For SNP representation
12. Previously in this workshop…
Main data formats
- http://genome.ucsc.edu/FAQ/FAQformat.html
- UCSC brower data formats, including all most commonly used formats that are
accepted and widely used
- In addition, ENCODE data formats (narrowPeak / broadPEAK)
13. Sequencing data analysis
Workshop – part 2 / mapping to a reference genome
Outline
Previously in this workshop…
Mapping to a reference genome – the steps
Mapping to a reference genome – the workshop
Maté Ongenaert
14. Mapping to a reference genome
The workflow
Mapping:
Aligning the raw sequence reads to a reference genome by using an indexing strategy and
aligning algorithm, taking into account the quality scores and with specific conditions
- Raw sequence reads with quality scores: FASTQ
- Reference genome: FASTA files can be downloaded (UCSC/Ensembl)
- Sequence reads <> reference genome: alignment
- To perform an efficient alignment, an indexing strategy is used
- For instance (BWA/Bowtie): FM indexes (based on burrows-wheeler algorithm) on the
reference genome and/or the sequence reads
- Specific conditions: single-end or paired-end; how many mismatches allowed; trade-off
speed/accuracy/specificity; local re-alignment afterwards for improved indel calling; …
>> Result: mapped sequence reads: chr / start / end / quality >> SAM file (>> BAM)
15. Mapping to a reference genome
The workflow
The reference genome
- Sequences (human; rat: mouse:…) can be downloaded from UCSC (Golden path) or
Ensembl
- Difficulty: download in 2bit format (needs convertor) >> fasta files (.fa)
- Need to be indexed by the mapping program you are going to use
- BWA: bwa index
- Bowtie: bowtie-build (pre-computed indexes available)
- BWA example:
bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta>
Index database sequences in the FASTA format.
OPTIONS:
-c Build color-space index. The input fast should be in nucleotide space.
-p STR Prefix of the output database [same as db filename]
-a STR Algorithm for constructing BWT index. Available options are:
is IS linear-time algorithm for constructing suffix array.
It requires 5.37N memory where N is the size of the database.
bwtsw Algorithm implemented in BWT-SW. This method works with the whole human genome
16. Mapping to a reference genome
The workflow
The sequencing reads
- Sequence reads with quality scores: FASTQ files from the machine
- Depending on the mapping program, need to be indexed as well
- BWA: converts reads to SA coordinates (Suffix Array) based on the reference genome
index
- Bowtie: not needed: indexing and aligning in one step
- BWA:
- Index reference genome
- Index sequence reads (INPUT: FASTQ and REF. GENOME ) >> SA coordinates (OUTPUT:
SAI)
- SA coordinates (INPUT: SAI/FASTQ and REF. GENOME >> SAM/BAM (OUTPUT)
17. Mapping to a reference genome
The workflow
aln bwa aln [-n][-o][-e][-d][-i][-k][-l][-t][-cRN][-M][-O][-E][-q]
<in.db.fasta> <in.query.fq> > <out.sai>
Find the SA coordinates of the input reads.
Maximum maxSeedDiff differences are allowed in the first seedLen subsequence
maximum maxDiff differences are allowed in the whole sequence.
OPTIONS:
-n NUM Maximum edit distance if the value is INT
-o INT Maximum number of gap opens
-e INT Maximum number of gap extensions, -1 for k-difference mode
-d INT Disallow a long deletion within INT bp towards the 3’-end
-i INT Disallow an indel within INT bp towards the ends [5]
-l INT Take the first INT subsequence as seed.
-k INT Maximum edit distance in the seed
-t INT Number of threads (multi-threading mode)
-M INT Mismatch penalty
-O INT Gap open penalty
-E INT Gap extension penalty
-R INT Proceed with suboptimal alignments
-c Reverse query but not complement it
-N Disable iterative search.
-q INT Parameter for read trimming.
-I The input is in the Illumina 1.3+ read format (quality equals ASCII-64)
-B INT Length of barcode starting from the 5’-end.
-b Specify the input read sequence file is the BAM format.
-0 When -b is specified, only use single-end reads in mapping.
-1 When -b is specified, only use the first read in a read pair in mapping
-2 When -b is specified, only use the second read in a read pair in mapping
18. Mapping to a reference genome
The workflow
samse bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>
Generate alignments in the SAM format given single-end reads
Repetitive hits will be randomly chosen.
OPTIONS:
-n INT Maximum number of alignments to output in the XA tag for reads paired properly.
-r STR Specify the read group in a format like ‘@RGtID:footSM:bar’
sampe bwa sampe [-a][-o][-n][-N][-P]<in.db.fasta>
<in1.sai><in2.sai><in1.fq><in2.fq> ><out.sam>
Generate alignments in the SAM format given paired-end reads.
Repetitive read pairs will be placed randomly.
OPTIONS:
-a INT Maximum insert size for a read pair to be considered being mapped properly.
-o INT Maximum occurrences of a read for pairing.
-P Load the entire FM-index into memory to reduce disk operations
-n INT Maximum number of alignments to output in the XA tag for reads paired properly
-N INT Maximum number of alignments to output in the XA tag for disconcordant read pairs
-r STR Specify the read group in a format like ‘@RGtID:footSM:bar’
19. Sequencing data analysis
Workshop – part 2 / mapping to a reference genome
Outline
Previously in this workshop…
Mapping to a reference genome – the steps
Mapping to a reference genome – the workshop
Maté Ongenaert
20. Mapping to a reference genome
The workshop
Mapping using BWA
bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai
bwa-0.5.9 BWA and its version
aln: alignement functionality of BWA
-t 4: use 4 processes (CPU cores) at the same time to speed up
/opt/genomes/index/bwa/GRCh37: location of the reference genome index
SRR058523.fastq: fastq file to align to the reference
> Indicates outputting to a file
SRR058523.sai: the output file (SA Index file)
Maps the input sequences (FASTQ) to the reference genome index output: indexes of
the reads
No ‘real genomic mapping’ thus, this would need a next step…
21. Mapping to a reference genome
The workshop
Mapping using BWA
bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |
samtools-0.1.18 view -bhSo PHF6-unsorted.bam –
bwa-0.5.9 BWA and its version
samse: single-end mapping and output to sam format
/opt/genomes/index/bwa/GRCh37: location of the reference genome index
SRR058523.sai: the reads index
SRR058523.fastq: the raw reads and quality scores
This would output a sam file (> SRR058523.sam) for instance
But we don’t need the SAM file, we would like a BAM file processing by samtools
| is the ‘pipe’ symbol: hands over the output from one command to the other
samtools-0.1.18: samtools and its version
view: the command to process sam files
- B output BAM ; h print the headers; S input is SAM; o output name
PHF6-unsorted.bam: output file name
- End of the | symbol (end of second command)
22. Mapping to a reference genome
The workshop
Mapping using BWA
bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai
bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |
samtools-0.1.18 view -bhSo PHF8-unsorted.bam –
Two-step process in BWA
Next steps: process the BAM file sort and index it (using samtools)
samtools-0.1.18 sort PHF8-unsorted.bam PHF8-sorted
Creates a sorted BAM file (PHF6-sorted.bam)
samtools-0.1.18 index PHF8-sorted.bam
Indexes the sorted BAM file (and created a BAM index file – PHF6-sorted.bam.bai)
23. Mapping to a reference genome
The workshop
BAM: what’s next?
So, now we have the sorted and indexed BAM file – what’s next?
This file is the starting point for all other analysis, depending on the application:
ChIP-seq: peak calling
SNP calling
RNA-seq: calculate gene-expression levels of the transcripts / find splice variants
What are the first things?
- Visualize it (IGV can load BAM files)
- First downstream analysis: QC and basic statistics (how many mapped reads, quality
distribution, distribution accross chromosomes,…)
24. Mapping to a reference genome
The workshop
First downstream analysis
- QC and basic statistics (how many mapped reads, quality distribution, distribution
accross chromosomes, information on paired-end reads,…)
Samstat
/opt/samstat/samstat PHF8-sorted.bam
- Outputs a HTML file with statistics
25. Mapping to a reference genome
The workshop
First downstream analysis
- QC and basic statistics (how many mapped reads, quality distribution, distribution
accross chromosomes, information on paired-end reads,…)
BamUtil (stats)
Bam stats --in PHF8-sorted.bam –-basic --phred --baseSum
Number of records read = 15732744
TotalReads(e6) 15.73
MappedReads(e6) 15.04
PairedReads(e6) 15.73
ProperPair(e6) 14.65
DuplicateReads(e6) 0.00
QCFailureReads(e6) 0.00
MappingRate(%) 95.59
PairedReads(%) 100.00
ProperPair(%) 93.11
DupRate(%) 0.00
QCFailRate(%) 0.00
TotalBases(e6) 802.37
BasesInMappedReads(e6) 766.95
Quality Count
33 0
34 0
35 71373
36 0
37 0
38 203544
39 403649
40 921714
41 2081099
42 1974615
43 2285826
27. Mapping to a reference genome
The workshop
First downstream analysis
- Think about PCR duplicates you may want to remove them (or set a ‘flag’ in the BAM
file, indicating it is a duplicate)
- Samtools rmdup or Picard MarkDuplicates
- Find out how these tools work and what otyher flags are used in BAM files
- Can you make statistics with the BAM flags?
28. Mapping to a reference genome
The workshop
Mapping – now let’s start!
- Mapping is only the starting point for most downstream analysis tools
- Depends on the application and what you want to do:
- Exome sequencing / whole genome sequencing: SNP calling (samtools): based on
mapping quality / coverage / identification of SNPs (VCF output format)
- ChIP-seq: peak calling: based on coverage of ChIP and input, enriched regions are
identified (BED output, BEDgraph and/or WIG files)
- RNA-seq: assign reads to the transcripts, normalize (length of exon and number of
reads in the sequencing library = RPKM) (relative) expression levels
identification of differentially expressed genes