SlideShare a Scribd company logo
NGS data formats and analyses
Richard Orton
Viral Genomics and Bioinformatics
Centre for Virus Research
University of Glasgow
1
NGS – HTS – 1st – 2nd – 3rd Gen
• Next Generation Sequencing is now the
Current Generation Sequencing
• NGS = High-Throughput Sequencing (HTS)
• 1st Generation: The automated Sanger
sequencing method
• 2nd Generation: NGS (Illumina, Roche 454,
Ion Torrent etc)
• 3rd Generation: PacBio & Oxford Nanopore:
The Next Next Generation Sequencing.
Single molecule sequencing.
2
FASTA Format
• Text based format for representing nucleotide or protein sequences.
• Typical file extensions
– .fasta, .fas, .fa, .fna, .fsa
• Two components of a FASTA sequence
– Header line – begins with a `>` character – followed by a sequence name and an optional description
– Sequence line(s) – the sequence data either all on a single line or spread across multiple lines
• Typically, sequence data is spread across multiple lines of length 80, 70, or 60 bases
>gi|661348725|gb|KM034562.1| Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
TTTTCCTCTCATTGAAATTTATATCGGAATTTAAATTGAAATTGTTACTGTAATCATACCTGGTTTGTTT
CAGAGCCATATCACCAAGATAGAGAACAACCTAGGTCTCCGGAGGGGGCAAGGGCATCAGTGTGCTCAGT
TGAAAATCCCTTGTCAACATCTAGGCCTTATCACATCACAAGTTCCGCCTTAAACTCTGCAGGGTGATCC
AACAACCTTAATAGCAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTGAGAA
TTAACTTTGATTTTGAACCTGAACACCCAGAGGACTGGAGACTCAACAACCCTAAAGCCTGGGGTAAAAC
ATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA
GTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGTCTGTCCG
Name Description
Sequence lines
Header line `>`
3
FASTA Format
• Can have multiple sequences in a single FASTA file
• The ‘>’ character signifies the end of one sequence and the beginning of the next
one
• Sequences do not need to be the same length
>gi|123456780|gb|KM012345.1| Epizone test sequence 1
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
TTTTCCTCTCATTGAAATTTATATCGGAATTTAAAT
>gi|123456781|gb|KM012346.1| Epizone test sequence 2
CAGAGCCATATCACCAAGATAGAGAACAACCTAGGTCTCCGGAGGGGGCAAGGGCATCAGTGTGCTCAGT
TGAAAATCCCTTGTCAACATCTAGGCCTTATCACATCACAAGTTCCGCCTTAAAC
>gi|123456782|gb|KM012347.1| Epizone test sequence 3
AACAACCTTAATAGCAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTGAGAA
TTAACTTTGATTTTGAACCTGAACACCCAGAGGACTGGAGACTCAAC
>gi|123456783|gb|KM012348.1| Epizone test sequence 4
ATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA
GTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGG
>gi|123456784|gb|KM012349.1| Epizone test sequence 5
ATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA
GTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGAAAAAAAAAAAAAAAAAAAAGCAGGTCTGTCCG
4
Sequence Characters – IUPAC Codes
• The nucleic acid notation currently in use was first formalized by the
International Union of Pure and Applied Chemistry (IUPAC) in 1970.
Symbol Description Bases Represented Num
A Adenine A 1
C Cytosine C 1
G Guanine G 1
T Thymine T 1
U Uracil U 1
W Weak A T 2
S Strong C G 2
M aMino A C 2
K Keto G T 2
R puRine A G 2
Y pYrimidine C T 2
B not A (B comes after A) C G T 3
D not C (D comes after C) A G T 4
H not G (H comes after G) A C T 4
V not T (V comes after T & U) A C G 4
N Any Nucleotide A C G T 4
- Gap 0
5
FASTA Names
• GenBank, at the National Center for Biotechnology Information (NCBI), is the NIH genetic sequence
database, an annotated collection of all publicly available DNA sequences.
– Part of the International Nucleotide Sequence Database Collaboration, which also comprises the DNA
DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL).
• GI (genInfo identifier) number
– A unique integer number which identifies a particular sequence (DNA or prot)
– As of September 2016, the integer sequence identifiers known as "GIs" will no longer be included in the
GenBank
• Accession Number
– A unique alphanumeric unique identifier given to a sequence (DNA or prot)
– Accession.Version
>gi|661348725|gb|KM034562.1| Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
TTTTCCTCTCATTGAAATTTATATCGGAATTTAAATTGAAATTGTTACTGTAATCATACCTGGTTTGTTT
CAGAGCCATATCACCAAGATAGAGAACAACCTAGGTCTCCGGAGGGGGCAAGGGCATCAGTGTGCTCAGT
TGAAAATCCCTTGTCAACATCTAGGCCTTATCACATCACAAGTTCCGCCTTAAACTCTGCAGGGTGATCC
AACAACCTTAATAGCAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTGAGAA
TTAACTTTGATTTTGAACCTGAACACCCAGAGGACTGGAGACTCAACAACCCTAAAGCCTGGGGTAAAAC
ATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA
GTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGTCTGTCCG
GI number
Description
Accession Number
Version Number
6
FASTQ Format
• Typically you will get your NGS reads in FASTQ format
– Oxford Nanopore – FAST5 format
– PacBio – HDF5 format
• FASTQ is a text-based format for storing both a nucleotide sequence and its corresponding quality
score.
• Four lines per sequence (read)
– @ - The UNIQUE sequence name
– The nucleotide sequence {A, C, G, T, N} – on 1 line
– + - The quality line break – sometimes see +SequenceName
– The quality scores {ASCII characters} – on 1 line
• The quality line is always the same length as the sequence line
– The 1st quality score corresponds to the quality of the 1st sequence
– The 14th quality score corresponds to the quality of the 14th sequence
• Multiple FASTQ sequence per file – millions
• Sequences can be different length
@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ Quality scores
Sequence bases
Sequence name
Quality line break
7
FASTQ Names
• Specific to Illumina Sequences
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - FASTQ Sequence Start
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Instrument Name
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Run ID
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Flowcell ID
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Lane Number
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Tile Number
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - X-coordinate of cluster
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Y-coordinate of cluster
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - read Number (Paired 1/2)
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Filtered
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG – Control Number
• @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG – Index Sequence
• Filtered by Illumina Real Time Analysis (RTA) software during the run
– Y = is filtered (did not pass)
– N = is not filtered (did pass)
• Control number
– 0 if sequence NOT identified as control (PhiX)
– > 0 is a control sequence
• However, can be just: @SRR1553467.1 1 8
Illumina Flowcell
Surface of flow cell coated with a dense lawn of oligos
8 Lanes/Channels
2 Columns Per Lane
50 tiles per column, 100 tiles per lane (GAII)
Each column contains 50 tiles
Each tile is imaged 4 times per cycle – 1 image
per base {ACGT}
For 70bp run: 28,000 images per lane, 224,000
images per flowcell
FASTQ Quality Scores
• Each nucleotide in a sequence has an associated quality score
@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG
ACGTGTACACAGTCAGATGATAGCAGATAGGAAAT
+
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@
• FASTQ quality scores are ASCII characters
– ASCII - American Standard Code for Information Interchange
– letters {A-Z, a-z}, numbers {0-9}, special characters {@ £ # $ % ^ & * ( ) ,
. ? “ /  | ~ ` { } }
– Each character has a numeric whole number associated with it
• When people speak of quality scores, it is typically in the region
of 0 to 40
– For example, trimming the ends of reads below Q25
• To translate from ASCII characters to Phred scale quality scores
– YOU SUBTRACT 33 (-33)
• But … what does a quality score actually mean???
10
Q-Symbol Q-ASCII Q-Score
I 73 40
H 72 39
G 71 38
F 70 37
E 69 36
D 68 35
C 67 34
B 66 33
A 65 32
@ 64 31
? 63 30
> 62 29
= 61 28
< 60 27
; 59 26
: 58 25
9 57 24
8 56 23
7 55 22
6 54 21
5 53 20
4 52 19
3 51 18
2 50 17
1 49 16
0 48 15
/ 47 14
. 46 13
- 45 12
, 44 11
+ 43 10
* 42 9
) 41 8
( 40 7
' 39 6
& 38 5
% 37 4
$ 36 3
# 35 2
" 34 1
! 33 0
Q = Probability of Error
Q-Symbol Q-ASCII Q-Score P-Error
I 73 40 0.00010
H 72 39 0.00013
G 71 38 0.00016
F 70 37 0.00020
E 69 36 0.00025
D 68 35 0.00032
C 67 34 0.00040
B 66 33 0.00050
A 65 32 0.00063
@ 64 31 0.00079
? 63 30 0.00100
> 62 29 0.00126
= 61 28 0.00158
< 60 27 0.00200
; 59 26 0.00251
: 58 25 0.00316
9 57 24 0.00398
8 56 23 0.00501
7 55 22 0.00631
6 54 21 0.00794
5 53 20 0.01000
4 52 19 0.01259
3 51 18 0.01585
2 50 17 0.01995
1 49 16 0.02512
0 48 15 0.03162
/ 47 14 0.03981
. 46 13 0.05012
- 45 12 0.06310
, 44 11 0.07943
+ 43 10 0.10000
* 42 9 0.12589
) 41 8 0.15849
( 40 7 0.19953
' 39 6 0.25119
& 38 5 0.31623
% 37 4 0.39811
$ 36 3 0.50119
# 35 2 0.63096
" 34 1 0.79433
! 33 0 1.00000
Q40 is not the maximum quality score
• A Phred quality score is a measure of the quality of the
identification of the nucleotide generated by automated DNA
sequencing.
• The quality score is associated to a probability of error – the
probability of an incorrect base call
• The calculation takes into account the ambiguity of the signal
for the respective base as well as the quality of neighboring
bases and the quality of the entire read.
• P = 10(-Q/10)
• Q40 = 1 in 10,000 error rate
• Q = - 10 x log10(P)
• If you have high coverage (~30,000) you expect errors even if
everything was the best quality i.e you would expect 3 errors on
average.
Read Quality Scores
• Quality scores tend to decrease along the read
• The read represents the sequence of a Cluster
– Cluster is comprised of ~1,000 DNA molecules
• As the sequencing progresses more and more of the DNA
molecules in the cluster get out of sync
– Phasing: falling behind: missing and incorporation cycle,
incomplete removal of the 3’ terminators/fluorophores
– Pre-phasing: jumping ahead: incorporation of multiple bases in a
cycle due to NTs without effective 3’ blocking
• Proportion of sequences in each cluster which are affected
by Phasing & Pre-Phasing increases with cycle number –
hampering correct base identification.
Q-Score P-error
40 0.0001
39 0.000125893
38 0.000158489
37 0.000199526
36 0.000251189
35 0.000316228
34 0.000398107
33 0.000501187
32 0.000630957
31 0.000794328
30 0.001
Trim low quality
ends off reads
Delete poor quality
reads overall
FASTQC
FASTX
ConDeTri
Illumina Quality Scores
• Quality scores tend to decrease along the read
• The read represents the sequence of a Cluster
– Cluster is comprised of ~1,000 DNA molecules
• As the sequencing progresses more and more of the DNA
molecules in the cluster get out of sync
– Phasing: falling behind: missing and incorporation cycle,
incomplete removal of the 3’ terminators/fluorophores
– Pre-phasing: jumping ahead: incorporation of multiple bases in a
cycle due to NTs without effective 3’ blocking
• Proportion of sequences in each cluster which are affected
by Phasing & Pre-Phasing increases with cycle number –
hampering correct base identification.
– A/C and G/ T fluorophore spectra overlap
– T accumulation – in certain chemistries – due to a lower removal
rate of T fluorophores
Viral Reference Genome
Sequencing
HTS Reads
Genome
Mapping
Aligning Reads to a Reference
Viral
DNA
Short Read Alignment
• Find which bit of the genome the short read has come from
– Given a reference genome and a short read, find the part of the
reference genome that is identical (or most similar) to the read
• It is basically a String matching problem
– BLAST – use hash tables
– MUMmer – use suffix trees
– Bowtie – use Burrows-Wheeler Transform
• There are now many alignment tools for short reads:
– BWA, Bowtie, MAQ, Novoalign, Stampy … … … … …
– Lots of options such as maximum no of mismatches between read/ref,
gap penalties etc
• Aligning millions of reads can be time consuming process.
• So you want to save the alignments for each read to a file:
– So the process doesn’t need to be repeated
– So you can view the results
– In a generic format that other tools can utilise (variant calling etc)
SAM Header
• The first section of the SAM file is called the header.
• All header lines start with an @
• HD – The Header Line – 1st line
– SO – Sorting order of alignments (unknown (default), unsorted, queryname and coordinate)
• SQ - Reference sequence dictionary.
– SN - Reference sequence name.
– LN - Reference sequence length.
• PG – Program
– ID – The program ID
– PN – The program name
– VN – Program version number
– CL – The Command actually used to create the SAM file
• RG – Read Group
– "a set of reads that were all the product of a single sequencing run on one lane"
16
@HDVN:1.0 SO:coordinate
@SQ SN:KM034562.G3686.1 LN:18957
@PG ID:bowtie2PN:bowtie2 VN:2.2.4 CL:"/usr/bin/../lib/bowtie2/bin/bowtie2-align-s --wrapper basic-
0 --local -x ../ref/ebov -S ebov1.sam -1 ebov1_1.fastq -2 ebov1_2.fastq"
SAM File Format
QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUALITY
Read1 10 FMDVgenome 3537 57 70M CCAGTACGTA >AAA=>?AA>
• SAM (Sequence Alignment/Map format) data files are outputted from aligners that read FASTQ files and
assign sequences/reads to a position with respect to a known reference genome.
– Readable Text format – tab delimited
– Each line contains alignment information for a read to the reference
• Each line contains:
– QNAME: Read Name
– FLAG: Info on if the read is mapped, part of a pair, strand etc
– RNAME: Reference Sequence Name that the read aligns to
– POS: Leftmost position of where this alignment maps to the reference
– MAPQ: Mapping quality of read to reference (phred scale P that mapping is wrong)
– CIGAR: Compact Idiosyncratic Gapped Alignment Report: 50M, 30M1I29M
– RNEXT: Paired Mate Read Name
– PNEXT: Paired Mate Position
– TLEN: Template length/Insert Size (difference in outer co-ordinates of paired reads)
– SEQ: The actual read DNA sequence
– QUAL: ASCII Phred quality scores (+33)
– TAGS: Optional data – Lots of options e.g. MD=String for mismatches
SAM FLAGS
• Example Flag = 99
• 99 < 128, 256, 512, 1024, 2048 … so all these flags don’t apply
• 99 > 64 … {99 – 64 = 35} … This read is the first in a pair
• 35 > 32 … {35 – 32 = 3} … The paired mate read is on the reverse strand
• 3 < 16 … {3} … This read is not on the reverse strand
• 3 < 8 … {3} … The paired mate read is not unmapped
• 3 > 2 … {3 – 2 = 1} … The read is mapped in proper pair
• 1 = 1 … {1 – 1 = 0} … The read is paired 18
# Flag Description
1 1 Read paired
2 2 Read mapped in proper pair
3 4 Read unmapped
4 8 Mate unmapped
5 16 Read reverse strand
6 32 Mate reverse strand
7 64 First in pair
8 128 Second in pair
9 256 Not primary alignment
10 512 Read fails platform/vendor quality checks
11 1024 Read is PCR or optical duplicate
12 2048 Supplementary alignment
8 > (4+2+1) = 7
16 > (8+4+2+1) = 15
32 > (16+8+4+2+1) = 31
etc
19
• H : Hard Clipping: the clipped nucleotides have been removed from the read
• S : Soft Clipping: the clipped nucleotides are still present in the read
• M : Match: the nucleotides align to the reference – BUT the nucleotide can be an alignment match or mismatch.
• X : Mismatch: the nucleotide aligns to the reference but is a mismatch
• = : Match: the nucleotide aligns to the reference and is a match to it
• I : Insertion: the nucleotide(s) is present in the read but not in the reference
• D : Deletion; the nucleotide(s) is present in the reference but not in the read
• N : Skipped region in reference (for mRNA-to-genome alignment, an N operation represents an intron)
• P: Padding (silent deletion from the padded reference sequence) – not a true deletion
CIGAR String
• 8M2I4M1D3M
• 8Matches
• 2 insertions
• 4 Matches
• 1 Deletion
• 3 Matches
BAM File
• SAM (Sequence Alignment/Map) files are typically converted to
Binary Alignment/Map (BAM) format
• They are binary
– Can NOT be opened like a text file
– Compressed = Smaller (great for server storage)
– Sorted and indexed = Daster
• Once you have a file in a BAM format you can delete your aligned
read files
– You can recover the FASTQ reads from the BAM format
– Delete the aligned reads – trimmed/filtered reads
– Do not delete the RAW reads – zip them and save them
20
Variant Call Format
• Once reads are aligned to a reference, you may want to call
mutations and indels to the reference
• Many tools output this data in the VCF (Variant Call Format)
• Text file – tab delimited
21
Variant Call Format
22
CHROM: Chromosome Name (i.e. viral genome name)
POS: Position the viral genome
ID: Name of the known variant from DB
REF: The reference allele (i.e. the reference base)
ALT: The alternate allele (i.e. the variant/mutation observed)
QUAL: The quality of the variant on a Phred scale
FILTER: Did the variant pass the tools filters
INFO: Information
DP = Depth
AF = Alternate Allele Frequency
SB = Strand Bias P-value
DP4 = Extra Depth Information
(forward ref; reverse ref; forward non-ref; reverse non-ref)
Acknowledgements
• Viral Genomics and
Bioinformatics group
– Centre for Virus Research
– University of Glasgow
23

More Related Content

What's hot

Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
Subhranil Bhattacharjee
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignmentavrilcoghlan
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
Athira RG
 
Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)
LOGESWARAN KA
 
Gene prediction and expression
Gene prediction and expressionGene prediction and expression
Gene prediction and expression
ishi tandon
 
2 whole genome sequencing and analysis
2 whole genome sequencing and analysis2 whole genome sequencing and analysis
2 whole genome sequencing and analysis
saberhussain9
 
BLAST
BLASTBLAST
Structural genomics
Structural genomicsStructural genomics
Structural genomics
Vaibhav Maurya
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
Maté Ongenaert
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
Hafiz Muhammad Zeeshan Raza
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
naveed ul mushtaq
 
Prosite
PrositeProsite
Introduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbjIntroduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbj
KAUSHAL SAHU
 
UniProt
UniProtUniProt
UniProt
AmnaA7
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
Karan Veer Singh
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
The Oxford College Engineering
 
Uni prot presentation
Uni prot presentationUni prot presentation
Uni prot presentation
Rida Khalid
 

What's hot (20)

RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)
 
Est database
Est databaseEst database
Est database
 
Gene prediction and expression
Gene prediction and expressionGene prediction and expression
Gene prediction and expression
 
2 whole genome sequencing and analysis
2 whole genome sequencing and analysis2 whole genome sequencing and analysis
2 whole genome sequencing and analysis
 
BLAST
BLASTBLAST
BLAST
 
Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Prosite
PrositeProsite
Prosite
 
Introduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbjIntroduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbj
 
UniProt
UniProtUniProt
UniProt
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
Fasta
FastaFasta
Fasta
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
Uni prot presentation
Uni prot presentationUni prot presentation
Uni prot presentation
 

Similar to NGS data formats and analyses

Gwas.emes.comp
Gwas.emes.compGwas.emes.comp
Gwas.emes.comp
Richard Emes
 
rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomicsFrancisco Garc
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
Li Shen
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2
Dan Gaston
 
sequencea.ppt
sequencea.pptsequencea.ppt
sequencea.ppt
olusolaogunyewo1
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
cursoNGS
 
MagneTag Presentation Winter Quarter V3.0
MagneTag Presentation Winter Quarter V3.0MagneTag Presentation Winter Quarter V3.0
MagneTag Presentation Winter Quarter V3.0John-Paul Petersen
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
Li Shen
 
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Thermo Fisher Scientific
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
Eli Kaminuma
 
Key recovery attacks against commercial white-box cryptography implementation...
Key recovery attacks against commercial white-box cryptography implementation...Key recovery attacks against commercial white-box cryptography implementation...
Key recovery attacks against commercial white-box cryptography implementation...
CODE BLUE
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Prof. Wim Van Criekinge
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
BITS
 
Wt 3000 btts datasheet
Wt 3000 btts datasheetWt 3000 btts datasheet
Wt 3000 btts datasheet
Bruce Petty
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
Prof. Wim Van Criekinge
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
alizain9604
 
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Lucidworks
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Hong ChangBum
 
Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...
Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...
Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...
Kate Barlow
 

Similar to NGS data formats and analyses (20)

Gwas.emes.comp
Gwas.emes.compGwas.emes.comp
Gwas.emes.comp
 
rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomics
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2
 
sequencea.ppt
sequencea.pptsequencea.ppt
sequencea.ppt
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
 
MagneTag Presentation Winter Quarter V3.0
MagneTag Presentation Winter Quarter V3.0MagneTag Presentation Winter Quarter V3.0
MagneTag Presentation Winter Quarter V3.0
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
Key recovery attacks against commercial white-box cryptography implementation...
Key recovery attacks against commercial white-box cryptography implementation...Key recovery attacks against commercial white-box cryptography implementation...
Key recovery attacks against commercial white-box cryptography implementation...
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
Wt 3000 btts datasheet
Wt 3000 btts datasheetWt 3000 btts datasheet
Wt 3000 btts datasheet
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
Macs course
Macs courseMacs course
Macs course
 
Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...
Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...
Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...
 

Recently uploaded

RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
ssuserbfdca9
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 

Recently uploaded (20)

RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 

NGS data formats and analyses

  • 1. NGS data formats and analyses Richard Orton Viral Genomics and Bioinformatics Centre for Virus Research University of Glasgow 1
  • 2. NGS – HTS – 1st – 2nd – 3rd Gen • Next Generation Sequencing is now the Current Generation Sequencing • NGS = High-Throughput Sequencing (HTS) • 1st Generation: The automated Sanger sequencing method • 2nd Generation: NGS (Illumina, Roche 454, Ion Torrent etc) • 3rd Generation: PacBio & Oxford Nanopore: The Next Next Generation Sequencing. Single molecule sequencing. 2
  • 3. FASTA Format • Text based format for representing nucleotide or protein sequences. • Typical file extensions – .fasta, .fas, .fa, .fna, .fsa • Two components of a FASTA sequence – Header line – begins with a `>` character – followed by a sequence name and an optional description – Sequence line(s) – the sequence data either all on a single line or spread across multiple lines • Typically, sequence data is spread across multiple lines of length 80, 70, or 60 bases >gi|661348725|gb|KM034562.1| Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA TTTTCCTCTCATTGAAATTTATATCGGAATTTAAATTGAAATTGTTACTGTAATCATACCTGGTTTGTTT CAGAGCCATATCACCAAGATAGAGAACAACCTAGGTCTCCGGAGGGGGCAAGGGCATCAGTGTGCTCAGT TGAAAATCCCTTGTCAACATCTAGGCCTTATCACATCACAAGTTCCGCCTTAAACTCTGCAGGGTGATCC AACAACCTTAATAGCAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTGAGAA TTAACTTTGATTTTGAACCTGAACACCCAGAGGACTGGAGACTCAACAACCCTAAAGCCTGGGGTAAAAC ATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA GTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGTCTGTCCG Name Description Sequence lines Header line `>` 3
  • 4. FASTA Format • Can have multiple sequences in a single FASTA file • The ‘>’ character signifies the end of one sequence and the beginning of the next one • Sequences do not need to be the same length >gi|123456780|gb|KM012345.1| Epizone test sequence 1 CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA TTTTCCTCTCATTGAAATTTATATCGGAATTTAAAT >gi|123456781|gb|KM012346.1| Epizone test sequence 2 CAGAGCCATATCACCAAGATAGAGAACAACCTAGGTCTCCGGAGGGGGCAAGGGCATCAGTGTGCTCAGT TGAAAATCCCTTGTCAACATCTAGGCCTTATCACATCACAAGTTCCGCCTTAAAC >gi|123456782|gb|KM012347.1| Epizone test sequence 3 AACAACCTTAATAGCAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTGAGAA TTAACTTTGATTTTGAACCTGAACACCCAGAGGACTGGAGACTCAAC >gi|123456783|gb|KM012348.1| Epizone test sequence 4 ATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA GTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGG >gi|123456784|gb|KM012349.1| Epizone test sequence 5 ATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA GTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGAAAAAAAAAAAAAAAAAAAAGCAGGTCTGTCCG 4
  • 5. Sequence Characters – IUPAC Codes • The nucleic acid notation currently in use was first formalized by the International Union of Pure and Applied Chemistry (IUPAC) in 1970. Symbol Description Bases Represented Num A Adenine A 1 C Cytosine C 1 G Guanine G 1 T Thymine T 1 U Uracil U 1 W Weak A T 2 S Strong C G 2 M aMino A C 2 K Keto G T 2 R puRine A G 2 Y pYrimidine C T 2 B not A (B comes after A) C G T 3 D not C (D comes after C) A G T 4 H not G (H comes after G) A C T 4 V not T (V comes after T & U) A C G 4 N Any Nucleotide A C G T 4 - Gap 0 5
  • 6. FASTA Names • GenBank, at the National Center for Biotechnology Information (NCBI), is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. – Part of the International Nucleotide Sequence Database Collaboration, which also comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL). • GI (genInfo identifier) number – A unique integer number which identifies a particular sequence (DNA or prot) – As of September 2016, the integer sequence identifiers known as "GIs" will no longer be included in the GenBank • Accession Number – A unique alphanumeric unique identifier given to a sequence (DNA or prot) – Accession.Version >gi|661348725|gb|KM034562.1| Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-G3686.1, complete genome CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA TTTTCCTCTCATTGAAATTTATATCGGAATTTAAATTGAAATTGTTACTGTAATCATACCTGGTTTGTTT CAGAGCCATATCACCAAGATAGAGAACAACCTAGGTCTCCGGAGGGGGCAAGGGCATCAGTGTGCTCAGT TGAAAATCCCTTGTCAACATCTAGGCCTTATCACATCACAAGTTCCGCCTTAAACTCTGCAGGGTGATCC AACAACCTTAATAGCAACATTATTGTTAAAGGACAGCATTAGTTCACAGTCAAACAAGCAAGATTGAGAA TTAACTTTGATTTTGAACCTGAACACCCAGAGGACTGGAGACTCAACAACCCTAAAGCCTGGGGTAAAAC ATTAGAAATAGTTTAAAGACAAATTGCTCGGAATCACAAAATTCCGAGTATGGATTCTCGTCCTCAGAAA GTCTGGATGACGCCGAGTCTCACTGAATCTGACATGGATTACCACAAGATCTTGACAGCAGGTCTGTCCG GI number Description Accession Number Version Number 6
  • 7. FASTQ Format • Typically you will get your NGS reads in FASTQ format – Oxford Nanopore – FAST5 format – PacBio – HDF5 format • FASTQ is a text-based format for storing both a nucleotide sequence and its corresponding quality score. • Four lines per sequence (read) – @ - The UNIQUE sequence name – The nucleotide sequence {A, C, G, T, N} – on 1 line – + - The quality line break – sometimes see +SequenceName – The quality scores {ASCII characters} – on 1 line • The quality line is always the same length as the sequence line – The 1st quality score corresponds to the quality of the 1st sequence – The 14th quality score corresponds to the quality of the 14th sequence • Multiple FASTQ sequence per file – millions • Sequences can be different length @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ Quality scores Sequence bases Sequence name Quality line break 7
  • 8. FASTQ Names • Specific to Illumina Sequences • @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - FASTQ Sequence Start • @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Instrument Name • @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Run ID • @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Flowcell ID • @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Lane Number • @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Tile Number • @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - X-coordinate of cluster • @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Y-coordinate of cluster • @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - read Number (Paired 1/2) • @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG - Filtered • @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG – Control Number • @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG – Index Sequence • Filtered by Illumina Real Time Analysis (RTA) software during the run – Y = is filtered (did not pass) – N = is not filtered (did pass) • Control number – 0 if sequence NOT identified as control (PhiX) – > 0 is a control sequence • However, can be just: @SRR1553467.1 1 8
  • 9. Illumina Flowcell Surface of flow cell coated with a dense lawn of oligos 8 Lanes/Channels 2 Columns Per Lane 50 tiles per column, 100 tiles per lane (GAII) Each column contains 50 tiles Each tile is imaged 4 times per cycle – 1 image per base {ACGT} For 70bp run: 28,000 images per lane, 224,000 images per flowcell
  • 10. FASTQ Quality Scores • Each nucleotide in a sequence has an associated quality score @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG ACGTGTACACAGTCAGATGATAGCAGATAGGAAAT + BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@ • FASTQ quality scores are ASCII characters – ASCII - American Standard Code for Information Interchange – letters {A-Z, a-z}, numbers {0-9}, special characters {@ £ # $ % ^ & * ( ) , . ? “ / | ~ ` { } } – Each character has a numeric whole number associated with it • When people speak of quality scores, it is typically in the region of 0 to 40 – For example, trimming the ends of reads below Q25 • To translate from ASCII characters to Phred scale quality scores – YOU SUBTRACT 33 (-33) • But … what does a quality score actually mean??? 10 Q-Symbol Q-ASCII Q-Score I 73 40 H 72 39 G 71 38 F 70 37 E 69 36 D 68 35 C 67 34 B 66 33 A 65 32 @ 64 31 ? 63 30 > 62 29 = 61 28 < 60 27 ; 59 26 : 58 25 9 57 24 8 56 23 7 55 22 6 54 21 5 53 20 4 52 19 3 51 18 2 50 17 1 49 16 0 48 15 / 47 14 . 46 13 - 45 12 , 44 11 + 43 10 * 42 9 ) 41 8 ( 40 7 ' 39 6 & 38 5 % 37 4 $ 36 3 # 35 2 " 34 1 ! 33 0
  • 11. Q = Probability of Error Q-Symbol Q-ASCII Q-Score P-Error I 73 40 0.00010 H 72 39 0.00013 G 71 38 0.00016 F 70 37 0.00020 E 69 36 0.00025 D 68 35 0.00032 C 67 34 0.00040 B 66 33 0.00050 A 65 32 0.00063 @ 64 31 0.00079 ? 63 30 0.00100 > 62 29 0.00126 = 61 28 0.00158 < 60 27 0.00200 ; 59 26 0.00251 : 58 25 0.00316 9 57 24 0.00398 8 56 23 0.00501 7 55 22 0.00631 6 54 21 0.00794 5 53 20 0.01000 4 52 19 0.01259 3 51 18 0.01585 2 50 17 0.01995 1 49 16 0.02512 0 48 15 0.03162 / 47 14 0.03981 . 46 13 0.05012 - 45 12 0.06310 , 44 11 0.07943 + 43 10 0.10000 * 42 9 0.12589 ) 41 8 0.15849 ( 40 7 0.19953 ' 39 6 0.25119 & 38 5 0.31623 % 37 4 0.39811 $ 36 3 0.50119 # 35 2 0.63096 " 34 1 0.79433 ! 33 0 1.00000 Q40 is not the maximum quality score • A Phred quality score is a measure of the quality of the identification of the nucleotide generated by automated DNA sequencing. • The quality score is associated to a probability of error – the probability of an incorrect base call • The calculation takes into account the ambiguity of the signal for the respective base as well as the quality of neighboring bases and the quality of the entire read. • P = 10(-Q/10) • Q40 = 1 in 10,000 error rate • Q = - 10 x log10(P) • If you have high coverage (~30,000) you expect errors even if everything was the best quality i.e you would expect 3 errors on average.
  • 12. Read Quality Scores • Quality scores tend to decrease along the read • The read represents the sequence of a Cluster – Cluster is comprised of ~1,000 DNA molecules • As the sequencing progresses more and more of the DNA molecules in the cluster get out of sync – Phasing: falling behind: missing and incorporation cycle, incomplete removal of the 3’ terminators/fluorophores – Pre-phasing: jumping ahead: incorporation of multiple bases in a cycle due to NTs without effective 3’ blocking • Proportion of sequences in each cluster which are affected by Phasing & Pre-Phasing increases with cycle number – hampering correct base identification. Q-Score P-error 40 0.0001 39 0.000125893 38 0.000158489 37 0.000199526 36 0.000251189 35 0.000316228 34 0.000398107 33 0.000501187 32 0.000630957 31 0.000794328 30 0.001 Trim low quality ends off reads Delete poor quality reads overall FASTQC FASTX ConDeTri
  • 13. Illumina Quality Scores • Quality scores tend to decrease along the read • The read represents the sequence of a Cluster – Cluster is comprised of ~1,000 DNA molecules • As the sequencing progresses more and more of the DNA molecules in the cluster get out of sync – Phasing: falling behind: missing and incorporation cycle, incomplete removal of the 3’ terminators/fluorophores – Pre-phasing: jumping ahead: incorporation of multiple bases in a cycle due to NTs without effective 3’ blocking • Proportion of sequences in each cluster which are affected by Phasing & Pre-Phasing increases with cycle number – hampering correct base identification. – A/C and G/ T fluorophore spectra overlap – T accumulation – in certain chemistries – due to a lower removal rate of T fluorophores
  • 14. Viral Reference Genome Sequencing HTS Reads Genome Mapping Aligning Reads to a Reference Viral DNA
  • 15. Short Read Alignment • Find which bit of the genome the short read has come from – Given a reference genome and a short read, find the part of the reference genome that is identical (or most similar) to the read • It is basically a String matching problem – BLAST – use hash tables – MUMmer – use suffix trees – Bowtie – use Burrows-Wheeler Transform • There are now many alignment tools for short reads: – BWA, Bowtie, MAQ, Novoalign, Stampy … … … … … – Lots of options such as maximum no of mismatches between read/ref, gap penalties etc • Aligning millions of reads can be time consuming process. • So you want to save the alignments for each read to a file: – So the process doesn’t need to be repeated – So you can view the results – In a generic format that other tools can utilise (variant calling etc)
  • 16. SAM Header • The first section of the SAM file is called the header. • All header lines start with an @ • HD – The Header Line – 1st line – SO – Sorting order of alignments (unknown (default), unsorted, queryname and coordinate) • SQ - Reference sequence dictionary. – SN - Reference sequence name. – LN - Reference sequence length. • PG – Program – ID – The program ID – PN – The program name – VN – Program version number – CL – The Command actually used to create the SAM file • RG – Read Group – "a set of reads that were all the product of a single sequencing run on one lane" 16 @HDVN:1.0 SO:coordinate @SQ SN:KM034562.G3686.1 LN:18957 @PG ID:bowtie2PN:bowtie2 VN:2.2.4 CL:"/usr/bin/../lib/bowtie2/bin/bowtie2-align-s --wrapper basic- 0 --local -x ../ref/ebov -S ebov1.sam -1 ebov1_1.fastq -2 ebov1_2.fastq"
  • 17. SAM File Format QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUALITY Read1 10 FMDVgenome 3537 57 70M CCAGTACGTA >AAA=>?AA> • SAM (Sequence Alignment/Map format) data files are outputted from aligners that read FASTQ files and assign sequences/reads to a position with respect to a known reference genome. – Readable Text format – tab delimited – Each line contains alignment information for a read to the reference • Each line contains: – QNAME: Read Name – FLAG: Info on if the read is mapped, part of a pair, strand etc – RNAME: Reference Sequence Name that the read aligns to – POS: Leftmost position of where this alignment maps to the reference – MAPQ: Mapping quality of read to reference (phred scale P that mapping is wrong) – CIGAR: Compact Idiosyncratic Gapped Alignment Report: 50M, 30M1I29M – RNEXT: Paired Mate Read Name – PNEXT: Paired Mate Position – TLEN: Template length/Insert Size (difference in outer co-ordinates of paired reads) – SEQ: The actual read DNA sequence – QUAL: ASCII Phred quality scores (+33) – TAGS: Optional data – Lots of options e.g. MD=String for mismatches
  • 18. SAM FLAGS • Example Flag = 99 • 99 < 128, 256, 512, 1024, 2048 … so all these flags don’t apply • 99 > 64 … {99 – 64 = 35} … This read is the first in a pair • 35 > 32 … {35 – 32 = 3} … The paired mate read is on the reverse strand • 3 < 16 … {3} … This read is not on the reverse strand • 3 < 8 … {3} … The paired mate read is not unmapped • 3 > 2 … {3 – 2 = 1} … The read is mapped in proper pair • 1 = 1 … {1 – 1 = 0} … The read is paired 18 # Flag Description 1 1 Read paired 2 2 Read mapped in proper pair 3 4 Read unmapped 4 8 Mate unmapped 5 16 Read reverse strand 6 32 Mate reverse strand 7 64 First in pair 8 128 Second in pair 9 256 Not primary alignment 10 512 Read fails platform/vendor quality checks 11 1024 Read is PCR or optical duplicate 12 2048 Supplementary alignment 8 > (4+2+1) = 7 16 > (8+4+2+1) = 15 32 > (16+8+4+2+1) = 31 etc
  • 19. 19 • H : Hard Clipping: the clipped nucleotides have been removed from the read • S : Soft Clipping: the clipped nucleotides are still present in the read • M : Match: the nucleotides align to the reference – BUT the nucleotide can be an alignment match or mismatch. • X : Mismatch: the nucleotide aligns to the reference but is a mismatch • = : Match: the nucleotide aligns to the reference and is a match to it • I : Insertion: the nucleotide(s) is present in the read but not in the reference • D : Deletion; the nucleotide(s) is present in the reference but not in the read • N : Skipped region in reference (for mRNA-to-genome alignment, an N operation represents an intron) • P: Padding (silent deletion from the padded reference sequence) – not a true deletion CIGAR String • 8M2I4M1D3M • 8Matches • 2 insertions • 4 Matches • 1 Deletion • 3 Matches
  • 20. BAM File • SAM (Sequence Alignment/Map) files are typically converted to Binary Alignment/Map (BAM) format • They are binary – Can NOT be opened like a text file – Compressed = Smaller (great for server storage) – Sorted and indexed = Daster • Once you have a file in a BAM format you can delete your aligned read files – You can recover the FASTQ reads from the BAM format – Delete the aligned reads – trimmed/filtered reads – Do not delete the RAW reads – zip them and save them 20
  • 21. Variant Call Format • Once reads are aligned to a reference, you may want to call mutations and indels to the reference • Many tools output this data in the VCF (Variant Call Format) • Text file – tab delimited 21
  • 22. Variant Call Format 22 CHROM: Chromosome Name (i.e. viral genome name) POS: Position the viral genome ID: Name of the known variant from DB REF: The reference allele (i.e. the reference base) ALT: The alternate allele (i.e. the variant/mutation observed) QUAL: The quality of the variant on a Phred scale FILTER: Did the variant pass the tools filters INFO: Information DP = Depth AF = Alternate Allele Frequency SB = Strand Bias P-value DP4 = Extra Depth Information (forward ref; reverse ref; forward non-ref; reverse non-ref)
  • 23. Acknowledgements • Viral Genomics and Bioinformatics group – Centre for Virus Research – University of Glasgow 23