Data formats

Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Data formats

Data formats
Sequence formats Other sequence
Visualisation formats
Sequence alignment
formats
Data Processing
FASTA
FASTQ
SRF
SFF
SCARF
AB1
GCG
IG
EMBL
SAM BAM
CRAM
WIG
BED
GFF
GTF

I. Read / Sequence Formats
1. FASTA File Format
2. FASTQ File Format
Each of file spans four lines.
1. The sequence identifier that begins with '@' character
2. The raw sequence read
3. An alternate line for the identifier and begins with '+' character
4. The quality scores for each position along the read.

The quality score (Q) is related to the probability of calling an incorrect
base. Phred quality scores are used for assessment of sequence quality,
recognition and removal of low-quality sequence and determination of
accurate consensus sequences.
Quantitation phred = −10log10P
Where P is the probability of calling the incorrect base.
PHRED Score
Probability of
Incorrect Base call
Accuracy of Base
call
0 1 in 1 0%
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10000 99.99%
PHRED Score

Sanger Phred + 33
Illumina Phred + 64
The quality scoring scheme are encoded in ASCII character -
Example : If the quality score of a base is d and ASCII code for d is
100 for illumina, what is the quality score after changing to Sanger scale
and what is the P?
−10log10P + 64 =100;
−10log10P = 36 i.e Q
log10P = -3.6
P = 10-3.6 = 0.0002511
In Sanger's scale the value is Q+33 = 36+33 = 69

Alignment formats
1. SAM (Sequence Alignment/Map format.)
Version (Accepted values from 0-9)
Sorted or unsorted
Program ID
Reference sequence name Reference sequence length
Program Name Program Version

Apart from the header lines, which are started with the ‘@’ symbol, each
alignment line consists of:
¥ QNAME: Query template/pair NAME
¥ FLAG: bitwise FLAG
¥ RNAME: Reference sequence NAME
¥ POS: 1-based leftmost POSition/coordinate of clipped sequence
¥ MAPQ: MAPping Quality (Phred-scaled)
¥ CIGAR: extended CIGAR string
¥ MRNM: Mate Reference sequence NaMe (‘=’ if same as RNAME)
¥ MPOS: 1-based Mate POSistion
¥ LEN: inferred Template LENgth (insert size)
Query Name FLAG Reference Sequence 1-based leftmost mapping POSition Quality of mapping in Phred CIGAR

¥ SEQ: query SEQuence on the same strand as the reference
¥ QUAL: query QUALity (ASCII-33 gives the Phred base quality)
¥ OPT: variable OPTional fields in the format TAG:VTYPE:VALUE
Note :- The detail of SAM format as mentioned below has been taken
from the document available at https://github.com/samtools/hts-specs or
at https://samtools.github.io/hts-specs/SAMv1.pdf
i.QNAME: Query template NAME. Reads/segments having identical QNAME
are regarded to come from the same template. A QNAME ‘*’ indicates the
information is unavailable.
I. Single end reads
• Query name - Ion torrent specific ;

Our score Bit Description
0 0 × 1 template having multiple segments in sequencing
0 0 × 2 each segment properly aligned according to the aligner
0 0 × 4 segment unmapped
0 0 × 8 next segment in the template unmapped
1 0 × 10 SEQ being reverse complemented
0 0 × 20 SEQ of the next segment in the template being reverse
complemented0 0 × 40 the first segment in the template
0 0 × 80 the last segment in the template
0 0 × 100 secondary alignment
0 0 × 200 not passing filters, such as platform/vendor quality
controls0 0 × 400 PCR or optical duplicate
0 0 × 800 supplementary alignment
• FLAG - 16 = 00010000

Interpretation : - This indicates that the read matching is the reverse compliment
of the reference
• Reference sequence is NDV
• Matches at 8617 of the reference NDV

Mapping Quality

CIGAR - Concise Idiosyncratic Gapped Alignment Report
• CIGAR of 51M1I151M - 51 matches 1 insertion and 151 matches
This indicates 51 matches, one insertion in the query followed by 151
matches. This is clearly shown in the BLAST output above (One
insertion in the query in comparison to reference)

3' 5'
5' 3'
5'
3'
5' 3'
AAAA
AAAA
3' 5'TTTT
cDNA fragments and adapter ligation
cDNA conversion
R1
R2
5' 3'
3' 5'
Sequencing of each fragment
R1 will run in the same direction of the reference
R2 will run in the opposite direction of the reference
II. Paired end reads

(CIGAR -6M102N94M1S)
6M (6nucleotide match)
102N is and intron marked in
a box that doesn't match
R1
R2
The sequence matches exactly in the same direction as that of the original read.Note that the matched
sequence and the sequence that matches in the SAM are same.
The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same
direction as R1 indicating that this sequence in original matches in the reverse direction as expected
R1
SAM
R2
SAM
R1 Match to the Genome
II. Paired end reads

92M as marked by the arrows till C
1S is a soft clipped base i.e T
R1
R2
R1
SAM
R2
SAM
R2 Match to the Genome(CIGAR - 69M569N32M)
R1
R2
R1
SAM
R2
SAM
R1
R2
R1
SAM
R2
SAM
II. Details of R1 match

Hard clipping

II. Details of R2 match
R1
R2
R1
SAM
R2
SAM
R1
SAM
R2
SAM
R1
R2
R1
SAM
R2
SAM

BAM file
BAM is the compressed binary version of the SAM format.The BAM format
is much more convenient computationally. BAM is compressed in the
BGZF format. All multi-byte numbers in BAM are little-endian, regardless of
the machine endianness.
Convert Sam to Bam using samtools :- (Run samtools for control file
using the following command)
./samtools view –bsh control_R1.sam[input file] >control_R1.bam[output
file]

Gene transfer format (GTF)
The Gene transfer format (GTF) is a file format used to hold information about
gene structure.

Tip for the chapter:- awk command to converting fastq file to a fasta
file :- awk 'NR % 4 == 1 {print ">" $0 } NR % 4 == 2 {print $0}' ctrl.fastq >
my.fasta

Data formats

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data formats

Similar to Data formats (20)

Recently uploaded

Recently uploaded (20)

Data formats