SlideShare a Scribd company logo
1 of 20
Download to read offline
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Data formats
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Data formats
Sequence formats Other sequence
Visualisation formats
Sequence alignment
formats
Data Processing
FASTA
FASTQ
SRF
SFF
SCARF
AB1
GCG
IG
EMBL
SAM BAM
CRAM
WIG
BED
GFF
GTF
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
I. Read / Sequence Formats
1. FASTA File Format
2. FASTQ File Format
Each of file spans four lines.
1. The sequence identifier that begins with '@' character
2. The raw sequence read
3. An alternate line for the identifier and begins with '+' character
4. The quality scores for each position along the read.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The quality score (Q) is related to the probability of calling an incorrect
base. Phred quality scores are used for assessment of sequence quality,
recognition and removal of low-quality sequence and determination of
accurate consensus sequences.
Quantitation phred = −10log10P
Where P is the probability of calling the incorrect base.
PHRED Score
Probability of
Incorrect Base call
Accuracy of Base
call
0 1 in 1 0%
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10000 99.99%
PHRED Score
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Sanger Phred + 33
Illumina Phred + 64
The quality scoring scheme are encoded in ASCII character -
Example : If the quality score of a base is d and ASCII code for d is
100 for illumina, what is the quality score after changing to Sanger scale
and what is the P?
−10log10P + 64 =100;
−10log10P = 36 i.e Q
log10P = -3.6
P = 10-3.6 = 0.0002511
In Sanger's scale the value is Q+33 = 36+33 = 69
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Alignment formats
1. SAM (Sequence Alignment/Map format.)
Version (Accepted values from 0-9)
Sorted or unsorted
Program ID
Reference sequence name Reference sequence length
Program Name Program Version
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Apart from the header lines, which are started with the ‘@’ symbol, each
alignment line consists of:
¥ QNAME: Query template/pair NAME
¥ FLAG: bitwise FLAG
¥ RNAME: Reference sequence NAME
¥ POS: 1-based leftmost POSition/coordinate of clipped sequence
¥ MAPQ: MAPping Quality (Phred-scaled)
¥ CIGAR: extended CIGAR string
¥ MRNM: Mate Reference sequence NaMe (‘=’ if same as RNAME)
¥ MPOS: 1-based Mate POSistion
¥ LEN: inferred Template LENgth (insert size)
Query Name FLAG Reference Sequence 1-based leftmost mapping POSition Quality of mapping in Phred CIGAR
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
¥ SEQ: query SEQuence on the same strand as the reference
¥ QUAL: query QUALity (ASCII-33 gives the Phred base quality)
¥ OPT: variable OPTional fields in the format TAG:VTYPE:VALUE
Note :- The detail of SAM format as mentioned below has been taken
from the document available at https://github.com/samtools/hts-specs or
at https://samtools.github.io/hts-specs/SAMv1.pdf
i.QNAME: Query template NAME. Reads/segments having identical QNAME
are regarded to come from the same template. A QNAME ‘*’ indicates the
information is unavailable.
Query Name FLAG Reference Sequence 1-based leftmost mapping POSition Quality of mapping in Phred CIGAR
I. Single end reads
• Query name - Ion torrent specific ;
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Our score Bit Description
0 0 × 1 template having multiple segments in sequencing
0 0 × 2 each segment properly aligned according to the aligner
0 0 × 4 segment unmapped
0 0 × 8 next segment in the template unmapped
1 0 × 10 SEQ being reverse complemented
0 0 × 20 SEQ of the next segment in the template being reverse
complemented0 0 × 40 the first segment in the template
0 0 × 80 the last segment in the template
0 0 × 100 secondary alignment
0 0 × 200 not passing filters, such as platform/vendor quality
controls0 0 × 400 PCR or optical duplicate
0 0 × 800 supplementary alignment
• FLAG - 16 = 00010000
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Query Name FLAG Reference Sequence 1-based leftmost mapping POSition Quality of mapping in Phred CIGAR
Interpretation : - This indicates that the read matching is the reverse compliment
of the reference
• Reference sequence is NDV
• Matches at 8617 of the reference NDV
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Mapping Quality
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
CIGAR - Concise Idiosyncratic Gapped Alignment Report
• CIGAR of 51M1I151M - 51 matches 1 insertion and 151 matches
This indicates 51 matches, one insertion in the query followed by 151
matches. This is clearly shown in the BLAST output above (One
insertion in the query in comparison to reference)
Query Name FLAG Reference Sequence 1-based leftmost mapping POSition Quality of mapping in Phred CIGAR
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
3' 5'
5' 3'
5'
3'
5' 3'
AAAA
AAAA
3' 5'TTTT
cDNA fragments and adapter ligation
cDNA conversion
R1
R2
5' 3'
3' 5'
Sequencing of each fragment
R1 will run in the same direction of the reference
R2 will run in the opposite direction of the reference
II. Paired end reads
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
(CIGAR -6M102N94M1S)
6M (6nucleotide match)
102N is and intron marked in
a box that doesn't match
R1
R2
The sequence matches exactly in the same direction as that of the original read.Note that the matched
sequence and the sequence that matches in the SAM are same.
The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same
direction as R1 indicating that this sequence in original matches in the reverse direction as expected
R1
SAM
R2
SAM
R1 Match to the Genome
II. Paired end reads
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
(CIGAR -6M102N94M1S)
6M (6nucleotide match)
102N is and intron marked in
a box that doesn't match
92M as marked by the arrows till C
1S is a soft clipped base i.e T
R1
R2
The sequence matches exactly in the same direction as that of the original read.Note that the matched
sequence and the sequence that matches in the SAM are same.
The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same
direction as R1 indicating that this sequence in original matches in the reverse direction as expected
R1
SAM
R2
SAM
R1 Match to the Genome
R2 Match to the Genome(CIGAR - 69M569N32M)
(CIGAR -6M102N94M1S)
6M (6nucleotide match)
102N is and intron marked in
a box that doesn't match
92M as marked by the arrows till C
R1
R2
The sequence matches exactly in the same direction as that of the original read.Note that the matched
sequence and the sequence that matches in the SAM are same.
The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same
direction as R1 indicating that this sequence in original matches in the reverse direction as expected
R1
SAM
R2
SAM
R1 Match to the Genome
(CIGAR -6M102N94M1S)
6M (6nucleotide match)
102N is and intron marked in
a box that doesn't match
92M as marked by the arrows till C
1S is a soft clipped base i.e T
R1
R2
The sequence matches exactly in the same direction as that of the original read.Note that the matched
sequence and the sequence that matches in the SAM are same.
The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same
direction as R1 indicating that this sequence in original matches in the reverse direction as expected
R1
SAM
R2
SAM
R1 Match to the Genome
R2 Match to the Genome(CIGAR - 69M569N32M)
II. Details of R1 match
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Hard clipping
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
II. Details of R2 match
(CIGAR -6M102N94M1S)
6M (6nucleotide match)
102N is and intron marked in
a box that doesn't match
92M as marked by the arrows till C
1S is a soft clipped base i.e T
R1
R2
The sequence matches exactly in the same direction as that of the original read.Note that the matched
sequence and the sequence that matches in the SAM are same.
The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same
direction as R1 indicating that this sequence in original matches in the reverse direction as expected
R1
SAM
R2
SAM
R1 Match to the Genome
R2 Match to the Genome(CIGAR - 69M569N32M)
(CIGAR -6M102N94M1S)
6M (6nucleotide match)
102N is and intron marked in
a box that doesn't match
92M as marked by the arrows till C
1S is a soft clipped base i.e T
sequence and the sequence that matches in the SAM are same.
The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same
direction as R1 indicating that this sequence in original matches in the reverse direction as expected
R1
SAM
R2
SAM
R1 Match to the Genome
R2 Match to the Genome(CIGAR - 69M569N32M)
(CIGAR -6M102N94M1S)
6M (6nucleotide match)
102N is and intron marked in
a box that doesn't match
92M as marked by the arrows till C
1S is a soft clipped base i.e T
R1
R2
The sequence matches exactly in the same direction as that of the original read.Note that the matched
sequence and the sequence that matches in the SAM are same.
The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same
direction as R1 indicating that this sequence in original matches in the reverse direction as expected
R1
SAM
R2
SAM
R1 Match to the Genome
R2 Match to the Genome(CIGAR - 69M569N32M)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
BAM file
BAM is the compressed binary version of the SAM format.The BAM format
is much more convenient computationally. BAM is compressed in the
BGZF format. All multi-byte numbers in BAM are little-endian, regardless of
the machine endianness.
Convert Sam to Bam using samtools :- (Run samtools for control file
using the following command)
./samtools view –bsh control_R1.sam[input file] >control_R1.bam[output
file]
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Gene transfer format (GTF)
The Gene transfer format (GTF) is a file format used to hold information about
gene structure.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Tip for the chapter:- awk command to converting fastq file to a fasta
file :- awk 'NR % 4 == 1 {print ">" $0 } NR % 4 == 2 {print $0}' ctrl.fastq >
my.fasta

More Related Content

What's hot (20)

Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Blast Algorithm
Blast AlgorithmBlast Algorithm
Blast Algorithm
 
Needleman-Wunsch Algorithm
Needleman-Wunsch AlgorithmNeedleman-Wunsch Algorithm
Needleman-Wunsch Algorithm
 
Dot matrix seminar
Dot matrix seminarDot matrix seminar
Dot matrix seminar
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
SEQUENCE ANALYSIS
SEQUENCE ANALYSISSEQUENCE ANALYSIS
SEQUENCE ANALYSIS
 
Fasta
FastaFasta
Fasta
 
GENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSGENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICS
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Distance based method
Distance based method Distance based method
Distance based method
 
Gene bank by kk sahu
Gene bank by kk sahuGene bank by kk sahu
Gene bank by kk sahu
 
Genomics and bioinformatics
Genomics and bioinformatics Genomics and bioinformatics
Genomics and bioinformatics
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
 
Composite and Specialized databases
Composite and Specialized databasesComposite and Specialized databases
Composite and Specialized databases
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)
 
Introduction of bioinformatics
Introduction of bioinformaticsIntroduction of bioinformatics
Introduction of bioinformatics
 
protein sequence analysis
protein sequence analysisprotein sequence analysis
protein sequence analysis
 
History and scope in bioinformatics
History and scope in bioinformaticsHistory and scope in bioinformatics
History and scope in bioinformatics
 

Similar to Data formats

Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema
 
BIM_2010_20_Bioinformatics_Project
BIM_2010_20_Bioinformatics_ProjectBIM_2010_20_Bioinformatics_Project
BIM_2010_20_Bioinformatics_ProjectSagar Nikam
 
TIS prediction in human cDNAs with high accuracy
TIS prediction in human cDNAs with high accuracyTIS prediction in human cDNAs with high accuracy
TIS prediction in human cDNAs with high accuracyAnax Fotopoulos
 
Multicopy reference assay (MRef) — a superior normalizer of sample input in D...
Multicopy reference assay (MRef) — a superior normalizer of sample input in D...Multicopy reference assay (MRef) — a superior normalizer of sample input in D...
Multicopy reference assay (MRef) — a superior normalizer of sample input in D...QIAGEN
 
Bioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesBioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesProf. Wim Van Criekinge
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesGenome Reference Consortium
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence AlignmentRavi Gandham
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticezahid6
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packagesRavi Gandham
 
Computation and System Biology Assignment Help
Computation and System Biology Assignment HelpComputation and System Biology Assignment Help
Computation and System Biology Assignment HelpNursing Assignment Help
 
Wang labsummer2010
Wang labsummer2010Wang labsummer2010
Wang labsummer2010russodl
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data AnalysisRavi Gandham
 
Ashg poster sp_compressed
Ashg poster sp_compressedAshg poster sp_compressed
Ashg poster sp_compressedAmy Cullinan
 

Similar to Data formats (20)

Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
 
BIM_2010_20_Bioinformatics_Project
BIM_2010_20_Bioinformatics_ProjectBIM_2010_20_Bioinformatics_Project
BIM_2010_20_Bioinformatics_Project
 
TIS prediction in human cDNAs with high accuracy
TIS prediction in human cDNAs with high accuracyTIS prediction in human cDNAs with high accuracy
TIS prediction in human cDNAs with high accuracy
 
Seq alignment
Seq alignment Seq alignment
Seq alignment
 
Multicopy reference assay (MRef) — a superior normalizer of sample input in D...
Multicopy reference assay (MRef) — a superior normalizer of sample input in D...Multicopy reference assay (MRef) — a superior normalizer of sample input in D...
Multicopy reference assay (MRef) — a superior normalizer of sample input in D...
 
Bioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesBioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matrices
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Aacr poster2007
Aacr poster2007Aacr poster2007
Aacr poster2007
 
Validaternai
ValidaternaiValidaternai
Validaternai
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informatice
 
FASTA
FASTAFASTA
FASTA
 
Cufflinks
CufflinksCufflinks
Cufflinks
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packages
 
Computation and System Biology Assignment Help
Computation and System Biology Assignment HelpComputation and System Biology Assignment Help
Computation and System Biology Assignment Help
 
Wang labsummer2010
Wang labsummer2010Wang labsummer2010
Wang labsummer2010
 
Snp genotyping
Snp genotypingSnp genotyping
Snp genotyping
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
Ashg poster sp_compressed
Ashg poster sp_compressedAshg poster sp_compressed
Ashg poster sp_compressed
 

Recently uploaded

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 

Recently uploaded (20)

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 

Data formats

  • 1. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Data formats
  • 2. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Data formats Sequence formats Other sequence Visualisation formats Sequence alignment formats Data Processing FASTA FASTQ SRF SFF SCARF AB1 GCG IG EMBL SAM BAM CRAM WIG BED GFF GTF
  • 3. Computational Biology and Genomics Facility, Indian Veterinary Research Institute I. Read / Sequence Formats 1. FASTA File Format 2. FASTQ File Format Each of file spans four lines. 1. The sequence identifier that begins with '@' character 2. The raw sequence read 3. An alternate line for the identifier and begins with '+' character 4. The quality scores for each position along the read.
  • 4. Computational Biology and Genomics Facility, Indian Veterinary Research Institute The quality score (Q) is related to the probability of calling an incorrect base. Phred quality scores are used for assessment of sequence quality, recognition and removal of low-quality sequence and determination of accurate consensus sequences. Quantitation phred = −10log10P Where P is the probability of calling the incorrect base. PHRED Score Probability of Incorrect Base call Accuracy of Base call 0 1 in 1 0% 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% 40 1 in 10000 99.99% PHRED Score
  • 5. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Sanger Phred + 33 Illumina Phred + 64 The quality scoring scheme are encoded in ASCII character - Example : If the quality score of a base is d and ASCII code for d is 100 for illumina, what is the quality score after changing to Sanger scale and what is the P? −10log10P + 64 =100; −10log10P = 36 i.e Q log10P = -3.6 P = 10-3.6 = 0.0002511 In Sanger's scale the value is Q+33 = 36+33 = 69
  • 6. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Alignment formats 1. SAM (Sequence Alignment/Map format.) Version (Accepted values from 0-9) Sorted or unsorted Program ID Reference sequence name Reference sequence length Program Name Program Version
  • 7. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Apart from the header lines, which are started with the ‘@’ symbol, each alignment line consists of: ¥ QNAME: Query template/pair NAME ¥ FLAG: bitwise FLAG ¥ RNAME: Reference sequence NAME ¥ POS: 1-based leftmost POSition/coordinate of clipped sequence ¥ MAPQ: MAPping Quality (Phred-scaled) ¥ CIGAR: extended CIGAR string ¥ MRNM: Mate Reference sequence NaMe (‘=’ if same as RNAME) ¥ MPOS: 1-based Mate POSistion ¥ LEN: inferred Template LENgth (insert size) Query Name FLAG Reference Sequence 1-based leftmost mapping POSition Quality of mapping in Phred CIGAR
  • 8. Computational Biology and Genomics Facility, Indian Veterinary Research Institute ¥ SEQ: query SEQuence on the same strand as the reference ¥ QUAL: query QUALity (ASCII-33 gives the Phred base quality) ¥ OPT: variable OPTional fields in the format TAG:VTYPE:VALUE Note :- The detail of SAM format as mentioned below has been taken from the document available at https://github.com/samtools/hts-specs or at https://samtools.github.io/hts-specs/SAMv1.pdf i.QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come from the same template. A QNAME ‘*’ indicates the information is unavailable. Query Name FLAG Reference Sequence 1-based leftmost mapping POSition Quality of mapping in Phred CIGAR I. Single end reads • Query name - Ion torrent specific ;
  • 9. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Our score Bit Description 0 0 × 1 template having multiple segments in sequencing 0 0 × 2 each segment properly aligned according to the aligner 0 0 × 4 segment unmapped 0 0 × 8 next segment in the template unmapped 1 0 × 10 SEQ being reverse complemented 0 0 × 20 SEQ of the next segment in the template being reverse complemented0 0 × 40 the first segment in the template 0 0 × 80 the last segment in the template 0 0 × 100 secondary alignment 0 0 × 200 not passing filters, such as platform/vendor quality controls0 0 × 400 PCR or optical duplicate 0 0 × 800 supplementary alignment • FLAG - 16 = 00010000
  • 10. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Query Name FLAG Reference Sequence 1-based leftmost mapping POSition Quality of mapping in Phred CIGAR Interpretation : - This indicates that the read matching is the reverse compliment of the reference • Reference sequence is NDV • Matches at 8617 of the reference NDV
  • 11. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Mapping Quality
  • 12. Computational Biology and Genomics Facility, Indian Veterinary Research Institute CIGAR - Concise Idiosyncratic Gapped Alignment Report • CIGAR of 51M1I151M - 51 matches 1 insertion and 151 matches This indicates 51 matches, one insertion in the query followed by 151 matches. This is clearly shown in the BLAST output above (One insertion in the query in comparison to reference) Query Name FLAG Reference Sequence 1-based leftmost mapping POSition Quality of mapping in Phred CIGAR
  • 13. Computational Biology and Genomics Facility, Indian Veterinary Research Institute 3' 5' 5' 3' 5' 3' 5' 3' AAAA AAAA 3' 5'TTTT cDNA fragments and adapter ligation cDNA conversion R1 R2 5' 3' 3' 5' Sequencing of each fragment R1 will run in the same direction of the reference R2 will run in the opposite direction of the reference II. Paired end reads
  • 14. Computational Biology and Genomics Facility, Indian Veterinary Research Institute (CIGAR -6M102N94M1S) 6M (6nucleotide match) 102N is and intron marked in a box that doesn't match R1 R2 The sequence matches exactly in the same direction as that of the original read.Note that the matched sequence and the sequence that matches in the SAM are same. The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same direction as R1 indicating that this sequence in original matches in the reverse direction as expected R1 SAM R2 SAM R1 Match to the Genome II. Paired end reads
  • 15. Computational Biology and Genomics Facility, Indian Veterinary Research Institute (CIGAR -6M102N94M1S) 6M (6nucleotide match) 102N is and intron marked in a box that doesn't match 92M as marked by the arrows till C 1S is a soft clipped base i.e T R1 R2 The sequence matches exactly in the same direction as that of the original read.Note that the matched sequence and the sequence that matches in the SAM are same. The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same direction as R1 indicating that this sequence in original matches in the reverse direction as expected R1 SAM R2 SAM R1 Match to the Genome R2 Match to the Genome(CIGAR - 69M569N32M) (CIGAR -6M102N94M1S) 6M (6nucleotide match) 102N is and intron marked in a box that doesn't match 92M as marked by the arrows till C R1 R2 The sequence matches exactly in the same direction as that of the original read.Note that the matched sequence and the sequence that matches in the SAM are same. The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same direction as R1 indicating that this sequence in original matches in the reverse direction as expected R1 SAM R2 SAM R1 Match to the Genome (CIGAR -6M102N94M1S) 6M (6nucleotide match) 102N is and intron marked in a box that doesn't match 92M as marked by the arrows till C 1S is a soft clipped base i.e T R1 R2 The sequence matches exactly in the same direction as that of the original read.Note that the matched sequence and the sequence that matches in the SAM are same. The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same direction as R1 indicating that this sequence in original matches in the reverse direction as expected R1 SAM R2 SAM R1 Match to the Genome R2 Match to the Genome(CIGAR - 69M569N32M) II. Details of R1 match
  • 16. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Hard clipping
  • 17. Computational Biology and Genomics Facility, Indian Veterinary Research Institute II. Details of R2 match (CIGAR -6M102N94M1S) 6M (6nucleotide match) 102N is and intron marked in a box that doesn't match 92M as marked by the arrows till C 1S is a soft clipped base i.e T R1 R2 The sequence matches exactly in the same direction as that of the original read.Note that the matched sequence and the sequence that matches in the SAM are same. The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same direction as R1 indicating that this sequence in original matches in the reverse direction as expected R1 SAM R2 SAM R1 Match to the Genome R2 Match to the Genome(CIGAR - 69M569N32M) (CIGAR -6M102N94M1S) 6M (6nucleotide match) 102N is and intron marked in a box that doesn't match 92M as marked by the arrows till C 1S is a soft clipped base i.e T sequence and the sequence that matches in the SAM are same. The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same direction as R1 indicating that this sequence in original matches in the reverse direction as expected R1 SAM R2 SAM R1 Match to the Genome R2 Match to the Genome(CIGAR - 69M569N32M) (CIGAR -6M102N94M1S) 6M (6nucleotide match) 102N is and intron marked in a box that doesn't match 92M as marked by the arrows till C 1S is a soft clipped base i.e T R1 R2 The sequence matches exactly in the same direction as that of the original read.Note that the matched sequence and the sequence that matches in the SAM are same. The sequence is the reverse compliment of the R2 sequence. Therefore matches is the same direction as R1 indicating that this sequence in original matches in the reverse direction as expected R1 SAM R2 SAM R1 Match to the Genome R2 Match to the Genome(CIGAR - 69M569N32M)
  • 18. Computational Biology and Genomics Facility, Indian Veterinary Research Institute BAM file BAM is the compressed binary version of the SAM format.The BAM format is much more convenient computationally. BAM is compressed in the BGZF format. All multi-byte numbers in BAM are little-endian, regardless of the machine endianness. Convert Sam to Bam using samtools :- (Run samtools for control file using the following command) ./samtools view –bsh control_R1.sam[input file] >control_R1.bam[output file]
  • 19. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Gene transfer format (GTF) The Gene transfer format (GTF) is a file format used to hold information about gene structure.
  • 20. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Tip for the chapter:- awk command to converting fastq file to a fasta file :- awk 'NR % 4 == 1 {print ">" $0 } NR % 4 == 2 {print $0}' ctrl.fastq > my.fasta