SlideShare a Scribd company logo
1 of 76
Next Generation Sequencing
File Formats.
Pierre Lindenbaum
pierre.lindenbaum@univ-nantes.fr
September 18, 2017
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
You don’t need to have a deep knowledge of those formats.
(Unless you’re doing NGS)
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Understand how people have solved their BIG data problems.
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Why sequencing ?
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Well, that’s a little more complicated ...
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
FASTQ
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
FASTQ
FASTQ: text-based format for storing both a DNA sequence and
its corresponding quality scores
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
FASTQ
FASTQ for single end
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
FASTQ
FASTQ for paired end
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
FASTQ Example
@IL31_4368:1:1:996:8507/1
NTGATAAAGTAATGACAAAATAATGACATTATTGTTACTATGGTTACTGTGGGA
+
(94**0-)*7=06>>><<<<<<22@>6;;;5;6:;63:4?-622647..-.5.%
@IL31_4368:1:1:996:21421/1
NAAGTTAATTCTTCATTGTCCATTCCTCTGAAATGATTCAGAAATACTGGTAGT
+
(**+*2396,@<+<:@@@;;5)<0)69606>4;5>;>6&<102)0*+8:&137;
@IL31_4368:1:1:997:10572/1
NAATGTATGTAGACCCTTCACATTCAAAGGCAAATACAATATCATCATGTCTTC
+
(/9**-0032>:>>9>4@@=>??@@:-66,;>;<;6+;255,1;7>>>>3676’
@IL31_4368:1:1:997:15684/1
NGCAATCAATGCTATGATTGATCCTGATGGAACTTTGGAGGCTCTGAACAACAT
+
()1,*37766>@@@>?@<?@@:>@0>>><-888>8;>*;966>;;;@8@4,.2.
@IL31_4368:1:1:997:15249/1
NCGTTATAATGGAATTATTTTTCTTCCTTTATTTAATGTGTTGACAAAGAGAACPierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
FASTQ name
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
Col Brief description
EAS139 the unique instrument name
136 the run id
FC706VJ the flowcell id
2 flowcell lane
2104 tile number within the flowcell lane
15343 ’x’-coordinate of the cluster within the tile
197393 ’y’-coordinate of the cluster within the tile
1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y Y if the read fails filter (read is bad), N otherwise
18 0 when none of the control bits are on, otherwise it is an even num
ATCACG index sequence
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
FASTQ Quality
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
FASTQ Quality
A quality value Q is an integer mapping of p (i.e., the probability
that the corresponding base call is incorrect).
Qsanger = −10 log10 p
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
FASTQ Quality
Since a human readable format is desired for SAM, 33 is added to
the calculated quality in order to make it a printable character
ranging from ! - .
Qsanger = −10 log10 p + 33
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Aligned Reads
44187101 44187111 44187121 44187131 44187141 44187151 44187161 44187171
aaatgagccaggtgtggtggtgcacacctatagtcccagctacgcaggaggctgaggtgggaggatcgcttaaacccggc REFERENCE
............................................Y................................... CONSENSUS
aaa gagccaggtgtggtggtgcacaccgataggcccagctacgtaggaggctgaggtgggaggatcgcttaaa cggc
AAA GAGCCAGGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAA CGGC
aaatga CCAGGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCC c
aaatgagcc GGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC
AAATGAGCCAGG gtggtggtgcacacctatagtcccagcgacgtaggaggctgaggtgggaggatcgcttaaacccggc
AAATGAGCCAGGTG ggtggtgcacacctatagtcccagctaagtaggaggctgaggtgggaggatcgctttaacccggc
AAATGAGCCAGGTGT GTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCGGGC
ACATGAGCCAGGTGTG tggtgcacacctatagtcccagctacgtaggaggctgaggtgggaggatcgcttaaacccggc
aaatgagccaggtgtgg GCACACGTAAAGTCCCAGCTACGCAGGAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC
CAATGAGCCAGTTGTGG cacacctatagtcccagctacgcacgaggctgaggtgggaggatcgctttaacccggc
AAATGAGCCAGGTGAGGT cacacctatagtcccagctacgcaggaggctgaggtgggaggatcgcttaaacccggc
AAATGAGCCAGGTGTGGT acacctatagtcccagctacgcaggaggctgaggtgggaggatcgctttaacccggc
aaatgagccaggtgtggtgg cctatagtcccagctacgtaggaggctgaggtgggaggatcgcttaaacccggc
AAATGAGCCAGGTGTGGTGG TATAGTCCCAGCTACGCAGGAGGCTGAGGTGGTAGGATCGCATAAACCCGGC
AAATGAGCCAGGTGTGGTGGT TAGTCCCAGCTACGTAGGAGGCTGAGTTGGGAGGATCTCTTAAACCCGGC
aaatgagccaggtgtggtggtg TCGTCCCAGCTACGCAGGAGGCTTAGGTGGGAGGATCGCTTAAACCCGGC
aaatgagccaggtgtggtggtgca AGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGGTTAAACCCGGC
aaatgagccaggtgtggtggtgcac cccagctacgcaggaggctgaggtgggaccatcgcttaaaccccgc
aaatgagccaggtgtggtggtgcac CCAGCTACGTAGTAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Format
”SAM (Sequence Alignment/Map) format is a generic format for
storing large nucleotide sequence alignments”
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Format
Is flexible enough to store all the alignment information
generated by various alignment programs;
Is simple enough to be easily generated by alignment
programs or converted from existing alignment formats;
Is compact in file size;
Allows most of operations on the alignment to work on a
stream without loading the whole alignment into memory;
Allows the file to be indexed by genomic position to
efficiently retrieve all reads aligning to a locus.
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Format
Structure
+ HEADER
-version
-program parameters
+GENOME
- chrom1 size
- chrom2 size
- chrom3 size
- (..)
+GROUPS
- group1 : sample1, lane 4
- group2 : sample2, lane 1
+ BODY
- READ1 -> group1
- READ2 -> group1
- READ3 -> group1
- READ4 -> group2
- (...)Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Example
Simple example
@HD VN:1.5 SO:coordinate
@SQ SN:ref LN:45
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 83 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Header Section
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Header
@HD VN: 1 . 0 SO: c o o r d i n a t e
@SQ SN:1 LN:249250621 AS : NCBI37 UR: f i l e : human . f a s t a M5:1 b22b98cdeb
@SQ SN:2 LN:243199373 AS : NCBI37 UR: f i l e : human . f a s t a M5: a0d9851da00
@SQ SN:3 LN:198022430 AS : NCBI37 UR: f i l e : human . f a s t a M5: fdfd811849c
@RG ID : UM0098 :1 PL : ILLUMINA SM: SD37743 CN: Nantes
@RG ID : UM0098 :2 PL : ILLUMINA SM: SD37743 CN: Nantes
@PG ID : bwa VN: 0 . 5 . 4
@PG ID :GATK T a b l e R e c a l i b r a t i o n VN: 1 . 0 . 3 4 7 1 CL : C o v a r i a t e s = ( . . . )
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Alignment Section
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Example
Simple example
IL31 4368 : 1 : 1 : 9 9 6 : 8 5 0 7 77 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 6 : 8 5 0 7 141 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 6 : 2 1 4 2 1 77 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 6 : 2 1 4 2 1 141 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 7 : 1 0 5 7 2 77 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 7 : 1 0 5 7 2 141 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 7 : 1 5 6 8 4 83 chr1 241356612 60 54M = 241356
IL31 4368 : 1 : 1 : 9 9 7 : 1 5 6 8 4 163 chr1 241356442 60 54M = 241356
IL31 4368 : 1 : 1 : 9 9 7 : 1 5 2 4 9 77 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 7 : 1 5 2 4 9 141 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 7 : 6 2 7 3 77 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 7 : 6 2 7 3 141 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 7 : 1 6 5 7 83 chr1 143630364 60 54M = 143630
IL31 4368 : 1 : 1 : 9 9 7 : 1 6 5 7 163 chr1 143630066 60 54M = 143630
IL31 4368 : 1 : 1 : 9 9 7 : 5 6 0 9 77 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 7 : 5 6 0 9 141 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 7 : 1 4 2 6 2 77 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 7 : 1 4 2 6 2 141 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 8 : 1 9 9 1 4 77 ∗ 0 0 ∗ ∗ 0 0
IL31 4368 : 1 : 1 : 9 9 8 : 1 9 9 1 4 141 ∗ 0 0 ∗ ∗ 0 0
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Example
Sorted SAM
One row is one read, NOT one fragment.
IL31_4368:1:107:15207:19097 163 chr1 17 0 54M = 21 58 (...)
IL31_4368:1:107:15207:19097 83 chr1 21 0 54M = 17 -58 (...)
IL31_4368:1:10:17817:9758 137 chr1 23 0 54M = 23 0 (...)
IL31_4368:1:54:13142:21400 163 chr1 37 0 54M = 44 61 (...)
IL31_4368:1:54:13142:21400 83 chr1 44 0 54M = 37 -61 (...)
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Specifications
Record Column
Col Field Type Brief description
1 QNAME String Query template NAME
2 FLAG Int bitwise FLAG
3 RNAME String Reference sequence NAME
4 POS Int 1-based leftmost mapping POSition
5 MAPQ Int MAPping Quality
6 CIGAR String CIGAR string
7 RNEXT String Ref. name of the mate/next read
8 PNEXT Int Position of the mate/next read
9 TLEN Int observed Template LENgth
10 SEQ String segment SEQuence
11 QUAL String ASCII of Phred-scaled base QUALity+33
12 META metadata
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Specifications
Record Column
Col Field Type
1 QNAME IL31 4368:1:42:12530:7509
2 FLAG 137
3 RNAME chr1
4 POS 10
5 MAPQ 30
6 CIGAR 54M
7 RNEXT =
8 PNEXT 100
9 TLEN 90
10 SEQ TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
11 QUAL GGGGGGGFEGGGGCFGGGGGEGGFGEGGFGFGGFGFEGFCF
12 META XT:A:R NM:i:3 SM:i:0 AM:i:0 X0:i:11 X1:i:0 XM:i:3 XO:i:0 X
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
read paired.
read mapped in proper pair.
read unmapped.
mate unmapped.
read reverse strand.
mate reverse strand.
first in pair.
second in pair.
not primary alignment.
read fails platform/vendor quality checks.
read is PCR or optical duplicate.
supplementary alignment
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
Read Paired
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
Read mapped in proper pair
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Read mapped in proper pair
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
Read unmapped
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
Mate unmapped
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
Read reverse strand
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
Mate reverse strand
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
First in pair
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
Second in pair
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
not primary alignment
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
read fails platform/vendor quality checks
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
read is PCR or optical duplicate
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM FLAGS
supplementary alignment
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM CIGAR
The CIGAR string is a sequence of of base lengths and the
associated operation. They are used to indicate things like which
bases align (either a match/mismatch) with the reference, are
deleted from the reference, and are insertions that are not in the
reference.
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Cigar
Op BAM Description
M 0 alignment match (can be a sequence match or mismatch)
I 1 insertion to the reference
D 2 deletion from the reference
N 3 skipped region from the reference
S 4 soft clipping (clipped sequences present in SEQ)
H 5 hard clipping (clipped sequences NOT present in SEQ)
P 6 padding (silent deletion from padded reference)
= 7 sequence match
X 8 sequence mismatch
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Cigar
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Cigar
http://genome.sph.umich.edu/wiki/SAM
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: ACTAGAATGGCT
Aligning these two:
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: A C T A G A A T G G C T
With the alignment above, you get:
POS: 5
CIGAR: 3M1I3M1D5M
or
CIGAR: 3=1I3=1D2=1X2=
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Cigar
Soft Clip
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Cigar
Hard Clip
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Fomat
optional TAGs
optional fields on a SAM/BAM Alignment. A TAG is comprised of
a two character TAG key, they type of the value, and the value:
[A-Za-z][A-za-z]:[AifZH]:.*
The types, A, i, f, Z, H are used to indicate the type of value
stored in the tag.
Type Description
A character
i signed 32-bit integer
f single-precision float
Z string
H hex string
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Fomat
optional TAGs
XT:A:U - user defined tag called XT. It holds a character.
The value associated with this tag is ’U’.
NM:i:2 - predefined tag NM means: Edit distance to the
reference (number of changes necessary to make this equal
the reference, excluding clipping)
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
SAM Example
Sorted SAM
IL31 4368 :1 :10 7: 152 07: 190 97 163 chr1 17 0 54M = 21
IL31 4368 :1 :10 7: 152 07: 190 97 83 chr1 21 0 54M = 17
IL31 4368 : 1 : 5 4 : 1 3 1 4 2 : 2 1 4 0 0 163 chr1 37 0 54M = 44
IL31 4368 : 1 : 5 4 : 1 3 1 4 2 : 2 1 4 0 0 83 chr1 44 0 54M = 37
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
BAM
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
BGZF Format
The SAM/BAM file format (Sequence Alignment/Map) comes in a
plain text format (SAM), and a compressed binary format (BAM).
The latter uses a modified form of gzip compression called BGZF
(Blocked GNU Zip Format), which can be applied to any file
format to provide compression with efficient random access
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
BAM INDEX
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
CRAM
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Other Sequencing technologies
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
HDF5
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
VCF
Variant Call Format
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
VCF Format
VCF is a text file format (most likely stored in a compressed
manner). It contains meta-information lines, a header line, and
then data lines each containing information about a position in the
genome.
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
VCF
Example
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot−NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description=”Number of Samples With Data”>
##INFO=<ID=DP,Number=1,Type=Integer,Description=”Total Depth”>
##INFO=<ID=AF,Number=.,Type=Float,Description=”Allele Frequency”>
##INFO=<ID=AA,Number=1,Type=String,Description=”Ancestral Allele”>
##INFO=<ID=DB,Number=0,Type=Flag,Description=”dbSNP membership, build 129”>
##INFO=<ID=H2,Number=0,Type=Flag,Description=”HapMap2 membership”>
##FILTER=<ID=q10,Description=”Quality below 10”>
##FILTER=<ID=s50,Description=”Less than 50% of samples have data”>
##FORMAT=<ID=GT,Number=1,Type=String,Description=”Genotype”>
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=”Genotype Quality”>
##FORMAT=<ID=DP,Number=1,Type=Integer,Description=”Read Depth”>
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description=”Haplotype Quality”>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA000
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,2
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:5
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
VCF
Column
CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
SAMPLE-1
SAMPLE-2
SAMPLE-3
...
(...)
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
VCF
INFO
INFO fields should be described as follows
##INFO=<ID=ID , Number=number , Type=type ,
D e s c r i p t i o n=”d e s c r i p t i o n ”>
( . . . )
##INFO=<ID= NS , Number=1,Type=I n t e g e r , D e s c r i p t i o n=”Number of Samples With Data”>
( . . . )
INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3 ;DP=14;AF=0.5;DB; H2
GT:GQ:DP:HQ 0 | 0 : 4 8 : 1 : 5 1 , 5 1 1 | 0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , .
20 17330 . T A 3 q10 NS=3 ;DP=11;AF=0.017
GT:GQ:DP:HQ 0 | 0 : 4 9 : 3 : 5 8 , 5 0 0 | 1 : 3 : 5 : 6 5 , 3 0/0:41:3
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
VCF
FILTERs
FILTERs that have been applied to the data should be described as
follows:
##FILTER=<ID=ID , D e s c r i p t i o n=”d e s c r i p t i o n ”>
( . . . )
##FILTER=<ID=q10 , D e s c r i p t i o n=”Q u a l i t y below 10”>
##FILTER=<ID=s50 , D e s c r i p t i o n=”Less than 50 p erce nt of samples have data”>
( . . . )
#CHROM POS ID REF ALT QUAL FILTER ( . . . )
20 14370 rs6054257 G A 29 PASS ( . . . )
20 17330 . T A 3 q10 ( . . . )
20 111069 rs6040355 A G,T 67 PASS ( . . . )
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
VCF
FORMAT
Genotype fields specified in the FORMAT field should be described
as follows:
##FORMAT=<ID=ID , Number=number , Type=type ,
D e s c r i p t i o n=”d e s c r i p t i o n ”>
( . . . )
##FORMAT=<ID=GT , Number=1,Type=String , D e s c r i p t i o n=”Genotype”>
( . . . )
# ( . . . )FORMAT NA00001 NA00002 NA00003
( . . . ) GT :GQ:DP:HQ 0 | 0 : 4 8 : 1 : 5 1 , 5 1 1/0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , .
( . . . ) GT :GQ:DP:HQ 0 | 0 : 4 9 : 3 : 5 8 , 5 0 0/1 : 3 : 5 : 6 5 , 3 0/0:41:3
( . . . ) GT :GQ:DP:HQ 1 | 2 : 2 1 : 6 : 2 3 , 2 7 2/1 : 2 : 0 : 1 8 , 2 2/2:35:4
( . . . ) GT :GQ:DP:HQ 0 | 0 : 5 4 : 7 : 5 6 , 6 0 0/0 : 4 8 : 4 : 5 1 , 5 1 0/0:61:2
( . . . ) GT :GQ:DP 0/1:35:4 0/2 : 1 7 : 2 1/1:40:3
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Tabix
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Binning
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Tabix INDEX
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Building the TABIX index
$ bgzip −f f i l e . vcf
$ t a b i x −p vcf f i l e . vcf . gz
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Querying the TABIX index
$ t a b i x f i l e . vcf . gz chr3 :1235 −456778
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
API
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Reading SAM with the samtools C library
#include <s t d l i b . h>
#include <s t d i o . h>
#include ”bam . h”
#include ”sam . h”
int main ( int argc , char ∗ argv [ ] ) {
s a m f i l e t ∗ sam=samopen ( argv [ 1 ] , ” rb ” , 0 ) ;
bam1 t ∗b= bam init1 ( ) ;
long n=0L ;
while ( samread (sam , b) > 0)
{
i f ( ! ( b−>core . f l a g&BAM FUNMAP)) ++n ;
}
bam destroy1 (b ) ;
samclose (sam ) ;
p r i n t f ( ”%lu n” ,n ) ;
return 0;
}
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Reading SAM with the java picard library
import java . i o . F i l e ;
import net . s f . samtools . ∗ ;
public class CountMapped {
public s t a t i c void main ( S t r i n g [ ] args ) {
long n = 0L ;
F i l e f = new F i l e ( args [ 0 ] ) ;
SamReader sam = SamReaderFactory .
makeDefault ( ) . open ( f ) ;
SAMRecordIterator i t e r = sam . i t e r a t o r ( ) ;
System . out . p r i n t l n ( i t e r . stream ( ) .
f i l t e r (R−>!R. getReadUnmapped ( ) ) .
count ()
) ;
i t e r . c l o s e ( ) ; sam . c l o s e ( ) ;
System . out . p r i n t l n (n ) ;
}
}
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
End
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
Credits
Spec: https://samtools.github.io/hts-specs/
Angus: http://ged.msu.edu/angus/
Wikipedia: https://en.wikibooks.org/wiki/C%2B%2B_
Programming/Programming_Languages/C%2B%2B/Code/
Statements/Variables
Abecasis Group Wiki:
http://genome.sph.umich.edu/wiki/SAM
Genome Research
http://genome.cshlp.org/content/12/6/996
Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.

More Related Content

What's hot

BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS
 
Scoring schemes in bioinformatics
Scoring schemes in bioinformaticsScoring schemes in bioinformatics
Scoring schemes in bioinformaticsSumatiHajela
 
Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)Safa Khalid
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysisyuvraj404
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformaticsatmapandey
 
Microarray technique
Microarray techniqueMicroarray technique
Microarray techniquearunchacko14
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matricesAshwini
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence AlignmentRavi Gandham
 
Needleman-Wunsch Algorithm
Needleman-Wunsch AlgorithmNeedleman-Wunsch Algorithm
Needleman-Wunsch AlgorithmProshantaShil
 
Oxford nanopore sequencing
Oxford nanopore sequencingOxford nanopore sequencing
Oxford nanopore sequencingSangeetha80717
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure predictionSiva Dharshini R
 
Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques fikrem24yahoocom6261
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation SequencingArindam Ghosh
 

What's hot (20)

Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
BLAST
BLASTBLAST
BLAST
 
BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1
 
Rna seq
Rna seqRna seq
Rna seq
 
Scoring schemes in bioinformatics
Scoring schemes in bioinformaticsScoring schemes in bioinformatics
Scoring schemes in bioinformatics
 
Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysis
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformatics
 
Microarray technique
Microarray techniqueMicroarray technique
Microarray technique
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matrices
 
ILLUMINA SEQUENCE.pptx
ILLUMINA SEQUENCE.pptxILLUMINA SEQUENCE.pptx
ILLUMINA SEQUENCE.pptx
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Needleman-Wunsch Algorithm
Needleman-Wunsch AlgorithmNeedleman-Wunsch Algorithm
Needleman-Wunsch Algorithm
 
Oxford nanopore sequencing
Oxford nanopore sequencingOxford nanopore sequencing
Oxford nanopore sequencing
 
Dynamic programming
Dynamic programming Dynamic programming
Dynamic programming
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure prediction
 
Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 

Similar to Next Generation Sequencing file Formats ( 2017 )

Cassandra : to be or not to be @ TechTalk
Cassandra : to be or not to be @ TechTalkCassandra : to be or not to be @ TechTalk
Cassandra : to be or not to be @ TechTalkAndriy Rymar
 
Overview of sparse and low-rank matrix / tensor techniques
Overview of sparse and low-rank matrix / tensor techniques Overview of sparse and low-rank matrix / tensor techniques
Overview of sparse and low-rank matrix / tensor techniques Alexander Litvinenko
 
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...Flink Forward
 
Self-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingVasia Kalavri
 
Numerical Methods: curve fitting and interpolation
Numerical Methods: curve fitting and interpolationNumerical Methods: curve fitting and interpolation
Numerical Methods: curve fitting and interpolationNikolai Priezjev
 
Linking E-Mails and Source Code Artifacts
Linking E-Mails and Source Code ArtifactsLinking E-Mails and Source Code Artifacts
Linking E-Mails and Source Code ArtifactsAlberto Bacchelli
 
Fast and accurate metrics. Is it actually possible?
Fast and accurate metrics. Is it actually possible?Fast and accurate metrics. Is it actually possible?
Fast and accurate metrics. Is it actually possible?Bogdan Storozhuk
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing CoursePierre Lindenbaum
 
Tensorflow and python : fault detection system - PyCon Taiwan 2017
Tensorflow and python : fault detection system - PyCon Taiwan 2017Tensorflow and python : fault detection system - PyCon Taiwan 2017
Tensorflow and python : fault detection system - PyCon Taiwan 2017Eric Ahn
 
Daniel - Portfolio - Digital Copy
Daniel - Portfolio - Digital CopyDaniel - Portfolio - Digital Copy
Daniel - Portfolio - Digital CopyDaniel Lipinski
 
Simple Nested Sets and some other DB optimizations
Simple Nested Sets and some other DB optimizationsSimple Nested Sets and some other DB optimizations
Simple Nested Sets and some other DB optimizationsEli Aschkenasy
 
Ground Vibration Control Using Signature Hole Method - Thesis BE Mining, Univ...
Ground Vibration Control Using Signature Hole Method - Thesis BE Mining, Univ...Ground Vibration Control Using Signature Hole Method - Thesis BE Mining, Univ...
Ground Vibration Control Using Signature Hole Method - Thesis BE Mining, Univ...Muhamad Rizky
 
Identification of unknown parameters and prediction of missing values. Compar...
Identification of unknown parameters and prediction of missing values. Compar...Identification of unknown parameters and prediction of missing values. Compar...
Identification of unknown parameters and prediction of missing values. Compar...Alexander Litvinenko
 
ch03_block_ciphers_nemo (2) (1).ppt
ch03_block_ciphers_nemo (2) (1).pptch03_block_ciphers_nemo (2) (1).ppt
ch03_block_ciphers_nemo (2) (1).pptMrsPrabhaBV
 

Similar to Next Generation Sequencing file Formats ( 2017 ) (20)

Ctrie Data Structure
Ctrie Data StructureCtrie Data Structure
Ctrie Data Structure
 
Cassandra : to be or not to be @ TechTalk
Cassandra : to be or not to be @ TechTalkCassandra : to be or not to be @ TechTalk
Cassandra : to be or not to be @ TechTalk
 
Overview of sparse and low-rank matrix / tensor techniques
Overview of sparse and low-rank matrix / tensor techniques Overview of sparse and low-rank matrix / tensor techniques
Overview of sparse and low-rank matrix / tensor techniques
 
LEC 8-DS ALGO(heaps).pdf
LEC 8-DS  ALGO(heaps).pdfLEC 8-DS  ALGO(heaps).pdf
LEC 8-DS ALGO(heaps).pdf
 
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
 
Self-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processing
 
Numerical Methods: curve fitting and interpolation
Numerical Methods: curve fitting and interpolationNumerical Methods: curve fitting and interpolation
Numerical Methods: curve fitting and interpolation
 
Linking E-Mails and Source Code Artifacts
Linking E-Mails and Source Code ArtifactsLinking E-Mails and Source Code Artifacts
Linking E-Mails and Source Code Artifacts
 
Fast and accurate metrics. Is it actually possible?
Fast and accurate metrics. Is it actually possible?Fast and accurate metrics. Is it actually possible?
Fast and accurate metrics. Is it actually possible?
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course
 
Tensorflow and python : fault detection system - PyCon Taiwan 2017
Tensorflow and python : fault detection system - PyCon Taiwan 2017Tensorflow and python : fault detection system - PyCon Taiwan 2017
Tensorflow and python : fault detection system - PyCon Taiwan 2017
 
Daniel - Portfolio - Digital Copy
Daniel - Portfolio - Digital CopyDaniel - Portfolio - Digital Copy
Daniel - Portfolio - Digital Copy
 
Analisis dinamico de un portico
Analisis dinamico de un porticoAnalisis dinamico de un portico
Analisis dinamico de un portico
 
Simple Nested Sets and some other DB optimizations
Simple Nested Sets and some other DB optimizationsSimple Nested Sets and some other DB optimizations
Simple Nested Sets and some other DB optimizations
 
Ground Vibration Control Using Signature Hole Method - Thesis BE Mining, Univ...
Ground Vibration Control Using Signature Hole Method - Thesis BE Mining, Univ...Ground Vibration Control Using Signature Hole Method - Thesis BE Mining, Univ...
Ground Vibration Control Using Signature Hole Method - Thesis BE Mining, Univ...
 
Pres eucome 2016_v3
Pres eucome 2016_v3Pres eucome 2016_v3
Pres eucome 2016_v3
 
Identification of unknown parameters and prediction of missing values. Compar...
Identification of unknown parameters and prediction of missing values. Compar...Identification of unknown parameters and prediction of missing values. Compar...
Identification of unknown parameters and prediction of missing values. Compar...
 
Huffman.pptx
Huffman.pptxHuffman.pptx
Huffman.pptx
 
ch03_block_ciphers_nemo (2) (1).ppt
ch03_block_ciphers_nemo (2) (1).pptch03_block_ciphers_nemo (2) (1).ppt
ch03_block_ciphers_nemo (2) (1).ppt
 
Data Encryption Standard
Data Encryption StandardData Encryption Standard
Data Encryption Standard
 

More from Pierre Lindenbaum

More from Pierre Lindenbaum (20)

Introduction to Linux
Introduction to LinuxIntroduction to Linux
Introduction to Linux
 
Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !Mum, I 3D printed a gel comb !
Mum, I 3D printed a gel comb !
 
"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)"Mon make à moi", (tout sauf Galaxy)
"Mon make à moi", (tout sauf Galaxy)
 
Advanced NCBI
Advanced NCBI Advanced NCBI
Advanced NCBI
 
File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation Sequencing
 
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebookBuilding a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
Building a Simple LIMS with the Eclipse Modeling Framework (EMF) ,my notebook
 
Make
MakeMake
Make
 
XML for bioinformatics
XML for bioinformaticsXML for bioinformatics
XML for bioinformatics
 
20120423.NGS.Rennes
20120423.NGS.Rennes20120423.NGS.Rennes
20120423.NGS.Rennes
 
Sketching 20120412
Sketching 20120412Sketching 20120412
Sketching 20120412
 
Introduction to mongodb for bioinformatics
Introduction to mongodb for bioinformaticsIntroduction to mongodb for bioinformatics
Introduction to mongodb for bioinformatics
 
Biostar17037
Biostar17037Biostar17037
Biostar17037
 
Tweeting for the BioStar Paper
Tweeting for the BioStar PaperTweeting for the BioStar Paper
Tweeting for the BioStar Paper
 
Variation Toolkit
Variation ToolkitVariation Toolkit
Variation Toolkit
 
Bioinformatician 2.0
Bioinformatician 2.0Bioinformatician 2.0
Bioinformatician 2.0
 
Analyzing Exome Data with KNIME
Analyzing Exome Data with KNIMEAnalyzing Exome Data with KNIME
Analyzing Exome Data with KNIME
 
NOTCH2 backstage
NOTCH2 backstageNOTCH2 backstage
NOTCH2 backstage
 
Bioinfo tweets
Bioinfo tweetsBioinfo tweets
Bioinfo tweets
 
Post doctoriales 2011
Post doctoriales 2011Post doctoriales 2011
Post doctoriales 2011
 
MyWordle.java
MyWordle.javaMyWordle.java
MyWordle.java
 

Recently uploaded

Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 

Recently uploaded (20)

Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 

Next Generation Sequencing file Formats ( 2017 )

  • 1. Next Generation Sequencing File Formats. Pierre Lindenbaum pierre.lindenbaum@univ-nantes.fr September 18, 2017 Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 2. You don’t need to have a deep knowledge of those formats. (Unless you’re doing NGS) Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 3. Understand how people have solved their BIG data problems. Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 4. Why sequencing ? Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 5. Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 6. Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 7. Well, that’s a little more complicated ... Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 9. FASTQ FASTQ: text-based format for storing both a DNA sequence and its corresponding quality scores Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 10. FASTQ FASTQ for single end Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 11. FASTQ FASTQ for paired end Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 13. FASTQ name @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG Col Brief description EAS139 the unique instrument name 136 the run id FC706VJ the flowcell id 2 flowcell lane 2104 tile number within the flowcell lane 15343 ’x’-coordinate of the cluster within the tile 197393 ’y’-coordinate of the cluster within the tile 1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only) Y Y if the read fails filter (read is bad), N otherwise 18 0 when none of the control bits are on, otherwise it is an even num ATCACG index sequence Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 14. FASTQ Quality Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 15. FASTQ Quality A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). Qsanger = −10 log10 p Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 16. FASTQ Quality Since a human readable format is desired for SAM, 33 is added to the calculated quality in order to make it a printable character ranging from ! - . Qsanger = −10 log10 p + 33 Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 17. Aligned Reads 44187101 44187111 44187121 44187131 44187141 44187151 44187161 44187171 aaatgagccaggtgtggtggtgcacacctatagtcccagctacgcaggaggctgaggtgggaggatcgcttaaacccggc REFERENCE ............................................Y................................... CONSENSUS aaa gagccaggtgtggtggtgcacaccgataggcccagctacgtaggaggctgaggtgggaggatcgcttaaa cggc AAA GAGCCAGGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAA CGGC aaatga CCAGGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCC c aaatgagcc GGTGTGGTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC AAATGAGCCAGG gtggtggtgcacacctatagtcccagcgacgtaggaggctgaggtgggaggatcgcttaaacccggc AAATGAGCCAGGTG ggtggtgcacacctatagtcccagctaagtaggaggctgaggtgggaggatcgctttaacccggc AAATGAGCCAGGTGT GTGGTGCACACCTATAGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGCTTAAACCGGGC ACATGAGCCAGGTGTG tggtgcacacctatagtcccagctacgtaggaggctgaggtgggaggatcgcttaaacccggc aaatgagccaggtgtgg GCACACGTAAAGTCCCAGCTACGCAGGAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC CAATGAGCCAGTTGTGG cacacctatagtcccagctacgcacgaggctgaggtgggaggatcgctttaacccggc AAATGAGCCAGGTGAGGT cacacctatagtcccagctacgcaggaggctgaggtgggaggatcgcttaaacccggc AAATGAGCCAGGTGTGGT acacctatagtcccagctacgcaggaggctgaggtgggaggatcgctttaacccggc aaatgagccaggtgtggtgg cctatagtcccagctacgtaggaggctgaggtgggaggatcgcttaaacccggc AAATGAGCCAGGTGTGGTGG TATAGTCCCAGCTACGCAGGAGGCTGAGGTGGTAGGATCGCATAAACCCGGC AAATGAGCCAGGTGTGGTGGT TAGTCCCAGCTACGTAGGAGGCTGAGTTGGGAGGATCTCTTAAACCCGGC aaatgagccaggtgtggtggtg TCGTCCCAGCTACGCAGGAGGCTTAGGTGGGAGGATCGCTTAAACCCGGC aaatgagccaggtgtggtggtgca AGTCCCAGCTACGTAGGAGGCTGAGGTGGGAGGATCGGTTAAACCCGGC aaatgagccaggtgtggtggtgcac cccagctacgcaggaggctgaggtgggaccatcgcttaaaccccgc aaatgagccaggtgtggtggtgcac CCAGCTACGTAGTAGGCTGAGGTGGGAGGATCGCTTAAACCCGGC Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 19. SAM Format ”SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments” Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 20. SAM Format Is flexible enough to store all the alignment information generated by various alignment programs; Is simple enough to be easily generated by alignment programs or converted from existing alignment formats; Is compact in file size; Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory; Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus. Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 21. SAM Format Structure + HEADER -version -program parameters +GENOME - chrom1 size - chrom2 size - chrom3 size - (..) +GROUPS - group1 : sample1, lane 4 - group2 : sample2, lane 1 + BODY - READ1 -> group1 - READ2 -> group1 - READ3 -> group1 - READ4 -> group2 - (...)Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 22. SAM Example Simple example @HD VN:1.5 SO:coordinate @SQ SN:ref LN:45 r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; r001 83 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1 Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 23. SAM Header Section Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 24. SAM Header @HD VN: 1 . 0 SO: c o o r d i n a t e @SQ SN:1 LN:249250621 AS : NCBI37 UR: f i l e : human . f a s t a M5:1 b22b98cdeb @SQ SN:2 LN:243199373 AS : NCBI37 UR: f i l e : human . f a s t a M5: a0d9851da00 @SQ SN:3 LN:198022430 AS : NCBI37 UR: f i l e : human . f a s t a M5: fdfd811849c @RG ID : UM0098 :1 PL : ILLUMINA SM: SD37743 CN: Nantes @RG ID : UM0098 :2 PL : ILLUMINA SM: SD37743 CN: Nantes @PG ID : bwa VN: 0 . 5 . 4 @PG ID :GATK T a b l e R e c a l i b r a t i o n VN: 1 . 0 . 3 4 7 1 CL : C o v a r i a t e s = ( . . . ) Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 25. SAM Alignment Section Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 26. SAM Example Simple example IL31 4368 : 1 : 1 : 9 9 6 : 8 5 0 7 77 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 6 : 8 5 0 7 141 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 6 : 2 1 4 2 1 77 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 6 : 2 1 4 2 1 141 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 7 : 1 0 5 7 2 77 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 7 : 1 0 5 7 2 141 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 7 : 1 5 6 8 4 83 chr1 241356612 60 54M = 241356 IL31 4368 : 1 : 1 : 9 9 7 : 1 5 6 8 4 163 chr1 241356442 60 54M = 241356 IL31 4368 : 1 : 1 : 9 9 7 : 1 5 2 4 9 77 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 7 : 1 5 2 4 9 141 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 7 : 6 2 7 3 77 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 7 : 6 2 7 3 141 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 7 : 1 6 5 7 83 chr1 143630364 60 54M = 143630 IL31 4368 : 1 : 1 : 9 9 7 : 1 6 5 7 163 chr1 143630066 60 54M = 143630 IL31 4368 : 1 : 1 : 9 9 7 : 5 6 0 9 77 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 7 : 5 6 0 9 141 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 7 : 1 4 2 6 2 77 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 7 : 1 4 2 6 2 141 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 8 : 1 9 9 1 4 77 ∗ 0 0 ∗ ∗ 0 0 IL31 4368 : 1 : 1 : 9 9 8 : 1 9 9 1 4 141 ∗ 0 0 ∗ ∗ 0 0 Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 27. SAM Example Sorted SAM One row is one read, NOT one fragment. IL31_4368:1:107:15207:19097 163 chr1 17 0 54M = 21 58 (...) IL31_4368:1:107:15207:19097 83 chr1 21 0 54M = 17 -58 (...) IL31_4368:1:10:17817:9758 137 chr1 23 0 54M = 23 0 (...) IL31_4368:1:54:13142:21400 163 chr1 37 0 54M = 44 61 (...) IL31_4368:1:54:13142:21400 83 chr1 44 0 54M = 37 -61 (...) Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 28. SAM Specifications Record Column Col Field Type Brief description 1 QNAME String Query template NAME 2 FLAG Int bitwise FLAG 3 RNAME String Reference sequence NAME 4 POS Int 1-based leftmost mapping POSition 5 MAPQ Int MAPping Quality 6 CIGAR String CIGAR string 7 RNEXT String Ref. name of the mate/next read 8 PNEXT Int Position of the mate/next read 9 TLEN Int observed Template LENgth 10 SEQ String segment SEQuence 11 QUAL String ASCII of Phred-scaled base QUALity+33 12 META metadata Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 29. SAM Specifications Record Column Col Field Type 1 QNAME IL31 4368:1:42:12530:7509 2 FLAG 137 3 RNAME chr1 4 POS 10 5 MAPQ 30 6 CIGAR 54M 7 RNEXT = 8 PNEXT 100 9 TLEN 90 10 SEQ TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC 11 QUAL GGGGGGGFEGGGGCFGGGGGEGGFGEGGFGFGGFGFEGFCF 12 META XT:A:R NM:i:3 SM:i:0 AM:i:0 X0:i:11 X1:i:0 XM:i:3 XO:i:0 X Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 30. SAM FLAGS read paired. read mapped in proper pair. read unmapped. mate unmapped. read reverse strand. mate reverse strand. first in pair. second in pair. not primary alignment. read fails platform/vendor quality checks. read is PCR or optical duplicate. supplementary alignment Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 31. SAM FLAGS Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 32. SAM FLAGS Read Paired Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 33. SAM FLAGS Read mapped in proper pair Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 34. Read mapped in proper pair Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 35. SAM FLAGS Read unmapped Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 36. SAM FLAGS Mate unmapped Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 37. SAM FLAGS Read reverse strand Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 38. SAM FLAGS Mate reverse strand Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 39. SAM FLAGS First in pair Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 40. SAM FLAGS Second in pair Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 41. SAM FLAGS not primary alignment Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 42. SAM FLAGS read fails platform/vendor quality checks Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 43. SAM FLAGS read is PCR or optical duplicate Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 44. SAM FLAGS supplementary alignment Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 45. SAM CIGAR The CIGAR string is a sequence of of base lengths and the associated operation. They are used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference. Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 46. SAM Cigar Op BAM Description M 0 alignment match (can be a sequence match or mismatch) I 1 insertion to the reference D 2 deletion from the reference N 3 skipped region from the reference S 4 soft clipping (clipped sequences present in SEQ) H 5 hard clipping (clipped sequences NOT present in SEQ) P 6 padding (silent deletion from padded reference) = 7 sequence match X 8 sequence mismatch Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 47. SAM Cigar Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 48. SAM Cigar http://genome.sph.umich.edu/wiki/SAM RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Reference: C C A T A C T G A A C T G A C T A A C Read: ACTAGAATGGCT Aligning these two: RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Reference: C C A T A C T G A A C T G A C T A A C Read: A C T A G A A T G G C T With the alignment above, you get: POS: 5 CIGAR: 3M1I3M1D5M or CIGAR: 3=1I3=1D2=1X2= Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 49. SAM Cigar Soft Clip Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 50. SAM Cigar Hard Clip Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 51. SAM Fomat optional TAGs optional fields on a SAM/BAM Alignment. A TAG is comprised of a two character TAG key, they type of the value, and the value: [A-Za-z][A-za-z]:[AifZH]:.* The types, A, i, f, Z, H are used to indicate the type of value stored in the tag. Type Description A character i signed 32-bit integer f single-precision float Z string H hex string Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 52. SAM Fomat optional TAGs XT:A:U - user defined tag called XT. It holds a character. The value associated with this tag is ’U’. NM:i:2 - predefined tag NM means: Edit distance to the reference (number of changes necessary to make this equal the reference, excluding clipping) Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 53. SAM Example Sorted SAM IL31 4368 :1 :10 7: 152 07: 190 97 163 chr1 17 0 54M = 21 IL31 4368 :1 :10 7: 152 07: 190 97 83 chr1 21 0 54M = 17 IL31 4368 : 1 : 5 4 : 1 3 1 4 2 : 2 1 4 0 0 163 chr1 37 0 54M = 44 IL31 4368 : 1 : 5 4 : 1 3 1 4 2 : 2 1 4 0 0 83 chr1 44 0 54M = 37 Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 55. BGZF Format The SAM/BAM file format (Sequence Alignment/Map) comes in a plain text format (SAM), and a compressed binary format (BAM). The latter uses a modified form of gzip compression called BGZF (Blocked GNU Zip Format), which can be applied to any file format to provide compression with efficient random access Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 56. BAM INDEX Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 58. Other Sequencing technologies Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 60. VCF Variant Call Format Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 61. VCF Format VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 62. VCF Example ##fileformat=VCFv4.0 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=1000GenomesPilot−NCBI36 ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description=”Number of Samples With Data”> ##INFO=<ID=DP,Number=1,Type=Integer,Description=”Total Depth”> ##INFO=<ID=AF,Number=.,Type=Float,Description=”Allele Frequency”> ##INFO=<ID=AA,Number=1,Type=String,Description=”Ancestral Allele”> ##INFO=<ID=DB,Number=0,Type=Flag,Description=”dbSNP membership, build 129”> ##INFO=<ID=H2,Number=0,Type=Flag,Description=”HapMap2 membership”> ##FILTER=<ID=q10,Description=”Quality below 10”> ##FILTER=<ID=s50,Description=”Less than 50% of samples have data”> ##FORMAT=<ID=GT,Number=1,Type=String,Description=”Genotype”> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=”Genotype Quality”> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description=”Read Depth”> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description=”Haplotype Quality”> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA000 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0: 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65, 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,2 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:5 20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2 Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 64. VCF INFO INFO fields should be described as follows ##INFO=<ID=ID , Number=number , Type=type , D e s c r i p t i o n=”d e s c r i p t i o n ”> ( . . . ) ##INFO=<ID= NS , Number=1,Type=I n t e g e r , D e s c r i p t i o n=”Number of Samples With Data”> ( . . . ) INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3 ;DP=14;AF=0.5;DB; H2 GT:GQ:DP:HQ 0 | 0 : 4 8 : 1 : 5 1 , 5 1 1 | 0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , . 20 17330 . T A 3 q10 NS=3 ;DP=11;AF=0.017 GT:GQ:DP:HQ 0 | 0 : 4 9 : 3 : 5 8 , 5 0 0 | 1 : 3 : 5 : 6 5 , 3 0/0:41:3 Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 65. VCF FILTERs FILTERs that have been applied to the data should be described as follows: ##FILTER=<ID=ID , D e s c r i p t i o n=”d e s c r i p t i o n ”> ( . . . ) ##FILTER=<ID=q10 , D e s c r i p t i o n=”Q u a l i t y below 10”> ##FILTER=<ID=s50 , D e s c r i p t i o n=”Less than 50 p erce nt of samples have data”> ( . . . ) #CHROM POS ID REF ALT QUAL FILTER ( . . . ) 20 14370 rs6054257 G A 29 PASS ( . . . ) 20 17330 . T A 3 q10 ( . . . ) 20 111069 rs6040355 A G,T 67 PASS ( . . . ) Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 66. VCF FORMAT Genotype fields specified in the FORMAT field should be described as follows: ##FORMAT=<ID=ID , Number=number , Type=type , D e s c r i p t i o n=”d e s c r i p t i o n ”> ( . . . ) ##FORMAT=<ID=GT , Number=1,Type=String , D e s c r i p t i o n=”Genotype”> ( . . . ) # ( . . . )FORMAT NA00001 NA00002 NA00003 ( . . . ) GT :GQ:DP:HQ 0 | 0 : 4 8 : 1 : 5 1 , 5 1 1/0 : 4 8 : 8 : 5 1 , 5 1 1 / 1 : 4 3 : 5 : . , . ( . . . ) GT :GQ:DP:HQ 0 | 0 : 4 9 : 3 : 5 8 , 5 0 0/1 : 3 : 5 : 6 5 , 3 0/0:41:3 ( . . . ) GT :GQ:DP:HQ 1 | 2 : 2 1 : 6 : 2 3 , 2 7 2/1 : 2 : 0 : 1 8 , 2 2/2:35:4 ( . . . ) GT :GQ:DP:HQ 0 | 0 : 5 4 : 7 : 5 6 , 6 0 0/0 : 4 8 : 4 : 5 1 , 5 1 0/0:61:2 ( . . . ) GT :GQ:DP 0/1:35:4 0/2 : 1 7 : 2 1/1:40:3 Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 69. Tabix INDEX Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 70. Building the TABIX index $ bgzip −f f i l e . vcf $ t a b i x −p vcf f i l e . vcf . gz Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 71. Querying the TABIX index $ t a b i x f i l e . vcf . gz chr3 :1235 −456778 Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 73. Reading SAM with the samtools C library #include <s t d l i b . h> #include <s t d i o . h> #include ”bam . h” #include ”sam . h” int main ( int argc , char ∗ argv [ ] ) { s a m f i l e t ∗ sam=samopen ( argv [ 1 ] , ” rb ” , 0 ) ; bam1 t ∗b= bam init1 ( ) ; long n=0L ; while ( samread (sam , b) > 0) { i f ( ! ( b−>core . f l a g&BAM FUNMAP)) ++n ; } bam destroy1 (b ) ; samclose (sam ) ; p r i n t f ( ”%lu n” ,n ) ; return 0; } Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 74. Reading SAM with the java picard library import java . i o . F i l e ; import net . s f . samtools . ∗ ; public class CountMapped { public s t a t i c void main ( S t r i n g [ ] args ) { long n = 0L ; F i l e f = new F i l e ( args [ 0 ] ) ; SamReader sam = SamReaderFactory . makeDefault ( ) . open ( f ) ; SAMRecordIterator i t e r = sam . i t e r a t o r ( ) ; System . out . p r i n t l n ( i t e r . stream ( ) . f i l t e r (R−>!R. getReadUnmapped ( ) ) . count () ) ; i t e r . c l o s e ( ) ; sam . c l o s e ( ) ; System . out . p r i n t l n (n ) ; } } Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.
  • 76. Credits Spec: https://samtools.github.io/hts-specs/ Angus: http://ged.msu.edu/angus/ Wikipedia: https://en.wikibooks.org/wiki/C%2B%2B_ Programming/Programming_Languages/C%2B%2B/Code/ Statements/Variables Abecasis Group Wiki: http://genome.sph.umich.edu/wiki/SAM Genome Research http://genome.cshlp.org/content/12/6/996 Pierre Lindenbaumpierre.lindenbaum@univ-nantes.fr Next Generation SequencingFile Formats.