Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1

1/26/2014

NGS
Data Formats & QC Analysis

Karan Veer Singh
Scientist, NBAGR
2

1/26/2014

Sequence Formats
 All

Sequence formats are ASCII text
containing sequence ID, Quality Scores,
Annotation d...
3

1/26/2014

Why so many formats?


Created based on the information required for each step of analysis



Efficient Da...
4

Read output formats
 454

 Solexa/Illumina
 SOLiD

1/26/2014
454 output formats
Standard flowgram
format

.sff

.fna
.qual

5

1/26/2014
Illumina output formats
6

.seq.txt
.prb.txt

Illumina FASTQ

(ASCII – 64 is Illumina score)

Qseq
(ASCII – 64 is Phred sc...
Illumina FastQ

 ASCII

7

1/26/2014

value for h= 103
 Quality of Base A at the position 1 = 103- 64
 103- 64 = 39
 W...
8

1/26/2014

SOLiD output format(s)

CSFASTA

color-space sequence reads in a fasta format



These reads can be retaine...
Read Length
• Sanger reads lengths ~ 800-2000bp
• Generally we define short reads as anything below 200bp
−Illumina (100bp...
10

1/26/2014

Common (“standard”) format for read
alignments: Alignment/Assembly Format
SAM

BAM
MAQ

(= binary SAM)
Sequencers & Sequence
Assembly Packages
11

1/26/2014
12

1/26/2014

Formats for Genome/Gene annotation
BED format

(genome-browser tracks)

GFF format

(gene/genome features)
...
13

1/26/2014

If reads should be deposited in a public
repository:
SRA (Short Read Archive) at NCBI
14

1/26/2014

Points to remember on Data Formats
 For base-call data, “standard” FASTQ (Sanger, Phred)
 For read alignm...
15

QC analysis

1/26/2014
All platforms have errors

Illumina

1.
2.
3.

SoLID/ABI-Life

Roche 454

Ion Torrent

Removal of low quality bases/ Low c...
Illumina artefacts

 under represented GC rich regions
 PCR
 Sequencing
 GGC/GCC motif is associated with low
quality ...
18

1/26/2014

Need for QC & Preprocessing
QC analysis of sequence data is extremely important for meaningful
downstream a...
19

1/26/2014

Need for QC & Preprocessing
 The

quality of data is very important for various
downstream analyses, such ...
20

1/26/2014

NGS QC Toolkit & FastQC
 NGS QC Toolkit is for quality check and filtering of high-quality read

 This to...
21

1/26/2014
22

1/26/2014
NGSQC toolkit Output
23

1/26/2014
NGSQC toolkit Output
24

1/26/2014
Comparison - QC tools
25

1/26/2014
26

1/26/2014

FastQC
 Basic

statistics
 Quality- Per base position
 Per Sequence Quality Distribution
 Nucleotide co...
27

FastQC (Box-Whisker plot)

Y axis- Quality Score
X axis- Base position

1/26/2014
28

2. Quality- Per base position

1/26/2014
29

2. Quality- Per base position

1/26/2014
3.Per Sequence Quality
Distribution
30

1/26/2014
3. Per Sequence Quality
Distribution
31

1/26/2014
4.Nucleotide content per
position
32

1/26/2014
33

1/26/2014

4. Nucleotide content per position
5.Per sequence GC
distribution
34

1/26/2014
5.Per sequence GC
distribution
35

1/26/2014
36

6. Per base GC distribution

1/26/2014
37

6. Per base GC distribution

1/26/2014
38

7. Per base N content

1/26/2014
39

7. Length Distribution

1/26/2014
8. Kmer content

40

1/26/2014

Any k-mer showing more than a 3 fold overall enrichment or a 5 fold
enrichment at any give...
9. Overrepresented/
duplicate sequences
41

1/26/2014

The analysis of overrepresented sequences will spot an
increase in ...
42

1/26/2014

QC Report
 Sequence Statistics
Total No. of Sequences
6970943
Avg. Sequence Length
54
Max Sequence Length
...
Upcoming SlideShare
Loading in …5
×

NGS - QC & Dataformat

2,605 views

Published on

The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification this ppt show parameters for
NGS Data quality check and Dataformat of top sequencing machine

Published in: Education, Technology
  • Be the first to comment

NGS - QC & Dataformat

  1. 1. 1 1/26/2014 NGS Data Formats & QC Analysis Karan Veer Singh Scientist, NBAGR
  2. 2. 2 1/26/2014 Sequence Formats  All Sequence formats are ASCII text containing sequence ID, Quality Scores, Annotation details, comments, and other descriptions about sequence  Formats are designed to hold sequence data and other information about sequence
  3. 3. 3 1/26/2014 Why so many formats?  Created based on the information required for each step of analysis  Efficient Data & time management Types of sequence file formats • • • • •  Raw Sequence files Co-ordinate files Parameter files Annotation files Metadata files Each Data formats vary in the information they contain
  4. 4. 4 Read output formats  454  Solexa/Illumina  SOLiD 1/26/2014
  5. 5. 454 output formats Standard flowgram format .sff .fna .qual 5 1/26/2014
  6. 6. Illumina output formats 6 .seq.txt .prb.txt Illumina FASTQ (ASCII – 64 is Illumina score) Qseq (ASCII – 64 is Phred score) Phred quality scores Illumina single line format SCARF Solexa Compact ASCII Read Format 1/26/2014
  7. 7. Illumina FastQ  ASCII 7 1/26/2014 value for h= 103  Quality of Base A at the position 1 = 103- 64  103- 64 = 39  Where 39 is the phred score
  8. 8. 8 1/26/2014 SOLiD output format(s) CSFASTA color-space sequence reads in a fasta format  These reads can be retained and analyzed in color-space by software  The Format Conversion Tool offers options for cleaning of the CSFASTA files
  9. 9. Read Length • Sanger reads lengths ~ 800-2000bp • Generally we define short reads as anything below 200bp −Illumina (100bp – 250bp) −SoLID (75bp max) −Ion Torrent (200-300bp max – currently...) −Roche 454 – 400-800bp • Even with these platforms it is cheaper to produce short reads (e.g. 50bp) rather than 100 or 200bp reads • Diminishing returns: −For some applications 50bp is more than sufficient −Resequencing of smaller organisms −Bacterial de-novo assembly −ChIP-Seq −Digital Gene Expression profiling −Bacterial RNA-seq
  10. 10. 10 1/26/2014 Common (“standard”) format for read alignments: Alignment/Assembly Format SAM BAM MAQ (= binary SAM)
  11. 11. Sequencers & Sequence Assembly Packages 11 1/26/2014
  12. 12. 12 1/26/2014 Formats for Genome/Gene annotation BED format (genome-browser tracks) GFF format (gene/genome features) BioXSD (XML) (any annotation; under development)
  13. 13. 13 1/26/2014 If reads should be deposited in a public repository: SRA (Short Read Archive) at NCBI
  14. 14. 14 1/26/2014 Points to remember on Data Formats  For base-call data, “standard” FASTQ (Sanger, Phred)  For read alignments, SAM/BAM/MAQ format  For annotation results (e.g. GFF or BED format)
  15. 15. 15 QC analysis 1/26/2014
  16. 16. All platforms have errors Illumina 1. 2. 3. SoLID/ABI-Life Roche 454 Ion Torrent Removal of low quality bases/ Low complexity regions Removal of adaptor sequences Homopolymer-associated base call errors (3 or more identical DNA bases) causes higher number of (artificial) frameshifts
  17. 17. Illumina artefacts  under represented GC rich regions  PCR  Sequencing  GGC/GCC motif is associated with low quality and mismatches  Low quality reads < 20% phred score
  18. 18. 18 1/26/2014 Need for QC & Preprocessing QC analysis of sequence data is extremely important for meaningful downstream analysis  To analyze problems in quality scores/ statistics of sequencing data  To check whether further analysis with sequence is possible  To remove redundancy (filtering)  To remove low quality reads from analysis  To remove adapter contamination Highly efficient and fast processing tools are required to handle large volume of datasets
  19. 19. 19 1/26/2014 Need for QC & Preprocessing  The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification  Most of the programs available for downstream analyses do not provide the utility for quality check and filtering of NGS data before processing
  20. 20. 20 1/26/2014 NGS QC Toolkit & FastQC  NGS QC Toolkit is for quality check and filtering of high-quality read  This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html  Application have been implemented in Perl programming language  QC of sequencing data generated using Roche 454 and Illumina platforms  Additional tools to aid QC : (sequence format converter and trimming tools) and analysis (statistics tools) FastQC can be used only for preliminary analysis
  21. 21. 21 1/26/2014
  22. 22. 22 1/26/2014
  23. 23. NGSQC toolkit Output 23 1/26/2014
  24. 24. NGSQC toolkit Output 24 1/26/2014
  25. 25. Comparison - QC tools 25 1/26/2014
  26. 26. 26 1/26/2014 FastQC  Basic statistics  Quality- Per base position  Per Sequence Quality Distribution  Nucleotide content per position  Per sequence GC distribution  Per base GC distribution  Per base N content  Length Distribution  Overrepresented/ duplicated sequences  K-mer content
  27. 27. 27 FastQC (Box-Whisker plot) Y axis- Quality Score X axis- Base position 1/26/2014
  28. 28. 28 2. Quality- Per base position 1/26/2014
  29. 29. 29 2. Quality- Per base position 1/26/2014
  30. 30. 3.Per Sequence Quality Distribution 30 1/26/2014
  31. 31. 3. Per Sequence Quality Distribution 31 1/26/2014
  32. 32. 4.Nucleotide content per position 32 1/26/2014
  33. 33. 33 1/26/2014 4. Nucleotide content per position
  34. 34. 5.Per sequence GC distribution 34 1/26/2014
  35. 35. 5.Per sequence GC distribution 35 1/26/2014
  36. 36. 36 6. Per base GC distribution 1/26/2014
  37. 37. 37 6. Per base GC distribution 1/26/2014
  38. 38. 38 7. Per base N content 1/26/2014
  39. 39. 39 7. Length Distribution 1/26/2014
  40. 40. 8. Kmer content 40 1/26/2014 Any k-mer showing more than a 3 fold overall enrichment or a 5 fold enrichment at any given base position will be reported by this module.
  41. 41. 9. Overrepresented/ duplicate sequences 41 1/26/2014 The analysis of overrepresented sequences will spot an increase in any exactly duplicated sequences Too many duplicate regions in the sequence will be due to sequencing problems This module will issue a warning if any sequence is found to represent more than 0.1% of the total.
  42. 42. 42 1/26/2014 QC Report  Sequence Statistics Total No. of Sequences 6970943 Avg. Sequence Length 54 Max Sequence Length 54 Min Sequence Length 54 Total Sequence Length 376430922 Total N bases 14254521 % N bases 3.78676 No of Sequences with Ns 278635 % Sequences with Ns 3.99709 Quality Statistics Total HQ bases 334195496 %HQ bases 88.78 Total HQ reads 6350256 %HQ reads 91.0961 Alignment statistics

×