FASTQC Analysis of FASTQ Files Before and After Trimming

Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Data (Fastq)
FASTQC Prinseqlite
Trimming/filtering
Data Quality Check and Interpretation

perl prinseq-lite.pl -fastq control.fastq -out_format 5 -min_len 50 -min_qual_mean 25
Input fastq file
1 (FASTA only), 2 (FASTA and QUAL), 3 (FASTQ), 4 (FASTQ and
FASTA), or 5 (FASTQ, FASTA and QUAL)
Output format Filter sequences shorter
than minimum length
(here it is 50 nucleotides)
Filter sequence with
quality score mean below
minimum quality mean
(here it is 25)
PRINSEQ -
• generates summary statistics of sequence and quality data
• used to filter, reformat and trim next-generation sequence data.
• PRINSEQ is available through a user-friendly web interface or as
standalone version.
Command for quality filtering :
perl prinseq-lite.pl -fastq control.fastq -out_format 5 -min_len 50 -
min_qual_mean 25
For any further help type :- perl prinseq-lite.pl -h

The output generated from analyzing three files viz. original
fastq i.e control_R1.fastq, and the good and bad fastq files
generated from prinseq-lite

Basic Statistics:
Interpretation : The Basic Statistics module generates
simple composition statistics for the file analyzed. It
gives the filename, file type, Sequences, sequence
length and GC % .
Here all the three files controlR1.fastq, control_R1
good.fastq and control_R1bad.fastq were analyzed. The
raw data file showed > 17 million reads and the good file
generated after running prinseq-lite showed > 16 million
reads. The bad file showed only 0.4 million reads.
Note :- Basic Statistics never raises a warning and never
raises an error.
Control_R1.fastq Control_R1_good.fastq
Control_R1_bad.fastq
length and GC % .
raises an error.

Interpre
simple
Control_R1.fastq

Basic Statistics:
length and GC % .
raises an error.
length and GC % .
raises an error.
length and GC % .
raises an error.

Control_R1_good.fastq

Interpret
simple c
gives th
length an
Here all
good.fas
raw data
generate
reads. Th
Note :- B
raises an

Basic Statistics:
length and GC % .
raises an error.
length and GC % .
raises an error.
length and GC % .
raises an error.
length and GC % .
raises an error.

length and GC % .
raises an error.

length and GC % .
raises an error.
Interpretation : The Basic Statistics module genera
simple composition statistics for the file analyzed
gives the filename, file type, Sequences, sequen
length and GC % .
Here all the three files controlR1.fastq, control_
good.fastq and control_R1bad.fastq were analyzed. T
raw data file showed > 17 million reads and the good
generated after running prinseq-lite showed > 16 mill
Note :- Basic Statistics never raises a warning and ne
raises an error.
Control_R1.fastq Control_R1_good.fa

Interpretation: This view shows an overview of the range of quality
values across all bases at each position in the FastQ file.  
For each position a BoxWhisker type plot is drawn. The central line (red
in colour) is the median value. 
The box (yellow in colour) represents the inter-quartile  
range (25-75%). The upper and lower whiskers represent the 10% and
90% points. The line (blue in colour) that runs across the graphs
represents the mean quality. 
It can be appreciated that output of good files looks to have the best
means quality in comparison to raw data file. The bad file had a very
low mean quality thereby, a failure was issued as indicated by a cross
mark (red in colour) against per base sequence quality in the main
window of that data set.
Note : A warning will be issued if the lower quartile for any base is less
than 10, or if the median for any base is less than 25. A failure was
raised if the lower quartile for any base is less than 5 or if the median
for any base is less than 20.
Per Base Sequence Quality

Control_R1.fastq

Computational Biology and Genomics Facility, Indian Veterinary Research InstituteInterpretation: This view shows an overview of the range of quality

Interp
values
For ea
in colo
The bo
range
90% p
represe
It can
means
low me
mark (
window
Note :
than 1
raised
for any

Interpretation: The per sequence quality score report allows us to see if a
subset of your sequences have universally low quality values. The area
under the bell shaped curve was greater for the good file than the raw data
file indicating that most of the sequences had very good quality in the good
file in comparison to the raw data file.Average quality per read is also
better in the good file. In the bad file most of the sequences in the bad had
very poor quality. A failure was also issued as indicated by a cross mark
(red in colour) against per sequence quality score in the main window of
that data set.
Note : A warning is raised if the most frequently observed mean quality is
below 27 - this equates to a 0.2% error rate. A failure error is raised if the
most frequently observed mean quality is below 20 - this equates to a 1%
error rate.
Per Sequence Quality Scores
that data set.
error rate.

Control_R1.fastq C

that data set.
error rate.
that data set.
error rate.
that data set.
error rate.

Interpretat
subset of y
under the b
file indicatin
file in com
better in the
very poor q
(red in colo
that data se
Note : A wa
below 27 -
most freque
error rate.

that data set.
error rate.
that data set.
error rate.
that data set.
error rate.
that data set.
error rate.

that data set.
error rate.

Interpretation: Per Base Sequence Content plots the proportion
of each base position in a file for which each of the four normal
DNA bases has been called. In a random library you would
expect that there would be little to no difference between the
different bases of a sequence run, so the lines in this plot should
run parallel with each other. The relative amount of each base
should reflect the overall amount of these bases in the genome,
but in any case they should not be hugely imbalanced from each
other. However in all our cases here there is a failure issued due
to variation in A to T and G to C percentages at the start of the
reads till the 14th base. This also indicates that the reads can be
trimmed at the 5’end till the 14th base.
Note : A warning is issued if the difference between A and T, or G
and C is greater than 10% in any position and a failure is issued if
the difference between A and T, or G and C is greater than 20% in
any position.
any position.

Computational Biology and Genomics Facility, Indian Veterinary Research Institute In
Control_R1.fastq

any position.

I
p
f
w
t
s
b
g
f
i
s
c
N
a
t
a

any position.

Interpretation :This module measures the GC content across the whole
length of each sequence in a file and compares it to a modeled normal
distribution of GC content. In a normal random library you would expect to see
a roughly normal distribution of GC content where the central peak
corresponds to the overall GC content of the underlying genome. An unusually
shaped distribution could indicate a contaminated library or some other kinds
of biased subset. In our data sets a warning is raised for the raw and good
files and a failure was issued for the bad file. It is clear that the deviation from
the normal distribution is greater in the bad file output in comparison to the
good and raw file outputs.
Note : A warning is raised if the sum of the deviations from the normal
distribution represents more than 15% of the reads. And failure is issued if the
sum of the deviations from the normal distribution represents more than 30%
of the reads.
Control_R1.fastq
of the reads.
Control_R1.fastq
Per Sequence GC Content

Interpret
length of
Control_R1.fastq
Cont

of the reads.
Control_R1.fastq
of the reads.
Control_R1.fastq
of the reads.
Control_R1.fastq

of the reads.
Control_R1.fastq
of the reads.
Control_R1.fastq
of the reads.
Control_R1.fastq
of the reads.
Control_R1.fastq

Interpretat
length of e
distribution
a roughly
correspond
shaped dis
of biased s
files and a
the normal
good and ra
Note : A wa
distribution
sum of the
of the reads

files and a failure is issued for the bad file. It is clear that the deviation from
of the reads.

length and GC % .
raises an error.
Interpretation : The Basic Statistics module generat
simple composition statistics for the file analyzed.
gives the filename, file type, Sequences, sequen
length and GC % .
Here all the three files controlR1.fastq, control_R
good.fastq and control_R1bad.fastq were analyzed. T
raw data file showed > 17 million reads and the good f
generated after running prinseq-lite showed > 16 milli
Note :- Basic Statistics never raises a warning and nev
raises an error.
Control_R1.fastq Control_R1_good.fas

Interpretation:This module plots out the percentage of base calls at each
position for which an N was called. It's not unusual to see a very low
proportion of Ns appearing in a sequence, especially nearer the end of a
sequence. Here no N’s were found in all the datasets.
Note : A warning is raised if any position shows an N content of >5% and a
failure is issued if any position shows an N content of >20%
Control_R1.fastq
Per Base N Content
Control_R1.fastq

Interpreta
position f
Control_R1.fastq

Computational Biology and Genomics Facility, Indian Veterinary Research InstituteInterpretation:This module plots out the percentage of base calls at each

Control_R1.fastq
Control_R1.fastq

Interpre
position
proportio
sequenc
Note : A
failure is

Control_R1.fastq

Sequence length Distribution
Interpretation :This module generates a graph showing the distribution of fragment
sizes in the file which was analysed.In many cases this will produce a simple graph
showing a peak only at one size, but for variable length FastQ files this will show
the relative amounts of each different size of sequence fragment. Here in all the
files we have reads with a length of 101 bp.
Note : A warning is raised if all sequences are not the same length. A failure is
issued if any of the sequences have zero length.
Control_R1.fastq
Control_R1.fastq

Interpre
sizes in
Control_R1.fastq
Co

Control_R1.fastq

Computational Biology and Genomics Facility, Indian Veterinary Research InstituteInterpretation :This module generates a graph showing the distribution of fragment

Control_R1.fastq

Interpretati
sizes in the
showing a
the relative
files we ha
Note : A w
issued if an

Interpretation: This module generates a graph showing the distribution of fragment
Control_R1.fastq

Interpretation: This module generates a graph showing the distribution of fragment

Control_R1.fastq
No Overrepresented Sequences
Sequence Count Percentage Possible Source
GATCGGAAGA
GCACACGTCTG
AACTCCAGTCA
CTGACCAACTC
TCCGTATGC
3950 0.84781596
16915967
TrueSeq Adapter,
index 4 (110%
over 50 bp)
Interpretation: Overrepresented sequences include the sequences that
are highly duplicated in your library, as well as any primer and/or adapter
dimers that were present in the original library. Adapter sequences are
always present in a sequencing experiment at some level, but aren't
problematic in small percentages. These adapters will not align to your
genome. They can be ignored, or you may use analysis software to
remove them. This module lists all of the sequence, which make up
more than 0.1% of the total. To conserve memory only sequences which
appear in the first 200,000 sequences are tracked to the end of the file.
It is therefore possible that a sequence,which is overrepresented but
doesn't appear at the start of the file for some reason could be missed
by this module. Here wee find overrepresented sequences of adapter in
the bad file
Note : A warning is raised if any sequence is found to represent more
than 0.1% of the total and a failure is issued if any sequence is found to
represent more than 1% of the total.
Overrepresented Sequences

Control_R1.fastq
Sequence Count Percentage Possible Source
GATCGGAAGA
GCACACGTCTG
AACTCCAGTCA
CTGACCAACTC
TCCGTATGC
3950 0.84781596
16915967
TrueSeq Adapter,
index 4 (110%
over 50 bp)
I
a
d
a
p
g
r
m
a
I
d
b
t
N
t
r

Interpretation: Overrepresented sequences include the sequences that
are highly duplicated in your library, as well as any primer and/or adapter
dimers that were present in the original library. Adapter sequences are
always present in a sequencing experiment at some level, but aren't
problematic in small percentages. These adapters will not align to your
genome. They can be ignored, or you may use analysis software to
remove them. This module lists all the sequences, which make up more
than 0.1% of the total. To conserve memory only sequences which
appear in the first 200,000 sequences are tracked to the end of the file.
It is therefore possible that a sequence,which is overrepresented but
doesn't appear at the start of the file for some reason could be missed
by this module. Here wee find overrepresented sequences of adapter in
the bad file
Note : A warning is raised if any sequence is found to represent more
than 0.1% of the total and a failure is issued if any sequence is found to
represent more than 1% of the total.

Duplicate sequences
Interpretation: This module counts the degree of duplication for every
sequence in the set and creates a plot showing the relative number of
sequences with different degrees of duplication. In this module analysis
occurs only for the first 200,000 different sequences seen. The number of
occurrences of these sequences is then tracked through the rest of the file,
but any new sequences after the first 200,000 are then discarded. Also,
any sequences with more than 10 duplicates are placed into the 10
duplicates category - so it's not unusual to see a small rise in this final
category. If the rate at which the duplicate plot falls from unique sequences
is slow – showing appreciable proportions of the library with duplication
levels of 3-5, and a small spike in the 10+ bin there may be a biological
rather than a technical cause. The most common type of library to produce
this type of plot is an RNA-Seq library. In this type of library it is expected
that some sequences will occur very frequently, and others will be very
rare. If you want to see the very rare sequences (eg low copy number
transcripts), then you will have to greatly over-sequence the most frequent
sequences (eg housekeeping genes), so a high level of duplication in part
of the library is unavoidable. Therefore you see a warning in the case of
raw and good files
Note: A warning is raised if non-unique sequences make up more than
20% of the total. And failure is issued if non-unique sequences make up
more than 50% of the total.
Control_R1.fastq

Interpret
sequenc
sequenc
occurs o
Control_R1.fastq

Interpr
sequen
sequen
occurs
occurre
but any
any se
duplica
catego
is slow
levels o
rather t
this typ
that so
rare.
transcr
sequen
of the l
raw and
Note: A
20% of
more th

Duplicate sequences
levels of 3-5, and a small spike in the 10+ bin may be biological rather than
technical. The most common type of library to produce this type of plot is
an RNA-Seq library. In this type of library it is expected that some
sequences will occur very frequently, and others will be very rare. If you
want to see very rare sequences (eg low copy number transcripts), then
you will have to greatly over-sequence the most frequent sequences (eg
housekeeping genes), so a high level of duplication in part of the library is
unavoidable. Therefore you see a warning in the case of raw and good
files
Control_R1.fastq

levels of 3-5, and a small spike in the 10+ bin may be biological rather than
technical. The most common type of library to produce this type of plot is
an RNA-Seq library. In this type of library it is expected that some
sequences will occur very frequently, and others will be very rare. If you
want to see very rare sequences (eg low copy number transcripts), then
you will have to greatly over-sequence the most frequent sequences (eg
housekeeping genes), so a high level of duplication in part of the library is
unavoidable. Therefore you see a warning in the case of raw and good
files

Overrepresented Kmers
Interpretation: This module counts the enrichment of every 5-mer within the
sequence library. Any k-mer showing more than a 3 fold overall enrichment or a 5
fold enrichment at any given base position will be reported by this module.  
Note: A warning is raised if any k-mer is enriched more than 3 fold overall, or
more than 5 fold at any individual position.
And failure is issued if non-unique if any k-mer is enriched more than 10 fold at
any individual base position  
Control_R1.fastq

Interpre
sequenc
fold enric
Control_R1.fastq

any individual base position  
Control_R1.fastq

In
se
fol
No
mo
An
an

any individual base position

FASTQC Analysis of FASTQ Files Before and After Trimming

FASTQC Analysis of FASTQ Files Before and After Trimming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FASTQC Analysis of FASTQ Files Before and After Trimming

Similar to FASTQC Analysis of FASTQ Files Before and After Trimming (20)

More from Ravi Gandham

More from Ravi Gandham (8)

Recently uploaded

Recently uploaded (20)

FASTQC Analysis of FASTQ Files Before and After Trimming