Quality control of sequencing with fast qc obtained with
1. Quality control of sequencing
with FastQC obtained with the
Illumina platform
Hafiz.M.Zeeshan.Raza
Research Associate
COMSATS University Islamabad, Pakistan
Sahiwal Campus
hafizraza26@gmail.com
Cell# 0092-36-6155501
2. Basic data (BASIC STATISTICS)
In the basic statistical data it is represented:
• File name: Sec_Ilumina.fastq.txt
• File format: conventional
• System used: Illumina
• Total analyzed sequences: 25000. This is the number of
readings analyzed.
• Sequences marked with bad quality: 0.
• Length of each reading: 38 bases
• % GC: 45%
• Note: The program will not tell you the sequences that
have bad quality you are the one that will correct them
and you will mark them.
3.
4. Quality (Q) of the sequences per base
(PER BASE SEQUENCE QUALITY)
• The quartiles are represented in yellow, the blue line is the median and in
red, the mean of the quality. In the X axis, the bases of the readings are
represented and each reading has 38 bases. While in the Y, the qualities 0-
34 are represented, distinguishing three zones:
• Green zone : 28-34. They correspond to a very good quality.
• Orange zone : intermediate quality zone (20-28).
• Red zone : area of poor quality (0-20).
• The quality of the 25000 readings is represented from each base.
5.
6. Continue…
• From the graphical representation, it can be said that when you see the qualities
assigned to the first base, they are all very good, since they are in the green
zone. At the base 38 there is a lot of dispersion in the qualities (that is why the
quartile is so big), that is, there are good qualities and others very bad.
• In conclusion, we can say that until the base 22 the qualities are good, but from
this they get worse since of the 25000 readings from the base 23 I have some
readings in which that base has bad quality. Therefore, I will have to use a program
that will remove all the bases on which, for example, Q <25 (the quality is assigned
by us) or that make me the average Q <25. In this way, I will have that of the
25,000 readings each will have a different size, there will be readings that have 38
bases and others that do not.
7. Quality of the sequence by "tile"
(PER TILE SEQUENCE QUALITY).
• In this case a graphic is shown here (it only appears if an
Illumina library is used) that shows the flow cells, where the
sequence is placed. This chart allows you to search the
quality scores of each piece through all its bases to see if
there was a quality loss associated with only part of the flow
cell.
• If there are marks on the graph, this tells me that I have poor
quality since I may not have filtered the reagents, I have not
done the vacuum. So that the bubbles stay in the flow cell and
when looking at the spectra it interferes me, giving a bad
quality.
• In our image no fault is shown by us since the background is
blue. Which indicates that it has been degassed and filtered.
8.
9. Levels of quality per sequence
(PER SEQUENCE QUALITY SCORE)
• It gives us an idea in advance of how many readings I am going to
remove since it allows to see if a subset of its sequences have
values with low quality.
• In the graph shown on the left, the average of the quality of the
sequences is represented on the X axis. While on the Y axis, the
number of sequences or readings corresponding to that average is
represented.
• In our case, it can be seen that there are more than 3500 readings
that present a warm environment of 29-31. Existing much less with
low quality.
10.
11. Content of the sequence per base
(PER BASE SEQUENCE CONTENT)
• In this section we are told the proportion of each of the
bases in the sequence.
• In a random library, there should be little difference
between the bases of a sequence of execution, so the lines
in this plot must run parallel to each other.
• In our case, we can see that there are differences between
some bases and others when the amount of A should be
equal to that of T, and that of G = C.
12.
13. Content of guanine and cytosine (GC) per sequence
(PER SEQUENCE GC CONTENT)
• This module measures the GC content of our entire sequence ( red line ) and
compares it with a theoretical normal distribution of GC content ( blue line ).
• The average percentage of the content of G and C is shown on the X axis, while the
number of readings is shown on the Y axis.
• In our case ( red line ), we see that there are several peaks, where there should be
a Gaussian curve. This indicates that you have been able to recognize:
Adapter dimers
Contamination with other DNA
• If I have sequences with bad quality, in which I do not know what the base is,
when sequencing G or C is put where maybe I should not go.
• This indicates to me that the sequencing has not been carried out correctly, but
after analyzing the file with the qualities, I will be able to correct them and see:
• If there was DNA contamination, if it is of good quality I will not remove it
• In the case of sequences that are misread, if you read them as G and C when
correcting them, they should be removed.
14.
15. Content of N per base
(PER BASE N CONTENT)
• This section tells us the content of bases that have an N
(unassigned base).
• In our case, we can see that the content of N is
practically nil, that is, that N has not been assigned to
the bases that were not known, but has placed an A, T,
G or C. This indicates that in our sequence the quality is
not so bad as to put an N.
16.
17. Distribution of the length of the sequences
(SEQUENCE LENGTH DISTRIBUTION)
• Some high-performance sequencers generate fragments of
sequences of uniform length, but others may contain
different lengths.
• This module generates a graph that shows the size
distribution of the fragments in the file that was analyzed.
• In our case, we see that the sequences are homogeneous,
the 25000 sequences have 38 bases.
18.
19. Levels of duplicate sequences
(SEQUENCE DUPLICATION LEVELS)
• This module counts the degree of duplication for each sequence in a
library and creates a graph that shows the relative number of sequences
with different degrees of duplication.
• When sequencing, it is necessary that random sequences occur.
• The graph shows the proportion of the library that consists of sequences
in each of the different duplication level containers. There are two lines:
• The blue line shows the total of the sequences
• The red line shows the duplicated sequences
20. Continue…
• In the case of the complete sequence we can observe 3 peaks:
• > 10: in this case there are more than 10% of sequences that have the same
fragment from the beginning to the end 10 times
• > 100: of the 25,000 readings that I have, 25% of them have the same fragment
from the beginning to the end 100 times
• > 1K: 20% of repeated sequences, that is, they have the same fragment from the
beginning to the end 1000 times
• In the case, genomic DNA should not be observed duplications (red line). However,
they can be generated. In general there are two possible types of duplicates of a
library: duplicates derived from PCR artifacts, or biological duplicates that are
natural collisions where different copies of the same sequence are randomly
selected. However, there is no way to distinguish between these two types and
both will be reported as duplicates here.
• In the RNA-Seq libraries, some sequences are expected to occur very frequently,
and others will be very rare ( transcripts under copy number), so a high level of
duplication in the part of the library is inevitable.
21.
22. Overrepresented sequences
• In this module, it shows the evaluation of the number of sequences
that come out at the time of mapping (I use a reference
transcriptome and I look for alignment by homology) that can give
me problems when I try to do an assembly.
• You can see if there are dimers in the adapters, because as you
know the adapter that has been placed if that sequence comes out
you know that the adapter has been sequenced forming dimers.
23.
24. Content of the adapters
(ADAPTER CONTENT)
• An obvious class of sequences that you may want to
analyze are the adapter sequences. It is useful to know if
the library contains a significant number of adapters to be
able to evaluate if you need adjustment adapter or not.
• Therefore, this module makes a specific search for a set of
Kmers defined separately and will give you a view of the
total proportion of your library that contains these Kmers.
25.
26. Contact now for Scientific writing, synopsis, thesis, assignments, ppt presentations, etc.
hafizraza26@gmail.com
Check the work now https://www.slideshare.net/HafizMuhammadRaza/edit_my_uploads