RNA-seq: analysis of raw data and preprocessing - part 2
The document presents a detailed guide on the investigation and preprocessing of raw RNA-seq data, emphasizing the importance of experimental design, sequencing options, and quality control methods. It discusses the differences between single-end and paired-end sequencing, the relevance of strandedness, and the consequences of technical and biological contaminations in the data. Preprocessing steps such as removing low-quality and contaminant sequences are critical for accurate differential expression analysis and successful mapping to reference genomes.
RNA-seq: analysis of raw data and preprocessing - part 2
1.
Raw data investigation
JoachimJacob
20 and 27 January 2014
This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts hereof.
2.
Experimental setup
We havedecided on:
● how many samples per condition
● how deep
This determines how reliable the statistics will be,
using experience, and tools like Scotty. A wrong
experimental design cannot be fixed. Best
approach: pilot data (3 samples per condition, 10M)
But we have other sequencing options to choose!
3.
PE versus SEIllumina
●
Single end (SE): from each cDNA fragment only
one end is read.
Paired end (PE): the cDNA fragment is read from
both ends.
Purify and
fragment
SE
●
PE
4.
PE versus SEIllumina
Single end (SE):
●
Gene level differential expression
Paired end (PE):
●
Novel splice junction detection
●
De novo assembly of transcriptome
●
Helps with correctly positioning reads on the
reference genome sequence.
Note: PE not the same as mate pairs.
5.
Strandedness
●
●
Naive protocols obtainreads from cDNA
fragments. BUT the link with the sense or
antisense strand is broken.
Stranded protocols generate reads
from one strand, corresponding to the
sense or antisense strand (depending on
the protocol).
Example of astranded protocol
●
dUTP protocol to
generate stranded
reads.
8.
Importance of strandedness
●
●
Strandednesscan bias the read counts
compared to non-stranded protocols.
Depends on the genome whether you
should apply it, e.g. in case genes
overlap, the improved benefit of
assigning reads to correct genes can
outweigh technical variation.
9.
Length of thereads
●
●
●
Does not matter so much (when we want
to quantify aligning to a reference
sequence): 50 bp will do.
The most important point is to be able to
accurately position the read on the
reference genome sequence, to assign it
to the correct gene.
Length can become important, if you
want to assemble the transcriptome.
10.
For DE onthe gene level
The 'cheapest' protocol for high-throughput
sequencing suffices to achieve DE detection:
●
SE
●
50bp
●
Option: strandedness.
Use the money you have left over for
increasing the number of replicates.
Raw Illumina data
Thedata you get arrives as...
barcode
experiment
Compressed, usually with gzip
13.
Raw Illumina data
(thisone: 87196924 lines)
@HWI-ST571:202:D1B86ACXX:2:1102:1146:2155 1:N:0:ACAGTG
CCAACATCGAGGTCGCAATCTTTTTNANCGATATGAACTCTCCAAAAAAA
+
@@@FFFDFHHDG?FFHIIJJJJJIJ#1#1:BFFIGJJJJJIJJGIJJJJA
@HWI-ST571:202:D1B86ACXX:2:1102:1073:2240 1:N:0:ACAGTG
One read (minimum 4 lines)
sequence
CGGAGCTGAAGGAGAAACTGAAATCCCTGCAATGTGAATTGTACGTTCTT
+
CCCFFFFFGGHHHIJJJJJJJIJFHIJIIIJJJJGIIIIIEFGHIFCHJI
@HWI-ST571:202:D1B86ACXX:2:1102:1385:2192 1:N:0:ACAGTG
certainty reading this base
at this position ('quality')
GTTGGCAGCCCTGGAGCCCTGCCTCGGTGGTTTAGCCAGTACTAGGGGAT
+
CCCFFFFFHHHHHJJJIJJJJJJGIJJCGHFHIGIHJJJBDHGHHJJJIE
@HWI-ST571:202:D1B86ACXX:2:1102:1352:2244 1:N:0:ACAGTG
ATTTCCTCTTATTTACGTTGCTTTAAAGCGAGACTTCAACGCCATTTGAC
+
@@CFFFFFHHFHDFGHIJIIJGIJGGEHGGJB>??FHHGFFFGHIGIECF
@HWI-ST571:202:D1B86ACXX:2:1102:1981:2152 1:N:0:ACAGTG
CATCGAAGCAAAGCATATAAAGTTANTNNTNNCTGAGTTGTACATATTGC
+
??;;D?DB6CDB+<EFE>:AFA443#2##1##11)0:0?9**0??DAGI4
@HWI-ST571:202:D1B86ACXX:2:1102:1877:2165 1:N:0:ACAGTG
GAAGTGCCCCGCTGGCAGCACACAAGGAGCAGCCCGCTGCCGGACCACTC
+
?@@DDDADFFAA:CEGHBFGAHGD?F@BE9BFF?D@F;'-8AG<B92=;;
http://wiki.bits.vib.be/index.php/.fas
14.
Exploring the rawdata
1) check whether the Fastq file is consistent
-
2) Make graphs of some metrics of the raw data
http://wiki.bits.vib.be/index.php/.fastq
http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Quality_control_and_visualization_of_raw_reads
FastQC – perfectexample
Anna Karenina principle: “There is only one way
to be good, but there are many ways to be
wrong.”
We will start by showing a good sample.
Afterwards we will discuss a less good sample.
http://en.wikipedia.org/wiki/Anna_Karenina_principle
18.
FastQC – perfectexample
Smooth
histogram/
density line
towards the
right,
19.
FastQC – perfectexample
steady
nucleotide
distribution.
Bias typical
for illumina
20.
FastQC – perfectexample
Not strongly
fluctuating
GC content
Bias typical
for illumina
FastQC – lessgood RNA-seq sample
Heavily skewed versus
AT rich reads
32.
FastQC – lessgood RNA-seq sample
Apparently a mixture
of two sets of reads
with different lengths
33.
FastQC – lessgood RNA-seq sample
Duplication seems a
bit on the low side
(reported figures are
from 60 -75%)
34.
FastQC – lessgood RNA-seq sample
Very highly skewed
read number.
Often the
sequence of Truseq
adaptor, or multiplex identifiers
can be
found here.
BLAST can reveal
more information!
35.
FastQC – lessgood RNA-seq sample
Specific patterns of
Specific kmers.
Note: A and T rich
36.
Quality control ofraw data
Proceed? Or rerun?
This QC can guide you to which preprocessing steps
you need to apply for sure. The extra time and
money needed to correct the biases can sometimes
justify a rerun of the experiment.
This QC shows which preprocessing steps have
already been made by the sequencing provider.
37.
Preprocessing
Removing unwanted partsof the raw data so it helps as
much as possible with reaching our goal: defining
differentially expressed genes.
1) removing technical contamination
● Low quality read parts
● Technical sequences: adaptors
● PhiX internal control sequences
2) removing biological contamination
● polyA-tails
● rRNA sequences
● mtDNA sequences
After this, we run FastQC again.
38.
Technical contamination
Our goalis to define DE expression, for this we
need to assign reads with a high confidence to the
correct genomic location.
Removal of low quality read parts: they have a
higher chance to contain errors, and cause noise in
our read counts.
39.
Technical contamination
Our goalis to define DE expression, for this we
need to assign reads with a high confidence to the
correct genomic location.
Removal of low quality read parts: they have a
higher chance to contain errors, and cause noise in
our read counts.
Technical contamination
Our goalis to define DE expression, for this we
need to assign reads with a high confidence to the
correct genomic location.
Removal of adaptor sequences (and other
technical sequences, such as multiplex) as they
cannot be mapped to the reference genome.
42.
Technical contamination
List oftechnical sequences
Our goal is to define DE expression, for this we
need to assign reads with a high confidence to the
correct genomic location.
Advised to use defaults
Removal of adaptor sequences (and other
technical sequences, such as multiplex) as they
cannot be mapped to the reference genome.
http://code.google.com/p/ea-utils/wiki/FastqMcf
Technical contamination
Never removeduplicate reads! Highly expressed
genes can have genuine duplicate reads, which are
not due to the PCR amplification step in the
protocol.
●
PhiX sequences: the DNA of Phi X bacteriophage
is spiked in to monitor and optimize sequencing on
Illumina machines. Your sequencing provider
should filter out those sequences before delivery.
You can filter them out by aligning your reads to the
PhiX genome.
●
http://en.wikipedia.org/wiki/Phi_X_174
Biological contamination
Mitochondrial
rRNA andnc
mRNAs are captured with
oligo-dT coated beads.
Occasionally, non-protein
coding sequences are also
captured (especially since
mtRNA and rRNA can be
relatively rich in AT).
We can remove them via
homology searching (BLAST)
with known non-protein
coding sequences.
mRNA (5% of RNA)
47.
Biological contamination
AAAAAAAAAAAAA
mRNAs arepost-transcriptionally modified: e.g. the
addition of a poly-A tail. If
our goal is to map the reads
to a reference genome
sequence, the polyA tails
should be removed. This
can be viewed as some
source of 'biological
contamination' in our
sequences (…).
Filter the biologicalcontamination
Your reads
The biological reads
Imported via Biomart
We are interested in the
reads that don't map!
52.
Filter the biologicalcontamination
Your reads
The biological reads
Imported via Biomart
We are interested in the
reads that don't map!
53.
Doing this inGalaxy
Useful: take a sample of your reads: fastq-to-tabular,
select random lines, tabular-to-fastq
1. create a new history
2. load the sample data in
3. Run fastqMcf to remove technical sequences
4. Run bowtie to match against biological sequence
databases, and keep reads that don't match.
5. Summarize: fastqc
→ make a workflow of this sample history.
→ run the workflow on all your samples in parallel
→ store the cleaned reads in a data library.
54.
Summary preprocessing
Your reads
…...
Formatconsistent? Errors in quality?
Your groomed reads
Trends in raw data? QC report
...
…....
…...
Get technical contaminants
- ….
Your groomed reads without technical contamination
Get biological contaminants
- ….
- ….
Your groomed reads without technical
and biological contamination
…....
...
…...
How does your data look now? QC