RNA-seq: analysis of raw data and preprocessing - part 2
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

RNA-seq: analysis of raw data and preprocessing - part 2

  • 1,783 views
Uploaded on

Second presentation slides of the 'RNA-seq for DE analysis' training. See http://www.bits.vib.be for more information.

Second presentation slides of the 'RNA-seq for DE analysis' training. See http://www.bits.vib.be for more information.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,783
On Slideshare
1,783
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
66
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Raw data investigation Joachim Jacob 20 and 27 January 2014 This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
  • 2. Experimental setup We have decided on: ● how many samples per condition ● how deep This determines how reliable the statistics will be, using experience, and tools like Scotty. A wrong experimental design cannot be fixed. Best approach: pilot data (3 samples per condition, 10M) But we have other sequencing options to choose!
  • 3. PE versus SE Illumina ● Single end (SE): from each cDNA fragment only one end is read. Paired end (PE): the cDNA fragment is read from both ends. Purify and fragment SE ● PE
  • 4. PE versus SE Illumina Single end (SE): ● Gene level differential expression Paired end (PE): ● Novel splice junction detection ● De novo assembly of transcriptome ● Helps with correctly positioning reads on the reference genome sequence. Note: PE not the same as mate pairs.
  • 5. Strandedness ● ● Naive protocols obtain reads from cDNA fragments. BUT the link with the sense or antisense strand is broken. Stranded protocols generate reads from one strand, corresponding to the sense or antisense strand (depending on the protocol).
  • 6. Strandedness Not stranded Stranded
  • 7. Example of a stranded protocol ● dUTP protocol to generate stranded reads.
  • 8. Importance of strandedness ● ● Strandedness can bias the read counts compared to non-stranded protocols. Depends on the genome whether you should apply it, e.g. in case genes overlap, the improved benefit of assigning reads to correct genes can outweigh technical variation.
  • 9. Length of the reads ● ● ● Does not matter so much (when we want to quantify aligning to a reference sequence): 50 bp will do. The most important point is to be able to accurately position the read on the reference genome sequence, to assign it to the correct gene. Length can become important, if you want to assemble the transcriptome.
  • 10. For DE on the gene level The 'cheapest' protocol for high-throughput sequencing suffices to achieve DE detection: ● SE ● 50bp ● Option: strandedness. Use the money you have left over for increasing the number of replicates.
  • 11. Illumina Truseq protocol sdf
  • 12. Raw Illumina data The data you get arrives as... barcode experiment Compressed, usually with gzip
  • 13. Raw Illumina data (this one: 87196924 lines) @HWI-ST571:202:D1B86ACXX:2:1102:1146:2155 1:N:0:ACAGTG CCAACATCGAGGTCGCAATCTTTTTNANCGATATGAACTCTCCAAAAAAA + @@@FFFDFHHDG?FFHIIJJJJJIJ#1#1:BFFIGJJJJJIJJGIJJJJA @HWI-ST571:202:D1B86ACXX:2:1102:1073:2240 1:N:0:ACAGTG One read (minimum 4 lines) sequence CGGAGCTGAAGGAGAAACTGAAATCCCTGCAATGTGAATTGTACGTTCTT + CCCFFFFFGGHHHIJJJJJJJIJFHIJIIIJJJJGIIIIIEFGHIFCHJI @HWI-ST571:202:D1B86ACXX:2:1102:1385:2192 1:N:0:ACAGTG certainty reading this base at this position ('quality') GTTGGCAGCCCTGGAGCCCTGCCTCGGTGGTTTAGCCAGTACTAGGGGAT + CCCFFFFFHHHHHJJJIJJJJJJGIJJCGHFHIGIHJJJBDHGHHJJJIE @HWI-ST571:202:D1B86ACXX:2:1102:1352:2244 1:N:0:ACAGTG ATTTCCTCTTATTTACGTTGCTTTAAAGCGAGACTTCAACGCCATTTGAC + @@CFFFFFHHFHDFGHIJIIJGIJGGEHGGJB>??FHHGFFFGHIGIECF @HWI-ST571:202:D1B86ACXX:2:1102:1981:2152 1:N:0:ACAGTG CATCGAAGCAAAGCATATAAAGTTANTNNTNNCTGAGTTGTACATATTGC + ??;;D?DB6CDB+<EFE>:AFA443#2##1##11)0:0?9**0??DAGI4 @HWI-ST571:202:D1B86ACXX:2:1102:1877:2165 1:N:0:ACAGTG GAAGTGCCCCGCTGGCAGCACACAAGGAGCAGCCCGCTGCCGGACCACTC + ?@@DDDADFFAA:CEGHBFGAHGD?F@BE9BFF?D@F;'-8AG<B92=;; http://wiki.bits.vib.be/index.php/.fas
  • 14. Exploring the raw data 1) check whether the Fastq file is consistent - 2) Make graphs of some metrics of the raw data http://wiki.bits.vib.be/index.php/.fastq http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Quality_control_and_visualization_of_raw_reads
  • 15. FastQC – graphical exploration http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  • 16. FastQC – perfect example Reads have good quality!
  • 17. FastQC – perfect example Anna Karenina principle: “There is only one way to be good, but there are many ways to be wrong.” We will start by showing a good sample. Afterwards we will discuss a less good sample. http://en.wikipedia.org/wiki/Anna_Karenina_principle
  • 18. FastQC – perfect example Smooth histogram/ density line towards the right,
  • 19. FastQC – perfect example steady nucleotide distribution. Bias typical for illumina
  • 20. FastQC – perfect example Not strongly fluctuating GC content Bias typical for illumina
  • 21. FastQC – perfect example GC-content nicely bell shaped
  • 22. FastQC – perfect example No N's! (should ring something)
  • 23. FastQC – perfect example All reads have length 50bp,
  • 24. FastQC – perfect example Reads are nicely duplicated: some amount of duplication is to be expected in RNA-seq data.
  • 25. FastQC – perfect example Reads are nicely duplicated: some amount of duplication is to be expected in RNA-seq data.
  • 26. FastQC – perfect example Kmers are short sequence stretches. Sometimes they are overrepresented. But in RNA-seq this is not so important (duplication).
  • 27. FastQC – less good RNA-seq sample A relatively large Portion of the reads have mistakes at the 3' end of the read.
  • 28. FastQC – less good RNA-seq sample There is an overrepresentation of reads with a low mean quality score
  • 29. FastQC – less good RNA-seq sample Not a steady level of different nucleotide fractions
  • 30. FastQC – less good RNA-seq sample Fluctuates
  • 31. FastQC – less good RNA-seq sample Heavily skewed versus AT rich reads
  • 32. FastQC – less good RNA-seq sample Apparently a mixture of two sets of reads with different lengths
  • 33. FastQC – less good RNA-seq sample Duplication seems a bit on the low side (reported figures are from 60 -75%)
  • 34. FastQC – less good RNA-seq sample Very highly skewed read number. Often the sequence of Truseq adaptor, or multiplex identifiers can be found here. BLAST can reveal more information!
  • 35. FastQC – less good RNA-seq sample Specific patterns of Specific kmers. Note: A and T rich
  • 36. Quality control of raw data Proceed? Or rerun? This QC can guide you to which preprocessing steps you need to apply for sure. The extra time and money needed to correct the biases can sometimes justify a rerun of the experiment. This QC shows which preprocessing steps have already been made by the sequencing provider.
  • 37. Preprocessing Removing unwanted parts of the raw data so it helps as much as possible with reaching our goal: defining differentially expressed genes. 1) removing technical contamination ● Low quality read parts ● Technical sequences: adaptors ● PhiX internal control sequences 2) removing biological contamination ● polyA-tails ● rRNA sequences ● mtDNA sequences After this, we run FastQC again.
  • 38. Technical contamination Our goal is to define DE expression, for this we need to assign reads with a high confidence to the correct genomic location. Removal of low quality read parts: they have a higher chance to contain errors, and cause noise in our read counts.
  • 39. Technical contamination Our goal is to define DE expression, for this we need to assign reads with a high confidence to the correct genomic location. Removal of low quality read parts: they have a higher chance to contain errors, and cause noise in our read counts.
  • 40. Technical contamination
  • 41. Technical contamination Our goal is to define DE expression, for this we need to assign reads with a high confidence to the correct genomic location. Removal of adaptor sequences (and other technical sequences, such as multiplex) as they cannot be mapped to the reference genome.
  • 42. Technical contamination List of technical sequences Our goal is to define DE expression, for this we need to assign reads with a high confidence to the correct genomic location. Advised to use defaults Removal of adaptor sequences (and other technical sequences, such as multiplex) as they cannot be mapped to the reference genome. http://code.google.com/p/ea-utils/wiki/FastqMcf
  • 43. Fastq-mcf output http://code.google.com/p/ea-utils/wiki/FastqMcf
  • 44. Technical contamination Never remove duplicate reads! Highly expressed genes can have genuine duplicate reads, which are not due to the PCR amplification step in the protocol. ● PhiX sequences: the DNA of Phi X bacteriophage is spiked in to monitor and optimize sequencing on Illumina machines. Your sequencing provider should filter out those sequences before delivery. You can filter them out by aligning your reads to the PhiX genome. ● http://en.wikipedia.org/wiki/Phi_X_174
  • 45. Biological contamination cell Mitochondria contain rRNA, mRNA and mtDNA rRNA and non-coding (95% of RNA) nucleus mRNA (5% of RNA)
  • 46. Biological contamination Mitochondrial rRNA and nc mRNAs are captured with oligo-dT coated beads. Occasionally, non-protein coding sequences are also captured (especially since mtRNA and rRNA can be relatively rich in AT). We can remove them via homology searching (BLAST) with known non-protein coding sequences. mRNA (5% of RNA)
  • 47. Biological contamination AAAAAAAAAAAAA mRNAs are post-transcriptionally modified: e.g. the addition of a poly-A tail. If our goal is to map the reads to a reference genome sequence, the polyA tails should be removed. This can be viewed as some source of 'biological contamination' in our sequences (…).
  • 48. Biological contamination ● Get the non-protein coding sequences via Biomart. Mitochondrial genome sequence also.
  • 49. Biological contamination
  • 50. Biological contamination
  • 51. Filter the biological contamination Your reads The biological reads Imported via Biomart We are interested in the reads that don't map!
  • 52. Filter the biological contamination Your reads The biological reads Imported via Biomart We are interested in the reads that don't map!
  • 53. Doing this in Galaxy Useful: take a sample of your reads: fastq-to-tabular, select random lines, tabular-to-fastq 1. create a new history 2. load the sample data in 3. Run fastqMcf to remove technical sequences 4. Run bowtie to match against biological sequence databases, and keep reads that don't match. 5. Summarize: fastqc → make a workflow of this sample history. → run the workflow on all your samples in parallel → store the cleaned reads in a data library.
  • 54. Summary preprocessing Your reads …... Format consistent? Errors in quality? Your groomed reads Trends in raw data? QC report ... ….... …... Get technical contaminants - …. Your groomed reads without technical contamination Get biological contaminants - …. - …. Your groomed reads without technical and biological contamination ….... ... …... How does your data look now? QC
  • 55. Keywords Paired end Stranded reads gzip fastq Biological contamination Technical contamination Adapter sequence Write in your own words what the terms mean
  • 56. Exercise → investigating and preprocessing raw RNA-seq data
  • 57. Break