Generating the count table
and validating assumptions
RNA-seq for DE analysis training
Joachim Jacob
20 and 27 January 201...
Goal
Summarize the read counts per gene from
a mapping result.
The outcome is a raw count table on
which we can perform so...
Status
The challenge
'Exons' are the type of features used here.
They are summarized per 'gene'

Alt splicing
Overlaps no feature...
Tools to count features
●

Different tools exist to accomplish this:

http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Fea...
Dealing with ambiguity
●

We focus on the gene level: merge all counts over
different isoforms into one, taking into accou...
HTSeq count has 3 modes
HTSeq-count
recommends
the 'union
mode'. But
depending on
your genome,
you may opt
for the
'inters...
Indicate the SE or PE nature of your data
(note: mate-pair is not
appropriate naming here)
The annotation file with the co...
Resulting count table column

One sample !
Merging to create experiment count table
Resulting count table
Quality control of count table
Relative numbers

Absolute numbers

In the end, we used about 70% of the reads. Check for y...
Quality control of count table
2 types of QC:
●

General metrics

●

Sample-specific quality control
QC: general metrics
●

General numbers
QC: general metrics
Which genes are most highly present?
Which fractions do they occupy?
Gene

Counts

42 genes (0,0063%)
...
QC: general metrics
●

General numbers
QC: general metrics
●

We can plot the counts per sample: filter
out the '0', and transform on log2.

The bulk of the gene...
QC: log2 density graph
●

We can do this for all samples, and merge
All samples show
nice overlap, peaks
are similar

Stra...
QC: log2 merging samples
Here, we take one sample,
plot the log2 density
graph, add the counts of
another sample, and plot...
QC: log2, merging samples
Here, we take one sample,
plot the log2 density
graph, add the counts of
another sample, and plo...
QC: rarefaction curve
What is the number
of total detected
features, how does
the feature space
increase with each
additio...
Sample A
Sample A + sample B
Sample A + sample B + sample C
Etc.

QC: rarefaction curve
rRNA genes

Saturation: OK!
QC: transformations for viz

Regularized log (rLog) and 'Variance Stabilizing Transformation'
(VST) as alternatives to log...
QC: count transformations
Not normalizations!
●

Techniques used for microarray can be
applied on VST transformed counts.
...
QC including condition info
●

●

We can also include condition
information, to interpret our QC better.
For this, we need...
QC with condition info

What are the differences in
counts in each sample
dependent on? Here: counts are
dependent on the ...
QC with condition info
Clustering of the distance between samples based on
transformed counts can reveal sample errors.

V...
QC with condition info
Clustering of transformed counts can reveal sample
errors.

VST transformed

rLog transformed
QC with condition info
Principal component (PC) analysis allows to display
the samples in a 2D scatterplot based on variab...
Collect enough metadata
Principal component (PC) analysis allows to display
the samples in a 2D scatterplot based on varia...
QC with condition info
During library preparation, collect as much as
information as possible, to add to the sample
descri...
Collect enough metadata
In the QC of the count table, you can map this
additional info to the PC graph. In this case, libr...
Collect enough metadata
In the QC of the count table, you can map this
additional info to the PC graph. In this case, libr...
Collect enough metadata
Next step
Now we know our data from the inside out, we
can run a DE algorithm on the count table!
Keywords
Raw counts
VST

Write in your own words what the terms mean
Break
Upcoming SlideShare
Loading in...5
×

RNA-seq for DE analysis: extracting counts and QC - part 4

903

Published on

Part 4 of the training sesson 'RNA-seq for differential expression analysis' considers extracting the count table from a mapping, and performing QC to detect sample biases. See http://www.bits.vib.be

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
903
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
72
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

RNA-seq for DE analysis: extracting counts and QC - part 4

  1. 1. Generating the count table and validating assumptions RNA-seq for DE analysis training Joachim Jacob 20 and 27 January 2014 This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
  2. 2. Goal Summarize the read counts per gene from a mapping result. The outcome is a raw count table on which we can perform some QC. This table is used by the differential expression algorithm to detect DE genes.
  3. 3. Status
  4. 4. The challenge 'Exons' are the type of features used here. They are summarized per 'gene' Alt splicing Overlaps no feature Concept: GeneA = exon 1 + exon 2 + exon 3 + exon 4 = 215 reads GeneB = exon 1 + exon 2 + exon 3 = 180 reads No normalization yet! Just pure counts, aka 'raw counts',
  5. 5. Tools to count features ● Different tools exist to accomplish this: http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Feature_counting
  6. 6. Dealing with ambiguity ● We focus on the gene level: merge all counts over different isoforms into one, taking into account: ● ● ● Reads that do not overlap a feature, but appear in introns. Take into account? Reads that align to more than one feature (exon or transcript). Transcripts can be overlapping - perhaps on different strands. (PE, and strandedness can resolve this partially). Reads that partially overlap a feature, not following known annotations.
  7. 7. HTSeq count has 3 modes HTSeq-count recommends the 'union mode'. But depending on your genome, you may opt for the 'intersection_st rict mode'. Galaxy allows experimenting! http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
  8. 8. Indicate the SE or PE nature of your data (note: mate-pair is not appropriate naming here) The annotation file with the coordinates of the features to be counted mode Reverse stranded: heck with mapping viz Check with mapping QC (see earlier) For RNA-seq DE we summarize over 'exons' grouped by 'gene_id'. Make sure these fields are correct in your GTF file.
  9. 9. Resulting count table column One sample !
  10. 10. Merging to create experiment count table
  11. 11. Resulting count table
  12. 12. Quality control of count table Relative numbers Absolute numbers In the end, we used about 70% of the reads. Check for your experiment.
  13. 13. Quality control of count table 2 types of QC: ● General metrics ● Sample-specific quality control
  14. 14. QC: general metrics ● General numbers
  15. 15. QC: general metrics Which genes are most highly present? Which fractions do they occupy? Gene Counts 42 genes (0,0063%) of the 6665 genes take 25% of all counts. This graph can be constructed from the count table. TEF1alpha, putative ribo prot,...
  16. 16. QC: general metrics ● General numbers
  17. 17. QC: general metrics ● We can plot the counts per sample: filter out the '0', and transform on log2. The bulk of the genes have counts in the hundreds. Few are extremely highly expressed A minority have extremely low counts log2(count)
  18. 18. QC: log2 density graph ● We can do this for all samples, and merge All samples show nice overlap, peaks are similar Strange Deviation here
  19. 19. QC: log2 merging samples Here, we take one sample, plot the log2 density graph, add the counts of another sample, and plot again, add the counts of another sample, etc. until we have merged all samples. We see a horizontal shift of the graph, rather than a vertical shift, pointing to no saturation.
  20. 20. QC: log2, merging samples Here, we take one sample, plot the log2 density graph, add the counts of another sample, and plot again, add the counts of another sample, etc. until we have merged all samples.
  21. 21. QC: rarefaction curve What is the number of total detected features, how does the feature space increase with each additional sample added? There should be saturation, but here there is none. Code: ggplot(data = nonzero_counts, aes(total, counts)) + geom_line() + labs(x = "total number of sequenced reads", y = "number of genes with counts > 0")
  22. 22. Sample A Sample A + sample B Sample A + sample B + sample C Etc. QC: rarefaction curve rRNA genes Saturation: OK!
  23. 23. QC: transformations for viz Regularized log (rLog) and 'Variance Stabilizing Transformation' (VST) as alternatives to log2. http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
  24. 24. QC: count transformations Not normalizations! ● Techniques used for microarray can be applied on VST transformed counts. Log2 http://www.biomedcentral.com/1471-2105/14/91 rLog VST http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
  25. 25. QC including condition info ● ● We can also include condition information, to interpret our QC better. For this, we need to gather sample information. Make a separate file in which sample info is provided (metadata)
  26. 26. QC with condition info What are the differences in counts in each sample dependent on? Here: counts are dependent on the treatment and the strain. Must match the sample descriptions file.
  27. 27. QC with condition info Clustering of the distance between samples based on transformed counts can reveal sample errors. VST transformed Colour scale Of the distance measure between Samples. Similar conditions Should cluster together rLog transformed
  28. 28. QC with condition info Clustering of transformed counts can reveal sample errors. VST transformed rLog transformed
  29. 29. QC with condition info Principal component (PC) analysis allows to display the samples in a 2D scatterplot based on variability between the samples. Samples close to each other resemble each other more.
  30. 30. Collect enough metadata Principal component (PC) analysis allows to display the samples in a 2D scatterplot based on variability between the samples. Samples close to each other resemble each other more. Why do these resemble each other?
  31. 31. QC with condition info During library preparation, collect as much as information as possible, to add to the sample descriptions. Pay particular attention to differences between samples: e.g. day of preparation, centrifuges used, ... Why do these resemble each other?
  32. 32. Collect enough metadata In the QC of the count table, you can map this additional info to the PC graph. In this case, library prep on a different day had effect on the WT samples. Day 1 Day 2 Additional metadata
  33. 33. Collect enough metadata In the QC of the count table, you can map this additional info to the PC graph. In this case, library prep on a different day had effect on the WT samples (batch effect). Day 1 Day 2 Additional metadata
  34. 34. Collect enough metadata
  35. 35. Next step Now we know our data from the inside out, we can run a DE algorithm on the count table!
  36. 36. Keywords Raw counts VST Write in your own words what the terms mean
  37. 37. Break
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×