With next generation sequencing it has become possible to analyse the transcriptome of non-model organism by performing a de novo assembly of RNA-seq reads. In particular, differential expression analysis can be undertaken without the need for a reference genome or annotation. While a number of studies have compared the relative merits of different transcriptome assembly programs, less attention has been given to the methodology for performing a differential expression analysis after the transcriptome has been assembled.
Differential expression analysis on a de novo assembly suffers from several challenges including mapping reads to transcripts, clustering similar transcripts and producing a summary of read counts for statistical testing. In particular, we have found that transcriptome assembly produces a much larger number of transcripts than would generally be expected. I will discuss the reasons for this and will assess the different strategies for taking the de novo assembled transcripts and producing a list of differentially expressed genes.
I demonstrate that clustering transcripts into loci improves the interpretability of results and increases statistical power, but that results are very dependent on the choice of clustering. Most clustering tools are not optimised for de novo assembled sequences, and to address this, we are developing a method which uses hierarchical clustering to group transcripts based on shared reads. We also explore possible choices for mapping and summarising read counts to gene clusters.