Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson


Published on

With next generation sequencing it has become possible to analyse the transcriptome of non-model organism by performing a de novo assembly of RNA-seq reads. In particular, differential expression analysis can be undertaken without the need for a reference genome or annotation. While a number of studies have compared the relative merits of different transcriptome assembly programs, less attention has been given to the methodology for performing a differential expression analysis after the transcriptome has been assembled.

Differential expression analysis on a de novo assembly suffers from several challenges including mapping reads to transcripts, clustering similar transcripts and producing a summary of read counts for statistical testing. In particular, we have found that transcriptome assembly produces a much larger number of transcripts than would generally be expected. I will discuss the reasons for this and will assess the different strategies for taking the de novo assembled transcripts and producing a list of differentially expressed genes.

I demonstrate that clustering transcripts into loci improves the interpretability of results and increases statistical power, but that results are very dependent on the choice of clustering. Most clustering tools are not optimised for de novo assembled sequences, and to address this, we are developing a method which uses hierarchical clustering to group transcripts based on shared reads. We also explore possible choices for mapping and summarising read counts to gene clusters.

Published in: Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

  1. 1. WEHI Bioinformatics Seminar April 9th 2013 Differential expressionanalysis of de novo assembled transcriptomes Nadia Davidson Murdoch Childrens Research Institute
  2. 2. RNA-Seq on non-model organisms• RNA-Seq is a powerful technology for studying the transcriptome: – Gene annotation, splice variants – Estimating gene abundance and differential gene expression• In particular, these things can be done for non-model organisms – Without the need for a gene annotation – Without the need for a reference genome• By de novo assembling the transcriptome – But it has its challenges
  3. 3. De novo assembly
  4. 4. Transcriptome Assemblers – For genome assembly a k-mer length must be selected and optimised for the coverage level. – But transcriptomes have a high dynamic range of coverage – Solution 1.: Use a genome assembler and perform multiple assemblies with different k-mer values and then merge the results. The Trans-abyss/Abyss and Oases/Velvet approach. – Solution 2: Write a dedicated assembler for transcriptomes using a single k-mer. The Trinity approach. – Many studies compare the different assemblers. – Few studies explore ways to do a differential expression analysis after the transcriptome has been assembled • Our aim
  5. 5. Our RNA-Seq dataset• One Hi-Seq lane: 160 million 100bp, paired end, reads from chickens (4 female, 4 male) samples – Already had the data from another project – Model organism• Assembled the data using Trinity and Oases. – Starting with these assemblies we investigated how to perform a differential expression analysis – 300k and 600k transcripts from Trinity and Oases respectively.
  6. 6. Q1. Why so many transcripts?
  7. 7. Transcripts grow with reads Fracis et. Al., BMC Genomics 2013
  8. 8. De Bruijn Graph Complexity ATTCGATG – Sequencing errors – HeterozygosityAGGTCTGA ACCTGAGA – Different Isoforms ATTCCATG – Paralogs Reported Transcripts AGGTCTGA ATTCGATG ACCTGAGA AGGTCTGA ATTCCATG ACCTGAGA
  9. 9. Simulation StudyVijay et. al., Molecular Ecology, 2012. doi:10.1111/mec.12014. Supplementary Fig. 7A simulation study of de novo transcriptome assemblies. 17K genes.100 million, 100bp paired-end reads“Even in the data sets simulated without alternative splicing, nosequencing error, no polymorphism and no paralogs for 7.87% of thegenes many isoforms were erroneously inferred (ranging from 2 to 335isoforms per gene)”
  10. 10. Variation in coverage• Across transcripts Reported Transcripts Reported Transcripts Different coverage could mean different contigs assembled for each k-mer
  11. 11. Other transcripts• We get about 4.3 transcript for each known chicken gene (Ensembl) in our Trinity assembly and 13.4 for Oases• What are the other transcripts? Known genes Novel in genome Novel not in geome Trinity Assembly Oases Assembly
  12. 12. Abundance of Gene Type from ENCODE with > 100 million reads S Djebali et al. Nature 000, 1-8 (2012) doi:10.1038/nature11233
  13. 13. Our novel genes Trinity assembly
  14. 14. Q2. Isoform or gene-levelanalysis?
  15. 15. The ConsIsoforms Genes• List may be too long: • Not sensitive to differential – Difficult to interpret splicing – Computationally expensive • Not obvious how transcripts – Larger correction for should be clustered into genes multiple testing• Not obvious how to assign ambiguously aligned reads – Can lead to double counting if ignored, or – Less power if reads are split between transcripts• Not all transcript represent different isoforms anyway
  16. 16. Q3. How to cluster transcriptsinto genes
  17. 17. Which clusters to use? • This is not an obvious problem to solve: – group genes which share sequence. i.e. only differ by splicing, SNPs or in-dels. – but place paralogs in a different cluster – This is complicated by the quality of the assembly e.g. Cluster ✓ Do not cluster ✗ Gene A Gene ATwo incompletesequences from the Gene A Low coverage repeatsame gene sequence past UTR Gene B
  18. 18. Clustering Options• What you can use: – The locus/component information from the assembler. • General form of a transcript name from the assembler: <loci>_<transcript>_<other info such as length> – Sequence similarity clustering such as CD-HIT, Blastclust etc. We tested the accuracy of these clustering methods on our assemblies “Truth” clusters were determined by matching transcripts to RefSeq genes using blat (98% identity over 200 bases)
  19. 19. How we assess clustering Scored based on correct/incorrect pairwise groupings (like for the Rand Index). Example: true positives = 2 true negatives = 4 false positives = 2 false negatives = 2 “over clustered” False positives indicate “over clustering” TP = 4, TN = 0 “under clustered” FP = 6, FN = 0False negative indicate “under clustering” TP = 2, TN = 6 FP = 0, FN = 2
  20. 20. Trinity Assembly Clustering “over clustered” 335,377 transcripts “under clustered”Number of clusters:TP = true positives ✕ Idealnumber of pairs of transcripts whichcorrectly share a cluster  CD-HIT-EST  TrinityFN = false negativesnumber of pairs of transcripts which areincorrectly are split
  21. 21. Oases Assembly Clustering 540,933 transcripts “over clustered”Number of clusters: “under clustered” ✕ IdealMax transcripts in a cluster:  CD-HIT-EST  Oases
  22. 22. Can we do better?• CD-HIT-EST uses only the sequence information, but we also have the reads – We could down weight region which are expressed at a low level – We could separate sequences which show different expression between sample groups – Using pair-end reads gives extra leverage to group transcripts• We are developing a tools which will take multi- mapped reads and output clusters along with counts for each cluster
  23. 23. The idea – Multi-map reads to the assembly – Separate transcripts into super-clusters • Transcripts are grouped if they share ANY reads with another transcript – For each super-cluster • For each pair of transcripts, calculate the distance Rab – Number of reads which map to transcript a and b • To do: incorporate sample information too – Hierarchical cluster the transcripts using the distance metric – Stop when the distance between grouped transcripts is too large – this threshold is a parameter of the algorithm – To do: output the counts for each cluster
  24. 24. Step 1
  25. 25. Step 1Step 2 – make a distance matrix 0 (R=2) 0 (R=2) 0.5 (R=1) 0 (R=2) 0.5 (R=1) 0 (R=2) 0.5 (R=1) 0 (R=2) 0 (R=2) Update R2’2’ = R22+R33-R23Distance = , R = reads R12’ = max(R12,R23) Recalculate the distance
  26. 26. Cutting the tree at 0.5 or lesswould give the correct clustering Distance = 1 Distance = 0.5 Distance = 0
  27. 27. How do we do? “over clustered” “over clustered” “under clustered” “under clustered” Oases assembly Trinity assembly  Ours – dist.=0.1✕ Ideal  Ours – dist.=0.3 CD-HIT-EST  Ours – dist.=0.5  Ours – dist.=0.7 Oases/Trinity  Ours – dist.=0.9
  28. 28. Impact on differential expression (DE)• To assess this we: – Mapped reads back to all transcripts (“best” mapping - bowtie) – Counted the reads which overlapped a transcript (samtools) – Added up all the counts for each cluster – Performed a DE analysis in edgeR for males vs. females – Compared against a “truth” DE list • Obtained from a genome based analysis on RefSeq genes (5 thousand genes) • True positives – false discovery rate < 0.05 • RefSeq genes were identified in the de novo assembly • Non-identified clusters were excluded
  29. 29. DE results - Oases “over clustered” “under clustered” ✕ Ideal  CD-HIT-EST  Oases/Trinity  Ours – dist.=0.1  Ours – dist.=0.3  Ours – dist.=0.5  Ours – dist.=0.7  Ours – dist.=0.9 Conclusion: Better to “under” cluster than “over” cluster vs.
  30. 30. DE results - Trinity “over clustered” “under clustered” ✕ Ideal  CD-HIT-EST  Oases/Trinity  Ours – dist.=0.1  Ours – dist.=0.3  Ours – dist.=0.5  Ours – dist.=0.7  Ours – dist.=0.9
  31. 31. Q4. What is the best way to gofrom reads to counts?
  32. 32. Approaches1. Do what we did before – add up counts for each cluster2. Trinity and Oases suggest: – Muti-mapping reads to transcripts – Then use a program which can deal with ambiguously mapped reads • RSEM can take the clustering as input and return gene- level counts.3. What people have actually done: – Select a set of representative transcript (i.e. the longest one) – Map reads using their favorite mapper. – Count the number of reads which overlap the transcripts e.g. Sandmann et. al., Genome Biology, 2011 12:R76
  33. 33. The alternatives: Select Representative “Best” map Transcript: Reads: Multi-map longest bowtie Reads:Trinity script Best-map reads: Count reads overlapping bowtie transcripts: Get samtools counts: Count reads overlapping RSEM transcripts: Add counts in a samtools cluster own script Gene–level counts Use the same clustering for all edgeR three approaches Gene–level DE results
  34. 34. Results Oases Assembly Trinity AssemblyDifference between methods is small - could probably do any of them
  35. 35. Conclusions• Q1. Why so many transcripts? – Expect de novo transcriptome assemblies to produce may more transcripts than a typical annotation. – De novo transcriptome assemblies must deal with a number of issues which make full-length transcript assembly, without redundancy, difficult. – Sequencing to a high depth may give you more intergenic non-coding transcripts.• Q2. Isoform or gene-level analysis? – Doing a differential expression analysis on gene-level counts has a number of advantages over isoform-level counts
  36. 36. Conclusions cont.• Q3. How to cluster transcripts into genes? – We found that Trinity’s clustering was good, but the clustering from Oases and CD-HIT-EST were poor – We are developing a tool for clustering which already works better than the alternatives based on differential expression results• Q4. What is the best way to go from reads to counts? – We compared three alternatives for mapping/abundance estimation – Results were similar for all three – Getting the clustering correct has a bigger impact on the differential expression results than other steps in the pipe-line
  37. 37. Future Work• We have only looked at one RNA-Seq dataset. – Would like to look at RNA-Seq from at least two other datasets to ensure that the conclusions drawn here also hold in general: • Different species (all model organisms) • Different read depths• Our clustering tool: – Would like to output the gene-level counts for each cluster. • Then compare to other abundance estimation approaches. – Would like to incorporate differences in expression between groups to improve the clustering• More investigation into the pipe-line methods – E.g. mapping
  38. 38. AcknowledgementsMCRI Bioinformatics Chicken RNA-Seq Data fromAlicia Oshlack Katie Ayers (MCRI)The Bioinformatics Group Craig Smith (MCRI)VLSCI AGRFRed Jungle Fowl (credit: NHGRI)
  39. 39. Extra Slides
  40. 40. Trinity and Oases compared Oases Trinity – version from the end of 2012Trinity – version from the start of 2012 frac_match = length of the longest matching assembled transcript / “true” length of the transcript
  41. 41. Number of genes to transcripts (ordered by DE) Yeast Chicken (Trinity) Chicken (Oases)