1. RNA-‐seq analysis Mikael Huss Bioinforma7cs scien7st at WABI (Wallenberg Advanced Infrastructure for Bioinforma7cs), Science for Life Laboratory / DBB, Stockholm university February 13, 2013
2. Omics, biology and diseases + + + + Protein “parts ProteinGenomics RNA profiles Interactomics list” profiles Systems biology Pathways, molecular targets, diagnos5cs
3. Approximate contents of talk- Gene expression analysis in general; differences between RNA-seq and microarrays- Typical workflow(s) for RNA-seq analysis- Normalization issues- Visualization- Differential expression analysisI have tried to include many references so you can go back to these slides forreference afterwards
4. How DNA get transcribed to RNA (and then translated to proteins) varies between e. g. -‐Tissues -‐ Cell types -‐ Cell states -‐Individuals
5. What can gene expression tell us? Basic research -‐ How do gene expression paUerns determine cellular iden7ty? (7ssues, cell types …) -‐ How does gene expression control early development in an embryo? -‐ What kinds of genes are expressed in response to speciﬁc s7muli (infec7ons, smoking, environmental pollu7on, gym exercise …)? -‐ What kinds of genes do bacteria or other microorganisms express in the human gut / in soil / in oceans under diﬀerent condi7ons? … and much, much more …
6. What can gene expression tell us? Diseases -‐ Which genes are over-‐ (or under-‐)expressed in pa7ents vs. healthy controls? -‐ Which genes are correlated to disease progression? -‐ Can markers of hidden disease be found by sequencing blood plasma?
7. Gene expression signatures for disease? Hypothesis: Cell types are stable states in a “space” of gene expression paUerns. Diseases (e g cancers) distort the gene expression so that the cell ends up in the wrong stable state. Furusawa and Kaneko, Biology Direct 2009 4:17
8. Can the research community ﬁnd such paUerns? On-‐line predic7on compe77ons, objec7vely scored by the organizers Diagnosing MS (mul/ple sclerosis), lung cancer, psoriasis, COPD (KOL) Prognos/ca/ng breast cancer outcome
9. Human 7ssue RNA-‐seq data sets Genotype-Tissue Expression projecthttp://commonfund.nih.gov/GTEx/Illumina Human Body Mapaccessed via ReCount database, bowtie-bio.sourceforge.net/recount/Wang 2008 data set of ~15 human tissuesaccessed via ReCountRNA-seq Atlashttp://medicalgenomics.org/rna_seq_atlasHuman Protein Atlashttp://www.proteinatlas.org (tissue RNA-seq data not yet publicly released)
10. Tools for genome-‐scale gene expression measurements Microarrays (c:a 1995) Some7mes called “gene chips” Based on hybridiza7on RNA sequencing (c:a 2008 in current form) Based on sampling
12. Alterna7ve: rRNA deple7on There are various kits for depleting rRNA insteadPluses:- Can use for microorganisms that don’t have poly-A tails- Thus, can use for simultaneous host/pathogen expression profiling- Can find non-coding RNAMinuses:-Usually leaves in quite a lot of rRNA-In practice, often variable efficiency between samples -> hard to compare results
13. Sequencing plagorms ABI 3730xl 454 Life Sciences SOLiD + Paciﬁc Biosciences, Sanger Sequencing pyrosequencing Illumina Oxford Nanopore etc Single-‐molecule sequencing Length/read 800 bp 400 bp 100 bp 20 000+ bp Reads/run 96 1 million 2 billion 5 million Bases/run 60 kbp 400 Mbp 500 Gbp 100 Gbp Speed 10 years/HG 1 month/HG 1 day/HG 10 min/HG “old school” “2nd gen” “3rd gen”
14. Microarray: Hybridiza7on Source: Wikipedia The design of the microarray determines what you can detect in a sample
15. RNA sequencing: Sampling It is possible to detect transcripts that are not known a priori (in advance)
16. RNA-‐seq advantages The non-‐dependence on reference makes possible: -‐ meta-‐transcriptomics -‐ detec7ng novel splice variants -‐ detec7ng novel transcripts -‐ Fusion transcripts -‐ Non-‐coding transcripts
17. Some examples RNA-seq Atlas Wang 2008
18. Some examples RNA-seq Atlas <- Skeletal Wang 2008 muscle -> <-Adipose tissue-> HPA
19. What does one do with RNA-‐seq reads? • Mapping (also called alignment) • (de novo) Assembly
20. Mapping (alignment) vs. assembly Imagine a book being ripped to pieces with word or sentence fragments ending up on each piece of paper. If you have a copy of the book that you can compare the pieces to, you have a mapping (alignment) problem. If you have no copy of the book, you have a de novo assembly problem.
21. Mapping to a reference genome Reads from the sequencer Sequencing error Gene7c varia7on CAATCAGA G TCCCACTGTGG AGACG TCCCACTGTGGGGTG GTGAAGTGTCCGTAGATGTGTG GCAAATGCAATCAGACG TCCC Gene(or transcript) sequence
22. Mapping to a reference genome AGACG TCCCACTGTGGGGTG GTGAAGTGTCCGTAGATGTGTG GCAAATGCAATCAGACG TCCC
23. Mapping to a reference genome GTGAAGTGTCCGTAGATGTGTG GCAAATGCAATCAGACG TCCC
24. Mapping to a reference genome GCAAATGCAATCAGACG TCCC
25. Mapping to a reference genome
26. Mapping to the genome vs. the transcriptome Vs. the genome:-Can (in principle) detect new transcripts, splice variants- Less sensitive, need a lot of coverage to discover new things- Need a “splice-aware” aligned such as TopHat, MapSplice, RUM etc.Vs. the transcriptome:-Not unbiased anymore, tied to existing annotation-Faster, more sensitive, need less coverageThe best of both worlds?- Tools like TopHat (v1.4 and up) now do both
27. If it had been de novo assembly CAATCAGA G TCCCACTGTGG AGACG TCCCACTGTGGGGTG GTGAAGTGTCCGTAGATGTGTG GCAAATGCAATCAGACG TCCC Assembly CAATCAGA G TCCCACTGTGG AGACG TCCCACTGTGGGGTG GCAAATGCAATCAGACG TCCC “singleton” GTGAAGTGTCCGTAGATGTGTG Consensus sequence(s)
28. Assembly of RNA-‐seq reads Will not be discussed much further here.Most popular de novo assemblers build de Bruijn graphs where overlapping k-mersare connected to each other. The programs then try to find paths through the graphTypically needs a LOT of RAM. Can try to pre-process using “digital normalization”Tools: - Trinity - Velvet/Oases - CLC Bio (commercial)
29. Assembly of RNA-‐seq reads Typical workflow could be:- Clean the reads properly (remove adapters, low-quality reads) - Useful tools: FastQC, PRINSEQ, FASTX toolkit etc.- Run assembly tool of choice, resulting in a set of contigs- BLAST the contigs against nt database, check for % overlap by transcript inrelated organisms- Map your original reads back to the contigs and count the reads overlappingeach <- comparison of assembly & mapping
30. Quan7fying expression with RNA-‐seq Microarrays give a continuous (floating-point) expression value for each geneRNA-‐seq gives an integer value for each gene (“digital expression”): read counts
31. Example (SciLifeLab) mapping workﬂow FASTQ file(s) TopHat 2.0 BAM file Picard tools (SortSam, MarkDuplicates) Sorted BAM file with duplicate reads removed HTSeq 0.5 Cufflinks 2.0Gene-level count files Gene- and isoform-level expression(for DE analysis) estimates (FPKM, for reporting)
33. (what it would look like mapped to the genome) Exon 1 Exon 2 Exon 3 Need a special mapping algorithm which allows large gaps, a “split-‐read aligner”
34. (what we would actually observe – of course we don’t know which reads come from which isoform) Sta7s7cal algorithms needed to es7mate what propor7on of reads comes from which isoform. (For example, maximum likelihood / expecta7on maximiza7on)
35. Name Free/Commercial/ Type of approach Descrip5on only Xing et al. 2006 D Maximum likelihood Partek C “ Li et al. 2010 D “ Avadis C “ IsoEM F “ MISO F “ (MCMC) Cuﬄinks F “ rQuant F Least squares (quadra7c programming) Rpkmforgenes.py F Least squares Howard and Heber 2010 D Least squares FluxCapacitor F Linear programming CLC Bio C ? NSMAP F Nonnega7ve Sparse Maximum A Posteriori ALEXA-‐SEQ F Use only reads that are compa7ble with a single isoform NEUMA D Normaliza7on by Expected Uniquely Mappable Area
36. Some remarks on isoform quantification- It is necessary for correct gene-level quantification as well because straight readcounting methods can never be fully correct (from 2012 CuffDiff2 paper)- Xing et al. (2006) gave the basic idea for EM-based isoform quantification which otherprograms (Cufflinks, MISO, IsoEM, …) haveadded various “bells and whistles” to- It is actually pretty hard to do isoformquantification well because there can be a lotof possible isoforms not enough sequencecoverage to estimate
37. Basic idea of the EM approachWe have a set of reads mapping to some locus - Some fit one specific isoform - Some fit several isoformsIf we knew the isoforms’ expression levels, we could distribute the reads proportionallyto those. But we don’t!On the other hand, if we knew the probability of each read to match each isoform, wecould estimate the isoforms’ expression pretty well. But we don’t know that either.So … start with a guess and iterate!- Assign reads to isoforms according to some initial guess- Re-estimate isoform expression levels- Repeat until convergence!
38. Gene fusion detec7on with RNA-‐seq Beyond isoforms: Detect pieces of diﬀerent genes that have been fused Look for reads that map in “wrong” ways Wang et al. Brieﬁngs in Bioinforma7cs doi:10.1093/ bib/bbs044
39. Some further comments on microarrays and RNA-‐seq -‐ Microarrays are s7ll cheaper and faster. -‐ You may be able to run more replicates, which is important for sta7s7cal power. -‐ RNA-‐seq has a wider measurement range. -‐ Low expressed transcripts: -‐ Microarrays have high background signal -‐> poor measurement -‐ RNA-‐seq can measure well if you sequence very deeply -‐ Medium expressed transcripts: -‐ Microarrays measure well -‐ RNA-‐seq measures well if sequenced rela7vely deeply -‐ High expressed transcripts: -‐ Microarrays measure poorly because of satura7on -‐ RNA-‐seq measures well -‐ Less is understood about how to pre-‐process and normalize RNA-‐seq data. -‐ One interes7ng aspect of RNA-‐seq: You can con7nue to sequence a sample more to obtain beUer gene expression es7mates.
40. Analysis -‐ Pre-‐processing and normaliza7on -‐ Visualiza7on -‐ Diﬀeren7al gene expression analysis -‐ ( Gene set analysis, pathway analysis, gene expression signatures … -‐> try to ﬁnd the biological signiﬁcance)
41. Pre-‐processing Why do we do pre-‐processing and normaliza7on of RNA-‐seq (or microarray) data?
42. Pre-‐processing Why do we do pre-‐processing and normaliza7on of RNA-‐seq (or microarray) data? -‐ To correct for batch eﬀects -‐ Diﬀerent labs -‐ Diﬀerent prepara7on 7mes -‐ Etc.
43. Pre-‐processing Why do we do pre-‐processing and normaliza7on of RNA-‐seq (or microarray) data? -‐ To correct for batch eﬀects -‐ Diﬀerent labs -‐ Diﬀerent prepara7on 7mes -‐ Etc. -‐ To correct for intrinsic technical biases in the technologies
44. Pre-‐processing Why do we do pre-‐processing and normaliza7on of RNA-‐ seq (or microarray) data? -‐ To correct for batch eﬀects -‐ Diﬀerent labs -‐ Diﬀerent prepara7on 7mes -‐ Etc. -‐ To correct for intrinsic technical biases in the technologies -‐ To make the expression value distribu7ons conform to some assump7ons in order to perform sta7s7cal tests
45. RNA-‐seq pre-‐processing For RNA-‐seq data, it is s7ll less understood than for microarrays how one should pre-‐process and normalize the data. Let’s look at some aspects (that some7mes apply to both RNA-‐seq and microarray data)
46. R and Bioconductor Very helpful for (e.g.) microarray and RNA-seq differential expression analysisMicroarray: RNA-seq:affy, lumi (read raw microarray signal files DESeq, edgeR, baySeq,& preprocess) (differential expression analysislimma (differential expression analysis based on count data)with complex designs) SAMSeq (nonparametric differential expression analysis)
47. Variance stabiliza5on Raw data(could be microarray signal or RNA-seq counts)Higher value -> higher variability (noise)Log transformLower value -> higher variability. Too aggressiveVariance stabilizing transforme.g. voom() in limma package http://bridgecrest.blogspot.se/2011_09_01_archive.html
48. Quan5fying expression with RNA-‐seq If you want to compare RNA-‐seq counts between diﬀerent genes and/or samples, consider: -‐ Longer genes/transcripts are expected to generate more reads -‐ The more you sequence, the more reads you get from each gene Therefore, the standard measure has been RPKM ( ), which corrects for transcript length and sequencing depth: ⎛ X t ⎞ ⎜ l ⎟ 10 9 ⋅ X t (Xt: no of reads mapped to transcript/gene/… t ⎜ eff ,t ⎟ Nlib: no of mapped reads in library RPKM = ⎜ 10 3 ⎟ ⎜ ⎟ = N lib ⋅ leff ,t Leﬀ, t: eﬀec/ve length of transcript/gene/… t) ⎝ ⎠ ⎛ N lib ⎞ ⎜ 6 ⎟ ⎝ 10 ⎠ € € FPKM is a paired-end version of this
49. Alterna5ves TPM – “transcripts per million”A slightly modified RPKM measure thataccounts for differences in gene lengthdistribution in the transcript population
50. Alterna5ves TMM – “trimmed mean of M values” Attempts to correct for differences in RNA composition between samples E g if certain genes are very highly expressed in one tissue but not another, there will be less “sequencing real estate” left for the less expressed genes in that tissue and RPKM normalization (or similar) will give biased expression values for them compared to the other sample RNA population 1 RNA population 2Equal sequencing depth -> orange and red will get lower RPKM in RNA population 1 although theexpression levels are actually the same in populations 1 and 2Robinson and Oshlack Genome Biology 2010, 11:R25, http://genomebiology.com/2010/11/3/R25
51. Across-‐sample comparability Dillies et al., Briefings in Bioinformatics, doi:10.1093/bib/bbs046
52. Across-‐sample comparability
53. Across-‐sample comparability
54. Prac5cal issues with normaliza5on methods Limma / voom can give negative valuesTMM cannot be done on a single sample
55. RNA-‐seq pre-‐processing In RNA-‐seq, normaliza7on of counts is oven interwoven with diﬀeren7al expression analysis and done implicitly in DE packages such as DESeq, edgeR etc. Normalized values like RPKM are usually only used for repor7ng expression values, not tes7ng for diﬀeren7al expression. Why?
56. Count nature of RNA-‐seq data These methods want to use the added sta7s7cal power provided by the count nature of RNA-‐seq data. Simpliﬁed toy example: Scenario 1: A 30000-bp transcript has 1000 counts in sample A and 700 countsin sample B.Scenario 2: A 300-bp transcript has 10 counts in sample A and 7 counts insample B.Assume that the sequencing depths are the same in both samples and bothscenarios. Then the RPKM is the same in sample A in both scenarios, and insample B and both scenarios.In scenario A, we can be more confident that there is a true difference in theexpression level than in scenario B (although we would want more replicates ofcourse!) by analogy to a coin flip – 700 heads out of 1000 trials gives much moreconfidence that a coin is biased than 7 heads out of 10 trials
57. Visualiza5on Can be useful for “sanity checking”, outlier detec7on and exploratory analysis in general Examples of useful visualiza7ons -‐ Heat maps -‐ PCA/MDS/NMF -‐ Box plots, violin plots etc.
58. Box plots Useful for comparing groupsAdding the actual data points is optional but can be interesting
59. Sample correla5on heat maps Heat maps are ubiquitous in transcriptomicsCorrelations between samples, hierarchical clusteringUsed for “sanity checks”, outlier detection Two tissues Batch effects
60. Gene / sample heat maps With a smallercollection of genes,one sometimes looksat gene/sample heatmaps
61. PCA plots Another way to see how samples cluster
62. PCA plots Nice thing with PCA: you can also see how much each gene contributes to eachprincipal component -> a kind of feature selection
63. Alterna5ves to PCA NMF: non-negative matrix factorization. Also a matrix decomposition technique (like PCA)“A bioinformatic assay for pluripotency in human cells”, Nature Methods: doi.10.1038/nmeth.1580
64. PCA plot of human 5ssue RNA-‐seq Red – GTexGreen – Body MapBlack – Human Protein Atlas
65. # of genes taking up X% of sequences GTex RPKM HBA1 HBB HBA2
66. # of genes taking up X% of sequences GTex
67. # of genes taking up X% of sequences Wang/Sandberg
68. Diﬀeren5al expression analysis Many tools available!Easily the most common type of analysis, even though it is understood thatgene expression levels are not independent of each other, and should inprinciple be considered together.However, since the number of samples is typically << the number ofmeasured genes, a full model is usually not feasible to construct in practice.Some sort of feature selection is needed.
69. Diﬀeren5al expression analysis One would simply like to do a t-test or something like that for each gene, but…
70. Diﬀeren5al expression analysis One would simply like to do a t-test or something like that for each gene, but…- Assumes normal distribution & no mean-variance dependence
71. Diﬀeren5al expression analysis One would simply like to do a t-test or something like that for each gene, but…- Assumes normal distribution & no mean-variance dependence- Hard to estimate variance from few samples
72. Diﬀeren5al expression analysis One would simply like to do a t-test or something like that for each gene, but…- Assumes normal distribution & no mean-variance dependence- Hard to estimate variance from few samples- Multiple testing issue
73. Parametric vs. non-‐parametric methods It would be nice to not have to assume anything about the expression valuedistributions but only use rank-order statistics. -> methods like SAM(Significance Analysis of Microarrays) or SAM-seq (equivalent for RNA-seq data)However, it is (typically) harder to show statistical significance with non-parametric methods with few replicates.My rule of thumb:- Many replicates (~ >10) in each group -> use SAM(Seq)- Otherwise use DESeq or other parametric methodNote that according to Simon Anders (creator of DESeq) says that non-parametric methods are definitely better with 12 replicates and maybe already atfivehttp://seqanswers.com/forums/showpost.php?p=74264&postcount=3
74. Standard DE methods Limma (microarrays, RNA-seq)edgeR, DESeq (RNA-seq)
75. Standard DE methods Limma (microarrays, RNA-seq)edgeR, DESeq (RNA-seq)Distributional issue: Solved by variance stabilizing transform in limmaedgeR and DESeq model the count data using a negative binomial distribution anduse their own modified statistical tests based on that.
76. Standard DE methods Limma (microarrays, RNA-seq)edgeR, DESeq (RNA-seq)Distributional issue: Solved by variance stabilizing transform in limmaedgeR and DESeq model the count data using a negative binomial distribution anduse their own modified statistical tests based on that.Multiple testing issue: All of these packages report false discovery rate (correctedp values).
77. Standard DE methods Limma (microarrays, RNA-seq)edgeR, DESeq (RNA-seq)Distributional issue: Solved by variance stabilizing transform in limmaedgeR and DESeq model the count data using a negative binomial distribution anduse their own modified statistical tests based on that.Multiple testing issue: All of these packages report false discovery rate (correctedp values).Variance estimation issue: These packages (in slightly different ways) “borrow”information across genes to get a better variance estimate. One says that theestimates “shrink” from gene-specific estimates towards a common mean value.
78. Standard DE methods Limma (microarrays, RNA-seq)edgeR, DESeq (RNA-seq)Distributional issue: Solved by variance stabilizing transform in limmaedgeR and DESeq model the count data using a negative binomial distribution anduse their own modified statistical tests based on that.Multiple testing issue: All of these packages report false discovery rate (correctedp values).Variance estimation issue: These packages (in slightly different ways) “borrow”information across genes to get a better variance estimate. One says that theestimates “shrink” from gene-specific estimates towards a common mean value.
80. Complex designs The simplest case is when you just want to compare two groups against each other.But what if you have several factors that you want to control for?E.g. you have taken tumor samples at two different time points from six patients,cultured the samples and treated them with two different anticancer drugs and a mockcontrol treatment. -> 2x6x3 = 36 samples.Now you want to assess the differential expression in response to one of theanticancer drugs, drug X. You could just compare all “drug X” samples to all controlsamples but the inter-subject variability might be larger than the specific drug effect. Enter limma / DESeq / edgeR which can work with factorial designs(SAMSeq cannot, which is another reason one might not want to use it)
81. Limma and factorial designs limma stands for “linear models for microarray analysis” Essentially, the expression of each gene is modeled with a linear relationhttp://www.math.ku.dk/~richard/courses/bioconductor2009/handout/19_08_Wednesday/KU-August2009-LIMMA/PPT-PDF/Robinson-limma-linear-models-ku-2009.6up.pdf The design matrix describes all the conditions, e g treatment, patient, time etc y = a + b*treatment + c*time + d*patient + e Baseline/average Error term/noise
82. Recent DE so[ware comparison
83. Take-‐away messages from DE tool comparison - CuffDiff2, which should theoretically be better, seems to work worse, probablydue to the increased “statistical burden” from isoform expression estimation- The HTSeq quantification which is theoretically “wrong” seems to give goodresults with downstream software- It is practically always better to sequence more biological replicates than tosequence the same samples deeperOmitted from this comparison - gains from ability to do complex designs - non-parametric methods
84. The end Contact me at firstname.lastname@example.org if you have any questions