Outline
1. Why single cell?
2. Current platforms for scRNA-seq
3. How much does it cost?
4. Experimental design considerations
Prep work
Analysis
1. Transcript counting
2. QC and Filtering
3. Normalization
4. Generic exploration: Dimensionality reduction
5. Clustering
6. Differential expression / marker identification
7. Exploring a phenotypic continuum: Pseudotime analysis
8. Inferring cell dynamics: Predicting future states

Accuracy—average measurements from complex distributions are meaningless
X Mean:
Y Mean:
X SD:
Y SD:
Corr:
54.26
47.83
16.76
26.93
-0.06
Matejka and Fitzmaurice (Autodesk Research, Toronto)

Biological systems are complex – Tissue Heterogeneity
~250,000 single cells from >40 mouse tissues
Han et al., Cell, 2018
tSNE 1
tSNE2
“The Mouse Cell Atlas”
(Colours don’t match but it’s the same data)

The Tabula Muris Consortium, BioRxiv, 2017
Biological systems are complex – Tissue Heterogeneity

Cell Cycle
Apoptosis/Senescence
Xu et al., Blood, 2006
The Human Cell Atlas, eLife, 2017
Phenotypic spectrum
(eg Th17 T-cells)
Gaublomme et al., Cell, 2015
Gene-gene relationships
Biological systems are complex – Cellular Heterogeneity
Azizi et al., BioRxiv, 2018

Exponential increase in throughput
Svennson et al., Nature Protocols, 2018

Main platforms – Plate methods (eg. SMART-Seq)
Pros
Compatible with full-length sequencing
Completely customizable
Better control over cell throughput
Cons
Dramatically lower throughput
compared to other methods
High per-cell cost

Main platforms – Droplet methods (eg. 10x Genomics Chromium, Drop-Seq)
Pros
Very high throughput
Up to 8 unique samples per run
System cost relatively low
Cons
Limited customizability
Zheng et al., Nature Comm, 2017

Approximate cost breakdown for 10x Genomics experiment
Consumables
10x Reagents (Beads, enzyme, etc)
Microfluidics chip
HiSeq4000 PE100 Lane
(~350,000,000 reads)
Sequencing options
NextSeq500 150-cycle
(~400,000,000 reads)
NextSeq500 75-cycle
(~400,000,000 reads)
Costs ($CAD)
$2900/sample (max throughput = ~10,000 cells/sample)
$320 for an 8-sample chip (consumable; no re-use)
$2550
$3700
$1800
Enough for about 8000 cells*

How deep to sequence?
But we don’t necessarily need to
detect everything in every cell!
# of genes saturates around 1 million reads
Ziegenhain et al., Molecular Cell, 2017
General Rule: “…when the number of genes required to
answer a given biological question is small, then greater
transcriptome coverage is more important than analyzing large
number of cells.” Torre et al., Cell Systems, 2018
My current targets
Identifying different cell types present: 25-50k reads per cell
Identifying transcriptional dynamics within a population:
50-100k reads per cell

How many cells?
Two main things to consider
1) How many cell types are there?
2) What is the proportion of the rarest cell type you’re interested in?
10x Genomics currently allows for each run to yield anywhere from ~500-10,000 cells
Satija Lab “How Many Cells”
power calculator
https://satijalab.org/howmanycells
Caution: Do you actually know how many cell types are there? What about cell states?
The current trend in the field seems more focused in increasing cell #

What about replicates?
It kind of depends…
Replicates would always be good, but each replicate adds a significant cost right now, so we have to
ask if it’s worth it
What biological variability are you trying to capture with replicates?
• Biological variability between cells is captured in scRNA-seq (many measurements in one “replicate”)
• Measurements are not confounded by differences in population composition
• Genetic differences could contribute to important variation (eg. Comparing two tumour samples)
What technical variability are you trying to capture with replicates?
• Batch effects are always possible, but not all experiments involve batching
• The use of UMIs has been said to reduce technical variability by ~50% (due to mitigation of PCR
artifacts)
• Culture conditions (eg. Cell confluence, user-supplied growth factors, etc) could lead to variation
• Tissue dissociation protocols can lead to variable purification efficiencies
Technical replicate of PBMCs
has near-perfect overlap
Cancer cells dramatically
different between patients

What are your questions?
Common questions for scRNA-seq experiments
What cell types exist in this tissue?
Can we identify new cell types/subtypes?
What are the proportion of each cell type in this tissue?
What are the gene expression patterns that define each cell type?
How do cell types relate to each other in terms of their expression?
Can we reconstruct transcriptional dynamics associated with differentiation? (More on this later)
Can we construct gene regulatory networks?

QC, Filtering, and Normalization

Transcript Counting
CellRanger (10x Genomics)
Fastq file
Aligns each read to transcriptome
Removes duplicate UMIs
Sorts reads by cell (10x) barcode
Counts reads aligning to each gene for each cell
Gene-barcode matrix

QC and Filtering
Common parameters worth exploring
UMI distribution
Number of genes detected
Percent of UMIs aligning to mitochondrial genes
Goal: Remove low-quality cells and potential doublets Oddly-high nUMI/nGene could be doublets
(~90 doublets per 1000 cells)
High mitochondrial genes is associated with
cell death (loss of membrane integrity >
cytoplasmic loss > enrichment of mitochondrial
content)

Normalization (The inelegant way)
Goal: Make profiles of each cell comparable
Simplest Approach: Scaling library size to some arbitrary value (eg. 10,000)
Cell 1 (5,000 UMI total)
Gene A: 10 UMIs
Before Normalization
Gene A: 40 UMIs
Gene A: 20 UMIs
After Normalization
Gene A: 20 UMIs
(10 UMI / 5,000 UMI) x 10,000 UMI
(40 UMI / 20,000 UMI) x 10,000 UMI
But this alone isn’t sufficient to
remove the effect of seq depth
on the structure of the data
Not uncommon to “regress out”
the effect of nUMI and percent
mito on each cell
After regressing out nUMI
and percent mito

Normalization (The inelegant way)
In some systems, cell cycle stage can confound biological variation of interest
Cell cycle score can also be regressed out if you want to “look past” the effect of cell cycle

Normalization (The more elegant ways)
Scran – DESeq-style scaling factors modified to use w/ scRNA-seq data
SCNorm – Scales expression on a gene-by-gene basis so that
each gene’s expression does not correlate w/ sequencing depth
SCNorm
scVI & DCA – Deep autoencoders that use neural networks to
denoise data
DCA

Data exploration
The fun part! What patterns are there? Can we start generating some hypotheses

Dimensionality Reduction
Goal: To visualize the structure of our data
Principal Component Analysis (PCA)
PCA effectively defines new axes through the data that capture the highest amount of variation possible
Small # of PCs capture large
amount of total variation
Problem
Two new axes can only capture so much information, leading very different profiles to become superimposed
structures data on 2d plots
Ringér, Nature Biotech, 2008

t-distributed stochastic neighbor embedding (tSNE)
tSNE focuses on keeping similar cells close to each other when moving from high dimensional space to 2d
Han et al., Cell, 2018
“The Mouse Cell Atlas”
Problems
-Very dependent on hyperparameter selection
-Cluster size not interpretable
-Cluster distances not interpretable
-Random noise can look non-random
-Topology can be distorted
https://distill.pub/2016/misread-tsne/

Diffusion maps
Dimensionality reduction based on diffusion distances through the data. Can identify non-linear, continuous structures
Eigenvectors = Diffusion components
Diffusion map

Force-directed graphs
Visualize cells based on nearest neighbor structures
Wagner et al., Science, 2018
Repulsion between nodes
Attractive forces added
to edges connecting
nodes (spring functions)

Clustering
Goal: Assign cells to groups of similar cells
Louvain clustering
Takes a nearest neighbor graph and identifies “communities” of nodes (ie. Clusters of cells) by optimizing a
modularity score, defined by the density of edges within the groupings (a tight cluster would have a lot of edges
connecting the cells of the cluster, but relatively few projecting to nodes outside the cluster—therefore, high edge
density)

Differential Expression / Marker Identification
Goal: Find out what’s different between clusters
Lots of options—some complex, some simple
Pairwise comparisons (eg. Cluster A vs. Cluster B)
Marker identification (eg. Cluster A vs. Combined Cluster B,C,D)
Often many genes identified as “significant” due to large number of
cells per cluster. May need to apply effect size (eg. Fold change)
cutoffs to filter down to a smaller list of things to follow up on
Soneson & Robinson., Nature Methods, 2018

Constructing transcriptional trajectories with pseudotime analysis

Cells are heterogeneous and their responses are not synchronous
Morphology in
culture plate
Continuous
trajectory from
snapshot data

Evaluation of pseudotime algorithms
Salens, BioRxiv, 2018

Analysis of continuous transcriptional dynamics
Yi = Pseudotimei + Branchi + (Pseudotimei*Branchi)
Differential expression model for branched trajectories

Predicting a cell’s future state
La Manno et al., BioRxiv, 2017
Glutaminergic neuronal development
(Forebrain from 10wk human embryo)
Neuroblast differentiation

Resources
10x Genomics Datasets: https://support.10xgenomics.com/single-cell-gene-expression/datasets
Mouse Cell Atlas Dataset (242,000 cells): https://satijalab.org/seurat/mca.html
Tabula Muris Dataset (50k mouse cells): https://figshare.com/articles/Single-cell_RNA-seq_data_from_microfluidic_emulsion_v2_/5968960
scRNA-Seq course: https://hemberg-lab.github.io/scRNA.seq.course/index.html
Database of single-cell tools (all 220 of them): https://www.scrna-tools.org/

scRNA-Seq Workshop Presentation - Stem Cell Network 2018

In this document