Single-Cell Transcriptome Analysis of Pluripotent Stem Cells

Single-Cell Transcriptome Analysis
of Pluripotent Stem Cells
Nacho Caballero
Center for Regenerative Medicine
Boston University
Jun 12, 2017
From raw data to insights

Raw data
AT
CG
Analysis pipeline

Raw data
AT
CG
Initial QC
Analysis pipeline

Raw data
AT
CG
Alignment and
Quantiﬁcation
Initial QC
Analysis pipeline

Raw data
AT
CG
Alignment and
Quantiﬁcation
Outlier
analysis
Initial QC
Analysis pipeline

Raw data
AT
CG
Alignment and
Quantiﬁcation
Outlier
analysis
Gene selection
and clustering
Initial QC
Analysis pipeline

Raw data
AT
CG
Alignment and
Quantiﬁcation
Outlier
analysis
Gene selection
and clustering
Initial QC Insights
Analysis pipeline

Raw data Initial QC Alignment and
Quantiﬁcation
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG

Barcoded
sequencing
ﬁles
AT
CG

Demultiplex
One pair of
sequencing
ﬁles
per cell
Barcoded
sequencing
ﬁles
AT
CG

Demultiplex
One pair of
sequencing
ﬁles
per cell
@NB500996:64:HNM72BGX2:3:12510:12240:9366 2:N:0:T
CTACTGTCTAGAGCTTGTCTCAATGGATCTAGAACTTCATCGCCCTCTG
+
AAAAAEEEE<E/EEEEEEEEE6EE/6AEEE//E/EEE/AEA/EAEEEE<
…
Millions of reads
Barcoded
sequencing
ﬁles
AT
CG

Demultiplex
One pair of
sequencing
files
per cell
@NB500996:64:HNM72BGX2:3:12510:12240:9366 2:N:0:T
+
…
Millions of reads
Metadata file
Cell_id Condition1 Condition2
Cell_01 BU3 red
Cell_02 BU3 green
Cell_03 C17 red
Cell_04 C17 green
Cell_05 BU3 red
Cell_06 BU3 green
…
Barcoded
sequencing
files
AT
CG

Demultiplex
One pair of
sequencing
files
per cell
@NB500996:64:HNM72BGX2:3:12510:12240:9366 2:N:0:T
+
…
Millions of reads
Metadata file
Cell_id Condition1 Condition2
Cell_01 BU3 red
Cell_02 BU3 green
Cell_03 C17 red
Cell_04 C17 green
Cell_05 BU3 red
Cell_06 BU3 green
…
Barcoded
sequencing
files
AT
CG
Short
simple
names

Quantiﬁcation
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Analysis pipeline

Position in Read
AvgSequenceQuality

Good cDNA quality
Position in Read
AvgSequenceQuality

Good cDNA quality
Read length is often inversely correlated with base-pair
sequencing quality
Position in Read
AvgSequenceQuality

Good cDNA quality Average quality
sequencing quality
Position in Read
AvgSequenceQuality

Good cDNA quality Average quality Bad quality
sequencing quality
Position in Read
AvgSequenceQuality

Numberofreadspercell
1M
10K
1K
0
400 Cells

More reads is generally better than longer reads
(safe target: 200K reads, 150-bp long)
Numberofreadspercell
1M
10K
1K
0
400 Cells

The Fluidigm protocol makes it extremely easy
to lose entire rows or columns
Rows
Columns

We quantify the gene expression in a cell by counting how many
reads align to each gene

SFTPC gene

AGGCAGAGGGGCGAGATGCA…
SFTPC gene

AGGCAGAGGGGCGAGATGCA…
1358 reads aligned to the SFTPC
gene in this cell
SFTPC gene

Read type
Number of
reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span
multiple genes
20,838
Aligned uniquely to
a single gene
167,241

Read type
Number of
reads per cell
Raw 333,229
Unaligned 81,673
Aligned, but non-uniquely 28,813
Aligned uniquely, but not to a gene 32,774
Aligned uniquely, but span
multiple genes
20,838
Aligned uniquely to
a single gene
167,241
40-60% of the raw reads cannot be used to quantify gene expression

Filter out cells with fewer than 5K aligned reads
Numberofalignedreads
1M
10K
1K
0
120 Cells

Filter out cells with a high percentage of mitochondrial
gene counts (indicative of a broken cell membrane)
%ofMitochondrialgenecounts
100%
75%
50%
0
48 Cells
25%

Filter out cells with less than 2K expressed genes
Numberofexpressedgenes
6K
4K
0
30 Cells
2K

Raw count data
Normalized expression data

Raw count data
Assume that most genes are not differentially expressed

Raw count data
Calculate scaling factors for each cell

Raw count data
Apply the scaling factors and log

Raw count data
Normalization corrects for differences in capture
efﬁciency, sequencing depth and other technical bias
Apply the scaling factors and log

Averageexpression
Expression
Variance

Averageexpression
Expression
Variance
cell

Averageexpression
Expression
Variance
high expression
low variance
cell

Averageexpression
Expression
Variance
high expression
low variance
cell
Expression
low expression
low variance

Averageexpression
Expression
Variance
high expression
low variance
cell
Expression
low expression
low variance
high expression
high variance
high expression
high variance

Typical questions
What are the expression differences
between my experimental groups?

Typical questions
What are the subpopulations in my data?

Typical questions
What are the subpopulations in my data?
What are the gene expression patterns
in each subpopulation?

TREAT
CONDITIONS AS
GROUPS?
ASSIGN
CELLS TO
GROUPS
SELECT
GENES
NO

ASSIGN
CELLS TO
GROUPS
SELECT
GENES
NO
A difference between the populations (signal)
should appear among the most variable genes
Averageexpression
Variance
TREAT
CONDITIONS AS
GROUPS?

ASSIGN
CELLS TO
GROUPS
SELECT
GENES
NO
Variance is a necessary but insufﬁcient
indicator of population differences
Averageexpression
Variance
TREAT
CONDITIONS AS
GROUPS?

ASSIGN
CELLS TO
GROUPS
SELECT
GENES
NO
Averageexpression
Variance
Unique populations consistently
over or under-express a set of genes
TREAT
CONDITIONS AS
GROUPS?

ASSIGN
CELLS TO
GROUPS
SELECT
GENES
NO
TREAT
CONDITIONS AS
GROUPS?

ASSIGN
CELLS TO
GROUPS
SELECT
GENES
NO
TREAT
CONDITIONS AS
GROUPS?
The silhouette coefﬁcient is a useful metric to
determine the optimal number of groups

ASSIGN
CELLS TO
GROUPS
SELECT
GENES
NO
k = 2
Silhouette coefﬁcient: 0.48
TREAT
CONDITIONS AS
GROUPS?

ASSIGN
CELLS TO
GROUPS
SELECT
GENES
NO
k = 3
TREAT
CONDITIONS AS
GROUPS?

ASSIGN
CELLS TO
GROUPS
SELECT
GENES
NO
k = 4
TREAT
CONDITIONS AS
GROUPS?

ASSIGN
CELLS TO
GROUPS
TEST GENES FOR
DIFFERENTIAL
EXPRESSION
YES
SELECT
GENES
NO
TREAT
CONDITIONS AS
GROUPS?

ASSIGN
CELLS TO
GROUPS
TEST GENES FOR
DIFFERENTIAL
EXPRESSION
YES
SELECT
GENES
NO
TREAT
CONDITIONS AS
GROUPS?
Variance
Average
expression
Differentially expressed
genes

ASSIGN
CELLS TO
GROUPS
TEST GENES FOR
DIFFERENTIAL
EXPRESSION
YES
SELECT
GENES
NO
TREAT
CONDITIONS AS
GROUPS?
Variance
Average
expression
Differentially expressed
genes
Variance
Average
expression
Highly variable
genes

Real heatmaps are a rough-draft visualization

NKX2-1
CD47

NKX2-1
CD47
NKX2-1
CD47

NKX2-1
CD47
NKX2-1
CD47
ROW-SCALING GLOBAL SCALING

Expression patterns are
better conveyed by
showing individual genes

CLUSTERED
better conveyed by

CLUSTEREDRANDOM
better conveyed by

Geneset enrichment analysis depends on the
quality of the geneset

MsigDB hallmark genesets only contain 4000 genes

MsigDB hallmark genesets only contain 4000 genes
MAKE YOUR OWN GENESETS FROM THE LITERATURE

Remember to provide a metadata ﬁle
Quantiﬁcation
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways

Quantiﬁcation
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways
More reads is usually better than longer reads

Quantiﬁcation
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways
You will only be able to align 50% of your reads

Quantiﬁcation
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways
Assume that 50% of your cells could fail

Quantiﬁcation
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways
High variance doesn’t imply subpopulations

Quantiﬁcation
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways
Make your own gene lists!

Slides available at: bit.ly/crem_bioinformatics
Quantiﬁcation
Outlier
analysis
Gene selection
and clustering
Insights
AT
CG
Takeaways

Single-Cell Transcriptome Analysis of Pluripotent Stem Cells

Recommended

Recommended

More Related Content

Similar to Single-Cell Transcriptome Analysis of Pluripotent Stem Cells

Similar to Single-Cell Transcriptome Analysis of Pluripotent Stem Cells (20)

More from Nacho Caballero

More from Nacho Caballero (20)

Recently uploaded

Recently uploaded (20)

Single-Cell Transcriptome Analysis of Pluripotent Stem Cells