Selective inference and single-cell differential analysis

Selective inference and single-cell differential analysis
Nathalie Vialaneix
nathalie.vialaneix@inrae.fr
http://www.nathalievialaneix.eu
Club Single-Cell
February 7th, 2022

Outline
Introduction: what is selective inference and why should we bother?
Sketch of basic ideas developed to answer this issue
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 2

Standard single-cell analysis pipeline and double dipping
Image taken from [Fang et al., 2021]
Club Single-Cell
p. 3

Standard single-cell analysis pipeline and double dipping
Image taken from [Fang et al., 2021]
here: differential analysis
Dataset is used twice: (clustering
then differential analysis)
Club Single-Cell
p. 3

Why is it a problem? Example on simulations...
How can we show the problem?
I simulate dummy data with no signal (e.g., n i.i.d. observations from
Nd (0d , σ2Id ))
Club Single-Cell
p. 4

Nd (0d , σ2Id ))
I perform the test procedure: clustering then differential analysis between clusters
(Wald test) and obtain p-values
Club Single-Cell
p. 4

Nd (0d , σ2Id ))
I perform the test procedure: clustering then differential analysis between clusters
(Wald test) and obtain p-values
I What do we expect? Since there is no signal in the data (no true clusters so no
marker genes), p-values ∼ U[0, 1]
Club Single-Cell
p. 4

First question [Gao et al., 2021]
Is the average value of vector X in first cluster different of what it is in the section
cluster?
Club Single-Cell
p. 5

First question using a train/test approach [Gao et al., 2021]
Is the average value of vector X, in first cluster different of what it is in the second
cluster?
Club Single-Cell
p. 6

Second question (at the level of marker gene)
[Zhang et al., 2019]
Is the average expression of a given gene, xj , in first cluster different of what it is in
the second cluster?
Club Single-Cell
p. 7

Why do we have this problem?
Main idea:
Clustering “forces” separation between expression measurements whatever the true
underlying signal (or absence of signal).
Club Single-Cell
p. 8

Outline
Introduction: what is selective inference and why should we bother?
Sketch of basic ideas developed to answer this issue
Club Single-Cell
p. 9

Question 1 [Gao et al., 2021]
Denoting by D := kX(1) − X(2)k and φ a rv from χ2 (with parameters depending on
X), define a perturbed version of the data that:
I pulls clusters apart if φ > D
I push clusters together if φ < D
There is a way to obtain a valid p-value from the distribution of obtained clusters (that
depends on the rv φ).
Club Single-Cell
p. 10

Is it usable? More or less...
1. either: you have a way to have a explicit description of the perturbed cluster
definition
Club Single-Cell
p. 11

definition
Only available for HC in [Gao et al., 2021].
Club Single-Cell
p. 11

definition
2. or: you simulate the distribution (using random draws of φ)
Club Single-Cell
p. 11

definition
But you need to have plenty of time.
Club Single-Cell
p. 11

definition
But you need to have plenty of time.
The method is available as an R package: clusterpval
https://www.lucylgao.com/clusterpval/
Club Single-Cell
p. 11

Experiment
Data from [Zheng et al., 2017] with clustering of peripheral blood mononuclear cells
prior to sequencing (antibody-based bead enrichment + fluorescent activated cell
sorting) ⇒ ground truth
Club Single-Cell
p. 12

Experiment
Derivation of:
I negative control (selection of 600 memory T cells)
I positive control (selection of 200 memory T cells + 200 B cells + monocytes)
Club Single-Cell
p. 12

Experiment
Derivation of:
I negative control (selection of 600 memory T cells)
I positive control (selection of 200 memory T cells + 200 B cells + monocytes)
Method: clustering with HAC (3 clusters) then differential analysis (Wald test versus
their test)
Club Single-Cell
p. 12

Experiment
Club Single-Cell
p. 13

Further discussion
I extension of this approach to marker gene detection ongoing (work from Benjamin
Hivert, Boris Hejblum & Rodolphe Thiébaut)
I but extension beyond the 2-by-2 cluster comparison is still challenging as is the
estimation of a variance parameter needed for the method to work
Club Single-Cell
p. 14

Question 2 [Zhang et al., 2019]
Use a test based on a truncated distribution
Club Single-Cell
p. 15

Club Single-Cell
p. 16

Remarks on this approach:
I the separating hyperplane is supposed to be given ⇒ contrains the clustering
method and requires that it is performed on a separate dataset
I genes are supposed to be not correlated (very, very strong assumption...)
I method available as a python tool at
https://github.com/jessemzhang/tn_test
Club Single-Cell
p. 17

Experiment 1
Again... data from [Zheng et al., 2017]...
Method:
I use SEURAT for clustering (9 clusters)
I use SEURAT and TN for differential analysis between the first two clusters
Club Single-Cell
p. 18

Results
Club Single-Cell
p. 19

Experiment 2
Data from [Kolodziejczyk et al., 2015]
Impact of overclustering on results
Club Single-Cell
p. 20

References
Fang, R., Preissl, S., Li, Y., Hou, X., Lucero, J., Wang, X., Motamedi, A., Shiau, A. K., Zhou, X., Fangming, X., Mukamel, E. A., Zhang, K.,
Zhang, Y., Behrens, M. M., Ecker, J. R., and Ren, B. (2021).
Comprehensive analysis of single cell ATAC-seq data with SnapATAC.
Nature Communications, 12:1337.
Gao, L. L., Bien, J., and Witten, D. (2021).
Selective inference for hierarchical clustering.
Preprint arXiv 2012.02936.
Kolodziejczyk, A. A., Kim, J. K., Tsang, J. C., Ilicic, T., Henriksson, J., Natarajan, K. N., Tuck, A. C., Gao, X., Bühler, M., Liu, P., Marioni,
J. C., and Teichmann, S. A. (2015).
Single cell RNA-sequencing of pluripotent states unlock modular transcriptional variation.
Cell Stem Cell, 17(4):471–485.
Zhang, J. M., Kamath, G. M., and Tse, D. N. (2019).
Valid post-clustering differential analysis for single-cell RNA-seq.
Cell Systems, 9(4):283–392.e6.
Zheng, G. X., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W., Wilson, R., Ziraldo, S. B., Wheeler, T. D., McDermott, G. P., Zhu, J.,
Gregoy, M. T., Shuga, J., Montesclaros, L., Underwood, J. G., Masquelier, Donald A. andNishimura, S. Y., Schnall-Levin, M., Wyatt, P. W.,
Hindson, C. M., Bharadwaj, R., Wond, A., Ness, K. D., Beppu, L. W., Deeg, H. J., McFarland, C., Loeb, K. R., Valente, W. J., Ericson,
N. G., Stevens, E. A., Radich, J. p., Mikkelsen, T. S., Hindson, B. J., and Bielas, J. H. (2017).
Massively parallel digital transcriptional profiling of single cells.
Nature Communications, 8:14049.
Club Single-Cell
p. 20

Selective inference and single-cell differential analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Selective inference and single-cell differential analysis

Similar to Selective inference and single-cell differential analysis (10)

More from tuxette

More from tuxette (20)

Recently uploaded

Recently uploaded (20)

Selective inference and single-cell differential analysis