Selective inference and single-cell differential analysis
Nathalie Vialaneix
nathalie.vialaneix@inrae.fr
http://www.nathalievialaneix.eu
Club Single-Cell
February 7th, 2022
Outline
Introduction: what is selective inference and why should we bother?
Sketch of basic ideas developed to answer this issue
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 2
Standard single-cell analysis pipeline and double dipping
Image taken from [Fang et al., 2021]
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 3
Standard single-cell analysis pipeline and double dipping
Image taken from [Fang et al., 2021]
here: differential analysis
Dataset is used twice: (clustering
then differential analysis)
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 3
Why is it a problem? Example on simulations...
How can we show the problem?
I simulate dummy data with no signal (e.g., n i.i.d. observations from
Nd (0d , σ2Id ))
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 4
Why is it a problem? Example on simulations...
How can we show the problem?
I simulate dummy data with no signal (e.g., n i.i.d. observations from
Nd (0d , σ2Id ))
I perform the test procedure: clustering then differential analysis between clusters
(Wald test) and obtain p-values
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 4
Why is it a problem? Example on simulations...
How can we show the problem?
I simulate dummy data with no signal (e.g., n i.i.d. observations from
Nd (0d , σ2Id ))
I perform the test procedure: clustering then differential analysis between clusters
(Wald test) and obtain p-values
I What do we expect? Since there is no signal in the data (no true clusters so no
marker genes), p-values ∼ U[0, 1]
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 4
First question [Gao et al., 2021]
Is the average value of vector X in first cluster different of what it is in the section
cluster?
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 5
First question using a train/test approach [Gao et al., 2021]
Is the average value of vector X, in first cluster different of what it is in the second
cluster?
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 6
Second question (at the level of marker gene)
[Zhang et al., 2019]
Is the average expression of a given gene, xj , in first cluster different of what it is in
the second cluster?
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 7
Why do we have this problem?
Main idea:
Clustering “forces” separation between expression measurements whatever the true
underlying signal (or absence of signal).
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 8
Outline
Introduction: what is selective inference and why should we bother?
Sketch of basic ideas developed to answer this issue
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 9
Question 1 [Gao et al., 2021]
Denoting by D := kX(1) − X(2)k and φ a rv from χ2 (with parameters depending on
X), define a perturbed version of the data that:
I pulls clusters apart if φ > D
I push clusters together if φ < D
There is a way to obtain a valid p-value from the distribution of obtained clusters (that
depends on the rv φ).
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 10
Question 1 [Gao et al., 2021]
Is it usable? More or less...
1. either: you have a way to have a explicit description of the perturbed cluster
definition
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 11
Question 1 [Gao et al., 2021]
Is it usable? More or less...
1. either: you have a way to have a explicit description of the perturbed cluster
definition
Only available for HC in [Gao et al., 2021].
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 11
Question 1 [Gao et al., 2021]
Is it usable? More or less...
1. either: you have a way to have a explicit description of the perturbed cluster
definition
Only available for HC in [Gao et al., 2021].
2. or: you simulate the distribution (using random draws of φ)
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 11
Question 1 [Gao et al., 2021]
Is it usable? More or less...
1. either: you have a way to have a explicit description of the perturbed cluster
definition
Only available for HC in [Gao et al., 2021].
2. or: you simulate the distribution (using random draws of φ)
But you need to have plenty of time.
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 11
Question 1 [Gao et al., 2021]
Is it usable? More or less...
1. either: you have a way to have a explicit description of the perturbed cluster
definition
Only available for HC in [Gao et al., 2021].
2. or: you simulate the distribution (using random draws of φ)
But you need to have plenty of time.
The method is available as an R package: clusterpval
https://www.lucylgao.com/clusterpval/
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 11
Experiment
Data from [Zheng et al., 2017] with clustering of peripheral blood mononuclear cells
prior to sequencing (antibody-based bead enrichment + fluorescent activated cell
sorting) ⇒ ground truth
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 12
Experiment
Data from [Zheng et al., 2017] with clustering of peripheral blood mononuclear cells
prior to sequencing (antibody-based bead enrichment + fluorescent activated cell
sorting) ⇒ ground truth
Derivation of:
I negative control (selection of 600 memory T cells)
I positive control (selection of 200 memory T cells + 200 B cells + monocytes)
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 12
Experiment
Data from [Zheng et al., 2017] with clustering of peripheral blood mononuclear cells
prior to sequencing (antibody-based bead enrichment + fluorescent activated cell
sorting) ⇒ ground truth
Derivation of:
I negative control (selection of 600 memory T cells)
I positive control (selection of 200 memory T cells + 200 B cells + monocytes)
Method: clustering with HAC (3 clusters) then differential analysis (Wald test versus
their test)
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 12
Experiment
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 13
Further discussion
I extension of this approach to marker gene detection ongoing (work from Benjamin
Hivert, Boris Hejblum & Rodolphe Thiébaut)
I but extension beyond the 2-by-2 cluster comparison is still challenging as is the
estimation of a variance parameter needed for the method to work
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 14
Question 2 [Zhang et al., 2019]
Use a test based on a truncated distribution
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 15
Question 2 [Zhang et al., 2019]
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 16
Question 2 [Zhang et al., 2019]
Remarks on this approach:
I the separating hyperplane is supposed to be given ⇒ contrains the clustering
method and requires that it is performed on a separate dataset
I genes are supposed to be not correlated (very, very strong assumption...)
I method available as a python tool at
https://github.com/jessemzhang/tn_test
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 17
Experiment 1
Again... data from [Zheng et al., 2017]...
Method:
I use SEURAT for clustering (9 clusters)
I use SEURAT and TN for differential analysis between the first two clusters
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 18
Results
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 19
Experiment 2
Data from [Kolodziejczyk et al., 2015]
Impact of overclustering on results
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 20
References
Fang, R., Preissl, S., Li, Y., Hou, X., Lucero, J., Wang, X., Motamedi, A., Shiau, A. K., Zhou, X., Fangming, X., Mukamel, E. A., Zhang, K.,
Zhang, Y., Behrens, M. M., Ecker, J. R., and Ren, B. (2021).
Comprehensive analysis of single cell ATAC-seq data with SnapATAC.
Nature Communications, 12:1337.
Gao, L. L., Bien, J., and Witten, D. (2021).
Selective inference for hierarchical clustering.
Preprint arXiv 2012.02936.
Kolodziejczyk, A. A., Kim, J. K., Tsang, J. C., Ilicic, T., Henriksson, J., Natarajan, K. N., Tuck, A. C., Gao, X., Bühler, M., Liu, P., Marioni,
J. C., and Teichmann, S. A. (2015).
Single cell RNA-sequencing of pluripotent states unlock modular transcriptional variation.
Cell Stem Cell, 17(4):471–485.
Zhang, J. M., Kamath, G. M., and Tse, D. N. (2019).
Valid post-clustering differential analysis for single-cell RNA-seq.
Cell Systems, 9(4):283–392.e6.
Zheng, G. X., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W., Wilson, R., Ziraldo, S. B., Wheeler, T. D., McDermott, G. P., Zhu, J.,
Gregoy, M. T., Shuga, J., Montesclaros, L., Underwood, J. G., Masquelier, Donald A. andNishimura, S. Y., Schnall-Levin, M., Wyatt, P. W.,
Hindson, C. M., Bharadwaj, R., Wond, A., Ness, K. D., Beppu, L. W., Deeg, H. J., McFarland, C., Loeb, K. R., Valente, W. J., Ericson,
N. G., Stevens, E. A., Radich, J. p., Mikkelsen, T. S., Hindson, B. J., and Bielas, J. H. (2017).
Massively parallel digital transcriptional profiling of single cells.
Nature Communications, 8:14049.
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 20

Selective inference and single-cell differential analysis

  • 1.
    Selective inference andsingle-cell differential analysis Nathalie Vialaneix nathalie.vialaneix@inrae.fr http://www.nathalievialaneix.eu Club Single-Cell February 7th, 2022
  • 2.
    Outline Introduction: what isselective inference and why should we bother? Sketch of basic ideas developed to answer this issue Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 2
  • 3.
    Standard single-cell analysispipeline and double dipping Image taken from [Fang et al., 2021] Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 3
  • 4.
    Standard single-cell analysispipeline and double dipping Image taken from [Fang et al., 2021] here: differential analysis Dataset is used twice: (clustering then differential analysis) Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 3
  • 5.
    Why is ita problem? Example on simulations... How can we show the problem? I simulate dummy data with no signal (e.g., n i.i.d. observations from Nd (0d , σ2Id )) Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 4
  • 6.
    Why is ita problem? Example on simulations... How can we show the problem? I simulate dummy data with no signal (e.g., n i.i.d. observations from Nd (0d , σ2Id )) I perform the test procedure: clustering then differential analysis between clusters (Wald test) and obtain p-values Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 4
  • 7.
    Why is ita problem? Example on simulations... How can we show the problem? I simulate dummy data with no signal (e.g., n i.i.d. observations from Nd (0d , σ2Id )) I perform the test procedure: clustering then differential analysis between clusters (Wald test) and obtain p-values I What do we expect? Since there is no signal in the data (no true clusters so no marker genes), p-values ∼ U[0, 1] Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 4
  • 8.
    First question [Gaoet al., 2021] Is the average value of vector X in first cluster different of what it is in the section cluster? Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 5
  • 9.
    First question usinga train/test approach [Gao et al., 2021] Is the average value of vector X, in first cluster different of what it is in the second cluster? Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 6
  • 10.
    Second question (atthe level of marker gene) [Zhang et al., 2019] Is the average expression of a given gene, xj , in first cluster different of what it is in the second cluster? Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 7
  • 11.
    Why do wehave this problem? Main idea: Clustering “forces” separation between expression measurements whatever the true underlying signal (or absence of signal). Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 8
  • 12.
    Outline Introduction: what isselective inference and why should we bother? Sketch of basic ideas developed to answer this issue Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 9
  • 13.
    Question 1 [Gaoet al., 2021] Denoting by D := kX(1) − X(2)k and φ a rv from χ2 (with parameters depending on X), define a perturbed version of the data that: I pulls clusters apart if φ > D I push clusters together if φ < D There is a way to obtain a valid p-value from the distribution of obtained clusters (that depends on the rv φ). Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 10
  • 14.
    Question 1 [Gaoet al., 2021] Is it usable? More or less... 1. either: you have a way to have a explicit description of the perturbed cluster definition Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 11
  • 15.
    Question 1 [Gaoet al., 2021] Is it usable? More or less... 1. either: you have a way to have a explicit description of the perturbed cluster definition Only available for HC in [Gao et al., 2021]. Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 11
  • 16.
    Question 1 [Gaoet al., 2021] Is it usable? More or less... 1. either: you have a way to have a explicit description of the perturbed cluster definition Only available for HC in [Gao et al., 2021]. 2. or: you simulate the distribution (using random draws of φ) Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 11
  • 17.
    Question 1 [Gaoet al., 2021] Is it usable? More or less... 1. either: you have a way to have a explicit description of the perturbed cluster definition Only available for HC in [Gao et al., 2021]. 2. or: you simulate the distribution (using random draws of φ) But you need to have plenty of time. Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 11
  • 18.
    Question 1 [Gaoet al., 2021] Is it usable? More or less... 1. either: you have a way to have a explicit description of the perturbed cluster definition Only available for HC in [Gao et al., 2021]. 2. or: you simulate the distribution (using random draws of φ) But you need to have plenty of time. The method is available as an R package: clusterpval https://www.lucylgao.com/clusterpval/ Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 11
  • 19.
    Experiment Data from [Zhenget al., 2017] with clustering of peripheral blood mononuclear cells prior to sequencing (antibody-based bead enrichment + fluorescent activated cell sorting) ⇒ ground truth Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 12
  • 20.
    Experiment Data from [Zhenget al., 2017] with clustering of peripheral blood mononuclear cells prior to sequencing (antibody-based bead enrichment + fluorescent activated cell sorting) ⇒ ground truth Derivation of: I negative control (selection of 600 memory T cells) I positive control (selection of 200 memory T cells + 200 B cells + monocytes) Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 12
  • 21.
    Experiment Data from [Zhenget al., 2017] with clustering of peripheral blood mononuclear cells prior to sequencing (antibody-based bead enrichment + fluorescent activated cell sorting) ⇒ ground truth Derivation of: I negative control (selection of 600 memory T cells) I positive control (selection of 200 memory T cells + 200 B cells + monocytes) Method: clustering with HAC (3 clusters) then differential analysis (Wald test versus their test) Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 12
  • 22.
    Experiment Club Single-Cell February 7th,2022 / Nathalie Vialaneix p. 13
  • 23.
    Further discussion I extensionof this approach to marker gene detection ongoing (work from Benjamin Hivert, Boris Hejblum & Rodolphe Thiébaut) I but extension beyond the 2-by-2 cluster comparison is still challenging as is the estimation of a variance parameter needed for the method to work Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 14
  • 24.
    Question 2 [Zhanget al., 2019] Use a test based on a truncated distribution Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 15
  • 25.
    Question 2 [Zhanget al., 2019] Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 16
  • 26.
    Question 2 [Zhanget al., 2019] Remarks on this approach: I the separating hyperplane is supposed to be given ⇒ contrains the clustering method and requires that it is performed on a separate dataset I genes are supposed to be not correlated (very, very strong assumption...) I method available as a python tool at https://github.com/jessemzhang/tn_test Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 17
  • 27.
    Experiment 1 Again... datafrom [Zheng et al., 2017]... Method: I use SEURAT for clustering (9 clusters) I use SEURAT and TN for differential analysis between the first two clusters Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 18
  • 28.
    Results Club Single-Cell February 7th,2022 / Nathalie Vialaneix p. 19
  • 29.
    Experiment 2 Data from[Kolodziejczyk et al., 2015] Impact of overclustering on results Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 20
  • 30.
    References Fang, R., Preissl,S., Li, Y., Hou, X., Lucero, J., Wang, X., Motamedi, A., Shiau, A. K., Zhou, X., Fangming, X., Mukamel, E. A., Zhang, K., Zhang, Y., Behrens, M. M., Ecker, J. R., and Ren, B. (2021). Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nature Communications, 12:1337. Gao, L. L., Bien, J., and Witten, D. (2021). Selective inference for hierarchical clustering. Preprint arXiv 2012.02936. Kolodziejczyk, A. A., Kim, J. K., Tsang, J. C., Ilicic, T., Henriksson, J., Natarajan, K. N., Tuck, A. C., Gao, X., Bühler, M., Liu, P., Marioni, J. C., and Teichmann, S. A. (2015). Single cell RNA-sequencing of pluripotent states unlock modular transcriptional variation. Cell Stem Cell, 17(4):471–485. Zhang, J. M., Kamath, G. M., and Tse, D. N. (2019). Valid post-clustering differential analysis for single-cell RNA-seq. Cell Systems, 9(4):283–392.e6. Zheng, G. X., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W., Wilson, R., Ziraldo, S. B., Wheeler, T. D., McDermott, G. P., Zhu, J., Gregoy, M. T., Shuga, J., Montesclaros, L., Underwood, J. G., Masquelier, Donald A. andNishimura, S. Y., Schnall-Levin, M., Wyatt, P. W., Hindson, C. M., Bharadwaj, R., Wond, A., Ness, K. D., Beppu, L. W., Deeg, H. J., McFarland, C., Loeb, K. R., Valente, W. J., Ericson, N. G., Stevens, E. A., Radich, J. p., Mikkelsen, T. S., Hindson, B. J., and Bielas, J. H. (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications, 8:14049. Club Single-Cell February 7th, 2022 / Nathalie Vialaneix p. 20