This document discusses selective inference and single-cell differential analysis. It introduces the problem of "double dipping" in the standard single-cell analysis pipeline where the same dataset is used for clustering and differential analysis. Two approaches for addressing this are presented: 1) A method that perturbs clusters before testing for differences, and 2) A test based on a truncated distribution that assumes clusters and genes are given separately. Experiments applying these methods to real single-cell datasets are described. The document outlines challenges in extending these approaches to more complex analyses.
Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia
Medical pathology images are visually evaluated by experts for disease diagnosis, but the connectionbetween image features and the state of the cells in an image is typically unknown. To understand thisrelationship, we describe a multimodal modeling and inference framework that estimates shared latentstructure of joint gene expression levels and medical image features. The method is built aroundprobabilistic canonical correlation analysis (PCCA), which is jointly fit to image embeddings that are learnedusing convolutional neural networks and linear embeddings of paired gene expression data. We finallydiscuss a set of theoretical and empirical challenges in domain adaptation settings arising from genomics data.(based on work in collab with Gregory Gundersen and Barbara E. Engelhardt)
PhD Dissertation Talk, 22 April 2011
----
The main topic of this thesis addresses the important problem of mining numerical data, and especially gene expression data. These data characterize the behaviour of thousand of genes in various biological situations (time, cell, etc.).
A difficult task consists in clustering genes to obtain classes of genes with similar behaviour, supposed to be involved together within a biological process.
Accordingly, we are interested in designing and comparing methods in the field of knowledge discovery from biological data. We propose to study how the conceptual classification method called Formal Concept Analysis (FCA) can handle the problem of extracting interesting classes of genes. For this purpose, we have designed and experimented several original methods based on an extension of FCA called pattern structures. Furthermore, we show that these methods can enhance decision making in agronomy and crop sanity in the vast formal domain of information fusion.
Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian theory and methodology in machine learning. They have achieved remarkable success in computation, and enjoy strong theoretical support. Much of the existing literature has focused on the linear Gaussian case. The purpose of the current talk is to demonstrate that the horseshoe priors are useful more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications. Specifically, we focus on methodological challenges in horseshoe regularization in nonlinear and non-Gaussian models; multivariate models; and deep neural networks. We also outline the recent computational developments in horseshoe shrinkage for complex models along with a list of available software implementations that allows one to venture out beyond the comfort zone of the canonical linear regression problems.
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.
Dimensionality reduction by matrix factorization using concept lattice in dat...eSAT Journals
Abstract Concept lattices is the important technique that has become a standard in data analytics and knowledge presentation in many fields such as statistics, artificial intelligence, pattern recognition ,machine learning ,information theory ,social networks, information retrieval system and software engineering. Formal concepts are adopted as the primitive notion. A concept is jointly defined as a pair consisting of the intension and the extension. FCA can handle with huge amount of data it generates concepts and rules and data visualization. Matrix factorization methods have recently received greater exposure, mainly as an unsupervised learning method for latent variable decomposition. In this paper a novel method is proposed to decompose such concepts by using Boolean Matrix Factorization for dimensionality reduction. This paper focuses on finding all the concepts and the object intersections. Keywords: Data mining, formal concepts, lattice, matrix factorization dimensionality reduction.
Dirty data science machine learning on non-curated dataGael Varoquaux
These slides are a one-hour course on machine learning with non-curated data.
According to industry surveys, the number one hassle of data scientists is cleaning the data to analyze it. Here, I survey what "dirtyness" forces time-consuming cleaning. We will then cover two specific aspects of dirty data: non-normalized entries and missing values. I show how, for these two problems, machine-learning practice can be adapted to work directly on a data table without curation. The normalization problem can be tackled by adapting methods from natural language processing. The missing-values problem will lead us to revisit classic statistical results in the setting of supervised learning.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
On the Classification of NP Complete Problems and Their Duality Featureijcsit
NP Complete (abbreviated as NPC) problems, standing at the crux of deciding whether P=NP, are among
hardest problems in computer science and other related areas. Through decades, NPC problems are
treated as one class. Observing that NPC problems have different natures, it is unlikely that they will have
the same complexity. Our intensive study shows that NPC problems are not all equivalent in computational
complexity, and they can be further classified. We then show that the classification of NPC problems may
depend on their natures, reduction methods, exact algorithms, and the boundary between P and NP. And a
new perspective is provided: both P problems and NPC problems have the duality feature in terms of
computational complexity of asymptotic efficiency of algorithms. We also discuss
Study of Different Multi-instance Learning kNN AlgorithmsEditor IJCATR
Because of it is applicability in various field, multi-instance learning or multi-instance problem becoming more popular in
machine learning research field. Different from supervised learning, multi-instance learning related to the problem of classifying an
unknown bag into positive or negative label such that labels of instances of bags are ambiguous. This paper uses and study three
different k-nearest neighbor algorithm namely Bayesian -kNN, citation -kNN and Bayesian Citation -kNN algorithm for solving multiinstance
problem. Similarity between two bags is measured using Hausdroff distance. To overcome the problem of false positive
instances constructive covering algorithm used. Also the problem definition, learning algorithm and experimental data sets related to
multi-instance learning framework are briefly reviewed in this paper
Machine learning for functional connectomesGael Varoquaux
A tutorial on using machine-learning for functional-connectomes, for instance on resting-state fMRI. This is typically useful for population imaging: comparing traits or conditions across subjects.
Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia
Medical pathology images are visually evaluated by experts for disease diagnosis, but the connectionbetween image features and the state of the cells in an image is typically unknown. To understand thisrelationship, we describe a multimodal modeling and inference framework that estimates shared latentstructure of joint gene expression levels and medical image features. The method is built aroundprobabilistic canonical correlation analysis (PCCA), which is jointly fit to image embeddings that are learnedusing convolutional neural networks and linear embeddings of paired gene expression data. We finallydiscuss a set of theoretical and empirical challenges in domain adaptation settings arising from genomics data.(based on work in collab with Gregory Gundersen and Barbara E. Engelhardt)
PhD Dissertation Talk, 22 April 2011
----
The main topic of this thesis addresses the important problem of mining numerical data, and especially gene expression data. These data characterize the behaviour of thousand of genes in various biological situations (time, cell, etc.).
A difficult task consists in clustering genes to obtain classes of genes with similar behaviour, supposed to be involved together within a biological process.
Accordingly, we are interested in designing and comparing methods in the field of knowledge discovery from biological data. We propose to study how the conceptual classification method called Formal Concept Analysis (FCA) can handle the problem of extracting interesting classes of genes. For this purpose, we have designed and experimented several original methods based on an extension of FCA called pattern structures. Furthermore, we show that these methods can enhance decision making in agronomy and crop sanity in the vast formal domain of information fusion.
Since the advent of the horseshoe priors for regularization, global-local shrinkage methods have proved to be a fertile ground for the development of Bayesian theory and methodology in machine learning. They have achieved remarkable success in computation, and enjoy strong theoretical support. Much of the existing literature has focused on the linear Gaussian case. The purpose of the current talk is to demonstrate that the horseshoe priors are useful more broadly, by reviewing both methodological and computational developments in complex models that are more relevant to machine learning applications. Specifically, we focus on methodological challenges in horseshoe regularization in nonlinear and non-Gaussian models; multivariate models; and deep neural networks. We also outline the recent computational developments in horseshoe shrinkage for complex models along with a list of available software implementations that allows one to venture out beyond the comfort zone of the canonical linear regression problems.
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.
Dimensionality reduction by matrix factorization using concept lattice in dat...eSAT Journals
Abstract Concept lattices is the important technique that has become a standard in data analytics and knowledge presentation in many fields such as statistics, artificial intelligence, pattern recognition ,machine learning ,information theory ,social networks, information retrieval system and software engineering. Formal concepts are adopted as the primitive notion. A concept is jointly defined as a pair consisting of the intension and the extension. FCA can handle with huge amount of data it generates concepts and rules and data visualization. Matrix factorization methods have recently received greater exposure, mainly as an unsupervised learning method for latent variable decomposition. In this paper a novel method is proposed to decompose such concepts by using Boolean Matrix Factorization for dimensionality reduction. This paper focuses on finding all the concepts and the object intersections. Keywords: Data mining, formal concepts, lattice, matrix factorization dimensionality reduction.
Dirty data science machine learning on non-curated dataGael Varoquaux
These slides are a one-hour course on machine learning with non-curated data.
According to industry surveys, the number one hassle of data scientists is cleaning the data to analyze it. Here, I survey what "dirtyness" forces time-consuming cleaning. We will then cover two specific aspects of dirty data: non-normalized entries and missing values. I show how, for these two problems, machine-learning practice can be adapted to work directly on a data table without curation. The normalization problem can be tackled by adapting methods from natural language processing. The missing-values problem will lead us to revisit classic statistical results in the setting of supervised learning.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
On the Classification of NP Complete Problems and Their Duality Featureijcsit
NP Complete (abbreviated as NPC) problems, standing at the crux of deciding whether P=NP, are among
hardest problems in computer science and other related areas. Through decades, NPC problems are
treated as one class. Observing that NPC problems have different natures, it is unlikely that they will have
the same complexity. Our intensive study shows that NPC problems are not all equivalent in computational
complexity, and they can be further classified. We then show that the classification of NPC problems may
depend on their natures, reduction methods, exact algorithms, and the boundary between P and NP. And a
new perspective is provided: both P problems and NPC problems have the duality feature in terms of
computational complexity of asymptotic efficiency of algorithms. We also discuss
Study of Different Multi-instance Learning kNN AlgorithmsEditor IJCATR
Because of it is applicability in various field, multi-instance learning or multi-instance problem becoming more popular in
machine learning research field. Different from supervised learning, multi-instance learning related to the problem of classifying an
unknown bag into positive or negative label such that labels of instances of bags are ambiguous. This paper uses and study three
different k-nearest neighbor algorithm namely Bayesian -kNN, citation -kNN and Bayesian Citation -kNN algorithm for solving multiinstance
problem. Similarity between two bags is measured using Hausdroff distance. To overcome the problem of false positive
instances constructive covering algorithm used. Also the problem definition, learning algorithm and experimental data sets related to
multi-instance learning framework are briefly reviewed in this paper
Machine learning for functional connectomesGael Varoquaux
A tutorial on using machine-learning for functional-connectomes, for instance on resting-state fMRI. This is typically useful for population imaging: comparing traits or conditions across subjects.
I updated the previous slides.
Previous slides: https://www.slideshare.net/DongMinLee32/causal-confusion-in-imitation-learning-238882277
I reviewed the "Causal Confusion in Imitation Learning" paper.
Paper link: https://papers.nips.cc/paper/9343-causal-confusion-in-imitation-learning.pdf
- Abstract
Behavioral cloning reduces policy learning to supervised learning by training a discriminative model to predict expert actions given observations. Such discriminative models are non-causal: the training procedure is unaware of the causal structure of the interaction between the expert and the environment. We point out that ignoring causality is particularly damaging because of the distributional shift in imitation learning. In particular, it leads to a counter-intuitive “causal misidentification” phenomenon: access to more information can yield worse performance. We investigate how this problem arises, and propose a solution to combat it through targeted interventions—either environment interaction or expert queries—to determine the correct causal model. We show that causal misidentification occurs in several benchmark control domains as well as realistic driving settings, and validate our solution against DAgger and other baselines and ablations.
- Outline
1. Introduction
2. Causality and Causal Inference
3. Causality in Imitation Learning
4. Experiments Setting
5. Resolving Causal Misidentification
- Causal Graph-Parameterized Policy Learning
- Targeted Intervention
6. Experiments
Thank you!
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
Nathalie Vialaneix
4th course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches
International course, Curie training
https://training.institut-curie.org/courses/sysbiocancer2021
(remote)
September 29th, 2021
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet medical needs. However, data scientists remain stymied by the fact that data remain hard to find and to productively reuse because data and their metadata i) are wholly inaccessible, ii) are in non-standard or incompatible representations, iii) do not conform to community standards, and iv) have unclear or highly restricted terms and conditions that preclude legitimate reuse. These limitations require a rethink on data can be made machine and AI-ready - the key motivation behind the FAIR Guiding Principles. Concurrently, while recent efforts have explored the use of deep learning to fuse disparate data into predictive models for a wide range of biomedical applications, these models often fail even when the correct answer is already known, and fail to explain individual predictions in terms that data scientists can appreciate. These limitations suggest that new methods to produce practical artificial intelligence are still needed.
In this talk, I will discuss our work in (1) building an integrative knowledge infrastructure to prepare FAIR and "AI-ready" data and services along with (2) neurosymbolic AI methods to improve the quality of predictions and to generate plausible explanations. Attention is given to standards, platforms, and methods to wrangle knowledge into simple, but effective semantic and latent representations, and to make these available into standards-compliant and discoverable interfaces that can be used in model building, validation, and explanation. Our work, and those of others in the field, creates a baseline for building trustworthy and easy to deploy AI models in biomedicine.
Bio
Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University, founder and executive director of the Institute of Data Science, and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research explores socio-technological approaches for responsible discovery science, which includes collaborative multi-modal knowledge graphs, privacy-preserving distributed data mining, and AI methods for drug discovery and personalized medicine. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon Europe, the European Open Science Cloud, the US National Institutes of Health, and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Selective inference and single-cell differential analysis
1. Selective inference and single-cell differential analysis
Nathalie Vialaneix
nathalie.vialaneix@inrae.fr
http://www.nathalievialaneix.eu
Club Single-Cell
February 7th, 2022
2. Outline
Introduction: what is selective inference and why should we bother?
Sketch of basic ideas developed to answer this issue
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 2
3. Standard single-cell analysis pipeline and double dipping
Image taken from [Fang et al., 2021]
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 3
4. Standard single-cell analysis pipeline and double dipping
Image taken from [Fang et al., 2021]
here: differential analysis
Dataset is used twice: (clustering
then differential analysis)
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 3
5. Why is it a problem? Example on simulations...
How can we show the problem?
I simulate dummy data with no signal (e.g., n i.i.d. observations from
Nd (0d , σ2Id ))
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 4
6. Why is it a problem? Example on simulations...
How can we show the problem?
I simulate dummy data with no signal (e.g., n i.i.d. observations from
Nd (0d , σ2Id ))
I perform the test procedure: clustering then differential analysis between clusters
(Wald test) and obtain p-values
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 4
7. Why is it a problem? Example on simulations...
How can we show the problem?
I simulate dummy data with no signal (e.g., n i.i.d. observations from
Nd (0d , σ2Id ))
I perform the test procedure: clustering then differential analysis between clusters
(Wald test) and obtain p-values
I What do we expect? Since there is no signal in the data (no true clusters so no
marker genes), p-values ∼ U[0, 1]
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 4
8. First question [Gao et al., 2021]
Is the average value of vector X in first cluster different of what it is in the section
cluster?
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 5
9. First question using a train/test approach [Gao et al., 2021]
Is the average value of vector X, in first cluster different of what it is in the second
cluster?
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 6
10. Second question (at the level of marker gene)
[Zhang et al., 2019]
Is the average expression of a given gene, xj , in first cluster different of what it is in
the second cluster?
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 7
11. Why do we have this problem?
Main idea:
Clustering “forces” separation between expression measurements whatever the true
underlying signal (or absence of signal).
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 8
12. Outline
Introduction: what is selective inference and why should we bother?
Sketch of basic ideas developed to answer this issue
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 9
13. Question 1 [Gao et al., 2021]
Denoting by D := kX(1) − X(2)k and φ a rv from χ2 (with parameters depending on
X), define a perturbed version of the data that:
I pulls clusters apart if φ > D
I push clusters together if φ < D
There is a way to obtain a valid p-value from the distribution of obtained clusters (that
depends on the rv φ).
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 10
14. Question 1 [Gao et al., 2021]
Is it usable? More or less...
1. either: you have a way to have a explicit description of the perturbed cluster
definition
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 11
15. Question 1 [Gao et al., 2021]
Is it usable? More or less...
1. either: you have a way to have a explicit description of the perturbed cluster
definition
Only available for HC in [Gao et al., 2021].
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 11
16. Question 1 [Gao et al., 2021]
Is it usable? More or less...
1. either: you have a way to have a explicit description of the perturbed cluster
definition
Only available for HC in [Gao et al., 2021].
2. or: you simulate the distribution (using random draws of φ)
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 11
17. Question 1 [Gao et al., 2021]
Is it usable? More or less...
1. either: you have a way to have a explicit description of the perturbed cluster
definition
Only available for HC in [Gao et al., 2021].
2. or: you simulate the distribution (using random draws of φ)
But you need to have plenty of time.
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 11
18. Question 1 [Gao et al., 2021]
Is it usable? More or less...
1. either: you have a way to have a explicit description of the perturbed cluster
definition
Only available for HC in [Gao et al., 2021].
2. or: you simulate the distribution (using random draws of φ)
But you need to have plenty of time.
The method is available as an R package: clusterpval
https://www.lucylgao.com/clusterpval/
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 11
19. Experiment
Data from [Zheng et al., 2017] with clustering of peripheral blood mononuclear cells
prior to sequencing (antibody-based bead enrichment + fluorescent activated cell
sorting) ⇒ ground truth
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 12
20. Experiment
Data from [Zheng et al., 2017] with clustering of peripheral blood mononuclear cells
prior to sequencing (antibody-based bead enrichment + fluorescent activated cell
sorting) ⇒ ground truth
Derivation of:
I negative control (selection of 600 memory T cells)
I positive control (selection of 200 memory T cells + 200 B cells + monocytes)
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 12
21. Experiment
Data from [Zheng et al., 2017] with clustering of peripheral blood mononuclear cells
prior to sequencing (antibody-based bead enrichment + fluorescent activated cell
sorting) ⇒ ground truth
Derivation of:
I negative control (selection of 600 memory T cells)
I positive control (selection of 200 memory T cells + 200 B cells + monocytes)
Method: clustering with HAC (3 clusters) then differential analysis (Wald test versus
their test)
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 12
23. Further discussion
I extension of this approach to marker gene detection ongoing (work from Benjamin
Hivert, Boris Hejblum & Rodolphe Thiébaut)
I but extension beyond the 2-by-2 cluster comparison is still challenging as is the
estimation of a variance parameter needed for the method to work
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 14
24. Question 2 [Zhang et al., 2019]
Use a test based on a truncated distribution
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 15
25. Question 2 [Zhang et al., 2019]
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 16
26. Question 2 [Zhang et al., 2019]
Remarks on this approach:
I the separating hyperplane is supposed to be given ⇒ contrains the clustering
method and requires that it is performed on a separate dataset
I genes are supposed to be not correlated (very, very strong assumption...)
I method available as a python tool at
https://github.com/jessemzhang/tn_test
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 17
27. Experiment 1
Again... data from [Zheng et al., 2017]...
Method:
I use SEURAT for clustering (9 clusters)
I use SEURAT and TN for differential analysis between the first two clusters
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 18
29. Experiment 2
Data from [Kolodziejczyk et al., 2015]
Impact of overclustering on results
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 20
30. References
Fang, R., Preissl, S., Li, Y., Hou, X., Lucero, J., Wang, X., Motamedi, A., Shiau, A. K., Zhou, X., Fangming, X., Mukamel, E. A., Zhang, K.,
Zhang, Y., Behrens, M. M., Ecker, J. R., and Ren, B. (2021).
Comprehensive analysis of single cell ATAC-seq data with SnapATAC.
Nature Communications, 12:1337.
Gao, L. L., Bien, J., and Witten, D. (2021).
Selective inference for hierarchical clustering.
Preprint arXiv 2012.02936.
Kolodziejczyk, A. A., Kim, J. K., Tsang, J. C., Ilicic, T., Henriksson, J., Natarajan, K. N., Tuck, A. C., Gao, X., Bühler, M., Liu, P., Marioni,
J. C., and Teichmann, S. A. (2015).
Single cell RNA-sequencing of pluripotent states unlock modular transcriptional variation.
Cell Stem Cell, 17(4):471–485.
Zhang, J. M., Kamath, G. M., and Tse, D. N. (2019).
Valid post-clustering differential analysis for single-cell RNA-seq.
Cell Systems, 9(4):283–392.e6.
Zheng, G. X., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W., Wilson, R., Ziraldo, S. B., Wheeler, T. D., McDermott, G. P., Zhu, J.,
Gregoy, M. T., Shuga, J., Montesclaros, L., Underwood, J. G., Masquelier, Donald A. andNishimura, S. Y., Schnall-Levin, M., Wyatt, P. W.,
Hindson, C. M., Bharadwaj, R., Wond, A., Ness, K. D., Beppu, L. W., Deeg, H. J., McFarland, C., Loeb, K. R., Valente, W. J., Ericson,
N. G., Stevens, E. A., Radich, J. p., Mikkelsen, T. S., Hindson, B. J., and Bielas, J. H. (2017).
Massively parallel digital transcriptional profiling of single cells.
Nature Communications, 8:14049.
Club Single-Cell
February 7th, 2022 / Nathalie Vialaneix
p. 20