Mini useR! in Melbourne https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network/events/251933078/
MelbURN (Melbourne useR group) https://www.meetup.com/fr-FR/MelbURN-Melbourne-Users-of-R-Network
July 16th, 2018
Melbourne, Australia
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
Nathalie Vialaneix
4th course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches
International course, Curie training
https://training.institut-curie.org/courses/sysbiocancer2021
(remote)
September 29th, 2021
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
Nathalie Vialaneix
4th course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches
International course, Curie training
https://training.institut-curie.org/courses/sysbiocancer2021
(remote)
September 29th, 2021
Medical pathology images are visually evaluated by experts for disease diagnosis, but the connectionbetween image features and the state of the cells in an image is typically unknown. To understand thisrelationship, we describe a multimodal modeling and inference framework that estimates shared latentstructure of joint gene expression levels and medical image features. The method is built aroundprobabilistic canonical correlation analysis (PCCA), which is jointly fit to image embeddings that are learnedusing convolutional neural networks and linear embeddings of paired gene expression data. We finallydiscuss a set of theoretical and empirical challenges in domain adaptation settings arising from genomics data.(based on work in collab with Gregory Gundersen and Barbara E. Engelhardt)
When Classifier Selection meets Information Theory: A Unifying ViewMohamed Farouk
Classifier selection aims to reduce the size of an
ensemble of classifiers in order to improve its efficiency and
classification accuracy. Recently an information-theoretic view
was presented for feature selection. It derives a space of possible
selection criteria and show that several feature selection criteria
in the literature are points within this continuous space. The
contribution of this paper is to export this information-theoretic
view to solve an open issue in ensemble learning which is
classifier selection. We investigated a couple of informationtheoretic
selection criteria that are used to rank classifiers.
Autoregressive Convolutional Neural Networks for Asynchronous Time SeriesGautier Marti
In this talk, we present a CNN architecture for predicting autoregressive asynchronous time series. We illustrate its application on predicting traders’ quotes of credit default swaps (proprietary dataset from Hellebore Capital), and on artificial time series. The paper is available there: http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
An important, and well studied, class of stochastic models is given by stochastic differential equations (SDEs). In this talk, we consider Bayesian inference based on measurements from several individuals, to provide inference at the "population level" using mixed-effects modelling. We consider the case where dynamics are expressed via SDEs or other stochastic (Markovian) models. Stochastic differential equation mixed-effects models (SDEMEMs) are flexible hierarchical models that account for (i) the intrinsic random variability in the latent states dynamics, as well as (ii) the variability between individuals, and also (iii) account for measurement error. This flexibility gives rise to methodological and computational difficulties.
Fully Bayesian inference for nonlinear SDEMEMs is complicated by the typical intractability of the observed data likelihood which motivates the use of sampling-based approaches such as Markov chain Monte Carlo. A Gibbs sampler is proposed to target the marginal posterior of all parameters of interest. The algorithm is made computationally efficient through careful use of blocking strategies, particle filters (sequential Monte Carlo) and correlated pseudo-marginal approaches. The resulting methodology is is flexible, general and is able to deal with a large class of nonlinear SDEMEMs [1]. In a more recent work [2], we also explored ways to make inference even more scalable to an increasing number of individuals, while also dealing with state-space models driven by other stochastic dynamic models than SDEs, eg Markov jump processes and nonlinear solvers typically used in systems biology.
[1] S. Wiqvist, A. Golightly, AT McLean, U. Picchini (2020). Efficient inference for stochastic differential mixed-effects models using correlated particle pseudo-marginal algorithms, CSDA, https://doi.org/10.1016/j.csda.2020.107151
[2] S. Persson, N. Welkenhuysen, S. Shashkova, S. Wiqvist, P. Reith, G. W. Schmidt, U. Picchini, M. Cvijovic (2021). PEPSDI: Scalable and flexible inference framework for stochastic dynamic single-cell models, bioRxiv doi:10.1101/2021.07.01.450748.
Medical pathology images are visually evaluated by experts for disease diagnosis, but the connectionbetween image features and the state of the cells in an image is typically unknown. To understand thisrelationship, we describe a multimodal modeling and inference framework that estimates shared latentstructure of joint gene expression levels and medical image features. The method is built aroundprobabilistic canonical correlation analysis (PCCA), which is jointly fit to image embeddings that are learnedusing convolutional neural networks and linear embeddings of paired gene expression data. We finallydiscuss a set of theoretical and empirical challenges in domain adaptation settings arising from genomics data.(based on work in collab with Gregory Gundersen and Barbara E. Engelhardt)
When Classifier Selection meets Information Theory: A Unifying ViewMohamed Farouk
Classifier selection aims to reduce the size of an
ensemble of classifiers in order to improve its efficiency and
classification accuracy. Recently an information-theoretic view
was presented for feature selection. It derives a space of possible
selection criteria and show that several feature selection criteria
in the literature are points within this continuous space. The
contribution of this paper is to export this information-theoretic
view to solve an open issue in ensemble learning which is
classifier selection. We investigated a couple of informationtheoretic
selection criteria that are used to rank classifiers.
Autoregressive Convolutional Neural Networks for Asynchronous Time SeriesGautier Marti
In this talk, we present a CNN architecture for predicting autoregressive asynchronous time series. We illustrate its application on predicting traders’ quotes of credit default swaps (proprietary dataset from Hellebore Capital), and on artificial time series. The paper is available there: http://proceedings.mlr.press/v80/binkowski18a/binkowski18a.pdf
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
An important, and well studied, class of stochastic models is given by stochastic differential equations (SDEs). In this talk, we consider Bayesian inference based on measurements from several individuals, to provide inference at the "population level" using mixed-effects modelling. We consider the case where dynamics are expressed via SDEs or other stochastic (Markovian) models. Stochastic differential equation mixed-effects models (SDEMEMs) are flexible hierarchical models that account for (i) the intrinsic random variability in the latent states dynamics, as well as (ii) the variability between individuals, and also (iii) account for measurement error. This flexibility gives rise to methodological and computational difficulties.
Fully Bayesian inference for nonlinear SDEMEMs is complicated by the typical intractability of the observed data likelihood which motivates the use of sampling-based approaches such as Markov chain Monte Carlo. A Gibbs sampler is proposed to target the marginal posterior of all parameters of interest. The algorithm is made computationally efficient through careful use of blocking strategies, particle filters (sequential Monte Carlo) and correlated pseudo-marginal approaches. The resulting methodology is is flexible, general and is able to deal with a large class of nonlinear SDEMEMs [1]. In a more recent work [2], we also explored ways to make inference even more scalable to an increasing number of individuals, while also dealing with state-space models driven by other stochastic dynamic models than SDEs, eg Markov jump processes and nonlinear solvers typically used in systems biology.
[1] S. Wiqvist, A. Golightly, AT McLean, U. Picchini (2020). Efficient inference for stochastic differential mixed-effects models using correlated particle pseudo-marginal algorithms, CSDA, https://doi.org/10.1016/j.csda.2020.107151
[2] S. Persson, N. Welkenhuysen, S. Shashkova, S. Wiqvist, P. Reith, G. W. Schmidt, U. Picchini, M. Cvijovic (2021). PEPSDI: Scalable and flexible inference framework for stochastic dynamic single-cell models, bioRxiv doi:10.1101/2021.07.01.450748.
Clustering Financial Time Series using their Correlations and their Distribut...Gautier Marti
We have designed a distance that takes into account both the correlation between the time series and also the distribution of the individual time series. A tutorial with Python code is available: https://www.datagrapple.com/Tech/GNPR-tutorial-How-to-cluster-random-walks.html
This talk was given at the Paris Machine Learning Meetup.
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013Christian Robert
Those are the slides for my conference talk at 2013 WSC, in the "Jacob Bernoulli's "Ars Conjectandi" and the emergence of probability" session organised by Adam Jakubowski
Professor Timoteo Carletti presented a seminar titled "A journey in the zoo of Turing patterns: the topology does matter as part of the SMART Seminar Series on 8th March 2018.
More information: http://www.uoweis.co/event/a-journey-in-the-zoo-of-turing-patterns-the-topology-does-matter/
Keep updated with future events: http://www.uoweis.co/events/category/smart-infrastructure-facility/
We provide a comprehensive convergence analysis of the asymptotic preserving implicit-explicit particle-in-cell (IMEX-PIC) methods for the Vlasov–Poisson system with a strong magnetic field. This study is of utmost importance for understanding the behavior of plasmas in magnetic fusion devices such as tokamaks, where such a large magnetic field needs to be applied in order to keep the plasma particles on desired tracks.
In classical data analysis, data are single values. This is the case if you consider a dataset of n patients which age and size you know. But what if you record the blood pressure or the weight of each patient during a day ? Then, for each patient, you do not have a single-valued data but a set of values since the blood pressure or the weight are not constant during the day.
Suppose now that you do not want to record blood pressure a thousand times for each patient and to store it into a database because your memory space is limited. Therefore, you need to aggregate each set of values into symbols: intervals (lower and upper bounds only), box plots, histograms or even distributions (distribution law with mean and variance)...
Thus, the issue is to adapt classical statistical tools to symbolic data analysis. More precisely, this article is aimed at proposing a method to fit a regression on Gaussian distributions. This paper is divided as follows: first, it presents the computation of the maximum likelihood estimator and then it compares the new approach with the usual least squares regression.
Wavelet-based Reflection Symmetry Detection via Textural and Color HistogramsMohamed Elawady
Conference: ICCV 2017 Workshop: Detecting Symmetry in the Wild, Venice, Italy
Source Code: http://github.com/mawady/ColorSymDetect/
Authors: M. Elawady, C. Ducottet, O. Alata, C. Barat, & P. Colantoni
Affiliation: Universite de Lyon, CNRS, UMR 5516, Laboratoire Hubert Curien, Universite de Saint-Etienne, Jean-Monnet, F-42000 Saint-Etienne, France
Similar to Learning from (dis)similarity data (20)
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
1. Learning from (dis)similarity data
Nathalie Vialaneix
nathalie.vialaneix@inra.fr
http://www.nathalievialaneix.eu
MelbURN 2018
July 16th, 2018 - Melbourne, Australia
Nathalie Vialaneix | Learning from (dis)similarity data 1/24
2. What are my data like?
Nathalie Vialaneix | Learning from (dis)similarity data 2/24
3. A medieval social network [Boulet et al., 2008, Rossi et al., 2013]
corpus with more than 6,000
transactions, 3 centuries, all
related to
Castelnau Montratier
Nathalie Vialaneix | Learning from (dis)similarity data 3/24
4. A medieval social network [Boulet et al., 2008, Rossi et al., 2013]
corpus with more than 6,000
transactions, 3 centuries, all
related to
Castelnau Montratier
Individual
Transaction
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Ratier
Ratier (II) Castelnau
Jean Laperarede
Bertrande Audoy
Gailhard Gourdon
Guy Moynes (de)
Pierre Piret (de)
Bernard Audoy
Hélène Castelnau
Guiral Baro
Bernard Audoy
Arnaud Bernard Laperarede
Guilhem Bernard Prestis
Jean Manas
Jean Laperarede
Jean Laperarede
Jean Roquefeuil
Jean Pojols
Ramond Belpech
Raymond Laperarede
Bertrand Prestis (de)
Ratier
(Monseigneur) Roquefeuil (de)
Guilhem Bernard Prestis
Arnaud Gasbert Castanhier (del)
Ratier (III) Castelnau
Pierre Prestis (de)
P Valeribosc
Guillaume Marsa
Berenguier Roquefeuil
Arnaud Bernard Perarede
Jean Roquefeuil
Arnaud I Audoy
Arnaud Bernard Perarede
bipartite network with more than 17,000
nodes (∼ 10,000 individuals)
What can we learn from the French
medieval society?
Nathalie Vialaneix | Learning from (dis)similarity data 3/24
5. Career paths [Olteanu and Villa-Vialaneix, 2015]
Survey “Génération 98”: labor market
status (9 categories) on more than
16,000 people having graduated in 1998
during 94 months. 1
1. Available thanks to Génération 1998 à 7 ans - 2005, [producer] CEREQ, [diffusion] Centre Maurice Halbwachs (CMH).
Nathalie Vialaneix | Learning from (dis)similarity data 4/24
6. Career paths [Olteanu and Villa-Vialaneix, 2015]
Survey “Génération 98”: labor market
status (9 categories) on more than
16,000 people having graduated in 1998
during 94 months. 1
How to cluster career paths into
homogeneous groups?
1. Available thanks to Génération 1998 à 7 ans - 2005, [producer] CEREQ, [diffusion] Centre Maurice Halbwachs (CMH).
Nathalie Vialaneix | Learning from (dis)similarity data 4/24
7. Career paths [Olteanu and Villa-Vialaneix, 2015]
Survey “Génération 98”: labor market
status (9 categories) on more than
16,000 people having graduated in 1998
during 94 months. 1
How to cluster career paths into
homogeneous groups?
It is all about distance...
χ2
dissimilarity emphasizes the
contemporary identical situations
Optimal-matching dissimilarities is
more focused on the sequences
similarities
[Needleman and Wunsch, 1970]
(or “edit distance”, “Levenshtein
distance”)
1. Available thanks to Génération 1998 à 7 ans - 2005, [producer] CEREQ, [diffusion] Centre Maurice Halbwachs (CMH).
Nathalie Vialaneix | Learning from (dis)similarity data 4/24
8. and then I went into NGS data...
and again...
distances are everywhere
Nathalie Vialaneix | Learning from (dis)similarity data 5/24
9. a collection of NGS data...
DNA barcoding
Astraptes fulgerator
optimal matching
(edit) distances to
differentiate species
Nathalie Vialaneix | Learning from (dis)similarity data 6/24
10. a collection of NGS data...
DNA barcoding
Astraptes fulgerator
optimal matching
(edit) distances to
differentiate species
Hi-C data
pairwise measure (similarity) related to
the physical 3D distance between loci in
the cell, at genome scale
Nathalie Vialaneix | Learning from (dis)similarity data 6/24
11. a collection of NGS data...
DNA barcoding
Astraptes fulgerator
optimal matching
(edit) distances to
differentiate species
Hi-C data
pairwise measure (similarity) related to
the physical 3D distance between loci in
the cell, at genome scale
Metagenomics
dissemblance between
samples is better
captured when
phylogeny between
species is taken into
account (unifrac
distances)
Nathalie Vialaneix | Learning from (dis)similarity data 6/24
13. Basics on (standard) stochastic SOM
[Kohonen, 2001]
x
x
x
(xi)i=1,...,n ⊂ Rd
are affected to a unit f(xi) ∈ {1, . . . , U}
the grid is equipped with a “distance” between units: d(u, u ) and
observations affected to close units are close in Rd
every unit u corresponds to a prototype, pu (x) in Rd
Nathalie Vialaneix | Learning from (dis)similarity data 8/24
14. Basics on (standard) stochastic SOM
[Kohonen, 2001]
x
x
x
Iterative learning (assignment step): xi is picked at random within (xk )k
and affected to best matching unit:
ft
(xi) = arg min
u
xi − pt
u
2
Nathalie Vialaneix | Learning from (dis)similarity data 8/24
15. Basics on (standard) stochastic SOM
[Kohonen, 2001]
x
x
x
Iterative learning (representation step): all prototypes in neighboring units
are updated with a gradient descent like step:
pt+1
u ←− pt
u + µ(t)Ht
(d(f(xi), u))(xi − pt
u)
Nathalie Vialaneix | Learning from (dis)similarity data 8/24
16. Extension of SOM to data described by a kernel or a
dissimilarity
[Olteanu and Villa-Vialaneix, 2015]
Data: (xi)i=1,...,n ∈ Rd
1: Initialization:
randomly set p0
1
, ..., p0
U
in Rd
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
xi − pt
u
2
βt
u
5: for all u = 1 → U do Representation
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u))
7: end for
8: end for
Nathalie Vialaneix | Learning from (dis)similarity data 9/24
17. Extension of SOM to data described by a kernel or a
dissimilarity
[Olteanu and Villa-Vialaneix, 2015]
Data: (xi)i=1,...,n ∈ X
1: Initialization:
randomly set p0
1
, ..., p0
U
in Rd
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
xi − pt
u
2
βt
u
5: for all u = 1 → U do Representation
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u))
7: end for
8: end for
Nathalie Vialaneix | Learning from (dis)similarity data 9/24
18. Extension of SOM to data described by a kernel or a
dissimilarity
[Olteanu and Villa-Vialaneix, 2015]
Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u ∼ n
i=1 β0
ui
xi (convex combination)
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
xi − pt
u
2
βt
u
5: for all u = 1 → U do Representation
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u))
7: end for
8: end for
Nathalie Vialaneix | Learning from (dis)similarity data 9/24
19. Extension of SOM to data described by a kernel or a
dissimilarity
[Olteanu and Villa-Vialaneix, 2015]
Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u ∼ n
i=1 β0
ui
xi (convex combination)
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
βt
uD(pt
u, xi)
5: for all u = 1 → U do Representation
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u))
7: end for
8: end for
Nathalie Vialaneix | Learning from (dis)similarity data 9/24
20. Extension of SOM to data described by a kernel or a
dissimilarity
[Olteanu and Villa-Vialaneix, 2015]
Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u ∼ n
i=1 β0
ui
xi (convex combination)
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
βt
uD(pt
u, xi)
5: for all u = 1 → U do Representation
6:
pt+1
u = pt
u + µ(t)Ht
(d(ft
(xi), u)) ∼ xi − pt
u
7: end for
8: end for
Nathalie Vialaneix | Learning from (dis)similarity data 9/24
21. Extension of SOM to data described by a kernel or a
dissimilarity
[Olteanu and Villa-Vialaneix, 2015]
Data: (xi)i=1,...,n ∈ X
1: Initialization:
p0
u ∼ n
i=1 β0
ui
xi (convex combination)
2: for t = 1 → T do
3: pick at random i ∈ {1, . . . , n}
4: Assignment
ft
(xi) = arg min
u=1,...,U
βt
u(βt
u) D(., xi) −
1
2
(βt
u) Dβt
u
5: for all u = 1 → U do Representation
6:
βt+1
u = βt
u + µ(t)Ht
(d(ft
(xi), u)) 1i − βt
u
7: end for
8: end for
Nathalie Vialaneix | Learning from (dis)similarity data 9/24
22. Note on drawbacks of RSOM
Two main drawbacks:
For T ∼ γn iterations, complexity of RSOM is O(γn3
U) (compared to
O(γUdn) for numeric) [Rossi, 2014]
Nathalie Vialaneix | Learning from (dis)similarity data 10/24
23. Note on drawbacks of RSOM
Two main drawbacks:
For T ∼ γn iterations, complexity of RSOM is O(γn3
U) (compared to
O(γUdn) for numeric) [Rossi, 2014]
Exact solution proposed in [Mariette et al., 2017] to reduce the
complexity to O(γn2
U) with additional storage memory of O(Un)
Nathalie Vialaneix | Learning from (dis)similarity data 10/24
24. Note on drawbacks of RSOM
Two main drawbacks:
For T ∼ γn iterations, complexity of RSOM is O(γn3
U) (compared to
O(γUdn) for numeric) [Rossi, 2014]
Exact solution proposed in [Mariette et al., 2017] to reduce the
complexity to O(γn2
U) with additional storage memory of O(Un)
For the non Euclidean case, the learning algorithm can be very
unstable (saddle points)
Nathalie Vialaneix | Learning from (dis)similarity data 10/24
25. Note on drawbacks of RSOM
Two main drawbacks:
For T ∼ γn iterations, complexity of RSOM is O(γn3
U) (compared to
O(γUdn) for numeric) [Rossi, 2014]
Exact solution proposed in [Mariette et al., 2017] to reduce the
complexity to O(γn2
U) with additional storage memory of O(Un)
For the non Euclidean case, the learning algorithm can be very
unstable (saddle points)
clip or flip? [Chen et al., 2009]
Nathalie Vialaneix | Learning from (dis)similarity data 10/24
26. SOMbrero
[Villa-Vialaneix, 2017]
SOMbrero is an R package implementing stochastic variants of SOM
for non vectorial data
Specifically well adapted to...
non expert use and teaching
use with graphs and obtain simplified representations
first release: March 2013; latest release: Feb. 2018 (version 1.2.3)
depends on R (version ≥ 3.1.0) http://www.r-project.org
and on several packages available on CRAN:
wordcloud, igraph, RColorBrewer, scatterplot3d, knitr, shiny
available at https://cran.r-project.org/package=SOMbrero
(licence GPL) and can be installed from inside R using
install.packages("SOMbrero")
Nathalie Vialaneix | Learning from (dis)similarity data 11/24
27. Training
mysom <- trainSOM(iris[ ,1:4], ...)
Options to train the SOM:
grid: square grid, with arbitrary width and length
distance between units: standard distances as in dist or "letremy" (Euclidean then
"maximum")
neighborhood relationship: Gaussian or "letremy"
prototypes: initialized randomly, with a PCA, with random observations from the training
sample
preprocessing: centering, scaling to unit variance or nothing
training: number of iterations, standard or Heskes’s assignment step
ft
(xi) ← arg min
u=1,...,U
U
u =1
Ht
(d(u, u )) xi − pt−1
u
2
Nathalie Vialaneix | Learning from (dis)similarity data 12/24
28. Diagnostic tools
quality(mysom)
topographic error: average frequency (over the samples) for which the
prototypes that comes closest is in the direct neighborhood on the
grid of the BMU
quantization error
Q =
1
n
n
i=1
xi − pf(xi)
2
Nathalie Vialaneix | Learning from (dis)similarity data 13/24
31. Start with SOMbrero
3 datasets corresponding to the three types of data that SOMbrero
can handle (iris, presidentielles2002 and lesmis, a graph from
“Les Misérables”)
Nathalie Vialaneix | Learning from (dis)similarity data 16/24
32. Start with SOMbrero
3 datasets corresponding to the three types of data that SOMbrero
can handle (iris, presidentielles2002 and lesmis, a graph from
“Les Misérables”)
comprehensive (HTML) vignettes included in the package and
available on the website
Nathalie Vialaneix | Learning from (dis)similarity data 16/24
33. Start with SOMbrero
3 datasets corresponding to the three types of data that SOMbrero
can handle (iris, presidentielles2002 and lesmis, a graph from
“Les Misérables”)
comprehensive (HTML) vignettes included in the package and
available on the website
Web User Interface (made with shiny) for using the package even if
you do not know R programming language (included in the package
with sombreroGUI() Tested and approved on an historian!
Nathalie Vialaneix | Learning from (dis)similarity data 16/24
34. RSOM for mining a medieval social network
with the heat kernel
Individual
Transaction
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Ratier
Ratier (II) Castelnau
Jean Laperarede
Bertrande Audoy
Gailhard Gourdon
Guy Moynes (de)
Pierre Piret (de)
Bernard Audoy
Hélène Castelnau
Guiral Baro
Bernard Audoy
Arnaud Bernard Laperarede
Guilhem Bernard Prestis
Jean Manas
Jean Laperarede
Jean Laperarede
Jean Roquefeuil
Jean Pojols
Ramond Belpech
Raymond Laperarede
Bertrand Prestis (de)
Ratier
(Monseigneur) Roquefeuil (de)
Guilhem Bernard Prestis
Arnaud Gasbert Castanhier (del)
Ratier (III) Castelnau
Pierre Prestis (de)
P Valeribosc
Guillaume Marsa
Berenguier Roquefeuil
Arnaud Bernard Perarede
Jean Roquefeuil
Arnaud I Audoy
Arnaud Bernard Perarede
[Boulet et al., 2008]
Graph induced by clusters:
has nice relations with space and time
emphasizes leading people
has helped to identify problems in the
database (namesakes)
But: biggest communities are still
very complex
Nathalie Vialaneix | Learning from (dis)similarity data 17/24
35. RSOM for typology of Astraptes fulgerator from DNA
barcoding
Edit distances between DNA sequences [Olteanu and Villa-Vialaneix, 2015]
Almost perfect clustering (identifying a possible label error on one sample)
with (in addition) information on relations between species.
Nathalie Vialaneix | Learning from (dis)similarity data 18/24
36. RSOM for typology of school-to-time transitions
Edit distance between 12,000 categorical time series
Nathalie Vialaneix | Learning from (dis)similarity data 19/24
37. Also in SOMbrero: KORRESP
[Cottrell and Letrémy, 2005]
Data: contingency table T = (nij)ij with p rows and q columns transformed
into a numeric dataset X:
X =
columns rows
columns
rows
column profile
row profile
with
∀ i = 1, . . . , p and ∀ j = 1, . . . , q, xij =
nij
ni.
× n
n.j
Nathalie Vialaneix | Learning from (dis)similarity data 20/24
38. Also in SOMbrero: KORRESP
[Cottrell and Letrémy, 2005]
Data: contingency table T = (nij)ij with p rows and q columns transformed
into a numeric dataset X:
X =
columns rows
columns
rows
augmented
column profile
augmented row
profile
with
∀ i = 1, . . . , p and ∀ j = q + 1, . . . , q + p, xij = xk(i)+p,j with
k(i) = arg maxk=1,...,q xik
Nathalie Vialaneix | Learning from (dis)similarity data 20/24
39. Also in SOMbrero: KORRESP
[Cottrell and Letrémy, 2005]
Data: contingency table T = (nij)ij with p rows and q columns transformed
into a numeric dataset X:
X =
columns rows
columns
rows
augmented
column profile
augmented row
profile
column profile
row profile
assignment uses reduced profile
representation uses augmented profile
alternatively process row profiles and column profiles
Nathalie Vialaneix | Learning from (dis)similarity data 20/24
40. Also available in SOMbrero
mysom <- trainSOM(presidentielles2002 , type = "korresp")
plot(mysom, what = "obs", type = "names")
Nathalie Vialaneix | Learning from (dis)similarity data 21/24
41. SOMbrero
Madalina Olteanu,
Fabrice Rossi, Marie Cottrell,
Laura Bendhaïba and
Julien Boelaert
SOMbrero and mixKernel
Jérôme Mariette
adjclust
Pierre Neuvial, Guillem Rigail, Christophe Ambroise and
Shubham Chaturvedi
Nathalie Vialaneix | Learning from (dis)similarity data 22/24
42. Don’t miss useR! 2019
user2019.r-project.org
Nathalie Vialaneix | Learning from (dis)similarity data 23/24
43. Credits for pictures
Slide 2: Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae,
Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/
Slide 3: Picture of Castelnau Montratier from
https://commons.wikimedia.org/wiki/File:
Place_Gambetta,_Castelnau-Montratier.JPG by Duch.seb CC BY-SA 3.0
Slide 4: image based on ENCODE project, by Darryl Leja (NHGRI), Ian Dunham
(EBI) and Michael Pazin (NHGRI)
Slide 6: Astraptes picture is from
https://www.flickr.com/photos/39139121@N00/2045403823/ by Anne Toal
(CC BY-SA 2.0), Hi-C experiment is taken from the article Matharu et al., 2015
DOI:10.1371/journal.pgen.1005640 (CC BY-SA 4.0) and metagenomics illustration is
taken from the article Sommer et al., 2010 DOI:10.1038/msb.2010.16 (CC BY-NC-SA
3.0)
Slide 12: TADS picture is from the article Fraser et al., 2015
DOI:10.15252/msb.20156492 (CC BY-SA 4.0)
Nathalie Vialaneix | Learning from (dis)similarity data 24/24
44. References
Boulet, R., Jouve, B., Rossi, F., and Villa, N. (2008).
Batch kernel SOM and related Laplacian methods for social network analysis.
Neurocomputing, 71(7-9):1257–1273.
Chen, Y., Garcia, E., Gupta, M., Rahimi, A., and Cazzanti, L. (2009).
Similarity-based classification: concepts and algorithm.
Journal of Machine Learning Research, 10:747–776.
Cottrell, M. and Letrémy, P. (2005).
How to use the Kohonen algorithm to simultaneously analyse individuals in a survey.
Neurocomputing, 63:193–207.
Kohonen, T. (2001).
Self-Organizing Maps, 3rd Edition, volume 30.
Springer, Berlin, Heidelberg, New York.
Mariette, J., Rossi, F., Olteanu, M., and Villa-Vialaneix, N. (2017).
Accelerating stochastic kernel som.
In Verleysen, M., editor, XXVth European Symposium on Artificial Neural Networks, Computational Intelligence and Machine
Learning (ESANN 2017), pages 269–274, Bruges, Belgium. i6doc.
Needleman, S. and Wunsch, C. (1970).
A general method applicable to the search for similarities in the amino acid sequence of two proteins.
Journal of Molecular Biology, 48(3):443–453.
Olteanu, M. and Villa-Vialaneix, N. (2015).
On-line relational and multiple relational SOM.
Neurocomputing, 147:15–30.
Rossi, F. (2014).
How many dissimilarity/kernel self organizing map variants do we need?
Nathalie Vialaneix | Learning from (dis)similarity data 24/24
45. In Villmann, T., Schleif, F., Kaden, M., and Lange, M., editors, Advances in Self-Organizing Maps and Learning Vector
Quantization (Proceedings of WSOM 2014), volume 295 of Advances in Intelligent Systems and Computing, pages 3–23,
Mittweida, Germany. Springer Verlag, Berlin, Heidelberg.
Rossi, F., Villa-Vialaneix, N., and Hautefeuille, F. (2013).
Exploration of a large database of French notarial acts with social network methods.
Digital Medievalist, 9.
Villa-Vialaneix, N. (2017).
Stochastic self-organizing map variants with the R package SOMbrero.
In Lamirel, J., Cottrell, M., and Olteanu, M., editors, 12th International Workshop on Self-Organizing Maps and Learning Vector
Quantization, Clustering and Data Visualization (Proceedings of WSOM 2017), Nancy, France. IEEE.
Nathalie Vialaneix | Learning from (dis)similarity data 24/24