SlideShare a Scribd company logo
1 of 52
Download to read offline
1/52
MASSICCC: A SaaS Platform for
Clustering and Co-Clustering of Mixed Data
https://massiccc.lille.inria.fr/
F. Laporte
with B. Auder, C. Biernacki, G. Celeux, J. Demont, F. Langrognet, V. Kubicki, C. Poli, J. Renault, S. Iovleff
May 29th 2019, TechTalk, Paris
2/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
3/52
MASSICCC?
massiccc.lille.inria.fr
SaaS: Software as a Service
4/52
MASSICCC: Examples of Applications
Market sales
Cities’ similarities
Predictive maintenance
Health
Data mining
Large dataset
Complex dataset
5/52
MASSICCC??
A high quality and easy to use web platform
where are transfered mature research clustering (and more) software
towards (non academic) professionals
6/52
Here is the computer you need!
7/52
Clustering?
Detect hidden structures in data sets
−1 −0.5 0 0.5 1 1.5 2 2.5
−1.5
−1
−0.5
0
0.5
1
1.5
1st MCA axis
2ndMCAaxis
−1 −0.5 0 0.5 1 1.5 2 2.5
−1.5
−1
−0.5
0
0.5
1
1.5
1st MCA axis
2ndMCAaxis
Low income
Average income
High income
8/52
Large data sets1
1
S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:
Algorithms and Applications, 29
9/52
An opportunity for detecting weak signal
10/52
Todays features: full mixed/missing
11/52
Notations
Data: n individuals: x = {x1, . . . , xn} = {xO , xM } in a space X of dimension d
Observed individuals xO
Missing individuals xM
Aim: estimation of the partition z and the number of clusters K
Partition in K clusters G1, . . . , GK : z = (z1, . . . , zn), zi = (zi1, . . . , ziK )
xi ∈ Gk ⇔ zih = I{h=k}
Mixed, missing, uncertain
Individuals x Partition z ⇔ Group
? 0.5 red 5 ? ? ? ⇔ ???
0.3 0.1 green 3 ? ? ? ⇔ ???
0.3 0.6 {red,green} 3 ? ? ? ⇔ ???
0.9 [0.25 0.45] red ? ? ? ? ⇔ ???
↓ ↓ ↓ ↓
continuous continuous categorical integer
12/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
13/52
Parametric mixture model
Parametric assumption:
pk (x1) = p(x1; αk )
thus
p(x1) = p(x1; θ) =
K
k=1
πk p(x1; αk )
Mixture parameter:
θ = (π, α) with α = (α1, . . . , αK )
Model: it includes both the family p(·; αk ) and the number of groups K
m = {p(x1; θ) : θ ∈ Θ}
The number of free continuous parameters is given by
ν = dim(Θ)
Clustering becomes a well-posed problem. . .
14/52
The clustering process in mixtures
1 Estimation of θ by ˆθ
2 Estimation of the conditional probability that xi ∈ Gk
tik (ˆθ) = p(Zik = 1|Xi = xi ; ˆθ) =
ˆπk p(xi ; ˆαk )
p(xi ; ˆθ)
3 Estimation of zi by maximum a posteriori (MAP)
ˆzik = I{k=arg maxh=1,...,K tih( ˆθ)}
4 Model selection: BIC, ICL, . . .
15/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
16/52
Possible datasets
Continuous data
14 models on correlation matrix between variables (models’ selection)
Categorical data
Conditional independence
Mixed data
Conditional independence inter-type
Conditional independence intra-type (symmetry between types)
Distributions
Continuous: Gaussian
Categorical: Multinomial
17/52
Estimation of θ
Maximize the complete-likelihood over (θ, z)
c (θ; x, z) =
n
i=1
K
k=1
zik ln {πk p(xi ; αk )} CEM
Maximize the observe-likelihood on θ
(θ; x) =
n
i=1
ln p(xi ; θ) EM
18/52
Principle of EM and CEM
Initialization: θ0
Iteration noq:
Step E: estimate probabilities tq
= {tik (θq
)}
Step C: classify by setting tq
= MAP({tik (θq
)})
Step M: maximize θq+1
= arg maxθ c (θ; x, tq
)
Stopping rule: iteration number or criterion stability
Properties
⊕: simplicity, monotony, low memory requirement
: local maxima (depends on θ0), linear convergence (EM)
19/52
Prostate cancer data (without mixing data)
Individuals: n = 475 patients with prostatic cancer grouped on clinical criteria
into two Stages 3 and 4 of the disease
Variables: d = 12 pre-trial variates were measured on each patient, composed by
eight continuous variables (age, weight, systolic blood pressure, diastolic blood
pressure, serum haemoglobin, size of primary tumour, index of tumour stage and
histolic grade, serum prostatic acid phosphatase) and four categorical variables
with various numbers of levels (performance rating, cardiovascular disease history,
electrocardiogram code, bone metastases)
Model: cond. indep. p(x1; αk ) = p(x1; αcont
k ) · p(x1; αcat
k )
20/52
Mixed data
21/52
Why should I use mixmod?
Avdantage(s)
Compare a lot of different models (continuous data)
Analyse mixed data
Disadvantage(s)
Do not handle missing data
Only continuous or categorical data
22/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
23/52
Full mixed data: conditional independence everywhere2
The aim is to combine continuous, categorical, integer data, ordinal, ranking and
functional data
x1 = (xcont
1 , xcat
1 , xint
1 , . . .)
The proposed solution is to mixed all types by inter-type conditional independence
p(x1; αk ) = p(xcont
1 ; αcont
k ) × p(xcat
1 ; αcat
k ) × p(xint
1 ; αint
k ) × . . .
In addition, for symmetry between types, intra-type conditional independence
Only need to define the univariate pdf for each variable type!
Continuous: Gaussian
Categorical: multinomial
Integer: Poisson
. . .
2
MixtComp software on the MASSICCC platform: https://massiccc.lille.inria.fr/
24/52
Missing data: MAR assumption and estimation
Assumption on the missingness mecanism
Missing At Randon (MAR): the probability that a variable is missing does not
depend on its own value given the observed variables.
Observed log-likelihood. . .
(θ; xO
) =
n
i=1
log
K
k=1
πk p(xO
i ; αk ) =
n
i=1
log






K
k=1
πk
xM
i
p(xO
i , xM
i ; αk)dxM
i
MAR assumption






25/52
SEM algorithm3
A SEM algorithm to estimate θ by maximizing the observed-data log-likelihood
Initialisation: θ(0)
Iteration nb q:
E-step: compute conditional probabilities p(xM
, z|x0
; θ(q)
)
S-step: draw (xM(q)
, z(q)
) from p(xM
, z|x0
; θ(q)
)
M-step: maximize θ(q+1)
= arg maxθ ln p(xO
, xM(q)
, z(q)
; θ)
Stopping rule: iteration number
Properties: simpler than EM and interesting properties!
Avoid possibly difficult E-step in an EM
Classical M steps
Avoids local maxima
The mean of the sequence (θ(q)) approximates ˆθ
The variance of the sequence (θ(q)) gives confidence intervals
3
MixtComp software on the MASSICCC platform: https://massiccc.lille.inria.fr/
26/52
Prostate cancer data (with missing data)4
Individuals: 506 patients with prostatic cancer grouped on clinical criteria into
two Stages 3 and 4 of the disease
Variables: d = 12 pre-trial variates were measured on each patient, composed by
eight continuous variables (age, weight, systolic blood pressure, diastolic blood
pressure, serum haemoglobin, size of primary tumour, index of tumour stage and
histolic grade, serum prostatic acid phosphatase) and four categorical variables
with various numbers of levels (performance rating, cardiovascular disease history,
electrocardiogram code, bone metastases)
Some missing data: 62 missing values (≈ 1%)
We forget the classes (Stages of the desease) for performing clustering
Questions
How many clusters?
Which partition?
4
Byar DP, Green SB (1980): Bulletin Cancer, Paris 67:477-488
27/52
Data upload without preprocessing
28/52
Run clustering analysis
29/52
Several quick result overviews. . . without post-processing
30/52
Variable significance on global partition
+ similarity between variables
31/52
Variable “SG” difference between clusters
32/52
Variable “BM” difference between clusters
33/52
Companies and MixtComp
Modal (Rougegorge)
Inriatech (Alstom, ArcelorMittal, D´ecathlon, ...)
DiagRAMS technologies (predictive maintenance)
34/52
Why should I use MixtComp?
Avdantage(s)
Analyse different kinds of data
Handle missing and partially missing data
Disadvantage(s)
Do not use correlation structure (even with continuous data)
35/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
36/52
High-dimensional (HD) data5
5
S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:
Algorithms and Applications, 29
37/52
From clustering to co-clustering
[Govaert, 2011]
38/52
Notations
zi : the cluster of the row i
wj : the cluster of the column j
(zi , wj ): the block of the element xij (row i, column j)
z = (z1, . . . , zn): partition of individuals in K custers of rows
w = (w1, . . . , wd ): partition of variables in L clusters of columns
(z, w): bi-partition of the whole data set x
Both space partitions are respectively denoted by Z and W
Restriction
All variables are of the same kind (research in progress for overcoming that. . . )
39/52
MLE estimation: EM algorithm
Observed log-likelihood: (θ; x) = log p(x; θ)
Complete log-likelihood:
c (θ; x, z, w) = log p(x, z, w; θ)
=
i,k
zik log πk +
k,l
wjl log ρl +
i,j,k,l
zik wjl log p(xj
i ; αkl )
E-step of EM (iteration q):
Q(θ, θ(q)
) = E[ c (θ; x, z, w)|x; θ(q)
]
=
i,k
p(zi = k|x; θ(q)
)
t
(q)
ik
ln πk +
j,l
p(wi = l|x; θ(q)
)
s
(q)
jl
ln ρl
+
i,j,k,l
p(zi = k, wj = l|x; θ(q)
)
e
(q)
ijkl
ln p(xij ; αkl )
M-step of EM (iteration q): classical. For instance, for the Bernoulli case, it gives
π
(q+1)
k = i t
(q)
ik
n
, ρ
(q+1)
l =
j s
(q)
jl
d
, α
(q+1)
kl =
i,j e
(q)
ijkl xij
i,j e
(q)
ijkl
40/52
MLE: intractable E step
e
(q)
ijkl is usually intractable. . .
Consequence of dependency between xij s (link between rows and columns)
Involve KnLd calculus (number of possible blocks)
Example: if n = d = 20 and K = L = 2 then 1012 blocks
Example (cont’d): 33 years with a computer calculating 100,000 blocks/second
Alternatives to EM
Variational EM (numerical approx.): conditional independence assumption
p(z, w|x; θ) ≈ p(z|x; θ)p(w|x; θ)
SEM-Gibbs (stochastic approx.): replace E-step by a S-step approx. by Gibbs
z|x, w; θ and w|x, z; θ
41/52
Document clustering (1/2)
Mixture of 1033 medical summaries and 1398 aeronautics summaries
Lines: 2431 documents
Columns: present words (except stop), thus 9275 unique words
Data matrix: cross counting document×words
Poisson model
42/52
Document clustering (2/2)
Results with 2×2 blocs
Medline Cranfield
Medline 1033 0
Cranfield 0 1398
43/52
Running BlockCluster
44/52
Running BlockCluster
45/52
Running BlockCluster
46/52
Running BlockCluster
47/52
Why should I use BlockCluster?
Advantage(s)
Co-clustering (HD data)
Disadvantage(s)
Analyse one kind of data at a time
48/52
Outline
1 Introduction
2 Model-based clustering
3 Mixmod in MASSICCC
4 MixtComp in MASSICCC
5 BlockCluster in MASSICCC
6 Conclusion
49/52
Current work
mixmod
Missing Not At Random data (using logit distribution) =⇒ missing values
impact the probability of missingness
Patnership between CMAP/INRIA/Traumabase
Work with C. Biernacki, G. Celeux, J. Josse and Y. Stroppa
MixtComp
Publish an R package on CRAN
With you?
New kind of data
Complex missingness
2D clusters’ plot
50/52
Use probabilistic modelling as a mathematical guideline
Use the MASSICCC platform for user-friendly implementation
User-friendly interpretation
”One for all” of clustering
Low computer requirement needed
Free software
https://massiccc.lille.inria.fr/
Also check R packages (https://cran.r-project.org/)
Rmixmod: https://cran.r-project.org/web/packages/Rmixmod/index.html
blockcluster: https://cran.r-project.org/web/packages/blockcluster/index.html
51/52
MERCI
!
52/52

More Related Content

What's hot

A quantum-inspired optimization heuristic for the multiple sequence alignment...
A quantum-inspired optimization heuristic for the multiple sequence alignment...A quantum-inspired optimization heuristic for the multiple sequence alignment...
A quantum-inspired optimization heuristic for the multiple sequence alignment...Konstantinos Giannakis
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelstuxette
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practicetuxette
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelstuxette
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classificationSung Yub Kim
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphstuxette
 
Investigating the 3D structure of the genome with Hi-C data analysis
Investigating the 3D structure of the genome with Hi-C data analysisInvestigating the 3D structure of the genome with Hi-C data analysis
Investigating the 3D structure of the genome with Hi-C data analysistuxette
 
ABC-SysBio – Approximate Bayesian Computation in Python with GPU support
ABC-SysBio – Approximate Bayesian Computation in Python with GPU supportABC-SysBio – Approximate Bayesian Computation in Python with GPU support
ABC-SysBio – Approximate Bayesian Computation in Python with GPU supportBiogeeks
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3arogozhnikov
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theorysia16
 
Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biologyKernel methods for data integration in systems biology
Kernel methods for data integration in systems biologytuxette
 

What's hot (20)

SASA 2016
SASA 2016SASA 2016
SASA 2016
 
A quantum-inspired optimization heuristic for the multiple sequence alignment...
A quantum-inspired optimization heuristic for the multiple sequence alignment...A quantum-inspired optimization heuristic for the multiple sequence alignment...
A quantum-inspired optimization heuristic for the multiple sequence alignment...
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernels
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
 
CSC446: Pattern Recognition (LN3)
CSC446: Pattern Recognition (LN3)CSC446: Pattern Recognition (LN3)
CSC446: Pattern Recognition (LN3)
 
Investigating the 3D structure of the genome with Hi-C data analysis
Investigating the 3D structure of the genome with Hi-C data analysisInvestigating the 3D structure of the genome with Hi-C data analysis
Investigating the 3D structure of the genome with Hi-C data analysis
 
ABC-SysBio – Approximate Bayesian Computation in Python with GPU support
ABC-SysBio – Approximate Bayesian Computation in Python with GPU supportABC-SysBio – Approximate Bayesian Computation in Python with GPU support
ABC-SysBio – Approximate Bayesian Computation in Python with GPU support
 
Polynomial Matrix Decompositions
Polynomial Matrix DecompositionsPolynomial Matrix Decompositions
Polynomial Matrix Decompositions
 
P1121133727
P1121133727P1121133727
P1121133727
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
 
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
 
ICPR 2016
ICPR 2016ICPR 2016
ICPR 2016
 
Pres metabief2020jmm
Pres metabief2020jmmPres metabief2020jmm
Pres metabief2020jmm
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theory
 
Kernel methods for data integration in systems biology
Kernel methods for data integration in systems biologyKernel methods for data integration in systems biology
Kernel methods for data integration in systems biology
 

Similar to Inria Tech Talk - La classification de données complexes avec MASSICCC

Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep LearningRayKim51
 
Scaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid ParallelismScaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid ParallelismParameswaran Raman
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Rafael Nogueras
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10FredrikRonquist
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Pooyan Jamshidi
 
isabelle_webinar_jan..
isabelle_webinar_jan..isabelle_webinar_jan..
isabelle_webinar_jan..butest
 
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)Pierre Schaus
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big DataChristian Robert
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
Introduction
IntroductionIntroduction
Introductionbutest
 
H2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine UdellH2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine UdellSri Ambati
 
Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Université de Liège (ULg)
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Aijun Zhang
 

Similar to Inria Tech Talk - La classification de données complexes avec MASSICCC (20)

ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Scaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid ParallelismScaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid Parallelism
 
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10
 
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
Ensembles of Many Diverse Weak Defenses can be Strong: Defending Deep Neural ...
 
Triggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphsTriggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphs
 
isabelle_webinar_jan..
isabelle_webinar_jan..isabelle_webinar_jan..
isabelle_webinar_jan..
 
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Introduction
IntroductionIntroduction
Introduction
 
H2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine UdellH2O World - Generalized Low Rank Models - Madeleine Udell
H2O World - Generalized Low Rank Models - Madeleine Udell
 
Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...Computing near-optimal policies from trajectories by solving a sequence of st...
Computing near-optimal policies from trajectories by solving a sequence of st...
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)
 

More from Stéphanie Roger

Workshop IA : accélérez vos projets grâce au calcul intensif
Workshop IA : accélérez vos projets grâce au calcul intensif Workshop IA : accélérez vos projets grâce au calcul intensif
Workshop IA : accélérez vos projets grâce au calcul intensif Stéphanie Roger
 
Workshop - Le traitement de données biométriques par la CNIL
Workshop - Le traitement de données biométriques par la CNILWorkshop - Le traitement de données biométriques par la CNIL
Workshop - Le traitement de données biométriques par la CNILStéphanie Roger
 
Masterclass Welcome to France with Business France
Masterclass Welcome to France with Business FranceMasterclass Welcome to France with Business France
Masterclass Welcome to France with Business FranceStéphanie Roger
 
Inria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LAB
Inria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LABInria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LAB
Inria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LABStéphanie Roger
 
Dossier de Presse - chatbot
Dossier de Presse - chatbotDossier de Presse - chatbot
Dossier de Presse - chatbotStéphanie Roger
 
Masterclass pour s'implanter en Inde avec Business France Export et INPI
Masterclass pour s'implanter en Inde avec Business France Export et INPIMasterclass pour s'implanter en Inde avec Business France Export et INPI
Masterclass pour s'implanter en Inde avec Business France Export et INPIStéphanie Roger
 
Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...
Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...
Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...Stéphanie Roger
 
Masterclass - Vendre au secteur public de santé par l'UGAP
Masterclass - Vendre au secteur public de santé par l'UGAPMasterclass - Vendre au secteur public de santé par l'UGAP
Masterclass - Vendre au secteur public de santé par l'UGAPStéphanie Roger
 
Inria Tech Talk : IceSL, le logiciel d'impression 3D
Inria Tech Talk : IceSL, le logiciel d'impression 3DInria Tech Talk : IceSL, le logiciel d'impression 3D
Inria Tech Talk : IceSL, le logiciel d'impression 3DStéphanie Roger
 
Workshop CNIL - RGPD & Objets connectés
Workshop CNIL - RGPD & Objets connectésWorkshop CNIL - RGPD & Objets connectés
Workshop CNIL - RGPD & Objets connectésStéphanie Roger
 
Masterclass pour se développer en zone ASEAN @BF Export @INPI
 Masterclass pour se développer en zone ASEAN @BF Export @INPI Masterclass pour se développer en zone ASEAN @BF Export @INPI
Masterclass pour se développer en zone ASEAN @BF Export @INPIStéphanie Roger
 
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMPInria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMPStéphanie Roger
 
Workshop CNIL - RGPD & données de santé 22 février
Workshop CNIL - RGPD & données de santé 22 févrierWorkshop CNIL - RGPD & données de santé 22 février
Workshop CNIL - RGPD & données de santé 22 févrierStéphanie Roger
 
Workshop les bonnes pratiques pour scaler sur le marché américain - 18 février
Workshop les bonnes pratiques pour scaler sur le marché américain - 18 févrierWorkshop les bonnes pratiques pour scaler sur le marché américain - 18 février
Workshop les bonnes pratiques pour scaler sur le marché américain - 18 févrierStéphanie Roger
 
Workshop Financement par la CCIPARIS-IDF
Workshop Financement par la CCIPARIS-IDF Workshop Financement par la CCIPARIS-IDF
Workshop Financement par la CCIPARIS-IDF Stéphanie Roger
 
Masterclass : les grands enjeux de la #Smartcity
Masterclass : les grands enjeux de la #SmartcityMasterclass : les grands enjeux de la #Smartcity
Masterclass : les grands enjeux de la #SmartcityStéphanie Roger
 
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentéeInria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentéeStéphanie Roger
 
La Masterclass #RGPD #International @CNIL
La Masterclass #RGPD #International @CNILLa Masterclass #RGPD #International @CNIL
La Masterclass #RGPD #International @CNILStéphanie Roger
 
Workshop IA : supercalculateur pour booster vos projets par GENCI
Workshop IA : supercalculateur pour booster vos projets par GENCIWorkshop IA : supercalculateur pour booster vos projets par GENCI
Workshop IA : supercalculateur pour booster vos projets par GENCIStéphanie Roger
 
Workshop Recrutement #Associés #Fondateurs
Workshop Recrutement #Associés #FondateursWorkshop Recrutement #Associés #Fondateurs
Workshop Recrutement #Associés #FondateursStéphanie Roger
 

More from Stéphanie Roger (20)

Workshop IA : accélérez vos projets grâce au calcul intensif
Workshop IA : accélérez vos projets grâce au calcul intensif Workshop IA : accélérez vos projets grâce au calcul intensif
Workshop IA : accélérez vos projets grâce au calcul intensif
 
Workshop - Le traitement de données biométriques par la CNIL
Workshop - Le traitement de données biométriques par la CNILWorkshop - Le traitement de données biométriques par la CNIL
Workshop - Le traitement de données biométriques par la CNIL
 
Masterclass Welcome to France with Business France
Masterclass Welcome to France with Business FranceMasterclass Welcome to France with Business France
Masterclass Welcome to France with Business France
 
Inria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LAB
Inria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LABInria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LAB
Inria Tech Talk : Validez vos protocoles IoT avec la plateforme FIT/IoT-LAB
 
Dossier de Presse - chatbot
Dossier de Presse - chatbotDossier de Presse - chatbot
Dossier de Presse - chatbot
 
Masterclass pour s'implanter en Inde avec Business France Export et INPI
Masterclass pour s'implanter en Inde avec Business France Export et INPIMasterclass pour s'implanter en Inde avec Business France Export et INPI
Masterclass pour s'implanter en Inde avec Business France Export et INPI
 
Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...
Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...
Workshop CNIL - "Privacy Impact Assessment" : comment réaliser une analyse de...
 
Masterclass - Vendre au secteur public de santé par l'UGAP
Masterclass - Vendre au secteur public de santé par l'UGAPMasterclass - Vendre au secteur public de santé par l'UGAP
Masterclass - Vendre au secteur public de santé par l'UGAP
 
Inria Tech Talk : IceSL, le logiciel d'impression 3D
Inria Tech Talk : IceSL, le logiciel d'impression 3DInria Tech Talk : IceSL, le logiciel d'impression 3D
Inria Tech Talk : IceSL, le logiciel d'impression 3D
 
Workshop CNIL - RGPD & Objets connectés
Workshop CNIL - RGPD & Objets connectésWorkshop CNIL - RGPD & Objets connectés
Workshop CNIL - RGPD & Objets connectés
 
Masterclass pour se développer en zone ASEAN @BF Export @INPI
 Masterclass pour se développer en zone ASEAN @BF Export @INPI Masterclass pour se développer en zone ASEAN @BF Export @INPI
Masterclass pour se développer en zone ASEAN @BF Export @INPI
 
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMPInria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
Inria Tech Talk : Comment améliorer la qualité de vos logiciels avec STAMP
 
Workshop CNIL - RGPD & données de santé 22 février
Workshop CNIL - RGPD & données de santé 22 févrierWorkshop CNIL - RGPD & données de santé 22 février
Workshop CNIL - RGPD & données de santé 22 février
 
Workshop les bonnes pratiques pour scaler sur le marché américain - 18 février
Workshop les bonnes pratiques pour scaler sur le marché américain - 18 févrierWorkshop les bonnes pratiques pour scaler sur le marché américain - 18 février
Workshop les bonnes pratiques pour scaler sur le marché américain - 18 février
 
Workshop Financement par la CCIPARIS-IDF
Workshop Financement par la CCIPARIS-IDF Workshop Financement par la CCIPARIS-IDF
Workshop Financement par la CCIPARIS-IDF
 
Masterclass : les grands enjeux de la #Smartcity
Masterclass : les grands enjeux de la #SmartcityMasterclass : les grands enjeux de la #Smartcity
Masterclass : les grands enjeux de la #Smartcity
 
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentéeInria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
Inria Tech Talk : Améliorez vos applications de robotique & réalité augmentée
 
La Masterclass #RGPD #International @CNIL
La Masterclass #RGPD #International @CNILLa Masterclass #RGPD #International @CNIL
La Masterclass #RGPD #International @CNIL
 
Workshop IA : supercalculateur pour booster vos projets par GENCI
Workshop IA : supercalculateur pour booster vos projets par GENCIWorkshop IA : supercalculateur pour booster vos projets par GENCI
Workshop IA : supercalculateur pour booster vos projets par GENCI
 
Workshop Recrutement #Associés #Fondateurs
Workshop Recrutement #Associés #FondateursWorkshop Recrutement #Associés #Fondateurs
Workshop Recrutement #Associés #Fondateurs
 

Recently uploaded

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Inria Tech Talk - La classification de données complexes avec MASSICCC

  • 2. MASSICCC: A SaaS Platform for Clustering and Co-Clustering of Mixed Data https://massiccc.lille.inria.fr/ F. Laporte with B. Auder, C. Biernacki, G. Celeux, J. Demont, F. Langrognet, V. Kubicki, C. Poli, J. Renault, S. Iovleff May 29th 2019, TechTalk, Paris 2/52
  • 3. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 3/52
  • 5. MASSICCC: Examples of Applications Market sales Cities’ similarities Predictive maintenance Health Data mining Large dataset Complex dataset 5/52
  • 6. MASSICCC?? A high quality and easy to use web platform where are transfered mature research clustering (and more) software towards (non academic) professionals 6/52
  • 7. Here is the computer you need! 7/52
  • 8. Clustering? Detect hidden structures in data sets −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5 1st MCA axis 2ndMCAaxis −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5 1st MCA axis 2ndMCAaxis Low income Average income High income 8/52
  • 9. Large data sets1 1 S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering: Algorithms and Applications, 29 9/52
  • 10. An opportunity for detecting weak signal 10/52
  • 11. Todays features: full mixed/missing 11/52
  • 12. Notations Data: n individuals: x = {x1, . . . , xn} = {xO , xM } in a space X of dimension d Observed individuals xO Missing individuals xM Aim: estimation of the partition z and the number of clusters K Partition in K clusters G1, . . . , GK : z = (z1, . . . , zn), zi = (zi1, . . . , ziK ) xi ∈ Gk ⇔ zih = I{h=k} Mixed, missing, uncertain Individuals x Partition z ⇔ Group ? 0.5 red 5 ? ? ? ⇔ ??? 0.3 0.1 green 3 ? ? ? ⇔ ??? 0.3 0.6 {red,green} 3 ? ? ? ⇔ ??? 0.9 [0.25 0.45] red ? ? ? ? ⇔ ??? ↓ ↓ ↓ ↓ continuous continuous categorical integer 12/52
  • 13. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 13/52
  • 14. Parametric mixture model Parametric assumption: pk (x1) = p(x1; αk ) thus p(x1) = p(x1; θ) = K k=1 πk p(x1; αk ) Mixture parameter: θ = (π, α) with α = (α1, . . . , αK ) Model: it includes both the family p(·; αk ) and the number of groups K m = {p(x1; θ) : θ ∈ Θ} The number of free continuous parameters is given by ν = dim(Θ) Clustering becomes a well-posed problem. . . 14/52
  • 15. The clustering process in mixtures 1 Estimation of θ by ˆθ 2 Estimation of the conditional probability that xi ∈ Gk tik (ˆθ) = p(Zik = 1|Xi = xi ; ˆθ) = ˆπk p(xi ; ˆαk ) p(xi ; ˆθ) 3 Estimation of zi by maximum a posteriori (MAP) ˆzik = I{k=arg maxh=1,...,K tih( ˆθ)} 4 Model selection: BIC, ICL, . . . 15/52
  • 16. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 16/52
  • 17. Possible datasets Continuous data 14 models on correlation matrix between variables (models’ selection) Categorical data Conditional independence Mixed data Conditional independence inter-type Conditional independence intra-type (symmetry between types) Distributions Continuous: Gaussian Categorical: Multinomial 17/52
  • 18. Estimation of θ Maximize the complete-likelihood over (θ, z) c (θ; x, z) = n i=1 K k=1 zik ln {πk p(xi ; αk )} CEM Maximize the observe-likelihood on θ (θ; x) = n i=1 ln p(xi ; θ) EM 18/52
  • 19. Principle of EM and CEM Initialization: θ0 Iteration noq: Step E: estimate probabilities tq = {tik (θq )} Step C: classify by setting tq = MAP({tik (θq )}) Step M: maximize θq+1 = arg maxθ c (θ; x, tq ) Stopping rule: iteration number or criterion stability Properties ⊕: simplicity, monotony, low memory requirement : local maxima (depends on θ0), linear convergence (EM) 19/52
  • 20. Prostate cancer data (without mixing data) Individuals: n = 475 patients with prostatic cancer grouped on clinical criteria into two Stages 3 and 4 of the disease Variables: d = 12 pre-trial variates were measured on each patient, composed by eight continuous variables (age, weight, systolic blood pressure, diastolic blood pressure, serum haemoglobin, size of primary tumour, index of tumour stage and histolic grade, serum prostatic acid phosphatase) and four categorical variables with various numbers of levels (performance rating, cardiovascular disease history, electrocardiogram code, bone metastases) Model: cond. indep. p(x1; αk ) = p(x1; αcont k ) · p(x1; αcat k ) 20/52
  • 22. Why should I use mixmod? Avdantage(s) Compare a lot of different models (continuous data) Analyse mixed data Disadvantage(s) Do not handle missing data Only continuous or categorical data 22/52
  • 23. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 23/52
  • 24. Full mixed data: conditional independence everywhere2 The aim is to combine continuous, categorical, integer data, ordinal, ranking and functional data x1 = (xcont 1 , xcat 1 , xint 1 , . . .) The proposed solution is to mixed all types by inter-type conditional independence p(x1; αk ) = p(xcont 1 ; αcont k ) × p(xcat 1 ; αcat k ) × p(xint 1 ; αint k ) × . . . In addition, for symmetry between types, intra-type conditional independence Only need to define the univariate pdf for each variable type! Continuous: Gaussian Categorical: multinomial Integer: Poisson . . . 2 MixtComp software on the MASSICCC platform: https://massiccc.lille.inria.fr/ 24/52
  • 25. Missing data: MAR assumption and estimation Assumption on the missingness mecanism Missing At Randon (MAR): the probability that a variable is missing does not depend on its own value given the observed variables. Observed log-likelihood. . . (θ; xO ) = n i=1 log K k=1 πk p(xO i ; αk ) = n i=1 log       K k=1 πk xM i p(xO i , xM i ; αk)dxM i MAR assumption       25/52
  • 26. SEM algorithm3 A SEM algorithm to estimate θ by maximizing the observed-data log-likelihood Initialisation: θ(0) Iteration nb q: E-step: compute conditional probabilities p(xM , z|x0 ; θ(q) ) S-step: draw (xM(q) , z(q) ) from p(xM , z|x0 ; θ(q) ) M-step: maximize θ(q+1) = arg maxθ ln p(xO , xM(q) , z(q) ; θ) Stopping rule: iteration number Properties: simpler than EM and interesting properties! Avoid possibly difficult E-step in an EM Classical M steps Avoids local maxima The mean of the sequence (θ(q)) approximates ˆθ The variance of the sequence (θ(q)) gives confidence intervals 3 MixtComp software on the MASSICCC platform: https://massiccc.lille.inria.fr/ 26/52
  • 27. Prostate cancer data (with missing data)4 Individuals: 506 patients with prostatic cancer grouped on clinical criteria into two Stages 3 and 4 of the disease Variables: d = 12 pre-trial variates were measured on each patient, composed by eight continuous variables (age, weight, systolic blood pressure, diastolic blood pressure, serum haemoglobin, size of primary tumour, index of tumour stage and histolic grade, serum prostatic acid phosphatase) and four categorical variables with various numbers of levels (performance rating, cardiovascular disease history, electrocardiogram code, bone metastases) Some missing data: 62 missing values (≈ 1%) We forget the classes (Stages of the desease) for performing clustering Questions How many clusters? Which partition? 4 Byar DP, Green SB (1980): Bulletin Cancer, Paris 67:477-488 27/52
  • 28. Data upload without preprocessing 28/52
  • 30. Several quick result overviews. . . without post-processing 30/52
  • 31. Variable significance on global partition + similarity between variables 31/52
  • 32. Variable “SG” difference between clusters 32/52
  • 33. Variable “BM” difference between clusters 33/52
  • 34. Companies and MixtComp Modal (Rougegorge) Inriatech (Alstom, ArcelorMittal, D´ecathlon, ...) DiagRAMS technologies (predictive maintenance) 34/52
  • 35. Why should I use MixtComp? Avdantage(s) Analyse different kinds of data Handle missing and partially missing data Disadvantage(s) Do not use correlation structure (even with continuous data) 35/52
  • 36. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 36/52
  • 37. High-dimensional (HD) data5 5 S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering: Algorithms and Applications, 29 37/52
  • 38. From clustering to co-clustering [Govaert, 2011] 38/52
  • 39. Notations zi : the cluster of the row i wj : the cluster of the column j (zi , wj ): the block of the element xij (row i, column j) z = (z1, . . . , zn): partition of individuals in K custers of rows w = (w1, . . . , wd ): partition of variables in L clusters of columns (z, w): bi-partition of the whole data set x Both space partitions are respectively denoted by Z and W Restriction All variables are of the same kind (research in progress for overcoming that. . . ) 39/52
  • 40. MLE estimation: EM algorithm Observed log-likelihood: (θ; x) = log p(x; θ) Complete log-likelihood: c (θ; x, z, w) = log p(x, z, w; θ) = i,k zik log πk + k,l wjl log ρl + i,j,k,l zik wjl log p(xj i ; αkl ) E-step of EM (iteration q): Q(θ, θ(q) ) = E[ c (θ; x, z, w)|x; θ(q) ] = i,k p(zi = k|x; θ(q) ) t (q) ik ln πk + j,l p(wi = l|x; θ(q) ) s (q) jl ln ρl + i,j,k,l p(zi = k, wj = l|x; θ(q) ) e (q) ijkl ln p(xij ; αkl ) M-step of EM (iteration q): classical. For instance, for the Bernoulli case, it gives π (q+1) k = i t (q) ik n , ρ (q+1) l = j s (q) jl d , α (q+1) kl = i,j e (q) ijkl xij i,j e (q) ijkl 40/52
  • 41. MLE: intractable E step e (q) ijkl is usually intractable. . . Consequence of dependency between xij s (link between rows and columns) Involve KnLd calculus (number of possible blocks) Example: if n = d = 20 and K = L = 2 then 1012 blocks Example (cont’d): 33 years with a computer calculating 100,000 blocks/second Alternatives to EM Variational EM (numerical approx.): conditional independence assumption p(z, w|x; θ) ≈ p(z|x; θ)p(w|x; θ) SEM-Gibbs (stochastic approx.): replace E-step by a S-step approx. by Gibbs z|x, w; θ and w|x, z; θ 41/52
  • 42. Document clustering (1/2) Mixture of 1033 medical summaries and 1398 aeronautics summaries Lines: 2431 documents Columns: present words (except stop), thus 9275 unique words Data matrix: cross counting document×words Poisson model 42/52
  • 43. Document clustering (2/2) Results with 2×2 blocs Medline Cranfield Medline 1033 0 Cranfield 0 1398 43/52
  • 48. Why should I use BlockCluster? Advantage(s) Co-clustering (HD data) Disadvantage(s) Analyse one kind of data at a time 48/52
  • 49. Outline 1 Introduction 2 Model-based clustering 3 Mixmod in MASSICCC 4 MixtComp in MASSICCC 5 BlockCluster in MASSICCC 6 Conclusion 49/52
  • 50. Current work mixmod Missing Not At Random data (using logit distribution) =⇒ missing values impact the probability of missingness Patnership between CMAP/INRIA/Traumabase Work with C. Biernacki, G. Celeux, J. Josse and Y. Stroppa MixtComp Publish an R package on CRAN With you? New kind of data Complex missingness 2D clusters’ plot 50/52
  • 51. Use probabilistic modelling as a mathematical guideline Use the MASSICCC platform for user-friendly implementation User-friendly interpretation ”One for all” of clustering Low computer requirement needed Free software https://massiccc.lille.inria.fr/ Also check R packages (https://cran.r-project.org/) Rmixmod: https://cran.r-project.org/web/packages/Rmixmod/index.html blockcluster: https://cran.r-project.org/web/packages/blockcluster/index.html 51/52