PhD Completion Seminar

Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Simone Romano’s PhD Completion Seminar
Design and Adjustment of
Dependency Measures Between Variables
November 30th 2015
Supervisor: Prof. James Bailey
Co-Supervisor: A/Prof. Karin Verspoor
Computing and Information Systems (CIS)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables

Background
Examples of Applications
Categories of Dependency measures
Thesis Motivation
Ranking Dependencies in Noisy Data
Motivation
Design of the Randomized Information Coeﬃcient (RIC)
Comparison Against Other Measures
A Framework for Adjusting Dependency Measures
Motivation
Adjustment for Quantiﬁcation
Adjustment for Ranking
Adjustments for Clustering Comparison Measures
Motivation
Detailed Analysis of Contingency Tables
Application Scenarios
Conclusions

Dependency Measures
A dependency measure D is used to assess
the amount of dependency between variables:
Example 1: After collecting weight and height for many people,
we can compute D(weight, height)
Example 2: assess the amount of dependency between search queries in Google
https://www.google.com/trends/correlate/
They are fundamental for a number of applications in machine learning/ data mining

Applications of Dependency Measures
Supervised learning
Feature selection [Guyon and Elisseeff, 2003];
Decision tree induction [Criminisi et al., 2012];
Evaluation of classification accuracy [Witten et al., 2011].
Unsupervised learning
External clustering validation [Strehl and Ghosh, 2003];
Generation of alternative or multi-view clusterings
[Müller et al., 2013, Dang and Bailey, 2015];
The exploration of the clustering space using results from the Meta-Clustering algorithm
[Caruana et al., 2006, Lei et al., 2014].
Exploratory analysis
Inference of biological networks [Reshef et al., 2011, Villaverde et al., 2013];
Analysis of neural time-series data [Cohen, 2014].

Application example (1): feature selection / decision tree induction
Application: Identify if the class C is dependent to a feature F
Toy Example: Is the class C = cancer dependent to the feature F = smoker according to
this data set of 20 patients.
Use of dependency measure: Compute D(F, C)
Smoker Cancer
No -
Yes +
Yes +
Yes -
No +
No -
Yes +
...
...
Yes +
Contingency table is a useful tool:
counts the co-occurrences of feature values
and class values.
+ -
10 10
Smoker 8 6 2
Non smoker 12 4 8
⇒ if it is
dependent then
induce a split in
the decision tree
yes no
smoker

Application example (2): external clustering validation
Application: Compare a clustering solution B to a reference clustering A.
Toy Example: N = 15 data points
reference clustering A with 2 clusters, stars and circles

Application example (2): external clustering validation
Application: Compare a clustering solution B to a reference clustering A.
Toy Example: N = 15 data points
reference clustering A with 2 clusters, stars and circles
clustering solution B with 2 clusters, red
and blue

Use of dependency measure: Compute D(A, B)
Once gain the contingency table is a useful too that assesses the amount of overlap
between A and B
B
red blue
6 9
A
8 4 4
7 2 5

Application example (3): genetic network inference
Application: Identify if the gene G1 is interacting with the gene G2
Toy Example: We have a time series of values for each G1 and G2:
Use of dependency measure: Compute D(G1, G2)
time G1 G2
t1 20.4400 19.7450
t2 19.0750 20.3300
t3 20.0650 20.1700
...
...
...
Time
0 20 40 60 80 100 120 140
18
20
22
24
26
G1
G2
Here there is no contingency table
because the variables are numerical
G1
18 20 22 24 26
G2
19
20
21
22
23
24
25

Categories of Dependency Measures
Dependency measures can be divided in two categories: measures between categorical
variables and measures between numerical variables.
Between Categorical Variables
These measures can be computed naturally on a contingency table. For example on:
Decision trees
yes no
smoker
+ -10 10
Smoker 8 6 2
Non smoker 12 4 8
Clustering
comparisons
B
red blue
6 9
A
8 4 4
7 2 5
Information theoretic [Cover and Thomas, 2012]:
e.g. mutual information (a.k.a. information gain)
Based on pair-counting [Albatineh et al., 2006]: e.g. Rand Index, Jaccard similarity
Based on set-matching [Meil˘a, 2007]:
e.g. classiﬁcation accuracy, agreement between annotators
Others: mostly employed as splitting criteria [Kononenko, 1995]: e.g. Gini gain,
Chi-square.

Between Numerical Variables
No contingency table. For example:
Biological interaction
G1
18 20 22 24 26
G2
19
20
21
22
23
24
25
Estimators of mutual information [Khan et al., 2007]:
e.g. kNN estimator, kernel estimator, estimator based on grids
Correlation based:
e.g. Pearson’s correlation, distance correlation [Székely et al., 2009],
randomized dependence coefficient [Lopez-Paz et al., 2013]
Kernel based: e.g. Hilbert-Schimidt Independence Criterion [Gretton et al., 2005]
Based on information theory:
e.g. the Maximal Information Coefficient (MIC) [Reshef et al., 2011],
the mutual information dimension [Sugiyama and Borgwardt, 2013],
total information coefficient [Reshef et al., 2015].

Thesis Motivation
Thesis Motivation
Even if a dependency measure D has nice theoretical properties,
dependencies are estimated on finite data with ˆD.
The following goals of dependency measures are challenging:
Detection: Test for the presence of dependency.
E.g. test dependence between two genes
Example (3)
Quantification: Summarization of the amount of dependency in an interpretable fashion.
E.g. assessing the amount of overlapping between two clusterings
Example (2)
Ranking: Sort the relationships of different variables.
E.g. ranking many features in decision trees
Example (1)
To improve performances on the three goals above:
We need information on the distribution of ˆD

Thesis Motivation
For Example, when Ranking Noisy Relationships
The distribution of ˆD(X, Y ) when the relationship between X and Y is noisy,
should not overlap with the distribution of ˆD(X, Y ) on a noiseless relationship:

Motivation
Motivation
Mutual information I(X, Y ) is good to rank relationships with different level of noise
between variables:
high I ⇒ little noise
small I ⇒ big noise
It can also be computed between sets of variables: e.g.
I(X, Y ) = I({X1, X2}, Y ) = I({weight, height}, BMI)
Mutual Information quantifies the information shared between two variables
MI(X, Y ) =
+∞
−∞
+∞
−∞
fX,Y (x, y) log
fX,Y (x, y)
fX (x)fY (y)
Importance of MI
It is based on a well-established theory and quantifies non-linear interactions which might be
missed if e.g. the Pearson’s correlation coefficient r(X, Y ) is used.

Motivation
Estimation of Mutual Information
Many estimators of mutual information:
Acronym Type Applicable to Sets of vars. Best Compl. Worst Compl.
Iew (Discretization equal width) O(n1.5
)
Ief (Discretization equal frequency) O(n1.5
)
IA (Adaptive Partitioning) O(n1.5
)
Imean (Mean Nearest Neighbours) O(n2
)
IKDE (Kernel Density Estimation) O(n2
)
IkNN (Nearest Neighbours) O(n1.5
) O(n2
)
Discretization based estimators of mutual information exhibits good complexity but not
applicable to sets of variables

Motivation
Discretization based estimator use ﬁxed grids. And compute mutual information on a
contingency table.
X
18 20 22 24 26
Y
16
18
20
22
24
26
For example Iew discretizes
using equal width binning
Discretized X
b1 · · · bj · · · bc
a1 n11 · · · · · · · n1c
...
...
...
...
Discretized Y ai · nij ·
...
...
...
...
ar nr1 · · · · · · · nrc
nij counts the number of points in a
particular bin. Mutual information
can be computed with:
Iew(X, Y ) =
r
i=1
c
j=1
nij
N
log
nij N
ai bj

Motivation
Criticism
The discretization approach is less popular between numerical variables because:
There is a systematic estimation bias which depends to the grid size
However, when comparing dependencies systematic estimation biases cancel each other out
[Kraskov et al., 2004, Margolin et al., 2006, Schaﬀernicht et al., 2010]
Thus too not bad for comparing/ranking relationships!

Motivation
Comparing relationships/ comparing estimations of I
Task: Given a strong relationship s and a weak relationship w, compare the estimates Îs and
Îw of the true values Is and Iw
Systematic biases cancel out when comparing relationships
Systematic biases translate the distributions by a fixed amount
It is beneficial to reduce the variance

Motivation
Comparing relationships/ comparing estimations of I
Task: Given a strong relationship s and a weak relationship w, compare the estimates Îs and
Îw of the true values Is and Iw
Systematic biases cancel out when comparing relationships
Systematic biases translate the distributions by a fixed amount
It is beneficial to reduce the variance
Challenge: Decreasing the variance of the estimation

Randomized Information Coefficient (RIC)
Idea:
Generate many random grids with different cardinality by random cut-offs
Estimate the normalized mutual information for each of them (because of different
cardinality)
Average
X
18 20 22 24 26Y
16
18
20
22
24
26
X
18 20 22 24 26
Y
16
18
20
22
24
26
X
18 20 22 24 26
Y
16
18
20
22
24
26
Average
Parameters:
Kr - tunes the number of random grids
Dmax - tunes the maximum grid cardinality generated
Features:
Proved to decrease the variance like in random forests [Geurts, 2002]
Still good complexity O(n1.5
)
Easy to extend to sets of variables

Random discretization of set of variables
Relationship between Y and X = {X1, X2}
X2
1
0.5
01
0.5
X1
1
0.5
0
0
Y
X0
= X1+X2
2
0 0.5 1
Y
0
0.5
1
Need to randomly discretize X ⇒ just choose some random seeds:
X1
0 0.5 1
X2
0
0.2
0.4
0.6
0.8
1

Detection of Relationship
Task: Using permutation test identify if a
relationship exists:
Generate 500 values of RIC under
complete noise
Sort the values and identify the
value x of RIC at position
500 × 95% = 475
Generate 500 values of RIC under a
particular relationship
Count how many values are greater
than x
⇒ the bigger the count the bigger the
Power of RIC
Linear Quadratic Cubic
Sinusoidal low freq. Sinusoidal high freq. 4th Root
Circle Step Function Two Lines
X Sinusoidal varying freq. Circle-bar
Noise Lev. 1
Noise Lev. 6
Noise Lev. 11
Noise Lev. 16
Noise Lev. 21
Noise Lev. 26
Tested on many relationships and level of noise

Power at the increase of the number of random grids
Kr increases the number of random grids
Parameter (Kr)
50 100 150 200
AreaUnderPowerCurve
0
0.2
0.4
0.6
0.8
1
RIC, optimum at Kr = 200
Figure : Average power for each relationship - every line is a relationship
More random grids ⇒ less estimation variance ⇒ more power

Comparison with Other Measures
Extensively compared with other measures on the task of relationship detection
AverageRank-Power
0
2
4
6
8
10
12
14
RIC TICe IKDE dCorr HSIC RDC MIC IkNN Ief GMIC Iew r2
IA ACE Imean MID
Figure : Average rank across relationship (E.g. rank 1st when power is max on a relationship)

Comparison - Biological Network Inference
Reverse engineering of network of genes when the ground truth is known
AverageRank-MeanAveragePrecision
3
4
5
6
7
8
9
10
11
12
13
RIC dCorr IKDE IkNN HSIC ACE r2
GMIC Ief IA RDC Iew Imean MIC MID
Figure : Average rank across relationship (E.g. rank 1st when Average Precision is max on a network )
Also compared on:
Feature ﬁltering for regression
Feature selection for regression
RIC shows competitive performance

Conclusion - Message
We proposed the Randomized Information Coefficient (RIC)
Reduces the variance of normalized mutual information via grids when comparing
relationships
Random discretize multiple variables
Take away message:
There are different ways to generate random grids (random cut-off/ random
seeds)
The more the number of grids the smaller the variance
The Randomized Information Coefficient: Ranking Dependencies in Noisy Data, Simone Romano, James Bailey, Nguyen Xuan
Vinh, and Karin Verspoor. Under review in the Machine Learning Journal

Hypothesis so far...
So far we compared numerical variables on samples of ﬁxed size n
Dependency measures might have biases if they:
Compare samples with diﬀerent n
Compare categorical variables

Hypothesis so far...
So far we compared numerical variables on samples of ﬁxed size n
Dependency measures might have biases if they:
Compare samples with diﬀerent n
Compare categorical variables
Need for adjustment in these cases

Motivation
Motivation for Adjustment For Quantiﬁcation
Pearson’s correlation between two variables X and Y estimated on a data sample
Sn = {(xk , yk )} of n data points:
r(Sn|X, Y )
n
k=1(xk − ¯x)(yk − ¯y)
n
k=1(xk − ¯x)2 n
k=1(yk − ¯y)2
(1)
1 0.8 0.4 0 -0.4 -0.8 -1
1 1 1 -1 -1 -1
0 0 0 0 0 0 0
Figure : From https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
r2
(Sn|X, Y ) can be used as a proxy of the amount of noise for linear relationships:
1 if noiseless
0 if complete noise

Motivation
The Maximal Information Coeﬃcient (MIC) was published in Science [Reshef et al., 2011]
and has 499 citations to date according to Google scholar.
MIC(X,Y ) can be used as a proxy of the amount fo noise for functional relationships:
Figure : From supplementary material online in [Reshef et al., 2011]
MIC should be equal to:
1 if the relationship between X and Y is functional and noiseless
0 if there is complete noise

Motivation
Challenge
Nonetheless, its estimation is challenging on a ﬁnite data sample Sn of n data points.
We simulate 10,000 fully noisy relationships between X and Y on 20 and 80 data points:
0.2 0.4 0.6 0.8 1
MIC(S80jX; Y )
MIC(S20jX; Y )
Value can be high because of chance! The user expects values close to 0 in both cases
Challenge: Adjust the estimated MIC to better exploit the range [0, 1]

Adjustment for Chance
We deﬁne a framework for adjustment:
A ˆD
ˆD − E[ ˆD0]
max ˆD − E[ ˆD0]
It uses the distribution ˆD0 under independent variables:
r2
0 : Beta distribution
MIC0: can be computed using Monte Carlo permutations.
Used in κ-statistics. Its application is beneﬁcial to other dependency measures:
Adjusted r2
⇒ Ar2
Adjusted MIC ⇒ AMIC

Adjusted measures enable better interpretability
Task:
Obtain 1 for noiseless relationship, and 0 for complete noise (on average).
0%
r2
= 1
Ar2
= 1
20%
r2
= 0:66
Ar2
= 0:65
40%
r2
= 0:39
Ar2
= 0:37
60%
r2
= 0:2
Ar2
= 0:17
80%
r2
= 0:073
Ar2
= 0:044
100%
r2
= 0:035
Ar2
= 0:00046
Figure : Ar2
becomes zero on average on 100% noise
0%
MIC = 1
AMIC = 1
20%
MIC = 0:7
AMIC = 0:6
40%
MIC = 0:47
AMIC = 0:29
60%
MIC = 0:34
AMIC = 0:11
80%
MIC = 0:27
AMIC = 0:021
100%
MIC = 0:26
AMIC = 0:0014
Figure : AMIC becomes zero on average on 100% noise

Not biased towards small sample size n
Average value of ˆD for diﬀerent % of noise
⇒ estimates can be high because of chance at small n (e.g. because of missing values)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC

Not biased towards small sample size n
Average value of ˆD for diﬀerent % of noise
⇒ estimates can be high because of chance at small n (e.g. because of missing values)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Ar2
(Adjusted)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
AMIC (Adjusted)

Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variables X1 and
X2 deﬁned as follows:
X1 ≡ patient had breakfast today, X1 = {yes, no};
X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yes
X1= no
X2=green
X2=blue
X2=brown
Problem:
When ranking variables,
dependency measures are
biased towards the
selection of variables
with many categories

Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variables X1 and
X2 deﬁned as follows:
X1 ≡ patient had breakfast today, X1 = {yes, no};
X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yes
X1= no
X2=green
X2=blue
X2=brown
Problem:
When ranking variables,
dependency measures are
biased towards the
selection of variables
with many categories
This still happens because of ﬁnite samples!

Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)

Experiment
n = 100 data points

Experiment
n = 100 data points
Compute
Gini(X1, C) and Gini(X2, C).
Give a win to the variable
that gets the highest value

Experiment
n = 100 data points
Compute
REPEAT 10,000 times

Experiment
n = 100 data points
Compute
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times ( Bad )
Given that they are equally unpredictive, we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased ranking

We propose two adjustments for ranking:
Standardization
S ˆD
ˆD − E[ ˆD0]
Var( ˆD0)
Quantifies statistical significance like a p-value
A ˆD(α) ˆD − q0(1 − α)
Penalizes on statistical significance according to α
q0 is the quantile of the distribution ˆD0
(small α more penalization)

Standardized Gini (SGini) corrects for Selection bias
Select unpredictive features X1 with 2 categories and X2 with 3 categories.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Experiment: X1 and X2 gets selected
on average almost 50% of the times
( Good )
Being similar to a p-value, this is consistent with the literature on decision
trees [Frank and Witten, 1998, Dobra and Gehrke, 2001, Hothorn et al., 2006,
Strobl et al., 2007].
Nonetheless: we found that this is a simplistic scenario

Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant = 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Experiment: SGini becomes biased
towards X1 because more statically
signiﬁcant ( Bad )
This behavior has been overlooked in the decision tree community
Use A ˆD(α) to penalize less or even tune the bias!

Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant = 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Experiment: SGini becomes biased
towards X1 because more statically
signiﬁcant ( Bad )
This behavior has been overlooked in the decision tree community
Use A ˆD(α) to penalize less or even tune the bias!
⇒ AGini(α)

Application to random forest
why random forest? good classifier to try first when there are “meaningful” features
[Fernández-Delgado et al., 2014].
Plug-in different splitting criteria
Experiment: 19 data sets with categorical variables
,
0 0.2 0.4 0.6 0.8
MeanAUC
90
90.5
91
91.5
AGini(,)
SGini
Gini
Figure : Using the same α for all data sets
And α can be tuned for each data set with cross-validation.

Dependency estimates are high because of chance under ﬁnite samples.
Adjustments can help for:
Quantiﬁcation, to have an interpretable value between [0, 1]
Ranking, to avoid biases towards:
missing values
categorical variables with more categories
A Framework to Adjust Dependency Measure Estimates for Chance, Simone Romano, Nguyen Xuan Vinh, James Bailey, and
Karin Verspoor. Under submission in SIAM International Conference on Data Mining 2016 (SDM-16)
Arxiv: http://arxiv.org/abs/1510.07786

Motivation
Clustering Validation
Given a reference clustering V ( / ) we want to validate the clustering solution U (blue/red)
⇒ we need dependency measures
There are two very popular measures based on adjustments:
The Adjusted Rand Index (ARI)
[Hubert and Arabie, 1985]
∼ 3000 citations
The Adjusted Mutual Information (AMI)
[Vinh et al., 2009]
∼ 200 citations

Motivation
Clustering Validation
Given a reference clustering V ( / ) we want to validate the clustering solution U (blue/red)
⇒ we need dependency measures
There are two very popular measures based on adjustments:
The Adjusted Rand Index (ARI)
[Hubert and Arabie, 1985]
∼ 3000 citations
The Adjusted Mutual Information (AMI)
[Vinh et al., 2009]
∼ 200 citations
No clear connection between them - Users use them both

Motivation
Both computed on a contingency table
Notation: Contingency table M
ai = j nij are the row marginals and
bj = i nij are the column marginals.
V
b1 · · · bj · · · bc
a1 n11 · · · · · · · n1c
...
...
...
...
U ai · nij ·
...
...
...
...
ar nr1 · · · · · · · nrc
ARI - Adjustment of Rand Index (RI)
based on counting pairs of objects
ARI =
RI − E[RI]
max RI − E[RI]
AMI - Adjustment of Mutual Information (MI)
based on information theory
AMI =
MI − E[MI]
max MI − E[MI]

Motivation
Link: generalized information theory
Generalized information theory based on Tsallis q-entropy
Hq(V )
1
q − 1
1 −
j
bj
N
q
generalizes Shannon’s entropy
lim
q→1
Hq(V ) = H(V )
j
bj
N
log
bj
N
Link between measures:
Mutual Information (MIq) based on Tsallis q-entropy links RI and MI:
MIq=2 ∝ RI lim
q→1
MIq = MI

Motivation
Hq(V )
1
q − 1
1 −
j
bj
N
q
lim
q→1
Hq(V ) = H(V )
j
bj
N
log
bj
N
MIq=2 ∝ RI lim
q→1
MIq = MI
Challenge: Compute E[MIq] to connect ARI and AMI

Motivation
Hq(V )
1
q − 1
1 −
j
bj
N
q
lim
q→1
Hq(V ) = H(V )
j
bj
N
log
bj
N
MIq=2 ∝ RI lim
q→1
MIq = MI
Challenge: Compute E[MIq] to connect ARI and AMI
Challenge 2.0: Compute Var(MIq) for standardization

Motivation
Propose technique applicable to a broader class of measures:
We can do:
Exact computation of measures in Lφ
where S ∈ Lφ is a linear function of the entries of the contingency table:
S = α + β
ij
φij (nij )
(α and β are constants)
Asymptotic approximation of measures in Nφ (non-linear)
Rand Index (RI)
MI Jaccard
(J)
Generalized
Information Theoretic
VI
MI
NMI
Figure : Families of measures we can adjust

Exact Expected Value by Permutation Model
E[S] is obtained by summation over all possible contingency tables M obtained by
permutations.
E[S] =
M
S(M)P(M) = α + β
M ij
φij (nij )P(M)
No method to exhaustively generate M ﬁxing the marginals
extremely time expensive ( permutations O(N!))
However, it is possible to swap the inner summation with the outer summation:
M i,j
to swap
φij (nij )P(M) =
i,j nij
swapped
φij (nij )P(nij )
nij has a known hypergeometric distribution,
Computation time dramatically reduced! ⇒ O (max {rN, cN})

Exact Variance Computation
We have to compute the second moment E[S2
] which requires:
M


r
i=1
c
j=1
φij (nij )


2
P(M)
M i,j,i ,j
to swap
φij (nij ) · φi j (ni j )P(M)
i,j,i ,j nij ni j
swapped
φij (nij ) · φi j (ni j )P(nij , ni j )

Exact Variance Computation
We have to compute the second moment E[S2
] which requires:
M


r
i=1
c
j=1
φij (nij )


2
P(M)
M i,j,i ,j
to swap
φij (nij ) · φi j (ni j )P(M)
i,j,i ,j nij ni j
swapped
φij (nij ) · φi j (ni j )P(nij , ni j )
Contribution: P(nij , ni j ) computation is technically challenging.
We use the hypergeometric model: drawings from a urn with N marbles with 3 colors, red,
blue, and white.

Finally we can define adjustments..
Definition: Adjusted Mutual Information q - AMIq
AMI2 = ARI lim
q→1
AMIq = AMI
We can finally relate ARI and AMI to generalized information theory!

Finally we can define adjustments..
Definition: Adjusted Mutual Information q - AMIq
AMI2 = ARI lim
q→1
AMIq = AMI
We can finally relate ARI and AMI to generalized information theory!
Also define: a generalized Standardized Mutual Information q - SMIq for selection bias.
Their complexities:
Name Computational complexity
AMI O (max {rN, cN})
SMI O max {rcN3
, c2
N3
}
Table : Complexity when comparing two clusterings: N objects, r, c number of clusters

Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
10 10 10 70
U1
8 8 0 0 0
7 0 7 0 0
7 0 0 7 0
78 2 3 3 70
V
10 10 10 70
U2
10 7 1 1 1
10 1 7 1 1
10 1 1 7 1
70 1 1 1 67

V
10 10 10 70
U1
8 8 0 0 0
7 0 7 0 0
7 0 0 7 0
78 2 3 3 70
AMI chooses this one because of many 0’s
V
10 10 10 70
U2
10 7 1 1 1
10 1 7 1 1
10 1 1 7 1
70 1 1 1 67

V
10 10 10 70
U1
8 8 0 0 0
7 0 7 0 0
7 0 0 7 0
78 2 3 3 70
V
10 10 10 70
U2
10 7 1 1 1
10 1 7 1 1
10 1 1 7 1
70 1 1 1 67
ARI chooses this one

V
10 10 10 70
U1
8 8 0 0 0
7 0 7 0 0
7 0 0 7 0
78 2 3 3 70
V
10 10 10 70
U2
10 7 1 1 1
10 1 7 1 1
10 1 1 7 1
70 1 1 1 67
When there are small clusters in V , use AMI because it likes 0’s

Equal sized clusters...
V
25 25 25 25
U1
17 17 0 0 0
17 0 17 0 0
17 0 0 17 0
49 8 8 8 25
V
25 25 25 25
U2
24 20 2 1 1
25 2 20 2 1
23 1 1 20 1
28 2 2 2 22

V
25 25 25 25
U1
17 17 0 0 0
17 0 17 0 0
17 0 0 17 0
49 8 8 8 25
V
25 25 25 25
U2
24 20 2 1 1
25 2 20 2 1
23 1 1 20 1
28 2 2 2 22

V
25 25 25 25
U1
17 17 0 0 0
17 0 17 0 0
17 0 0 17 0
49 8 8 8 25
V
25 25 25 25
U2
24 20 2 1 1
25 2 20 2 1
23 1 1 20 1
28 2 2 2 22
When there are big equal sized clusters in V , use ARI because 0’s are misleading

SMIq can be used to correct selection bias
Reference clustering with 4 clusters and solutions U with diﬀerent number of clusters
2 3 4 5 6 7 8 9 10
SMIq
0
0.05
0.1
Probability of selection (q = 1:001)
2 3 4 5 6 7 8 9 10
AMIq
0
0.1
Number of sets r in U
2 3 4 5 6 7 8 9 10
NMIq
0
0.2
0.4

Correct for selection bias with SMIq for any q
Reference clustering with 4 clusters and solutions U with diﬀerent number of clusters
2 3 4 5 6 7 8 9 10
SMIq
0
0.05
0.1
Probability of selection (q = 2)
2 3 4 5 6 7 8 9 10
AMIq
0
0.05
0.1
Number of sets r in U
2 3 4 5 6 7 8 9 10
NMIq
0
0.5

We computed generalized information theoretic measures to propose AMIq and SMIq to:
identify the application scenarios of ARI and AMI
correct for selection bias
Take away message:
Use AMI when the reference is unbalanced and has small clusters
Use ARI when the reference has big equal sized clusters
Use SMIq to correct for selection bias
Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance, Simone Romano,
James Bailey, Nguyen Xuan Vinh, and Karin Verspoor. Published in Proceedings of the 31st International Conference on Machine
Learning 2014, pp. 1143–1151 (ICML-14)
Adjusting for Chance Clustering Comparison Measures, Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor.
To submit to the Journal of Machine Learning Research

Summary
Studying the distribution of the estimates ˆD, we:
Designed RIC
Adjusted for quantification
Adjusted for ranking
These results can aid detection, quantification, and ranking of relationships as follows
Detection: RIC can be used to detect relationships between continuous variables because
it has high power
Quantification: Adjustment for quantification can be used to obtain a more interpretable
range of values.
E.g. AMIC and AMIq
Ranking: Adjustment for ranking can be used to correct for biases towards variables
with missing values or variables with many categories.
E.g. AGini(α) for random forests

Future Work
Dependency measure estimates can obtain high values because of chance also when
they are computed on different number of dimensions
⇒ study adjustments to be unbiased towards different dimensionality
Adjustment via permutations is slow
⇒ compute more analytical adjustments, e.g. for MIC
The random seeds discretization technique for RIC might have problems with high
dimensionality
⇒ generate random seeds in random subspaces
⇒ study multivariable discretization using random trees
Inject randomness in other estimators of mutual information
⇒ E.g. choose different random kernel widths for the IKDE estimator

Papers
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Adjusting for Chance Clustering Comparison Measures”. To submit to the
Journal of Machine Learning Research
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “A Framework to Adjust Dependency Measure Estimates for Chance”. Under
submission in SIAM International Conference on Data Mining 2016 (SDM-16)
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “The Randomized Information Coeﬃcient: Ranking Dependencies in Noisy
Data” Under review in the Machine Learning Journal
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Standardized Mutual Information for Clustering Comparisons: One Step
Further in Adjustment for Chance”. Published in Proceedings of the 31st International Conference on Machine Learning 2014, pp.
1143–1151 (ICML-14)
Collaborations:
Y. Lei, J. C. Bezdek, N. X. Vinh, J. Chan, S. Romano, and J. Bailey, “Extending information theoretic validity indices for fuzzy
clusterings”. Submitted to the Transactions on Fuzzy Systems Journal
N. X. Vinh, J. Chan, S. Romano, J. Bailey, C. Leckie, K. Ramamohanarao, and J. Pei, “Discovering outlying aspects in large
datasets”. Submitted to the Data Mining and Knowledge Discovery Journal
N. X. Vinh, J. Chan, S. Romano, and J. Bailey, “Eﬀective global approaches for mutual information based feature selection”.
Published in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,
2014, pp. 512–521
Y. Lei, J. C. Bezdek, J. Chan, N. X. Vinh, S. Romano, and J. Bailey, “Generalized information theoretic cluster validity indices for
soft clusterings”. Published in Proceedings of Computational Intelligence and Data Mining (CIDM), 2014, pp. 24–31

Thank You All
In particular
My supervisors:
James Bailey, Karin Verspoor, and Vinh Nguyen
Committee Chair:
Tim Baldwin
My fellow PhD students

Thank You All
In particular
My supervisors:
James Bailey, Karin Verspoor, and Vinh Nguyen
Committee Chair:
Tim Baldwin
My fellow PhD students
Questions?
Code available online:
https://github.com/ialuronico

References I
Albatineh, A. N., Niewiadomska-Bugaj, M., and Mihalko, D. (2006).
On similarity indices and correction for chance agreement.
Journal of Classification, 23(2):301–313.
Caruana, R., Elhawary, M., Nguyen, N., and Smith, C. (2006).
Meta clustering.
In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 107–118. IEEE.
Cohen, M. X. (2014).
Analyzing neural time series data: theory and practice.
MIT Press.
Cover, T. M. and Thomas, J. A. (2012).
Elements of information theory.
John Wiley Sons.
Criminisi, A., Shotton, J., and Konukoglu, E. (2012).
Decision forests: A unified framework for classification, regression, density estimation, manifold
learning and semi-supervised learning.
Foundations and Trends in Computer Graphics and Vision, 7(2-3):81–227.
Dang, X. H. and Bailey, J. (2015).
A framework to uncover multiple alternative clusterings.
Machine Learning, 98(1-2):7–30.

References II
Dobra, A. and Gehrke, J. (2001).
Bias correction in classification tree construction.
In ICML, pages 90–97.
Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014).
Do we need hundreds of classifiers to solve real world classification problems?
The Journal of Machine Learning Research, 15(1):3133–3181.
Frank, E. and Witten, I. H. (1998).
Using a permutation test for attribute selection in decision trees.
In ICML, pages 152–160.
Geurts, P. (2002).
Bias/Variance Tradeoff and Time Series Classification.
PhD thesis, Department d’Életrecité, Életronique et Informatique. Institut Momntefiore. Unversité de
Liège.
Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005).
Measuring statistical dependence with hilbert-schmidt norms.
In Algorithmic learning theory, pages 63–77. Springer.
Guyon, I. and Elisseeff, A. (2003).
An introduction to variable and feature selection.
The Journal of Machine Learning Research, 3:1157–1182.

References III
Hothorn, T., Hornik, K., and Zeileis, A. (2006).
Unbiased recursive partitioning: A conditional inference framework.
Journal of Computational and Graphical Statistics, 15(3):651–674.
Hubert, L. and Arabie, P. (1985).
Comparing partitions.
Journal of Classification, 2:193–218.
Khan, S., Bandyopadhyay, S., Ganguly, A. R., Saigal, S., Erickson III, D. J., Protopopescu, V., and
Ostrouchov, G. (2007).
Relative performance of mutual information estimation methods for quantifying the dependence among
short and noisy data.
Physical Review E, 76(2):026209.
Kononenko, I. (1995).
On biases in estimating multi-valued attributes.
In International Joint Conferences on Artificial Intelligence, pages 1034–1040.
Kraskov, A., Stögbauer, H., and Grassberger, P. (2004).
Estimating mutual information.
Physical review E, 69(6):066138.
Lei, Y., Vinh, N. X., Chan, J., and Bailey, J. (2014).
Filta: Better view discovery from collections of clusterings via filtering.
In Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer.

References IV
Lopez-Paz, D., Hennig, P., and Schölkopf, B. (2013).
The randomized dependence coefficient.
In Advances in Neural Information Processing Systems, pages 1–9.
Margolin, A. A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R. D., and Califano,
A. (2006).
Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular
context.
BMC bioinformatics, 7(Suppl 1):S7.
Meil˘a, M. (2007).
Comparing clusterings—an information based distance.
Journal of Multivariate Analysis, 98(5):873–895.
Müller, E., Günnemann, S., Färber, I., and Seidl, T. (2013).
Discovering multiple clustering solutions: Grouping objects in different views of the data.
Tutorial at ICML.
Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J.,
Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011).
Detecting novel associations in large data sets.
Science, 334(6062):1518–1524.
Reshef, Y. A., Reshef, D. N., Finucane, H. K., Sabeti, P. C., and Mitzenmacher, M. M. (2015).
Measuring dependence powerfully and equitably.
arXiv preprint arXiv:1505.02213.

References V
Schaffernicht, E., Kaltenhaeuser, R., Verma, S. S., and Gross, H.-M. (2010).
On estimating mutual information for feature selection.
In Artificial Neural Networks ICANN 2010, pages 362–367. Springer.
Strehl, A. and Ghosh, J. (2003).
Cluster ensembles—a knowledge reuse framework for combining multiple partitions.
The Journal of Machine Learning Research, 3:583–617.
Strobl, C., Boulesteix, A.-L., and Augustin, T. (2007).
Unbiased split selection for classification trees based on the gini index.
Computational Statistics Data Analysis, 52(1):483–501.
Sugiyama, M. and Borgwardt, K. M. (2013).
Measuring statistical dependence via the mutual information dimension.
In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pages
1692–1698. AAAI Press.
Székely, G. J., Rizzo, M. L., et al. (2009).
Brownian distance covariance.
The annals of applied statistics, 3(4):1236–1265.
Villaverde, A. F., Ross, J., and Banga, J. R. (2013).
Reverse engineering cellular networks with information theoretic methods.
Cells, 2(2):306–329.

References VI
Vinh, N. X., Epps, J., and Bailey, J. (2009).
Information theoretic measures for clusterings comparison: is a correction for chance necessary?
In ICML, pages 1073–1080. ACM.
Witten, I. H., Frank, E., and Hall, M. A. (2011).
Data Mining: Practical Machine Learning Tools and Techniques.
3rd edition.

PhD Completion Seminar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to PhD Completion Seminar

Similar to PhD Completion Seminar (20)

More from Simone Romano

More from Simone Romano (7)

Recently uploaded

Recently uploaded (20)

PhD Completion Seminar