In this presentation, I discuss the topics I covered during my PhD:
Dependency measures between variables are fundamental for a number of important applications in machine learning. They are ubiquitously used: for feature selection, as splitting criteria in random forest, for clustering comparison and validation, to infer biological networks, to list a few. Nonetheless there exist a number of problems when dependencies are estimated on finite data: detection, quantification, and ranking of dependencies are challenging.
This thesis proposes a series of contributions to improve performances on each of the 3 goals above. During the seminar I will demonstrate that:
- Adjusted measures can improve on the tasks of quantification and ranking. In particular, I will discuss some adjustments applied to the Maximal Information Coefficient (MIC), random forests, and clustering comparisons;
- A measure based on mutual information and randomisation we designed is competitive on the tasks of detection and ranking of relationships. We named this measure the Randomised Information Coefficient (RIC) and tested it on the applications of biological network inference and multi-variable feature selection.
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
PhD Completion Seminar
1. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Simone Romano’s PhD Completion Seminar
Design and Adjustment of
Dependency Measures Between Variables
November 30th 2015
Supervisor: Prof. James Bailey
Co-Supervisor: A/Prof. Karin Verspoor
Computing and Information Systems (CIS)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
2. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Background
Examples of Applications
Categories of Dependency measures
Thesis Motivation
Ranking Dependencies in Noisy Data
Motivation
Design of the Randomized Information Coefficient (RIC)
Comparison Against Other Measures
A Framework for Adjusting Dependency Measures
Motivation
Adjustment for Quantification
Adjustment for Ranking
Adjustments for Clustering Comparison Measures
Motivation
Detailed Analysis of Contingency Tables
Application Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
3. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Dependency Measures
A dependency measure D is used to assess
the amount of dependency between variables:
Example 1: After collecting weight and height for many people,
we can compute D(weight, height)
Example 2: assess the amount of dependency between search queries in Google
https://www.google.com/trends/correlate/
They are fundamental for a number of applications in machine learning/ data mining
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
4. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Applications of Dependency Measures
Supervised learning
Feature selection [Guyon and Elisseeff, 2003];
Decision tree induction [Criminisi et al., 2012];
Evaluation of classification accuracy [Witten et al., 2011].
Unsupervised learning
External clustering validation [Strehl and Ghosh, 2003];
Generation of alternative or multi-view clusterings
[M¨uller et al., 2013, Dang and Bailey, 2015];
The exploration of the clustering space using results from the Meta-Clustering algorithm
[Caruana et al., 2006, Lei et al., 2014].
Exploratory analysis
Inference of biological networks [Reshef et al., 2011, Villaverde et al., 2013];
Analysis of neural time-series data [Cohen, 2014].
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
5. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Application example (1): feature selection / decision tree induction
Application: Identify if the class C is dependent to a feature F
Toy Example: Is the class C = cancer dependent to the feature F = smoker according to
this data set of 20 patients.
Use of dependency measure: Compute D(F, C)
Smoker Cancer
No -
Yes +
Yes +
Yes -
No +
No -
Yes +
...
...
Yes +
Contingency table is a useful tool:
counts the co-occurrences of feature values
and class values.
+ -
10 10
Smoker 8 6 2
Non smoker 12 4 8
⇒ if it is
dependent then
induce a split in
the decision tree
yes no
smoker
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
6. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Application example (2): external clustering validation
Application: Compare a clustering solution B to a reference clustering A.
Toy Example: N = 15 data points
reference clustering A with 2 clusters, stars and circles
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
7. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Application example (2): external clustering validation
Application: Compare a clustering solution B to a reference clustering A.
Toy Example: N = 15 data points
reference clustering A with 2 clusters, stars and circles
clustering solution B with 2 clusters, red
and blue
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
8. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Use of dependency measure: Compute D(A, B)
Once gain the contingency table is a useful too that assesses the amount of overlap
between A and B
B
red blue
6 9
A
8 4 4
7 2 5
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
9. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Application example (3): genetic network inference
Application: Identify if the gene G1 is interacting with the gene G2
Toy Example: We have a time series of values for each G1 and G2:
Use of dependency measure: Compute D(G1, G2)
time G1 G2
t1 20.4400 19.7450
t2 19.0750 20.3300
t3 20.0650 20.1700
...
...
...
Time
0 20 40 60 80 100 120 140
18
20
22
24
26
G1
G2
Here there is no contingency table
because the variables are numerical
G1
18 20 22 24 26
G2
19
20
21
22
23
24
25
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
10. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Categories of Dependency measures
Categories of Dependency Measures
Dependency measures can be divided in two categories: measures between categorical
variables and measures between numerical variables.
Between Categorical Variables
These measures can be computed naturally on a contingency table. For example on:
Decision trees
yes no
smoker
+ -10 10
Smoker 8 6 2
Non smoker 12 4 8
Clustering
comparisons
B
red blue
6 9
A
8 4 4
7 2 5
Information theoretic [Cover and Thomas, 2012]:
e.g. mutual information (a.k.a. information gain)
Based on pair-counting [Albatineh et al., 2006]: e.g. Rand Index, Jaccard similarity
Based on set-matching [Meil˘a, 2007]:
e.g. classification accuracy, agreement between annotators
Others: mostly employed as splitting criteria [Kononenko, 1995]: e.g. Gini gain,
Chi-square.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
11. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Categories of Dependency measures
Between Numerical Variables
No contingency table. For example:
Biological interaction
G1
18 20 22 24 26
G2
19
20
21
22
23
24
25
Estimators of mutual information [Khan et al., 2007]:
e.g. kNN estimator, kernel estimator, estimator based on grids
Correlation based:
e.g. Pearson’s correlation, distance correlation [Sz´ekely et al., 2009],
randomized dependence coefficient [Lopez-Paz et al., 2013]
Kernel based: e.g. Hilbert-Schimidt Independence Criterion [Gretton et al., 2005]
Based on information theory:
e.g. the Maximal Information Coefficient (MIC) [Reshef et al., 2011],
the mutual information dimension [Sugiyama and Borgwardt, 2013],
total information coefficient [Reshef et al., 2015].
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
12. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Thesis Motivation
Thesis Motivation
Even if a dependency measure D has nice theoretical properties,
dependencies are estimated on finite data with ˆD.
The following goals of dependency measures are challenging:
Detection: Test for the presence of dependency.
E.g. test dependence between two genes
Example (3)
Quantification: Summarization of the amount of dependency in an interpretable fashion.
E.g. assessing the amount of overlapping between two clusterings
Example (2)
Ranking: Sort the relationships of different variables.
E.g. ranking many features in decision trees
Example (1)
To improve performances on the three goals above:
We need information on the distribution of ˆD
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
13. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Thesis Motivation
For Example, when Ranking Noisy Relationships
The distribution of ˆD(X, Y ) when the relationship between X and Y is noisy,
should not overlap with the distribution of ˆD(X, Y ) on a noiseless relationship:
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
14. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Background
Examples of Applications
Categories of Dependency measures
Thesis Motivation
Ranking Dependencies in Noisy Data
Motivation
Design of the Randomized Information Coefficient (RIC)
Comparison Against Other Measures
A Framework for Adjusting Dependency Measures
Motivation
Adjustment for Quantification
Adjustment for Ranking
Adjustments for Clustering Comparison Measures
Motivation
Detailed Analysis of Contingency Tables
Application Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
15. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Motivation
Mutual information I(X, Y ) is good to rank relationships with different level of noise
between variables:
high I ⇒ little noise
small I ⇒ big noise
It can also be computed between sets of variables: e.g.
I(X, Y ) = I({X1, X2}, Y ) = I({weight, height}, BMI)
Mutual Information quantifies the information shared between two variables
MI(X, Y ) =
+∞
−∞
+∞
−∞
fX,Y (x, y) log
fX,Y (x, y)
fX (x)fY (y)
Importance of MI
It is based on a well-established theory and quantifies non-linear interactions which might be
missed if e.g. the Pearson’s correlation coefficient r(X, Y ) is used.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
16. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Estimation of Mutual Information
Many estimators of mutual information:
Acronym Type Applicable to Sets of vars. Best Compl. Worst Compl.
Iew (Discretization equal width) O(n1.5
)
Ief (Discretization equal frequency) O(n1.5
)
IA (Adaptive Partitioning) O(n1.5
)
Imean (Mean Nearest Neighbours) O(n2
)
IKDE (Kernel Density Estimation) O(n2
)
IkNN (Nearest Neighbours) O(n1.5
) O(n2
)
Discretization based estimators of mutual information exhibits good complexity but not
applicable to sets of variables
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
17. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Discretization based estimator use fixed grids. And compute mutual information on a
contingency table.
X
18 20 22 24 26
Y
16
18
20
22
24
26
For example Iew discretizes
using equal width binning
Discretized X
b1 · · · bj · · · bc
a1 n11 · · · · · · · n1c
...
...
...
...
Discretized Y ai · nij ·
...
...
...
...
ar nr1 · · · · · · · nrc
nij counts the number of points in a
particular bin. Mutual information
can be computed with:
Iew(X, Y ) =
r
i=1
c
j=1
nij
N
log
nij N
ai bj
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
18. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Criticism
The discretization approach is less popular between numerical variables because:
There is a systematic estimation bias which depends to the grid size
However, when comparing dependencies systematic estimation biases cancel each other out
[Kraskov et al., 2004, Margolin et al., 2006, Schaffernicht et al., 2010]
Thus too not bad for comparing/ranking relationships!
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
19. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Comparing relationships/ comparing estimations of I
Task: Given a strong relationship s and a weak relationship w, compare the estimates ˆIs and
ˆIw of the true values Is and Iw
Systematic biases cancel out when comparing relationships
Systematic biases translate the distributions by a fixed amount
It is beneficial to reduce the variance
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
20. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Comparing relationships/ comparing estimations of I
Task: Given a strong relationship s and a weak relationship w, compare the estimates ˆIs and
ˆIw of the true values Is and Iw
Systematic biases cancel out when comparing relationships
Systematic biases translate the distributions by a fixed amount
It is beneficial to reduce the variance
Challenge: Decreasing the variance of the estimation
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
21. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Design of the Randomized Information Coefficient (RIC)
Randomized Information Coefficient (RIC)
Idea:
Generate many random grids with different cardinality by random cut-offs
Estimate the normalized mutual information for each of them (because of different
cardinality)
Average
X
18 20 22 24 26Y
16
18
20
22
24
26
X
18 20 22 24 26
Y
16
18
20
22
24
26
X
18 20 22 24 26
Y
16
18
20
22
24
26
Average
Parameters:
Kr - tunes the number of random grids
Dmax - tunes the maximum grid cardinality generated
Features:
Proved to decrease the variance like in random forests [Geurts, 2002]
Still good complexity O(n1.5
)
Easy to extend to sets of variables
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
22. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Design of the Randomized Information Coefficient (RIC)
Random discretization of set of variables
Relationship between Y and X = {X1, X2}
X2
1
0.5
01
0.5
X1
1
0.5
0
0
Y
X0
= X1+X2
2
0 0.5 1
Y
0
0.5
1
Need to randomly discretize X ⇒ just choose some random seeds:
X1
0 0.5 1
X2
0
0.2
0.4
0.6
0.8
1
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
23. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Design of the Randomized Information Coefficient (RIC)
Detection of Relationship
Task: Using permutation test identify if a
relationship exists:
Generate 500 values of RIC under
complete noise
Sort the values and identify the
value x of RIC at position
500 × 95% = 475
Generate 500 values of RIC under a
particular relationship
Count how many values are greater
than x
⇒ the bigger the count the bigger the
Power of RIC
Linear Quadratic Cubic
Sinusoidal low freq. Sinusoidal high freq. 4th Root
Circle Step Function Two Lines
X Sinusoidal varying freq. Circle-bar
Noise Lev. 1
Noise Lev. 6
Noise Lev. 11
Noise Lev. 16
Noise Lev. 21
Noise Lev. 26
Tested on many relationships and level of noise
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
24. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Design of the Randomized Information Coefficient (RIC)
Power at the increase of the number of random grids
Kr increases the number of random grids
Parameter (Kr)
50 100 150 200
AreaUnderPowerCurve
0
0.2
0.4
0.6
0.8
1
RIC, optimum at Kr = 200
Figure : Average power for each relationship - every line is a relationship
More random grids ⇒ less estimation variance ⇒ more power
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
25. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Comparison with Other Measures
Extensively compared with other measures on the task of relationship detection
AverageRank-Power
0
2
4
6
8
10
12
14
RIC TICe IKDE dCorr HSIC RDC MIC IkNN Ief GMIC Iew r2
IA ACE Imean MID
Figure : Average rank across relationship (E.g. rank 1st when power is max on a relationship)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
26. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Comparison - Biological Network Inference
Reverse engineering of network of genes when the ground truth is known
AverageRank-MeanAveragePrecision
3
4
5
6
7
8
9
10
11
12
13
RIC dCorr IKDE IkNN HSIC ACE r2
GMIC Ief IA RDC Iew Imean MIC MID
Figure : Average rank across relationship (E.g. rank 1st when Average Precision is max on a network )
Also compared on:
Feature filtering for regression
Feature selection for regression
RIC shows competitive performance
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
27. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Conclusion - Message
We proposed the Randomized Information Coefficient (RIC)
Reduces the variance of normalized mutual information via grids when comparing
relationships
Random discretize multiple variables
Take away message:
There are different ways to generate random grids (random cut-off/ random
seeds)
The more the number of grids the smaller the variance
The Randomized Information Coefficient: Ranking Dependencies in Noisy Data, Simone Romano, James Bailey, Nguyen Xuan
Vinh, and Karin Verspoor. Under review in the Machine Learning Journal
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
28. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Hypothesis so far...
So far we compared numerical variables on samples of fixed size n
Dependency measures might have biases if they:
Compare samples with different n
Compare categorical variables
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
29. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Hypothesis so far...
So far we compared numerical variables on samples of fixed size n
Dependency measures might have biases if they:
Compare samples with different n
Compare categorical variables
Need for adjustment in these cases
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
30. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Background
Examples of Applications
Categories of Dependency measures
Thesis Motivation
Ranking Dependencies in Noisy Data
Motivation
Design of the Randomized Information Coefficient (RIC)
Comparison Against Other Measures
A Framework for Adjusting Dependency Measures
Motivation
Adjustment for Quantification
Adjustment for Ranking
Adjustments for Clustering Comparison Measures
Motivation
Detailed Analysis of Contingency Tables
Application Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
31. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Motivation for Adjustment For Quantification
Pearson’s correlation between two variables X and Y estimated on a data sample
Sn = {(xk , yk )} of n data points:
r(Sn|X, Y )
n
k=1(xk − ¯x)(yk − ¯y)
n
k=1(xk − ¯x)2 n
k=1(yk − ¯y)2
(1)
1 0.8 0.4 0 -0.4 -0.8 -1
1 1 1 -1 -1 -1
0 0 0 0 0 0 0
Figure : From https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
r2
(Sn|X, Y ) can be used as a proxy of the amount of noise for linear relationships:
1 if noiseless
0 if complete noise
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
32. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
The Maximal Information Coefficient (MIC) was published in Science [Reshef et al., 2011]
and has 499 citations to date according to Google scholar.
MIC(X,Y ) can be used as a proxy of the amount fo noise for functional relationships:
Figure : From supplementary material online in [Reshef et al., 2011]
MIC should be equal to:
1 if the relationship between X and Y is functional and noiseless
0 if there is complete noise
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
33. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Challenge
Nonetheless, its estimation is challenging on a finite data sample Sn of n data points.
We simulate 10,000 fully noisy relationships between X and Y on 20 and 80 data points:
0.2 0.4 0.6 0.8 1
MIC(S80jX; Y )
MIC(S20jX; Y )
Value can be high because of chance! The user expects values close to 0 in both cases
Challenge: Adjust the estimated MIC to better exploit the range [0, 1]
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
34. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Quantification
Adjustment for Chance
We define a framework for adjustment:
Adjustment for Quantification
A ˆD
ˆD − E[ ˆD0]
max ˆD − E[ ˆD0]
It uses the distribution ˆD0 under independent variables:
r2
0 : Beta distribution
MIC0: can be computed using Monte Carlo permutations.
Used in κ-statistics. Its application is beneficial to other dependency measures:
Adjusted r2
⇒ Ar2
Adjusted MIC ⇒ AMIC
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
35. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Quantification
Adjusted measures enable better interpretability
Task:
Obtain 1 for noiseless relationship, and 0 for complete noise (on average).
0%
r2
= 1
Ar2
= 1
20%
r2
= 0:66
Ar2
= 0:65
40%
r2
= 0:39
Ar2
= 0:37
60%
r2
= 0:2
Ar2
= 0:17
80%
r2
= 0:073
Ar2
= 0:044
100%
r2
= 0:035
Ar2
= 0:00046
Figure : Ar2
becomes zero on average on 100% noise
0%
MIC = 1
AMIC = 1
20%
MIC = 0:7
AMIC = 0:6
40%
MIC = 0:47
AMIC = 0:29
60%
MIC = 0:34
AMIC = 0:11
80%
MIC = 0:27
AMIC = 0:021
100%
MIC = 0:26
AMIC = 0:0014
Figure : AMIC becomes zero on average on 100% noise
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
36. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Quantification
Not biased towards small sample size n
Average value of ˆD for different % of noise
⇒ estimates can be high because of chance at small n (e.g. because of missing values)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
37. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Quantification
Not biased towards small sample size n
Average value of ˆD for different % of noise
⇒ estimates can be high because of chance at small n (e.g. because of missing values)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Ar2
(Adjusted)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
AMIC (Adjusted)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
38. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variables X1 and
X2 defined as follows:
X1 ≡ patient had breakfast today, X1 = {yes, no};
X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yes
X1= no
X2=green
X2=blue
X2=brown
Problem:
When ranking variables,
dependency measures are
biased towards the
selection of variables
with many categories
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
39. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variables X1 and
X2 defined as follows:
X1 ≡ patient had breakfast today, X1 = {yes, no};
X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yes
X1= no
X2=green
X2=blue
X2=brown
Problem:
When ranking variables,
dependency measures are
biased towards the
selection of variables
with many categories
This still happens because of finite samples!
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
40. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
41. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Generate a variable X2 with 3 categories
(independently from C)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
42. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Generate a variable X2 with 3 categories
(independently from C)
Compute
Gini(X1, C) and Gini(X2, C).
Give a win to the variable
that gets the highest value
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
43. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Generate a variable X2 with 3 categories
(independently from C)
Compute
Gini(X1, C) and Gini(X2, C).
Give a win to the variable
that gets the highest value
REPEAT 10,000 times
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
44. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Generate a variable X2 with 3 categories
(independently from C)
Compute
Gini(X1, C) and Gini(X2, C).
Give a win to the variable
that gets the highest value
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times ( Bad )
Given that they are equally unpredictive, we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased ranking
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
45. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Adjustment for Ranking
We propose two adjustments for ranking:
Standardization
S ˆD
ˆD − E[ ˆD0]
Var( ˆD0)
Quantifies statistical significance like a p-value
Adjustment for Ranking
A ˆD(α) ˆD − q0(1 − α)
Penalizes on statistical significance according to α
q0 is the quantile of the distribution ˆD0
(small α more penalization)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
46. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Standardized Gini (SGini) corrects for Selection bias
Select unpredictive features X1 with 2 categories and X2 with 3 categories.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: X1 and X2 gets selected
on average almost 50% of the times
( Good )
Being similar to a p-value, this is consistent with the literature on decision
trees [Frank and Witten, 1998, Dobra and Gehrke, 2001, Hothorn et al., 2006,
Strobl et al., 2007].
Nonetheless: we found that this is a simplistic scenario
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
47. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant = 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: SGini becomes biased
towards X1 because more statically
significant ( Bad )
This behavior has been overlooked in the decision tree community
Use A ˆD(α) to penalize less or even tune the bias!
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
48. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant = 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: SGini becomes biased
towards X1 because more statically
significant ( Bad )
This behavior has been overlooked in the decision tree community
Use A ˆD(α) to penalize less or even tune the bias!
⇒ AGini(α)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
49. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Application to random forest
why random forest? good classifier to try first when there are “meaningful” features
[Fern´andez-Delgado et al., 2014].
Plug-in different splitting criteria
Experiment: 19 data sets with categorical variables
,
0 0.2 0.4 0.6 0.8
MeanAUC
90
90.5
91
91.5
AGini(,)
SGini
Gini
Figure : Using the same α for all data sets
And α can be tuned for each data set with cross-validation.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
50. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Conclusion - Message
Dependency estimates are high because of chance under finite samples.
Adjustments can help for:
Quantification, to have an interpretable value between [0, 1]
Ranking, to avoid biases towards:
missing values
categorical variables with more categories
A Framework to Adjust Dependency Measure Estimates for Chance, Simone Romano, Nguyen Xuan Vinh, James Bailey, and
Karin Verspoor. Under submission in SIAM International Conference on Data Mining 2016 (SDM-16)
Arxiv: http://arxiv.org/abs/1510.07786
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
51. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Background
Examples of Applications
Categories of Dependency measures
Thesis Motivation
Ranking Dependencies in Noisy Data
Motivation
Design of the Randomized Information Coefficient (RIC)
Comparison Against Other Measures
A Framework for Adjusting Dependency Measures
Motivation
Adjustment for Quantification
Adjustment for Ranking
Adjustments for Clustering Comparison Measures
Motivation
Detailed Analysis of Contingency Tables
Application Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
52. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Clustering Validation
Given a reference clustering V ( / ) we want to validate the clustering solution U (blue/red)
⇒ we need dependency measures
There are two very popular measures based on adjustments:
The Adjusted Rand Index (ARI)
[Hubert and Arabie, 1985]
∼ 3000 citations
The Adjusted Mutual Information (AMI)
[Vinh et al., 2009]
∼ 200 citations
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
53. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Clustering Validation
Given a reference clustering V ( / ) we want to validate the clustering solution U (blue/red)
⇒ we need dependency measures
There are two very popular measures based on adjustments:
The Adjusted Rand Index (ARI)
[Hubert and Arabie, 1985]
∼ 3000 citations
The Adjusted Mutual Information (AMI)
[Vinh et al., 2009]
∼ 200 citations
No clear connection between them - Users use them both
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
54. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Both computed on a contingency table
Notation: Contingency table M
ai = j nij are the row marginals and
bj = i nij are the column marginals.
V
b1 · · · bj · · · bc
a1 n11 · · · · · · · n1c
...
...
...
...
U ai · nij ·
...
...
...
...
ar nr1 · · · · · · · nrc
ARI - Adjustment of Rand Index (RI)
based on counting pairs of objects
ARI =
RI − E[RI]
max RI − E[RI]
AMI - Adjustment of Mutual Information (MI)
based on information theory
AMI =
MI − E[MI]
max MI − E[MI]
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
55. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Link: generalized information theory
Generalized information theory based on Tsallis q-entropy
Hq(V )
1
q − 1
1 −
j
bj
N
q
generalizes Shannon’s entropy
lim
q→1
Hq(V ) = H(V )
j
bj
N
log
bj
N
Link between measures:
Mutual Information (MIq) based on Tsallis q-entropy links RI and MI:
MIq=2 ∝ RI lim
q→1
MIq = MI
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
56. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Link: generalized information theory
Generalized information theory based on Tsallis q-entropy
Hq(V )
1
q − 1
1 −
j
bj
N
q
generalizes Shannon’s entropy
lim
q→1
Hq(V ) = H(V )
j
bj
N
log
bj
N
Link between measures:
Mutual Information (MIq) based on Tsallis q-entropy links RI and MI:
MIq=2 ∝ RI lim
q→1
MIq = MI
Challenge: Compute E[MIq] to connect ARI and AMI
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
57. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Link: generalized information theory
Generalized information theory based on Tsallis q-entropy
Hq(V )
1
q − 1
1 −
j
bj
N
q
generalizes Shannon’s entropy
lim
q→1
Hq(V ) = H(V )
j
bj
N
log
bj
N
Link between measures:
Mutual Information (MIq) based on Tsallis q-entropy links RI and MI:
MIq=2 ∝ RI lim
q→1
MIq = MI
Challenge: Compute E[MIq] to connect ARI and AMI
Challenge 2.0: Compute Var(MIq) for standardization
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
58. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Propose technique applicable to a broader class of measures:
We can do:
Exact computation of measures in Lφ
where S ∈ Lφ is a linear function of the entries of the contingency table:
S = α + β
ij
φij (nij )
(α and β are constants)
Asymptotic approximation of measures in Nφ (non-linear)
Rand Index (RI)
MI Jaccard
(J)
Generalized
Information Theoretic
VI
MI
NMI
Figure : Families of measures we can adjust
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
59. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Exact Expected Value by Permutation Model
E[S] is obtained by summation over all possible contingency tables M obtained by
permutations.
E[S] =
M
S(M)P(M) = α + β
M ij
φij (nij )P(M)
No method to exhaustively generate M fixing the marginals
extremely time expensive ( permutations O(N!))
However, it is possible to swap the inner summation with the outer summation:
M i,j
to swap
φij (nij )P(M) =
i,j nij
swapped
φij (nij )P(nij )
nij has a known hypergeometric distribution,
Computation time dramatically reduced! ⇒ O (max {rN, cN})
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
60. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Exact Variance Computation
We have to compute the second moment E[S2
] which requires:
M
r
i=1
c
j=1
φij (nij )
2
P(M)
M i,j,i ,j
to swap
φij (nij ) · φi j (ni j )P(M)
i,j,i ,j nij ni j
swapped
φij (nij ) · φi j (ni j )P(nij , ni j )
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
61. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Exact Variance Computation
We have to compute the second moment E[S2
] which requires:
M
r
i=1
c
j=1
φij (nij )
2
P(M)
M i,j,i ,j
to swap
φij (nij ) · φi j (ni j )P(M)
i,j,i ,j nij ni j
swapped
φij (nij ) · φi j (ni j )P(nij , ni j )
Contribution: P(nij , ni j ) computation is technically challenging.
We use the hypergeometric model: drawings from a urn with N marbles with 3 colors, red,
blue, and white.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
62. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Finally we can define adjustments..
Definition: Adjusted Mutual Information q - AMIq
AMI2 = ARI lim
q→1
AMIq = AMI
We can finally relate ARI and AMI to generalized information theory!
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
63. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Finally we can define adjustments..
Definition: Adjusted Mutual Information q - AMIq
AMI2 = ARI lim
q→1
AMIq = AMI
We can finally relate ARI and AMI to generalized information theory!
Also define: a generalized Standardized Mutual Information q - SMIq for selection bias.
Their complexities:
Name Computational complexity
AMI O (max {rN, cN})
SMI O max {rcN3
, c2
N3
}
Table : Complexity when comparing two clusterings: N objects, r, c number of clusters
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
64. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Application Scenarios
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
10 10 10 70
U1
8 8 0 0 0
7 0 7 0 0
7 0 0 7 0
78 2 3 3 70
V
10 10 10 70
U2
10 7 1 1 1
10 1 7 1 1
10 1 1 7 1
70 1 1 1 67
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
65. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Application Scenarios
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
10 10 10 70
U1
8 8 0 0 0
7 0 7 0 0
7 0 0 7 0
78 2 3 3 70
AMI chooses this one because of many 0’s
V
10 10 10 70
U2
10 7 1 1 1
10 1 7 1 1
10 1 1 7 1
70 1 1 1 67
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
66. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Application Scenarios
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
10 10 10 70
U1
8 8 0 0 0
7 0 7 0 0
7 0 0 7 0
78 2 3 3 70
AMI chooses this one because of many 0’s
V
10 10 10 70
U2
10 7 1 1 1
10 1 7 1 1
10 1 1 7 1
70 1 1 1 67
ARI chooses this one
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
67. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Application Scenarios
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
10 10 10 70
U1
8 8 0 0 0
7 0 7 0 0
7 0 0 7 0
78 2 3 3 70
AMI chooses this one because of many 0’s
V
10 10 10 70
U2
10 7 1 1 1
10 1 7 1 1
10 1 1 7 1
70 1 1 1 67
ARI chooses this one
When there are small clusters in V , use AMI because it likes 0’s
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
68. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Equal sized clusters...
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
25 25 25 25
U1
17 17 0 0 0
17 0 17 0 0
17 0 0 17 0
49 8 8 8 25
V
25 25 25 25
U2
24 20 2 1 1
25 2 20 2 1
23 1 1 20 1
28 2 2 2 22
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
69. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Equal sized clusters...
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
25 25 25 25
U1
17 17 0 0 0
17 0 17 0 0
17 0 0 17 0
49 8 8 8 25
AMI chooses this one because of many 0’s
V
25 25 25 25
U2
24 20 2 1 1
25 2 20 2 1
23 1 1 20 1
28 2 2 2 22
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
70. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Equal sized clusters...
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
25 25 25 25
U1
17 17 0 0 0
17 0 17 0 0
17 0 0 17 0
49 8 8 8 25
AMI chooses this one because of many 0’s
V
25 25 25 25
U2
24 20 2 1 1
25 2 20 2 1
23 1 1 20 1
28 2 2 2 22
ARI chooses this one
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
71. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Equal sized clusters...
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
25 25 25 25
U1
17 17 0 0 0
17 0 17 0 0
17 0 0 17 0
49 8 8 8 25
AMI chooses this one because of many 0’s
V
25 25 25 25
U2
24 20 2 1 1
25 2 20 2 1
23 1 1 20 1
28 2 2 2 22
ARI chooses this one
When there are big equal sized clusters in V , use ARI because 0’s are misleading
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
72. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
SMIq can be used to correct selection bias
Reference clustering with 4 clusters and solutions U with different number of clusters
2 3 4 5 6 7 8 9 10
SMIq
0
0.05
0.1
Probability of selection (q = 1:001)
2 3 4 5 6 7 8 9 10
AMIq
0
0.1
Number of sets r in U
2 3 4 5 6 7 8 9 10
NMIq
0
0.2
0.4
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
73. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Correct for selection bias with SMIq for any q
Reference clustering with 4 clusters and solutions U with different number of clusters
2 3 4 5 6 7 8 9 10
SMIq
0
0.05
0.1
Probability of selection (q = 2)
2 3 4 5 6 7 8 9 10
AMIq
0
0.05
0.1
Number of sets r in U
2 3 4 5 6 7 8 9 10
NMIq
0
0.5
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
74. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Conclusion - Message
We computed generalized information theoretic measures to propose AMIq and SMIq to:
identify the application scenarios of ARI and AMI
correct for selection bias
Take away message:
Use AMI when the reference is unbalanced and has small clusters
Use ARI when the reference has big equal sized clusters
Use SMIq to correct for selection bias
Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance, Simone Romano,
James Bailey, Nguyen Xuan Vinh, and Karin Verspoor. Published in Proceedings of the 31st International Conference on Machine
Learning 2014, pp. 1143–1151 (ICML-14)
Adjusting for Chance Clustering Comparison Measures, Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor.
To submit to the Journal of Machine Learning Research
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
75. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Background
Examples of Applications
Categories of Dependency measures
Thesis Motivation
Ranking Dependencies in Noisy Data
Motivation
Design of the Randomized Information Coefficient (RIC)
Comparison Against Other Measures
A Framework for Adjusting Dependency Measures
Motivation
Adjustment for Quantification
Adjustment for Ranking
Adjustments for Clustering Comparison Measures
Motivation
Detailed Analysis of Contingency Tables
Application Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
76. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Summary
Studying the distribution of the estimates ˆD, we:
Designed RIC
Adjusted for quantification
Adjusted for ranking
These results can aid detection, quantification, and ranking of relationships as follows
Detection: RIC can be used to detect relationships between continuous variables because
it has high power
Quantification: Adjustment for quantification can be used to obtain a more interpretable
range of values.
E.g. AMIC and AMIq
Ranking: Adjustment for ranking can be used to correct for biases towards variables
with missing values or variables with many categories.
E.g. AGini(α) for random forests
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
77. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Future Work
Dependency measure estimates can obtain high values because of chance also when
they are computed on different number of dimensions
⇒ study adjustments to be unbiased towards different dimensionality
Adjustment via permutations is slow
⇒ compute more analytical adjustments, e.g. for MIC
The random seeds discretization technique for RIC might have problems with high
dimensionality
⇒ generate random seeds in random subspaces
⇒ study multivariable discretization using random trees
Inject randomness in other estimators of mutual information
⇒ E.g. choose different random kernel widths for the IKDE estimator
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
78. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Papers
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Adjusting for Chance Clustering Comparison Measures”. To submit to the
Journal of Machine Learning Research
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “A Framework to Adjust Dependency Measure Estimates for Chance”. Under
submission in SIAM International Conference on Data Mining 2016 (SDM-16)
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “The Randomized Information Coefficient: Ranking Dependencies in Noisy
Data” Under review in the Machine Learning Journal
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Standardized Mutual Information for Clustering Comparisons: One Step
Further in Adjustment for Chance”. Published in Proceedings of the 31st International Conference on Machine Learning 2014, pp.
1143–1151 (ICML-14)
Collaborations:
Y. Lei, J. C. Bezdek, N. X. Vinh, J. Chan, S. Romano, and J. Bailey, “Extending information theoretic validity indices for fuzzy
clusterings”. Submitted to the Transactions on Fuzzy Systems Journal
N. X. Vinh, J. Chan, S. Romano, J. Bailey, C. Leckie, K. Ramamohanarao, and J. Pei, “Discovering outlying aspects in large
datasets”. Submitted to the Data Mining and Knowledge Discovery Journal
N. X. Vinh, J. Chan, S. Romano, and J. Bailey, “Effective global approaches for mutual information based feature selection”.
Published in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,
2014, pp. 512–521
Y. Lei, J. C. Bezdek, J. Chan, N. X. Vinh, S. Romano, and J. Bailey, “Generalized information theoretic cluster validity indices for
soft clusterings”. Published in Proceedings of Computational Intelligence and Data Mining (CIDM), 2014, pp. 24–31
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
79. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Thank You All
In particular
My supervisors:
James Bailey, Karin Verspoor, and Vinh Nguyen
Committee Chair:
Tim Baldwin
My fellow PhD students
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
80. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Thank You All
In particular
My supervisors:
James Bailey, Karin Verspoor, and Vinh Nguyen
Committee Chair:
Tim Baldwin
My fellow PhD students
Questions?
Code available online:
https://github.com/ialuronico
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
81. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References I
Albatineh, A. N., Niewiadomska-Bugaj, M., and Mihalko, D. (2006).
On similarity indices and correction for chance agreement.
Journal of Classification, 23(2):301–313.
Caruana, R., Elhawary, M., Nguyen, N., and Smith, C. (2006).
Meta clustering.
In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 107–118. IEEE.
Cohen, M. X. (2014).
Analyzing neural time series data: theory and practice.
MIT Press.
Cover, T. M. and Thomas, J. A. (2012).
Elements of information theory.
John Wiley Sons.
Criminisi, A., Shotton, J., and Konukoglu, E. (2012).
Decision forests: A unified framework for classification, regression, density estimation, manifold
learning and semi-supervised learning.
Foundations and Trends in Computer Graphics and Vision, 7(2-3):81–227.
Dang, X. H. and Bailey, J. (2015).
A framework to uncover multiple alternative clusterings.
Machine Learning, 98(1-2):7–30.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
82. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References II
Dobra, A. and Gehrke, J. (2001).
Bias correction in classification tree construction.
In ICML, pages 90–97.
Fern´andez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014).
Do we need hundreds of classifiers to solve real world classification problems?
The Journal of Machine Learning Research, 15(1):3133–3181.
Frank, E. and Witten, I. H. (1998).
Using a permutation test for attribute selection in decision trees.
In ICML, pages 152–160.
Geurts, P. (2002).
Bias/Variance Tradeoff and Time Series Classification.
PhD thesis, Department d’´Eletrecit´e, ´Eletronique et Informatique. Institut Momntefiore. Unversit´e de
Li`ege.
Gretton, A., Bousquet, O., Smola, A., and Sch¨olkopf, B. (2005).
Measuring statistical dependence with hilbert-schmidt norms.
In Algorithmic learning theory, pages 63–77. Springer.
Guyon, I. and Elisseeff, A. (2003).
An introduction to variable and feature selection.
The Journal of Machine Learning Research, 3:1157–1182.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
83. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References III
Hothorn, T., Hornik, K., and Zeileis, A. (2006).
Unbiased recursive partitioning: A conditional inference framework.
Journal of Computational and Graphical Statistics, 15(3):651–674.
Hubert, L. and Arabie, P. (1985).
Comparing partitions.
Journal of Classification, 2:193–218.
Khan, S., Bandyopadhyay, S., Ganguly, A. R., Saigal, S., Erickson III, D. J., Protopopescu, V., and
Ostrouchov, G. (2007).
Relative performance of mutual information estimation methods for quantifying the dependence among
short and noisy data.
Physical Review E, 76(2):026209.
Kononenko, I. (1995).
On biases in estimating multi-valued attributes.
In International Joint Conferences on Artificial Intelligence, pages 1034–1040.
Kraskov, A., St¨ogbauer, H., and Grassberger, P. (2004).
Estimating mutual information.
Physical review E, 69(6):066138.
Lei, Y., Vinh, N. X., Chan, J., and Bailey, J. (2014).
Filta: Better view discovery from collections of clusterings via filtering.
In Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
84. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References IV
Lopez-Paz, D., Hennig, P., and Sch¨olkopf, B. (2013).
The randomized dependence coefficient.
In Advances in Neural Information Processing Systems, pages 1–9.
Margolin, A. A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R. D., and Califano,
A. (2006).
Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular
context.
BMC bioinformatics, 7(Suppl 1):S7.
Meil˘a, M. (2007).
Comparing clusterings—an information based distance.
Journal of Multivariate Analysis, 98(5):873–895.
M¨uller, E., G¨unnemann, S., F¨arber, I., and Seidl, T. (2013).
Discovering multiple clustering solutions: Grouping objects in different views of the data.
Tutorial at ICML.
Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J.,
Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011).
Detecting novel associations in large data sets.
Science, 334(6062):1518–1524.
Reshef, Y. A., Reshef, D. N., Finucane, H. K., Sabeti, P. C., and Mitzenmacher, M. M. (2015).
Measuring dependence powerfully and equitably.
arXiv preprint arXiv:1505.02213.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
85. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References V
Schaffernicht, E., Kaltenhaeuser, R., Verma, S. S., and Gross, H.-M. (2010).
On estimating mutual information for feature selection.
In Artificial Neural Networks ICANN 2010, pages 362–367. Springer.
Strehl, A. and Ghosh, J. (2003).
Cluster ensembles—a knowledge reuse framework for combining multiple partitions.
The Journal of Machine Learning Research, 3:583–617.
Strobl, C., Boulesteix, A.-L., and Augustin, T. (2007).
Unbiased split selection for classification trees based on the gini index.
Computational Statistics Data Analysis, 52(1):483–501.
Sugiyama, M. and Borgwardt, K. M. (2013).
Measuring statistical dependence via the mutual information dimension.
In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pages
1692–1698. AAAI Press.
Sz´ekely, G. J., Rizzo, M. L., et al. (2009).
Brownian distance covariance.
The annals of applied statistics, 3(4):1236–1265.
Villaverde, A. F., Ross, J., and Banga, J. R. (2013).
Reverse engineering cellular networks with information theoretic methods.
Cells, 2(2):306–329.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
86. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References VI
Vinh, N. X., Epps, J., and Bailey, J. (2009).
Information theoretic measures for clusterings comparison: is a correction for chance necessary?
In ICML, pages 1073–1080. ACM.
Witten, I. H., Frank, E., and Hall, M. A. (2011).
Data Mining: Practical Machine Learning Tools and Techniques.
3rd edition.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables