SlideShare a Scribd company logo
1 of 86
Download to read offline
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Simone Romano’s PhD Completion Seminar
Design and Adjustment of
Dependency Measures Between Variables
November 30th 2015
Supervisor: Prof. James Bailey
Co-Supervisor: A/Prof. Karin Verspoor
Computing and Information Systems (CIS)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Background
Examples of Applications
Categories of Dependency measures
Thesis Motivation
Ranking Dependencies in Noisy Data
Motivation
Design of the Randomized Information Coefficient (RIC)
Comparison Against Other Measures
A Framework for Adjusting Dependency Measures
Motivation
Adjustment for Quantification
Adjustment for Ranking
Adjustments for Clustering Comparison Measures
Motivation
Detailed Analysis of Contingency Tables
Application Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Dependency Measures
A dependency measure D is used to assess
the amount of dependency between variables:
Example 1: After collecting weight and height for many people,
we can compute D(weight, height)
Example 2: assess the amount of dependency between search queries in Google
https://www.google.com/trends/correlate/
They are fundamental for a number of applications in machine learning/ data mining
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Applications of Dependency Measures
Supervised learning
Feature selection [Guyon and Elisseeff, 2003];
Decision tree induction [Criminisi et al., 2012];
Evaluation of classification accuracy [Witten et al., 2011].
Unsupervised learning
External clustering validation [Strehl and Ghosh, 2003];
Generation of alternative or multi-view clusterings
[M¨uller et al., 2013, Dang and Bailey, 2015];
The exploration of the clustering space using results from the Meta-Clustering algorithm
[Caruana et al., 2006, Lei et al., 2014].
Exploratory analysis
Inference of biological networks [Reshef et al., 2011, Villaverde et al., 2013];
Analysis of neural time-series data [Cohen, 2014].
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Application example (1): feature selection / decision tree induction
Application: Identify if the class C is dependent to a feature F
Toy Example: Is the class C = cancer dependent to the feature F = smoker according to
this data set of 20 patients.
Use of dependency measure: Compute D(F, C)
Smoker Cancer
No -
Yes +
Yes +
Yes -
No +
No -
Yes +
...
...
Yes +
Contingency table is a useful tool:
counts the co-occurrences of feature values
and class values.
+ -
10 10
Smoker 8 6 2
Non smoker 12 4 8
⇒ if it is
dependent then
induce a split in
the decision tree
yes no
smoker
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Application example (2): external clustering validation
Application: Compare a clustering solution B to a reference clustering A.
Toy Example: N = 15 data points
reference clustering A with 2 clusters, stars and circles
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Application example (2): external clustering validation
Application: Compare a clustering solution B to a reference clustering A.
Toy Example: N = 15 data points
reference clustering A with 2 clusters, stars and circles
clustering solution B with 2 clusters, red
and blue
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Use of dependency measure: Compute D(A, B)
Once gain the contingency table is a useful too that assesses the amount of overlap
between A and B
B
red blue
6 9
A
8 4 4
7 2 5
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Application example (3): genetic network inference
Application: Identify if the gene G1 is interacting with the gene G2
Toy Example: We have a time series of values for each G1 and G2:
Use of dependency measure: Compute D(G1, G2)
time G1 G2
t1 20.4400 19.7450
t2 19.0750 20.3300
t3 20.0650 20.1700
...
...
...
Time
0 20 40 60 80 100 120 140
18
20
22
24
26
G1
G2
Here there is no contingency table
because the variables are numerical
G1
18 20 22 24 26
G2
19
20
21
22
23
24
25
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Categories of Dependency measures
Categories of Dependency Measures
Dependency measures can be divided in two categories: measures between categorical
variables and measures between numerical variables.
Between Categorical Variables
These measures can be computed naturally on a contingency table. For example on:
Decision trees
yes no
smoker
+ -10 10
Smoker 8 6 2
Non smoker 12 4 8
Clustering
comparisons
B
red blue
6 9
A
8 4 4
7 2 5
Information theoretic [Cover and Thomas, 2012]:
e.g. mutual information (a.k.a. information gain)
Based on pair-counting [Albatineh et al., 2006]: e.g. Rand Index, Jaccard similarity
Based on set-matching [Meil˘a, 2007]:
e.g. classification accuracy, agreement between annotators
Others: mostly employed as splitting criteria [Kononenko, 1995]: e.g. Gini gain,
Chi-square.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Categories of Dependency measures
Between Numerical Variables
No contingency table. For example:
Biological interaction
G1
18 20 22 24 26
G2
19
20
21
22
23
24
25
Estimators of mutual information [Khan et al., 2007]:
e.g. kNN estimator, kernel estimator, estimator based on grids
Correlation based:
e.g. Pearson’s correlation, distance correlation [Sz´ekely et al., 2009],
randomized dependence coefficient [Lopez-Paz et al., 2013]
Kernel based: e.g. Hilbert-Schimidt Independence Criterion [Gretton et al., 2005]
Based on information theory:
e.g. the Maximal Information Coefficient (MIC) [Reshef et al., 2011],
the mutual information dimension [Sugiyama and Borgwardt, 2013],
total information coefficient [Reshef et al., 2015].
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Thesis Motivation
Thesis Motivation
Even if a dependency measure D has nice theoretical properties,
dependencies are estimated on finite data with ˆD.
The following goals of dependency measures are challenging:
Detection: Test for the presence of dependency.
E.g. test dependence between two genes
Example (3)
Quantification: Summarization of the amount of dependency in an interpretable fashion.
E.g. assessing the amount of overlapping between two clusterings
Example (2)
Ranking: Sort the relationships of different variables.
E.g. ranking many features in decision trees
Example (1)
To improve performances on the three goals above:
We need information on the distribution of ˆD
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Thesis Motivation
For Example, when Ranking Noisy Relationships
The distribution of ˆD(X, Y ) when the relationship between X and Y is noisy,
should not overlap with the distribution of ˆD(X, Y ) on a noiseless relationship:
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Background
Examples of Applications
Categories of Dependency measures
Thesis Motivation
Ranking Dependencies in Noisy Data
Motivation
Design of the Randomized Information Coefficient (RIC)
Comparison Against Other Measures
A Framework for Adjusting Dependency Measures
Motivation
Adjustment for Quantification
Adjustment for Ranking
Adjustments for Clustering Comparison Measures
Motivation
Detailed Analysis of Contingency Tables
Application Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Motivation
Mutual information I(X, Y ) is good to rank relationships with different level of noise
between variables:
high I ⇒ little noise
small I ⇒ big noise
It can also be computed between sets of variables: e.g.
I(X, Y ) = I({X1, X2}, Y ) = I({weight, height}, BMI)
Mutual Information quantifies the information shared between two variables
MI(X, Y ) =
+∞
−∞
+∞
−∞
fX,Y (x, y) log
fX,Y (x, y)
fX (x)fY (y)
Importance of MI
It is based on a well-established theory and quantifies non-linear interactions which might be
missed if e.g. the Pearson’s correlation coefficient r(X, Y ) is used.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Estimation of Mutual Information
Many estimators of mutual information:
Acronym Type Applicable to Sets of vars. Best Compl. Worst Compl.
Iew (Discretization equal width)  O(n1.5
)
Ief (Discretization equal frequency)  O(n1.5
)
IA (Adaptive Partitioning)  O(n1.5
)
Imean (Mean Nearest Neighbours)  O(n2
)
IKDE (Kernel Density Estimation)  O(n2
)
IkNN (Nearest Neighbours)  O(n1.5
) O(n2
)
Discretization based estimators of mutual information exhibits good complexity but not
applicable to sets of variables
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Discretization based estimator use fixed grids. And compute mutual information on a
contingency table.
X
18 20 22 24 26
Y
16
18
20
22
24
26
For example Iew discretizes
using equal width binning
Discretized X
b1 · · · bj · · · bc
a1 n11 · · · · · · · n1c
...
...
...
...
Discretized Y ai · nij ·
...
...
...
...
ar nr1 · · · · · · · nrc
nij counts the number of points in a
particular bin. Mutual information
can be computed with:
Iew(X, Y ) =
r
i=1
c
j=1
nij
N
log
nij N
ai bj
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Criticism
The discretization approach is less popular between numerical variables because:
There is a systematic estimation bias which depends to the grid size
However, when comparing dependencies systematic estimation biases cancel each other out
[Kraskov et al., 2004, Margolin et al., 2006, Schaffernicht et al., 2010]
Thus too not bad for comparing/ranking relationships!
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Comparing relationships/ comparing estimations of I
Task: Given a strong relationship s and a weak relationship w, compare the estimates ˆIs and
ˆIw of the true values Is and Iw
Systematic biases cancel out when comparing relationships
Systematic biases translate the distributions by a fixed amount
It is beneficial to reduce the variance
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Comparing relationships/ comparing estimations of I
Task: Given a strong relationship s and a weak relationship w, compare the estimates ˆIs and
ˆIw of the true values Is and Iw
Systematic biases cancel out when comparing relationships
Systematic biases translate the distributions by a fixed amount
It is beneficial to reduce the variance
Challenge: Decreasing the variance of the estimation
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Design of the Randomized Information Coefficient (RIC)
Randomized Information Coefficient (RIC)
Idea:
Generate many random grids with different cardinality by random cut-offs
Estimate the normalized mutual information for each of them (because of different
cardinality)
Average
X
18 20 22 24 26Y
16
18
20
22
24
26
X
18 20 22 24 26
Y
16
18
20
22
24
26
X
18 20 22 24 26
Y
16
18
20
22
24
26
Average
Parameters:
Kr - tunes the number of random grids
Dmax - tunes the maximum grid cardinality generated
Features:
Proved to decrease the variance like in random forests [Geurts, 2002]
Still good complexity O(n1.5
)
Easy to extend to sets of variables
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Design of the Randomized Information Coefficient (RIC)
Random discretization of set of variables
Relationship between Y and X = {X1, X2}
X2
1
0.5
01
0.5
X1
1
0.5
0
0
Y
X0
= X1+X2
2
0 0.5 1
Y
0
0.5
1
Need to randomly discretize X ⇒ just choose some random seeds:
X1
0 0.5 1
X2
0
0.2
0.4
0.6
0.8
1
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Design of the Randomized Information Coefficient (RIC)
Detection of Relationship
Task: Using permutation test identify if a
relationship exists:
Generate 500 values of RIC under
complete noise
Sort the values and identify the
value x of RIC at position
500 × 95% = 475
Generate 500 values of RIC under a
particular relationship
Count how many values are greater
than x
⇒ the bigger the count the bigger the
Power of RIC
Linear Quadratic Cubic
Sinusoidal low freq. Sinusoidal high freq. 4th Root
Circle Step Function Two Lines
X Sinusoidal varying freq. Circle-bar
Noise Lev. 1
Noise Lev. 6
Noise Lev. 11
Noise Lev. 16
Noise Lev. 21
Noise Lev. 26
Tested on many relationships and level of noise
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Design of the Randomized Information Coefficient (RIC)
Power at the increase of the number of random grids
Kr increases the number of random grids
Parameter (Kr)
50 100 150 200
AreaUnderPowerCurve
0
0.2
0.4
0.6
0.8
1
RIC, optimum at Kr = 200
Figure : Average power for each relationship - every line is a relationship
More random grids ⇒ less estimation variance ⇒ more power
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Comparison with Other Measures
Extensively compared with other measures on the task of relationship detection
AverageRank-Power
0
2
4
6
8
10
12
14
RIC TICe IKDE dCorr HSIC RDC MIC IkNN Ief GMIC Iew r2
IA ACE Imean MID
Figure : Average rank across relationship (E.g. rank 1st when power is max on a relationship)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Comparison - Biological Network Inference
Reverse engineering of network of genes when the ground truth is known
AverageRank-MeanAveragePrecision
3
4
5
6
7
8
9
10
11
12
13
RIC dCorr IKDE IkNN HSIC ACE r2
GMIC Ief IA RDC Iew Imean MIC MID
Figure : Average rank across relationship (E.g. rank 1st when Average Precision is max on a network )
Also compared on:
Feature filtering for regression
Feature selection for regression
RIC shows competitive performance
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Conclusion - Message
We proposed the Randomized Information Coefficient (RIC)
Reduces the variance of normalized mutual information via grids when comparing
relationships
Random discretize multiple variables
Take away message:
There are different ways to generate random grids (random cut-off/ random
seeds)
The more the number of grids the smaller the variance
The Randomized Information Coefficient: Ranking Dependencies in Noisy Data, Simone Romano, James Bailey, Nguyen Xuan
Vinh, and Karin Verspoor. Under review in the Machine Learning Journal
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Hypothesis so far...
So far we compared numerical variables on samples of fixed size n
Dependency measures might have biases if they:
Compare samples with different n
Compare categorical variables
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Hypothesis so far...
So far we compared numerical variables on samples of fixed size n
Dependency measures might have biases if they:
Compare samples with different n
Compare categorical variables
Need for adjustment in these cases
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Background
Examples of Applications
Categories of Dependency measures
Thesis Motivation
Ranking Dependencies in Noisy Data
Motivation
Design of the Randomized Information Coefficient (RIC)
Comparison Against Other Measures
A Framework for Adjusting Dependency Measures
Motivation
Adjustment for Quantification
Adjustment for Ranking
Adjustments for Clustering Comparison Measures
Motivation
Detailed Analysis of Contingency Tables
Application Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Motivation for Adjustment For Quantification
Pearson’s correlation between two variables X and Y estimated on a data sample
Sn = {(xk , yk )} of n data points:
r(Sn|X, Y )
n
k=1(xk − ¯x)(yk − ¯y)
n
k=1(xk − ¯x)2 n
k=1(yk − ¯y)2
(1)
1 0.8 0.4 0 -0.4 -0.8 -1
1 1 1 -1 -1 -1
0 0 0 0 0 0 0
Figure : From https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
r2
(Sn|X, Y ) can be used as a proxy of the amount of noise for linear relationships:
1 if noiseless
0 if complete noise
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
The Maximal Information Coefficient (MIC) was published in Science [Reshef et al., 2011]
and has 499 citations to date according to Google scholar.
MIC(X,Y ) can be used as a proxy of the amount fo noise for functional relationships:
Figure : From supplementary material online in [Reshef et al., 2011]
MIC should be equal to:
1 if the relationship between X and Y is functional and noiseless
0 if there is complete noise
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Challenge
Nonetheless, its estimation is challenging on a finite data sample Sn of n data points.
We simulate 10,000 fully noisy relationships between X and Y on 20 and 80 data points:
0.2 0.4 0.6 0.8 1
MIC(S80jX; Y )
MIC(S20jX; Y )
Value can be high because of chance! The user expects values close to 0 in both cases
Challenge: Adjust the estimated MIC to better exploit the range [0, 1]
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Quantification
Adjustment for Chance
We define a framework for adjustment:
Adjustment for Quantification
A ˆD
ˆD − E[ ˆD0]
max ˆD − E[ ˆD0]
It uses the distribution ˆD0 under independent variables:
r2
0 : Beta distribution
MIC0: can be computed using Monte Carlo permutations.
Used in κ-statistics. Its application is beneficial to other dependency measures:
Adjusted r2
⇒ Ar2
Adjusted MIC ⇒ AMIC
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Quantification
Adjusted measures enable better interpretability
Task:
Obtain 1 for noiseless relationship, and 0 for complete noise (on average).
0%
r2
= 1
Ar2
= 1
20%
r2
= 0:66
Ar2
= 0:65
40%
r2
= 0:39
Ar2
= 0:37
60%
r2
= 0:2
Ar2
= 0:17
80%
r2
= 0:073
Ar2
= 0:044
100%
r2
= 0:035
Ar2
= 0:00046
Figure : Ar2
becomes zero on average on 100% noise
0%
MIC = 1
AMIC = 1
20%
MIC = 0:7
AMIC = 0:6
40%
MIC = 0:47
AMIC = 0:29
60%
MIC = 0:34
AMIC = 0:11
80%
MIC = 0:27
AMIC = 0:021
100%
MIC = 0:26
AMIC = 0:0014
Figure : AMIC becomes zero on average on 100% noise
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Quantification
Not biased towards small sample size n
Average value of ˆD for different % of noise
⇒ estimates can be high because of chance at small n (e.g. because of missing values)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Quantification
Not biased towards small sample size n
Average value of ˆD for different % of noise
⇒ estimates can be high because of chance at small n (e.g. because of missing values)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Ar2
(Adjusted)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
AMIC (Adjusted)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variables X1 and
X2 defined as follows:
X1 ≡ patient had breakfast today, X1 = {yes, no};
X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yes
X1= no
X2=green
X2=blue
X2=brown
Problem:
When ranking variables,
dependency measures are
biased towards the
selection of variables
with many categories
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variables X1 and
X2 defined as follows:
X1 ≡ patient had breakfast today, X1 = {yes, no};
X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yes
X1= no
X2=green
X2=blue
X2=brown
Problem:
When ranking variables,
dependency measures are
biased towards the
selection of variables
with many categories
This still happens because of finite samples!
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Generate a variable X2 with 3 categories
(independently from C)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Generate a variable X2 with 3 categories
(independently from C)
Compute
Gini(X1, C) and Gini(X2, C).
Give a win to the variable
that gets the highest value
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Generate a variable X2 with 3 categories
(independently from C)
Compute
Gini(X1, C) and Gini(X2, C).
Give a win to the variable
that gets the highest value
REPEAT 10,000 times
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Generate a variable X2 with 3 categories
(independently from C)
Compute
Gini(X1, C) and Gini(X2, C).
Give a win to the variable
that gets the highest value
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times ( Bad )
Given that they are equally unpredictive, we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased ranking
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Adjustment for Ranking
We propose two adjustments for ranking:
Standardization
S ˆD
ˆD − E[ ˆD0]
Var( ˆD0)
Quantifies statistical significance like a p-value
Adjustment for Ranking
A ˆD(α) ˆD − q0(1 − α)
Penalizes on statistical significance according to α
q0 is the quantile of the distribution ˆD0
(small α more penalization)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Standardized Gini (SGini) corrects for Selection bias
Select unpredictive features X1 with 2 categories and X2 with 3 categories.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: X1 and X2 gets selected
on average almost 50% of the times
( Good )
Being similar to a p-value, this is consistent with the literature on decision
trees [Frank and Witten, 1998, Dobra and Gehrke, 2001, Hothorn et al., 2006,
Strobl et al., 2007].
Nonetheless: we found that this is a simplistic scenario
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant = 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: SGini becomes biased
towards X1 because more statically
significant ( Bad )
This behavior has been overlooked in the decision tree community
Use A ˆD(α) to penalize less or even tune the bias!
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant = 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: SGini becomes biased
towards X1 because more statically
significant ( Bad )
This behavior has been overlooked in the decision tree community
Use A ˆD(α) to penalize less or even tune the bias!
⇒ AGini(α)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Application to random forest
why random forest? good classifier to try first when there are “meaningful” features
[Fern´andez-Delgado et al., 2014].
Plug-in different splitting criteria
Experiment: 19 data sets with categorical variables
,
0 0.2 0.4 0.6 0.8
MeanAUC
90
90.5
91
91.5
AGini(,)
SGini
Gini
Figure : Using the same α for all data sets
And α can be tuned for each data set with cross-validation.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Conclusion - Message
Dependency estimates are high because of chance under finite samples.
Adjustments can help for:
Quantification, to have an interpretable value between [0, 1]
Ranking, to avoid biases towards:
missing values
categorical variables with more categories
A Framework to Adjust Dependency Measure Estimates for Chance, Simone Romano, Nguyen Xuan Vinh, James Bailey, and
Karin Verspoor. Under submission in SIAM International Conference on Data Mining 2016 (SDM-16)
Arxiv: http://arxiv.org/abs/1510.07786
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Background
Examples of Applications
Categories of Dependency measures
Thesis Motivation
Ranking Dependencies in Noisy Data
Motivation
Design of the Randomized Information Coefficient (RIC)
Comparison Against Other Measures
A Framework for Adjusting Dependency Measures
Motivation
Adjustment for Quantification
Adjustment for Ranking
Adjustments for Clustering Comparison Measures
Motivation
Detailed Analysis of Contingency Tables
Application Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Clustering Validation
Given a reference clustering V ( / ) we want to validate the clustering solution U (blue/red)
⇒ we need dependency measures
There are two very popular measures based on adjustments:
The Adjusted Rand Index (ARI)
[Hubert and Arabie, 1985]
∼ 3000 citations
The Adjusted Mutual Information (AMI)
[Vinh et al., 2009]
∼ 200 citations
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Clustering Validation
Given a reference clustering V ( / ) we want to validate the clustering solution U (blue/red)
⇒ we need dependency measures
There are two very popular measures based on adjustments:
The Adjusted Rand Index (ARI)
[Hubert and Arabie, 1985]
∼ 3000 citations
The Adjusted Mutual Information (AMI)
[Vinh et al., 2009]
∼ 200 citations
No clear connection between them - Users use them both
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Both computed on a contingency table
Notation: Contingency table M
ai = j nij are the row marginals and
bj = i nij are the column marginals.
V
b1 · · · bj · · · bc
a1 n11 · · · · · · · n1c
...
...
...
...
U ai · nij ·
...
...
...
...
ar nr1 · · · · · · · nrc
ARI - Adjustment of Rand Index (RI)
based on counting pairs of objects
ARI =
RI − E[RI]
max RI − E[RI]
AMI - Adjustment of Mutual Information (MI)
based on information theory
AMI =
MI − E[MI]
max MI − E[MI]
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Link: generalized information theory
Generalized information theory based on Tsallis q-entropy
Hq(V )
1
q − 1
1 −
j
bj
N
q
generalizes Shannon’s entropy
lim
q→1
Hq(V ) = H(V )
j
bj
N
log
bj
N
Link between measures:
Mutual Information (MIq) based on Tsallis q-entropy links RI and MI:
MIq=2 ∝ RI lim
q→1
MIq = MI
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Link: generalized information theory
Generalized information theory based on Tsallis q-entropy
Hq(V )
1
q − 1
1 −
j
bj
N
q
generalizes Shannon’s entropy
lim
q→1
Hq(V ) = H(V )
j
bj
N
log
bj
N
Link between measures:
Mutual Information (MIq) based on Tsallis q-entropy links RI and MI:
MIq=2 ∝ RI lim
q→1
MIq = MI
Challenge: Compute E[MIq] to connect ARI and AMI
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Link: generalized information theory
Generalized information theory based on Tsallis q-entropy
Hq(V )
1
q − 1
1 −
j
bj
N
q
generalizes Shannon’s entropy
lim
q→1
Hq(V ) = H(V )
j
bj
N
log
bj
N
Link between measures:
Mutual Information (MIq) based on Tsallis q-entropy links RI and MI:
MIq=2 ∝ RI lim
q→1
MIq = MI
Challenge: Compute E[MIq] to connect ARI and AMI
Challenge 2.0: Compute Var(MIq) for standardization
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Propose technique applicable to a broader class of measures:
We can do:
Exact computation of measures in Lφ
where S ∈ Lφ is a linear function of the entries of the contingency table:
S = α + β
ij
φij (nij )
(α and β are constants)
Asymptotic approximation of measures in Nφ (non-linear)
Rand Index (RI)
MI Jaccard
(J)
Generalized
Information Theoretic
VI
MI
NMI
Figure : Families of measures we can adjust
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Exact Expected Value by Permutation Model
E[S] is obtained by summation over all possible contingency tables M obtained by
permutations.
E[S] =
M
S(M)P(M) = α + β
M ij
φij (nij )P(M)
No method to exhaustively generate M fixing the marginals
extremely time expensive ( permutations O(N!))
However, it is possible to swap the inner summation with the outer summation:
M i,j
to swap
φij (nij )P(M) =
i,j nij
swapped
φij (nij )P(nij )
nij has a known hypergeometric distribution,
Computation time dramatically reduced! ⇒ O (max {rN, cN})
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Exact Variance Computation
We have to compute the second moment E[S2
] which requires:
M


r
i=1
c
j=1
φij (nij )


2
P(M)
M i,j,i ,j
to swap
φij (nij ) · φi j (ni j )P(M)
i,j,i ,j nij ni j
swapped
φij (nij ) · φi j (ni j )P(nij , ni j )
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Exact Variance Computation
We have to compute the second moment E[S2
] which requires:
M


r
i=1
c
j=1
φij (nij )


2
P(M)
M i,j,i ,j
to swap
φij (nij ) · φi j (ni j )P(M)
i,j,i ,j nij ni j
swapped
φij (nij ) · φi j (ni j )P(nij , ni j )
Contribution: P(nij , ni j ) computation is technically challenging.
We use the hypergeometric model: drawings from a urn with N marbles with 3 colors, red,
blue, and white.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Finally we can define adjustments..
Definition: Adjusted Mutual Information q - AMIq
AMI2 = ARI lim
q→1
AMIq = AMI
We can finally relate ARI and AMI to generalized information theory!
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Finally we can define adjustments..
Definition: Adjusted Mutual Information q - AMIq
AMI2 = ARI lim
q→1
AMIq = AMI
We can finally relate ARI and AMI to generalized information theory!
Also define: a generalized Standardized Mutual Information q - SMIq for selection bias.
Their complexities:
Name Computational complexity
AMI O (max {rN, cN})
SMI O max {rcN3
, c2
N3
}
Table : Complexity when comparing two clusterings: N objects, r, c number of clusters
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Application Scenarios
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
10 10 10 70
U1
8 8 0 0 0
7 0 7 0 0
7 0 0 7 0
78 2 3 3 70
V
10 10 10 70
U2
10 7 1 1 1
10 1 7 1 1
10 1 1 7 1
70 1 1 1 67
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Application Scenarios
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
10 10 10 70
U1
8 8 0 0 0
7 0 7 0 0
7 0 0 7 0
78 2 3 3 70
AMI chooses this one because of many 0’s
V
10 10 10 70
U2
10 7 1 1 1
10 1 7 1 1
10 1 1 7 1
70 1 1 1 67
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Application Scenarios
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
10 10 10 70
U1
8 8 0 0 0
7 0 7 0 0
7 0 0 7 0
78 2 3 3 70
AMI chooses this one because of many 0’s
V
10 10 10 70
U2
10 7 1 1 1
10 1 7 1 1
10 1 1 7 1
70 1 1 1 67
ARI chooses this one
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Application Scenarios
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
10 10 10 70
U1
8 8 0 0 0
7 0 7 0 0
7 0 0 7 0
78 2 3 3 70
AMI chooses this one because of many 0’s
V
10 10 10 70
U2
10 7 1 1 1
10 1 7 1 1
10 1 1 7 1
70 1 1 1 67
ARI chooses this one
When there are small clusters in V , use AMI because it likes 0’s
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Equal sized clusters...
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
25 25 25 25
U1
17 17 0 0 0
17 0 17 0 0
17 0 0 17 0
49 8 8 8 25
V
25 25 25 25
U2
24 20 2 1 1
25 2 20 2 1
23 1 1 20 1
28 2 2 2 22
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Equal sized clusters...
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
25 25 25 25
U1
17 17 0 0 0
17 0 17 0 0
17 0 0 17 0
49 8 8 8 25
AMI chooses this one because of many 0’s
V
25 25 25 25
U2
24 20 2 1 1
25 2 20 2 1
23 1 1 20 1
28 2 2 2 22
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Equal sized clusters...
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
25 25 25 25
U1
17 17 0 0 0
17 0 17 0 0
17 0 0 17 0
49 8 8 8 25
AMI chooses this one because of many 0’s
V
25 25 25 25
U2
24 20 2 1 1
25 2 20 2 1
23 1 1 20 1
28 2 2 2 22
ARI chooses this one
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Equal sized clusters...
Task: Clustering validation. Given a reference clustering V , choose the best clustering
solution among U1 and U2
Example: Do you prefer U1 or U2?
V
25 25 25 25
U1
17 17 0 0 0
17 0 17 0 0
17 0 0 17 0
49 8 8 8 25
AMI chooses this one because of many 0’s
V
25 25 25 25
U2
24 20 2 1 1
25 2 20 2 1
23 1 1 20 1
28 2 2 2 22
ARI chooses this one
When there are big equal sized clusters in V , use ARI because 0’s are misleading
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
SMIq can be used to correct selection bias
Reference clustering with 4 clusters and solutions U with different number of clusters
2 3 4 5 6 7 8 9 10
SMIq
0
0.05
0.1
Probability of selection (q = 1:001)
2 3 4 5 6 7 8 9 10
AMIq
0
0.1
Number of sets r in U
2 3 4 5 6 7 8 9 10
NMIq
0
0.2
0.4
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Correct for selection bias with SMIq for any q
Reference clustering with 4 clusters and solutions U with different number of clusters
2 3 4 5 6 7 8 9 10
SMIq
0
0.05
0.1
Probability of selection (q = 2)
2 3 4 5 6 7 8 9 10
AMIq
0
0.05
0.1
Number of sets r in U
2 3 4 5 6 7 8 9 10
NMIq
0
0.5
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Conclusion - Message
We computed generalized information theoretic measures to propose AMIq and SMIq to:
identify the application scenarios of ARI and AMI
correct for selection bias
Take away message:
Use AMI when the reference is unbalanced and has small clusters
Use ARI when the reference has big equal sized clusters
Use SMIq to correct for selection bias
Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance, Simone Romano,
James Bailey, Nguyen Xuan Vinh, and Karin Verspoor. Published in Proceedings of the 31st International Conference on Machine
Learning 2014, pp. 1143–1151 (ICML-14)
Adjusting for Chance Clustering Comparison Measures, Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor.
To submit to the Journal of Machine Learning Research
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Background
Examples of Applications
Categories of Dependency measures
Thesis Motivation
Ranking Dependencies in Noisy Data
Motivation
Design of the Randomized Information Coefficient (RIC)
Comparison Against Other Measures
A Framework for Adjusting Dependency Measures
Motivation
Adjustment for Quantification
Adjustment for Ranking
Adjustments for Clustering Comparison Measures
Motivation
Detailed Analysis of Contingency Tables
Application Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Summary
Studying the distribution of the estimates ˆD, we:
Designed RIC
Adjusted for quantification
Adjusted for ranking
These results can aid detection, quantification, and ranking of relationships as follows
Detection: RIC can be used to detect relationships between continuous variables because
it has high power
Quantification: Adjustment for quantification can be used to obtain a more interpretable
range of values.
E.g. AMIC and AMIq
Ranking: Adjustment for ranking can be used to correct for biases towards variables
with missing values or variables with many categories.
E.g. AGini(α) for random forests
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Future Work
Dependency measure estimates can obtain high values because of chance also when
they are computed on different number of dimensions
⇒ study adjustments to be unbiased towards different dimensionality
Adjustment via permutations is slow
⇒ compute more analytical adjustments, e.g. for MIC
The random seeds discretization technique for RIC might have problems with high
dimensionality
⇒ generate random seeds in random subspaces
⇒ study multivariable discretization using random trees
Inject randomness in other estimators of mutual information
⇒ E.g. choose different random kernel widths for the IKDE estimator
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Papers
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Adjusting for Chance Clustering Comparison Measures”. To submit to the
Journal of Machine Learning Research
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “A Framework to Adjust Dependency Measure Estimates for Chance”. Under
submission in SIAM International Conference on Data Mining 2016 (SDM-16)
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “The Randomized Information Coefficient: Ranking Dependencies in Noisy
Data” Under review in the Machine Learning Journal
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Standardized Mutual Information for Clustering Comparisons: One Step
Further in Adjustment for Chance”. Published in Proceedings of the 31st International Conference on Machine Learning 2014, pp.
1143–1151 (ICML-14)
Collaborations:
Y. Lei, J. C. Bezdek, N. X. Vinh, J. Chan, S. Romano, and J. Bailey, “Extending information theoretic validity indices for fuzzy
clusterings”. Submitted to the Transactions on Fuzzy Systems Journal
N. X. Vinh, J. Chan, S. Romano, J. Bailey, C. Leckie, K. Ramamohanarao, and J. Pei, “Discovering outlying aspects in large
datasets”. Submitted to the Data Mining and Knowledge Discovery Journal
N. X. Vinh, J. Chan, S. Romano, and J. Bailey, “Effective global approaches for mutual information based feature selection”.
Published in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,
2014, pp. 512–521
Y. Lei, J. C. Bezdek, J. Chan, N. X. Vinh, S. Romano, and J. Bailey, “Generalized information theoretic cluster validity indices for
soft clusterings”. Published in Proceedings of Computational Intelligence and Data Mining (CIDM), 2014, pp. 24–31
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Thank You All
In particular
My supervisors:
James Bailey, Karin Verspoor, and Vinh Nguyen
Committee Chair:
Tim Baldwin
My fellow PhD students
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Thank You All
In particular
My supervisors:
James Bailey, Karin Verspoor, and Vinh Nguyen
Committee Chair:
Tim Baldwin
My fellow PhD students
Questions?
Code available online:
https://github.com/ialuronico
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References I
Albatineh, A. N., Niewiadomska-Bugaj, M., and Mihalko, D. (2006).
On similarity indices and correction for chance agreement.
Journal of Classification, 23(2):301–313.
Caruana, R., Elhawary, M., Nguyen, N., and Smith, C. (2006).
Meta clustering.
In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 107–118. IEEE.
Cohen, M. X. (2014).
Analyzing neural time series data: theory and practice.
MIT Press.
Cover, T. M. and Thomas, J. A. (2012).
Elements of information theory.
John Wiley  Sons.
Criminisi, A., Shotton, J., and Konukoglu, E. (2012).
Decision forests: A unified framework for classification, regression, density estimation, manifold
learning and semi-supervised learning.
Foundations and Trends in Computer Graphics and Vision, 7(2-3):81–227.
Dang, X. H. and Bailey, J. (2015).
A framework to uncover multiple alternative clusterings.
Machine Learning, 98(1-2):7–30.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References II
Dobra, A. and Gehrke, J. (2001).
Bias correction in classification tree construction.
In ICML, pages 90–97.
Fern´andez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014).
Do we need hundreds of classifiers to solve real world classification problems?
The Journal of Machine Learning Research, 15(1):3133–3181.
Frank, E. and Witten, I. H. (1998).
Using a permutation test for attribute selection in decision trees.
In ICML, pages 152–160.
Geurts, P. (2002).
Bias/Variance Tradeoff and Time Series Classification.
PhD thesis, Department d’´Eletrecit´e, ´Eletronique et Informatique. Institut Momntefiore. Unversit´e de
Li`ege.
Gretton, A., Bousquet, O., Smola, A., and Sch¨olkopf, B. (2005).
Measuring statistical dependence with hilbert-schmidt norms.
In Algorithmic learning theory, pages 63–77. Springer.
Guyon, I. and Elisseeff, A. (2003).
An introduction to variable and feature selection.
The Journal of Machine Learning Research, 3:1157–1182.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References III
Hothorn, T., Hornik, K., and Zeileis, A. (2006).
Unbiased recursive partitioning: A conditional inference framework.
Journal of Computational and Graphical Statistics, 15(3):651–674.
Hubert, L. and Arabie, P. (1985).
Comparing partitions.
Journal of Classification, 2:193–218.
Khan, S., Bandyopadhyay, S., Ganguly, A. R., Saigal, S., Erickson III, D. J., Protopopescu, V., and
Ostrouchov, G. (2007).
Relative performance of mutual information estimation methods for quantifying the dependence among
short and noisy data.
Physical Review E, 76(2):026209.
Kononenko, I. (1995).
On biases in estimating multi-valued attributes.
In International Joint Conferences on Artificial Intelligence, pages 1034–1040.
Kraskov, A., St¨ogbauer, H., and Grassberger, P. (2004).
Estimating mutual information.
Physical review E, 69(6):066138.
Lei, Y., Vinh, N. X., Chan, J., and Bailey, J. (2014).
Filta: Better view discovery from collections of clusterings via filtering.
In Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References IV
Lopez-Paz, D., Hennig, P., and Sch¨olkopf, B. (2013).
The randomized dependence coefficient.
In Advances in Neural Information Processing Systems, pages 1–9.
Margolin, A. A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R. D., and Califano,
A. (2006).
Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular
context.
BMC bioinformatics, 7(Suppl 1):S7.
Meil˘a, M. (2007).
Comparing clusterings—an information based distance.
Journal of Multivariate Analysis, 98(5):873–895.
M¨uller, E., G¨unnemann, S., F¨arber, I., and Seidl, T. (2013).
Discovering multiple clustering solutions: Grouping objects in different views of the data.
Tutorial at ICML.
Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J.,
Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011).
Detecting novel associations in large data sets.
Science, 334(6062):1518–1524.
Reshef, Y. A., Reshef, D. N., Finucane, H. K., Sabeti, P. C., and Mitzenmacher, M. M. (2015).
Measuring dependence powerfully and equitably.
arXiv preprint arXiv:1505.02213.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References V
Schaffernicht, E., Kaltenhaeuser, R., Verma, S. S., and Gross, H.-M. (2010).
On estimating mutual information for feature selection.
In Artificial Neural Networks ICANN 2010, pages 362–367. Springer.
Strehl, A. and Ghosh, J. (2003).
Cluster ensembles—a knowledge reuse framework for combining multiple partitions.
The Journal of Machine Learning Research, 3:583–617.
Strobl, C., Boulesteix, A.-L., and Augustin, T. (2007).
Unbiased split selection for classification trees based on the gini index.
Computational Statistics  Data Analysis, 52(1):483–501.
Sugiyama, M. and Borgwardt, K. M. (2013).
Measuring statistical dependence via the mutual information dimension.
In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pages
1692–1698. AAAI Press.
Sz´ekely, G. J., Rizzo, M. L., et al. (2009).
Brownian distance covariance.
The annals of applied statistics, 3(4):1236–1265.
Villaverde, A. F., Ross, J., and Banga, J. R. (2013).
Reverse engineering cellular networks with information theoretic methods.
Cells, 2(2):306–329.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References VI
Vinh, N. X., Epps, J., and Bailey, J. (2009).
Information theoretic measures for clusterings comparison: is a correction for chance necessary?
In ICML, pages 1073–1080. ACM.
Witten, I. H., Frank, E., and Hall, M. A. (2011).
Data Mining: Practical Machine Learning Tools and Techniques.
3rd edition.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables

More Related Content

What's hot

Remote Sensing and GIS for Natural Hazards Assessment and Disaster Risk Manag...
Remote Sensing and GIS for Natural Hazards Assessment and Disaster Risk Manag...Remote Sensing and GIS for Natural Hazards Assessment and Disaster Risk Manag...
Remote Sensing and GIS for Natural Hazards Assessment and Disaster Risk Manag...Cees van Westen
 
GEOGRAPHICAL INFORMATION SYSTEM (GIS)
GEOGRAPHICAL INFORMATION SYSTEM (GIS)GEOGRAPHICAL INFORMATION SYSTEM (GIS)
GEOGRAPHICAL INFORMATION SYSTEM (GIS)Siva Mbbs
 
Geographic Information System unit 5
Geographic Information System   unit 5Geographic Information System   unit 5
Geographic Information System unit 5sridevi5983
 
Agriculture and Big Data
Agriculture and Big DataAgriculture and Big Data
Agriculture and Big DataUIResearchPark
 
Stated preference methods and analysis
Stated preference methods and analysisStated preference methods and analysis
Stated preference methods and analysisHabet Madoyan
 
Remote Sensing (RS), UAV/drones, and Machine Learning (ML) as powerful techni...
Remote Sensing (RS), UAV/drones, and Machine Learning (ML) as powerful techni...Remote Sensing (RS), UAV/drones, and Machine Learning (ML) as powerful techni...
Remote Sensing (RS), UAV/drones, and Machine Learning (ML) as powerful techni...nitinrane33
 
Visual analysis and pattern recognition using gis and remote sensing techniqu...
Visual analysis and pattern recognition using gis and remote sensing techniqu...Visual analysis and pattern recognition using gis and remote sensing techniqu...
Visual analysis and pattern recognition using gis and remote sensing techniqu...Jaleann M McClurg MPH, CSPO, CSM, DTM
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With REdureka!
 
Visualizing climate change through data
Visualizing climate change through dataVisualizing climate change through data
Visualizing climate change through dataZachary Labe
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)Shahbaz Anjam
 
Diagnosticando la deforestación a través del análisis de arquetipos y una tip...
Diagnosticando la deforestación a través del análisis de arquetipos y una tip...Diagnosticando la deforestación a través del análisis de arquetipos y una tip...
Diagnosticando la deforestación a través del análisis de arquetipos y una tip...CIFOR-ICRAF
 
Human wildlife conflict in banke national park,nepal
Human wildlife conflict in  banke national park,nepalHuman wildlife conflict in  banke national park,nepal
Human wildlife conflict in banke national park,nepalkpkc1633
 
Image classification, remote sensing, P K MANI
Image classification, remote sensing, P K MANIImage classification, remote sensing, P K MANI
Image classification, remote sensing, P K MANIP.K. Mani
 
LAND USE /LAND COVER CLASSIFICATION AND CHANGE DETECTION USING GEOGRAPHICAL I...
LAND USE /LAND COVER CLASSIFICATION AND CHANGE DETECTION USING GEOGRAPHICAL I...LAND USE /LAND COVER CLASSIFICATION AND CHANGE DETECTION USING GEOGRAPHICAL I...
LAND USE /LAND COVER CLASSIFICATION AND CHANGE DETECTION USING GEOGRAPHICAL I...IAEME Publication
 

What's hot (20)

Remote Sensing and GIS for Natural Hazards Assessment and Disaster Risk Manag...
Remote Sensing and GIS for Natural Hazards Assessment and Disaster Risk Manag...Remote Sensing and GIS for Natural Hazards Assessment and Disaster Risk Manag...
Remote Sensing and GIS for Natural Hazards Assessment and Disaster Risk Manag...
 
GEOGRAPHICAL INFORMATION SYSTEM (GIS)
GEOGRAPHICAL INFORMATION SYSTEM (GIS)GEOGRAPHICAL INFORMATION SYSTEM (GIS)
GEOGRAPHICAL INFORMATION SYSTEM (GIS)
 
Land use cover pptx.
Land use cover pptx.Land use cover pptx.
Land use cover pptx.
 
Geographic Information System unit 5
Geographic Information System   unit 5Geographic Information System   unit 5
Geographic Information System unit 5
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Agriculture and Big Data
Agriculture and Big DataAgriculture and Big Data
Agriculture and Big Data
 
Stated preference methods and analysis
Stated preference methods and analysisStated preference methods and analysis
Stated preference methods and analysis
 
Remote Sensing (RS), UAV/drones, and Machine Learning (ML) as powerful techni...
Remote Sensing (RS), UAV/drones, and Machine Learning (ML) as powerful techni...Remote Sensing (RS), UAV/drones, and Machine Learning (ML) as powerful techni...
Remote Sensing (RS), UAV/drones, and Machine Learning (ML) as powerful techni...
 
Introduction To GIS
Introduction To GISIntroduction To GIS
Introduction To GIS
 
Visual analysis and pattern recognition using gis and remote sensing techniqu...
Visual analysis and pattern recognition using gis and remote sensing techniqu...Visual analysis and pattern recognition using gis and remote sensing techniqu...
Visual analysis and pattern recognition using gis and remote sensing techniqu...
 
Fuzzy In Remote Classification
Fuzzy In Remote ClassificationFuzzy In Remote Classification
Fuzzy In Remote Classification
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With R
 
Visualizing climate change through data
Visualizing climate change through dataVisualizing climate change through data
Visualizing climate change through data
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)
 
TREE BIOMASS ESTIMATION USING AIRBORNE LASER SCANNING
TREE BIOMASS ESTIMATION USING AIRBORNE LASER SCANNINGTREE BIOMASS ESTIMATION USING AIRBORNE LASER SCANNING
TREE BIOMASS ESTIMATION USING AIRBORNE LASER SCANNING
 
Diagnosticando la deforestación a través del análisis de arquetipos y una tip...
Diagnosticando la deforestación a través del análisis de arquetipos y una tip...Diagnosticando la deforestación a través del análisis de arquetipos y una tip...
Diagnosticando la deforestación a través del análisis de arquetipos y una tip...
 
Human wildlife conflict in banke national park,nepal
Human wildlife conflict in  banke national park,nepalHuman wildlife conflict in  banke national park,nepal
Human wildlife conflict in banke national park,nepal
 
Image classification, remote sensing, P K MANI
Image classification, remote sensing, P K MANIImage classification, remote sensing, P K MANI
Image classification, remote sensing, P K MANI
 
LAND USE /LAND COVER CLASSIFICATION AND CHANGE DETECTION USING GEOGRAPHICAL I...
LAND USE /LAND COVER CLASSIFICATION AND CHANGE DETECTION USING GEOGRAPHICAL I...LAND USE /LAND COVER CLASSIFICATION AND CHANGE DETECTION USING GEOGRAPHICAL I...
LAND USE /LAND COVER CLASSIFICATION AND CHANGE DETECTION USING GEOGRAPHICAL I...
 

Viewers also liked

Viewers also liked (6)

Measures of disease burden
Measures of disease burdenMeasures of disease burden
Measures of disease burden
 
90110 pp tx_ch03
90110 pp tx_ch0390110 pp tx_ch03
90110 pp tx_ch03
 
Epidemiological statistics I
Epidemiological statistics IEpidemiological statistics I
Epidemiological statistics I
 
Vital statisitics in india
Vital statisitics in indiaVital statisitics in india
Vital statisitics in india
 
Vital statistics
Vital statisticsVital statistics
Vital statistics
 
Vital statistics and demography
Vital statistics  and demographyVital statistics  and demography
Vital statistics and demography
 

Similar to PhD Completion Seminar

cannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdfcannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdfJermaeDizon2
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSScsula its training
 
Advanced statistics for librarians
Advanced statistics for librariansAdvanced statistics for librarians
Advanced statistics for librariansJohn McDonald
 
Correlational research
Correlational researchCorrelational research
Correlational researchJijo G John
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfDr. Radhey Shyam
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlationdomsr
 
Cannonical correlation
Cannonical correlationCannonical correlation
Cannonical correlationdomsr
 
using-qualitative-metasummary-to-synthesize-empirical-findings-in-literature-...
using-qualitative-metasummary-to-synthesize-empirical-findings-in-literature-...using-qualitative-metasummary-to-synthesize-empirical-findings-in-literature-...
using-qualitative-metasummary-to-synthesize-empirical-findings-in-literature-...Danilo Monteiro
 
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
 
201 - Using Qualitative Metasummary to Synthesize Empirical Findings in Liter...
201 - Using Qualitative Metasummary to Synthesize Empirical Findings in Liter...201 - Using Qualitative Metasummary to Synthesize Empirical Findings in Liter...
201 - Using Qualitative Metasummary to Synthesize Empirical Findings in Liter...ESEM 2014
 
Assigning Scores For Ordered Categorical Responses
Assigning Scores For Ordered Categorical ResponsesAssigning Scores For Ordered Categorical Responses
Assigning Scores For Ordered Categorical ResponsesMary Montoya
 
Performance evaluation of hepatitis diagnosis using single and multi classifi...
Performance evaluation of hepatitis diagnosis using single and multi classifi...Performance evaluation of hepatitis diagnosis using single and multi classifi...
Performance evaluation of hepatitis diagnosis using single and multi classifi...ahmedbohy
 
Are we really including all relevant evidence
Are we really including all relevant evidence Are we really including all relevant evidence
Are we really including all relevant evidence cheweb1
 
Correlation research design presentation 2015
Correlation research design presentation 2015Correlation research design presentation 2015
Correlation research design presentation 2015Syed imran ali
 
Data science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxData science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxswapnaraghav
 
1_Q2-PRACTICAL-RESEARCH.pptx
1_Q2-PRACTICAL-RESEARCH.pptx1_Q2-PRACTICAL-RESEARCH.pptx
1_Q2-PRACTICAL-RESEARCH.pptxGeraldRefil3
 
Evaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayEvaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayCrystal Alvarez
 
Review Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docxReview Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docxcarlstromcurtis
 

Similar to PhD Completion Seminar (20)

cannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdfcannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdf
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
 
Advanced statistics for librarians
Advanced statistics for librariansAdvanced statistics for librarians
Advanced statistics for librarians
 
Correlational research
Correlational researchCorrelational research
Correlational research
 
KIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdfKIT-601 Lecture Notes-UNIT-2.pdf
KIT-601 Lecture Notes-UNIT-2.pdf
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlation
 
Cannonical correlation
Cannonical correlationCannonical correlation
Cannonical correlation
 
Similarity learning
  Similarity learning  Similarity learning
Similarity learning
 
Us
UsUs
Us
 
using-qualitative-metasummary-to-synthesize-empirical-findings-in-literature-...
using-qualitative-metasummary-to-synthesize-empirical-findings-in-literature-...using-qualitative-metasummary-to-synthesize-empirical-findings-in-literature-...
using-qualitative-metasummary-to-synthesize-empirical-findings-in-literature-...
 
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
 
201 - Using Qualitative Metasummary to Synthesize Empirical Findings in Liter...
201 - Using Qualitative Metasummary to Synthesize Empirical Findings in Liter...201 - Using Qualitative Metasummary to Synthesize Empirical Findings in Liter...
201 - Using Qualitative Metasummary to Synthesize Empirical Findings in Liter...
 
Assigning Scores For Ordered Categorical Responses
Assigning Scores For Ordered Categorical ResponsesAssigning Scores For Ordered Categorical Responses
Assigning Scores For Ordered Categorical Responses
 
Performance evaluation of hepatitis diagnosis using single and multi classifi...
Performance evaluation of hepatitis diagnosis using single and multi classifi...Performance evaluation of hepatitis diagnosis using single and multi classifi...
Performance evaluation of hepatitis diagnosis using single and multi classifi...
 
Are we really including all relevant evidence
Are we really including all relevant evidence Are we really including all relevant evidence
Are we really including all relevant evidence
 
Correlation research design presentation 2015
Correlation research design presentation 2015Correlation research design presentation 2015
Correlation research design presentation 2015
 
Data science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxData science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptx
 
1_Q2-PRACTICAL-RESEARCH.pptx
1_Q2-PRACTICAL-RESEARCH.pptx1_Q2-PRACTICAL-RESEARCH.pptx
1_Q2-PRACTICAL-RESEARCH.pptx
 
Evaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayEvaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis Essay
 
Review Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docxReview Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docx
 

More from Simone Romano

Startups and you 2021
Startups and you 2021Startups and you 2021
Startups and you 2021Simone Romano
 
Measuring Dependency via Intrinsic Dimensionality (ICPR 2016)
Measuring Dependency via Intrinsic Dimensionality (ICPR 2016)Measuring Dependency via Intrinsic Dimensionality (ICPR 2016)
Measuring Dependency via Intrinsic Dimensionality (ICPR 2016)Simone Romano
 
A Framework to Adjust Dependency Measure Estimates for Chance
A Framework to Adjust Dependency Measure Estimates for Chance      A Framework to Adjust Dependency Measure Estimates for Chance
A Framework to Adjust Dependency Measure Estimates for Chance Simone Romano
 
Enhancing Diagnostics for Invasive Aspergillosis using Machine Learning
Enhancing Diagnostics for Invasive Aspergillosis using Machine LearningEnhancing Diagnostics for Invasive Aspergillosis using Machine Learning
Enhancing Diagnostics for Invasive Aspergillosis using Machine LearningSimone Romano
 
Predicting the Response to Hepatitis C Therapy
Predicting the Response to Hepatitis C TherapyPredicting the Response to Hepatitis C Therapy
Predicting the Response to Hepatitis C TherapySimone Romano
 
My Entry to the Sportsbet/CIKM competition
My Entry to the Sportsbet/CIKM competitionMy Entry to the Sportsbet/CIKM competition
My Entry to the Sportsbet/CIKM competitionSimone Romano
 

More from Simone Romano (7)

Startups and you 2021
Startups and you 2021Startups and you 2021
Startups and you 2021
 
Startups and You
Startups and YouStartups and You
Startups and You
 
Measuring Dependency via Intrinsic Dimensionality (ICPR 2016)
Measuring Dependency via Intrinsic Dimensionality (ICPR 2016)Measuring Dependency via Intrinsic Dimensionality (ICPR 2016)
Measuring Dependency via Intrinsic Dimensionality (ICPR 2016)
 
A Framework to Adjust Dependency Measure Estimates for Chance
A Framework to Adjust Dependency Measure Estimates for Chance      A Framework to Adjust Dependency Measure Estimates for Chance
A Framework to Adjust Dependency Measure Estimates for Chance
 
Enhancing Diagnostics for Invasive Aspergillosis using Machine Learning
Enhancing Diagnostics for Invasive Aspergillosis using Machine LearningEnhancing Diagnostics for Invasive Aspergillosis using Machine Learning
Enhancing Diagnostics for Invasive Aspergillosis using Machine Learning
 
Predicting the Response to Hepatitis C Therapy
Predicting the Response to Hepatitis C TherapyPredicting the Response to Hepatitis C Therapy
Predicting the Response to Hepatitis C Therapy
 
My Entry to the Sportsbet/CIKM competition
My Entry to the Sportsbet/CIKM competitionMy Entry to the Sportsbet/CIKM competition
My Entry to the Sportsbet/CIKM competition
 

Recently uploaded

Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxnoordubaliya2003
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 

Recently uploaded (20)

Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 

PhD Completion Seminar

  • 1. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Simone Romano’s PhD Completion Seminar Design and Adjustment of Dependency Measures Between Variables November 30th 2015 Supervisor: Prof. James Bailey Co-Supervisor: A/Prof. Karin Verspoor Computing and Information Systems (CIS) Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 2. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Background Examples of Applications Categories of Dependency measures Thesis Motivation Ranking Dependencies in Noisy Data Motivation Design of the Randomized Information Coefficient (RIC) Comparison Against Other Measures A Framework for Adjusting Dependency Measures Motivation Adjustment for Quantification Adjustment for Ranking Adjustments for Clustering Comparison Measures Motivation Detailed Analysis of Contingency Tables Application Scenarios Conclusions Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 3. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Examples of Applications Dependency Measures A dependency measure D is used to assess the amount of dependency between variables: Example 1: After collecting weight and height for many people, we can compute D(weight, height) Example 2: assess the amount of dependency between search queries in Google https://www.google.com/trends/correlate/ They are fundamental for a number of applications in machine learning/ data mining Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 4. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Examples of Applications Applications of Dependency Measures Supervised learning Feature selection [Guyon and Elisseeff, 2003]; Decision tree induction [Criminisi et al., 2012]; Evaluation of classification accuracy [Witten et al., 2011]. Unsupervised learning External clustering validation [Strehl and Ghosh, 2003]; Generation of alternative or multi-view clusterings [M¨uller et al., 2013, Dang and Bailey, 2015]; The exploration of the clustering space using results from the Meta-Clustering algorithm [Caruana et al., 2006, Lei et al., 2014]. Exploratory analysis Inference of biological networks [Reshef et al., 2011, Villaverde et al., 2013]; Analysis of neural time-series data [Cohen, 2014]. Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 5. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Examples of Applications Application example (1): feature selection / decision tree induction Application: Identify if the class C is dependent to a feature F Toy Example: Is the class C = cancer dependent to the feature F = smoker according to this data set of 20 patients. Use of dependency measure: Compute D(F, C) Smoker Cancer No - Yes + Yes + Yes - No + No - Yes + ... ... Yes + Contingency table is a useful tool: counts the co-occurrences of feature values and class values. + - 10 10 Smoker 8 6 2 Non smoker 12 4 8 ⇒ if it is dependent then induce a split in the decision tree yes no smoker Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 6. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Examples of Applications Application example (2): external clustering validation Application: Compare a clustering solution B to a reference clustering A. Toy Example: N = 15 data points reference clustering A with 2 clusters, stars and circles Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 7. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Examples of Applications Application example (2): external clustering validation Application: Compare a clustering solution B to a reference clustering A. Toy Example: N = 15 data points reference clustering A with 2 clusters, stars and circles clustering solution B with 2 clusters, red and blue Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 8. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Examples of Applications Use of dependency measure: Compute D(A, B) Once gain the contingency table is a useful too that assesses the amount of overlap between A and B B red blue 6 9 A 8 4 4 7 2 5 Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 9. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Examples of Applications Application example (3): genetic network inference Application: Identify if the gene G1 is interacting with the gene G2 Toy Example: We have a time series of values for each G1 and G2: Use of dependency measure: Compute D(G1, G2) time G1 G2 t1 20.4400 19.7450 t2 19.0750 20.3300 t3 20.0650 20.1700 ... ... ... Time 0 20 40 60 80 100 120 140 18 20 22 24 26 G1 G2 Here there is no contingency table because the variables are numerical G1 18 20 22 24 26 G2 19 20 21 22 23 24 25 Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 10. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Categories of Dependency measures Categories of Dependency Measures Dependency measures can be divided in two categories: measures between categorical variables and measures between numerical variables. Between Categorical Variables These measures can be computed naturally on a contingency table. For example on: Decision trees yes no smoker + -10 10 Smoker 8 6 2 Non smoker 12 4 8 Clustering comparisons B red blue 6 9 A 8 4 4 7 2 5 Information theoretic [Cover and Thomas, 2012]: e.g. mutual information (a.k.a. information gain) Based on pair-counting [Albatineh et al., 2006]: e.g. Rand Index, Jaccard similarity Based on set-matching [Meil˘a, 2007]: e.g. classification accuracy, agreement between annotators Others: mostly employed as splitting criteria [Kononenko, 1995]: e.g. Gini gain, Chi-square. Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 11. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Categories of Dependency measures Between Numerical Variables No contingency table. For example: Biological interaction G1 18 20 22 24 26 G2 19 20 21 22 23 24 25 Estimators of mutual information [Khan et al., 2007]: e.g. kNN estimator, kernel estimator, estimator based on grids Correlation based: e.g. Pearson’s correlation, distance correlation [Sz´ekely et al., 2009], randomized dependence coefficient [Lopez-Paz et al., 2013] Kernel based: e.g. Hilbert-Schimidt Independence Criterion [Gretton et al., 2005] Based on information theory: e.g. the Maximal Information Coefficient (MIC) [Reshef et al., 2011], the mutual information dimension [Sugiyama and Borgwardt, 2013], total information coefficient [Reshef et al., 2015]. Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 12. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Thesis Motivation Thesis Motivation Even if a dependency measure D has nice theoretical properties, dependencies are estimated on finite data with ˆD. The following goals of dependency measures are challenging: Detection: Test for the presence of dependency. E.g. test dependence between two genes Example (3) Quantification: Summarization of the amount of dependency in an interpretable fashion. E.g. assessing the amount of overlapping between two clusterings Example (2) Ranking: Sort the relationships of different variables. E.g. ranking many features in decision trees Example (1) To improve performances on the three goals above: We need information on the distribution of ˆD Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 13. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Thesis Motivation For Example, when Ranking Noisy Relationships The distribution of ˆD(X, Y ) when the relationship between X and Y is noisy, should not overlap with the distribution of ˆD(X, Y ) on a noiseless relationship: Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 14. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Background Examples of Applications Categories of Dependency measures Thesis Motivation Ranking Dependencies in Noisy Data Motivation Design of the Randomized Information Coefficient (RIC) Comparison Against Other Measures A Framework for Adjusting Dependency Measures Motivation Adjustment for Quantification Adjustment for Ranking Adjustments for Clustering Comparison Measures Motivation Detailed Analysis of Contingency Tables Application Scenarios Conclusions Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 15. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Motivation Mutual information I(X, Y ) is good to rank relationships with different level of noise between variables: high I ⇒ little noise small I ⇒ big noise It can also be computed between sets of variables: e.g. I(X, Y ) = I({X1, X2}, Y ) = I({weight, height}, BMI) Mutual Information quantifies the information shared between two variables MI(X, Y ) = +∞ −∞ +∞ −∞ fX,Y (x, y) log fX,Y (x, y) fX (x)fY (y) Importance of MI It is based on a well-established theory and quantifies non-linear interactions which might be missed if e.g. the Pearson’s correlation coefficient r(X, Y ) is used. Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 16. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Estimation of Mutual Information Many estimators of mutual information: Acronym Type Applicable to Sets of vars. Best Compl. Worst Compl. Iew (Discretization equal width) O(n1.5 ) Ief (Discretization equal frequency) O(n1.5 ) IA (Adaptive Partitioning) O(n1.5 ) Imean (Mean Nearest Neighbours) O(n2 ) IKDE (Kernel Density Estimation) O(n2 ) IkNN (Nearest Neighbours) O(n1.5 ) O(n2 ) Discretization based estimators of mutual information exhibits good complexity but not applicable to sets of variables Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 17. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Discretization based estimator use fixed grids. And compute mutual information on a contingency table. X 18 20 22 24 26 Y 16 18 20 22 24 26 For example Iew discretizes using equal width binning Discretized X b1 · · · bj · · · bc a1 n11 · · · · · · · n1c ... ... ... ... Discretized Y ai · nij · ... ... ... ... ar nr1 · · · · · · · nrc nij counts the number of points in a particular bin. Mutual information can be computed with: Iew(X, Y ) = r i=1 c j=1 nij N log nij N ai bj Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 18. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Criticism The discretization approach is less popular between numerical variables because: There is a systematic estimation bias which depends to the grid size However, when comparing dependencies systematic estimation biases cancel each other out [Kraskov et al., 2004, Margolin et al., 2006, Schaffernicht et al., 2010] Thus too not bad for comparing/ranking relationships! Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 19. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Comparing relationships/ comparing estimations of I Task: Given a strong relationship s and a weak relationship w, compare the estimates ˆIs and ˆIw of the true values Is and Iw Systematic biases cancel out when comparing relationships Systematic biases translate the distributions by a fixed amount It is beneficial to reduce the variance Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 20. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Comparing relationships/ comparing estimations of I Task: Given a strong relationship s and a weak relationship w, compare the estimates ˆIs and ˆIw of the true values Is and Iw Systematic biases cancel out when comparing relationships Systematic biases translate the distributions by a fixed amount It is beneficial to reduce the variance Challenge: Decreasing the variance of the estimation Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 21. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Design of the Randomized Information Coefficient (RIC) Randomized Information Coefficient (RIC) Idea: Generate many random grids with different cardinality by random cut-offs Estimate the normalized mutual information for each of them (because of different cardinality) Average X 18 20 22 24 26Y 16 18 20 22 24 26 X 18 20 22 24 26 Y 16 18 20 22 24 26 X 18 20 22 24 26 Y 16 18 20 22 24 26 Average Parameters: Kr - tunes the number of random grids Dmax - tunes the maximum grid cardinality generated Features: Proved to decrease the variance like in random forests [Geurts, 2002] Still good complexity O(n1.5 ) Easy to extend to sets of variables Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 22. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Design of the Randomized Information Coefficient (RIC) Random discretization of set of variables Relationship between Y and X = {X1, X2} X2 1 0.5 01 0.5 X1 1 0.5 0 0 Y X0 = X1+X2 2 0 0.5 1 Y 0 0.5 1 Need to randomly discretize X ⇒ just choose some random seeds: X1 0 0.5 1 X2 0 0.2 0.4 0.6 0.8 1 Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 23. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Design of the Randomized Information Coefficient (RIC) Detection of Relationship Task: Using permutation test identify if a relationship exists: Generate 500 values of RIC under complete noise Sort the values and identify the value x of RIC at position 500 × 95% = 475 Generate 500 values of RIC under a particular relationship Count how many values are greater than x ⇒ the bigger the count the bigger the Power of RIC Linear Quadratic Cubic Sinusoidal low freq. Sinusoidal high freq. 4th Root Circle Step Function Two Lines X Sinusoidal varying freq. Circle-bar Noise Lev. 1 Noise Lev. 6 Noise Lev. 11 Noise Lev. 16 Noise Lev. 21 Noise Lev. 26 Tested on many relationships and level of noise Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 24. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Design of the Randomized Information Coefficient (RIC) Power at the increase of the number of random grids Kr increases the number of random grids Parameter (Kr) 50 100 150 200 AreaUnderPowerCurve 0 0.2 0.4 0.6 0.8 1 RIC, optimum at Kr = 200 Figure : Average power for each relationship - every line is a relationship More random grids ⇒ less estimation variance ⇒ more power Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 25. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Comparison Against Other Measures Comparison with Other Measures Extensively compared with other measures on the task of relationship detection AverageRank-Power 0 2 4 6 8 10 12 14 RIC TICe IKDE dCorr HSIC RDC MIC IkNN Ief GMIC Iew r2 IA ACE Imean MID Figure : Average rank across relationship (E.g. rank 1st when power is max on a relationship) Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 26. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Comparison Against Other Measures Comparison - Biological Network Inference Reverse engineering of network of genes when the ground truth is known AverageRank-MeanAveragePrecision 3 4 5 6 7 8 9 10 11 12 13 RIC dCorr IKDE IkNN HSIC ACE r2 GMIC Ief IA RDC Iew Imean MIC MID Figure : Average rank across relationship (E.g. rank 1st when Average Precision is max on a network ) Also compared on: Feature filtering for regression Feature selection for regression RIC shows competitive performance Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 27. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Comparison Against Other Measures Conclusion - Message We proposed the Randomized Information Coefficient (RIC) Reduces the variance of normalized mutual information via grids when comparing relationships Random discretize multiple variables Take away message: There are different ways to generate random grids (random cut-off/ random seeds) The more the number of grids the smaller the variance The Randomized Information Coefficient: Ranking Dependencies in Noisy Data, Simone Romano, James Bailey, Nguyen Xuan Vinh, and Karin Verspoor. Under review in the Machine Learning Journal Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 28. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Comparison Against Other Measures Hypothesis so far... So far we compared numerical variables on samples of fixed size n Dependency measures might have biases if they: Compare samples with different n Compare categorical variables Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 29. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Comparison Against Other Measures Hypothesis so far... So far we compared numerical variables on samples of fixed size n Dependency measures might have biases if they: Compare samples with different n Compare categorical variables Need for adjustment in these cases Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 30. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Background Examples of Applications Categories of Dependency measures Thesis Motivation Ranking Dependencies in Noisy Data Motivation Design of the Randomized Information Coefficient (RIC) Comparison Against Other Measures A Framework for Adjusting Dependency Measures Motivation Adjustment for Quantification Adjustment for Ranking Adjustments for Clustering Comparison Measures Motivation Detailed Analysis of Contingency Tables Application Scenarios Conclusions Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 31. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Motivation for Adjustment For Quantification Pearson’s correlation between two variables X and Y estimated on a data sample Sn = {(xk , yk )} of n data points: r(Sn|X, Y ) n k=1(xk − ¯x)(yk − ¯y) n k=1(xk − ¯x)2 n k=1(yk − ¯y)2 (1) 1 0.8 0.4 0 -0.4 -0.8 -1 1 1 1 -1 -1 -1 0 0 0 0 0 0 0 Figure : From https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient r2 (Sn|X, Y ) can be used as a proxy of the amount of noise for linear relationships: 1 if noiseless 0 if complete noise Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 32. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation The Maximal Information Coefficient (MIC) was published in Science [Reshef et al., 2011] and has 499 citations to date according to Google scholar. MIC(X,Y ) can be used as a proxy of the amount fo noise for functional relationships: Figure : From supplementary material online in [Reshef et al., 2011] MIC should be equal to: 1 if the relationship between X and Y is functional and noiseless 0 if there is complete noise Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 33. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Challenge Nonetheless, its estimation is challenging on a finite data sample Sn of n data points. We simulate 10,000 fully noisy relationships between X and Y on 20 and 80 data points: 0.2 0.4 0.6 0.8 1 MIC(S80jX; Y ) MIC(S20jX; Y ) Value can be high because of chance! The user expects values close to 0 in both cases Challenge: Adjust the estimated MIC to better exploit the range [0, 1] Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 34. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Quantification Adjustment for Chance We define a framework for adjustment: Adjustment for Quantification A ˆD ˆD − E[ ˆD0] max ˆD − E[ ˆD0] It uses the distribution ˆD0 under independent variables: r2 0 : Beta distribution MIC0: can be computed using Monte Carlo permutations. Used in κ-statistics. Its application is beneficial to other dependency measures: Adjusted r2 ⇒ Ar2 Adjusted MIC ⇒ AMIC Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 35. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Quantification Adjusted measures enable better interpretability Task: Obtain 1 for noiseless relationship, and 0 for complete noise (on average). 0% r2 = 1 Ar2 = 1 20% r2 = 0:66 Ar2 = 0:65 40% r2 = 0:39 Ar2 = 0:37 60% r2 = 0:2 Ar2 = 0:17 80% r2 = 0:073 Ar2 = 0:044 100% r2 = 0:035 Ar2 = 0:00046 Figure : Ar2 becomes zero on average on 100% noise 0% MIC = 1 AMIC = 1 20% MIC = 0:7 AMIC = 0:6 40% MIC = 0:47 AMIC = 0:29 60% MIC = 0:34 AMIC = 0:11 80% MIC = 0:27 AMIC = 0:021 100% MIC = 0:26 AMIC = 0:0014 Figure : AMIC becomes zero on average on 100% noise Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 36. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Quantification Not biased towards small sample size n Average value of ˆD for different % of noise ⇒ estimates can be high because of chance at small n (e.g. because of missing values) Noise Level 0 50 100 0 0.2 0.4 0.6 0.8 1 n = 10 n = 20 n = 30 n = 40 n = 100 n = 200 Raw r2 Noise Level 0 50 100 0 0.2 0.4 0.6 0.8 1 n = 20 n = 40 n = 60 n = 80 Raw MIC Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 37. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Quantification Not biased towards small sample size n Average value of ˆD for different % of noise ⇒ estimates can be high because of chance at small n (e.g. because of missing values) Noise Level 0 50 100 0 0.2 0.4 0.6 0.8 1 n = 10 n = 20 n = 30 n = 40 n = 100 n = 200 Raw r2 Noise Level 0 50 100 0 0.2 0.4 0.6 0.8 1 n = 20 n = 40 n = 60 n = 80 Raw MIC Noise Level 0 50 100 0 0.2 0.4 0.6 0.8 1 n = 10 n = 20 n = 30 n = 40 n = 100 n = 200 Ar2 (Adjusted) Noise Level 0 50 100 0 0.2 0.4 0.6 0.8 1 n = 20 n = 40 n = 60 n = 80 AMIC (Adjusted) Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 38. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Ranking Motivation for Adjustment for Ranking Say that we want to predict the risks of C cancer using equally unpredictive variables X1 and X2 defined as follows: X1 ≡ patient had breakfast today, X1 = {yes, no}; X2 ≡ patient eye color, X2 = {green, blu, brown}; X1= yes X1= no X2=green X2=blue X2=brown Problem: When ranking variables, dependency measures are biased towards the selection of variables with many categories Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 39. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Ranking Motivation for Adjustment for Ranking Say that we want to predict the risks of C cancer using equally unpredictive variables X1 and X2 defined as follows: X1 ≡ patient had breakfast today, X1 = {yes, no}; X2 ≡ patient eye color, X2 = {green, blu, brown}; X1= yes X1= no X2=green X2=blue X2=brown Problem: When ranking variables, dependency measures are biased towards the selection of variables with many categories This still happens because of finite samples! Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 40. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Ranking Selection bias experiment Experiment n = 100 data points Class C with 2 categories: Generate a variable X1 with 2 categories (independently from C) Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 41. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Ranking Selection bias experiment Experiment n = 100 data points Class C with 2 categories: Generate a variable X1 with 2 categories (independently from C) Generate a variable X2 with 3 categories (independently from C) Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 42. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Ranking Selection bias experiment Experiment n = 100 data points Class C with 2 categories: Generate a variable X1 with 2 categories (independently from C) Generate a variable X2 with 3 categories (independently from C) Compute Gini(X1, C) and Gini(X2, C). Give a win to the variable that gets the highest value Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 43. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Ranking Selection bias experiment Experiment n = 100 data points Class C with 2 categories: Generate a variable X1 with 2 categories (independently from C) Generate a variable X2 with 3 categories (independently from C) Compute Gini(X1, C) and Gini(X2, C). Give a win to the variable that gets the highest value REPEAT 10,000 times Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 44. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Ranking Selection bias experiment Experiment n = 100 data points Class C with 2 categories: Generate a variable X1 with 2 categories (independently from C) Generate a variable X2 with 3 categories (independently from C) Compute Gini(X1, C) and Gini(X2, C). Give a win to the variable that gets the highest value REPEAT 10,000 times 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 X1 X2 Probability of Selection Result: X2 gets selected 70% of the times ( Bad ) Given that they are equally unpredictive, we expected 50% Challenge: adjust the estimated Gini gain to obtained unbiased ranking Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 45. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Ranking Adjustment for Ranking We propose two adjustments for ranking: Standardization S ˆD ˆD − E[ ˆD0] Var( ˆD0) Quantifies statistical significance like a p-value Adjustment for Ranking A ˆD(α) ˆD − q0(1 − α) Penalizes on statistical significance according to α q0 is the quantile of the distribution ˆD0 (small α more penalization) Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 46. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Ranking Standardized Gini (SGini) corrects for Selection bias Select unpredictive features X1 with 2 categories and X2 with 3 categories. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 X1 X2 Probability of Selection Experiment: X1 and X2 gets selected on average almost 50% of the times ( Good ) Being similar to a p-value, this is consistent with the literature on decision trees [Frank and Witten, 1998, Dobra and Gehrke, 2001, Hothorn et al., 2006, Strobl et al., 2007]. Nonetheless: we found that this is a simplistic scenario Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 47. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Ranking Standardized Gini (SGini) might be biased Fix predictiveness of features X1 and X2 to a constant = 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 X1 X2 Probability of Selection Experiment: SGini becomes biased towards X1 because more statically significant ( Bad ) This behavior has been overlooked in the decision tree community Use A ˆD(α) to penalize less or even tune the bias! Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 48. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Ranking Standardized Gini (SGini) might be biased Fix predictiveness of features X1 and X2 to a constant = 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 X1 X2 Probability of Selection Experiment: SGini becomes biased towards X1 because more statically significant ( Bad ) This behavior has been overlooked in the decision tree community Use A ˆD(α) to penalize less or even tune the bias! ⇒ AGini(α) Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 49. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Ranking Application to random forest why random forest? good classifier to try first when there are “meaningful” features [Fern´andez-Delgado et al., 2014]. Plug-in different splitting criteria Experiment: 19 data sets with categorical variables , 0 0.2 0.4 0.6 0.8 MeanAUC 90 90.5 91 91.5 AGini(,) SGini Gini Figure : Using the same α for all data sets And α can be tuned for each data set with cross-validation. Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 50. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Adjustment for Ranking Conclusion - Message Dependency estimates are high because of chance under finite samples. Adjustments can help for: Quantification, to have an interpretable value between [0, 1] Ranking, to avoid biases towards: missing values categorical variables with more categories A Framework to Adjust Dependency Measure Estimates for Chance, Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor. Under submission in SIAM International Conference on Data Mining 2016 (SDM-16) Arxiv: http://arxiv.org/abs/1510.07786 Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 51. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Background Examples of Applications Categories of Dependency measures Thesis Motivation Ranking Dependencies in Noisy Data Motivation Design of the Randomized Information Coefficient (RIC) Comparison Against Other Measures A Framework for Adjusting Dependency Measures Motivation Adjustment for Quantification Adjustment for Ranking Adjustments for Clustering Comparison Measures Motivation Detailed Analysis of Contingency Tables Application Scenarios Conclusions Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 52. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Clustering Validation Given a reference clustering V ( / ) we want to validate the clustering solution U (blue/red) ⇒ we need dependency measures There are two very popular measures based on adjustments: The Adjusted Rand Index (ARI) [Hubert and Arabie, 1985] ∼ 3000 citations The Adjusted Mutual Information (AMI) [Vinh et al., 2009] ∼ 200 citations Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 53. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Clustering Validation Given a reference clustering V ( / ) we want to validate the clustering solution U (blue/red) ⇒ we need dependency measures There are two very popular measures based on adjustments: The Adjusted Rand Index (ARI) [Hubert and Arabie, 1985] ∼ 3000 citations The Adjusted Mutual Information (AMI) [Vinh et al., 2009] ∼ 200 citations No clear connection between them - Users use them both Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 54. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Both computed on a contingency table Notation: Contingency table M ai = j nij are the row marginals and bj = i nij are the column marginals. V b1 · · · bj · · · bc a1 n11 · · · · · · · n1c ... ... ... ... U ai · nij · ... ... ... ... ar nr1 · · · · · · · nrc ARI - Adjustment of Rand Index (RI) based on counting pairs of objects ARI = RI − E[RI] max RI − E[RI] AMI - Adjustment of Mutual Information (MI) based on information theory AMI = MI − E[MI] max MI − E[MI] Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 55. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Link: generalized information theory Generalized information theory based on Tsallis q-entropy Hq(V ) 1 q − 1 1 − j bj N q generalizes Shannon’s entropy lim q→1 Hq(V ) = H(V ) j bj N log bj N Link between measures: Mutual Information (MIq) based on Tsallis q-entropy links RI and MI: MIq=2 ∝ RI lim q→1 MIq = MI Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 56. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Link: generalized information theory Generalized information theory based on Tsallis q-entropy Hq(V ) 1 q − 1 1 − j bj N q generalizes Shannon’s entropy lim q→1 Hq(V ) = H(V ) j bj N log bj N Link between measures: Mutual Information (MIq) based on Tsallis q-entropy links RI and MI: MIq=2 ∝ RI lim q→1 MIq = MI Challenge: Compute E[MIq] to connect ARI and AMI Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 57. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Link: generalized information theory Generalized information theory based on Tsallis q-entropy Hq(V ) 1 q − 1 1 − j bj N q generalizes Shannon’s entropy lim q→1 Hq(V ) = H(V ) j bj N log bj N Link between measures: Mutual Information (MIq) based on Tsallis q-entropy links RI and MI: MIq=2 ∝ RI lim q→1 MIq = MI Challenge: Compute E[MIq] to connect ARI and AMI Challenge 2.0: Compute Var(MIq) for standardization Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 58. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Motivation Propose technique applicable to a broader class of measures: We can do: Exact computation of measures in Lφ where S ∈ Lφ is a linear function of the entries of the contingency table: S = α + β ij φij (nij ) (α and β are constants) Asymptotic approximation of measures in Nφ (non-linear) Rand Index (RI) MI Jaccard (J) Generalized Information Theoretic VI MI NMI Figure : Families of measures we can adjust Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 59. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Detailed Analysis of Contingency Tables Exact Expected Value by Permutation Model E[S] is obtained by summation over all possible contingency tables M obtained by permutations. E[S] = M S(M)P(M) = α + β M ij φij (nij )P(M) No method to exhaustively generate M fixing the marginals extremely time expensive ( permutations O(N!)) However, it is possible to swap the inner summation with the outer summation: M i,j to swap φij (nij )P(M) = i,j nij swapped φij (nij )P(nij ) nij has a known hypergeometric distribution, Computation time dramatically reduced! ⇒ O (max {rN, cN}) Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 60. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Detailed Analysis of Contingency Tables Exact Variance Computation We have to compute the second moment E[S2 ] which requires: M   r i=1 c j=1 φij (nij )   2 P(M) M i,j,i ,j to swap φij (nij ) · φi j (ni j )P(M) i,j,i ,j nij ni j swapped φij (nij ) · φi j (ni j )P(nij , ni j ) Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 61. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Detailed Analysis of Contingency Tables Exact Variance Computation We have to compute the second moment E[S2 ] which requires: M   r i=1 c j=1 φij (nij )   2 P(M) M i,j,i ,j to swap φij (nij ) · φi j (ni j )P(M) i,j,i ,j nij ni j swapped φij (nij ) · φi j (ni j )P(nij , ni j ) Contribution: P(nij , ni j ) computation is technically challenging. We use the hypergeometric model: drawings from a urn with N marbles with 3 colors, red, blue, and white. Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 62. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Detailed Analysis of Contingency Tables Finally we can define adjustments.. Definition: Adjusted Mutual Information q - AMIq AMI2 = ARI lim q→1 AMIq = AMI We can finally relate ARI and AMI to generalized information theory! Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 63. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Detailed Analysis of Contingency Tables Finally we can define adjustments.. Definition: Adjusted Mutual Information q - AMIq AMI2 = ARI lim q→1 AMIq = AMI We can finally relate ARI and AMI to generalized information theory! Also define: a generalized Standardized Mutual Information q - SMIq for selection bias. Their complexities: Name Computational complexity AMI O (max {rN, cN}) SMI O max {rcN3 , c2 N3 } Table : Complexity when comparing two clusterings: N objects, r, c number of clusters Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 64. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Application Scenarios Application Scenarios Task: Clustering validation. Given a reference clustering V , choose the best clustering solution among U1 and U2 Example: Do you prefer U1 or U2? V 10 10 10 70 U1 8 8 0 0 0 7 0 7 0 0 7 0 0 7 0 78 2 3 3 70 V 10 10 10 70 U2 10 7 1 1 1 10 1 7 1 1 10 1 1 7 1 70 1 1 1 67 Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 65. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Application Scenarios Application Scenarios Task: Clustering validation. Given a reference clustering V , choose the best clustering solution among U1 and U2 Example: Do you prefer U1 or U2? V 10 10 10 70 U1 8 8 0 0 0 7 0 7 0 0 7 0 0 7 0 78 2 3 3 70 AMI chooses this one because of many 0’s V 10 10 10 70 U2 10 7 1 1 1 10 1 7 1 1 10 1 1 7 1 70 1 1 1 67 Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 66. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Application Scenarios Application Scenarios Task: Clustering validation. Given a reference clustering V , choose the best clustering solution among U1 and U2 Example: Do you prefer U1 or U2? V 10 10 10 70 U1 8 8 0 0 0 7 0 7 0 0 7 0 0 7 0 78 2 3 3 70 AMI chooses this one because of many 0’s V 10 10 10 70 U2 10 7 1 1 1 10 1 7 1 1 10 1 1 7 1 70 1 1 1 67 ARI chooses this one Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 67. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Application Scenarios Application Scenarios Task: Clustering validation. Given a reference clustering V , choose the best clustering solution among U1 and U2 Example: Do you prefer U1 or U2? V 10 10 10 70 U1 8 8 0 0 0 7 0 7 0 0 7 0 0 7 0 78 2 3 3 70 AMI chooses this one because of many 0’s V 10 10 10 70 U2 10 7 1 1 1 10 1 7 1 1 10 1 1 7 1 70 1 1 1 67 ARI chooses this one When there are small clusters in V , use AMI because it likes 0’s Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 68. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Application Scenarios Equal sized clusters... Task: Clustering validation. Given a reference clustering V , choose the best clustering solution among U1 and U2 Example: Do you prefer U1 or U2? V 25 25 25 25 U1 17 17 0 0 0 17 0 17 0 0 17 0 0 17 0 49 8 8 8 25 V 25 25 25 25 U2 24 20 2 1 1 25 2 20 2 1 23 1 1 20 1 28 2 2 2 22 Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 69. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Application Scenarios Equal sized clusters... Task: Clustering validation. Given a reference clustering V , choose the best clustering solution among U1 and U2 Example: Do you prefer U1 or U2? V 25 25 25 25 U1 17 17 0 0 0 17 0 17 0 0 17 0 0 17 0 49 8 8 8 25 AMI chooses this one because of many 0’s V 25 25 25 25 U2 24 20 2 1 1 25 2 20 2 1 23 1 1 20 1 28 2 2 2 22 Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 70. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Application Scenarios Equal sized clusters... Task: Clustering validation. Given a reference clustering V , choose the best clustering solution among U1 and U2 Example: Do you prefer U1 or U2? V 25 25 25 25 U1 17 17 0 0 0 17 0 17 0 0 17 0 0 17 0 49 8 8 8 25 AMI chooses this one because of many 0’s V 25 25 25 25 U2 24 20 2 1 1 25 2 20 2 1 23 1 1 20 1 28 2 2 2 22 ARI chooses this one Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 71. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Application Scenarios Equal sized clusters... Task: Clustering validation. Given a reference clustering V , choose the best clustering solution among U1 and U2 Example: Do you prefer U1 or U2? V 25 25 25 25 U1 17 17 0 0 0 17 0 17 0 0 17 0 0 17 0 49 8 8 8 25 AMI chooses this one because of many 0’s V 25 25 25 25 U2 24 20 2 1 1 25 2 20 2 1 23 1 1 20 1 28 2 2 2 22 ARI chooses this one When there are big equal sized clusters in V , use ARI because 0’s are misleading Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 72. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Application Scenarios SMIq can be used to correct selection bias Reference clustering with 4 clusters and solutions U with different number of clusters 2 3 4 5 6 7 8 9 10 SMIq 0 0.05 0.1 Probability of selection (q = 1:001) 2 3 4 5 6 7 8 9 10 AMIq 0 0.1 Number of sets r in U 2 3 4 5 6 7 8 9 10 NMIq 0 0.2 0.4 Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 73. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Application Scenarios Correct for selection bias with SMIq for any q Reference clustering with 4 clusters and solutions U with different number of clusters 2 3 4 5 6 7 8 9 10 SMIq 0 0.05 0.1 Probability of selection (q = 2) 2 3 4 5 6 7 8 9 10 AMIq 0 0.05 0.1 Number of sets r in U 2 3 4 5 6 7 8 9 10 NMIq 0 0.5 Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 74. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Application Scenarios Conclusion - Message We computed generalized information theoretic measures to propose AMIq and SMIq to: identify the application scenarios of ARI and AMI correct for selection bias Take away message: Use AMI when the reference is unbalanced and has small clusters Use ARI when the reference has big equal sized clusters Use SMIq to correct for selection bias Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance, Simone Romano, James Bailey, Nguyen Xuan Vinh, and Karin Verspoor. Published in Proceedings of the 31st International Conference on Machine Learning 2014, pp. 1143–1151 (ICML-14) Adjusting for Chance Clustering Comparison Measures, Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor. To submit to the Journal of Machine Learning Research Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 75. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Background Examples of Applications Categories of Dependency measures Thesis Motivation Ranking Dependencies in Noisy Data Motivation Design of the Randomized Information Coefficient (RIC) Comparison Against Other Measures A Framework for Adjusting Dependency Measures Motivation Adjustment for Quantification Adjustment for Ranking Adjustments for Clustering Comparison Measures Motivation Detailed Analysis of Contingency Tables Application Scenarios Conclusions Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 76. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Summary Studying the distribution of the estimates ˆD, we: Designed RIC Adjusted for quantification Adjusted for ranking These results can aid detection, quantification, and ranking of relationships as follows Detection: RIC can be used to detect relationships between continuous variables because it has high power Quantification: Adjustment for quantification can be used to obtain a more interpretable range of values. E.g. AMIC and AMIq Ranking: Adjustment for ranking can be used to correct for biases towards variables with missing values or variables with many categories. E.g. AGini(α) for random forests Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 77. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Future Work Dependency measure estimates can obtain high values because of chance also when they are computed on different number of dimensions ⇒ study adjustments to be unbiased towards different dimensionality Adjustment via permutations is slow ⇒ compute more analytical adjustments, e.g. for MIC The random seeds discretization technique for RIC might have problems with high dimensionality ⇒ generate random seeds in random subspaces ⇒ study multivariable discretization using random trees Inject randomness in other estimators of mutual information ⇒ E.g. choose different random kernel widths for the IKDE estimator Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 78. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Papers S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Adjusting for Chance Clustering Comparison Measures”. To submit to the Journal of Machine Learning Research S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “A Framework to Adjust Dependency Measure Estimates for Chance”. Under submission in SIAM International Conference on Data Mining 2016 (SDM-16) S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “The Randomized Information Coefficient: Ranking Dependencies in Noisy Data” Under review in the Machine Learning Journal S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance”. Published in Proceedings of the 31st International Conference on Machine Learning 2014, pp. 1143–1151 (ICML-14) Collaborations: Y. Lei, J. C. Bezdek, N. X. Vinh, J. Chan, S. Romano, and J. Bailey, “Extending information theoretic validity indices for fuzzy clusterings”. Submitted to the Transactions on Fuzzy Systems Journal N. X. Vinh, J. Chan, S. Romano, J. Bailey, C. Leckie, K. Ramamohanarao, and J. Pei, “Discovering outlying aspects in large datasets”. Submitted to the Data Mining and Knowledge Discovery Journal N. X. Vinh, J. Chan, S. Romano, and J. Bailey, “Effective global approaches for mutual information based feature selection”. Published in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 512–521 Y. Lei, J. C. Bezdek, J. Chan, N. X. Vinh, S. Romano, and J. Bailey, “Generalized information theoretic cluster validity indices for soft clusterings”. Published in Proceedings of Computational Intelligence and Data Mining (CIDM), 2014, pp. 24–31 Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 79. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Thank You All In particular My supervisors: James Bailey, Karin Verspoor, and Vinh Nguyen Committee Chair: Tim Baldwin My fellow PhD students Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 80. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions Thank You All In particular My supervisors: James Bailey, Karin Verspoor, and Vinh Nguyen Committee Chair: Tim Baldwin My fellow PhD students Questions? Code available online: https://github.com/ialuronico Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 81. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions References I Albatineh, A. N., Niewiadomska-Bugaj, M., and Mihalko, D. (2006). On similarity indices and correction for chance agreement. Journal of Classification, 23(2):301–313. Caruana, R., Elhawary, M., Nguyen, N., and Smith, C. (2006). Meta clustering. In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 107–118. IEEE. Cohen, M. X. (2014). Analyzing neural time series data: theory and practice. MIT Press. Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. John Wiley Sons. Criminisi, A., Shotton, J., and Konukoglu, E. (2012). Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations and Trends in Computer Graphics and Vision, 7(2-3):81–227. Dang, X. H. and Bailey, J. (2015). A framework to uncover multiple alternative clusterings. Machine Learning, 98(1-2):7–30. Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 82. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions References II Dobra, A. and Gehrke, J. (2001). Bias correction in classification tree construction. In ICML, pages 90–97. Fern´andez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 15(1):3133–3181. Frank, E. and Witten, I. H. (1998). Using a permutation test for attribute selection in decision trees. In ICML, pages 152–160. Geurts, P. (2002). Bias/Variance Tradeoff and Time Series Classification. PhD thesis, Department d’´Eletrecit´e, ´Eletronique et Informatique. Institut Momntefiore. Unversit´e de Li`ege. Gretton, A., Bousquet, O., Smola, A., and Sch¨olkopf, B. (2005). Measuring statistical dependence with hilbert-schmidt norms. In Algorithmic learning theory, pages 63–77. Springer. Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157–1182. Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 83. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions References III Hothorn, T., Hornik, K., and Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3):651–674. Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2:193–218. Khan, S., Bandyopadhyay, S., Ganguly, A. R., Saigal, S., Erickson III, D. J., Protopopescu, V., and Ostrouchov, G. (2007). Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Physical Review E, 76(2):026209. Kononenko, I. (1995). On biases in estimating multi-valued attributes. In International Joint Conferences on Artificial Intelligence, pages 1034–1040. Kraskov, A., St¨ogbauer, H., and Grassberger, P. (2004). Estimating mutual information. Physical review E, 69(6):066138. Lei, Y., Vinh, N. X., Chan, J., and Bailey, J. (2014). Filta: Better view discovery from collections of clusterings via filtering. In Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer. Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 84. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions References IV Lopez-Paz, D., Hennig, P., and Sch¨olkopf, B. (2013). The randomized dependence coefficient. In Advances in Neural Information Processing Systems, pages 1–9. Margolin, A. A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R. D., and Califano, A. (2006). Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC bioinformatics, 7(Suppl 1):S7. Meil˘a, M. (2007). Comparing clusterings—an information based distance. Journal of Multivariate Analysis, 98(5):873–895. M¨uller, E., G¨unnemann, S., F¨arber, I., and Seidl, T. (2013). Discovering multiple clustering solutions: Grouping objects in different views of the data. Tutorial at ICML. Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011). Detecting novel associations in large data sets. Science, 334(6062):1518–1524. Reshef, Y. A., Reshef, D. N., Finucane, H. K., Sabeti, P. C., and Mitzenmacher, M. M. (2015). Measuring dependence powerfully and equitably. arXiv preprint arXiv:1505.02213. Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 85. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions References V Schaffernicht, E., Kaltenhaeuser, R., Verma, S. S., and Gross, H.-M. (2010). On estimating mutual information for feature selection. In Artificial Neural Networks ICANN 2010, pages 362–367. Springer. Strehl, A. and Ghosh, J. (2003). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3:583–617. Strobl, C., Boulesteix, A.-L., and Augustin, T. (2007). Unbiased split selection for classification trees based on the gini index. Computational Statistics Data Analysis, 52(1):483–501. Sugiyama, M. and Borgwardt, K. M. (2013). Measuring statistical dependence via the mutual information dimension. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pages 1692–1698. AAAI Press. Sz´ekely, G. J., Rizzo, M. L., et al. (2009). Brownian distance covariance. The annals of applied statistics, 3(4):1236–1265. Villaverde, A. F., Ross, J., and Banga, J. R. (2013). Reverse engineering cellular networks with information theoretic methods. Cells, 2(2):306–329. Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables
  • 86. Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions References VI Vinh, N. X., Epps, J., and Bailey, J. (2009). Information theoretic measures for clusterings comparison: is a correction for chance necessary? In ICML, pages 1073–1080. ACM. Witten, I. H., Frank, E., and Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques. 3rd edition. Simone Romano University of Melbourne Design and Adjustment of Dependency Measures Between Variables