A Framework to Adjust Dependency Measure Estimates for Chance

Motivation Adjustment for Quantiﬁcation Adjustment for Ranking Conclusions
SDM 2016 – May 6th 2016
A Framework to Adjust Dependency Measure Estimates for Chance
Simone Romano
me@simoneromano.com
@ialuronico
Nguyen Xuan Vinh James Bailey Karin Verspoor
(We won the Best Paper Award!)
Department of Computing and Information Systems,
The University of Melbourne, Victoria, Australia
I will soon start working as
applied scientist for
in London UK
Simone Romano University of Melbourne

Motivation
Adjustment for Quantiﬁcation
Adjustment for Ranking
Conclusions

Dependency Measures
A dependency measure D is used to assess
the amount of dependency between variables:
Example 1: After collecting weight and height for many people,
we can compute D(weight, height)
Example 2: assess the amount
of dependency between search
queries in Google
https://www.google.com/
trends/correlate/
They are fundamental for a number of applications in machine learning/ data mining

Applications of Dependency Measures
Supervised learning
Feature selection [Guyon and Elisseeff, 2003];
Decision tree induction [Criminisi et al., 2012];
Evaluation of classification accuracy [Witten et al., 2011].
Unsupervised learning
External clustering validation [Strehl and Ghosh, 2003];
Generation of alternative or multi-view clusterings
[Müller et al., 2013, Dang and Bailey, 2015];
The exploration of the clustering space using results from the Meta-Clustering
algorithm [Caruana et al., 2006, Lei et al., 2014].
Exploratory analysis
Inference of biological networks [Reshef et al., 2011, Villaverde et al., 2013];
Analysis of neural time-series data [Cohen, 2014].

Motivation for Adjustment For Quantiﬁcation
Pearson’s correlation between two variables X and Y estimated on a data sample
Sn = {(xk, yk)} of n data points:
r(Sn|X, Y )
n
k=1(xk − ¯x)(yk − ¯y)
n
k=1(xk − ¯x)2 n
k=1(yk − ¯y)2
(1)
1 0.8 0.4 0 -0.4 -0.8 -1
1 1 1 -1 -1 -1
0 0 0 0 0 0 0
Figure : From
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
r2
(Sn|X, Y ) can be used as a proxy of the amount of noise for linear
relationships:
1 if noiseless
0 if complete noise

The Maximal Information Coeﬃcient (MIC) was published in Science
[Reshef et al., 2011] and has ≈ 570 citations to date according to Google scholar.
MIC(X,Y ) can be used as a proxy of the amount fo noise for functional
relationships:
Figure : From supplementary material online in [Reshef et al., 2011]
MIC should be equal to:
1 if the relationship between X and Y is functional and noiseless
0 if there is complete noise

Challenge
Nonetheless, its estimation is challenging on a ﬁnite data sample Sn of n data
points.
We simulate 10,000 fully noisy relationships between X and Y on 20 and 80 data
points:
0.2 0.4 0.6 0.8 1
MIC(S80jX; Y )
MIC(S20jX; Y )
Value can be high because of chance!
The user expects values close to 0 in both cases
Challenge: Adjust the estimated MIC to better exploit the range [0, 1]

Adjustment for Chance
We define a framework for adjustment:
Adjustment for Quantification
A ˆD
ˆD − E[ ˆD0]
max ˆD − E[ ˆD0]
It uses the distribution ˆD0 under independent variables:
r2
0 : Beta distribution
MIC0: can be computed using Monte Carlo permutations.
Used in κ-statistics. Its application is beneficial to other dependency measures:
Adjusted r2
⇒ Ar2
Adjusted MIC ⇒ AMIC

Adjusted measures enable better interpretability
Task:
Obtain 1 for noiseless relationship, and 0 for complete noise (on average).
0%
r2
= 1
Ar2
= 1
20%
r2
= 0:66
Ar2
= 0:65
40%
r2
= 0:39
Ar2
= 0:37
60%
r2
= 0:2
Ar2
= 0:17
80%
r2
= 0:073
Ar2
= 0:044
100%
r2
= 0:035
Ar2
= 0:00046
Figure : Ar2
becomes zero on average on 100% noise: r2
= 0.035 vs Ar2
= 0.00046.
0%
MIC = 1
AMIC = 1
20%
MIC = 0:7
AMIC = 0:6
40%
MIC = 0:47
AMIC = 0:29
60%
MIC = 0:34
AMIC = 0:11
80%
MIC = 0:27
AMIC = 0:021
100%
MIC = 0:26
AMIC = 0:0014
Figure : AMIC becomes zero on average on 100% noise: MIC = 0.26 vs AMIC = 0.014.

Not biased towards small sample size n
Average value of ˆD for diﬀerent % of noise
⇒ estimates can be high because of chance at small n (e.g. because of missing
values)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC

Not biased towards small sample size n
Average value of ˆD for diﬀerent % of noise
⇒ estimates can be high because of chance at small n (e.g. because of missing
values)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Ar2
(Adjusted)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
AMIC (Adjusted)

Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variables
X1 and X2 deﬁned as follows:
X1 ≡ patient had breakfast today, X1 = {yes, no};
X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yes
X1= no
X2=green
X2=blue
X2=brown
Problem:
When ranking
variables, dependency
measures are biased
towards the selection
of variables with many
categories

Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variables
X1 and X2 deﬁned as follows:
X1 ≡ patient had breakfast today, X1 = {yes, no};
X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yes
X1= no
X2=green
X2=blue
X2=brown
Problem:
When ranking
variables, dependency
measures are biased
towards the selection
of variables with many
categories
This still happens because of ﬁnite samples!

Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)

Experiment
n = 100 data points

Experiment
n = 100 data points
Compute
Gini(X1, C) and
Gini(X2, C).
Give a win to the variable
that gets the highest
value

Experiment
n = 100 data points
Compute
Gini(X1, C) and
Gini(X2, C).
value
REPEAT 10,000 times

Experiment
n = 100 data points
Compute
Gini(X1, C) and
Gini(X2, C).
value
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times
( Bad )
Given that they are equally unpredictive,
we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased rankingSimone Romano University of Melbourne

We propose two adjustments for ranking:
Standardization
S ˆD
ˆD − E[ ˆD0]
Var( ˆD0)
Quantifies statistical significance like a p-value
A ˆD(α) ˆD − q0(1 − α)
Penalizes on statistical significance according to α
q0 is the quantile of the distribution ˆD0
(small α more penalization)

Standardized Gini (SGini) corrects for Selection bias
Select unpredictive features X1 with 2 categories and X2 with 3 categories.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Experiment: X1 and X2 gets se-
lected on average almost 50% of
the times
( Good )
Being similar to a p-value, this is consistent with the literature on decision
trees [Frank and Witten, 1998, Dobra and Gehrke, 2001, Hothorn et al., 2006,
Strobl et al., 2007].
Nonetheless: we found that this is a simplistic scenario

Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant = 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Experiment: SGini becomes bi-
ased towards X1 because more
statically signiﬁcant
( Bad )
This behavior has been overlooked in the decision tree community
Use A ˆD(α) to penalize less or even tune the bias!

Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant = 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Experiment: SGini becomes bi-
ased towards X1 because more
statically signiﬁcant
( Bad )
This behavior has been overlooked in the decision tree community
Use A ˆD(α) to penalize less or even tune the bias!
⇒ AGini(α)

Application to random forest
why random forest? good classifier to try first when there are “meaningful” features
[Fernández-Delgado et al., 2014].
Plug-in different splitting criteria
Experiment: 19 data sets with categorical variables
,
0 0.2 0.4 0.6 0.8
MeanAUC
90
90.5
91
91.5
AGini(,)
SGini
Gini
Figure : Using the same α for all data sets
And α can be tuned for each data set with cross-validation.

Conclusion - Message
Dependency estimates are high because of chance under ﬁnite samples.
Adjustments can help for:
Quantiﬁcation, to have an interpretable value between [0, 1]
Ranking, to avoid biases towards:
missing values
categorical variables with more categories
Future Work:
Adjust dependency measures between multiple variables D(X1, . . . , Xd ) because of
bias towards large d

Thank you.
Questions?
Simone Romano
me@simoneromano.com
@ialuronico
Code available online:
https://github.com/ialuronico

References I
Caruana, R., Elhawary, M., Nguyen, N., and Smith, C. (2006).
Meta clustering.
In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 107–118. IEEE.
Cohen, M. X. (2014).
Analyzing neural time series data: theory and practice.
MIT Press.
Criminisi, A., Shotton, J., and Konukoglu, E. (2012).
Decision forests: A unified framework for classification, regression, density estimation,
manifold learning and semi-supervised learning.
Foundations and Trends in Computer Graphics and Vision, 7(2-3):81–227.
Dang, X. H. and Bailey, J. (2015).
A framework to uncover multiple alternative clusterings.
Machine Learning, 98(1-2):7–30.
Dobra, A. and Gehrke, J. (2001).
Bias correction in classification tree construction.
In ICML, pages 90–97.
Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014).
Do we need hundreds of classifiers to solve real world classification problems?
The Journal of Machine Learning Research, 15(1):3133–3181.

References II
Frank, E. and Witten, I. H. (1998).
Using a permutation test for attribute selection in decision trees.
In ICML, pages 152–160.
Guyon, I. and Elisseeff, A. (2003).
An introduction to variable and feature selection.
The Journal of Machine Learning Research, 3:1157–1182.
Hothorn, T., Hornik, K., and Zeileis, A. (2006).
Unbiased recursive partitioning: A conditional inference framework.
Journal of Computational and Graphical Statistics, 15(3):651–674.
Lei, Y., Vinh, N. X., Chan, J., and Bailey, J. (2014).
Filta: Better view discovery from collections of clusterings via filtering.
In Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer.
Müller, E., Günnemann, S., Färber, I., and Seidl, T. (2013).
Discovering multiple clustering solutions: Grouping objects in different views of the data.
Tutorial at ICML.
Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh,
P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011).
Detecting novel associations in large data sets.
Science, 334(6062):1518–1524.

References III
Strehl, A. and Ghosh, J. (2003).
Cluster ensembles—a knowledge reuse framework for combining multiple partitions.
The Journal of Machine Learning Research, 3:583–617.
Strobl, C., Boulesteix, A.-L., and Augustin, T. (2007).
Unbiased split selection for classiﬁcation trees based on the gini index.
Computational Statistics & Data Analysis, 52(1):483–501.
Villaverde, A. F., Ross, J., and Banga, J. R. (2013).
Reverse engineering cellular networks with information theoretic methods.
Cells, 2(2):306–329.
Witten, I. H., Frank, E., and Hall, M. A. (2011).
Data Mining: Practical Machine Learning Tools and Techniques.
3rd edition.

A Framework to Adjust Dependency Measure Estimates for Chance

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to A Framework to Adjust Dependency Measure Estimates for Chance

Similar to A Framework to Adjust Dependency Measure Estimates for Chance (20)

Recently uploaded

Recently uploaded (20)

A Framework to Adjust Dependency Measure Estimates for Chance