SlideShare a Scribd company logo
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
SDM 2016 – May 6th 2016
A Framework to Adjust Dependency Measure Estimates for Chance
Simone Romano
me@simoneromano.com
@ialuronico
Nguyen Xuan Vinh James Bailey Karin Verspoor
(We won the Best Paper Award!)
Department of Computing and Information Systems,
The University of Melbourne, Victoria, Australia
I will soon start working as
applied scientist for
in London UK
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Motivation
Adjustment for Quantification
Adjustment for Ranking
Conclusions
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Dependency Measures
A dependency measure D is used to assess
the amount of dependency between variables:
Example 1: After collecting weight and height for many people,
we can compute D(weight, height)
Example 2: assess the amount
of dependency between search
queries in Google
https://www.google.com/
trends/correlate/
They are fundamental for a number of applications in machine learning/ data mining
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Applications of Dependency Measures
Supervised learning
Feature selection [Guyon and Elisseeff, 2003];
Decision tree induction [Criminisi et al., 2012];
Evaluation of classification accuracy [Witten et al., 2011].
Unsupervised learning
External clustering validation [Strehl and Ghosh, 2003];
Generation of alternative or multi-view clusterings
[M¨uller et al., 2013, Dang and Bailey, 2015];
The exploration of the clustering space using results from the Meta-Clustering
algorithm [Caruana et al., 2006, Lei et al., 2014].
Exploratory analysis
Inference of biological networks [Reshef et al., 2011, Villaverde et al., 2013];
Analysis of neural time-series data [Cohen, 2014].
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Motivation for Adjustment For Quantification
Pearson’s correlation between two variables X and Y estimated on a data sample
Sn = {(xk, yk)} of n data points:
r(Sn|X, Y )
n
k=1(xk − ¯x)(yk − ¯y)
n
k=1(xk − ¯x)2 n
k=1(yk − ¯y)2
(1)
1 0.8 0.4 0 -0.4 -0.8 -1
1 1 1 -1 -1 -1
0 0 0 0 0 0 0
Figure : From
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
r2
(Sn|X, Y ) can be used as a proxy of the amount of noise for linear
relationships:
1 if noiseless
0 if complete noise
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
The Maximal Information Coefficient (MIC) was published in Science
[Reshef et al., 2011] and has ≈ 570 citations to date according to Google scholar.
MIC(X,Y ) can be used as a proxy of the amount fo noise for functional
relationships:
Figure : From supplementary material online in [Reshef et al., 2011]
MIC should be equal to:
1 if the relationship between X and Y is functional and noiseless
0 if there is complete noise
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Challenge
Nonetheless, its estimation is challenging on a finite data sample Sn of n data
points.
We simulate 10,000 fully noisy relationships between X and Y on 20 and 80 data
points:
0.2 0.4 0.6 0.8 1
MIC(S80jX; Y )
MIC(S20jX; Y )
Value can be high because of chance!
The user expects values close to 0 in both cases
Challenge: Adjust the estimated MIC to better exploit the range [0, 1]
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Adjustment for Chance
We define a framework for adjustment:
Adjustment for Quantification
A ˆD
ˆD − E[ ˆD0]
max ˆD − E[ ˆD0]
It uses the distribution ˆD0 under independent variables:
r2
0 : Beta distribution
MIC0: can be computed using Monte Carlo permutations.
Used in κ-statistics. Its application is beneficial to other dependency measures:
Adjusted r2
⇒ Ar2
Adjusted MIC ⇒ AMIC
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Adjusted measures enable better interpretability
Task:
Obtain 1 for noiseless relationship, and 0 for complete noise (on average).
0%
r2
= 1
Ar2
= 1
20%
r2
= 0:66
Ar2
= 0:65
40%
r2
= 0:39
Ar2
= 0:37
60%
r2
= 0:2
Ar2
= 0:17
80%
r2
= 0:073
Ar2
= 0:044
100%
r2
= 0:035
Ar2
= 0:00046
Figure : Ar2
becomes zero on average on 100% noise: r2
= 0.035 vs Ar2
= 0.00046.
0%
MIC = 1
AMIC = 1
20%
MIC = 0:7
AMIC = 0:6
40%
MIC = 0:47
AMIC = 0:29
60%
MIC = 0:34
AMIC = 0:11
80%
MIC = 0:27
AMIC = 0:021
100%
MIC = 0:26
AMIC = 0:0014
Figure : AMIC becomes zero on average on 100% noise: MIC = 0.26 vs AMIC = 0.014.
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Not biased towards small sample size n
Average value of ˆD for different % of noise
⇒ estimates can be high because of chance at small n (e.g. because of missing
values)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Not biased towards small sample size n
Average value of ˆD for different % of noise
⇒ estimates can be high because of chance at small n (e.g. because of missing
values)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Ar2
(Adjusted)
Noise Level
0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
AMIC (Adjusted)
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Motivation
Adjustment for Quantification
Adjustment for Ranking
Conclusions
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variables
X1 and X2 defined as follows:
X1 ≡ patient had breakfast today, X1 = {yes, no};
X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yes
X1= no
X2=green
X2=blue
X2=brown
Problem:
When ranking
variables, dependency
measures are biased
towards the selection
of variables with many
categories
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variables
X1 and X2 defined as follows:
X1 ≡ patient had breakfast today, X1 = {yes, no};
X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yes
X1= no
X2=green
X2=blue
X2=brown
Problem:
When ranking
variables, dependency
measures are biased
towards the selection
of variables with many
categories
This still happens because of finite samples!
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Generate a variable X2 with 3 categories
(independently from C)
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Generate a variable X2 with 3 categories
(independently from C)
Compute
Gini(X1, C) and
Gini(X2, C).
Give a win to the variable
that gets the highest
value
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Generate a variable X2 with 3 categories
(independently from C)
Compute
Gini(X1, C) and
Gini(X2, C).
Give a win to the variable
that gets the highest
value
REPEAT 10,000 times
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Selection bias experiment
Experiment
n = 100 data points
Class C with 2 categories:
Generate a variable X1 with 2 categories
(independently from C)
Generate a variable X2 with 3 categories
(independently from C)
Compute
Gini(X1, C) and
Gini(X2, C).
Give a win to the variable
that gets the highest
value
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times
( Bad )
Given that they are equally unpredictive,
we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased rankingSimone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Adjustment for Ranking
We propose two adjustments for ranking:
Standardization
S ˆD
ˆD − E[ ˆD0]
Var( ˆD0)
Quantifies statistical significance like a p-value
Adjustment for Ranking
A ˆD(α) ˆD − q0(1 − α)
Penalizes on statistical significance according to α
q0 is the quantile of the distribution ˆD0
(small α more penalization)
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Standardized Gini (SGini) corrects for Selection bias
Select unpredictive features X1 with 2 categories and X2 with 3 categories.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: X1 and X2 gets se-
lected on average almost 50% of
the times
( Good )
Being similar to a p-value, this is consistent with the literature on decision
trees [Frank and Witten, 1998, Dobra and Gehrke, 2001, Hothorn et al., 2006,
Strobl et al., 2007].
Nonetheless: we found that this is a simplistic scenario
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant = 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: SGini becomes bi-
ased towards X1 because more
statically significant
( Bad )
This behavior has been overlooked in the decision tree community
Use A ˆD(α) to penalize less or even tune the bias!
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant = 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: SGini becomes bi-
ased towards X1 because more
statically significant
( Bad )
This behavior has been overlooked in the decision tree community
Use A ˆD(α) to penalize less or even tune the bias!
⇒ AGini(α)
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Application to random forest
why random forest? good classifier to try first when there are “meaningful” features
[Fern´andez-Delgado et al., 2014].
Plug-in different splitting criteria
Experiment: 19 data sets with categorical variables
,
0 0.2 0.4 0.6 0.8
MeanAUC
90
90.5
91
91.5
AGini(,)
SGini
Gini
Figure : Using the same α for all data sets
And α can be tuned for each data set with cross-validation.
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Motivation
Adjustment for Quantification
Adjustment for Ranking
Conclusions
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Conclusion - Message
Dependency estimates are high because of chance under finite samples.
Adjustments can help for:
Quantification, to have an interpretable value between [0, 1]
Ranking, to avoid biases towards:
missing values
categorical variables with more categories
Future Work:
Adjust dependency measures between multiple variables D(X1, . . . , Xd ) because of
bias towards large d
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
Thank you.
Questions?
Simone Romano
me@simoneromano.com
@ialuronico
Code available online:
https://github.com/ialuronico
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
References I
Caruana, R., Elhawary, M., Nguyen, N., and Smith, C. (2006).
Meta clustering.
In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 107–118. IEEE.
Cohen, M. X. (2014).
Analyzing neural time series data: theory and practice.
MIT Press.
Criminisi, A., Shotton, J., and Konukoglu, E. (2012).
Decision forests: A unified framework for classification, regression, density estimation,
manifold learning and semi-supervised learning.
Foundations and Trends in Computer Graphics and Vision, 7(2-3):81–227.
Dang, X. H. and Bailey, J. (2015).
A framework to uncover multiple alternative clusterings.
Machine Learning, 98(1-2):7–30.
Dobra, A. and Gehrke, J. (2001).
Bias correction in classification tree construction.
In ICML, pages 90–97.
Fern´andez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014).
Do we need hundreds of classifiers to solve real world classification problems?
The Journal of Machine Learning Research, 15(1):3133–3181.
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
References II
Frank, E. and Witten, I. H. (1998).
Using a permutation test for attribute selection in decision trees.
In ICML, pages 152–160.
Guyon, I. and Elisseeff, A. (2003).
An introduction to variable and feature selection.
The Journal of Machine Learning Research, 3:1157–1182.
Hothorn, T., Hornik, K., and Zeileis, A. (2006).
Unbiased recursive partitioning: A conditional inference framework.
Journal of Computational and Graphical Statistics, 15(3):651–674.
Lei, Y., Vinh, N. X., Chan, J., and Bailey, J. (2014).
Filta: Better view discovery from collections of clusterings via filtering.
In Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer.
M¨uller, E., G¨unnemann, S., F¨arber, I., and Seidl, T. (2013).
Discovering multiple clustering solutions: Grouping objects in different views of the data.
Tutorial at ICML.
Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh,
P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011).
Detecting novel associations in large data sets.
Science, 334(6062):1518–1524.
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance
Motivation Adjustment for Quantification Adjustment for Ranking Conclusions
References III
Strehl, A. and Ghosh, J. (2003).
Cluster ensembles—a knowledge reuse framework for combining multiple partitions.
The Journal of Machine Learning Research, 3:583–617.
Strobl, C., Boulesteix, A.-L., and Augustin, T. (2007).
Unbiased split selection for classification trees based on the gini index.
Computational Statistics & Data Analysis, 52(1):483–501.
Villaverde, A. F., Ross, J., and Banga, J. R. (2013).
Reverse engineering cellular networks with information theoretic methods.
Cells, 2(2):306–329.
Witten, I. H., Frank, E., and Hall, M. A. (2011).
Data Mining: Practical Machine Learning Tools and Techniques.
3rd edition.
Simone Romano University of Melbourne
A Framework to Adjust Dependency Measure Estimates for Chance

More Related Content

Viewers also liked

Resume IMA2 (1)
Resume IMA2 (1)Resume IMA2 (1)
Resume IMA2 (1)
Rachhpal Malhi
 
Predicting the Response to Hepatitis C Therapy
Predicting the Response to Hepatitis C TherapyPredicting the Response to Hepatitis C Therapy
Predicting the Response to Hepatitis C Therapy
Simone Romano
 
Evento social
Evento socialEvento social
Evento social
Sandra Reyes
 
PhD Completion Seminar
PhD Completion Seminar PhD Completion Seminar
PhD Completion Seminar
Simone Romano
 
DUCH GROUP MOULD-MAKING
DUCH GROUP MOULD-MAKINGDUCH GROUP MOULD-MAKING
DUCH GROUP MOULD-MAKINGDiana Yang
 
Enhancing Diagnostics for Invasive Aspergillosis using Machine Learning
Enhancing Diagnostics for Invasive Aspergillosis using Machine LearningEnhancing Diagnostics for Invasive Aspergillosis using Machine Learning
Enhancing Diagnostics for Invasive Aspergillosis using Machine Learning
Simone Romano
 
My Entry to the Sportsbet/CIKM competition
My Entry to the Sportsbet/CIKM competitionMy Entry to the Sportsbet/CIKM competition
My Entry to the Sportsbet/CIKM competition
Simone Romano
 
Introduction to auditing, Meaning, Objects and Techniques
Introduction to auditing, Meaning, Objects and TechniquesIntroduction to auditing, Meaning, Objects and Techniques
Introduction to auditing, Meaning, Objects and Techniques
mack19921
 

Viewers also liked (8)

Resume IMA2 (1)
Resume IMA2 (1)Resume IMA2 (1)
Resume IMA2 (1)
 
Predicting the Response to Hepatitis C Therapy
Predicting the Response to Hepatitis C TherapyPredicting the Response to Hepatitis C Therapy
Predicting the Response to Hepatitis C Therapy
 
Evento social
Evento socialEvento social
Evento social
 
PhD Completion Seminar
PhD Completion Seminar PhD Completion Seminar
PhD Completion Seminar
 
DUCH GROUP MOULD-MAKING
DUCH GROUP MOULD-MAKINGDUCH GROUP MOULD-MAKING
DUCH GROUP MOULD-MAKING
 
Enhancing Diagnostics for Invasive Aspergillosis using Machine Learning
Enhancing Diagnostics for Invasive Aspergillosis using Machine LearningEnhancing Diagnostics for Invasive Aspergillosis using Machine Learning
Enhancing Diagnostics for Invasive Aspergillosis using Machine Learning
 
My Entry to the Sportsbet/CIKM competition
My Entry to the Sportsbet/CIKM competitionMy Entry to the Sportsbet/CIKM competition
My Entry to the Sportsbet/CIKM competition
 
Introduction to auditing, Meaning, Objects and Techniques
Introduction to auditing, Meaning, Objects and TechniquesIntroduction to auditing, Meaning, Objects and Techniques
Introduction to auditing, Meaning, Objects and Techniques
 

Similar to A Framework to Adjust Dependency Measure Estimates for Chance

Statistical tests/prosthodontic courses
Statistical tests/prosthodontic coursesStatistical tests/prosthodontic courses
Statistical tests/prosthodontic courses
Indian dental academy
 
Statistik 1 7 estimasi & ci
Statistik 1 7 estimasi & ciStatistik 1 7 estimasi & ci
Statistik 1 7 estimasi & ci
Selvin Hadi
 
L1 statistics
L1 statisticsL1 statistics
L1 statistics
dapdai
 
韩国会议
韩国会议韩国会议
韩国会议
YAO YUAN
 
Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help
HelpWithAssignment.com
 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
Kemal İnciroğlu
 
Correlation continued
Correlation continuedCorrelation continued
Correlation continued
Nelsie Grace Pineda
 
week 5.pptx
week 5.pptxweek 5.pptx
week 5.pptx
RezaJoia
 
Estimating population mean
Estimating population meanEstimating population mean
Estimating population mean
Ronaldo Cabardo
 
Presentation_1376168115602
Presentation_1376168115602Presentation_1376168115602
Presentation_1376168115602
Alexander Nevidimov
 
Monte Carlo Simulations in Ad Lift Measurement using Spark
Monte Carlo Simulations in Ad Lift Measurement using SparkMonte Carlo Simulations in Ad Lift Measurement using Spark
Monte Carlo Simulations in Ad Lift Measurement using Spark
Prasad Chalasani
 
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Spark Summit
 
Friedman-SPSS.docx
Friedman-SPSS.docxFriedman-SPSS.docx
Friedman-SPSS.docx
drsaravanan1977
 
Inorganic CHEMISTRY
Inorganic CHEMISTRYInorganic CHEMISTRY
Inorganic CHEMISTRY
Saikumar raja
 
X18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsX18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalytics
Shantanu Deshpande
 
[WISE 2015] Similarity-Based Context-aware Recommendation
[WISE 2015] Similarity-Based Context-aware Recommendation[WISE 2015] Similarity-Based Context-aware Recommendation
[WISE 2015] Similarity-Based Context-aware Recommendation
YONG ZHENG
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
The Statistical and Applied Mathematical Sciences Institute
 
Errors2
Errors2Errors2
Errors2
sjsuchaya
 
cannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdfcannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdf
JermaeDizon2
 
6SigmaReferenceMaterials
6SigmaReferenceMaterials6SigmaReferenceMaterials
6SigmaReferenceMaterials
Larry Thompson, MfgT.
 

Similar to A Framework to Adjust Dependency Measure Estimates for Chance (20)

Statistical tests/prosthodontic courses
Statistical tests/prosthodontic coursesStatistical tests/prosthodontic courses
Statistical tests/prosthodontic courses
 
Statistik 1 7 estimasi & ci
Statistik 1 7 estimasi & ciStatistik 1 7 estimasi & ci
Statistik 1 7 estimasi & ci
 
L1 statistics
L1 statisticsL1 statistics
L1 statistics
 
韩国会议
韩国会议韩国会议
韩国会议
 
Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help
 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
 
Correlation continued
Correlation continuedCorrelation continued
Correlation continued
 
week 5.pptx
week 5.pptxweek 5.pptx
week 5.pptx
 
Estimating population mean
Estimating population meanEstimating population mean
Estimating population mean
 
Presentation_1376168115602
Presentation_1376168115602Presentation_1376168115602
Presentation_1376168115602
 
Monte Carlo Simulations in Ad Lift Measurement using Spark
Monte Carlo Simulations in Ad Lift Measurement using SparkMonte Carlo Simulations in Ad Lift Measurement using Spark
Monte Carlo Simulations in Ad Lift Measurement using Spark
 
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
 
Friedman-SPSS.docx
Friedman-SPSS.docxFriedman-SPSS.docx
Friedman-SPSS.docx
 
Inorganic CHEMISTRY
Inorganic CHEMISTRYInorganic CHEMISTRY
Inorganic CHEMISTRY
 
X18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsX18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalytics
 
[WISE 2015] Similarity-Based Context-aware Recommendation
[WISE 2015] Similarity-Based Context-aware Recommendation[WISE 2015] Similarity-Based Context-aware Recommendation
[WISE 2015] Similarity-Based Context-aware Recommendation
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Inference on Treatment...
 
Errors2
Errors2Errors2
Errors2
 
cannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdfcannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdf
 
6SigmaReferenceMaterials
6SigmaReferenceMaterials6SigmaReferenceMaterials
6SigmaReferenceMaterials
 

Recently uploaded

UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTERUNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
vmspraneeth
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
PriyankaKilaniya
 
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Transcat
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
upoux
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
snaprevwdev
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
vmspraneeth
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
sydezfe
 
Introduction to Artificial Intelligence.
Introduction to Artificial Intelligence.Introduction to Artificial Intelligence.
Introduction to Artificial Intelligence.
supriyaDicholkar1
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
b0754201
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
Indrajeet sahu
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
Lubi Valves
 
comptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdfcomptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdf
foxlyon
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
wafawafa52
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
uqyfuc
 
Beckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview PresentationBeckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview Presentation
VanTuDuong1
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
DharmaBanothu
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
DELTA V MES EMERSON EDUARDO RODRIGUES ENGINEER
DELTA V MES EMERSON EDUARDO RODRIGUES ENGINEERDELTA V MES EMERSON EDUARDO RODRIGUES ENGINEER
DELTA V MES EMERSON EDUARDO RODRIGUES ENGINEER
EMERSON EDUARDO RODRIGUES
 

Recently uploaded (20)

UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTERUNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
 
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
 
openshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoinopenshift technical overview - Flow of openshift containerisatoin
openshift technical overview - Flow of openshift containerisatoin
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
 
Introduction to Artificial Intelligence.
Introduction to Artificial Intelligence.Introduction to Artificial Intelligence.
Introduction to Artificial Intelligence.
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
 
Butterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdfButterfly Valves Manufacturer (LBF Series).pdf
Butterfly Valves Manufacturer (LBF Series).pdf
 
comptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdfcomptia-security-sy0-701-exam-objectives-(5-0).pdf
comptia-security-sy0-701-exam-objectives-(5-0).pdf
 
Ericsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.pptEricsson LTE Throughput Troubleshooting Techniques.ppt
Ericsson LTE Throughput Troubleshooting Techniques.ppt
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Beckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview PresentationBeckhoff Programmable Logic Control Overview Presentation
Beckhoff Programmable Logic Control Overview Presentation
 
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
DELTA V MES EMERSON EDUARDO RODRIGUES ENGINEER
DELTA V MES EMERSON EDUARDO RODRIGUES ENGINEERDELTA V MES EMERSON EDUARDO RODRIGUES ENGINEER
DELTA V MES EMERSON EDUARDO RODRIGUES ENGINEER
 

A Framework to Adjust Dependency Measure Estimates for Chance

  • 1. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions SDM 2016 – May 6th 2016 A Framework to Adjust Dependency Measure Estimates for Chance Simone Romano me@simoneromano.com @ialuronico Nguyen Xuan Vinh James Bailey Karin Verspoor (We won the Best Paper Award!) Department of Computing and Information Systems, The University of Melbourne, Victoria, Australia I will soon start working as applied scientist for in London UK Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 2. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 3. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Dependency Measures A dependency measure D is used to assess the amount of dependency between variables: Example 1: After collecting weight and height for many people, we can compute D(weight, height) Example 2: assess the amount of dependency between search queries in Google https://www.google.com/ trends/correlate/ They are fundamental for a number of applications in machine learning/ data mining Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 4. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Applications of Dependency Measures Supervised learning Feature selection [Guyon and Elisseeff, 2003]; Decision tree induction [Criminisi et al., 2012]; Evaluation of classification accuracy [Witten et al., 2011]. Unsupervised learning External clustering validation [Strehl and Ghosh, 2003]; Generation of alternative or multi-view clusterings [M¨uller et al., 2013, Dang and Bailey, 2015]; The exploration of the clustering space using results from the Meta-Clustering algorithm [Caruana et al., 2006, Lei et al., 2014]. Exploratory analysis Inference of biological networks [Reshef et al., 2011, Villaverde et al., 2013]; Analysis of neural time-series data [Cohen, 2014]. Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 5. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Motivation for Adjustment For Quantification Pearson’s correlation between two variables X and Y estimated on a data sample Sn = {(xk, yk)} of n data points: r(Sn|X, Y ) n k=1(xk − ¯x)(yk − ¯y) n k=1(xk − ¯x)2 n k=1(yk − ¯y)2 (1) 1 0.8 0.4 0 -0.4 -0.8 -1 1 1 1 -1 -1 -1 0 0 0 0 0 0 0 Figure : From https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient r2 (Sn|X, Y ) can be used as a proxy of the amount of noise for linear relationships: 1 if noiseless 0 if complete noise Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 6. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions The Maximal Information Coefficient (MIC) was published in Science [Reshef et al., 2011] and has ≈ 570 citations to date according to Google scholar. MIC(X,Y ) can be used as a proxy of the amount fo noise for functional relationships: Figure : From supplementary material online in [Reshef et al., 2011] MIC should be equal to: 1 if the relationship between X and Y is functional and noiseless 0 if there is complete noise Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 7. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Challenge Nonetheless, its estimation is challenging on a finite data sample Sn of n data points. We simulate 10,000 fully noisy relationships between X and Y on 20 and 80 data points: 0.2 0.4 0.6 0.8 1 MIC(S80jX; Y ) MIC(S20jX; Y ) Value can be high because of chance! The user expects values close to 0 in both cases Challenge: Adjust the estimated MIC to better exploit the range [0, 1] Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 8. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Adjustment for Chance We define a framework for adjustment: Adjustment for Quantification A ˆD ˆD − E[ ˆD0] max ˆD − E[ ˆD0] It uses the distribution ˆD0 under independent variables: r2 0 : Beta distribution MIC0: can be computed using Monte Carlo permutations. Used in κ-statistics. Its application is beneficial to other dependency measures: Adjusted r2 ⇒ Ar2 Adjusted MIC ⇒ AMIC Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 9. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Adjusted measures enable better interpretability Task: Obtain 1 for noiseless relationship, and 0 for complete noise (on average). 0% r2 = 1 Ar2 = 1 20% r2 = 0:66 Ar2 = 0:65 40% r2 = 0:39 Ar2 = 0:37 60% r2 = 0:2 Ar2 = 0:17 80% r2 = 0:073 Ar2 = 0:044 100% r2 = 0:035 Ar2 = 0:00046 Figure : Ar2 becomes zero on average on 100% noise: r2 = 0.035 vs Ar2 = 0.00046. 0% MIC = 1 AMIC = 1 20% MIC = 0:7 AMIC = 0:6 40% MIC = 0:47 AMIC = 0:29 60% MIC = 0:34 AMIC = 0:11 80% MIC = 0:27 AMIC = 0:021 100% MIC = 0:26 AMIC = 0:0014 Figure : AMIC becomes zero on average on 100% noise: MIC = 0.26 vs AMIC = 0.014. Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 10. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Not biased towards small sample size n Average value of ˆD for different % of noise ⇒ estimates can be high because of chance at small n (e.g. because of missing values) Noise Level 0 50 100 0 0.2 0.4 0.6 0.8 1 n = 10 n = 20 n = 30 n = 40 n = 100 n = 200 Raw r2 Noise Level 0 50 100 0 0.2 0.4 0.6 0.8 1 n = 20 n = 40 n = 60 n = 80 Raw MIC Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 11. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Not biased towards small sample size n Average value of ˆD for different % of noise ⇒ estimates can be high because of chance at small n (e.g. because of missing values) Noise Level 0 50 100 0 0.2 0.4 0.6 0.8 1 n = 10 n = 20 n = 30 n = 40 n = 100 n = 200 Raw r2 Noise Level 0 50 100 0 0.2 0.4 0.6 0.8 1 n = 20 n = 40 n = 60 n = 80 Raw MIC Noise Level 0 50 100 0 0.2 0.4 0.6 0.8 1 n = 10 n = 20 n = 30 n = 40 n = 100 n = 200 Ar2 (Adjusted) Noise Level 0 50 100 0 0.2 0.4 0.6 0.8 1 n = 20 n = 40 n = 60 n = 80 AMIC (Adjusted) Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 12. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 13. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Motivation for Adjustment for Ranking Say that we want to predict the risks of C cancer using equally unpredictive variables X1 and X2 defined as follows: X1 ≡ patient had breakfast today, X1 = {yes, no}; X2 ≡ patient eye color, X2 = {green, blu, brown}; X1= yes X1= no X2=green X2=blue X2=brown Problem: When ranking variables, dependency measures are biased towards the selection of variables with many categories Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 14. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Motivation for Adjustment for Ranking Say that we want to predict the risks of C cancer using equally unpredictive variables X1 and X2 defined as follows: X1 ≡ patient had breakfast today, X1 = {yes, no}; X2 ≡ patient eye color, X2 = {green, blu, brown}; X1= yes X1= no X2=green X2=blue X2=brown Problem: When ranking variables, dependency measures are biased towards the selection of variables with many categories This still happens because of finite samples! Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 15. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Selection bias experiment Experiment n = 100 data points Class C with 2 categories: Generate a variable X1 with 2 categories (independently from C) Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 16. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Selection bias experiment Experiment n = 100 data points Class C with 2 categories: Generate a variable X1 with 2 categories (independently from C) Generate a variable X2 with 3 categories (independently from C) Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 17. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Selection bias experiment Experiment n = 100 data points Class C with 2 categories: Generate a variable X1 with 2 categories (independently from C) Generate a variable X2 with 3 categories (independently from C) Compute Gini(X1, C) and Gini(X2, C). Give a win to the variable that gets the highest value Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 18. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Selection bias experiment Experiment n = 100 data points Class C with 2 categories: Generate a variable X1 with 2 categories (independently from C) Generate a variable X2 with 3 categories (independently from C) Compute Gini(X1, C) and Gini(X2, C). Give a win to the variable that gets the highest value REPEAT 10,000 times Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 19. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Selection bias experiment Experiment n = 100 data points Class C with 2 categories: Generate a variable X1 with 2 categories (independently from C) Generate a variable X2 with 3 categories (independently from C) Compute Gini(X1, C) and Gini(X2, C). Give a win to the variable that gets the highest value REPEAT 10,000 times 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 X1 X2 Probability of Selection Result: X2 gets selected 70% of the times ( Bad ) Given that they are equally unpredictive, we expected 50% Challenge: adjust the estimated Gini gain to obtained unbiased rankingSimone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 20. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Adjustment for Ranking We propose two adjustments for ranking: Standardization S ˆD ˆD − E[ ˆD0] Var( ˆD0) Quantifies statistical significance like a p-value Adjustment for Ranking A ˆD(α) ˆD − q0(1 − α) Penalizes on statistical significance according to α q0 is the quantile of the distribution ˆD0 (small α more penalization) Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 21. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Standardized Gini (SGini) corrects for Selection bias Select unpredictive features X1 with 2 categories and X2 with 3 categories. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 X1 X2 Probability of Selection Experiment: X1 and X2 gets se- lected on average almost 50% of the times ( Good ) Being similar to a p-value, this is consistent with the literature on decision trees [Frank and Witten, 1998, Dobra and Gehrke, 2001, Hothorn et al., 2006, Strobl et al., 2007]. Nonetheless: we found that this is a simplistic scenario Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 22. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Standardized Gini (SGini) might be biased Fix predictiveness of features X1 and X2 to a constant = 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 X1 X2 Probability of Selection Experiment: SGini becomes bi- ased towards X1 because more statically significant ( Bad ) This behavior has been overlooked in the decision tree community Use A ˆD(α) to penalize less or even tune the bias! Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 23. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Standardized Gini (SGini) might be biased Fix predictiveness of features X1 and X2 to a constant = 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 X1 X2 Probability of Selection Experiment: SGini becomes bi- ased towards X1 because more statically significant ( Bad ) This behavior has been overlooked in the decision tree community Use A ˆD(α) to penalize less or even tune the bias! ⇒ AGini(α) Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 24. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Application to random forest why random forest? good classifier to try first when there are “meaningful” features [Fern´andez-Delgado et al., 2014]. Plug-in different splitting criteria Experiment: 19 data sets with categorical variables , 0 0.2 0.4 0.6 0.8 MeanAUC 90 90.5 91 91.5 AGini(,) SGini Gini Figure : Using the same α for all data sets And α can be tuned for each data set with cross-validation. Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 25. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 26. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Conclusion - Message Dependency estimates are high because of chance under finite samples. Adjustments can help for: Quantification, to have an interpretable value between [0, 1] Ranking, to avoid biases towards: missing values categorical variables with more categories Future Work: Adjust dependency measures between multiple variables D(X1, . . . , Xd ) because of bias towards large d Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 27. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions Thank you. Questions? Simone Romano me@simoneromano.com @ialuronico Code available online: https://github.com/ialuronico Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 28. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions References I Caruana, R., Elhawary, M., Nguyen, N., and Smith, C. (2006). Meta clustering. In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 107–118. IEEE. Cohen, M. X. (2014). Analyzing neural time series data: theory and practice. MIT Press. Criminisi, A., Shotton, J., and Konukoglu, E. (2012). Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations and Trends in Computer Graphics and Vision, 7(2-3):81–227. Dang, X. H. and Bailey, J. (2015). A framework to uncover multiple alternative clusterings. Machine Learning, 98(1-2):7–30. Dobra, A. and Gehrke, J. (2001). Bias correction in classification tree construction. In ICML, pages 90–97. Fern´andez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 15(1):3133–3181. Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 29. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions References II Frank, E. and Witten, I. H. (1998). Using a permutation test for attribute selection in decision trees. In ICML, pages 152–160. Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157–1182. Hothorn, T., Hornik, K., and Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3):651–674. Lei, Y., Vinh, N. X., Chan, J., and Bailey, J. (2014). Filta: Better view discovery from collections of clusterings via filtering. In Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer. M¨uller, E., G¨unnemann, S., F¨arber, I., and Seidl, T. (2013). Discovering multiple clustering solutions: Grouping objects in different views of the data. Tutorial at ICML. Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011). Detecting novel associations in large data sets. Science, 334(6062):1518–1524. Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance
  • 30. Motivation Adjustment for Quantification Adjustment for Ranking Conclusions References III Strehl, A. and Ghosh, J. (2003). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3:583–617. Strobl, C., Boulesteix, A.-L., and Augustin, T. (2007). Unbiased split selection for classification trees based on the gini index. Computational Statistics & Data Analysis, 52(1):483–501. Villaverde, A. F., Ross, J., and Banga, J. R. (2013). Reverse engineering cellular networks with information theoretic methods. Cells, 2(2):306–329. Witten, I. H., Frank, E., and Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques. 3rd edition. Simone Romano University of Melbourne A Framework to Adjust Dependency Measure Estimates for Chance