Know thy tools

1
Data Mining
tim.menzies@gmail.com

Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
2

INFOGAIN: (the Fayyad and Irani MDL discretizer) in 55 lines
https://raw.githubusercontent.com/timm/axe/master/old/ediv.py
Input: [ (1,X), (2,X), (3,X), (4,X), (11,Y), (12,Y), (13,Y), (14,Y) ]
Output: 1, 11 dsfdsdssdsdsddsdsdsfsdfsdsdfsdsdf
3
E = Σ –p*log2(p)

Know thy tools
4

Know thy tools
5

It doesn't matter what you do but
does matter who does it!
Martin Shepperd, Brunel University, West London, UK
http://crest.cs.ucl.ac.uk/?id=3695
6

Systematic Review
• Conducted by Tracy Hall and David Bowes
– T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. “A systematic
literature review on fault prediction performance in software
engineering”, Accepted for publication in TSE (download from BURA).
• Located 208 relevant primary studies
• Due to reporting requirements used 18
studies that contain 194 results
– binary classifiers, confusion matrix, context details
7

Matthews correlation coefficient
8
MCC
Dataset$MCC
frequency
-0.2 0.0 0.2 0.4 0.6 0.8
0102030405060
-2 -1 0 1 2
-0.20.00.20.40.60.8
rnorm(194)
Dataset$MCC
TABLE IV
COMPOSITE PERFORMANCE MEASURES
Defined as Description
detection)
TP/ (TP + F N ) Proportion of faulty units cor
TP/ (TP + F P)
Proportion of units correctl
faulty
alse alarm)
F P/ (F P + TN )
Proportion of non-faulty un
classified
TN/ (TN + F P)
Proportion of correctly classi
units
2·R ecal l ·P r eci si on
R ecal l + P r eci si on
Most commonly defined as
mean of precision and recall
( T N + T P )
(T N + F N + F P + T P )
Proportion of correctly classifi
on Coefficient
T P ⇥T N − F P ⇥F Np
(T P + F P )( T P + F N )(T N + F P )(T N + F N )
Combines all quadrants of th
sion matrix to produce avalue
to +1 with 0 indicating random
tween the prediction and the r
MCC can betested for statistic
with χ2 = N · M CC2 where
number of instances.

ANOVA Results
Factor % of var
Author group 61%
Metric family 3%
Author/metric 9%
Everything else 8% (but not significant)
Residuals 19%
10

Final word
We cannot ignore the fact that
the main determinant of a
validation study result is which
research group undertakes it.
11

Know thy tools
12

Know thy tools

Recommended

Recommended

More Related Content

Similar to Know thy tools

Similar to Know thy tools (20)

More from CS, NcState

More from CS, NcState (20)

Recently uploaded

Recently uploaded (20)

Know thy tools