1
Data Mining
tim.menzies@gmail.com
Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
2
INFOGAIN: (the Fayyad and Irani MDL discretizer) in 55 lines
https://raw.githubusercontent.com/timm/axe/master/old/ediv.py...
Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
4
Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
5
It doesn't matter what you do but
does matter who does it!
Martin Shepperd, Brunel University, West London, UK
http://cres...
Systematic Review
• Conducted by Tracy Hall and David Bowes
– T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. “A ...
Matthews correlation coefficient
8
MCC
Dataset$MCC
frequency
-0.2 0.0 0.2 0.4 0.6 0.8
0102030405060
-2 -1 0 1 2
-0.20.00.2...
(iv) Research Group
9
ANOVA Results
Factor % of var
Author group 61%
Metric family 3%
Author/metric 9%
Everything else 8% (but not significant)
...
Final word
We cannot ignore the fact that
the main determinant of a
validation study result is which
research group undert...
Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
12
Upcoming SlideShare
Loading in …5
×

Know thy tools

543 views

Published on

Recommender workshop, ICSE'14

Published in: Engineering, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
543
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Know thy tools

  1. 1. 1 Data Mining tim.menzies@gmail.com
  2. 2. Know thy tools Stop treating data miners as black boxes. Looking inside is (1) fun, (2) easy, (3) needed. 2
  3. 3. INFOGAIN: (the Fayyad and Irani MDL discretizer) in 55 lines https://raw.githubusercontent.com/timm/axe/master/old/ediv.py Input: [ (1,X), (2,X), (3,X), (4,X), (11,Y), (12,Y), (13,Y), (14,Y) ] Output: 1, 11 dsfdsdssdsdsddsdsdsfsdfsdsdfsdsdf 3 E = Σ –p*log2(p)
  4. 4. Know thy tools Stop treating data miners as black boxes. Looking inside is (1) fun, (2) easy, (3) needed. 4
  5. 5. Know thy tools Stop treating data miners as black boxes. Looking inside is (1) fun, (2) easy, (3) needed. 5
  6. 6. It doesn't matter what you do but does matter who does it! Martin Shepperd, Brunel University, West London, UK http://crest.cs.ucl.ac.uk/?id=3695 6
  7. 7. Systematic Review • Conducted by Tracy Hall and David Bowes – T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. “A systematic literature review on fault prediction performance in software engineering”, Accepted for publication in TSE (download from BURA). • Located 208 relevant primary studies • Due to reporting requirements used 18 studies that contain 194 results – binary classifiers, confusion matrix, context details 7
  8. 8. Matthews correlation coefficient 8 MCC Dataset$MCC frequency -0.2 0.0 0.2 0.4 0.6 0.8 0102030405060 -2 -1 0 1 2 -0.20.00.20.40.60.8 rnorm(194) Dataset$MCC TABLE IV COMPOSITE PERFORMANCE MEASURES Defined as Description detection) TP/ (TP + F N ) Proportion of faulty units cor TP/ (TP + F P) Proportion of units correctl faulty alse alarm) F P/ (F P + TN ) Proportion of non-faulty un classified TN/ (TN + F P) Proportion of correctly classi units 2·R ecal l ·P r eci si on R ecal l + P r eci si on Most commonly defined as mean of precision and recall ( T N + T P ) (T N + F N + F P + T P ) Proportion of correctly classifi on Coefficient T P ⇥T N − F P ⇥F Np (T P + F P )( T P + F N )(T N + F P )(T N + F N ) Combines all quadrants of th sion matrix to produce avalue to +1 with 0 indicating random tween the prediction and the r MCC can betested for statistic with χ2 = N · M CC2 where number of instances.
  9. 9. (iv) Research Group 9
  10. 10. ANOVA Results Factor % of var Author group 61% Metric family 3% Author/metric 9% Everything else 8% (but not significant) Residuals 19% 10
  11. 11. Final word We cannot ignore the fact that the main determinant of a validation study result is which research group undertakes it. 11
  12. 12. Know thy tools Stop treating data miners as black boxes. Looking inside is (1) fun, (2) easy, (3) needed. 12

×