SlideShare a Scribd company logo
1
Data Mining
tim.menzies@gmail.com
Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
2
INFOGAIN: (the Fayyad and Irani MDL discretizer) in 55 lines
https://raw.githubusercontent.com/timm/axe/master/old/ediv.py
Input: [ (1,X), (2,X), (3,X), (4,X), (11,Y), (12,Y), (13,Y), (14,Y) ]
Output: 1, 11 dsfdsdssdsdsddsdsdsfsdfsdsdfsdsdf
3
E = Σ –p*log2(p)
Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
4
Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
5
It doesn't matter what you do but
does matter who does it!
Martin Shepperd, Brunel University, West London, UK
http://crest.cs.ucl.ac.uk/?id=3695
6
Systematic Review
• Conducted by Tracy Hall and David Bowes
– T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. “A systematic
literature review on fault prediction performance in software
engineering”, Accepted for publication in TSE (download from BURA).
• Located 208 relevant primary studies
• Due to reporting requirements used 18
studies that contain 194 results
– binary classifiers, confusion matrix, context details
7
Matthews correlation coefficient
8
MCC
Dataset$MCC
frequency
-0.2 0.0 0.2 0.4 0.6 0.8
0102030405060
-2 -1 0 1 2
-0.20.00.20.40.60.8
rnorm(194)
Dataset$MCC
TABLE IV
COMPOSITE PERFORMANCE MEASURES
Defined as Description
detection)
TP/ (TP + F N ) Proportion of faulty units cor
TP/ (TP + F P)
Proportion of units correctl
faulty
alse alarm)
F P/ (F P + TN )
Proportion of non-faulty un
classified
TN/ (TN + F P)
Proportion of correctly classi
units
2·R ecal l ·P r eci si on
R ecal l + P r eci si on
Most commonly defined as
mean of precision and recall
( T N + T P )
(T N + F N + F P + T P )
Proportion of correctly classifi
on Coefficient
T P ⇥T N − F P ⇥F Np
(T P + F P )( T P + F N )(T N + F P )(T N + F N )
Combines all quadrants of th
sion matrix to produce avalue
to +1 with 0 indicating random
tween the prediction and the r
MCC can betested for statistic
with χ2 = N · M CC2 where
number of instances.
(iv) Research Group
9
ANOVA Results
Factor % of var
Author group 61%
Metric family 3%
Author/metric 9%
Everything else 8% (but not significant)
Residuals 19%
10
Final word
We cannot ignore the fact that
the main determinant of a
validation study result is which
research group undertakes it.
11
Know thy tools
Stop treating data miners as black boxes.
Looking inside is (1) fun, (2) easy, (3) needed.
12

More Related Content

Similar to Know thy tools

7076 chapter5 slides
7076 chapter5 slides7076 chapter5 slides
7076 chapter5 slides
Nguyen Mina
 
Randić Index of Some Class of Trees with an Algorithm
Randić Index of Some Class of Trees with an AlgorithmRandić Index of Some Class of Trees with an Algorithm
Randić Index of Some Class of Trees with an Algorithm
Editor IJCATR
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
Dr. C.V. Suresh Babu
 
Métodos computacionales para el estudio de modelos epidemiológicos con incer...
Métodos computacionales para el estudio de modelos  epidemiológicos con incer...Métodos computacionales para el estudio de modelos  epidemiológicos con incer...
Métodos computacionales para el estudio de modelos epidemiológicos con incer...
Facultad de Informática UCM
 
Trunsored data analysis
Trunsored data analysisTrunsored data analysis
Trunsored data analysis
Hideo Hirose
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
Larry Guo
 
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
Chiheb Ben Hammouda
 
Skiena algorithm 2007 lecture09 linear sorting
Skiena algorithm 2007 lecture09 linear sortingSkiena algorithm 2007 lecture09 linear sorting
Skiena algorithm 2007 lecture09 linear sortingzukun
 
DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...
DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...
DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...
Advanced-Concepts-Team
 
A common random fixed point theorem for rational ineqality in hilbert space ...
 A common random fixed point theorem for rational ineqality in hilbert space ... A common random fixed point theorem for rational ineqality in hilbert space ...
A common random fixed point theorem for rational ineqality in hilbert space ...
Alexander Decker
 
Selm Falzon Compressed
Selm Falzon CompressedSelm Falzon Compressed
Selm Falzon Compressed
gfalzon2
 
Reading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsReading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGs
Keisuke OTAKI
 
Chap04
Chap04Chap04
MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化
Akira Tanimoto
 
Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptx
HamzaJaved306957
 
Adaline and Madaline.ppt
Adaline and Madaline.pptAdaline and Madaline.ppt
Adaline and Madaline.ppt
neelamsanjeevkumar
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Chiheb Ben Hammouda
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...
The Statistical and Applied Mathematical Sciences Institute
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Frank Nielsen
 

Similar to Know thy tools (20)

7076 chapter5 slides
7076 chapter5 slides7076 chapter5 slides
7076 chapter5 slides
 
Randić Index of Some Class of Trees with an Algorithm
Randić Index of Some Class of Trees with an AlgorithmRandić Index of Some Class of Trees with an Algorithm
Randić Index of Some Class of Trees with an Algorithm
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
Métodos computacionales para el estudio de modelos epidemiológicos con incer...
Métodos computacionales para el estudio de modelos  epidemiológicos con incer...Métodos computacionales para el estudio de modelos  epidemiológicos con incer...
Métodos computacionales para el estudio de modelos epidemiológicos con incer...
 
Trunsored data analysis
Trunsored data analysisTrunsored data analysis
Trunsored data analysis
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
MCQMC 2020 talk: Importance Sampling for a Robust and Efficient Multilevel Mo...
 
Skiena algorithm 2007 lecture09 linear sorting
Skiena algorithm 2007 lecture09 linear sortingSkiena algorithm 2007 lecture09 linear sorting
Skiena algorithm 2007 lecture09 linear sorting
 
DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...
DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...
DPPs everywhere: repulsive point processes for Monte Carlo integration, signa...
 
A common random fixed point theorem for rational ineqality in hilbert space ...
 A common random fixed point theorem for rational ineqality in hilbert space ... A common random fixed point theorem for rational ineqality in hilbert space ...
A common random fixed point theorem for rational ineqality in hilbert space ...
 
Selm Falzon Compressed
Selm Falzon CompressedSelm Falzon Compressed
Selm Falzon Compressed
 
Reading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsReading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGs
 
Chap04
Chap04Chap04
Chap04
 
MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化
 
Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptx
 
Lec9
Lec9Lec9
Lec9
 
Adaline and Madaline.ppt
Adaline and Madaline.pptAdaline and Madaline.ppt
Adaline and Madaline.ppt
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
 

More from CS, NcState

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
CS, NcState
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
CS, NcState
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
CS, NcState
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
CS, NcState
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
CS, NcState
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
CS, NcState
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
CS, NcState
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
CS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
CS, NcState
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
CS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
CS, NcState
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
CS, NcState
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
CS, NcState
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
CS, NcState
 

More from CS, NcState (20)

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 

Recently uploaded

Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 

Recently uploaded (20)

Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 

Know thy tools

  • 2. Know thy tools Stop treating data miners as black boxes. Looking inside is (1) fun, (2) easy, (3) needed. 2
  • 3. INFOGAIN: (the Fayyad and Irani MDL discretizer) in 55 lines https://raw.githubusercontent.com/timm/axe/master/old/ediv.py Input: [ (1,X), (2,X), (3,X), (4,X), (11,Y), (12,Y), (13,Y), (14,Y) ] Output: 1, 11 dsfdsdssdsdsddsdsdsfsdfsdsdfsdsdf 3 E = Σ –p*log2(p)
  • 4. Know thy tools Stop treating data miners as black boxes. Looking inside is (1) fun, (2) easy, (3) needed. 4
  • 5. Know thy tools Stop treating data miners as black boxes. Looking inside is (1) fun, (2) easy, (3) needed. 5
  • 6. It doesn't matter what you do but does matter who does it! Martin Shepperd, Brunel University, West London, UK http://crest.cs.ucl.ac.uk/?id=3695 6
  • 7. Systematic Review • Conducted by Tracy Hall and David Bowes – T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. “A systematic literature review on fault prediction performance in software engineering”, Accepted for publication in TSE (download from BURA). • Located 208 relevant primary studies • Due to reporting requirements used 18 studies that contain 194 results – binary classifiers, confusion matrix, context details 7
  • 8. Matthews correlation coefficient 8 MCC Dataset$MCC frequency -0.2 0.0 0.2 0.4 0.6 0.8 0102030405060 -2 -1 0 1 2 -0.20.00.20.40.60.8 rnorm(194) Dataset$MCC TABLE IV COMPOSITE PERFORMANCE MEASURES Defined as Description detection) TP/ (TP + F N ) Proportion of faulty units cor TP/ (TP + F P) Proportion of units correctl faulty alse alarm) F P/ (F P + TN ) Proportion of non-faulty un classified TN/ (TN + F P) Proportion of correctly classi units 2·R ecal l ·P r eci si on R ecal l + P r eci si on Most commonly defined as mean of precision and recall ( T N + T P ) (T N + F N + F P + T P ) Proportion of correctly classifi on Coefficient T P ⇥T N − F P ⇥F Np (T P + F P )( T P + F N )(T N + F P )(T N + F N ) Combines all quadrants of th sion matrix to produce avalue to +1 with 0 indicating random tween the prediction and the r MCC can betested for statistic with χ2 = N · M CC2 where number of instances.
  • 10. ANOVA Results Factor % of var Author group 61% Metric family 3% Author/metric 9% Everything else 8% (but not significant) Residuals 19% 10
  • 11. Final word We cannot ignore the fact that the main determinant of a validation study result is which research group undertakes it. 11
  • 12. Know thy tools Stop treating data miners as black boxes. Looking inside is (1) fun, (2) easy, (3) needed. 12