Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
Empirical Evaluations in
Software Engineering Research:
A Personal Perspective
Ahmed E. Hassan
Canada Research Chair in ...
2
Today’s goal:
Give concrete ideas
Inspire change
Kingston
Home of the 1,000 Islands
3
Kingston
Home of the 1,000 Islands
3
4
5
Intelligence throughout
the lifetime of a software system
from inception to production
http://sail.cs.queensu.ca
1851
6
SAIL’s new home:
A 5,000 sq. ft. historical building
1851
6
SAIL’s new home:
A 5,000 sq. ft. historical building
1851
6
SAIL’s new home:
A 5,000 sq. ft. historical building
~15 years of Mining
Software Repositories
7
http://msrconf.org
Early MSR Days
8
9
Early Day Miner
MSR Today: Easy Peasy!
10
11
Data trackGHTorrent
Lots of
data!
Then
12
Now
13
Thanks to a
large scale
community
effort!
14
Yet low practitioner interest!!
15
Even though our friendly
reviewers are pleased!
Why?!?
16
Why?!?
16
We continue to raise
the bar for studies
Empirical evaluations
continues to mature and evolve
17
Empirical evaluations
continues to mature and evolve
17
No validation
Single
System
Multiple
Systems
GitHub
Scale
+ Practi...
18
No validation
Single
System
Multiple
Systems
GitHub
Scale
+ Practitioner
Feedback
18
No validation
Single
System
Multiple
Systems
GitHub
Scale
+ Practitioner
Feedback
Are things this bad?
What can we do t...
We are
killing
research
areas
19
Software
Visualization
Program
Understand.
We are following industry
instead of leading!!
Web Apps
(e.g, AJAX) Data Analytics
Scripting
Infrastructure
Mobile
Apps IoT
We are doing this
to ourselves!!
21
22
22
Practitioner
buy-in
More
studies
(N+1)
Replication
data
Not Soft.
Eng!
23
Practitioner
buy-in
More
studies
(N+1)
Replication
data
23
Practitioner
buy-in
More
studies
(N+1)
Replication
data
Not Soft.
Eng!
Not Software Enginering
24
Not Software Enginering
24
Mobile Apps
Logs
Build Systems
Load Testing
NoSQL Systems
25
25
"I found the overall presentation
disjointed…. This needs to focus more
on the IR issues and less on web
analysis.” SIG...
26
Practitioner
buy-in
Replication
data
Not Soft.
Eng!
26
Practitioner
buy-in
More
studies
(N+1)
Replication
data
Not Soft.
Eng!
27
We can
early-detect
brain cancer!!!
28
How about
skin/lung
cancer?!!!
The “N+1” Critique
encourages
29
Case Study
Fishing
Shallow
Analysis
Generalization of approaches
instead of generalization of results
30
Encourage Meta Analysis
31
Encourage Meta Analysis
31
Encourage Meta Analysis
31
Encourage Meta Analysis
31
N=15..241
Total N=
2,659
Encourage Meta Analysis
32
Encourage Meta Analysis
32
1972
More is less!
33
34
More
studies
(N+1)
Replication
data
Not Soft.
Eng!
34
Practitioner
buy-in
More
studies
(N+1)
Replication
data
Not Soft.
Eng!
35
Smoking
kills!
36
What do
smokers think?!
37
54.9%of the time patients did not
receive the recommended care for their condition

[New England Journal of Medicine 20...
Less than half of
developers have
a CS degree
[stack overflow survey 2016]
38
“A couple billion
lines of code later:
static checking in
the real world”
Dawson Engler
39
“A couple billion
lines of code later:
static checking in
the real world”
Dawson Engler
39
“A couple billion
lines of code later:
static checking in
the real world”
Dawson Engler
39
Do practitioners
agree with
each other?
4141
Practitioners do not agree with
each other
42
Presentation
–---------
–-------
–---------
–-------
–---------
–-------
–--...
Practitioners do not agree with
each other
42
Presentation
–---------
–-------
–---------
–-------
–---------
–-------
–--...
Practitioners do not agree with
each other
42
Presentation
–---------
–-------
–---------
–-------
–---------
–-------
–--...
Practitioners do not agree with
each other
42
Presentation
–---------
–-------
–---------
–-------
–---------
–-------
–--...
Practitioners do not agree with
each other
42
Presentation
–---------
–-------
–---------
–-------
–---------
–-------
–--...
43
43
even within the same system
44
Practitioner
buy-in
More
studies
(N+1)
Not Soft.
Eng!
44
Practitioner
buy-in
More
studies
(N+1)
Replication
data
Not Soft.
Eng!
45
Replication data
is not sufficient!
MSR Today: Easy Peasy!
46
Limited knowledge of machine
learning is a serious risk
47
A Simple Example
A Simple Example
Logistic
Regression
Empirical Hypothesis Testing
(is complex code more risky?)
49
Bugs ~ f(LOC, CC_max)
NOT about predicting bugs
Including correlated metrics
50
Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max)
Model M2: Bugs ~ f(CC_avg, CC_max, P...
Including correlated metrics
51
Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max)
Model M2: Bugs ~ f(CC_avg, CC_max, P...
Including correlated metrics
51
Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max)
Model M2: Bugs ~ f(CC_avg, CC_max, P...
Including correlated metrics
51
Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max)
Model M2: Bugs ~ f(CC_avg, CC_max, P...
Using F-measure in studies
52
Threshold = 0.2 Threshold = 0.5 Threshold = 0.8
Resampling
imbalanced data
53
Resampling
imbalanced data
53
Resampling
imbalanced data
54
Resampling
imbalanced data
54
Variable importance:
Original vs under sampled data
55
Variable importance:
Original vs under sampled data
55
Variable importance:
Original vs under sampled data
55
Not including control metrics
56
Model M1: Bugs ~ f( , CC_max, PAR_max, FOUT_max)
Model M2: Bugs ~ f(TLOC, CC_max, PAR_max...
Not including control metrics
57
Model M1: Bugs ~ f( , CC_max, PAR_max, FOUT_max)
Model M2: Bugs ~ f(TLOC, CC_max, PAR_max...
Not including control metrics
57
Model M1: Bugs ~ f( , CC_max, PAR_max, FOUT_max)
Model M2: Bugs ~ f(TLOC, CC_max, PAR_max...
58
Unseen
100
out of sampl
bootstrap
10-fold cross
validation
10x10 Cross
Validation
Using 10 fold cross validation
59
Unseen out
b
100
out of sample
bootstrap
10-fold cross
validation
10x10 cross
validation
Using 10 fold cross validation
60
Unseen
25
out of sample
bootstrap
100
out of sample
bootstrap
10-fold cross
validation
10x10 cross
validation
Using 10 ...
61
Unseen
25
out of sample
bootstrap
100
out of sample
bootstrap
10-fold cross
validation
10x10 cross
validation
Using 10 ...
Using default ANOVA
62
Model M1: Bugs ~ f( TLOC, ….. )
Model M2: Bugs ~ f( ……,TLOC )
Using default ANOVA
63
Model M1: Bugs ~ f( TLOC, ….. )
Model M2: Bugs ~ f( ……,TLOC )
Using default ANOVA
63
Model M1: Bugs ~ f( TLOC, ….. )
Model M2: Bugs ~ f( ……,TLOC )
Pitfalls in Analytical Modelling
Pitfalls in Analytical Modelling
Joint work with
Kla Tantihamthavorn
65
Pitfalls in Analytical Modelling
66
Pitfalls in Analytical Modelling
Open Scripts
67
Open Scripts
67
Encourages community
mentorship
Open Scripts
67
Encourages community
mentorship
Works even for
industrial data
Open Scripts
67
Encourages community
mentorship
Works even for
industrial data
Ensures research
transparency
68
68
68
68
68
N
69
Where do
we go from
here?
• Trailblazing research should be encouraged and defended

• Inaccurate analysis is a serious and growing threat

• Deeper...
Think
Different!
Upcoming SlideShare
Loading in …5
×

Empirical Evaluations in Software Engineering Research: A Personal Perspective

Empirical Evaluations in Software Engineering Research: A Personal Perspective

  • Be the first to comment

Empirical Evaluations in Software Engineering Research: A Personal Perspective

  1. 1. 1 Empirical Evaluations in Software Engineering Research: A Personal Perspective Ahmed E. Hassan Canada Research Chair in Software Analytics Blackberry Ultra Large Scale Systems Chair School of Computing, Queen’s University Kingston, Canada
  2. 2. 2 Today’s goal: Give concrete ideas Inspire change
  3. 3. Kingston Home of the 1,000 Islands 3
  4. 4. Kingston Home of the 1,000 Islands 3
  5. 5. 4
  6. 6. 5 Intelligence throughout the lifetime of a software system from inception to production http://sail.cs.queensu.ca
  7. 7. 1851 6 SAIL’s new home: A 5,000 sq. ft. historical building
  8. 8. 1851 6 SAIL’s new home: A 5,000 sq. ft. historical building
  9. 9. 1851 6 SAIL’s new home: A 5,000 sq. ft. historical building
  10. 10. ~15 years of Mining Software Repositories 7 http://msrconf.org
  11. 11. Early MSR Days 8
  12. 12. 9 Early Day Miner
  13. 13. MSR Today: Easy Peasy! 10
  14. 14. 11 Data trackGHTorrent Lots of data!
  15. 15. Then 12 Now
  16. 16. 13 Thanks to a large scale community effort!
  17. 17. 14 Yet low practitioner interest!!
  18. 18. 15 Even though our friendly reviewers are pleased!
  19. 19. Why?!? 16
  20. 20. Why?!? 16 We continue to raise the bar for studies
  21. 21. Empirical evaluations continues to mature and evolve 17
  22. 22. Empirical evaluations continues to mature and evolve 17 No validation Single System Multiple Systems GitHub Scale + Practitioner Feedback
  23. 23. 18 No validation Single System Multiple Systems GitHub Scale + Practitioner Feedback
  24. 24. 18 No validation Single System Multiple Systems GitHub Scale + Practitioner Feedback Are things this bad? What can we do to prevent this?
  25. 25. We are killing research areas 19 Software Visualization Program Understand.
  26. 26. We are following industry instead of leading!! Web Apps (e.g, AJAX) Data Analytics Scripting Infrastructure Mobile Apps IoT
  27. 27. We are doing this to ourselves!! 21
  28. 28. 22
  29. 29. 22 Practitioner buy-in More studies (N+1) Replication data Not Soft. Eng!
  30. 30. 23 Practitioner buy-in More studies (N+1) Replication data
  31. 31. 23 Practitioner buy-in More studies (N+1) Replication data Not Soft. Eng!
  32. 32. Not Software Enginering 24
  33. 33. Not Software Enginering 24 Mobile Apps Logs Build Systems Load Testing NoSQL Systems
  34. 34. 25
  35. 35. 25 "I found the overall presentation disjointed…. This needs to focus more on the IR issues and less on web analysis.” SIGIR Rejection 98
  36. 36. 26 Practitioner buy-in Replication data Not Soft. Eng!
  37. 37. 26 Practitioner buy-in More studies (N+1) Replication data Not Soft. Eng!
  38. 38. 27 We can early-detect brain cancer!!!
  39. 39. 28 How about skin/lung cancer?!!!
  40. 40. The “N+1” Critique encourages 29 Case Study Fishing Shallow Analysis
  41. 41. Generalization of approaches instead of generalization of results 30
  42. 42. Encourage Meta Analysis 31
  43. 43. Encourage Meta Analysis 31
  44. 44. Encourage Meta Analysis 31
  45. 45. Encourage Meta Analysis 31 N=15..241 Total N= 2,659
  46. 46. Encourage Meta Analysis 32
  47. 47. Encourage Meta Analysis 32 1972
  48. 48. More is less! 33
  49. 49. 34 More studies (N+1) Replication data Not Soft. Eng!
  50. 50. 34 Practitioner buy-in More studies (N+1) Replication data Not Soft. Eng!
  51. 51. 35 Smoking kills!
  52. 52. 36 What do smokers think?!
  53. 53. 37 54.9%of the time patients did not receive the recommended care for their condition [New England Journal of Medicine 2003]
  54. 54. Less than half of developers have a CS degree [stack overflow survey 2016] 38
  55. 55. “A couple billion lines of code later: static checking in the real world” Dawson Engler 39
  56. 56. “A couple billion lines of code later: static checking in the real world” Dawson Engler 39
  57. 57. “A couple billion lines of code later: static checking in the real world” Dawson Engler 39
  58. 58. Do practitioners agree with each other?
  59. 59. 4141
  60. 60. Practitioners do not agree with each other 42 Presentation –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- Survey (93 stakeholders)
  61. 61. Practitioners do not agree with each other 42 Presentation –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- Survey (93 stakeholders) Semi-structured interviews (15 Key engineers) –--------- –------- –--------- –------- –--------- –------- –--------- –------- Interviewee’s list
  62. 62. Practitioners do not agree with each other 42 Presentation –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- Survey (93 stakeholders) Semi-structured interviews (15 Key engineers) Validation survey Interviews (25 Senior engineers) –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- Interviewee’s list –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- Implications
  63. 63. Practitioners do not agree with each other 42 Presentation –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- Survey (93 stakeholders) Semi-structured interviews (15 Key engineers) Validation survey Interviews (25 Senior engineers) –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- Interviewee’s list –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- Implications –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- Verified implications
  64. 64. Practitioners do not agree with each other 42 Presentation –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- Survey (93 stakeholders) Semi-structured interviews (15 Key engineers) Validation survey Interviews (25 Senior engineers) –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- Interviewee’s list –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- Implications –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- –--------- –------- Verified implications As low as 68% agreement
  65. 65. 43
  66. 66. 43 even within the same system
  67. 67. 44 Practitioner buy-in More studies (N+1) Not Soft. Eng!
  68. 68. 44 Practitioner buy-in More studies (N+1) Replication data Not Soft. Eng!
  69. 69. 45 Replication data is not sufficient!
  70. 70. MSR Today: Easy Peasy! 46
  71. 71. Limited knowledge of machine learning is a serious risk 47
  72. 72. A Simple Example
  73. 73. A Simple Example Logistic Regression
  74. 74. Empirical Hypothesis Testing (is complex code more risky?) 49 Bugs ~ f(LOC, CC_max) NOT about predicting bugs
  75. 75. Including correlated metrics 50 Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max) Model M2: Bugs ~ f(CC_avg, CC_max, PAR_max, FOUT_max)
  76. 76. Including correlated metrics 51 Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max) Model M2: Bugs ~ f(CC_avg, CC_max, PAR_max, FOUT_max)
  77. 77. Including correlated metrics 51 Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max) Model M2: Bugs ~ f(CC_avg, CC_max, PAR_max, FOUT_max)
  78. 78. Including correlated metrics 51 Model M1: Bugs ~ f(CC_max, CC_avg, PAR_max, FOUT_max) Model M2: Bugs ~ f(CC_avg, CC_max, PAR_max, FOUT_max) 60% of 2000-2011 studies do not handle correlated metrics [Shihab 2012]
  79. 79. Using F-measure in studies 52 Threshold = 0.2 Threshold = 0.5 Threshold = 0.8
  80. 80. Resampling imbalanced data 53
  81. 81. Resampling imbalanced data 53
  82. 82. Resampling imbalanced data 54
  83. 83. Resampling imbalanced data 54
  84. 84. Variable importance: Original vs under sampled data 55
  85. 85. Variable importance: Original vs under sampled data 55
  86. 86. Variable importance: Original vs under sampled data 55
  87. 87. Not including control metrics 56 Model M1: Bugs ~ f( , CC_max, PAR_max, FOUT_max) Model M2: Bugs ~ f(TLOC, CC_max, PAR_max, FOUT_max)
  88. 88. Not including control metrics 57 Model M1: Bugs ~ f( , CC_max, PAR_max, FOUT_max) Model M2: Bugs ~ f(TLOC, CC_max, PAR_max, FOUT_max)
  89. 89. Not including control metrics 57 Model M1: Bugs ~ f( , CC_max, PAR_max, FOUT_max) Model M2: Bugs ~ f(TLOC, CC_max, PAR_max, FOUT_max)
  90. 90. 58 Unseen 100 out of sampl bootstrap 10-fold cross validation 10x10 Cross Validation Using 10 fold cross validation
  91. 91. 59 Unseen out b 100 out of sample bootstrap 10-fold cross validation 10x10 cross validation Using 10 fold cross validation
  92. 92. 60 Unseen 25 out of sample bootstrap 100 out of sample bootstrap 10-fold cross validation 10x10 cross validation Using 10 fold cross validation
  93. 93. 61 Unseen 25 out of sample bootstrap 100 out of sample bootstrap 10-fold cross validation 10x10 cross validation Using 10 fold cross validation
  94. 94. Using default ANOVA 62 Model M1: Bugs ~ f( TLOC, ….. ) Model M2: Bugs ~ f( ……,TLOC )
  95. 95. Using default ANOVA 63 Model M1: Bugs ~ f( TLOC, ….. ) Model M2: Bugs ~ f( ……,TLOC )
  96. 96. Using default ANOVA 63 Model M1: Bugs ~ f( TLOC, ….. ) Model M2: Bugs ~ f( ……,TLOC )
  97. 97. Pitfalls in Analytical Modelling
  98. 98. Pitfalls in Analytical Modelling Joint work with Kla Tantihamthavorn
  99. 99. 65 Pitfalls in Analytical Modelling
  100. 100. 66 Pitfalls in Analytical Modelling
  101. 101. Open Scripts 67
  102. 102. Open Scripts 67 Encourages community mentorship
  103. 103. Open Scripts 67 Encourages community mentorship Works even for industrial data
  104. 104. Open Scripts 67 Encourages community mentorship Works even for industrial data Ensures research transparency
  105. 105. 68
  106. 106. 68
  107. 107. 68
  108. 108. 68
  109. 109. 68
  110. 110. N 69 Where do we go from here?
  111. 111. • Trailblazing research should be encouraged and defended • Inaccurate analysis is a serious and growing threat • Deeper analysis is more valuable than large scale studies • Generalization is not feasible nor desired by practitioners • Industry is a partner not an idol Parting thoughts 70 http://sail.cs.queensu.ca
  112. 112. Think Different!

×