SlideShare a Scribd company logo
1 of 86
Lessons from
10 Years of Automated Debugging Research
AST 2023 | Keynote
Shin Yoo
Shin Yoo
• Associate Prof@KAIST /
Republic of Korea
• Leads Computational
Intelligence for Software
Engineering Group
(https://coinse.github.io)
• Search Based Software
Engineering, Automated
Debugging, Software Testing…
Who am I?
2
When I say “lessons”, I mean that I learnt these things. I
do not mean to claim that these are some golden rules in
research or anything else. Treat this talk just as a story of
how certain techniques came to be… 🤠
3
Retrospective on My FL Work
4
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
GP + FL Theory
SF1
B
⊂ SF2
B
HUMIES!
DP + FL
Entropy MBFL
MBFL 2.0 Industry
Diagnosability
Applications
Retrospective on My FL Work
5
GP + FL Theory
SF1
B
⊂ SF2
B
HUMIES! DP + FL
Entropy
MBFL MBFL 2.0 Industry
Diagnosability Applications
Everything started with people.
6
Gabin An
(PhD Candidate)
Dr. Jeongju Sohn
Juyeon Yoon
(PhD Candidate)
Dr. Jinhan Kim
Sungmin Kang
(PhD Candidate)
Dr. Bill Langdon
Prof. Mark Harman
Prof. Robert Feldt Dr. David Clark
Prof. Moonzoo Kim
Prof. Yunho Kim
Dongwon Hwang
SAP Labs Korea
Prof. T. Y. Chen
Prof. Xiaoyuan Xie
Jingun Hong
SAP Labs Korea
❤
• Bill joins CREST (Centre for
Research on Evolution, Search,
& Testing) at King’s College
London.
• He immediately gets on with
GPGPU for genetic
programming, Higher-order
Mutation, and what will later
become Genetic Improvement.
Year 2008
7
Dr. Bill Langdon
• Bill joins CREST (Centre for
Research on Evolution, Search,
& Testing) at King’s College
London.
• He immediately gets on with
GPGPU for genetic
programming, Higher-order
Mutation, and what will later
become Genetic Improvement.
Year 2008
7
Dr. Bill Langdon Prof. Mark Harman
Me circa 2010
‘That stuff looks incredibly cool - what can
I do with GP and GPGPU…?’
Evolving SBFL Formulas using GP
Yoo, SSBSE 2012
9
ef
ep
ep + np + 1
Risk Evaluation Formula
Ranking
Program
Tests
Spectrum
Evolving SBFL Formulas using GP
Yoo, SSBSE 2012
9
ef
ep
ep + np + 1
Risk Evaluation Formula
Ranking
Program
Tests
Spectrum
P S
P S
Training Data
Evolving SBFL Formulas using GP
Yoo, SSBSE 2012
9
Ranking
Program
Tests
Spectrum
P S
P S
GP
Training Data
Evolving SBFL Formulas using GP
Yoo, SSBSE 2012
9
Program
Tests
Spectrum
P S
P S
GP
Fitness
(minimise)
Training Data
Evolving SBFL Formulas using GP
Yoo, SSBSE 2012
9
Program
Tests
Spectrum
P S
P S
GP
Fitness
(minimise)
e2
f (2ep + 2ef + 3np)
e2
f (e2
f +
p
np)
. . .
Training Data
Evolved Formulas
Little did I know that these IDs would be their names…
10
ble 7. GP-evolved risk evaluation formulæ. Trivial bloats, such as nf nf , were
moved.
ID Refined Formula ID Refined Formula
GP01 ef (np + ef (1 +
p
ef )) GP16
q
e
3
2
f + np
GP02 2(ef +
p
np) +
p
ep GP17
2ef +nf
ef np
+
np
p
ef
ef e2
f
GP03
q
|e2
f
p
ep| GP18 e3
f + 2np
GP04
q
|
np
ep np
ef | GP19 ef
p
|ep ef + nf np|
GP05
(ef +np)
p
ef
(ef +ep)(npnf +
p
ep)(ep+np)
p
|ep np|
GP20 2(ef +
np
ep+np
)
GP06 ef np GP21
p
ef +
p
ef + np
GP07 2ef (1 + ef + 1
2np
) + (1 +
p
2)
p
np GP22 e2
f + ef +
p
np
GP08 e2
f (2ep + 2ef + 3np) GP23
p
ef (e2
f +
np
ef
+
p
np + nf + np)
GP09
ef
p
np
np+np
+ np + ef + e3
f GP24 ef +
p
np
GP10
q
|ef
1
np
| GP25 e2
f +
p
np +
p
ef
p
|ep np|
+
np
(ef np)
GP11 e2
f (e2
f +
p
np) GP26 2e2
f +
p
np
GP12
p
ep + ef + np
p
ep GP27
np
p
(npnf ef )
ef +npnf
GP13 ef (1 + 1
2ep+ef
) GP28 ef (ef +
p
np + 1)
GP14 ef +
p
np GP29 ef (2e2
f +ef +np)+
(ef np)
p
npef
ep np
GP15 ef +
p
nf +
p
np GP30
q
|ef
nf np
ef +nf
|
existing metrics have been designed by human; this paper present the first
Evolved Formulas
Little did I know that these IDs would be their names…
10
ble 7. GP-evolved risk evaluation formulæ. Trivial bloats, such as nf nf , were
moved.
ID Refined Formula ID Refined Formula
GP01 ef (np + ef (1 +
p
ef )) GP16
q
e
3
2
f + np
GP02 2(ef +
p
np) +
p
ep GP17
2ef +nf
ef np
+
np
p
ef
ef e2
f
GP03
q
|e2
f
p
ep| GP18 e3
f + 2np
GP04
q
|
np
ep np
ef | GP19 ef
p
|ep ef + nf np|
GP05
(ef +np)
p
ef
(ef +ep)(npnf +
p
ep)(ep+np)
p
|ep np|
GP20 2(ef +
np
ep+np
)
GP06 ef np GP21
p
ef +
p
ef + np
GP07 2ef (1 + ef + 1
2np
) + (1 +
p
2)
p
np GP22 e2
f + ef +
p
np
GP08 e2
f (2ep + 2ef + 3np) GP23
p
ef (e2
f +
np
ef
+
p
np + nf + np)
GP09
ef
p
np
np+np
+ np + ef + e3
f GP24 ef +
p
np
GP10
q
|ef
1
np
| GP25 e2
f +
p
np +
p
ef
p
|ep np|
+
np
(ef np)
GP11 e2
f (e2
f +
p
np) GP26 2e2
f +
p
np
GP12
p
ep + ef + np
p
ep GP27
np
p
(npnf ef )
ef +npnf
GP13 ef (1 + 1
2ep+ef
) GP28 ef (ef +
p
np + 1)
GP14 ef +
p
np GP29 ef (2e2
f +ef +np)+
(ef np)
p
npef
ep np
GP15 ef +
p
nf +
p
np GP30
q
|ef
nf np
ef +nf
|
existing metrics have been designed by human; this paper present the first
Evolved Formulas
Little did I know that these IDs would be their names…
10
ble 7. GP-evolved risk evaluation formulæ. Trivial bloats, such as nf nf , were
moved.
ID Refined Formula ID Refined Formula
GP01 ef (np + ef (1 +
p
ef )) GP16
q
e
3
2
f + np
GP02 2(ef +
p
np) +
p
ep GP17
2ef +nf
ef np
+
np
p
ef
ef e2
f
GP03
q
|e2
f
p
ep| GP18 e3
f + 2np
GP04
q
|
np
ep np
ef | GP19 ef
p
|ep ef + nf np|
GP05
(ef +np)
p
ef
(ef +ep)(npnf +
p
ep)(ep+np)
p
|ep np|
GP20 2(ef +
np
ep+np
)
GP06 ef np GP21
p
ef +
p
ef + np
GP07 2ef (1 + ef + 1
2np
) + (1 +
p
2)
p
np GP22 e2
f + ef +
p
np
GP08 e2
f (2ep + 2ef + 3np) GP23
p
ef (e2
f +
np
ef
+
p
np + nf + np)
GP09
ef
p
np
np+np
+ np + ef + e3
f GP24 ef +
p
np
GP10
q
|ef
1
np
| GP25 e2
f +
p
np +
p
ef
p
|ep np|
+
np
(ef np)
GP11 e2
f (e2
f +
p
np) GP26 2e2
f +
p
np
GP12
p
ep + ef + np
p
ep GP27
np
p
(npnf ef )
ef +npnf
GP13 ef (1 + 1
2ep+ef
) GP28 ef (ef +
p
np + 1)
GP14 ef +
p
np GP29 ef (2e2
f +ef +np)+
(ef np)
p
npef
ep np
GP15 ef +
p
nf +
p
np GP30
q
|ef
nf np
ef +nf
|
existing metrics have been designed by human; this paper present the first
An Offer that I could not turn down…
(despite it being full of maths)
11
Prof. T. Y. Chen
Prof. Xiaoyuan Xie
SR2
B
SR1
B
SR1
F
SR1
A
Formula R1
SR2
F
SR2
A
Formula R2
Statement Ranking
SR
B = {si 2 S|R(si) > R(sf ), 1  i  n}
SR
F = {si 2 S|R(si) = R(sf ), 1  i  n}
SR
A = {si 2 S|R(si) < R(sf ), 1  i  n}
Xiaoyuan and T. Y. essentially formulated the
comparison between SBFL formulas as algebraic
proofs about set membership.
A Theoretical Analysis of the Risk Evaluation Formulas for Spectrum-
Based Fault Localisation, X. Xie, T.Y. Chen, F.C. Kuo, B. Xu,
ACM Transactions on Software Engineering and Methodology 22(4):1-40
An Offer that I could not turn down…
(despite it being full of maths)
12
Prof. T. Y. Chen
Prof. Xiaoyuan Xie
SR2
B
SR1
B
SR1
F
SR1
A
Formula R1
SR2
F
SR2
A
Formula R2
Statement Ranking
• Formula dominates formula ( )
if
• Formula is equivalent to formula ( )
if
• “Do you want to see where the evolved formulas
are ranked in the hierarchy?” (i.e., the o
ff
er)
R1 R2 R1 → R2
SR1
B
⊆ SR2
B
∧ SR2
A
⊆ SR1
A
R1 R2 R1 ↔ R2
SR1
B
= SR2
B
∧ SR1
F
= SR2
F
∧ SR1
A
= SR2
A
31:22 X. Xie et al.
Fig. 4. Another performance hierarchy chain of risk evaluation formulas.
LEMMA 4.12. For M2, we have
SM2
B =
!
si
"
"
"
"ai
ef > 0 and
P + a
f
ep
F
−
2F + P
ai
ef
+ 2 −
ai
ep
ai
ef
> 0, 1 ≤ i ≤ n
#
(4.18)
SM2
A =
!
si
"
"
"
"
$
ai
ef = 0
%
or
&
ai
ef > 0 and
P + a
f
ep
F
−
2F + P
ai
ef
+ 2 −
ai
ep
ai
ef
< 0
'
, 1 ≤ i ≤ n
#
.
(4.19)
LEMMA 4.13. For AMPLE2, we have
SA
B =
!
si
"
"
"
"ai
ef > 0 and
Pai
ef − PF + Fa
f
ep
Fai
ef
−
ai
ep
ai
ef
> 0, 1 ≤ i ≤ n
#
(4.20)
! "
"$ %
&
Pai
ef − PF + Fa
f
ep ai
ep
' #
31:16 X. Xie
Fig. 3. Performance hierarchy chains of risk evaluation formulas.
For Tarantula, si with ai
ef = 0 will be assigned with the lowest risk value of 0
hence is ranked lower than any sj with a
j
ef > 0. Consider a test suite with P =
F = 2. For re, if there exists si such that ai
ef = 0, we have RRE(si) = − P
F
− 1 =
Suppose there exists sj such that a
j
ef = 1 and a
j
ep = 5. Then, we have RRE(sj) = −
−5 < −4 = RRE(si), while for Tarantula, we have RT (sj) = 3
8
> 0 = RT (si). Ther
re does not produce the same ranking list as Tarantula, and hence re and Tara
are not equivalent with respect to Naish et al.’s equivalence.
4.3.3. Relations between Nonequivalent Formulas. With the aforeside six groups of e
alent formulas, we only need to search the maximal formulas from the followi
individual formulas or groups of equivalent formulas, namely, ER1, ER2, ER3,
Proof that GP was as good as humans
Xie et al., SSBSE 2013
13
Table 1. Investigated formulæ
Name Formula expression
ER1’
Naish1
⇢
1 if ef <F
P ep if ef =F
Naish2 ef
ep
ep+np+1
GP13 ef (1 + 1
2ep+ef
)
ER5
Wong1 ef
Russel & Rao
ef
ef +nf +ep+np
Binary
⇢
0 if ef <F
1 if ef =F
GP02 2(ef +
p
np) +
p
ep
GP03
q
|e2
f
p
ep|
GP19 ef
p
|ep ef + nf np|
Proof that GP was as good as humans
Xie et al., SSBSE 2013
13
Table 1. Investigated formulæ
Name Formula expression
ER1’
Naish1
⇢
1 if ef <F
P ep if ef =F
Naish2 ef
ep
ep+np+1
GP13 ef (1 + 1
2ep+ef
)
ER5
Wong1 ef
Russel & Rao
ef
ef +nf +ep+np
Binary
⇢
0 if ef <F
1 if ef =F
GP02 2(ef +
p
np) +
p
ep
GP03
q
|e2
f
p
ep|
GP19 ef
p
|ep ef + nf np|
has been permanently
expanded to because
joined the club.
ER1
ER1′

GP13
Proof that GP was as good as humans
Xie et al., SSBSE 2013
13
Table 1. Investigated formulæ
Name Formula expression
ER1’
Naish1
⇢
1 if ef <F
P ep if ef =F
Naish2 ef
ep
ep+np+1
GP13 ef (1 + 1
2ep+ef
)
ER5
Wong1 ef
Russel & Rao
ef
ef +nf +ep+np
Binary
⇢
0 if ef <F
1 if ef =F
GP02 2(ef +
p
np) +
p
ep
GP03
q
|e2
f
p
ep|
GP19 ef
p
|ep ef + nf np|
has been permanently
expanded to because
joined the club.
ER1
ER1′

GP13
Three GP formulas formed their
own maximal groups.
14
14
(Scene: T.Y., Mark, and Shin having lunch in the
basement canteen of LSHTM, one day in 2014)
14
(Scene: T.Y., Mark, and Shin having lunch in the
basement canteen of LSHTM, one day in 2014)
M: So, what
next?
14
(Scene: T.Y., Mark, and Shin having lunch in the
basement canteen of LSHTM, one day in 2014)
M: So, what
next?
TY: Perhaps we can use search to
fi
nd the
greatest formula…?
14
(Scene: T.Y., Mark, and Shin having lunch in the
basement canteen of LSHTM, one day in 2014)
M: So, what
next?
TY: Perhaps we can use search to
fi
nd the
greatest formula…?
S: Hmmmmm….
No Greatest SBFL Formula!
Yoo + Xie et al., TOSEM 2017
15
Soon after that lunch, I ended up with a
proof that such formula cannot exist.
Sent it to Xiaoyuan and TY for rigorous
check, kept my
fi
ngers crossed.
And then it was con
fi
rmed!
16
ef
np
Tarantula
ef
np
Ochiai
ef
np
gp13
Important Caveat: Consistent Tie-breaker
17
SR2
B
SR1
B
SR1
F
SR1
A
Formula R1
SR2
F
SR2
A
Formula R2
Statement Ranking
A choice of tie-breaker, combined with
speci
fi
c spectrum, can result in empirical
results that contradict theoretical
hierarchy.
Isn’t FL just DP but happening later?
Sohn & Yoo, ISSTA 2017
18
Dr. Jeongju Sohn
Isn’t FL just DP but happening later?
Sohn & Yoo, ISSTA 2017
18
Dr. Jeongju Sohn
Still using GP but this time to aggregate multiple scores,
inspired by Xuan & Monperrus (ICSME 2014) as well as Le et al. (ISSTA 2016)
Isn’t FL just DP but happening later?
Sohn & Yoo, ISSTA 2017
18
Dr. Jeongju Sohn
Still using GP but this time to aggregate multiple scores,
inspired by Xuan & Monperrus (ICSME 2014) as well as Le et al. (ISSTA 2016)
Multiple SBFL formulas are terminal nodes
Isn’t FL just DP but happening later?
Sohn & Yoo, ISSTA 2017
18
Dr. Jeongju Sohn
Still using GP but this time to aggregate multiple scores,
inspired by Xuan & Monperrus (ICSME 2014) as well as Le et al. (ISSTA 2016)
Multiple SBFL formulas are terminal nodes
Traditional Defect Prediction (DP) features added:
age, churn, and complexity
“Get out of your (beep) comfort zone? It (beep) took me 20 years to
build a comfort zone!!”
- Noel Gallagher, Clash Magazine, 2011
Lessons from my SBFL story
19
Lessons from my SBFL story
19
Sometimes imitating the cool
stu
ff
(people) works out :)
Lessons from my SBFL story
19
Sometimes imitating the cool
stu
ff
(people) works out :)
Do mental exercise; test your
comfort zone.
Mutating buggy programs can help FL
Moon et al., ICST 2014
20
Mutating buggy programs can help FL
Moon et al., ICST 2014
20
Correct Code
Mutate
(mostly)
Mutating buggy programs can help FL
Moon et al., ICST 2014
20
Correct Code
Mutate
(mostly)
Buggy Code
Mutate
(sometimes!)
Mutating buggy programs can help FL
Moon et al., ICST 2014
20
Prof. Moonzoo Kim Prof. Yunho Kim
Correct Code
Mutate
(mostly)
Buggy Code
Mutate
(sometimes!)
Mutating buggy programs can help FL
Moon et al., ICST 2014
20
Prof. Moonzoo Kim Prof. Yunho Kim
Correct Code
Mutate
(mostly)
Buggy Code
Mutate
(sometimes!)
“But what did YOU do?”
Ranking is only for humans
Moon et al., ICST 2014
• Humans consume FL results as ranking: order matters.
• Machines (APR) consume FL results as weights: magnitude matters.
• I suggested Kullback-Leibler divergence as an FL evaluation metric… didn’t
catch on :)
• Qi et al., ISSTA 2013 is an independent con
fi
rmation.
21
Op2 (LIL=7.34)
Executed Statements
0.0
0.4
0.8
Faulty Statement
Suspiciousness
Ochiai (LIL=5.96)
Executed Statements
0.0
0.4
0.8
Faulty Statement
Suspiciousness
Jaccard (LIL=4.92)
Executed Statements
0.0
0.4
0.8
Faulty Statement
Suspiciousness
MUSE (LIL=0.40)
Executed Statements
0.0
0.4
0.8
Faulty Statement
Suspiciousness
Figure 4: Comparison of distributions of normalized suspiciousness score across executed statements of space v21
Table V: Expense, LIL, and NCP scores on look utx 4.3
FL % of executed Locality Average NCP
Technique stmts examined Information Loss over 100 runs
MUSE 11.25 3.52 25.3
Op2 42.50 3.77 31.0
Ochiai 42.50 3.83 32.2
Jaccard 42.50 3.89 35.5
among the top 14.62% on average, and among the top 2%
for seven out of 14 faulty versions we studied. Intuitively,
A small LIL score of a localization technique indicates
that the technique can be used to perform more efficient
automated program repair. In contrast, the Expense metric
values did not provide any information for the three SBFL
techniques. We plan to perform a further empirical study to
support the claim.
VII. RELATED WORK
A moment of reckoning years later…
22
MUSE
Uni
fi
ed Debugging, Lou et al., ISSTA 2020
I had a “doh!” moment when I read this paper - although they frame it as APR helping FL,
to me it was incremental mutation that made MUSE more lightweight. A really clever idea.
Can we pay first? 💸
Kim et al., ISSRE 2021
23
Dr. Jinhan Kim
Prof. Robert Feldt
Ahead-of-Time MBFL: mutants and bugs elicit similar reactions from test cases if they share the same
location.
Can we pay first? 💸
Kim et al., ISSRE 2021
23
Dr. Jinhan Kim
Prof. Robert Feldt
Mutation
1. Do mutation analysis
1. Do mutation analysis
Ahead-of-Time MBFL: mutants and bugs elicit similar reactions from test cases if they share the same
location.
Can we pay first? 💸
Kim et al., ISSRE 2021
23
Dr. Jinhan Kim
Prof. Robert Feldt
Mutation
1. Do mutation analysis
1. Do mutation analysis
Training
ML
2. Learn the reverse relation.
Ahead-of-Time MBFL: mutants and bugs elicit similar reactions from test cases if they share the same
location.
Can we pay first? 💸
Kim et al., ISSRE 2021
23
Dr. Jinhan Kim
Prof. Robert Feldt
Mutation
1. Do mutation analysis
1. Do mutation analysis
Training
ML
2. Learn the reverse relation.
Real Fault
ML
3. Pretend if real bug is a mutant
Ahead-of-Time MBFL: mutants and bugs elicit similar reactions from test cases if they share the same
location.
What if a new test fails?
Kim et al., TOSEM 2022
24
Linear
Softmax
Test Method
Name
Comparison
Word Embedding
Killed or Not Killed
Source Method
Name
Bidirectional GRU
Comparison
Word Embedding
Bidirectional GRU
Before After
Mutated
Line
Mutation
Operator
Concatenate
Linear
Linear
Concatenate
Learn the relationship between killed mutants and
killing tests via the natural language channel
We can predict whether which new test cases will
kill which new mutants pretty well.
Dr. Jinhan Kim
“I concurrency joke have one! a”
https://twitter.com/francesc/status/1286731919376265216
Lessons from my MBFL story
25
Lessons from my MBFL story
25
Lessons from my MBFL story
25
Ideas may come too late, but
they do come, and it is okay ;)
Your weakness may become
your strength: often it is already
a good problem to solve.
Information Theory can help FL
Yoo, Clark, and Harman, TOSEM 2013
26
Dr. David Clark Prof. Mark Harman
Certain test cases provide more “information” to FL than others.
Can we prioritize them?
Equally Suspicious Uniquely Suspicious
Test Execution
FL is entropy reduction! That way, we can quantify the contribution
from individual test cases.
19:16
grep, v3, F_KP_3
0.5
1.0
Suspiciousness
FLINT
TCP
Random
0 20 40 60 80 100
Percentage of Executed Tests
−20
−10
0
10
Expense
Reduction
Exp. Reduction FLINT
Exp. Reduction Greedy
(a)
flex, v1, F_JR_2
Quantitative Diagnosability of Tests
An & Yoo, ISSTA 2022
27
Faulty
Program Test Generation
Fault Localisation
Test
Suite
test
Test
Regression
Tests
Test Prioritisation
Test
Human Labelling
Test
Iterative
Process
Unlabelled
Labelled
If we have to augment the test suite for FL as we go
along, we want to involve the human oracle for the
smallest number of time. For this, we want to choose
the test case with the most information about fault
diagnosis as the next test.
Gabin An
(PhD Candidate)
Quantitative Diagnosability of Tests
An & Yoo, ISSTA 2022
27
Faulty
Program Test Generation
Fault Localisation
Test
Suite
test
Test
Regression
Tests
Test Prioritisation
Test
Human Labelling
Test
Iterative
Process
Unlabelled
Labelled
If we have to augment the test suite for FL as we go
along, we want to involve the human oracle for the
smallest number of time. For this, we want to choose
the test case with the most information about fault
diagnosis as the next test.
Gabin An
(PhD Candidate)
Taking FL to Industry
Our partner in crime…
28
Jingun Hong
SAP Labs Korea
Dongwon Hwang
SAP Labs Korea
SAP HANA: In-memory RDBMS, >11MLoc, >50K
fi
les in C/C++
Developers Commits
Push
Taking FL to Industry
Our partner in crime…
28
Jingun Hong
SAP Labs Korea
Dongwon Hwang
SAP Labs Korea
SAP HANA: In-memory RDBMS, >11MLoc, >50K
fi
les in C/C++
Developers Commits
Push
Post-submit Testing
Build & Test
> 3,000 test cases per run
Failed!
Taking FL to Industry
Our partner in crime…
28
Jingun Hong
SAP Labs Korea
Dongwon Hwang
SAP Labs Korea
SAP HANA: In-memory RDBMS, >11MLoc, >50K
fi
les in C/C++
Developers Commits
Push
All
reattempts
failed?
Test breakages
Flaky Tests
Rerun failing
tests three times
Yes
No
Identifying Test Breakages
~30 test breakages per run
Post-submit Testing
Build & Test
> 3,000 test cases per run
Failed!
Taking FL to Industry
Our partner in crime…
28
Jingun Hong
SAP Labs Korea
Dongwon Hwang
SAP Labs Korea
SAP HANA: In-memory RDBMS, >11MLoc, >50K
fi
les in C/C++
Developers Commits
Push
All
reattempts
failed?
Test breakages
Flaky Tests
Rerun failing
tests three times
Yes
No
Identifying Test Breakages
~30 test breakages per run
Post-submit Testing
Build & Test
> 3,000 test cases per run
Failed!
Breakage 1
Breakage 2
Breakage 3
.
.
.
Bug Ticket 1
Bug Ticket 2
Bug Ticket 3
Taking FL to Industry
Our partner in crime…
28
Jingun Hong
SAP Labs Korea
Dongwon Hwang
SAP Labs Korea
SAP HANA: In-memory RDBMS, >11MLoc, >50K
fi
les in C/C++
Developers Commits
Push
All
reattempts
failed?
Test breakages
Flaky Tests
Rerun failing
tests three times
Yes
No
Identifying Test Breakages
~30 test breakages per run
Post-submit Testing
Build & Test
> 3,000 test cases per run
Failed!
Breakage 1
Breakage 2
Breakage 3
.
.
.
Bug Ticket 1
Bug Ticket 2
Bug Ticket 3
Coverage 🤠
- Weekly Collection
- Overhead is ~3.0x
Triaging with FL
Sohn et al., ICST 2021 Industry
29
Dongwon Hwang
SAP Labs Korea
Gabin An
(PhD Candidate)
Dr. Jeongju Sohn
FL applied for bug triage: it can complement
the static test-component relationship.
As an automated technique, this can aid the
decision maker so that issues are assigned
faster.
performance when all files participate, which is contradictory
to our hypothesis that eliminating less suspicious files from
voting would improve the localisation results by reducing
noises.
We may find a potential explanation for this result from
the initial experimental results in Section III-C. In Fig. 3,
the median ranking of faulty files is 3, 015 with the min tie-
breaker, which means that the ground-truth faulty files might
not be located near the top. Thus, eliminating the low-rank
files may result in excluding the actual faulty files from the
voting. As a result, in SAP HANA2, utilising every file in
voting is more effective than eliminating the low-rank files in
terms of breaking the ties and localising faulty components.
Fig. 10: Comparison between the component mapping and
SBFL with voting VD within the top 10: VD located 9 more
faults compared to the baseline. Among 57 faults localised by
VD, 36 of them are newly localised.
4) Ho
formance
we exten
vote for
utilising
isation p
Table V
the row
test-com
mapping
(↵ = 1)
However
the one
the test-c
increases
only by
decrease
we conc
localise n
TABLE
VD,VDN
Voting
V
V
Jingun Hong
SAP Labs Korea
Similar Root Causes
Sohn et al., ICSE 2022 SEIP
30
Dongwon Hwang
SAP Labs Korea
Gabin An
(PhD Candidate)
Juyeon Yoon
(PhD Candidate)
Dr. Jeongju Sohn
Jingun Hong
SAP Labs Korea
FL in existence of multiple faults: we need to
isolate individual faults.
Given a pair of failures, classify whether they
share the same root cause?
Static
Historical
Dynamic
Automatically Identifying Shared Root Causes of Test Breakages in SAP HANA ICSE-SEIP ’22, May 21–29, 2022, Pi�sburgh, PA, U
Figure 6: F1 scores of the models using di�erent feature extraction strategies against the test data. Note that the ~-axis rang
from 0.40 to 0.85. The red and blue horizontal lines represent the F1 scores for the relevance classi�cations when using th
Static-Ronin-TFIDF-Cosine and Coverage-nlink similarities, respectively, with learnt thresholds. Each error bar indicates th
95% con�dence interval.
Precision 0.88, recall 0.64
Finding the Origins of Bugs
An et al., ICSE 2023 (WED 14:30)
• We can project the FL results back to the commits to
fi
nd BICs
• Filtering out stylistic changes
• Applying time-dependent decay
• Keep the information source minimal!
31
Gabin An
(PhD Candidate)
Jingun Hong
SAP Labs Korea
TABLE I
EXAMPLE OF THE VOTING POWER OF CODE ELEMENTS
Code Element e1 e2 e3 e4 e5
Score 1.0 0.6 0.6 0.6 0.3
rankmax 1 4 4 4 5
rankdense 1 2 2 2 3
vote
↵ = 0, ⌧ = max 1.00 0.25 0.25 0.25 0.20
↵ = 1, ⌧ = max 1.00 0.15 0.15 0.15 0.06
↵ = 0, ⌧ = dense 1.00 0.50 0.50 0.50 0.33
↵ = 1, ⌧ = dense 1.00 0.30 0.30 0.30 0.10
in CBIC in the order of their likelihood of being the BIC. As
mentioned earlier, our basic intuition is that if a commit had
created, or modified, more suspicious code elements for the
observed failures, it is more likely to be a BIC.
The suspiciousness of code elements can be quantified via
an FL technique. For example, we can apply SBFL [11] using
the coverage of the test suite T: note that SBFL uses only test
coverage and result information, both of which are available
at the time of observing a test failure. Assuming that we are
given the suspiciousness scores, let susp: Esusp ! [0, 1)
be the mapping function from each suspicious code element
in Esusp to its non-negative FL score.2
To convert the code-
Fig. 3. Example of computing the commit scores when = 0.1
power regardless of hyperparameters, i.e., vote(e) > vote(e0
)
if and only if susp(e) > susp(e0
).
Depth-based Decay: Wen et al. [5] showed that using the
information about commit time can boost the accuracy of
the BIC identification. Similarly, Wu et al. [2] observed that
the commit time of crash-inducing changes is closer to the
reporting time of the crashes. Based on those findings, we
hypothesise that older commits are less likely to be responsible
for the currently observed failure, because if an older commit
Fig. 6. MRR for each hyperparameter configuration of FONTE
TABLE IV
COMPARISON OF FONTE (↵ = 0, ⌧ = max, = 0.1) TO OTHER RANKING
TECHNIQUES (# TOTAL SUBJECTS = 130)
MRR
Accuracy
@1 @2 @3 @5 @10
FONTE 0.528
47 66 85 98 110
(36%) (51%) (65%) (75%) (85%)
Stage 1&2 of FONTE + Other Techniques (on CBIC )
Bug2Commit 0.380 27 42 64 85 96
FBL-BERT 0.338 27 40 47 69 90
Max Aggr. (Eq. 9) 0.317 17 36 50 73 97
Random 0.218 3 19 41 65 75
Fig.
using
comm
Fig.
BIC
redu
B. R
W
on
perf
Fig. 6. MRR for each hyperparameter configuration of FONTE
TABLE IV
COMPARISON OF FONTE (↵ = 0, ⌧ = max, = 0.1) TO OTHER RANKING
TECHNIQUES (# TOTAL SUBJECTS = 130)
MRR
Accuracy
@1 @2 @3 @5 @10
FONTE 0.528
47 66 85 98 110
(36%) (51%) (65%) (75%) (85%)
Stage 1&2 of FONTE + Other Techniques (on CBIC )
Bug2Commit 0.380 27 42 64 85 96
FBL-BERT 0.338 27 40 47 69 90
Max Aggr. (Eq. 9) 0.317 17 36 50 73 97
Random 0.218 3 19 41 65 75
Lower Bound 0.145 3 8 19 41 71
Other Techniques (on C)
Bug2Commit 0.155 11 18 22 25 39
FBL-BERT 0.037 1 3 5 7 10
Ablation Study for FONTE
Fig.
using
comm
Fig.
BIC
redu
B. R
W
on
perf
bar
for
bise
the
wei
Lessons from Industry Collaborations
32
Simplicity
Your technique may help unexpected
tasks, but as long as they are helping,
nothing else matters! A small delta
may still be appreciated.
Keeping the information source of
your technique to the minimum is very
important.
Lessons from Industry Collaborations
32
Simplicity
Your technique may help unexpected
tasks, but as long as they are helping,
nothing else matters! A small delta
may still be appreciated.
Keeping the information source of
your technique to the minimum is very
important.
Prof. Robert Feldt
What lies ahead?
Harnessing the power of AI/ML
34
• Well, we interact with a single pre-
trained model, at least…?
• CoT, ReAct…?
• Someone else maintains them…?
Harnessing the power of AI/ML
• Keeping the information source
minimal.
• Explainability.
• Maintenance of trained models.
34
• Well, we interact with a single pre-
trained model, at least…?
• CoT, ReAct…?
• Someone else maintains them…?
Reproducing the Bug to begin with…
Kang, Yoon & Yoo, ICSE 2023 (FRI 14:00)
35
Juyeon Yoon
(PhD Candidate)
Sungmin Kang
(PhD Candidate)
Tn
T3
T3
Example
Test
Bug Report
T1
T2
T3
…
Tn
Test
File
1
Test
File
2
T1
Tn
T2
(A) Prompt
Engineering
(B) LLM
Querying
(C) Post-
processing
(D) Selection
& Ranking
Example
Report
Prompt
LLM Testing 1
2
Test
Clusters
3
1) RQ3-1: We explore the performance of LIBRO when
operating on the GHRB dataset of recent bug reports. We find
that of the 31 bug reports we study, LIBRO can automatically
generate bug reproducing tests for 10 bugs based on 50 trials,
for a success rate of 32.2%. This success rate is similar to
the results from Defects4J presented in RQ1-1, suggesting
LIBRO generalizes to new bug reports. A breakdown of results
by project is provided in Table VII. Bugs are successfully
reproduced in AssertJ, Jsoup, Gson, and sslcontext, while they
were not reproduced in the other two. We could not reproduce
bugs from the Checkstyle project, despite it having a large
number of bugs; upon inspection, we find that this is because
the project’s tests rely heavily on external files, which LIBRO
has no access to, as shown in Section VI-C3. LIBRO also
does not generate BRTs for the Jackson project, but the small
number of bugs in the Jackson project make it difficult to draw
conclusions from it.
Answer to RQ3-1: LIBRO is capable of generating bug
reproducing tests even for recent data, suggesting it is not
simply remembering what it trained with.
2) RQ3-2: LIBRO uses several predictive factors correlated
with successful bug reproduction for selecting bugs and rank-
ing tests. In this research question, we check whether the
identified patterns based on the Defects4J dataset continue to
hold in the recent GHRB dataset.
the features used for our ranking strategy continue to be good
indicators of successful bug reproduction.
Answer to RQ3-2: We find that the factors used for the
ranking and selection of LIBRO consistently predict bug
reproduction in real-world data.
3) RQ3-3: We present case studies of attempts by LIBRO
to reproduce bugs that either succeeded or failed.
TABLE VIII: Bug Report Successfully Reproduced: URLs
are omitted for brevity (AssertJ-Core Issue #2666)
Title assertContainsIgnoringCase fails to compare i and I in
tr_TR locale
See org.assertj.core.internal.Strings#assertContainsIgnoringCase
[url]
I would suggest adding [url] verification to just ban
toLowerCase(), toUpperCase() and other unsafe methods: #2664
We first present a successfully reproduced bug report, issue
#2685 for the AssertJ-Core project, in Table VIII. The bug
is that, in tr_TR locale, assertContainsIgnoringCase method is
mishandling the letter I. For this bug report, LIBRO gener-
ates the bug reproducing test in Listing 4. Despite the bug
report not containing any executable code, LIBRO successfully
generated a bug reproducing test that compares I and i,
using the containsIgnoringCase method referenced within the
report. A BRT for this bug is ranked at the second place,
suggesting a developer would have been quickly provided with
a reproducing test.
Listing 5: Generated FIB test for Checkstyle-11365.
1 public void testFinalClass() throws Exception {
2 final DefaultConfiguration checkConfig =
3 createModuleConfig(FinalClassCheck.class);
public void testIssue952(){
Locale locale = new Locale("tr", "TR");
Locale.setDefault(locale);
assertThat("I").as("Checking in tr_TR locale")
.containsIgnoringCase("i");
}
PTLMs are interesting because…
36
Great
Autocompletion
+
Dynamic
Postprocessing
+
Emergent
Behaviour
+
NL/AL
Duality
+
Use of
Tools
+…
Di
ff
erential Testing
Metamorphic Testing
Test Input Generation
…
CoT
Wei et al., 2022
PAL
Gao et al., 2023
Dual Channel
Casalnuovo et al., 2020
ReAct
Yao et al., 2022
AutoGPT
History of FL and Changes of Perspectives…
37
Figure 1: An screen snapshot of the Tarantula system in Shaded Summary mode.
lines executed in both passed and failed test cases. In each of these modes, the brightness for each line is set
to the percentage of test cases that execute the respective statement for that mode.
The final mode, Shaded Summary, is the most informative and complex mapping. It renders all executed
statements on a spectrum from red to green. The hue of a line is determined by the percentage of the
number of failed test cases executing statement s to the total number of failed test cases in the test suite
T and the percentage of the number passed test cases executing s to the number of passed test cases in T.
These percentages are used to gauge the point in the hue spectrum from red to green for which to color s.
The brightness is determined by the greater of the two percentages, assuming brightness is measured on a 0
to 100 scale. Specifically, the color of the line for a statement s that is executed by at least one test case is
determined by the following equations.
hue(s) = low hue (red) + %passed(s)
%passed(s)+%failed(s) ∗ hue range
bright(s) = max(% passed(s),% failed(s))
For example, for a test suite of 100 test cases, a statement s that is executed by 15 of 20 failed test cases
and 40 of 80 passed test cases, and a hue range of 0 (red) to 100 (green), the hue and brightness are 40 and
75, respectively.
The long, thin rectangular region located above and to the right of the code view area shows graphically
the results of the entire test suite. A small rectangle is drawn for each test case from left-to-right and is
color coded to its outcome—green for success and red for failure. This lets the viewer, at a glance, see the
Initially conceived as an aid for humans
History of FL and Changes of Perspectives…
37
Figure 1: An screen snapshot of the Tarantula system in Shaded Summary mode.
lines executed in both passed and failed test cases. In each of these modes, the brightness for each line is set
to the percentage of test cases that execute the respective statement for that mode.
The final mode, Shaded Summary, is the most informative and complex mapping. It renders all executed
statements on a spectrum from red to green. The hue of a line is determined by the percentage of the
number of failed test cases executing statement s to the total number of failed test cases in the test suite
T and the percentage of the number passed test cases executing s to the number of passed test cases in T.
These percentages are used to gauge the point in the hue spectrum from red to green for which to color s.
The brightness is determined by the greater of the two percentages, assuming brightness is measured on a 0
to 100 scale. Specifically, the color of the line for a statement s that is executed by at least one test case is
determined by the following equations.
hue(s) = low hue (red) + %passed(s)
%passed(s)+%failed(s) ∗ hue range
bright(s) = max(% passed(s),% failed(s))
For example, for a test suite of 100 test cases, a statement s that is executed by 15 of 20 failed test cases
and 40 of 80 passed test cases, and a hue range of 0 (red) to 100 (green), the hue and brightness are 40 and
75, respectively.
The long, thin rectangular region located above and to the right of the code view area shows graphically
the results of the entire test suite. A small rectangle is drawn for each test case from left-to-right and is
color coded to its outcome—green for success and red for failure. This lets the viewer, at a glance, see the
Initially conceived as an aid for humans
ef
ef + nf + ep
ef
ef + 2(nf + ep)
2ef
2ef + nf + ep
2ef
ef + nf + ep
ef
nf + ep
1
2
(
ef
ef + nf
+
ef
ef + ep
)
ef
ef + nf + ep + np
ef + np nf ep
ef + nf + ep + np
ef + np
ef + nf + ep + np
ef
ef + np + 2(ep + nf )
ef + np
nf + ep
ef
ef +nf
ep
ep+np
+
ef
ef +nf
Then developed as an automated tool for humans
History of FL and Changes of Perspectives…
37
Figure 1: An screen snapshot of the Tarantula system in Shaded Summary mode.
lines executed in both passed and failed test cases. In each of these modes, the brightness for each line is set
to the percentage of test cases that execute the respective statement for that mode.
The final mode, Shaded Summary, is the most informative and complex mapping. It renders all executed
statements on a spectrum from red to green. The hue of a line is determined by the percentage of the
number of failed test cases executing statement s to the total number of failed test cases in the test suite
T and the percentage of the number passed test cases executing s to the number of passed test cases in T.
These percentages are used to gauge the point in the hue spectrum from red to green for which to color s.
The brightness is determined by the greater of the two percentages, assuming brightness is measured on a 0
to 100 scale. Specifically, the color of the line for a statement s that is executed by at least one test case is
determined by the following equations.
hue(s) = low hue (red) + %passed(s)
%passed(s)+%failed(s) ∗ hue range
bright(s) = max(% passed(s),% failed(s))
For example, for a test suite of 100 test cases, a statement s that is executed by 15 of 20 failed test cases
and 40 of 80 passed test cases, and a hue range of 0 (red) to 100 (green), the hue and brightness are 40 and
75, respectively.
The long, thin rectangular region located above and to the right of the code view area shows graphically
the results of the entire test suite. A small rectangle is drawn for each test case from left-to-right and is
color coded to its outcome—green for success and red for failure. This lets the viewer, at a glance, see the
Initially conceived as an aid for humans
ef
ef + nf + ep
ef
ef + 2(nf + ep)
2ef
2ef + nf + ep
2ef
ef + nf + ep
ef
nf + ep
1
2
(
ef
ef + nf
+
ef
ef + ep
)
ef
ef + nf + ep + np
ef + np nf ep
ef + nf + ep + np
ef + np
ef + nf + ep + np
ef
ef + np + 2(ep + nf )
ef + np
nf + ep
ef
ef +nf
ep
ep+np
+
ef
ef +nf
Then developed as an automated tool for humans Then criticized for not helping humans
History of FL and Changes of Perspectives…
37
Figure 1: An screen snapshot of the Tarantula system in Shaded Summary mode.
lines executed in both passed and failed test cases. In each of these modes, the brightness for each line is set
to the percentage of test cases that execute the respective statement for that mode.
The final mode, Shaded Summary, is the most informative and complex mapping. It renders all executed
statements on a spectrum from red to green. The hue of a line is determined by the percentage of the
number of failed test cases executing statement s to the total number of failed test cases in the test suite
T and the percentage of the number passed test cases executing s to the number of passed test cases in T.
These percentages are used to gauge the point in the hue spectrum from red to green for which to color s.
The brightness is determined by the greater of the two percentages, assuming brightness is measured on a 0
to 100 scale. Specifically, the color of the line for a statement s that is executed by at least one test case is
determined by the following equations.
hue(s) = low hue (red) + %passed(s)
%passed(s)+%failed(s) ∗ hue range
bright(s) = max(% passed(s),% failed(s))
For example, for a test suite of 100 test cases, a statement s that is executed by 15 of 20 failed test cases
and 40 of 80 passed test cases, and a hue range of 0 (red) to 100 (green), the hue and brightness are 40 and
75, respectively.
The long, thin rectangular region located above and to the right of the code view area shows graphically
the results of the entire test suite. A small rectangle is drawn for each test case from left-to-right and is
color coded to its outcome—green for success and red for failure. This lets the viewer, at a glance, see the
Initially conceived as an aid for humans
ef
ef + nf + ep
ef
ef + 2(nf + ep)
2ef
2ef + nf + ep
2ef
ef + nf + ep
ef
nf + ep
1
2
(
ef
ef + nf
+
ef
ef + ep
)
ef
ef + nf + ep + np
ef + np nf ep
ef + nf + ep + np
ef + np
ef + nf + ep + np
ef
ef + np + 2(ep + nf )
ef + np
nf + ep
ef
ef +nf
ep
ep+np
+
ef
ef +nf
Then developed as an automated tool for humans Then criticized for not helping humans
Then redeemed to help machines to debug
History of FL and Changes of Perspectives…
37
Figure 1: An screen snapshot of the Tarantula system in Shaded Summary mode.
lines executed in both passed and failed test cases. In each of these modes, the brightness for each line is set
to the percentage of test cases that execute the respective statement for that mode.
The final mode, Shaded Summary, is the most informative and complex mapping. It renders all executed
statements on a spectrum from red to green. The hue of a line is determined by the percentage of the
number of failed test cases executing statement s to the total number of failed test cases in the test suite
T and the percentage of the number passed test cases executing s to the number of passed test cases in T.
These percentages are used to gauge the point in the hue spectrum from red to green for which to color s.
The brightness is determined by the greater of the two percentages, assuming brightness is measured on a 0
to 100 scale. Specifically, the color of the line for a statement s that is executed by at least one test case is
determined by the following equations.
hue(s) = low hue (red) + %passed(s)
%passed(s)+%failed(s) ∗ hue range
bright(s) = max(% passed(s),% failed(s))
For example, for a test suite of 100 test cases, a statement s that is executed by 15 of 20 failed test cases
and 40 of 80 passed test cases, and a hue range of 0 (red) to 100 (green), the hue and brightness are 40 and
75, respectively.
The long, thin rectangular region located above and to the right of the code view area shows graphically
the results of the entire test suite. A small rectangle is drawn for each test case from left-to-right and is
color coded to its outcome—green for success and red for failure. This lets the viewer, at a glance, see the
Initially conceived as an aid for humans
ef
ef + nf + ep
ef
ef + 2(nf + ep)
2ef
2ef + nf + ep
2ef
ef + nf + ep
ef
nf + ep
1
2
(
ef
ef + nf
+
ef
ef + ep
)
ef
ef + nf + ep + np
ef + np nf ep
ef + nf + ep + np
ef + np
ef + nf + ep + np
ef
ef + np + 2(ep + nf )
ef + np
nf + ep
ef
ef +nf
ep
ep+np
+
ef
ef +nf
Then developed as an automated tool for humans Then criticized for not helping humans
Then redeemed to help machines to debug
But now machines do debugging in a totally di
ff
erent way…?
History of FL and Changes of Perspectives…
37
Figure 1: An screen snapshot of the Tarantula system in Shaded Summary mode.
lines executed in both passed and failed test cases. In each of these modes, the brightness for each line is set
to the percentage of test cases that execute the respective statement for that mode.
The final mode, Shaded Summary, is the most informative and complex mapping. It renders all executed
statements on a spectrum from red to green. The hue of a line is determined by the percentage of the
number of failed test cases executing statement s to the total number of failed test cases in the test suite
T and the percentage of the number passed test cases executing s to the number of passed test cases in T.
These percentages are used to gauge the point in the hue spectrum from red to green for which to color s.
The brightness is determined by the greater of the two percentages, assuming brightness is measured on a 0
to 100 scale. Specifically, the color of the line for a statement s that is executed by at least one test case is
determined by the following equations.
hue(s) = low hue (red) + %passed(s)
%passed(s)+%failed(s) ∗ hue range
bright(s) = max(% passed(s),% failed(s))
For example, for a test suite of 100 test cases, a statement s that is executed by 15 of 20 failed test cases
and 40 of 80 passed test cases, and a hue range of 0 (red) to 100 (green), the hue and brightness are 40 and
75, respectively.
The long, thin rectangular region located above and to the right of the code view area shows graphically
the results of the entire test suite. A small rectangle is drawn for each test case from left-to-right and is
color coded to its outcome—green for success and red for failure. This lets the viewer, at a glance, see the
Initially conceived as an aid for humans
ef
ef + nf + ep
ef
ef + 2(nf + ep)
2ef
2ef + nf + ep
2ef
ef + nf + ep
ef
nf + ep
1
2
(
ef
ef + nf
+
ef
ef + ep
)
ef
ef + nf + ep + np
ef + np nf ep
ef + nf + ep + np
ef + np
ef + nf + ep + np
ef
ef + np + 2(ep + nf )
ef + np
nf + ep
ef
ef +nf
ep
ep+np
+
ef
ef +nf
Then developed as an automated tool for humans Then criticized for not helping humans
Then redeemed to help machines to debug
But now machines do debugging in a totally di
ff
erent way…?
?
Looking Both Ways…
38
Find ways to apply what we already
know to practice
- collaboration with Samsung SSDs
- Deploying FL into testing tools from
SuresoftTech
Reimagine the role of FL in the LLM era
- Many exciting work-in-progress: keep
watching COINSE :)
Looking Both Ways…
38
Find ways to apply what we already
know to practice
- collaboration with Samsung SSDs
- Deploying FL into testing tools from
SuresoftTech
Reimagine the role of FL in the LLM era
- Many exciting work-in-progress: keep
watching COINSE :)
39
Retrospective on My FL Work
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
GP + FL Theory
SF1
B
⊂ SF2
B
HUMIES!
DP + FL
Entropy MBFL
MBFL 2.0 Industry
Diagnosability
Applications
Everything started with people.
Gabin An
(PhD Candidate)
Dr. Jeongju Sohn
Juyeon Yoon
(PhD Candidate)
Dr. Jinhan Kim
Sungmin Kang
(PhD Candidate)
Dr. Bill Langdon
Prof. Mark Harman
Prof. Robert Feldt Dr. David Clark
Prof. Moonzoo Kim
Prof. Yunho Kim
Dongwon Hwang
SAP Labs Korea
Prof. T. Y. Chen
Prof. Xiaoyuan Xie
Jingun Hong
SAP Labs Korea
❤
PTLMs are interesting because…
Great
Autocompletion
+
Dynamic
Postprocessing
+
Emergent
Behaviour
+
NL/AL
Duality
+
Use of
Tools
+…
Differential Testing
Metamorphic Testing
Test Input Generation
…
CoT
Wei et al., 2022
PAL
Gao et al., 2023
Dual Channel
Casalnuovo et al., 2020
ReAct
Yao et al., 2022
AutoGPT
shin.yoo@kaist.ac.kr
https://coinse.github.io
Looking Both Ways…
38
Find ways to apply what we already
know to practice
- collaboration with Samsung SSDs
- Deploying FL into testing tools from
SuresoftTech
Reimagine the role of FL in the LLM era
- Many exciting work-in-progress: keep
watching COINSE :)
40
Structural Test Test Test Spectrum Tarantula Rank
Elements t1 t2 t3 ep ef np nf
s1 • 1 0 0 2 0.00 9
s2 • 1 0 0 2 0.00 9
s3 • 1 0 0 2 0.00 9
s4 • 1 0 0 2 0.00 9
s5 • 1 0 0 2 0.00 9
s6 • • 1 1 0 1 0.33 4
s7 (faulty) • • 0 2 1 0 1.00 1
s8 • • 1 1 0 1 0.33 4
s9 • • • 1 2 0 0 0.50 2
Result P F F
Spectrum Based Fault Localisation
Structural Test Test Test Spectrum Tarantula Rank
Elements t1 t2 t3 ep ef np nf
s1 • 1 0 0 2 0.00 9
s2 • 1 0 0 2 0.00 9
s3 • 1 0 0 2 0.00 9
s4 • 1 0 0 2 0.00 9
s5 • 1 0 0 2 0.00 9
s6 • • 1 1 0 1 0.33 4
s7 (faulty) • • 0 2 1 0 1.00 1
s8 • • 1 1 0 1 0.33 4
s9 • • • 1 2 0 0 0.50 2
Result P F F
Spectrum Based Fault Localisation
Structural Test Test Test Spectrum Tarantula Rank
Elements t1 t2 t3 ep ef np nf
s1 • 1 0 0 2 0.00 9
s2 • 1 0 0 2 0.00 9
s3 • 1 0 0 2 0.00 9
s4 • 1 0 0 2 0.00 9
s5 • 1 0 0 2 0.00 9
s6 • • 1 1 0 1 0.33 4
s7 (faulty) • • 0 2 1 0 1.00 1
s8 • • 1 1 0 1 0.33 4
s9 • • • 1 2 0 0 0.50 2
Result P F F
Spectrum Based Fault Localisation
Structural Test Test Test Spectrum Tarantula Rank
Elements t1 t2 t3 ep ef np nf
s1 • 1 0 0 2 0.00 9
s2 • 1 0 0 2 0.00 9
s3 • 1 0 0 2 0.00 9
s4 • 1 0 0 2 0.00 9
s5 • 1 0 0 2 0.00 9
s6 • • 1 1 0 1 0.33 4
s7 (faulty) • • 0 2 1 0 1.00 1
s8 • • 1 1 0 1 0.33 4
s9 • • • 1 2 0 0 0.50 2
Result P F F
Spectrum Based Fault Localisation
Tarantula =
ef
ef +nf
ep
ep+np
+
ef
ef +nf
Structural Test Test Test Spectrum Tarantula Rank
Elements t1 t2 t3 ep ef np nf
s1 • 1 0 0 2 0.00 9
s2 • 1 0 0 2 0.00 9
s3 • 1 0 0 2 0.00 9
s4 • 1 0 0 2 0.00 9
s5 • 1 0 0 2 0.00 9
s6 • • 1 1 0 1 0.33 4
s7 (faulty) • • 0 2 1 0 1.00 1
s8 • • 1 1 0 1 0.33 4
s9 • • • 1 2 0 0 0.50 2
Result P F F
Spectrum Based Fault Localisation
Structural Test Test Test Spectrum Tarantula Rank
Elements t1 t2 t3 ep ef np nf
s1 • 1 0 0 2 0.00 9
s2 • 1 0 0 2 0.00 9
s3 • 1 0 0 2 0.00 9
s4 • 1 0 0 2 0.00 9
s5 • 1 0 0 2 0.00 9
s6 • • 1 1 0 1 0.33 4
s7 (faulty) • • 0 2 1 0 1.00 1
s8 • • 1 1 0 1 0.33 4
s9 • • • 1 2 0 0 0.50 2
Result P F F
Spectrum Based Fault Localisation
Structural Test Test Test Spectrum Tarantula Rank
Elements t1 t2 t3 ep ef np nf
s1 • 1 0 0 2 0.00 9
s2 • 1 0 0 2 0.00 9
s3 • 1 0 0 2 0.00 9
s4 • 1 0 0 2 0.00 9
s5 • 1 0 0 2 0.00 9
s6 • • 1 1 0 1 0.33 4
s7 (faulty) • • 0 2 1 0 1.00 1
s8 • • 1 1 0 1 0.33 4
s9 • • • 1 2 0 0 0.50 2
Result P F F
Spectrum Based Fault Localisation

More Related Content

What's hot

OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroNumenta
 
Introduction to TDD (Test Driven development) - Ahmed Shreef
Introduction to TDD (Test Driven development) - Ahmed ShreefIntroduction to TDD (Test Driven development) - Ahmed Shreef
Introduction to TDD (Test Driven development) - Ahmed ShreefAhmed Shreef
 
LLM presentation final
LLM presentation finalLLM presentation final
LLM presentation finalRuth Griffin
 
A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3Ishan Jain
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfDavid Rostcheck
 
Open ai’s gpt 3 language explained under 5 mins
Open ai’s gpt 3 language explained under 5 minsOpen ai’s gpt 3 language explained under 5 mins
Open ai’s gpt 3 language explained under 5 minsAnshul Nema
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016Taehoon Kim
 
Particle swarm optimization
Particle swarm optimization Particle swarm optimization
Particle swarm optimization Ahmed Fouad Ali
 
Jawad's presentation on GPT.pptx
Jawad's presentation on GPT.pptxJawad's presentation on GPT.pptx
Jawad's presentation on GPT.pptxJawadNadeem3
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기CONNECT FOUNDATION
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentationbhavesh_physics
 
Тестирование требований
Тестирование требованийТестирование требований
Тестирование требованийNickola14
 
Google BARD v/s ChatGPT _ A review
Google BARD v/s ChatGPT _ A reviewGoogle BARD v/s ChatGPT _ A review
Google BARD v/s ChatGPT _ A reviewDR. Ram Kumar Pathak
 
risk based testing and regression testing
risk based testing and regression testingrisk based testing and regression testing
risk based testing and regression testingToshi Patel
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMsSylvainGugger
 
TDD Flow: The Mantra in Action
TDD Flow: The Mantra in ActionTDD Flow: The Mantra in Action
TDD Flow: The Mantra in ActionDionatan default
 

What's hot (20)

OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
 
Introduction to TDD (Test Driven development) - Ahmed Shreef
Introduction to TDD (Test Driven development) - Ahmed ShreefIntroduction to TDD (Test Driven development) - Ahmed Shreef
Introduction to TDD (Test Driven development) - Ahmed Shreef
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
LLM presentation final
LLM presentation finalLLM presentation final
LLM presentation final
 
A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
 
Open ai’s gpt 3 language explained under 5 mins
Open ai’s gpt 3 language explained under 5 minsOpen ai’s gpt 3 language explained under 5 mins
Open ai’s gpt 3 language explained under 5 mins
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
 
BERT
BERTBERT
BERT
 
Particle swarm optimization
Particle swarm optimization Particle swarm optimization
Particle swarm optimization
 
Jawad's presentation on GPT.pptx
Jawad's presentation on GPT.pptxJawad's presentation on GPT.pptx
Jawad's presentation on GPT.pptx
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Тестирование требований
Тестирование требованийТестирование требований
Тестирование требований
 
Google BARD v/s ChatGPT _ A review
Google BARD v/s ChatGPT _ A reviewGoogle BARD v/s ChatGPT _ A review
Google BARD v/s ChatGPT _ A review
 
risk based testing and regression testing
risk based testing and regression testingrisk based testing and regression testing
risk based testing and regression testing
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
TDD Flow: The Mantra in Action
TDD Flow: The Mantra in ActionTDD Flow: The Mantra in Action
TDD Flow: The Mantra in Action
 

Similar to Lessons from 10 Years of Automated Debugging Research

CMSC 56 | Lecture 12: Recursive Definition & Algorithms, and Program Correctness
CMSC 56 | Lecture 12: Recursive Definition & Algorithms, and Program CorrectnessCMSC 56 | Lecture 12: Recursive Definition & Algorithms, and Program Correctness
CMSC 56 | Lecture 12: Recursive Definition & Algorithms, and Program Correctnessallyn joy calcaben
 
On Certain Classess of Multivalent Functions
On Certain Classess of Multivalent Functions On Certain Classess of Multivalent Functions
On Certain Classess of Multivalent Functions iosrjce
 
Some properties of m sequences over finite field fp
Some properties of m sequences over finite field fpSome properties of m sequences over finite field fp
Some properties of m sequences over finite field fpIAEME Publication
 
CMSC 56 | Lecture 8: Growth of Functions
CMSC 56 | Lecture 8: Growth of FunctionsCMSC 56 | Lecture 8: Growth of Functions
CMSC 56 | Lecture 8: Growth of Functionsallyn joy calcaben
 
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesVjekoslavKovac1
 
Knutt Morris Pratt Algorithm by Dr. Rose.ppt
Knutt Morris Pratt Algorithm by Dr. Rose.pptKnutt Morris Pratt Algorithm by Dr. Rose.ppt
Knutt Morris Pratt Algorithm by Dr. Rose.pptsaki931
 
Laplace1 8merged
Laplace1 8mergedLaplace1 8merged
Laplace1 8mergedcohtran
 
MATHEMATICAL MODELING OF COMPLEX REDUNDANT SYSTEM UNDER HEAD-OF-LINE REPAIR
MATHEMATICAL MODELING OF COMPLEX REDUNDANT SYSTEM UNDER HEAD-OF-LINE REPAIRMATHEMATICAL MODELING OF COMPLEX REDUNDANT SYSTEM UNDER HEAD-OF-LINE REPAIR
MATHEMATICAL MODELING OF COMPLEX REDUNDANT SYSTEM UNDER HEAD-OF-LINE REPAIREditor IJMTER
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Lecture 4 - Growth of Functions (1).ppt
Lecture 4 - Growth of Functions (1).pptLecture 4 - Growth of Functions (1).ppt
Lecture 4 - Growth of Functions (1).pptZohairMughal1
 
Lecture 4 - Growth of Functions.ppt
Lecture 4 - Growth of Functions.pptLecture 4 - Growth of Functions.ppt
Lecture 4 - Growth of Functions.pptZohairMughal1
 
arithmatic progression.pptx
arithmatic progression.pptxarithmatic progression.pptx
arithmatic progression.pptxKirtiChauhan62
 
Master method
Master method Master method
Master method Rajendran
 
Arithmetic and geometric_sequences
Arithmetic and geometric_sequencesArithmetic and geometric_sequences
Arithmetic and geometric_sequencesDreams4school
 

Similar to Lessons from 10 Years of Automated Debugging Research (20)

algorithm_analysis1
algorithm_analysis1algorithm_analysis1
algorithm_analysis1
 
CMSC 56 | Lecture 12: Recursive Definition & Algorithms, and Program Correctness
CMSC 56 | Lecture 12: Recursive Definition & Algorithms, and Program CorrectnessCMSC 56 | Lecture 12: Recursive Definition & Algorithms, and Program Correctness
CMSC 56 | Lecture 12: Recursive Definition & Algorithms, and Program Correctness
 
On Certain Classess of Multivalent Functions
On Certain Classess of Multivalent Functions On Certain Classess of Multivalent Functions
On Certain Classess of Multivalent Functions
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
Some properties of m sequences over finite field fp
Some properties of m sequences over finite field fpSome properties of m sequences over finite field fp
Some properties of m sequences over finite field fp
 
CMSC 56 | Lecture 8: Growth of Functions
CMSC 56 | Lecture 8: Growth of FunctionsCMSC 56 | Lecture 8: Growth of Functions
CMSC 56 | Lecture 8: Growth of Functions
 
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averages
 
Knutt Morris Pratt Algorithm by Dr. Rose.ppt
Knutt Morris Pratt Algorithm by Dr. Rose.pptKnutt Morris Pratt Algorithm by Dr. Rose.ppt
Knutt Morris Pratt Algorithm by Dr. Rose.ppt
 
Laplace1 8merged
Laplace1 8mergedLaplace1 8merged
Laplace1 8merged
 
Sequence and series
Sequence and seriesSequence and series
Sequence and series
 
Per4 induction
Per4 inductionPer4 induction
Per4 induction
 
MATHEMATICAL MODELING OF COMPLEX REDUNDANT SYSTEM UNDER HEAD-OF-LINE REPAIR
MATHEMATICAL MODELING OF COMPLEX REDUNDANT SYSTEM UNDER HEAD-OF-LINE REPAIRMATHEMATICAL MODELING OF COMPLEX REDUNDANT SYSTEM UNDER HEAD-OF-LINE REPAIR
MATHEMATICAL MODELING OF COMPLEX REDUNDANT SYSTEM UNDER HEAD-OF-LINE REPAIR
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Nbvtalkatbzaonencryptionpuzzles
NbvtalkatbzaonencryptionpuzzlesNbvtalkatbzaonencryptionpuzzles
Nbvtalkatbzaonencryptionpuzzles
 
Nbvtalkatbzaonencryptionpuzzles
NbvtalkatbzaonencryptionpuzzlesNbvtalkatbzaonencryptionpuzzles
Nbvtalkatbzaonencryptionpuzzles
 
Lecture 4 - Growth of Functions (1).ppt
Lecture 4 - Growth of Functions (1).pptLecture 4 - Growth of Functions (1).ppt
Lecture 4 - Growth of Functions (1).ppt
 
Lecture 4 - Growth of Functions.ppt
Lecture 4 - Growth of Functions.pptLecture 4 - Growth of Functions.ppt
Lecture 4 - Growth of Functions.ppt
 
arithmatic progression.pptx
arithmatic progression.pptxarithmatic progression.pptx
arithmatic progression.pptx
 
Master method
Master method Master method
Master method
 
Arithmetic and geometric_sequences
Arithmetic and geometric_sequencesArithmetic and geometric_sequences
Arithmetic and geometric_sequences
 

Recently uploaded

Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 

Recently uploaded (20)

Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 

Lessons from 10 Years of Automated Debugging Research

  • 1. Lessons from 10 Years of Automated Debugging Research AST 2023 | Keynote Shin Yoo
  • 2. Shin Yoo • Associate Prof@KAIST / Republic of Korea • Leads Computational Intelligence for Software Engineering Group (https://coinse.github.io) • Search Based Software Engineering, Automated Debugging, Software Testing… Who am I? 2
  • 3. When I say “lessons”, I mean that I learnt these things. I do not mean to claim that these are some golden rules in research or anything else. Treat this talk just as a story of how certain techniques came to be… 🤠 3
  • 4. Retrospective on My FL Work 4 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 GP + FL Theory SF1 B ⊂ SF2 B HUMIES! DP + FL Entropy MBFL MBFL 2.0 Industry Diagnosability Applications
  • 5. Retrospective on My FL Work 5 GP + FL Theory SF1 B ⊂ SF2 B HUMIES! DP + FL Entropy MBFL MBFL 2.0 Industry Diagnosability Applications
  • 6. Everything started with people. 6 Gabin An (PhD Candidate) Dr. Jeongju Sohn Juyeon Yoon (PhD Candidate) Dr. Jinhan Kim Sungmin Kang (PhD Candidate) Dr. Bill Langdon Prof. Mark Harman Prof. Robert Feldt Dr. David Clark Prof. Moonzoo Kim Prof. Yunho Kim Dongwon Hwang SAP Labs Korea Prof. T. Y. Chen Prof. Xiaoyuan Xie Jingun Hong SAP Labs Korea ❤
  • 7. • Bill joins CREST (Centre for Research on Evolution, Search, & Testing) at King’s College London. • He immediately gets on with GPGPU for genetic programming, Higher-order Mutation, and what will later become Genetic Improvement. Year 2008 7 Dr. Bill Langdon
  • 8. • Bill joins CREST (Centre for Research on Evolution, Search, & Testing) at King’s College London. • He immediately gets on with GPGPU for genetic programming, Higher-order Mutation, and what will later become Genetic Improvement. Year 2008 7 Dr. Bill Langdon Prof. Mark Harman
  • 9. Me circa 2010 ‘That stuff looks incredibly cool - what can I do with GP and GPGPU…?’
  • 10. Evolving SBFL Formulas using GP Yoo, SSBSE 2012 9 ef ep ep + np + 1 Risk Evaluation Formula Ranking Program Tests Spectrum
  • 11. Evolving SBFL Formulas using GP Yoo, SSBSE 2012 9 ef ep ep + np + 1 Risk Evaluation Formula Ranking Program Tests Spectrum P S P S Training Data
  • 12. Evolving SBFL Formulas using GP Yoo, SSBSE 2012 9 Ranking Program Tests Spectrum P S P S GP Training Data
  • 13. Evolving SBFL Formulas using GP Yoo, SSBSE 2012 9 Program Tests Spectrum P S P S GP Fitness (minimise) Training Data
  • 14. Evolving SBFL Formulas using GP Yoo, SSBSE 2012 9 Program Tests Spectrum P S P S GP Fitness (minimise) e2 f (2ep + 2ef + 3np) e2 f (e2 f + p np) . . . Training Data
  • 15. Evolved Formulas Little did I know that these IDs would be their names… 10 ble 7. GP-evolved risk evaluation formulæ. Trivial bloats, such as nf nf , were moved. ID Refined Formula ID Refined Formula GP01 ef (np + ef (1 + p ef )) GP16 q e 3 2 f + np GP02 2(ef + p np) + p ep GP17 2ef +nf ef np + np p ef ef e2 f GP03 q |e2 f p ep| GP18 e3 f + 2np GP04 q | np ep np ef | GP19 ef p |ep ef + nf np| GP05 (ef +np) p ef (ef +ep)(npnf + p ep)(ep+np) p |ep np| GP20 2(ef + np ep+np ) GP06 ef np GP21 p ef + p ef + np GP07 2ef (1 + ef + 1 2np ) + (1 + p 2) p np GP22 e2 f + ef + p np GP08 e2 f (2ep + 2ef + 3np) GP23 p ef (e2 f + np ef + p np + nf + np) GP09 ef p np np+np + np + ef + e3 f GP24 ef + p np GP10 q |ef 1 np | GP25 e2 f + p np + p ef p |ep np| + np (ef np) GP11 e2 f (e2 f + p np) GP26 2e2 f + p np GP12 p ep + ef + np p ep GP27 np p (npnf ef ) ef +npnf GP13 ef (1 + 1 2ep+ef ) GP28 ef (ef + p np + 1) GP14 ef + p np GP29 ef (2e2 f +ef +np)+ (ef np) p npef ep np GP15 ef + p nf + p np GP30 q |ef nf np ef +nf | existing metrics have been designed by human; this paper present the first
  • 16. Evolved Formulas Little did I know that these IDs would be their names… 10 ble 7. GP-evolved risk evaluation formulæ. Trivial bloats, such as nf nf , were moved. ID Refined Formula ID Refined Formula GP01 ef (np + ef (1 + p ef )) GP16 q e 3 2 f + np GP02 2(ef + p np) + p ep GP17 2ef +nf ef np + np p ef ef e2 f GP03 q |e2 f p ep| GP18 e3 f + 2np GP04 q | np ep np ef | GP19 ef p |ep ef + nf np| GP05 (ef +np) p ef (ef +ep)(npnf + p ep)(ep+np) p |ep np| GP20 2(ef + np ep+np ) GP06 ef np GP21 p ef + p ef + np GP07 2ef (1 + ef + 1 2np ) + (1 + p 2) p np GP22 e2 f + ef + p np GP08 e2 f (2ep + 2ef + 3np) GP23 p ef (e2 f + np ef + p np + nf + np) GP09 ef p np np+np + np + ef + e3 f GP24 ef + p np GP10 q |ef 1 np | GP25 e2 f + p np + p ef p |ep np| + np (ef np) GP11 e2 f (e2 f + p np) GP26 2e2 f + p np GP12 p ep + ef + np p ep GP27 np p (npnf ef ) ef +npnf GP13 ef (1 + 1 2ep+ef ) GP28 ef (ef + p np + 1) GP14 ef + p np GP29 ef (2e2 f +ef +np)+ (ef np) p npef ep np GP15 ef + p nf + p np GP30 q |ef nf np ef +nf | existing metrics have been designed by human; this paper present the first
  • 17. Evolved Formulas Little did I know that these IDs would be their names… 10 ble 7. GP-evolved risk evaluation formulæ. Trivial bloats, such as nf nf , were moved. ID Refined Formula ID Refined Formula GP01 ef (np + ef (1 + p ef )) GP16 q e 3 2 f + np GP02 2(ef + p np) + p ep GP17 2ef +nf ef np + np p ef ef e2 f GP03 q |e2 f p ep| GP18 e3 f + 2np GP04 q | np ep np ef | GP19 ef p |ep ef + nf np| GP05 (ef +np) p ef (ef +ep)(npnf + p ep)(ep+np) p |ep np| GP20 2(ef + np ep+np ) GP06 ef np GP21 p ef + p ef + np GP07 2ef (1 + ef + 1 2np ) + (1 + p 2) p np GP22 e2 f + ef + p np GP08 e2 f (2ep + 2ef + 3np) GP23 p ef (e2 f + np ef + p np + nf + np) GP09 ef p np np+np + np + ef + e3 f GP24 ef + p np GP10 q |ef 1 np | GP25 e2 f + p np + p ef p |ep np| + np (ef np) GP11 e2 f (e2 f + p np) GP26 2e2 f + p np GP12 p ep + ef + np p ep GP27 np p (npnf ef ) ef +npnf GP13 ef (1 + 1 2ep+ef ) GP28 ef (ef + p np + 1) GP14 ef + p np GP29 ef (2e2 f +ef +np)+ (ef np) p npef ep np GP15 ef + p nf + p np GP30 q |ef nf np ef +nf | existing metrics have been designed by human; this paper present the first
  • 18. An Offer that I could not turn down… (despite it being full of maths) 11 Prof. T. Y. Chen Prof. Xiaoyuan Xie SR2 B SR1 B SR1 F SR1 A Formula R1 SR2 F SR2 A Formula R2 Statement Ranking SR B = {si 2 S|R(si) > R(sf ), 1  i  n} SR F = {si 2 S|R(si) = R(sf ), 1  i  n} SR A = {si 2 S|R(si) < R(sf ), 1  i  n} Xiaoyuan and T. Y. essentially formulated the comparison between SBFL formulas as algebraic proofs about set membership. A Theoretical Analysis of the Risk Evaluation Formulas for Spectrum- Based Fault Localisation, X. Xie, T.Y. Chen, F.C. Kuo, B. Xu, ACM Transactions on Software Engineering and Methodology 22(4):1-40
  • 19. An Offer that I could not turn down… (despite it being full of maths) 12 Prof. T. Y. Chen Prof. Xiaoyuan Xie SR2 B SR1 B SR1 F SR1 A Formula R1 SR2 F SR2 A Formula R2 Statement Ranking • Formula dominates formula ( ) if • Formula is equivalent to formula ( ) if • “Do you want to see where the evolved formulas are ranked in the hierarchy?” (i.e., the o ff er) R1 R2 R1 → R2 SR1 B ⊆ SR2 B ∧ SR2 A ⊆ SR1 A R1 R2 R1 ↔ R2 SR1 B = SR2 B ∧ SR1 F = SR2 F ∧ SR1 A = SR2 A 31:22 X. Xie et al. Fig. 4. Another performance hierarchy chain of risk evaluation formulas. LEMMA 4.12. For M2, we have SM2 B = ! si " " " "ai ef > 0 and P + a f ep F − 2F + P ai ef + 2 − ai ep ai ef > 0, 1 ≤ i ≤ n # (4.18) SM2 A = ! si " " " " $ ai ef = 0 % or & ai ef > 0 and P + a f ep F − 2F + P ai ef + 2 − ai ep ai ef < 0 ' , 1 ≤ i ≤ n # . (4.19) LEMMA 4.13. For AMPLE2, we have SA B = ! si " " " "ai ef > 0 and Pai ef − PF + Fa f ep Fai ef − ai ep ai ef > 0, 1 ≤ i ≤ n # (4.20) ! " "$ % & Pai ef − PF + Fa f ep ai ep ' # 31:16 X. Xie Fig. 3. Performance hierarchy chains of risk evaluation formulas. For Tarantula, si with ai ef = 0 will be assigned with the lowest risk value of 0 hence is ranked lower than any sj with a j ef > 0. Consider a test suite with P = F = 2. For re, if there exists si such that ai ef = 0, we have RRE(si) = − P F − 1 = Suppose there exists sj such that a j ef = 1 and a j ep = 5. Then, we have RRE(sj) = − −5 < −4 = RRE(si), while for Tarantula, we have RT (sj) = 3 8 > 0 = RT (si). Ther re does not produce the same ranking list as Tarantula, and hence re and Tara are not equivalent with respect to Naish et al.’s equivalence. 4.3.3. Relations between Nonequivalent Formulas. With the aforeside six groups of e alent formulas, we only need to search the maximal formulas from the followi individual formulas or groups of equivalent formulas, namely, ER1, ER2, ER3,
  • 20. Proof that GP was as good as humans Xie et al., SSBSE 2013 13 Table 1. Investigated formulæ Name Formula expression ER1’ Naish1 ⇢ 1 if ef <F P ep if ef =F Naish2 ef ep ep+np+1 GP13 ef (1 + 1 2ep+ef ) ER5 Wong1 ef Russel & Rao ef ef +nf +ep+np Binary ⇢ 0 if ef <F 1 if ef =F GP02 2(ef + p np) + p ep GP03 q |e2 f p ep| GP19 ef p |ep ef + nf np|
  • 21. Proof that GP was as good as humans Xie et al., SSBSE 2013 13 Table 1. Investigated formulæ Name Formula expression ER1’ Naish1 ⇢ 1 if ef <F P ep if ef =F Naish2 ef ep ep+np+1 GP13 ef (1 + 1 2ep+ef ) ER5 Wong1 ef Russel & Rao ef ef +nf +ep+np Binary ⇢ 0 if ef <F 1 if ef =F GP02 2(ef + p np) + p ep GP03 q |e2 f p ep| GP19 ef p |ep ef + nf np| has been permanently expanded to because joined the club. ER1 ER1′  GP13
  • 22. Proof that GP was as good as humans Xie et al., SSBSE 2013 13 Table 1. Investigated formulæ Name Formula expression ER1’ Naish1 ⇢ 1 if ef <F P ep if ef =F Naish2 ef ep ep+np+1 GP13 ef (1 + 1 2ep+ef ) ER5 Wong1 ef Russel & Rao ef ef +nf +ep+np Binary ⇢ 0 if ef <F 1 if ef =F GP02 2(ef + p np) + p ep GP03 q |e2 f p ep| GP19 ef p |ep ef + nf np| has been permanently expanded to because joined the club. ER1 ER1′  GP13 Three GP formulas formed their own maximal groups.
  • 23. 14
  • 24. 14 (Scene: T.Y., Mark, and Shin having lunch in the basement canteen of LSHTM, one day in 2014)
  • 25. 14 (Scene: T.Y., Mark, and Shin having lunch in the basement canteen of LSHTM, one day in 2014) M: So, what next?
  • 26. 14 (Scene: T.Y., Mark, and Shin having lunch in the basement canteen of LSHTM, one day in 2014) M: So, what next? TY: Perhaps we can use search to fi nd the greatest formula…?
  • 27. 14 (Scene: T.Y., Mark, and Shin having lunch in the basement canteen of LSHTM, one day in 2014) M: So, what next? TY: Perhaps we can use search to fi nd the greatest formula…? S: Hmmmmm….
  • 28. No Greatest SBFL Formula! Yoo + Xie et al., TOSEM 2017 15 Soon after that lunch, I ended up with a proof that such formula cannot exist. Sent it to Xiaoyuan and TY for rigorous check, kept my fi ngers crossed. And then it was con fi rmed!
  • 30. Important Caveat: Consistent Tie-breaker 17 SR2 B SR1 B SR1 F SR1 A Formula R1 SR2 F SR2 A Formula R2 Statement Ranking A choice of tie-breaker, combined with speci fi c spectrum, can result in empirical results that contradict theoretical hierarchy.
  • 31. Isn’t FL just DP but happening later? Sohn & Yoo, ISSTA 2017 18 Dr. Jeongju Sohn
  • 32. Isn’t FL just DP but happening later? Sohn & Yoo, ISSTA 2017 18 Dr. Jeongju Sohn Still using GP but this time to aggregate multiple scores, inspired by Xuan & Monperrus (ICSME 2014) as well as Le et al. (ISSTA 2016)
  • 33. Isn’t FL just DP but happening later? Sohn & Yoo, ISSTA 2017 18 Dr. Jeongju Sohn Still using GP but this time to aggregate multiple scores, inspired by Xuan & Monperrus (ICSME 2014) as well as Le et al. (ISSTA 2016) Multiple SBFL formulas are terminal nodes
  • 34. Isn’t FL just DP but happening later? Sohn & Yoo, ISSTA 2017 18 Dr. Jeongju Sohn Still using GP but this time to aggregate multiple scores, inspired by Xuan & Monperrus (ICSME 2014) as well as Le et al. (ISSTA 2016) Multiple SBFL formulas are terminal nodes Traditional Defect Prediction (DP) features added: age, churn, and complexity
  • 35. “Get out of your (beep) comfort zone? It (beep) took me 20 years to build a comfort zone!!” - Noel Gallagher, Clash Magazine, 2011 Lessons from my SBFL story 19
  • 36. Lessons from my SBFL story 19 Sometimes imitating the cool stu ff (people) works out :)
  • 37. Lessons from my SBFL story 19 Sometimes imitating the cool stu ff (people) works out :) Do mental exercise; test your comfort zone.
  • 38. Mutating buggy programs can help FL Moon et al., ICST 2014 20
  • 39. Mutating buggy programs can help FL Moon et al., ICST 2014 20 Correct Code Mutate (mostly)
  • 40. Mutating buggy programs can help FL Moon et al., ICST 2014 20 Correct Code Mutate (mostly) Buggy Code Mutate (sometimes!)
  • 41. Mutating buggy programs can help FL Moon et al., ICST 2014 20 Prof. Moonzoo Kim Prof. Yunho Kim Correct Code Mutate (mostly) Buggy Code Mutate (sometimes!)
  • 42. Mutating buggy programs can help FL Moon et al., ICST 2014 20 Prof. Moonzoo Kim Prof. Yunho Kim Correct Code Mutate (mostly) Buggy Code Mutate (sometimes!) “But what did YOU do?”
  • 43. Ranking is only for humans Moon et al., ICST 2014 • Humans consume FL results as ranking: order matters. • Machines (APR) consume FL results as weights: magnitude matters. • I suggested Kullback-Leibler divergence as an FL evaluation metric… didn’t catch on :) • Qi et al., ISSTA 2013 is an independent con fi rmation. 21 Op2 (LIL=7.34) Executed Statements 0.0 0.4 0.8 Faulty Statement Suspiciousness Ochiai (LIL=5.96) Executed Statements 0.0 0.4 0.8 Faulty Statement Suspiciousness Jaccard (LIL=4.92) Executed Statements 0.0 0.4 0.8 Faulty Statement Suspiciousness MUSE (LIL=0.40) Executed Statements 0.0 0.4 0.8 Faulty Statement Suspiciousness Figure 4: Comparison of distributions of normalized suspiciousness score across executed statements of space v21 Table V: Expense, LIL, and NCP scores on look utx 4.3 FL % of executed Locality Average NCP Technique stmts examined Information Loss over 100 runs MUSE 11.25 3.52 25.3 Op2 42.50 3.77 31.0 Ochiai 42.50 3.83 32.2 Jaccard 42.50 3.89 35.5 among the top 14.62% on average, and among the top 2% for seven out of 14 faulty versions we studied. Intuitively, A small LIL score of a localization technique indicates that the technique can be used to perform more efficient automated program repair. In contrast, the Expense metric values did not provide any information for the three SBFL techniques. We plan to perform a further empirical study to support the claim. VII. RELATED WORK
  • 44. A moment of reckoning years later… 22 MUSE Uni fi ed Debugging, Lou et al., ISSTA 2020 I had a “doh!” moment when I read this paper - although they frame it as APR helping FL, to me it was incremental mutation that made MUSE more lightweight. A really clever idea.
  • 45. Can we pay first? 💸 Kim et al., ISSRE 2021 23 Dr. Jinhan Kim Prof. Robert Feldt Ahead-of-Time MBFL: mutants and bugs elicit similar reactions from test cases if they share the same location.
  • 46. Can we pay first? 💸 Kim et al., ISSRE 2021 23 Dr. Jinhan Kim Prof. Robert Feldt Mutation 1. Do mutation analysis 1. Do mutation analysis Ahead-of-Time MBFL: mutants and bugs elicit similar reactions from test cases if they share the same location.
  • 47. Can we pay first? 💸 Kim et al., ISSRE 2021 23 Dr. Jinhan Kim Prof. Robert Feldt Mutation 1. Do mutation analysis 1. Do mutation analysis Training ML 2. Learn the reverse relation. Ahead-of-Time MBFL: mutants and bugs elicit similar reactions from test cases if they share the same location.
  • 48. Can we pay first? 💸 Kim et al., ISSRE 2021 23 Dr. Jinhan Kim Prof. Robert Feldt Mutation 1. Do mutation analysis 1. Do mutation analysis Training ML 2. Learn the reverse relation. Real Fault ML 3. Pretend if real bug is a mutant Ahead-of-Time MBFL: mutants and bugs elicit similar reactions from test cases if they share the same location.
  • 49. What if a new test fails? Kim et al., TOSEM 2022 24 Linear Softmax Test Method Name Comparison Word Embedding Killed or Not Killed Source Method Name Bidirectional GRU Comparison Word Embedding Bidirectional GRU Before After Mutated Line Mutation Operator Concatenate Linear Linear Concatenate Learn the relationship between killed mutants and killing tests via the natural language channel We can predict whether which new test cases will kill which new mutants pretty well. Dr. Jinhan Kim
  • 50. “I concurrency joke have one! a” https://twitter.com/francesc/status/1286731919376265216 Lessons from my MBFL story 25
  • 51. Lessons from my MBFL story 25
  • 52. Lessons from my MBFL story 25 Ideas may come too late, but they do come, and it is okay ;) Your weakness may become your strength: often it is already a good problem to solve.
  • 53. Information Theory can help FL Yoo, Clark, and Harman, TOSEM 2013 26 Dr. David Clark Prof. Mark Harman Certain test cases provide more “information” to FL than others. Can we prioritize them? Equally Suspicious Uniquely Suspicious Test Execution FL is entropy reduction! That way, we can quantify the contribution from individual test cases. 19:16 grep, v3, F_KP_3 0.5 1.0 Suspiciousness FLINT TCP Random 0 20 40 60 80 100 Percentage of Executed Tests −20 −10 0 10 Expense Reduction Exp. Reduction FLINT Exp. Reduction Greedy (a) flex, v1, F_JR_2
  • 54. Quantitative Diagnosability of Tests An & Yoo, ISSTA 2022 27 Faulty Program Test Generation Fault Localisation Test Suite test Test Regression Tests Test Prioritisation Test Human Labelling Test Iterative Process Unlabelled Labelled If we have to augment the test suite for FL as we go along, we want to involve the human oracle for the smallest number of time. For this, we want to choose the test case with the most information about fault diagnosis as the next test. Gabin An (PhD Candidate)
  • 55. Quantitative Diagnosability of Tests An & Yoo, ISSTA 2022 27 Faulty Program Test Generation Fault Localisation Test Suite test Test Regression Tests Test Prioritisation Test Human Labelling Test Iterative Process Unlabelled Labelled If we have to augment the test suite for FL as we go along, we want to involve the human oracle for the smallest number of time. For this, we want to choose the test case with the most information about fault diagnosis as the next test. Gabin An (PhD Candidate)
  • 56. Taking FL to Industry Our partner in crime… 28 Jingun Hong SAP Labs Korea Dongwon Hwang SAP Labs Korea SAP HANA: In-memory RDBMS, >11MLoc, >50K fi les in C/C++ Developers Commits Push
  • 57. Taking FL to Industry Our partner in crime… 28 Jingun Hong SAP Labs Korea Dongwon Hwang SAP Labs Korea SAP HANA: In-memory RDBMS, >11MLoc, >50K fi les in C/C++ Developers Commits Push Post-submit Testing Build & Test > 3,000 test cases per run Failed!
  • 58. Taking FL to Industry Our partner in crime… 28 Jingun Hong SAP Labs Korea Dongwon Hwang SAP Labs Korea SAP HANA: In-memory RDBMS, >11MLoc, >50K fi les in C/C++ Developers Commits Push All reattempts failed? Test breakages Flaky Tests Rerun failing tests three times Yes No Identifying Test Breakages ~30 test breakages per run Post-submit Testing Build & Test > 3,000 test cases per run Failed!
  • 59. Taking FL to Industry Our partner in crime… 28 Jingun Hong SAP Labs Korea Dongwon Hwang SAP Labs Korea SAP HANA: In-memory RDBMS, >11MLoc, >50K fi les in C/C++ Developers Commits Push All reattempts failed? Test breakages Flaky Tests Rerun failing tests three times Yes No Identifying Test Breakages ~30 test breakages per run Post-submit Testing Build & Test > 3,000 test cases per run Failed! Breakage 1 Breakage 2 Breakage 3 . . . Bug Ticket 1 Bug Ticket 2 Bug Ticket 3
  • 60. Taking FL to Industry Our partner in crime… 28 Jingun Hong SAP Labs Korea Dongwon Hwang SAP Labs Korea SAP HANA: In-memory RDBMS, >11MLoc, >50K fi les in C/C++ Developers Commits Push All reattempts failed? Test breakages Flaky Tests Rerun failing tests three times Yes No Identifying Test Breakages ~30 test breakages per run Post-submit Testing Build & Test > 3,000 test cases per run Failed! Breakage 1 Breakage 2 Breakage 3 . . . Bug Ticket 1 Bug Ticket 2 Bug Ticket 3 Coverage 🤠 - Weekly Collection - Overhead is ~3.0x
  • 61. Triaging with FL Sohn et al., ICST 2021 Industry 29 Dongwon Hwang SAP Labs Korea Gabin An (PhD Candidate) Dr. Jeongju Sohn FL applied for bug triage: it can complement the static test-component relationship. As an automated technique, this can aid the decision maker so that issues are assigned faster. performance when all files participate, which is contradictory to our hypothesis that eliminating less suspicious files from voting would improve the localisation results by reducing noises. We may find a potential explanation for this result from the initial experimental results in Section III-C. In Fig. 3, the median ranking of faulty files is 3, 015 with the min tie- breaker, which means that the ground-truth faulty files might not be located near the top. Thus, eliminating the low-rank files may result in excluding the actual faulty files from the voting. As a result, in SAP HANA2, utilising every file in voting is more effective than eliminating the low-rank files in terms of breaking the ties and localising faulty components. Fig. 10: Comparison between the component mapping and SBFL with voting VD within the top 10: VD located 9 more faults compared to the baseline. Among 57 faults localised by VD, 36 of them are newly localised. 4) Ho formance we exten vote for utilising isation p Table V the row test-com mapping (↵ = 1) However the one the test-c increases only by decrease we conc localise n TABLE VD,VDN Voting V V Jingun Hong SAP Labs Korea
  • 62. Similar Root Causes Sohn et al., ICSE 2022 SEIP 30 Dongwon Hwang SAP Labs Korea Gabin An (PhD Candidate) Juyeon Yoon (PhD Candidate) Dr. Jeongju Sohn Jingun Hong SAP Labs Korea FL in existence of multiple faults: we need to isolate individual faults. Given a pair of failures, classify whether they share the same root cause? Static Historical Dynamic Automatically Identifying Shared Root Causes of Test Breakages in SAP HANA ICSE-SEIP ’22, May 21–29, 2022, Pi�sburgh, PA, U Figure 6: F1 scores of the models using di�erent feature extraction strategies against the test data. Note that the ~-axis rang from 0.40 to 0.85. The red and blue horizontal lines represent the F1 scores for the relevance classi�cations when using th Static-Ronin-TFIDF-Cosine and Coverage-nlink similarities, respectively, with learnt thresholds. Each error bar indicates th 95% con�dence interval. Precision 0.88, recall 0.64
  • 63. Finding the Origins of Bugs An et al., ICSE 2023 (WED 14:30) • We can project the FL results back to the commits to fi nd BICs • Filtering out stylistic changes • Applying time-dependent decay • Keep the information source minimal! 31 Gabin An (PhD Candidate) Jingun Hong SAP Labs Korea TABLE I EXAMPLE OF THE VOTING POWER OF CODE ELEMENTS Code Element e1 e2 e3 e4 e5 Score 1.0 0.6 0.6 0.6 0.3 rankmax 1 4 4 4 5 rankdense 1 2 2 2 3 vote ↵ = 0, ⌧ = max 1.00 0.25 0.25 0.25 0.20 ↵ = 1, ⌧ = max 1.00 0.15 0.15 0.15 0.06 ↵ = 0, ⌧ = dense 1.00 0.50 0.50 0.50 0.33 ↵ = 1, ⌧ = dense 1.00 0.30 0.30 0.30 0.10 in CBIC in the order of their likelihood of being the BIC. As mentioned earlier, our basic intuition is that if a commit had created, or modified, more suspicious code elements for the observed failures, it is more likely to be a BIC. The suspiciousness of code elements can be quantified via an FL technique. For example, we can apply SBFL [11] using the coverage of the test suite T: note that SBFL uses only test coverage and result information, both of which are available at the time of observing a test failure. Assuming that we are given the suspiciousness scores, let susp: Esusp ! [0, 1) be the mapping function from each suspicious code element in Esusp to its non-negative FL score.2 To convert the code- Fig. 3. Example of computing the commit scores when = 0.1 power regardless of hyperparameters, i.e., vote(e) > vote(e0 ) if and only if susp(e) > susp(e0 ). Depth-based Decay: Wen et al. [5] showed that using the information about commit time can boost the accuracy of the BIC identification. Similarly, Wu et al. [2] observed that the commit time of crash-inducing changes is closer to the reporting time of the crashes. Based on those findings, we hypothesise that older commits are less likely to be responsible for the currently observed failure, because if an older commit Fig. 6. MRR for each hyperparameter configuration of FONTE TABLE IV COMPARISON OF FONTE (↵ = 0, ⌧ = max, = 0.1) TO OTHER RANKING TECHNIQUES (# TOTAL SUBJECTS = 130) MRR Accuracy @1 @2 @3 @5 @10 FONTE 0.528 47 66 85 98 110 (36%) (51%) (65%) (75%) (85%) Stage 1&2 of FONTE + Other Techniques (on CBIC ) Bug2Commit 0.380 27 42 64 85 96 FBL-BERT 0.338 27 40 47 69 90 Max Aggr. (Eq. 9) 0.317 17 36 50 73 97 Random 0.218 3 19 41 65 75 Fig. using comm Fig. BIC redu B. R W on perf Fig. 6. MRR for each hyperparameter configuration of FONTE TABLE IV COMPARISON OF FONTE (↵ = 0, ⌧ = max, = 0.1) TO OTHER RANKING TECHNIQUES (# TOTAL SUBJECTS = 130) MRR Accuracy @1 @2 @3 @5 @10 FONTE 0.528 47 66 85 98 110 (36%) (51%) (65%) (75%) (85%) Stage 1&2 of FONTE + Other Techniques (on CBIC ) Bug2Commit 0.380 27 42 64 85 96 FBL-BERT 0.338 27 40 47 69 90 Max Aggr. (Eq. 9) 0.317 17 36 50 73 97 Random 0.218 3 19 41 65 75 Lower Bound 0.145 3 8 19 41 71 Other Techniques (on C) Bug2Commit 0.155 11 18 22 25 39 FBL-BERT 0.037 1 3 5 7 10 Ablation Study for FONTE Fig. using comm Fig. BIC redu B. R W on perf bar for bise the wei
  • 64. Lessons from Industry Collaborations 32 Simplicity Your technique may help unexpected tasks, but as long as they are helping, nothing else matters! A small delta may still be appreciated. Keeping the information source of your technique to the minimum is very important.
  • 65. Lessons from Industry Collaborations 32 Simplicity Your technique may help unexpected tasks, but as long as they are helping, nothing else matters! A small delta may still be appreciated. Keeping the information source of your technique to the minimum is very important. Prof. Robert Feldt
  • 67. Harnessing the power of AI/ML 34 • Well, we interact with a single pre- trained model, at least…? • CoT, ReAct…? • Someone else maintains them…?
  • 68. Harnessing the power of AI/ML • Keeping the information source minimal. • Explainability. • Maintenance of trained models. 34 • Well, we interact with a single pre- trained model, at least…? • CoT, ReAct…? • Someone else maintains them…?
  • 69. Reproducing the Bug to begin with… Kang, Yoon & Yoo, ICSE 2023 (FRI 14:00) 35 Juyeon Yoon (PhD Candidate) Sungmin Kang (PhD Candidate) Tn T3 T3 Example Test Bug Report T1 T2 T3 … Tn Test File 1 Test File 2 T1 Tn T2 (A) Prompt Engineering (B) LLM Querying (C) Post- processing (D) Selection & Ranking Example Report Prompt LLM Testing 1 2 Test Clusters 3 1) RQ3-1: We explore the performance of LIBRO when operating on the GHRB dataset of recent bug reports. We find that of the 31 bug reports we study, LIBRO can automatically generate bug reproducing tests for 10 bugs based on 50 trials, for a success rate of 32.2%. This success rate is similar to the results from Defects4J presented in RQ1-1, suggesting LIBRO generalizes to new bug reports. A breakdown of results by project is provided in Table VII. Bugs are successfully reproduced in AssertJ, Jsoup, Gson, and sslcontext, while they were not reproduced in the other two. We could not reproduce bugs from the Checkstyle project, despite it having a large number of bugs; upon inspection, we find that this is because the project’s tests rely heavily on external files, which LIBRO has no access to, as shown in Section VI-C3. LIBRO also does not generate BRTs for the Jackson project, but the small number of bugs in the Jackson project make it difficult to draw conclusions from it. Answer to RQ3-1: LIBRO is capable of generating bug reproducing tests even for recent data, suggesting it is not simply remembering what it trained with. 2) RQ3-2: LIBRO uses several predictive factors correlated with successful bug reproduction for selecting bugs and rank- ing tests. In this research question, we check whether the identified patterns based on the Defects4J dataset continue to hold in the recent GHRB dataset. the features used for our ranking strategy continue to be good indicators of successful bug reproduction. Answer to RQ3-2: We find that the factors used for the ranking and selection of LIBRO consistently predict bug reproduction in real-world data. 3) RQ3-3: We present case studies of attempts by LIBRO to reproduce bugs that either succeeded or failed. TABLE VIII: Bug Report Successfully Reproduced: URLs are omitted for brevity (AssertJ-Core Issue #2666) Title assertContainsIgnoringCase fails to compare i and I in tr_TR locale See org.assertj.core.internal.Strings#assertContainsIgnoringCase [url] I would suggest adding [url] verification to just ban toLowerCase(), toUpperCase() and other unsafe methods: #2664 We first present a successfully reproduced bug report, issue #2685 for the AssertJ-Core project, in Table VIII. The bug is that, in tr_TR locale, assertContainsIgnoringCase method is mishandling the letter I. For this bug report, LIBRO gener- ates the bug reproducing test in Listing 4. Despite the bug report not containing any executable code, LIBRO successfully generated a bug reproducing test that compares I and i, using the containsIgnoringCase method referenced within the report. A BRT for this bug is ranked at the second place, suggesting a developer would have been quickly provided with a reproducing test. Listing 5: Generated FIB test for Checkstyle-11365. 1 public void testFinalClass() throws Exception { 2 final DefaultConfiguration checkConfig = 3 createModuleConfig(FinalClassCheck.class); public void testIssue952(){ Locale locale = new Locale("tr", "TR"); Locale.setDefault(locale); assertThat("I").as("Checking in tr_TR locale") .containsIgnoringCase("i"); }
  • 70. PTLMs are interesting because… 36 Great Autocompletion + Dynamic Postprocessing + Emergent Behaviour + NL/AL Duality + Use of Tools +… Di ff erential Testing Metamorphic Testing Test Input Generation … CoT Wei et al., 2022 PAL Gao et al., 2023 Dual Channel Casalnuovo et al., 2020 ReAct Yao et al., 2022 AutoGPT
  • 71. History of FL and Changes of Perspectives… 37 Figure 1: An screen snapshot of the Tarantula system in Shaded Summary mode. lines executed in both passed and failed test cases. In each of these modes, the brightness for each line is set to the percentage of test cases that execute the respective statement for that mode. The final mode, Shaded Summary, is the most informative and complex mapping. It renders all executed statements on a spectrum from red to green. The hue of a line is determined by the percentage of the number of failed test cases executing statement s to the total number of failed test cases in the test suite T and the percentage of the number passed test cases executing s to the number of passed test cases in T. These percentages are used to gauge the point in the hue spectrum from red to green for which to color s. The brightness is determined by the greater of the two percentages, assuming brightness is measured on a 0 to 100 scale. Specifically, the color of the line for a statement s that is executed by at least one test case is determined by the following equations. hue(s) = low hue (red) + %passed(s) %passed(s)+%failed(s) ∗ hue range bright(s) = max(% passed(s),% failed(s)) For example, for a test suite of 100 test cases, a statement s that is executed by 15 of 20 failed test cases and 40 of 80 passed test cases, and a hue range of 0 (red) to 100 (green), the hue and brightness are 40 and 75, respectively. The long, thin rectangular region located above and to the right of the code view area shows graphically the results of the entire test suite. A small rectangle is drawn for each test case from left-to-right and is color coded to its outcome—green for success and red for failure. This lets the viewer, at a glance, see the Initially conceived as an aid for humans
  • 72. History of FL and Changes of Perspectives… 37 Figure 1: An screen snapshot of the Tarantula system in Shaded Summary mode. lines executed in both passed and failed test cases. In each of these modes, the brightness for each line is set to the percentage of test cases that execute the respective statement for that mode. The final mode, Shaded Summary, is the most informative and complex mapping. It renders all executed statements on a spectrum from red to green. The hue of a line is determined by the percentage of the number of failed test cases executing statement s to the total number of failed test cases in the test suite T and the percentage of the number passed test cases executing s to the number of passed test cases in T. These percentages are used to gauge the point in the hue spectrum from red to green for which to color s. The brightness is determined by the greater of the two percentages, assuming brightness is measured on a 0 to 100 scale. Specifically, the color of the line for a statement s that is executed by at least one test case is determined by the following equations. hue(s) = low hue (red) + %passed(s) %passed(s)+%failed(s) ∗ hue range bright(s) = max(% passed(s),% failed(s)) For example, for a test suite of 100 test cases, a statement s that is executed by 15 of 20 failed test cases and 40 of 80 passed test cases, and a hue range of 0 (red) to 100 (green), the hue and brightness are 40 and 75, respectively. The long, thin rectangular region located above and to the right of the code view area shows graphically the results of the entire test suite. A small rectangle is drawn for each test case from left-to-right and is color coded to its outcome—green for success and red for failure. This lets the viewer, at a glance, see the Initially conceived as an aid for humans ef ef + nf + ep ef ef + 2(nf + ep) 2ef 2ef + nf + ep 2ef ef + nf + ep ef nf + ep 1 2 ( ef ef + nf + ef ef + ep ) ef ef + nf + ep + np ef + np nf ep ef + nf + ep + np ef + np ef + nf + ep + np ef ef + np + 2(ep + nf ) ef + np nf + ep ef ef +nf ep ep+np + ef ef +nf Then developed as an automated tool for humans
  • 73. History of FL and Changes of Perspectives… 37 Figure 1: An screen snapshot of the Tarantula system in Shaded Summary mode. lines executed in both passed and failed test cases. In each of these modes, the brightness for each line is set to the percentage of test cases that execute the respective statement for that mode. The final mode, Shaded Summary, is the most informative and complex mapping. It renders all executed statements on a spectrum from red to green. The hue of a line is determined by the percentage of the number of failed test cases executing statement s to the total number of failed test cases in the test suite T and the percentage of the number passed test cases executing s to the number of passed test cases in T. These percentages are used to gauge the point in the hue spectrum from red to green for which to color s. The brightness is determined by the greater of the two percentages, assuming brightness is measured on a 0 to 100 scale. Specifically, the color of the line for a statement s that is executed by at least one test case is determined by the following equations. hue(s) = low hue (red) + %passed(s) %passed(s)+%failed(s) ∗ hue range bright(s) = max(% passed(s),% failed(s)) For example, for a test suite of 100 test cases, a statement s that is executed by 15 of 20 failed test cases and 40 of 80 passed test cases, and a hue range of 0 (red) to 100 (green), the hue and brightness are 40 and 75, respectively. The long, thin rectangular region located above and to the right of the code view area shows graphically the results of the entire test suite. A small rectangle is drawn for each test case from left-to-right and is color coded to its outcome—green for success and red for failure. This lets the viewer, at a glance, see the Initially conceived as an aid for humans ef ef + nf + ep ef ef + 2(nf + ep) 2ef 2ef + nf + ep 2ef ef + nf + ep ef nf + ep 1 2 ( ef ef + nf + ef ef + ep ) ef ef + nf + ep + np ef + np nf ep ef + nf + ep + np ef + np ef + nf + ep + np ef ef + np + 2(ep + nf ) ef + np nf + ep ef ef +nf ep ep+np + ef ef +nf Then developed as an automated tool for humans Then criticized for not helping humans
  • 74. History of FL and Changes of Perspectives… 37 Figure 1: An screen snapshot of the Tarantula system in Shaded Summary mode. lines executed in both passed and failed test cases. In each of these modes, the brightness for each line is set to the percentage of test cases that execute the respective statement for that mode. The final mode, Shaded Summary, is the most informative and complex mapping. It renders all executed statements on a spectrum from red to green. The hue of a line is determined by the percentage of the number of failed test cases executing statement s to the total number of failed test cases in the test suite T and the percentage of the number passed test cases executing s to the number of passed test cases in T. These percentages are used to gauge the point in the hue spectrum from red to green for which to color s. The brightness is determined by the greater of the two percentages, assuming brightness is measured on a 0 to 100 scale. Specifically, the color of the line for a statement s that is executed by at least one test case is determined by the following equations. hue(s) = low hue (red) + %passed(s) %passed(s)+%failed(s) ∗ hue range bright(s) = max(% passed(s),% failed(s)) For example, for a test suite of 100 test cases, a statement s that is executed by 15 of 20 failed test cases and 40 of 80 passed test cases, and a hue range of 0 (red) to 100 (green), the hue and brightness are 40 and 75, respectively. The long, thin rectangular region located above and to the right of the code view area shows graphically the results of the entire test suite. A small rectangle is drawn for each test case from left-to-right and is color coded to its outcome—green for success and red for failure. This lets the viewer, at a glance, see the Initially conceived as an aid for humans ef ef + nf + ep ef ef + 2(nf + ep) 2ef 2ef + nf + ep 2ef ef + nf + ep ef nf + ep 1 2 ( ef ef + nf + ef ef + ep ) ef ef + nf + ep + np ef + np nf ep ef + nf + ep + np ef + np ef + nf + ep + np ef ef + np + 2(ep + nf ) ef + np nf + ep ef ef +nf ep ep+np + ef ef +nf Then developed as an automated tool for humans Then criticized for not helping humans Then redeemed to help machines to debug
  • 75. History of FL and Changes of Perspectives… 37 Figure 1: An screen snapshot of the Tarantula system in Shaded Summary mode. lines executed in both passed and failed test cases. In each of these modes, the brightness for each line is set to the percentage of test cases that execute the respective statement for that mode. The final mode, Shaded Summary, is the most informative and complex mapping. It renders all executed statements on a spectrum from red to green. The hue of a line is determined by the percentage of the number of failed test cases executing statement s to the total number of failed test cases in the test suite T and the percentage of the number passed test cases executing s to the number of passed test cases in T. These percentages are used to gauge the point in the hue spectrum from red to green for which to color s. The brightness is determined by the greater of the two percentages, assuming brightness is measured on a 0 to 100 scale. Specifically, the color of the line for a statement s that is executed by at least one test case is determined by the following equations. hue(s) = low hue (red) + %passed(s) %passed(s)+%failed(s) ∗ hue range bright(s) = max(% passed(s),% failed(s)) For example, for a test suite of 100 test cases, a statement s that is executed by 15 of 20 failed test cases and 40 of 80 passed test cases, and a hue range of 0 (red) to 100 (green), the hue and brightness are 40 and 75, respectively. The long, thin rectangular region located above and to the right of the code view area shows graphically the results of the entire test suite. A small rectangle is drawn for each test case from left-to-right and is color coded to its outcome—green for success and red for failure. This lets the viewer, at a glance, see the Initially conceived as an aid for humans ef ef + nf + ep ef ef + 2(nf + ep) 2ef 2ef + nf + ep 2ef ef + nf + ep ef nf + ep 1 2 ( ef ef + nf + ef ef + ep ) ef ef + nf + ep + np ef + np nf ep ef + nf + ep + np ef + np ef + nf + ep + np ef ef + np + 2(ep + nf ) ef + np nf + ep ef ef +nf ep ep+np + ef ef +nf Then developed as an automated tool for humans Then criticized for not helping humans Then redeemed to help machines to debug But now machines do debugging in a totally di ff erent way…?
  • 76. History of FL and Changes of Perspectives… 37 Figure 1: An screen snapshot of the Tarantula system in Shaded Summary mode. lines executed in both passed and failed test cases. In each of these modes, the brightness for each line is set to the percentage of test cases that execute the respective statement for that mode. The final mode, Shaded Summary, is the most informative and complex mapping. It renders all executed statements on a spectrum from red to green. The hue of a line is determined by the percentage of the number of failed test cases executing statement s to the total number of failed test cases in the test suite T and the percentage of the number passed test cases executing s to the number of passed test cases in T. These percentages are used to gauge the point in the hue spectrum from red to green for which to color s. The brightness is determined by the greater of the two percentages, assuming brightness is measured on a 0 to 100 scale. Specifically, the color of the line for a statement s that is executed by at least one test case is determined by the following equations. hue(s) = low hue (red) + %passed(s) %passed(s)+%failed(s) ∗ hue range bright(s) = max(% passed(s),% failed(s)) For example, for a test suite of 100 test cases, a statement s that is executed by 15 of 20 failed test cases and 40 of 80 passed test cases, and a hue range of 0 (red) to 100 (green), the hue and brightness are 40 and 75, respectively. The long, thin rectangular region located above and to the right of the code view area shows graphically the results of the entire test suite. A small rectangle is drawn for each test case from left-to-right and is color coded to its outcome—green for success and red for failure. This lets the viewer, at a glance, see the Initially conceived as an aid for humans ef ef + nf + ep ef ef + 2(nf + ep) 2ef 2ef + nf + ep 2ef ef + nf + ep ef nf + ep 1 2 ( ef ef + nf + ef ef + ep ) ef ef + nf + ep + np ef + np nf ep ef + nf + ep + np ef + np ef + nf + ep + np ef ef + np + 2(ep + nf ) ef + np nf + ep ef ef +nf ep ep+np + ef ef +nf Then developed as an automated tool for humans Then criticized for not helping humans Then redeemed to help machines to debug But now machines do debugging in a totally di ff erent way…? ?
  • 77. Looking Both Ways… 38 Find ways to apply what we already know to practice - collaboration with Samsung SSDs - Deploying FL into testing tools from SuresoftTech Reimagine the role of FL in the LLM era - Many exciting work-in-progress: keep watching COINSE :) Looking Both Ways… 38 Find ways to apply what we already know to practice - collaboration with Samsung SSDs - Deploying FL into testing tools from SuresoftTech Reimagine the role of FL in the LLM era - Many exciting work-in-progress: keep watching COINSE :)
  • 78. 39 Retrospective on My FL Work 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 GP + FL Theory SF1 B ⊂ SF2 B HUMIES! DP + FL Entropy MBFL MBFL 2.0 Industry Diagnosability Applications Everything started with people. Gabin An (PhD Candidate) Dr. Jeongju Sohn Juyeon Yoon (PhD Candidate) Dr. Jinhan Kim Sungmin Kang (PhD Candidate) Dr. Bill Langdon Prof. Mark Harman Prof. Robert Feldt Dr. David Clark Prof. Moonzoo Kim Prof. Yunho Kim Dongwon Hwang SAP Labs Korea Prof. T. Y. Chen Prof. Xiaoyuan Xie Jingun Hong SAP Labs Korea ❤ PTLMs are interesting because… Great Autocompletion + Dynamic Postprocessing + Emergent Behaviour + NL/AL Duality + Use of Tools +… Differential Testing Metamorphic Testing Test Input Generation … CoT Wei et al., 2022 PAL Gao et al., 2023 Dual Channel Casalnuovo et al., 2020 ReAct Yao et al., 2022 AutoGPT shin.yoo@kaist.ac.kr https://coinse.github.io Looking Both Ways… 38 Find ways to apply what we already know to practice - collaboration with Samsung SSDs - Deploying FL into testing tools from SuresoftTech Reimagine the role of FL in the LLM era - Many exciting work-in-progress: keep watching COINSE :)
  • 79. 40
  • 80. Structural Test Test Test Spectrum Tarantula Rank Elements t1 t2 t3 ep ef np nf s1 • 1 0 0 2 0.00 9 s2 • 1 0 0 2 0.00 9 s3 • 1 0 0 2 0.00 9 s4 • 1 0 0 2 0.00 9 s5 • 1 0 0 2 0.00 9 s6 • • 1 1 0 1 0.33 4 s7 (faulty) • • 0 2 1 0 1.00 1 s8 • • 1 1 0 1 0.33 4 s9 • • • 1 2 0 0 0.50 2 Result P F F Spectrum Based Fault Localisation
  • 81. Structural Test Test Test Spectrum Tarantula Rank Elements t1 t2 t3 ep ef np nf s1 • 1 0 0 2 0.00 9 s2 • 1 0 0 2 0.00 9 s3 • 1 0 0 2 0.00 9 s4 • 1 0 0 2 0.00 9 s5 • 1 0 0 2 0.00 9 s6 • • 1 1 0 1 0.33 4 s7 (faulty) • • 0 2 1 0 1.00 1 s8 • • 1 1 0 1 0.33 4 s9 • • • 1 2 0 0 0.50 2 Result P F F Spectrum Based Fault Localisation
  • 82. Structural Test Test Test Spectrum Tarantula Rank Elements t1 t2 t3 ep ef np nf s1 • 1 0 0 2 0.00 9 s2 • 1 0 0 2 0.00 9 s3 • 1 0 0 2 0.00 9 s4 • 1 0 0 2 0.00 9 s5 • 1 0 0 2 0.00 9 s6 • • 1 1 0 1 0.33 4 s7 (faulty) • • 0 2 1 0 1.00 1 s8 • • 1 1 0 1 0.33 4 s9 • • • 1 2 0 0 0.50 2 Result P F F Spectrum Based Fault Localisation
  • 83. Structural Test Test Test Spectrum Tarantula Rank Elements t1 t2 t3 ep ef np nf s1 • 1 0 0 2 0.00 9 s2 • 1 0 0 2 0.00 9 s3 • 1 0 0 2 0.00 9 s4 • 1 0 0 2 0.00 9 s5 • 1 0 0 2 0.00 9 s6 • • 1 1 0 1 0.33 4 s7 (faulty) • • 0 2 1 0 1.00 1 s8 • • 1 1 0 1 0.33 4 s9 • • • 1 2 0 0 0.50 2 Result P F F Spectrum Based Fault Localisation Tarantula = ef ef +nf ep ep+np + ef ef +nf
  • 84. Structural Test Test Test Spectrum Tarantula Rank Elements t1 t2 t3 ep ef np nf s1 • 1 0 0 2 0.00 9 s2 • 1 0 0 2 0.00 9 s3 • 1 0 0 2 0.00 9 s4 • 1 0 0 2 0.00 9 s5 • 1 0 0 2 0.00 9 s6 • • 1 1 0 1 0.33 4 s7 (faulty) • • 0 2 1 0 1.00 1 s8 • • 1 1 0 1 0.33 4 s9 • • • 1 2 0 0 0.50 2 Result P F F Spectrum Based Fault Localisation
  • 85. Structural Test Test Test Spectrum Tarantula Rank Elements t1 t2 t3 ep ef np nf s1 • 1 0 0 2 0.00 9 s2 • 1 0 0 2 0.00 9 s3 • 1 0 0 2 0.00 9 s4 • 1 0 0 2 0.00 9 s5 • 1 0 0 2 0.00 9 s6 • • 1 1 0 1 0.33 4 s7 (faulty) • • 0 2 1 0 1.00 1 s8 • • 1 1 0 1 0.33 4 s9 • • • 1 2 0 0 0.50 2 Result P F F Spectrum Based Fault Localisation
  • 86. Structural Test Test Test Spectrum Tarantula Rank Elements t1 t2 t3 ep ef np nf s1 • 1 0 0 2 0.00 9 s2 • 1 0 0 2 0.00 9 s3 • 1 0 0 2 0.00 9 s4 • 1 0 0 2 0.00 9 s5 • 1 0 0 2 0.00 9 s6 • • 1 1 0 1 0.33 4 s7 (faulty) • • 0 2 1 0 1.00 1 s8 • • 1 1 0 1 0.33 4 s9 • • • 1 2 0 0 0.50 2 Result P F F Spectrum Based Fault Localisation