Computer Science
Learning to Recognize Actionable Static Code
Warnings (is Intrinsically Easy)
– Journal First
1
Xueqi Yang, Jianfeng Chen, Rahul Yedida, Zhe Yu and Tim Menzies
ICSE ’22, 2022, Pittsburgh, PA, USA
Computer Science
Introduction
What is Static analysis?
• debugging without actually executing
programs
• analyzing code against coding guidelines
or rules
• Performed early in development and
before software testing
2
Computer Science
Introduction
What is Static analysis tool?
• Manually checking:
Time-consuming & expensive
• Leverage SA techniques to inspect
program for the occurrence of bug
patterns
3
Computer Science
Challenges
• High false positive rate
Why high false positive rate?
- SA tools are over-cautious
- 35% - 91% of warnings generated
can not be acted on [Heckman’
2011]
4
Computer Science
Empirical Study
Findbugs:
- One of the most commonly-used SA
tools
- Open-sourced SA tool for Java
programs
- Downloaded for over one million
times
5
Computer Science
Empirical Study
Findbugs:
Analysis Java code with 424 bug patterns
grouped into 9 types
- bad practice
- correctness
- experimental internationalization,
- malicious code vulnerability
- multithreaded correctness
- performance
- security
- dodgy code
6
Computer Science
Roadmap
● Wang et al. (EMSE’18)
○ Data and feature collection
○ Golden features
● Yang et al. (ESWA’21 and EMSE’21)
○ Incremental active learning
○ Deep learning and data simplicity
● Kang et al. (ICSE’22)
○ Data leakage(features and
instance)
○ Data refactoring
● Yedida et al. (targeted to TSE, preprint
available under request)
○ Boundary engineering, label
engineering, learner engineering
and instance engineering
7
Computer Science
Dataset
- Label: actionable &
unactionable [Liang’ 2010]
- Version control system &
issue tracking system
8
Grand truth:
Computer Science
Dataset
Feature extraction [Wang’
2018]:
- Slicing features from 9 SE
projects with a Java tool
- Feature name, category,
meaning, method
9
Computer Science
Dataset
8 categories of golden
features:
Warning combination (6)
Code characteristics (5)
Warning characteristics (4)
File history (3)
Code analysis (2)
Code history (2)
Warning history (1)
File characteristics (0)
10
Computer Science
My refuted EMSE ‘21
What happens when you write a paper
and it gets refuted by an ICSE paper?
11
- Response 1: go home and cry
- Response 2: sit down with your
critics to work out what to do next
Computer Science
Roadmap
● Wang et al. (EMSE’18)
○ Data and feature collection
○ Golden features
● Yang et al. (ESWA’21 and EMSE’21)
○ Incremental active learning
○ Deep learning and data simplicity
● Kang et al. (ICSE’22)
○ Data leakage(features and
instance)
○ Data refactoring
● Yedida et al. (targeted to TSE, preprint
available under request)
○ Boundary, label, learner, instance
engineering
12
Computer Science
Data refactoring
Kang et al.
● Instance leakage
○ Manually labeled 1,357
warnings and with 768
remained
● Feature leakage
○ 5 features(warning context in
method, file, for warning type, defect
likelihood, discretization of defect
likelihood)
13
Computer Science
Data refactoring
Kang et al.
● Instance leakage
○ Manually labeled 1,357
warnings and with 768
remained
● Feature leakage
○ 5 features(warning context in method,
file, for warning type, defect likelihood,
discretization of defect likelihood)
14
Computer Science
Data refactoring
Kang et al.
Applied off-the-shelf SVM model
● Data leakage
● Open issue: can’t get good
predictors
15
Computer Science
What happens when all these peoples worked together?
Collaboration in open science:
● RAISE Lab & SOAR Group
16
- What’s the
problem?
- What kang et al.
says about Yang
et al.?
- What new
results we came
up with?
Computer Science
Collaboration in open science
Collaboration of two labs:
● RAISE Lab & SOAR Group
● Under submission to TSE
○ Boundary engineering
(GHOSTing)
○ Label engineering
(SMOOTHing)
○ Learner engineering
○ Instance engineering
(SMOTEing)
17
Computer Science
Collaboration in open science
Collaboration of two labs:
● RAISE Lab & SOAR Group
● Under submission to TSE
○ Boundary engineering
(GHOSTing)
○ Label engineering
(SMOOTHing)
○ Learner engineering
○ Instance engineering
(SMOTEing)
18
Computer Science
Collaboration in open science
19
Prelim results of the
ablation study:
Computer Science
Reference
[Kang’ 2022] Kang, Hong Jin, Khai Loong Aw, and David Lo. "Detecting False Alarms from Automatic Static Analysis Tools: How Far are
We?." arXiv preprint arXiv:2202.05982 (2022).
[Yang’ 2021] Yang, Xueqi, et al. "Learning to recognize actionable static code warnings (is intrinsically easy)." Empirical Software
Engineering 26.3 (2021): 1-24.
[Yang’ 2021] Yang, Xueqi, et al. "Understanding static code warnings: An incremental AI approach." Expert Systems with Applications
167 (2021): 114134.
[Wang’ 2018] Wang, Junjie, Song Wang, and Qing Wang. "Is there a" golden" feature set for static warning identification? an
experimental evaluation." Proceedings of the 12th ACM/IEEE international symposium on empirical software engineering and
measurement. 2018.
[Heckman’ 2011] Heckman, Sarah, and Laurie Williams. "A systematic literature review of actionable alert identification techniques for
automated static code analysis." Information and Software Technology 53.4 (2011): 363-387.
[Levina’ 2004] Levina, Elizaveta, and Peter Bickel. "Maximum likelihood estimation of intrinsic dimension." Advances in neural
information processing systems 17 (2004).
20
Computer Science 21

final_ICSE '22 Presentaion_Sherry.pdf

  • 1.
    Computer Science Learning toRecognize Actionable Static Code Warnings (is Intrinsically Easy) – Journal First 1 Xueqi Yang, Jianfeng Chen, Rahul Yedida, Zhe Yu and Tim Menzies ICSE ’22, 2022, Pittsburgh, PA, USA
  • 2.
    Computer Science Introduction What isStatic analysis? • debugging without actually executing programs • analyzing code against coding guidelines or rules • Performed early in development and before software testing 2
  • 3.
    Computer Science Introduction What isStatic analysis tool? • Manually checking: Time-consuming & expensive • Leverage SA techniques to inspect program for the occurrence of bug patterns 3
  • 4.
    Computer Science Challenges • Highfalse positive rate Why high false positive rate? - SA tools are over-cautious - 35% - 91% of warnings generated can not be acted on [Heckman’ 2011] 4
  • 5.
    Computer Science Empirical Study Findbugs: -One of the most commonly-used SA tools - Open-sourced SA tool for Java programs - Downloaded for over one million times 5
  • 6.
    Computer Science Empirical Study Findbugs: AnalysisJava code with 424 bug patterns grouped into 9 types - bad practice - correctness - experimental internationalization, - malicious code vulnerability - multithreaded correctness - performance - security - dodgy code 6
  • 7.
    Computer Science Roadmap ● Wanget al. (EMSE’18) ○ Data and feature collection ○ Golden features ● Yang et al. (ESWA’21 and EMSE’21) ○ Incremental active learning ○ Deep learning and data simplicity ● Kang et al. (ICSE’22) ○ Data leakage(features and instance) ○ Data refactoring ● Yedida et al. (targeted to TSE, preprint available under request) ○ Boundary engineering, label engineering, learner engineering and instance engineering 7
  • 8.
    Computer Science Dataset - Label:actionable & unactionable [Liang’ 2010] - Version control system & issue tracking system 8 Grand truth:
  • 9.
    Computer Science Dataset Feature extraction[Wang’ 2018]: - Slicing features from 9 SE projects with a Java tool - Feature name, category, meaning, method 9
  • 10.
    Computer Science Dataset 8 categoriesof golden features: Warning combination (6) Code characteristics (5) Warning characteristics (4) File history (3) Code analysis (2) Code history (2) Warning history (1) File characteristics (0) 10
  • 11.
    Computer Science My refutedEMSE ‘21 What happens when you write a paper and it gets refuted by an ICSE paper? 11 - Response 1: go home and cry - Response 2: sit down with your critics to work out what to do next
  • 12.
    Computer Science Roadmap ● Wanget al. (EMSE’18) ○ Data and feature collection ○ Golden features ● Yang et al. (ESWA’21 and EMSE’21) ○ Incremental active learning ○ Deep learning and data simplicity ● Kang et al. (ICSE’22) ○ Data leakage(features and instance) ○ Data refactoring ● Yedida et al. (targeted to TSE, preprint available under request) ○ Boundary, label, learner, instance engineering 12
  • 13.
    Computer Science Data refactoring Kanget al. ● Instance leakage ○ Manually labeled 1,357 warnings and with 768 remained ● Feature leakage ○ 5 features(warning context in method, file, for warning type, defect likelihood, discretization of defect likelihood) 13
  • 14.
    Computer Science Data refactoring Kanget al. ● Instance leakage ○ Manually labeled 1,357 warnings and with 768 remained ● Feature leakage ○ 5 features(warning context in method, file, for warning type, defect likelihood, discretization of defect likelihood) 14
  • 15.
    Computer Science Data refactoring Kanget al. Applied off-the-shelf SVM model ● Data leakage ● Open issue: can’t get good predictors 15
  • 16.
    Computer Science What happenswhen all these peoples worked together? Collaboration in open science: ● RAISE Lab & SOAR Group 16 - What’s the problem? - What kang et al. says about Yang et al.? - What new results we came up with?
  • 17.
    Computer Science Collaboration inopen science Collaboration of two labs: ● RAISE Lab & SOAR Group ● Under submission to TSE ○ Boundary engineering (GHOSTing) ○ Label engineering (SMOOTHing) ○ Learner engineering ○ Instance engineering (SMOTEing) 17
  • 18.
    Computer Science Collaboration inopen science Collaboration of two labs: ● RAISE Lab & SOAR Group ● Under submission to TSE ○ Boundary engineering (GHOSTing) ○ Label engineering (SMOOTHing) ○ Learner engineering ○ Instance engineering (SMOTEing) 18
  • 19.
    Computer Science Collaboration inopen science 19 Prelim results of the ablation study:
  • 20.
    Computer Science Reference [Kang’ 2022]Kang, Hong Jin, Khai Loong Aw, and David Lo. "Detecting False Alarms from Automatic Static Analysis Tools: How Far are We?." arXiv preprint arXiv:2202.05982 (2022). [Yang’ 2021] Yang, Xueqi, et al. "Learning to recognize actionable static code warnings (is intrinsically easy)." Empirical Software Engineering 26.3 (2021): 1-24. [Yang’ 2021] Yang, Xueqi, et al. "Understanding static code warnings: An incremental AI approach." Expert Systems with Applications 167 (2021): 114134. [Wang’ 2018] Wang, Junjie, Song Wang, and Qing Wang. "Is there a" golden" feature set for static warning identification? an experimental evaluation." Proceedings of the 12th ACM/IEEE international symposium on empirical software engineering and measurement. 2018. [Heckman’ 2011] Heckman, Sarah, and Laurie Williams. "A systematic literature review of actionable alert identification techniques for automated static code analysis." Information and Software Technology 53.4 (2011): 363-387. [Levina’ 2004] Levina, Elizaveta, and Peter Bickel. "Maximum likelihood estimation of intrinsic dimension." Advances in neural information processing systems 17 (2004). 20
  • 21.