Bug prediction + sdlc automation

Bug prediction
based on your code history

2
3y 4y 5y 2y
Developer Developer & Founder Team Lead & Architect VP of Engineering
@ Various companies @ Startup @ Yandex @ WorldAPP
13 years in practical engineering
80 people at the department
10+ projects from initial commit to production

backlogDEV + QA
Automation engineers

What already have been automated
5
MaintenanceDevelop Test
● Unit and Integration tests run on every commit to MR branches
● Static code analysis on each push
● Cross references between GitLab and Jira
● HipChat notifications about created Merge Requests

6
● Deploy a successful build to the test environment
● UI and Performance tests run on every commit to a develop branch
● Check against different types of supported DBMS

7
● Deploy a successful build to the production environment
● Grafana alerting to HipChat

Issues with these opportunities
● static code analyzers find only non conceptual issues
● automated tests cover only predefined scenarios
● code reviews are aimed on sharing and controlling best practices and less
than 10% of all discussions discover logical issues.
● and, finally, QA has no idea which parts of the system could be affected by a
code change… neither do a programmer
8

20
bugs in a production environment
per week
9

A guess. Let's examine human factor
● a tired engineer makes more mistakes
● the more an engineer knows about certain module the fewer bugs (s)he will
produce
● small changes have fewer bugs than long listings
● some parts of the system are more complicated than another, so the risk of get
a bug increases
● huge changes in a short period of time contains more bugs (done in a hurry)
10

Hypothesis
If we know that certain commit has fixed a bug, than we know that a commit, when
the changed lines were introduced, did contain the bug.
11
Author: John
public int sum( int a, int b )
{
return a + b;
}
C
Author: Bob
{
return a * b;
}
BA
{
return a + b;
}

What tools can help us?
12
● ticket types
● action history
● exact code changes
● author of modifications
● class complexity
● code metrics

Our new team member. Overlord
13

WebHooks
ScheduledExecutorService
14
java.util.concurrent.

Improve cross references between tools
15
● Notifies about missed ticket key in MR title
● Fills MR with information from Jira
● Fixes common mistakes in MR creation

Propose the best reviewers based on MR changeset
16
● Who previously has edited the touched code lines
● Who has coded more than others in the files
● Who is team lead / owner of the service / package

Task updates according to the workflow
17
● Transitions task status
● Assigns proper person for the next step
● Marks if task has SQL changes
● Adds a label with branch merged into

Check that MR has 2 upvotes before merging
18
● Check that rules are followed
● Notify TeamLead / Dev manager about any
violation
● Push an author to ask colleagues to look at his
masterpiece

Another automated processes
● Notifies author about old MR without any reactions
● Notifies assignee that MR can be merged
● Notifies if you have lots of “In Progress” tickets or don’t have them at all
● Provides a list of merged tasks in the particular branch
19

Algorithm of metrics collection
● Export all tasks from Jira to inmemory dictionary
● For each commit run a backtrace to mark it as buggy, fixing or regular
● Collect all meaningful data about commit:
○ Month of year, Day of week, Hour of day, Who, How many lines and files, Which classes and
packages, Class complexity and amount of notices, How long a task is in progress
● Put a line with the data to Attribute-Relation File Format (ARFF) file
21

Getting educated. WEKA
Waikato Environment for Knowledge Analysis - is a suite of machine learning
software written in Java, developed at the University of Waikato, New Zealand.
● Parsers
● Classifiers
● Training/test splits
22

WEKA challenges
● Convert your data to corresponding vectors
● Choose proper data transformers
● Select and tweak desired Classifiers
● Run experiments and adjust your settings
Good materials about WEKA for beginners:
● How to Run Your First Classifier in Weka
● Data mining with WEKA, Part 2. Classification and clustering
● Document Classification using WEKA
23

Decision Tree
Ease of results interpretation
Any data can be fed to the method
Can work with scalars and intervals
24

Decision Tree
25
Changed less
than 300 lines?Changed more
than 50 lines?
Author is Bob?
Author is John?
Has no bugs :)
Has no bugs :)
Is it Friday?
Has no bugs :)
Has a bug :(
Has no bugs :)
Has a bug :(
● John never has bugs!
● Everybody except John and Bob has bugs on Friday.
● Bob has bugs only if he changed more than 300 lines of code.

Decision Tree
26
The simplest method for building a tree is ID3 (Iterative Dichotomiser 3*).
Build steps:
● Find an attribute with lowest entropy (or largest information gain)
● Split the data set by the found attribute
● Recursively build a tree for each of the subsets
* fates of ID2 and ID1 are lost in history

Naive Bayes
classifier
≈80% accuracy*
Simple implementation
Easy to understand
27

Naive Bayes classifier
29
30% of all commits with bugs were done by Bob P(Bob|bug)
10% of all commits without bugs were done by Bob P(Bob|~bug)
40% of all commits have bugs P(bug)
60% of all commits have no bugs P(~bug)
What probability that next commit from Bob will have a bug?
P(bug|Bob)

Output results example (Bayes)
Correctly Classified Instances 14381 77.4755 %
Incorrectly Classified Instances 4181 22.5245 %
Kappa statistic 0.3085
Mean absolute error 0.2637
Root mean squared error 0.3963
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.856 0.544 0.861 0.856 0.858 0.761 false
0.456 0.144 0.444 0.456 0.45 0.761 true
Weighted Avg. 0.775 0.463 0.777 0.775 0.776 0.761
=== Confusion Matrix ===
a b <-- classified as
12670 2140 | a = false
2041 1711 | b = true
30

Summary
● we found that certain classes are too complex as almost every change in them
will end up with a bug
● some of engineers shouldn't open some packages at all (or at least we should
properly educate them)
● there are still many rooms for improvements (overlapping hiding commits,
another meaningful features, more accurate code history, etc)
● It does not show you where an error exists. But you will be able to analyze a
commit more carefully.
● It was fun! :)
32

Questions?
Alexey@Tokar.net.ua
VP of Engineering @ WorldAPP
33

Bug prediction + sdlc automation

More Related Content

What's hot

Similar to Bug prediction + sdlc automation

More from Alexey Tokar

Recently uploaded

Bug prediction + sdlc automation

Editor's Notes