Pull requests can be analyzed quantitatively to evaluate developer performance. Metrics like cycle time, number of comments, and pull request size are captured from version control systems and used to generate scorecards for both developers and reviewers. Natural language processing techniques like BERT are used to classify comment types which factor into developer scores. This provides an objective way to assess skills and opportunities for improvement within agile teams. The presented approach is currently used for quarterly reviews at a company and has led to focused training and more efficient task allocation.
2. Agenda
• Who we are
• Use case Discussion
• Use of NLP in the solution
3.
4. Bizom Simplifies Supply Chain
Brand’s Salesman visits outlet Order is punched in Bizom App Order is delivered
Brand’s Salesman visits outlet
Takes order of
inventory using pen
& paper
Order is manually
handed to
distributor
Distributor
Checks
Inventory
Order is delivered
5. Retail Intelligence Platform
With enormous retail data that we gather using our App, we are developing a series of AI / ML
capabilities to solve industry’s prickliest challenges
35 Hawk
Image recognition
model(s)
Automated targets
Dynamic sales targets
Suggested order
Predictive demand
forecasting Route optimisation
Optimize salesman
productivity
Delivery optimisation
Optimize van sales/
direct delivery
Augmented Reality
Merchandising
optimisation Eagle Eye
Hyperlocal geospatial
analytics
Decision
optimization across
Supply chain
Sales
automation
Category
management
8. What is Pull Requests Analytics?
Analysis of the pull requests meta data to gain insights about
the efficiency & quality of the developers & reviewers can be
construed as PR Analytics
9. How does PR analytics fits in with Agile
Complimentary
10. Where is the PR Data??
Bitbucket( git based source code management) allows us to fetch the data for the repos using APIs.
The data points that can be accessed are:
PR Comments
PR Summary
PR Files
PR Activities
(Audit Trail)
PR Commits
12. How are we using the PR data?
Q1
Q2
Q4
Q3
We plot a 2D curve
between Quality of work
and Quantity of work :
Scorecard
Objective way to assess
the quality of a developer
or reviewer with respect
to the team
Gives timely feedback to
developer on their
performance within the
team
Mean no of PRs
Mean performance Index
13. Developer Trends
Use the scorecard to see shift in the performance of a particular developer over multiple quarters
14. Different Beneficiaries of Scorecard
Developer
• Timely feedback
• Motivates to accel
in the work
(gamification)
• Gap assessment
tool (self
assessment)
Team
Leads
• Capability
Assessment Tool :
Objective way of
appraisal
• Gap assessment tool
for the team
Scrum
Master
• Better Resource &
time planning
• Allocate critical tasks
to quality developers
(de risk the critical
path )
15. Cycle Time
PR
opened Time to Review
First
Comment
Time to Approve
PR
Approved
Time to Merge
PR
Merged
Cycle Time
Cycle Time: The time between opening of the PR and the closing of the PR. It is measured in days
Small Cycle Time
(<1 day or 1-2 days)
Large Cycle Time
(>14 days)
16. Additional Data Points
PR Frequency
Number of PRs
worked upon by a
developer
Comment Type
Category of
comments by
reviewer
PR Size
Number of lines for
reviewing
Comment Count
Number of
comments received
during the Review
Indicates Quantity of work
Indicates Quality of work
Project
Experience
How long have you
been working on
this project?
Industry Experience
Large industry experience indicates
that the resource has some idea of
good development practices and
better chances of doing the work
correctly
17. Developer & Reviewer Profile
Developer
profile
Cycle Time
How much time it
took for PR to be
accepted/rejected.
Too much time is
not desired
Number of
comments
Has the developer
received a lot of
comments
PR Size
How many lines of
code for review in
a single PR
PR Frequency
how many PRs
have been worked
upon
Category of
comments
Coding issues,
guidelines issue,
general Reviewer
Profile
Cycle Time w.r.t. PR
size (review time)
Is the reviewer
spending time
proportional to the
PR size
Number of
comments
How many
comments is
Reviewer giving
to PRs
PR Frequency
How many PRs
are being
reviewed
Types of comment
This shows the
depth of
understanding of the
code
19. Preparation of Developer Scorecard
Depends on the following
Cycle time: We give bonus for smaller cycle times and penalize for large cycle time
PR Size : We penalize for large PR sizes and give bonus for small PR sizes
Comment Types: Comment have been assigned penalty points based on their types. We count the penalty across all the PRs
20. Preparation of Reviewer Scorecard
Depends on the following
Quality of comments that reviewer has given. We give bonus for good quality comments
Number of PRs that the reviewer has reviewed
CycleTime wrt PRSize: Penalized on being lazy or not doing the review properly
21. Final Scorecard
● The team member may play a role of reviewer as
well as developer so we combine the developer
score and reviewer score together to get the final
score
● In general, the more experienced folks in the
team are expected to play reviewer role more
than the developer role and vice versa. This is
reflected in the way we combine the scores.
For e.g.
‘A’ is highly experienced team member, score will
be 0.4 * development score+0.6* reviewer score
(as we expect more of a mentorship role)
While ‘ L’ is a fresh out of college, we expect 100%
development effort and so the reviewer
weightage will be 0
Combined Score = Development_Weightage * Development Score + Review_Weightage * Review Score
23. Comment Types
Logical Suggestions
• Code logic suggestions
• Requirement/ usecase
validations with product
team
Code Suggestion
Comment contains the exact
code that needs to be
incorporated.
These comments show the
proficiency of the reviewer
Buggy Code
• When the component is not
working as per expectations
• Functionality can break in
future
• No exception handling
• Code Optimization issues
Spelling Suggestions
Comment related to use of
standard naming conventions
Rename files or variables
Questioning
• Reviewer needs more
clarifications from
developer
• Seek more information
Completeness Tasks
• Ensure that the code has
been unit tested
• Some functionality is
missing and need to be
completed before merge
Refactoring Suggestions
• Duplicate code that needs to
be deleted
• Indentation issues
• Empty spaces/rows
• Commented code removal
Documentation
Suggestion
• Required some code
comments
• Adding some background
to the approach
24. Few challenges while using algorithms to classify comments
Comment Types Penalty for Developers
• High accuracy is desired
• Not every person writing comments uses proper English sentences/grammar
• Sometimes comments are written like a rhetorical question for e.g.
“Don’t you think we should have used XYZ in the code?”
Comment Type Penalty
Weight
BUGGYCODE SUGGESTION -10
CODE SUGGESTION -8
REFACTORING SUGGESTION -7
REUSE SUGGESTION -6
COMPLETENESS SUGGESTION -5
QUESTIONING -4
LOGICAL SUGGESTION -3
SPELLING -2
DOCUMENTATION SUGGESTION -1
25. How to detect Comment Types
Why can’t we use the traditional NLP which is based on the word count or regex?
Spelling suggestions Reuse Suggestions
Difficult to classify the comments to the correct basket just by the count
26. How to detect Comment Types
• Classification of Comments is a Multiclass Classification Problem
• We have tried a Text Generation Model approach to solve Multiclass Classification problem
Why a generative model instead of a classification model?
Seq2seq using Attention
Every task including translation, question
answering, and classification is converted
to a seq2seq problem. The T5 model is fed
text as input and it generates some target
text
https://arxiv.org/abs/1910.10683
27. Data
Number of manually classified Comments: 4800
Training Data: 3700 comments
Validation Data: 900 comments
Number of classes (Comment Types) : 10
28. • Training Accuracy is
increasing with epoch but
the validation accuracy has
plateaued. Overfitting
• Validation Accuracy is
~ 40%
Validation loss not changing
after 10 epoch …overfitting
LSTM (Baseline)
34. https://mccormickml.com/2019/07/22/BERT-fine-tuning/
• Training Accuracy is
increasing with epoch but
the validation accuracy has
plateaued --- Overfitting
• Validation Accuracy is ~65%
which is better than LSTM
Validation loss is not
decreasing much while
the training loss has
decreased …overfitting
BERT (next step)
35. 1. For each developer find out the various types of comments and their counts
2.Calculate the penalty based on the comment type weights
3. Aggregating penalty for each developer column wise
Comment Types Penalty for Developers Calculation
36. Successes
• PR Scorecard is currently being used for the quarterly review of 80% of the RnD
teams
• ~ 10% developers who were not performing well were given focused trainings to
enhance their productivity. We see an increased trend in their performance after
that
• Less slippages in the deadlines as the scrum masters are considering the relative
efficiency of the developers before giving task estimates
37. Next Steps
• We are improving BERT & other transformer models on the comment
classification task by adding more training data & different pretrained models
• We are replicating the similar scorecard for the Customer Support team where
the metrics will be dependent on the First Response Time, Total Resolution Time,
Number of times Case Reopened, Customer Comments etc.