By J. Henry Hinnefeld
PyData New York City 2017
Git is a powerful tool for code versioning. If you follow its best practices and have good ‘commit hygiene’ it can also be a source of valuable data about your coding practices. In this talk I’ll describe a system we built at Civis that uses the metadata git collects, along with its logging and ‘blaming’ functionality to score commits in real time on their likelihood of introducing a bug.
5. 5Civis Analytics | Proprietary and Confidential
what is git?
1. a version control tool
6. 6Civis Analytics | Proprietary and Confidential
what is git?
1. a version control tool
2. a source of interesting
metadata about the history
of your codebase
7. 7Civis Analytics | Proprietary and Confidential
what is a git commit?
git records changes to your code in units called ‘commits’.
● a single commit can contain changes to multiple files
● each commit gets assigned a unique identifier
● each commit also has metadata attached to it.
8. 8Civis Analytics | Proprietary and Confidential
getting at git metadata
three git commands are particularly useful for extracting git
metadata:
● git log shows the history of changes to a project.
different arguments expose different data, e.g.
○ the commit message
○ which files and lines were changed
○ who made the change
9. 9Civis Analytics | Proprietary and Confidential
three git commands are particularly useful for extracting git
metadata:
● git log shows the history of changes to a project.
different arguments expose different data, e.g.
getting at git metadata
10. 10Civis Analytics | Proprietary and Confidential
three git commands are particularly useful for extracting git
metadata:
● git log shows the history of changes to a project.
different arguments expose different data, e.g.
getting at git metadata
11. 11Civis Analytics | Proprietary and Confidential
getting at git metadata
three git commands are particularly useful for extracting git
metadata:
● git diff shows the changes between two commits
12. 12Civis Analytics | Proprietary and Confidential
getting at git metadata
three git commands are particularly useful for extracting git
metadata:
● git blame identifies the last commit to modify a
particular line of code
codecommit
14. 14Civis Analytics | Proprietary and Confidential
build a model of code bug risk!
1. identify and label commits which introduced bugs
2. build commit-level features
3. train a binary classifier on the features and labels
4. score each new commit with the model using git hooks
what can you do with git metadata?
15. 15Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● find commits which fix a bug by checking for commit
messages which start with ‘BUG’ or ‘Fix’
● identify the lines which were changed in that commit
using git diff
● find the last commit to modify those lines using git blame
● label that commit as having introduced a bug
16. 16Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
caveat: this depends heavily on having good ‘commit
hygiene’
● each commit should only do one thing
● commit messages should be meaningful and standardized
at the end we’ll go over some tips for better commit hygiene
17. 17Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● find commits which fix a bug by checking for commit
messages which start with ‘BUG’ or ‘Fix’
git log
-i --grep='bug' --grep='fix'
--pretty=oneline --abbrev-commit
18. 18Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● identify the lines which were changed in that commit
using git diff
git diff <commit hash>^ <commit hash> -U0
19. 19Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● find the last commit to modify those lines using git blame
git blame <file path> <commit hash>^
-L<start line #>,<stop line #>
after
before
20. 20Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● label that commit as having introduced a bug
with a little python string parsing and some
subprocess.check_output calls we can repeat this
process for each bugfix commit, and so label all commits by
whether or not they introduced a bug.
21. 21Civis Analytics | Proprietary and Confidential
step 2: build commit-level features
if we want to build a model we need features and labels.
we just made some labels, so next let’s make some features.
git log --stat
22. 22Civis Analytics | Proprietary and Confidential
step 2: build commit-level features
if we want to build a model we need features and labels.
we just made some labels, so next let’s make some features.
git log --stat
23. 23Civis Analytics | Proprietary and Confidential
step 2: build commit-level features
if we want to build a model we need features and labels.
we just made some labels, so next let’s make some features.
git log --stat
24. 24Civis Analytics | Proprietary and Confidential
step 3: train a binary classifier on the features and labels
AUC = 0.76
25. 25Civis Analytics | Proprietary and Confidential
step 3: train a binary classifier on the features and labels
26. 26Civis Analytics | Proprietary and Confidential
step 3: train a binary classifier on the features and labels
again, this really requires good ‘commit hygiene’
● each commit should only do one thing
● commit messages should be meaningful and standardized
27. 27Civis Analytics | Proprietary and Confidential
step 4: score each new commit with the model using git hooks
git hooks are small scripts in <your repo>/.git/hooks/ that
run automatically based on certain git actions.
some available hooks are:
● pre-commit
● commit-msg
● post-commit
28. 28Civis Analytics | Proprietary and Confidential
for model
scoring we
want the
post-commit
hook
step 4: score each new commit with the model using git hooks
<repo>/.git/hooks/post-commit
29. 29Civis Analytics | Proprietary and Confidential
step 4: score each new commit with the model using git hooks
31. 31Civis Analytics | Proprietary and Confidential
why care about commit hygiene?
● clean commit histories can help you understand the
evolution of the codebase at a glance.
● it adds ‘documentation’ to every line of code: to
understand some confusing lines read the commit
messages of the commits which added those lines.
● it’s not that hard.
32. 32Civis Analytics | Proprietary and Confidential
improving git commit messages
use the commit-msg hook to
● reject commit messages that are too short
● enforce standards around message tags
(eg BUG, TST, WIP, etc)
33. 33Civis Analytics | Proprietary and Confidential
what if I already made lots of edits to one file?
git add --patch <file>
if you’ve made lots of
edits to a single file
you can split those
edits into separate
commits with the
--patch option
34. 34Civis Analytics | Proprietary and Confidential
what if I already have a messy commit history?
git rebase -i <base commit hash>
interactive rebasing lets you rewrite your commit history:
you can combine, split, reorder, and reword commits
35. 35Civis Analytics | Proprietary and Confidential
what if I already have a messy commit history?
git rebase -i <base commit hash>
with rebase -i it is super easy to make tons of small commits
as you work and then quickly combine them into a clean
commit history.
warning: only rebase code you haven’t shared with anyone
yet. <base commit hash> should be the most recent commit
that other people can see (eg the upstream master).