SlideShare a Scribd company logo
1 of 36
Download to read offline
using git metadata to
predict code bug risk
git risky
J. Henry Hinnefeld
hhinnefeld@civisanalytics.com
hinnefe2
DrJSomeday
Outline
1. intro to git
2. the model
3. git tips
intro to git & git metadata
5Civis Analytics | Proprietary and Confidential
what is git?
1. a version control tool
6Civis Analytics | Proprietary and Confidential
what is git?
1. a version control tool
2. a source of interesting
metadata about the history
of your codebase
7Civis Analytics | Proprietary and Confidential
what is a git commit?
git records changes to your code in units called ‘commits’.
● a single commit can contain changes to multiple files
● each commit gets assigned a unique identifier
● each commit also has metadata attached to it.
8Civis Analytics | Proprietary and Confidential
getting at git metadata
three git commands are particularly useful for extracting git
metadata:
● git log shows the history of changes to a project.
different arguments expose different data, e.g.
○ the commit message
○ which files and lines were changed
○ who made the change
9Civis Analytics | Proprietary and Confidential
three git commands are particularly useful for extracting git
metadata:
● git log shows the history of changes to a project.
different arguments expose different data, e.g.
getting at git metadata
10Civis Analytics | Proprietary and Confidential
three git commands are particularly useful for extracting git
metadata:
● git log shows the history of changes to a project.
different arguments expose different data, e.g.
getting at git metadata
11Civis Analytics | Proprietary and Confidential
getting at git metadata
three git commands are particularly useful for extracting git
metadata:
● git diff shows the changes between two commits
12Civis Analytics | Proprietary and Confidential
getting at git metadata
three git commands are particularly useful for extracting git
metadata:
● git blame identifies the last commit to modify a
particular line of code
codecommit
building a code bug risk model
14Civis Analytics | Proprietary and Confidential
build a model of code bug risk!
1. identify and label commits which introduced bugs
2. build commit-level features
3. train a binary classifier on the features and labels
4. score each new commit with the model using git hooks
what can you do with git metadata?
15Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● find commits which fix a bug by checking for commit
messages which start with ‘BUG’ or ‘Fix’
● identify the lines which were changed in that commit
using git diff
● find the last commit to modify those lines using git blame
● label that commit as having introduced a bug
16Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
caveat: this depends heavily on having good ‘commit
hygiene’
● each commit should only do one thing
● commit messages should be meaningful and standardized
at the end we’ll go over some tips for better commit hygiene
17Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● find commits which fix a bug by checking for commit
messages which start with ‘BUG’ or ‘Fix’
git log 
-i --grep='bug' --grep='fix' 
--pretty=oneline --abbrev-commit
18Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● identify the lines which were changed in that commit
using git diff
git diff <commit hash>^ <commit hash> -U0
19Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● find the last commit to modify those lines using git blame
git blame <file path> <commit hash>^ 
-L<start line #>,<stop line #>
after
before
20Civis Analytics | Proprietary and Confidential
step 1: identify commits which introduced bugs
● label that commit as having introduced a bug
with a little python string parsing and some
subprocess.check_output calls we can repeat this
process for each bugfix commit, and so label all commits by
whether or not they introduced a bug.
21Civis Analytics | Proprietary and Confidential
step 2: build commit-level features
if we want to build a model we need features and labels.
we just made some labels, so next let’s make some features.
git log --stat
22Civis Analytics | Proprietary and Confidential
step 2: build commit-level features
if we want to build a model we need features and labels.
we just made some labels, so next let’s make some features.
git log --stat
23Civis Analytics | Proprietary and Confidential
step 2: build commit-level features
if we want to build a model we need features and labels.
we just made some labels, so next let’s make some features.
git log --stat
24Civis Analytics | Proprietary and Confidential
step 3: train a binary classifier on the features and labels
AUC = 0.76
25Civis Analytics | Proprietary and Confidential
step 3: train a binary classifier on the features and labels
26Civis Analytics | Proprietary and Confidential
step 3: train a binary classifier on the features and labels
again, this really requires good ‘commit hygiene’
● each commit should only do one thing
● commit messages should be meaningful and standardized
27Civis Analytics | Proprietary and Confidential
step 4: score each new commit with the model using git hooks
git hooks are small scripts in <your repo>/.git/hooks/ that
run automatically based on certain git actions.
some available hooks are:
● pre-commit
● commit-msg
● post-commit
28Civis Analytics | Proprietary and Confidential
for model
scoring we
want the
post-commit
hook
step 4: score each new commit with the model using git hooks
<repo>/.git/hooks/post-commit
29Civis Analytics | Proprietary and Confidential
step 4: score each new commit with the model using git hooks
tips for better commit hygiene
31Civis Analytics | Proprietary and Confidential
why care about commit hygiene?
● clean commit histories can help you understand the
evolution of the codebase at a glance.
● it adds ‘documentation’ to every line of code: to
understand some confusing lines read the commit
messages of the commits which added those lines.
● it’s not that hard.
32Civis Analytics | Proprietary and Confidential
improving git commit messages
use the commit-msg hook to
● reject commit messages that are too short
● enforce standards around message tags
(eg BUG, TST, WIP, etc)
33Civis Analytics | Proprietary and Confidential
what if I already made lots of edits to one file?
git add --patch <file>
if you’ve made lots of
edits to a single file
you can split those
edits into separate
commits with the
--patch option
34Civis Analytics | Proprietary and Confidential
what if I already have a messy commit history?
git rebase -i <base commit hash>
interactive rebasing lets you rewrite your commit history:
you can combine, split, reorder, and reword commits
35Civis Analytics | Proprietary and Confidential
what if I already have a messy commit history?
git rebase -i <base commit hash>
with rebase -i it is super easy to make tons of small commits
as you work and then quickly combine them into a clean
commit history.
warning: only rebase code you haven’t shared with anyone
yet. <base commit hash> should be the most recent commit
that other people can see (eg the upstream master).
questions?

More Related Content

What's hot

Preventing Supply Chain Attacks on Open Source Software
Preventing Supply Chain Attacks on Open Source SoftwarePreventing Supply Chain Attacks on Open Source Software
Preventing Supply Chain Attacks on Open Source SoftwareAll Things Open
 
WTF is GitOps and Why You Should Care?
WTF is GitOps and Why You Should Care?WTF is GitOps and Why You Should Care?
WTF is GitOps and Why You Should Care?Weaveworks
 
Git for work groups ironhack talk
Git for work groups ironhack talkGit for work groups ironhack talk
Git for work groups ironhack talkTiago Ameller
 
Writing Commits for You, Your Friends, and Your Future Self
Writing Commits for You, Your Friends, and Your Future SelfWriting Commits for You, Your Friends, and Your Future Self
Writing Commits for You, Your Friends, and Your Future SelfAll Things Open
 
Git in 10 minutes (WordCamp London 2018)
Git in 10 minutes (WordCamp London 2018)Git in 10 minutes (WordCamp London 2018)
Git in 10 minutes (WordCamp London 2018)Borek Bernard
 
Learning Git and GitHub - BIT GDSC.pdf
Learning Git and GitHub - BIT GDSC.pdfLearning Git and GitHub - BIT GDSC.pdf
Learning Git and GitHub - BIT GDSC.pdfJayprakash677449
 
The printing press of 2021 - using GitLab to publish the VSHN Handbook
The printing press of 2021 - using GitLab to publish the VSHN HandbookThe printing press of 2021 - using GitLab to publish the VSHN Handbook
The printing press of 2021 - using GitLab to publish the VSHN HandbookAarno Aukia
 
You can git
You can gitYou can git
You can gitYu GUAN
 
Opencast Architecture
Opencast ArchitectureOpencast Architecture
Opencast ArchitectureGregLogan7
 
Jenkins plugin for Gerrit Code Review pipelines
Jenkins plugin for Gerrit Code Review pipelinesJenkins plugin for Gerrit Code Review pipelines
Jenkins plugin for Gerrit Code Review pipelinesLuca Milanesio
 
Docs or it didn’t happen
Docs or it didn’t happenDocs or it didn’t happen
Docs or it didn’t happenAll Things Open
 
Continuous integration for Ruby on Rails
Continuous integration for Ruby on RailsContinuous integration for Ruby on Rails
Continuous integration for Ruby on RailsDavid Paluy
 
EGit and Gerrit Code Review - Eclipse DemoCamp Bonn - 2010-11-16
EGit and Gerrit Code Review - Eclipse DemoCamp Bonn - 2010-11-16EGit and Gerrit Code Review - Eclipse DemoCamp Bonn - 2010-11-16
EGit and Gerrit Code Review - Eclipse DemoCamp Bonn - 2010-11-16msohn
 
Git for (collaborative) writing
Git for (collaborative) writingGit for (collaborative) writing
Git for (collaborative) writingArnaud Joly
 
Version control with Git & GitHub
Version control with Git & GitHubVersion control with Git & GitHub
Version control with Git & GitHubbenko
 

What's hot (17)

Preventing Supply Chain Attacks on Open Source Software
Preventing Supply Chain Attacks on Open Source SoftwarePreventing Supply Chain Attacks on Open Source Software
Preventing Supply Chain Attacks on Open Source Software
 
WTF is GitOps and Why You Should Care?
WTF is GitOps and Why You Should Care?WTF is GitOps and Why You Should Care?
WTF is GitOps and Why You Should Care?
 
Git for work groups ironhack talk
Git for work groups ironhack talkGit for work groups ironhack talk
Git for work groups ironhack talk
 
Writing Commits for You, Your Friends, and Your Future Self
Writing Commits for You, Your Friends, and Your Future SelfWriting Commits for You, Your Friends, and Your Future Self
Writing Commits for You, Your Friends, and Your Future Self
 
Git in 10 minutes (WordCamp London 2018)
Git in 10 minutes (WordCamp London 2018)Git in 10 minutes (WordCamp London 2018)
Git in 10 minutes (WordCamp London 2018)
 
Learning Git and GitHub - BIT GDSC.pdf
Learning Git and GitHub - BIT GDSC.pdfLearning Git and GitHub - BIT GDSC.pdf
Learning Git and GitHub - BIT GDSC.pdf
 
The printing press of 2021 - using GitLab to publish the VSHN Handbook
The printing press of 2021 - using GitLab to publish the VSHN HandbookThe printing press of 2021 - using GitLab to publish the VSHN Handbook
The printing press of 2021 - using GitLab to publish the VSHN Handbook
 
You can git
You can gitYou can git
You can git
 
Opencast Architecture
Opencast ArchitectureOpencast Architecture
Opencast Architecture
 
Git Demo
Git DemoGit Demo
Git Demo
 
Github
GithubGithub
Github
 
Jenkins plugin for Gerrit Code Review pipelines
Jenkins plugin for Gerrit Code Review pipelinesJenkins plugin for Gerrit Code Review pipelines
Jenkins plugin for Gerrit Code Review pipelines
 
Docs or it didn’t happen
Docs or it didn’t happenDocs or it didn’t happen
Docs or it didn’t happen
 
Continuous integration for Ruby on Rails
Continuous integration for Ruby on RailsContinuous integration for Ruby on Rails
Continuous integration for Ruby on Rails
 
EGit and Gerrit Code Review - Eclipse DemoCamp Bonn - 2010-11-16
EGit and Gerrit Code Review - Eclipse DemoCamp Bonn - 2010-11-16EGit and Gerrit Code Review - Eclipse DemoCamp Bonn - 2010-11-16
EGit and Gerrit Code Review - Eclipse DemoCamp Bonn - 2010-11-16
 
Git for (collaborative) writing
Git for (collaborative) writingGit for (collaborative) writing
Git for (collaborative) writing
 
Version control with Git & GitHub
Version control with Git & GitHubVersion control with Git & GitHub
Version control with Git & GitHub
 

Similar to Git risky using git metadata to predict code bug risk

Git Tutorial A Comprehensive Guide for Beginners.pdf
Git Tutorial A Comprehensive Guide for Beginners.pdfGit Tutorial A Comprehensive Guide for Beginners.pdf
Git Tutorial A Comprehensive Guide for Beginners.pdfuzair
 
Git for debugging (a.k.a Code archaeology)
Git for debugging (a.k.a Code archaeology)Git for debugging (a.k.a Code archaeology)
Git for debugging (a.k.a Code archaeology)Riaan Cornelius
 
GitLab Commit DevOps: How GitLab Can Save your Kubernetes environment from Be...
GitLab Commit DevOps: How GitLab Can Save your Kubernetes environment from Be...GitLab Commit DevOps: How GitLab Can Save your Kubernetes environment from Be...
GitLab Commit DevOps: How GitLab Can Save your Kubernetes environment from Be...Nico Meisenzahl
 
GitOps - Modern best practices for high velocity app dev using cloud native t...
GitOps - Modern best practices for high velocity app dev using cloud native t...GitOps - Modern best practices for high velocity app dev using cloud native t...
GitOps - Modern best practices for high velocity app dev using cloud native t...Weaveworks
 
Speeding up your team with GitOps
Speeding up your team with GitOpsSpeeding up your team with GitOps
Speeding up your team with GitOpsBrice Fernandes
 
Continuous Security for GitOps
Continuous Security for GitOpsContinuous Security for GitOps
Continuous Security for GitOpsWeaveworks
 
GitOps: Git come unica fonte di verità per applicazioni e infrastruttura
GitOps: Git come unica fonte di verità per applicazioni e infrastrutturaGitOps: Git come unica fonte di verità per applicazioni e infrastruttura
GitOps: Git come unica fonte di verità per applicazioni e infrastrutturasparkfabrik
 
Code quality
Code qualityCode quality
Code qualityProvectus
 
A crash course on git as version control system and GitHub
A crash course on git as version control system and GitHubA crash course on git as version control system and GitHub
A crash course on git as version control system and GitHubJerome Mberia
 
GDSC23 - Github Workshop Presentation.pptx
GDSC23 - Github Workshop Presentation.pptxGDSC23 - Github Workshop Presentation.pptx
GDSC23 - Github Workshop Presentation.pptxChitreshGyanani1
 
Probo.ci Drupal 4 Gov Devops 1/2 day Presentation
Probo.ci Drupal 4 Gov Devops 1/2 day Presentation Probo.ci Drupal 4 Gov Devops 1/2 day Presentation
Probo.ci Drupal 4 Gov Devops 1/2 day Presentation Zivtech, LLC
 
Delivering Quality at Speed with GitOps
Delivering Quality at Speed with GitOpsDelivering Quality at Speed with GitOps
Delivering Quality at Speed with GitOpsWeaveworks
 
Intro to git and git hub
Intro to git and git hubIntro to git and git hub
Intro to git and git hubJasleenSondhi
 
01 git interview questions &amp; answers
01   git interview questions &amp; answers01   git interview questions &amp; answers
01 git interview questions &amp; answersDeepQuest Software
 
FOSDEM 2017: GitLab CI
FOSDEM 2017:  GitLab CIFOSDEM 2017:  GitLab CI
FOSDEM 2017: GitLab CIOlinData
 

Similar to Git risky using git metadata to predict code bug risk (20)

Git Tutorial A Comprehensive Guide for Beginners.pdf
Git Tutorial A Comprehensive Guide for Beginners.pdfGit Tutorial A Comprehensive Guide for Beginners.pdf
Git Tutorial A Comprehensive Guide for Beginners.pdf
 
Git for debugging (a.k.a Code archaeology)
Git for debugging (a.k.a Code archaeology)Git for debugging (a.k.a Code archaeology)
Git for debugging (a.k.a Code archaeology)
 
GitLab Commit DevOps: How GitLab Can Save your Kubernetes environment from Be...
GitLab Commit DevOps: How GitLab Can Save your Kubernetes environment from Be...GitLab Commit DevOps: How GitLab Can Save your Kubernetes environment from Be...
GitLab Commit DevOps: How GitLab Can Save your Kubernetes environment from Be...
 
GitOps - Modern best practices for high velocity app dev using cloud native t...
GitOps - Modern best practices for high velocity app dev using cloud native t...GitOps - Modern best practices for high velocity app dev using cloud native t...
GitOps - Modern best practices for high velocity app dev using cloud native t...
 
Git & GitLab
Git & GitLabGit & GitLab
Git & GitLab
 
Speeding up your team with GitOps
Speeding up your team with GitOpsSpeeding up your team with GitOps
Speeding up your team with GitOps
 
Continuous Security for GitOps
Continuous Security for GitOpsContinuous Security for GitOps
Continuous Security for GitOps
 
Git Basics
Git BasicsGit Basics
Git Basics
 
GitOps: Git come unica fonte di verità per applicazioni e infrastruttura
GitOps: Git come unica fonte di verità per applicazioni e infrastrutturaGitOps: Git come unica fonte di verità per applicazioni e infrastruttura
GitOps: Git come unica fonte di verità per applicazioni e infrastruttura
 
Gitops Hands On
Gitops Hands OnGitops Hands On
Gitops Hands On
 
Code quality
Code qualityCode quality
Code quality
 
3 Git
3 Git3 Git
3 Git
 
A crash course on git as version control system and GitHub
A crash course on git as version control system and GitHubA crash course on git as version control system and GitHub
A crash course on git as version control system and GitHub
 
GDSC23 - Github Workshop Presentation.pptx
GDSC23 - Github Workshop Presentation.pptxGDSC23 - Github Workshop Presentation.pptx
GDSC23 - Github Workshop Presentation.pptx
 
Probo.ci Drupal 4 Gov Devops 1/2 day Presentation
Probo.ci Drupal 4 Gov Devops 1/2 day Presentation Probo.ci Drupal 4 Gov Devops 1/2 day Presentation
Probo.ci Drupal 4 Gov Devops 1/2 day Presentation
 
Delivering Quality at Speed with GitOps
Delivering Quality at Speed with GitOpsDelivering Quality at Speed with GitOps
Delivering Quality at Speed with GitOps
 
Intro to git and git hub
Intro to git and git hubIntro to git and git hub
Intro to git and git hub
 
Git best practices 2016
Git best practices 2016Git best practices 2016
Git best practices 2016
 
01 git interview questions &amp; answers
01   git interview questions &amp; answers01   git interview questions &amp; answers
01 git interview questions &amp; answers
 
FOSDEM 2017: GitLab CI
FOSDEM 2017:  GitLab CIFOSDEM 2017:  GitLab CI
FOSDEM 2017: GitLab CI
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Git risky using git metadata to predict code bug risk

  • 1. using git metadata to predict code bug risk git risky
  • 3. Outline 1. intro to git 2. the model 3. git tips
  • 4. intro to git & git metadata
  • 5. 5Civis Analytics | Proprietary and Confidential what is git? 1. a version control tool
  • 6. 6Civis Analytics | Proprietary and Confidential what is git? 1. a version control tool 2. a source of interesting metadata about the history of your codebase
  • 7. 7Civis Analytics | Proprietary and Confidential what is a git commit? git records changes to your code in units called ‘commits’. ● a single commit can contain changes to multiple files ● each commit gets assigned a unique identifier ● each commit also has metadata attached to it.
  • 8. 8Civis Analytics | Proprietary and Confidential getting at git metadata three git commands are particularly useful for extracting git metadata: ● git log shows the history of changes to a project. different arguments expose different data, e.g. ○ the commit message ○ which files and lines were changed ○ who made the change
  • 9. 9Civis Analytics | Proprietary and Confidential three git commands are particularly useful for extracting git metadata: ● git log shows the history of changes to a project. different arguments expose different data, e.g. getting at git metadata
  • 10. 10Civis Analytics | Proprietary and Confidential three git commands are particularly useful for extracting git metadata: ● git log shows the history of changes to a project. different arguments expose different data, e.g. getting at git metadata
  • 11. 11Civis Analytics | Proprietary and Confidential getting at git metadata three git commands are particularly useful for extracting git metadata: ● git diff shows the changes between two commits
  • 12. 12Civis Analytics | Proprietary and Confidential getting at git metadata three git commands are particularly useful for extracting git metadata: ● git blame identifies the last commit to modify a particular line of code codecommit
  • 13. building a code bug risk model
  • 14. 14Civis Analytics | Proprietary and Confidential build a model of code bug risk! 1. identify and label commits which introduced bugs 2. build commit-level features 3. train a binary classifier on the features and labels 4. score each new commit with the model using git hooks what can you do with git metadata?
  • 15. 15Civis Analytics | Proprietary and Confidential step 1: identify commits which introduced bugs ● find commits which fix a bug by checking for commit messages which start with ‘BUG’ or ‘Fix’ ● identify the lines which were changed in that commit using git diff ● find the last commit to modify those lines using git blame ● label that commit as having introduced a bug
  • 16. 16Civis Analytics | Proprietary and Confidential step 1: identify commits which introduced bugs caveat: this depends heavily on having good ‘commit hygiene’ ● each commit should only do one thing ● commit messages should be meaningful and standardized at the end we’ll go over some tips for better commit hygiene
  • 17. 17Civis Analytics | Proprietary and Confidential step 1: identify commits which introduced bugs ● find commits which fix a bug by checking for commit messages which start with ‘BUG’ or ‘Fix’ git log -i --grep='bug' --grep='fix' --pretty=oneline --abbrev-commit
  • 18. 18Civis Analytics | Proprietary and Confidential step 1: identify commits which introduced bugs ● identify the lines which were changed in that commit using git diff git diff <commit hash>^ <commit hash> -U0
  • 19. 19Civis Analytics | Proprietary and Confidential step 1: identify commits which introduced bugs ● find the last commit to modify those lines using git blame git blame <file path> <commit hash>^ -L<start line #>,<stop line #> after before
  • 20. 20Civis Analytics | Proprietary and Confidential step 1: identify commits which introduced bugs ● label that commit as having introduced a bug with a little python string parsing and some subprocess.check_output calls we can repeat this process for each bugfix commit, and so label all commits by whether or not they introduced a bug.
  • 21. 21Civis Analytics | Proprietary and Confidential step 2: build commit-level features if we want to build a model we need features and labels. we just made some labels, so next let’s make some features. git log --stat
  • 22. 22Civis Analytics | Proprietary and Confidential step 2: build commit-level features if we want to build a model we need features and labels. we just made some labels, so next let’s make some features. git log --stat
  • 23. 23Civis Analytics | Proprietary and Confidential step 2: build commit-level features if we want to build a model we need features and labels. we just made some labels, so next let’s make some features. git log --stat
  • 24. 24Civis Analytics | Proprietary and Confidential step 3: train a binary classifier on the features and labels AUC = 0.76
  • 25. 25Civis Analytics | Proprietary and Confidential step 3: train a binary classifier on the features and labels
  • 26. 26Civis Analytics | Proprietary and Confidential step 3: train a binary classifier on the features and labels again, this really requires good ‘commit hygiene’ ● each commit should only do one thing ● commit messages should be meaningful and standardized
  • 27. 27Civis Analytics | Proprietary and Confidential step 4: score each new commit with the model using git hooks git hooks are small scripts in <your repo>/.git/hooks/ that run automatically based on certain git actions. some available hooks are: ● pre-commit ● commit-msg ● post-commit
  • 28. 28Civis Analytics | Proprietary and Confidential for model scoring we want the post-commit hook step 4: score each new commit with the model using git hooks <repo>/.git/hooks/post-commit
  • 29. 29Civis Analytics | Proprietary and Confidential step 4: score each new commit with the model using git hooks
  • 30. tips for better commit hygiene
  • 31. 31Civis Analytics | Proprietary and Confidential why care about commit hygiene? ● clean commit histories can help you understand the evolution of the codebase at a glance. ● it adds ‘documentation’ to every line of code: to understand some confusing lines read the commit messages of the commits which added those lines. ● it’s not that hard.
  • 32. 32Civis Analytics | Proprietary and Confidential improving git commit messages use the commit-msg hook to ● reject commit messages that are too short ● enforce standards around message tags (eg BUG, TST, WIP, etc)
  • 33. 33Civis Analytics | Proprietary and Confidential what if I already made lots of edits to one file? git add --patch <file> if you’ve made lots of edits to a single file you can split those edits into separate commits with the --patch option
  • 34. 34Civis Analytics | Proprietary and Confidential what if I already have a messy commit history? git rebase -i <base commit hash> interactive rebasing lets you rewrite your commit history: you can combine, split, reorder, and reword commits
  • 35. 35Civis Analytics | Proprietary and Confidential what if I already have a messy commit history? git rebase -i <base commit hash> with rebase -i it is super easy to make tons of small commits as you work and then quickly combine them into a clean commit history. warning: only rebase code you haven’t shared with anyone yet. <base commit hash> should be the most recent commit that other people can see (eg the upstream master).