Bot or not? Detecting bots in GitHub pull request activity based on comment similarity

Bot or Not?
Detecting bots in GitHub
pull request activity based
on comment similarity
-- Mehdi Golzadeh
-- Damien Legay
-- Alexandre Decan
-- Tom Mens
---------------------------------------
{firstname.lastname}@umons.ac.be
Software engineering lab - Univerisity of Mons
2nd International Workshop on Bots in Software Engineering
May 24th, 2020, Seoul, South Korea In conjunction with ICSE 2020

> Introduction
> Online distributed social coding platforms such as
GitHub
-- Commit changes
-- Submit pull requests
-- Code reviews
> Projects are increasingly relying on bots
-- Carry out routine tasks
-- Interact with project members
> Bots have been shown to improve collaborative
development
-- Improving software quality by automated refactoring {Wyrich2019}
-- Generating patches for bugs {Monperrus2019}
-- Continuous integration {Ablett2007}
-- Automatically closing abandoned issues and pull requests {Wessel2019}
1/13

> What is a pull request?
-- A method of code contribution
-- Participate in a project without having direct commit access
> Pull-based development
2/13

> Problems
> Empirical studies about human behaviour
-- Distinguish human behaviour from automated behaviour {Liu2019}
-- Identity merging {GoeminneComparison2012}
-- Contributor onboarding {Casalnuovo2015}
-- Contributor abandonment {Constantinou2017ISSE}
-- Team productivity {Meyer2014,Vasilescu2015}
-- Team collaboration {Tamburri2019}
Proportion of accepted pull requests with and without presence of bots in commenting activity
3/13

> Dataset
> Current methods to identify bots on GitHub
-- A manual and effort-prone process
-- Not easily reproducible
-- Presence of the string "bot" in the user identity
> Create ground truth
-- Accounts that are currently active in cargo ecosystem
-- Two distinct authors of this paper manually and independently verified all
identities
-----------------------------------------------
| | Humans | | Total |
-----------------------------------------------
| GitHub identities | 250 | 42 | 292 |
-----------------------------------------------
| PR comments | 16,430 | 3,360 | 20,090 |
-----------------------------------------------
| GitHub repositories | 692 | 694 | 1,262 |
-----------------------------------------------
4/13

> Underlying Idea
> An automated approach for detecting bots in GitHub
-- Based on the similarity of comments
5/13

> Approach
> Measure the similarity between comments
> Levenshtein edit distance
-- Works more on structure
-- Counting the number of single-character edits
> Jaccard distance
-- Similarity of content
-- Compares words in two strings
-- We computed the Levenshtein and Jaccard distance of all pairs of PR messages
C1 C2 C3
-------------------------
C1 | 0 | | |
-------------------------
C2 | 0.42 | 0 | |
-------------------------
C3 | 0.63 | 0.2 | 0 |
-------------------------
6/13

> Results
-- humans have higher mean distances than bots
-- very similar results but not redundant
-- some bots can be detected by one and not with another
7/13

> Idea behind clustering
> Pairwise similarity does not work for some bots
-- different message patterns
8/13
c c
c
c c
c
c
c

> PR comment clustering
> Clusters of comments
-- Messages follow several distinct patterns
-- Human comments spread across many small clusters
-- Bots have a small number of large clusters
> DBSCAN
-- Specifying the number of expected clusters
-- Generate clusters of unequal size
-- Allows clusters to contain a single item
-- Combined Levenshtein-Jaccard distance
9/13

> Approach and results
> Clusters of comments
-- Bots have a lower number of clusters regardless of the number of messages
-- All 42 bots have 10 clusters or less
-- 249 out of the 250 humans have at least 14 clusters
-- Classifying bots based on a threshold of 10 clusters
-- Recall of 100% and accuracy of 97.7%
10/13

> Limitations and Future work
> Limitations
-- Only 292 GitHub identities from Cargo
-- Clusters of comments is the only decision factor
> Future work
-- Already manually verified 5,000 accounts from 37 ecosystems
-- Creating a classification model instead of a threshold for clusters
-- Number of comments, empty comments and Gini value as distinguishing metrics
-- Developing a tool
11/13

> Outcome
> Pull-based model and bots
-- Presence of bots in social coding platforms
-- Problem raised by presence of bots
> Our approach and the results
-- Based on comment similarity and clusters
-- Clusters of combination of the Jaccard and Levenshtein distance
-- Our approach classifies accounts with high precision and recall
> The future work
-- Create a bigger dataset
-- Consider more features to make a better model to classify
-- Create a tool to make it useful for the community
12/13

> Ctrl + Z
> Thanks
-- Any question?
13/13joint FNRS / FWO Excellence of Science project SECO-ASSIST
SECO-ASSIST
https://secoassist.github.io

Bot or not? Detecting bots in GitHub pull request activity based on comment similarity

More Related Content

More from Tom Mens

Recently uploaded

Bot or not? Detecting bots in GitHub pull request activity based on comment similarity