Bot or Not?
Detecting bots in GitHub
pull request activity based
on comment similarity
-- Mehdi Golzadeh
-- Damien Legay
-- Alexandre Decan
-- Tom Mens
---------------------------------------
{firstname.lastname}@umons.ac.be
Software engineering lab - Univerisity of Mons
2nd International Workshop on Bots in Software Engineering
May 24th, 2020, Seoul, South Korea In conjunction with ICSE 2020
> Introduction
> Online distributed social coding platforms such as
GitHub
-- Commit changes
-- Submit pull requests
-- Code reviews
> Projects are increasingly relying on bots
-- Carry out routine tasks
-- Interact with project members
> Bots have been shown to improve collaborative
development
-- Improving software quality by automated refactoring {Wyrich2019}
-- Generating patches for bugs {Monperrus2019}
-- Continuous integration {Ablett2007}
-- Automatically closing abandoned issues and pull requests {Wessel2019}
1/13
> What is a pull request?
-- A method of code contribution
-- Participate in a project without having direct commit access
> Pull-based development
2/13
> Pull-based development
2/13
> Pull-based development
2/13
> Pull-based development
2/13
> Pull-based development
2/13
> Problems
> Empirical studies about human behaviour
-- Distinguish human behaviour from automated behaviour {Liu2019}
-- Identity merging {GoeminneComparison2012}
-- Contributor onboarding {Casalnuovo2015}
-- Contributor abandonment {Constantinou2017ISSE}
-- Team productivity {Meyer2014,Vasilescu2015}
-- Team collaboration {Tamburri2019}
Proportion of accepted pull requests with and without presence of bots in commenting activity
3/13
> Dataset
> Current methods to identify bots on GitHub
-- A manual and effort-prone process
-- Not easily reproducible
-- Presence of the string "bot" in the user identity
> Create ground truth
-- Accounts that are currently active in cargo ecosystem
-- Two distinct authors of this paper manually and independently verified all
identities
-----------------------------------------------
| | Humans | | Total |
-----------------------------------------------
| GitHub identities | 250 | 42 | 292 |
-----------------------------------------------
| PR comments | 16,430 | 3,360 | 20,090 |
-----------------------------------------------
| GitHub repositories | 692 | 694 | 1,262 |
-----------------------------------------------
4/13
> Underlying Idea
> An automated approach for detecting bots in GitHub
-- Based on the similarity of comments
5/13
> Approach
> Measure the similarity between comments
> Levenshtein edit distance
-- Works more on structure
-- Counting the number of single-character edits
> Jaccard distance
-- Similarity of content
-- Compares words in two strings
-- We computed the Levenshtein and Jaccard distance of all pairs of PR messages
C1 C2 C3
-------------------------
C1 | 0 | | |
-------------------------
C2 | 0.42 | 0 | |
-------------------------
C3 | 0.63 | 0.2 | 0 |
-------------------------
6/13
> Results
-- humans have higher mean distances than bots
-- very similar results but not redundant
-- some bots can be detected by one and not with another
7/13
> Idea behind clustering
> Pairwise similarity does not work for some bots
-- different message patterns
8/13
c c
c
c c
c
c
c
> PR comment clustering
> Clusters of comments
-- Messages follow several distinct patterns
-- Human comments spread across many small clusters
-- Bots have a small number of large clusters
> DBSCAN
-- Specifying the number of expected clusters
-- Generate clusters of unequal size
-- Allows clusters to contain a single item
-- Combined Levenshtein-Jaccard distance
9/13
> Approach and results
> Clusters of comments
-- Bots have a lower number of clusters regardless of the number of messages
-- All 42 bots have 10 clusters or less
-- 249 out of the 250 humans have at least 14 clusters
-- Classifying bots based on a threshold of 10 clusters
-- Recall of 100% and accuracy of 97.7%
10/13
> Limitations and Future work
> Limitations
-- Only 292 GitHub identities from Cargo
-- Clusters of comments is the only decision factor
> Future work
-- Already manually verified 5,000 accounts from 37 ecosystems
-- Creating a classification model instead of a threshold for clusters
-- Number of comments, empty comments and Gini value as distinguishing metrics
-- Developing a tool
11/13
> Outcome
> Pull-based model and bots
-- Presence of bots in social coding platforms
-- Problem raised by presence of bots
> Our approach and the results
-- Based on comment similarity and clusters
-- Clusters of combination of the Jaccard and Levenshtein distance
-- Our approach classifies accounts with high precision and recall
> The future work
-- Create a bigger dataset
-- Consider more features to make a better model to classify
-- Create a tool to make it useful for the community
12/13
> Ctrl + Z
> Thanks
-- Any question?
13/13joint FNRS / FWO Excellence of Science project SECO-ASSIST
SECO-ASSIST
https://secoassist.github.io

Bot or not? Detecting bots in GitHub pull request activity based on comment similarity

  • 1.
    Bot or Not? Detectingbots in GitHub pull request activity based on comment similarity -- Mehdi Golzadeh -- Damien Legay -- Alexandre Decan -- Tom Mens --------------------------------------- {firstname.lastname}@umons.ac.be Software engineering lab - Univerisity of Mons 2nd International Workshop on Bots in Software Engineering May 24th, 2020, Seoul, South Korea In conjunction with ICSE 2020
  • 2.
    > Introduction > Onlinedistributed social coding platforms such as GitHub -- Commit changes -- Submit pull requests -- Code reviews > Projects are increasingly relying on bots -- Carry out routine tasks -- Interact with project members > Bots have been shown to improve collaborative development -- Improving software quality by automated refactoring {Wyrich2019} -- Generating patches for bugs {Monperrus2019} -- Continuous integration {Ablett2007} -- Automatically closing abandoned issues and pull requests {Wessel2019} 1/13
  • 3.
    > What isa pull request? -- A method of code contribution -- Participate in a project without having direct commit access > Pull-based development 2/13
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
    > Problems > Empiricalstudies about human behaviour -- Distinguish human behaviour from automated behaviour {Liu2019} -- Identity merging {GoeminneComparison2012} -- Contributor onboarding {Casalnuovo2015} -- Contributor abandonment {Constantinou2017ISSE} -- Team productivity {Meyer2014,Vasilescu2015} -- Team collaboration {Tamburri2019} Proportion of accepted pull requests with and without presence of bots in commenting activity 3/13
  • 9.
    > Dataset > Currentmethods to identify bots on GitHub -- A manual and effort-prone process -- Not easily reproducible -- Presence of the string "bot" in the user identity > Create ground truth -- Accounts that are currently active in cargo ecosystem -- Two distinct authors of this paper manually and independently verified all identities ----------------------------------------------- | | Humans | | Total | ----------------------------------------------- | GitHub identities | 250 | 42 | 292 | ----------------------------------------------- | PR comments | 16,430 | 3,360 | 20,090 | ----------------------------------------------- | GitHub repositories | 692 | 694 | 1,262 | ----------------------------------------------- 4/13
  • 10.
    > Underlying Idea >An automated approach for detecting bots in GitHub -- Based on the similarity of comments 5/13
  • 11.
    > Approach > Measurethe similarity between comments > Levenshtein edit distance -- Works more on structure -- Counting the number of single-character edits > Jaccard distance -- Similarity of content -- Compares words in two strings -- We computed the Levenshtein and Jaccard distance of all pairs of PR messages C1 C2 C3 ------------------------- C1 | 0 | | | ------------------------- C2 | 0.42 | 0 | | ------------------------- C3 | 0.63 | 0.2 | 0 | ------------------------- 6/13
  • 12.
    > Results -- humanshave higher mean distances than bots -- very similar results but not redundant -- some bots can be detected by one and not with another 7/13
  • 13.
    > Idea behindclustering > Pairwise similarity does not work for some bots -- different message patterns 8/13 c c c c c c c c
  • 14.
    > PR commentclustering > Clusters of comments -- Messages follow several distinct patterns -- Human comments spread across many small clusters -- Bots have a small number of large clusters > DBSCAN -- Specifying the number of expected clusters -- Generate clusters of unequal size -- Allows clusters to contain a single item -- Combined Levenshtein-Jaccard distance 9/13
  • 15.
    > Approach andresults > Clusters of comments -- Bots have a lower number of clusters regardless of the number of messages -- All 42 bots have 10 clusters or less -- 249 out of the 250 humans have at least 14 clusters -- Classifying bots based on a threshold of 10 clusters -- Recall of 100% and accuracy of 97.7% 10/13
  • 16.
    > Limitations andFuture work > Limitations -- Only 292 GitHub identities from Cargo -- Clusters of comments is the only decision factor > Future work -- Already manually verified 5,000 accounts from 37 ecosystems -- Creating a classification model instead of a threshold for clusters -- Number of comments, empty comments and Gini value as distinguishing metrics -- Developing a tool 11/13
  • 17.
    > Outcome > Pull-basedmodel and bots -- Presence of bots in social coding platforms -- Problem raised by presence of bots > Our approach and the results -- Based on comment similarity and clusters -- Clusters of combination of the Jaccard and Levenshtein distance -- Our approach classifies accounts with high precision and recall > The future work -- Create a bigger dataset -- Consider more features to make a better model to classify -- Create a tool to make it useful for the community 12/13
  • 18.
    > Ctrl +Z > Thanks -- Any question? 13/13joint FNRS / FWO Excellence of Science project SECO-ASSIST SECO-ASSIST https://secoassist.github.io