1. AN INSIGHT INTO THE PULL
REQUESTS OF GITHUB
Mohammad Masudur Rahman, Chanchal K. Roy
Department of Computer Science
University of Saskatchewan
11th Working Conference on Mining Software
Repositories(MSR 2014) (Challenge Track)
Hyderabad, India
2. RESEARCH PROBLEM: HIGHER RATE OF PULL
REQUEST FAILURE IN GITHUB
Base repo.
Forked repos.
Pull Requests
• 88 Base repositories
• 103,192+ Fork repos.
• 20,142 developers
• 78,955 Pull requests made in
4+ years.
• Only 42.95% pull request
commits merged
• About 57.05% pull
requests failed.
RQ: Why and how did those pull requests fail?
3. ASPECTS OF STUDY
Studied and analyzed 7 aspects related to technical
problems in the pull requests, programming
languages, projects and developers.
Technical issues in pull request commits
Programming language
Application domain
Age of project
Maturity of project
Number of developers
Experience of developers
4. WHICH TECHNICAL PROBLEMS DID HINDER THE
SUCCESS OF THE PULL REQUESTS?
2. Recursion & Refactoring (7.57%, 10.78%)
3. Database query execution (6.98%, 9.18%)
16. Arrays & functions (14.40%, 17.29%)
29. Actor model (7.11%, 5.11%)
31. OOP paradigm (7.12%, 9.17%)
33. Space & indentation (3.07%, 7.32%)
Arrays & functions
(31.69%)
Recursion & Refactoring
(18.35%)
Database query
execution (16.16%)
5. DID AN AVERAGE PROJECT FROM DIFFERENT
PROGRAMMING LANGUAGES SHOW DIFFERENT
BEHAVIOUR IN TERMS OF PULL REQUEST?
•Ruby (16.92/m, 40.11/m)
•PHP (21.72/m, 21.21/m)
•Java (2.75/m, 13.21/m)
•Scala (10.39/m, 4.08/m)
•C (11.89/m, 6.72/m)
•JavaScript (5.92/m, 14.87/m)
PHP(42.93/m)
Ruby (57.03/m)
Java(15.96/m)
6. WAS THERE A DOMAIN-SPECIFIC TREND IN
PULL REQUESTS?
•Framework (20.67/m, 15.49/m)
•IDE (19.43/m, 9.31/m)
•Client Apps (10.27/m, 6.37/m)
•Database (1.40/m, 3.94/m)
•Statistics(1.15/m, 0.80/m)
•Library(6.59/m, 9.18/m)
IDE
(28.84/m)
Framework
(36.16/m)
7. HOW DID PROJECT AGE AFFECT PULL
REQUEST RATE?
2012-2013
(43.12/m)
2009-2010
(19.34/m)
11. TAKE-AWAY MESSAGES
57.05% of the pull requests failed. The issues that
failed the requests to merge are related to a limited
number of topics—recursion & refactoring,
database query execution, arrays & functions and
so on.
Projects written in Java, JavaScript and Ruby
received exceptionally higher no. of failed pull
requests. PHP projects received almost equal no.
of successful and unsuccessful pull requests on
average per month.
Projects from IDE and Framework domain showed
the maximum activities in terms of pull requests.
12. TAKE-AWAY MESSAGES
As the age of a project increases, both merged and
failed pull request rates increase almost
proportionally.
With the increase in forks, the average no. of pull
requests per month did not increase regularly.
However, projects with 2000+ forks received increased
amount of failed pull requests.
With new participation(developers) in project, no. of
pull requests per month did not increase regularly.
However, a project with 4000+ developers received
excessive no. of failed pull requests.
Projects with developers of 20-50 months experience
showed the maximum activities in terms of pull
requests.
Introduce yourself
Today, I am going to talk about our findings from our mining on the Pull requests of Github.
The challenge dataset contains data about 88 base repositories and about 103,192 forked repositories, where 20 thousand developers are involved.
Developers usually create forks, and submits their code to the base repository as the pull requests.
Statistics show that about 79 thousand pull requests were made from those 88 base projects within a time span of three years.
Only 42.95% of the requests were accepted and the commits were successfully merged.
The rest 57.05% of the requests were failed, which is a matter of concern.
In this research, we investigate why pull requests succeed and fail in GitHub.
We identify an intuitive list of 7 factors, and investigate whether they have any interesting influences on the success or failure of the pull requests.
Question we asked:
Which types of technical issues are hindering the developers in getting their pull requests merged?
--We collect the pull request commit comments of 9421 pull requests made to 78 base repositories.
-We apply LDA topic modeling with Gibbs sampling, retrieve 100 topics, and label 64 topics.
-We found 8 dominant topics, and six of them can be labeled, which are shown here.
-We found that in the commit discussion of pull requests, certain topics are frequently discussed such as recursion and refactoring, database query execution, arrays and functions and so on.
-We even noticed space and indentation is one of the frequently discussed topics.
Question we asked: Does an average project from different programming languages show different behaviour in terms of pull requests?
-- We chose 10 programming languages with reasonable number of base projects (maximum 10, minimum 3)
--We then find out the number of pull requests made to a base project each month on average.
--We found an interesting behaviour in case of different programming language. For example, Ruby projects received the maximum number of pull requests each month, and R and Java projects received the minimum.
-- We also note an interesting pattern for PHP projects, its successful and failed pull requests are almost equal, on the other hand, Ruby projects have a relatively higher number of pull requests that failed.
Question we asked: Does the application domain matter in case of the success or failure of the pull requests?
--We identify seven major domains consulting the read me description of the projects, and determine the average number of pull requests received per month by a project from each domain.
--We found framework and reusable library based projects are dominant in frequency.
--We note that framework and IDE based projects received higher rate of pull requests each month than projects of other domains.
--For example, IDE based projects received 19.43 successful requests per month, on the other hand database projects received only 2 successful request per month.
We also investigate how age (i.e., how long it is in GitHub) of the base project contributes to its amount of pull requests it receives per month?
--We found that the average pull request rates increase regularly over time for a project.
--This finding is intuitive, as forks and developer pool increase over time, and thus amount of pull request also grows.
--However, the dataset shows that the earliest pull requests started from October 2010, although the project existed from February 2008.
--It is notable that both successful and failed pull requests grew over time,
that means both developer and the management really need to pay heed to the issue of failed pull requests.
We consider the number of forks of a base project as a heuristic estimate of its maturity.
Question we asked: How does the number of forks of a base project contribute to its pull request rate?
--We did not find a regular change in pull request rate with the addition of new forks to the base project.
--However, projects with more than 2000 forks show higher rate of failed pull requests.
--For example, we found 19 base projects with more than 2000 forks, 7 of them have more than 3000 forks, and they show extremely higher unsuccessful pull requests.
It is intuitive that a base project having higher number of developers is likely to receive higher number of pull requests.
However, our finding does not support that intuition much.
We did not find any regular increase in pull request rate with the increase in the number of developers.
Moreover, we note that 8 projects having more than 500 developers received increasing number of unsuccessful pull requests.
For example, one project having 4000+ developers received an extreme number of failed pull requests each month.
We consider the working experience of the developers of a project as an important factor that is likely to contribute to the pull request rates.
--We average the experience of all the developers of a project, and determine six ranges from 10 months to 70 months.
--We note that 53 projects having developers of average experience from 20-50 months received maximum number of pull requests on avearge each month.
--However, we note that 10 projects having developers of 60-70 months experience showed relatively lower activities.
--More interestingly, 10 projects with 50-60 months developer experience showed extremely high unsuccessful pull request rates.
From the mining, we find the following take-away messages.