What is in github?
Daniel M German
dmg@uvic.ca
Researcher states:
“40% of pull requests are not merged”
● Based on simply querying ghtorrent data
● But it ignores what r...
What is github used for?
"I store my presentations in github. I don't
need USB stick anymore!"
Are there potential threats to validity for studies
that assume github is about software engineering
only?
Methodology
● Reuse:
– Surveys
– Data analysis for other papers
● Mixed methods:
– Quantitative, and
– Qualitative
Uses:
Most projects are inactive
Social?
67% of projects a personal repos
95% have 3 or less committers
Self contained?
“Any serious project would have to have some
separate infrastructure - mailing lists, forums, irc
channels...
But.. what about the users?
Switch to http://osrc.dfm.io/dmgerman
Is it still worth exploring github?
Definitely!
The Perills of Doing Software Engineering Research using Github Data
The Perills of Doing Software Engineering Research using Github Data
The Perills of Doing Software Engineering Research using Github Data
The Perills of Doing Software Engineering Research using Github Data
The Perills of Doing Software Engineering Research using Github Data
The Perills of Doing Software Engineering Research using Github Data
Upcoming SlideShare
Loading in …5
×

The Perills of Doing Software Engineering Research using Github Data

413 views

Published on

With over 10 million \git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet.
Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software.
However, so far there have been no studies describing the quality and properties of the data available from GitHub.
We document the results of an empirical study aimed at understanding the characteristics of the repositories in GitHub and how users take advantage of GitHub's main features---namely commits, pull
requests, and issues.
Our results indicate that, while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration.
We show, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were.
We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
413
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The Perills of Doing Software Engineering Research using Github Data

  1. 1. What is in github? Daniel M German dmg@uvic.ca
  2. 2. Researcher states: “40% of pull requests are not merged” ● Based on simply querying ghtorrent data ● But it ignores what really happens ● Many pull requests are merged without being marked as merged in github ● Ghtorrent data has many potential threats to validity
  3. 3. What is github used for?
  4. 4. "I store my presentations in github. I don't need USB stick anymore!"
  5. 5. Are there potential threats to validity for studies that assume github is about software engineering only?
  6. 6. Methodology ● Reuse: – Surveys – Data analysis for other papers ● Mixed methods: – Quantitative, and – Qualitative
  7. 7. Uses:
  8. 8. Most projects are inactive
  9. 9. Social? 67% of projects a personal repos 95% have 3 or less committers
  10. 10. Self contained? “Any serious project would have to have some separate infrastructure - mailing lists, forums, irc channels and their archives, build farms, etc. [...] Thus while GitHub and all other project hosts are used for collaboration, they are not and can not be a complete solution.”
  11. 11. But.. what about the users?
  12. 12. Switch to http://osrc.dfm.io/dmgerman
  13. 13. Is it still worth exploring github? Definitely!

×