Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Is the Pareto Principle
Applicable to the Core
Teams of GitHub Projects?
Kazuhiro
Yamashita
Yasutaka
Kamei
Shane
McIntosh
...
Core developers play a critical
role
in software development
2
Core developers are responsible
for guiding and coordinatin...
In fact, some argue that core
developers in OSS projects follow the
Pareto Principle
5
Effort Result
80% 80%
20%20%
Pareto Principle in Software
Development
6
20 %
80 % 20 %
80 %
Project
Developers Artifacts
Prior studies have arrived at mixed
conclusions about core teams and the
Pareto Principle
7
Pareto Non-Pareto
Goeminne
IWS...
Prior studies have arrived at mixed
conclusions about core teams and the
Pareto Principle
8
< 10 or 15 Other
Goeminne
IWSQ...
Overview of our study of core
teams on GitHub
19
Applicability of the Pareto Principle
Number of Core Developers
Overview of our study of core
teams on GitHub
20
Core and Non-Core Developers Activities
Applicability of the Pareto Princ...
Collecting and analyzing
GitHub data to study core team
activity
21
Filter Heuristics
Core
Non-Core
Core
Non-Core
Calc Pro...
Collecting and analyzing
GitHub data to study core team
activity
22
Filter Heuristics
Core
Non-Core
Projects
22
Core
Non-C...
Preprocessing GitHub data to handle
forks, duplicates, and to remove
immature projects
23
8,510,504 repositories -> 2,496 ...
Collecting and analyzing
GitHub data to study core team
activity
24
Filter Heuristics
Core
Non-Core
Projects
24
Core
Non-C...
Using heuristics to identify core
team members
26
Commit-based LOC-based Access-based
Core Core Core
29
A B C D
Our commit-based core
contributor heuristic
Number of
Commits
= Commit
Step1: Sort contributors by
their number of commits
30
A BC D
Number of
Commits
Step2: Compute the proportion
of commits that each
contributor
32
A BC D
60% 20% 10% 10%
Commits ratio
Step3: Core contributors are those
developers below the 0.8 cumulative
contribution cutoff
33
A BC D
0.8
1.0
0.6
Cumulativ...
Collecting and analyzing
GitHub data to study core team
activity
35
Filter Heuristics
Core
Non-Core
Projects
35
Core
Non-C...
Overview of our study of core
teams on GitHub
36
Core and Non-Core Developers Activities
Applicability of the Pareto Princ...
Overview of our study of core
teams on GitHub
37
Core and Non-Core Developers Activities
Applicability of the Pareto Princ...
Collecting and analyzing
GitHub data to study core team
activity
38
Filter Heuristics
Core
Non-Core
Projects
38
Core
Non-C...
Our approach to study Core
Team Size
40
30%20%10%
Percentage of Core Devs
Compliance with
the Pareto Principle
Stratify pr...
Core team proportions are
widespread
43
Commit-based Divide by LOC
Often, there are fewer than 15
core developers in a projects
44
Number of core developers in projects
88% 98% 96%
Commit-B...
Overview of our study of core
teams on GitHub
45
Core and Non-Core Developers Activities
Applicability of the Pareto Princ...
Overview of our study of core
teams on GitHub
48
Core and Non-Core Developers Activities
Applicability of the Pareto Princ...
Collecting and analyzing
GitHub data to study core team
activity
49
Filter Heuristics
Core
Non-Core
Projects
49
Core
Non-C...
Our approach to study
activity
50
By using the keywords, we classify the commits.
Development
Activity Type Keywords
Forwa...
No big differences in
proportions of development
activities
54
Commit-Based LOC-Based Access-Based
Overview of our study of core
teams on GitHub
55
Core and Non-Core Developers Activities
Applicability of the Pareto Princ...
Overview of our study of core
teams on GitHub
56
Core and Non-Core Developers Activities
Applicability of the Pareto Princ...
Extremely large core team may
be interesting
58
Heuristic -15 16-20 21-50 51-100 101-
Commit-
Based 2,197 98 137 17 47
LOC...
Many projects face a risk of
bus factor
59
Commit-Based LOC-Based Access-Based
43% (Core=1: 8%) 81% (Core=1: 24%) 54% (Cor...
Conclusion
63
64
Core Developer
• additional slides
65
Additional description of our
definition
66
0.8
1.0
A B C D E
Depend on
Name
Commit-based
67
Age Total Author
LOC-based
68
Age Total Author
LOC
Access-based
69
Age Total Author
LOC
Data Extraction
70
8,510,504 repositories -> 4,618 repositories
Data Extraction
71
Data Extraction
72
(1) Filter projects by GHTorrent
Filter forked repositories.
Fork
73
One of the features of GitHub
Fork (clone)
Original
Repository
Fork
Repository
Pull Request
Data Extraction
74
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
Data Extraction
75
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
Fil...
Data Extraction
76
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
Fil...
Data Extraction
77
Data Extraction
78
(2) Clone repositories
4,618 repositories -> 4,154 repositories
local server
clone
Data Extraction
79
Data Extraction
80
(3) Filter duplicate projects
Project A Fork of Afork
clone
Project B
register
Clone of A
Data Extraction
81
(3) Filter duplicate projects
4,618 repositories -> 3,533 repositories
Project A Project B
Compare SHAs...
Data Extraction
82
Data Extraction
83
(4) Calculate metrics
LOC
Total Commits
Total Authors
AgeRepository
Data Extraction
84
Data Extraction
85
(5) Filter projects by metrics
4,618 repositories -> 2,496 repositories
Filter less than 10 devs reposi...
Upcoming SlideShare
Loading in …5
×

Revisiting the Applicability of the Pareto Principle to Core Development Teams in Open Source Software Projects

258 views

Published on

Presented at IWPSE 2015

Published in: Software
  • Be the first to comment

Revisiting the Applicability of the Pareto Principle to Core Development Teams in Open Source Software Projects

  1. 1. Is the Pareto Principle Applicable to the Core Teams of GitHub Projects? Kazuhiro Yamashita Yasutaka Kamei Shane McIntosh Naoyasu Ubayashi Ahmed E. Hassan
  2. 2. Core developers play a critical role in software development 2 Core developers are responsible for guiding and coordinating the development of an OSS project. The most productive developers who have made roughly 80% of the total contributions. Nakakoji Mockus
  3. 3. In fact, some argue that core developers in OSS projects follow the Pareto Principle 5 Effort Result 80% 80% 20%20%
  4. 4. Pareto Principle in Software Development 6 20 % 80 % 20 % 80 % Project Developers Artifacts
  5. 5. Prior studies have arrived at mixed conclusions about core teams and the Pareto Principle 7 Pareto Non-Pareto Goeminne IWSQM Robles RAMSS Mockus TOSEM Geldenhuys ECSEAA Koch ISJ Dinh-Trong TSE The results depend on small number of case study systems Other
  6. 6. Prior studies have arrived at mixed conclusions about core teams and the Pareto Principle 8 < 10 or 15 Other Goeminne IWSQM Robles RAMSS Mockus TOSEM Geldenhuys ECSEAA Koch ISJ Dinh-Trong TSE
  7. 7. Overview of our study of core teams on GitHub 19 Applicability of the Pareto Principle Number of Core Developers
  8. 8. Overview of our study of core teams on GitHub 20 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers
  9. 9. Collecting and analyzing GitHub data to study core team activity 21 Filter Heuristics Core Non-Core Core Non-Core Calc Prop Projects Core Non-Core Classify Commits Core Team Size Activity
  10. 10. Collecting and analyzing GitHub data to study core team activity 22 Filter Heuristics Core Non-Core Projects 22 Core Non-Core Calc Prop Core Non-Core Classify Commits Core Team Size Activity
  11. 11. Preprocessing GitHub data to handle forks, duplicates, and to remove immature projects 23 8,510,504 repositories -> 2,496 repositories
  12. 12. Collecting and analyzing GitHub data to study core team activity 24 Filter Heuristics Core Non-Core Projects 24 Core Non-Core Calc Prop Core Non-Core Classify Commits Core Team Size Activity
  13. 13. Using heuristics to identify core team members 26 Commit-based LOC-based Access-based Core Core Core
  14. 14. 29 A B C D Our commit-based core contributor heuristic Number of Commits = Commit
  15. 15. Step1: Sort contributors by their number of commits 30 A BC D Number of Commits
  16. 16. Step2: Compute the proportion of commits that each contributor 32 A BC D 60% 20% 10% 10% Commits ratio
  17. 17. Step3: Core contributors are those developers below the 0.8 cumulative contribution cutoff 33 A BC D 0.8 1.0 0.6 Cumulative ratio Pct. CoreDev 2/4*100 = 50% Num CoreDev 2
  18. 18. Collecting and analyzing GitHub data to study core team activity 35 Filter Heuristics Core Non-Core Projects 35 Core Non-Core Calc Prop Core Non-Core Classify Commits Core Team Size Activity
  19. 19. Overview of our study of core teams on GitHub 36 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers
  20. 20. Overview of our study of core teams on GitHub 37 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers
  21. 21. Collecting and analyzing GitHub data to study core team activity 38 Filter Heuristics Core Non-Core Projects 38 Core Non-Core Calc Prop Core Non-Core Classify Commits Core Team Size Activity
  22. 22. Our approach to study Core Team Size 40 30%20%10% Percentage of Core Devs Compliance with the Pareto Principle Stratify projects along the confounding factors Small Medium Large Small Medium Large Small Medium Large LOC Total Author Age The example project does not follow the Pareto Principle
  23. 23. Core team proportions are widespread 43 Commit-based Divide by LOC
  24. 24. Often, there are fewer than 15 core developers in a projects 44 Number of core developers in projects 88% 98% 96% Commit-Based LOC-Based Access-Based
  25. 25. Overview of our study of core teams on GitHub 45 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers More than half projects do not follow the Pareto principle Most of projects have 15 or less core developers
  26. 26. Overview of our study of core teams on GitHub 48 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers More than half projects do not follow the Pareto principle Most of projects have 15 or less core developers
  27. 27. Collecting and analyzing GitHub data to study core team activity 49 Filter Heuristics Core Non-Core Projects 49 Core Non-Core Calc Prop Core Non-Core Classify Commits Core Team Size Activity
  28. 28. Our approach to study activity 50 By using the keywords, we classify the commits. Development Activity Type Keywords Forward Engineering implement, add, request Maintenance Reengineering optimiz, adjust Corrective Engineering bug, fix, issue, error Management license, formatting, TODO
  29. 29. No big differences in proportions of development activities 54 Commit-Based LOC-Based Access-Based
  30. 30. Overview of our study of core teams on GitHub 55 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers More than half projects do not follow the Pareto principle Most of projects have 15 or less core developers There are no big differences between core and non-core activities
  31. 31. Overview of our study of core teams on GitHub 56 Core and Non-Core Developers Activities Applicability of the Pareto Principle Number of Core Developers More than half projects do not follow the Pareto principle Most of projects have 15 or less core developers There are no big differences between core and non-core activities
  32. 32. Extremely large core team may be interesting 58 Heuristic -15 16-20 21-50 51-100 101- Commit- Based 2,197 98 137 17 47 LOC- Based 2,454 15 13 4 10 Access- Based 1,164 24 24 0 0
  33. 33. Many projects face a risk of bus factor 59 Commit-Based LOC-Based Access-Based 43% (Core=1: 8%) 81% (Core=1: 24%) 54% (Core=1: 21%) In fact, most of projects have less than 5 core developers
  34. 34. Conclusion 63
  35. 35. 64
  36. 36. Core Developer • additional slides 65
  37. 37. Additional description of our definition 66 0.8 1.0 A B C D E Depend on Name
  38. 38. Commit-based 67 Age Total Author
  39. 39. LOC-based 68 Age Total Author LOC
  40. 40. Access-based 69 Age Total Author LOC
  41. 41. Data Extraction 70 8,510,504 repositories -> 4,618 repositories
  42. 42. Data Extraction 71
  43. 43. Data Extraction 72 (1) Filter projects by GHTorrent Filter forked repositories.
  44. 44. Fork 73 One of the features of GitHub Fork (clone) Original Repository Fork Repository Pull Request
  45. 45. Data Extraction 74 (1) Filter projects by GHTorrent Filter forked repositories. Filter less than 10 devs repositories.
  46. 46. Data Extraction 75 (1) Filter projects by GHTorrent Filter forked repositories. Filter less than 10 devs repositories. Filter repositories which is developed outside of GitHub.
  47. 47. Data Extraction 76 (1) Filter projects by GHTorrent Filter forked repositories. Filter less than 10 devs repositories. Filter repositories which is developed outside of GitHub. 8,510,504 repositories -> 4,618 repositories
  48. 48. Data Extraction 77
  49. 49. Data Extraction 78 (2) Clone repositories 4,618 repositories -> 4,154 repositories local server clone
  50. 50. Data Extraction 79
  51. 51. Data Extraction 80 (3) Filter duplicate projects Project A Fork of Afork clone Project B register Clone of A
  52. 52. Data Extraction 81 (3) Filter duplicate projects 4,618 repositories -> 3,533 repositories Project A Project B Compare SHAs c87cce1 e1a7260 f40ccb5 455e44c 8b67f28 651fa5e 655b8be 757dd93 a4cf371 8145880 cf484e3 4e63bde
  53. 53. Data Extraction 82
  54. 54. Data Extraction 83 (4) Calculate metrics LOC Total Commits Total Authors AgeRepository
  55. 55. Data Extraction 84
  56. 56. Data Extraction 85 (5) Filter projects by metrics 4,618 repositories -> 2,496 repositories Filter less than 10 devs repositories. Filter less than 1,000 LOC repositories.

×