Analysing the evolution of   social aspects of opensource software ecosystems    Tom Mens & Mathieu Goeminne           Ser...
Research topic       •   Study of software evolution           •   in particular, the software process and software       ...
Goal   • Analyse the success factors (e.g. popularity,        quality) of OSS projects?       •   What are good and bad pr...
Software ecosystem       •   There are many views on a software ecosystem. Our focus:           •   the ecosystem of a sin...
Social aspects       •   Which communities (e.g. users, developers) are involved in a           software project?         ...
Why open source?       •   Free access to source code, defect data,           developer and user communication       •   H...
Methodology       •   Exploit available data from different repositories           •   code repositories (e.g. SVN, Git, ....
Selecting Projects       • Criteria for selecting projects          •    Availability of data from repositories          •...
Herdsman FrameworkUniversité de Mons   Tuesday 7 June 2011, IWSECO workshop, Brussels   Tom Mens & Mathieu Goeminne
Identity Merging
Comparing Merge Algorithms• Based on a reference merge model  • manually created  • iterative approach  • relying on infor...
Comparing Merge                 Algorithms       • Simple       • Bird (code and mail repositories only)        • based on...
Comparing Merge Algorithms  Brasero         Evince
Comparing Merge Algorithms (varying parameter values) - EvinceSimple                    ImprovedBird                      ...
AnalysingActivity Distribution
Analysing           Activity distribution     Three different longitudinal studies     1. Distribution of work, for a sing...
Activity distribution                First study       • Analyse, for individual Gnome projects,           historical data...
Activity distribution  10.90.80.70.60.50.40.3                                                                             ...
Activity distribution                  First study       • Evidence of Pareto principle (20/80 rule)?        • Most activi...
Activity distribution                First study       • Analyse activity distribution over time       • Use econometrics ...
Activity distribution                     First study      1              Evolution of aggregation indices for Evince     ...
Activity distribution                                       First study  $"!#,"!#+"                                       ...
Activity distribution                First study       • Identifying core groups         • Display Venn diagrams of most a...
Activity distribution                                                 Identifying core groups                             ...
Activity distribution                First study       • Conclusion        • Activity distributions seem to become        ...
Activity distribution                First study       •   Future work           •   Identify the type of statistical dist...
Activity distribution   Second study
Activity distribution                           Second study       •   Analysing fine-grained activities, restricted to    ...
Activity distribution                            Second study•Example: Rhythmbox • overlap between 5 activity types      A...
Activity distribution                            Second studyEvince - scatter plots of correlation                        ...
Activity distribution                 Second studyEvince - correlation matrixCorrelations with p-value > 0.01 not shown;Me...
Activity distribution               Second study   • Correlation graph for      Evince                                    ...
Activity distribution               Second study   •   Evince: Interpretation of results       •   The activities of build...
Activity distribution              Second study       •   Conclusion           •   Some activities are done by same group ...
Activity distribution               Third study     •   Study the activity of persons across software         projects in ...
Activity distribution                                       Third study                          •        How are activiti...
Activity distribution               Third study       •   How are activities distributed in different GNOME projects?     ...
Activity distribution               Third study       • How many persons are contributing to           more than 1 GNOME p...
Activity distribution                   Third study •   Which types of activities are carried out by persons involved in m...
Activity distribution               Third study       • Work in progress        • Study activities of developers across   ...
Thank youUniversité de Mons   Tuesday 7 June 2011, IWSECO workshop, Brussels   Tom Mens & Mathieu Goeminne
Upcoming SlideShare
Loading in …5
×

Analysing the evolution of social aspects of open source software ecosystems

1,532 views

Published on

Presented as invited talk at IWSECO 2011workshop in Brussels on June 7, 2011.
Also presented as invited talk at GDR GPL 2011 days in Lille, France on June 8, 2011.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,532
On SlideShare
0
From Embeds
0
Number of Embeds
581
Actions
Shares
0
Downloads
10
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Analysing the evolution of social aspects of open source software ecosystems

  1. 1. Analysing the evolution of social aspects of opensource software ecosystems Tom Mens & Mathieu Goeminne Service de Génie Logiciel Département des Sciences Informatiques Faculté des Sciences - Université de Mons Belgique
  2. 2. Research topic • Study of software evolution • in particular, the software process and software quality • Focus on software ecosystems (see next slide) • Focus on social aspects (see next slide) • By analysing and visualising historical data of open source projects • obtained by mining and combining data from different types of repositoriesUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  3. 3. Goal • Analyse the success factors (e.g. popularity, quality) of OSS projects? • What are good and bad practices of OSS evolution? What lessons can we learn? • Study how social aspects co-evolve with, and affect the software product and process • Exploit this knowledge to improve the current practice (e.g., guidelines, tool support)Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  4. 4. Software ecosystem • There are many views on a software ecosystem. Our focus: • the ecosystem of a single project, containing all artefacts (code, documentation, etc.) and persons involved in using, producing or modifying these artefacts • the ecosystem of a coherent collection or distribution of individual projects • Example: GNOME desktop environment for Linux, containing hundreds of applications/project • The ecosystem of each individual GNOME project can be studied • The interaction between all GNOME projects and their contributors can be studied • Other examples: Linux OS distributions (e.g. Debian, Ubuntu)Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  5. 5. Social aspects • Which communities (e.g. users, developers) are involved in a software project? • relation between project quality and popularity? • How do communities communicate / interact? • e.g. how are developers driven by user requests and how does this influence the project evolution? • How are communities structured? • relation between community structure and software quality or maintainability? • How is work distributed among persons? • relation between distribution of work / responsibility and maintainability? • Which processes (formal or informal) are used?Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  6. 6. Why open source? • Free access to source code, defect data, developer and user communication • Historical data available in open repositories • Observable communities • Observable activities • Increasing popularity for personal and commercial use • A huge range of community and software sizesUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  7. 7. Methodology • Exploit available data from different repositories • code repositories (e.g. SVN, Git, ...) • mail repositories (mailing lists) • bug repositories (bug trackers) • Select open source projects • Use Herdsman framework • Based on FLOSSMetrics data extraction • Use identity merging tool • Use of statistical analysis and visualisationUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  8. 8. Selecting Projects • Criteria for selecting projects • Availability of data from repositories • Data processable by FLOSSMetrics tools • CVSAnaly2, MLStats, Bicho • Size of considered projects: persons involved, code size, activity in each repository • Age of considered projectsUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  9. 9. Herdsman FrameworkUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  10. 10. Identity Merging
  11. 11. Comparing Merge Algorithms• Based on a reference merge model • manually created • iterative approach • relying on information contained in different files (COMMITTERS, MAINTAINERS, AUTHORS, NEWS, README)• Compute, for each algorithm, precision andrecall w.r.t. reference model
  12. 12. Comparing Merge Algorithms • Simple • Bird (code and mail repositories only) • based on Levenshtein distance • Bird, extended for bug repositories • Improved • Combining ideas from Bird and RoblesUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  13. 13. Comparing Merge Algorithms Brasero Evince
  14. 14. Comparing Merge Algorithms (varying parameter values) - EvinceSimple ImprovedBird Bird extended
  15. 15. AnalysingActivity Distribution
  16. 16. Analysing Activity distribution Three different longitudinal studies 1. Distribution of work, for a single project, using coarse-grained data from different repositories How are developer activities (commits, mails, bug fixes) distributed across repositories? How is activity distributed across developers? How does this evolve over time? 2. Distribution of work, for a single project, using fine-grained data from a single repository Based on commits of different file types in version repository 3. Distribution of work across different projects belonging to the same community (e.g. GNOME)Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  17. 17. Activity distribution First study • Analyse, for individual Gnome projects, historical data from version repositories, bug trackers and mailing lists • Focus on 3 types of activities per person: commits, mails, bug report changes • How are activities distributed: over different persons, over time?Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  18. 18. Activity distribution 10.90.80.70.60.50.40.3 Brasero0.20.1 0 commits 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 mails0.90.8 bug report changes0.7 Evince0.60.50.40.30.20.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  19. 19. Activity distribution First study • Evidence of Pareto principle (20/80 rule)? • Most activity is carried out by a small group of persons • Typically : 20% do 80% of the job • Distribution of code activity more unequal than mail activityUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  20. 20. Activity distribution First study • Analyse activity distribution over time • Use econometrics • express inequality in a distribution • aggregation metrics: Gini, Hoover, Theil (normalised) • Values between 0 and 1 • 0 = perfect equality; 1 = perfect inequalityUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  21. 21. Activity distribution First study 1 Evolution of aggregation indices for Evince 0.9 0.8 0.7 0.6 0.5 0.4 Gini Hoover 0.3 Theil (normalised) 0.2 0.1 0 Apr-99 Dec-99 Aug-00 Apr-01 Dec-01 Aug-02 Apr-03 Dec-03 Aug-04 Apr-05 Dec-05 Aug-06 Apr-07 Dec-07 Aug-08Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  22. 22. Activity distribution First study $"!#,"!#+" Evolution of Gini index!#*"!#)"!#("!#"!#&" .4556/7" 58697" Brasero!#%" :;"0".<8=>?7"!#$" !" -./0!)" 1230!*" -./0!*" 1230!+" -./0!+" 1230!," -./0!," 1230$!" -./0$!" $"!#,"!#+"!#*"!#)"!#("!#" 1233456" Evince!#&" 37486"!#%" 9:;"/<.2/5"1=7>;<6"!#$" !" -./0,," -./0!!" -./0!$" -./0!%" -./0!&" -./0!" -./0!(" -./0!)" -./0!*" -./0!+" -./0!," -./0$!" Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  23. 23. Activity distribution First study • Identifying core groups • Display Venn diagrams of most active (top 20) persons, according to each activity type (committing, mailing, bug report changing) • For each person, show percentage of activity attributable to this person • Take into account identity mergesUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  24. 24. Activity distribution Identifying core groups Brasero EvinceUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  25. 25. Activity distribution First study • Conclusion • Activity distributions seem to become more and more unequally distributed • The Pareto principle is clearly present in studied projects • For Brasero and Evince, the activity is led by a few persons involved in 2 or 3 of the defined activitiesUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  26. 26. Activity distribution First study • Future work • Identify the type of statistical distribution • Use sliding window approach • detect impact of personnel turnover: ignore inactive persons, and discover new active persons • Automate detection of core groups • Study the evolution of core groups over timeUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  27. 27. Activity distribution Second study
  28. 28. Activity distribution Second study • Analysing fine-grained activities, restricted to source code repository • Analyse contributors to a project based on types of activity they contribute to type of activity file type coding .c, .h, .cc, .pl, .java, .ada,.cpp, .chh, .py documenting .html, .txt, .ps, .tex, .sgml, .pdf translation .po, .pot, .mo, .charset multimedia .mp3, .ogg, .wav, .au, .mid ...Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  29. 29. Activity distribution Second study•Example: Rhythmbox • overlap between 5 activity types A = code, B = build, C = devel-doc, D = translation• Visualised using Venn diagram values represent A B 5 C 4 D number of persons 53 15 6 84 71 involved in a particular 19 51 activity (i.e., having 19 committed at least 16 7 one file of this type) 4 0 1Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  30. 30. Activity distribution Second studyEvince - scatter plots of correlation coding versus developer translation versus coding versus translation documentation developer documentationUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  31. 31. Activity distribution Second studyEvince - correlation matrixCorrelations with p-value > 0.01 not shown;Medium (>0.5) to strong correlations (>0.8) are colouredEvince document images i18n ui multimedi code meta config build devel.doc other ation adocumenta 1 0,14715687tionimages 1 0,16954996 0,20213002 0,23079965 0,41266264 0,4596056 0,4757661 0,44918361 0,19411888i18n 1 0,15575507 0,13010625 0,17912559 0,15815786 0,30906678ui 1 0,16142962 0,56632106 0,31157703 0,54728747 0,55080516 0,51066149 0,16779495multimedia 1 0,21869572 0,41933948 0,33361174 0,36632992 0,20712209code 1 0,30763128 0,8242121 0,90169181 0,87365497 0,4087696meta 1 0,31914846 0,54909454 0,2434247 0,61723848config 1 0,89979553 0,95124778 0,43475402build 1 0,90540444 0,58183399devel.doc 1 0,41281866other 1Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  32. 32. Activity distribution Second study • Correlation graph for Evince docume ntation 2,2% • code config images Nodes represent relative 46,2% 0,8% 1,2% amount of activity of meta 1,6% each type • Edges represent a build 16,1% devel-doc 19% i18n 12,6% statistically significant (p<0.01) high correlation (>0.8) Edges with weak or not statistically significant correlation, and activity nodes <0,5% are not shown.Université de Mons IWSECO 2011, Brussels Tom Mens & Mathieu Goeminne
  33. 33. Activity distribution Second study • Evince: Interpretation of results • The activities of building, docume ntation 2,2% coding and development code documentation are done by config images 0,8% 1,2% 46,2% the same group of persons meta 1,6% • Translators (i18n) are also involved in development build devel-doc i18n documentation but not in coding 16,1% 19% 12,6% • Other documentation is largely independent from these activitiesUniversité de Mons IWSECO 2011, Brussels Tom Mens & Mathieu Goeminne
  34. 34. Activity distribution Second study • Conclusion • Some activities are done by same group of persons, other activities are largely independent • Future work • Study and compare the activity patterns on other projects • Study how the separation of activities influences the process • Compare activity distribution of different types of distributions • E.g. translators have a more equal distribution of work than codersUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  35. 35. Activity distribution Third study • Study the activity of persons across software projects in the GNOME ecosystem • Case: Large long-lived GNOME projects using Git ID Project name Authors Years Files A Banshee 268 5,9 2700 B Rhythmbox 364 8,9 937 C Tomboy 290 6,6 766 D Evince 381 12,1 699 E Brasero 193 4,2 797Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  36. 36. Activity distribution Third study • How are activities distributed in different GNOME projects? • counted in terms of number of files modified • code files are most frequently committed, followed by build, development documentation and translation files !!!!"$!!"# "1?=@1-A8)"" ,!"# "30,-:" #6DBE62F=.## &!!!!" "?8"" +!"# #8512?# "A39?1-*0)@3*"" *!"# #D=## %!!!!" "1-0)"" )!"# #F8>D62/5.E8/## "81)B-+"" #625.## (!"# #=6.G20## "93*CB"" $!!!!" !"# #>8/HG## "8#D*"" &!"# #=$+/## "A-7-=EA39"" #F2<2BIF8>## #!!!!" %!"# "2?8=A"" #7D=BF## "93A-"" $!"# #>8F2## !" !"# ()*+,--" .,/0,1234" 53123/" 678*9-" (:)+-:3" ;,30<-==" >,--+-" -./0122# 314516789# :86784# ;<=/>2# -?.02?8# @185A2BB# C12202# Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  37. 37. Activity distribution Third study • How are activities distributed in different GNOME projects? • counted in terms of number of persons involved in modifying files • most persons are involved in translations, followed by development documentation, followed by code and build ,!"# +!"# A;7<?00# *!"# )!"# B?C=?:1.D# (!"# E.:1.C# !"# &!"# %!"# $!"# !!"# ## ## ## ;## # ;## ## ## ## 3## # 7# 0@ /0 0< 4/ .- +7 9 #2 0= 78 3 >. 0/ =? 23 ;9 6/ #-. #3$ #: #1 #-. #. =; >: 04 #3 : 07 05 24 2: #/ #: .- #/Université de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  38. 38. Activity distribution Third study • How many persons are contributing to more than 1 GNOME project? B 162 C A 22 9 18 8 4 7 77 11 21 4 127 4 18 15 70 24 4 6 1 1 6 1 1 4 5 7 11 12 148 26 E DUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  39. 39. Activity distribution Third study • Which types of activities are carried out by persons involved in multiple projects? Mainly translators. Coders tend to stick to one project B B 106 66 C C A 9 0 A 9 9 0 1 0 5 2 17 0 46 6 32 0 1 0 5 96 37 10 21 0 0 3 17 10 0 2 1 69 24 0 4 2 6 0 1 1 1 3 2 0 1 1 0 0 0 4 5 5 5 3 7 2 6 107 49 6 23 E D D E coders translatorsUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  40. 40. Activity distribution Third study • Work in progress • Study activities of developers across different projects • How, when and why do developers contribute to different projects? • Simultaneously • Over timeUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne
  41. 41. Thank youUniversité de Mons Tuesday 7 June 2011, IWSECO workshop, Brussels Tom Mens & Mathieu Goeminne

×