Mining sourceforge data to Discover Models of Open Source Software (OSS)  Project Performance Joseph Davis, Bavani Arunasa...
Outline <ul><li>Motivation for this project </li></ul><ul><li>Open Source Software (OSS) Development </li></ul><ul><li>Sou...
Motivation  <ul><li>Steady Success of Open Source Software(OSS): Linux, Apache, Samba, Python, MySQL </li></ul><ul><li>KM ...
Open Source Software Development <ul><li>Non-proprietary and perceived to be socially beneficial model of software develop...
OSS Trends <ul><li>Growing acceptance of OS software in organizations, </li></ul><ul><li>Increasing participation by large...
Untested Claims regarding OSS development <ul><li>Good software evolves when a dedicated community (of developers and prog...
Open Research Questions <ul><li>How do we discover crucial relationships that characterise successful and unsuccessful OSS...
Field Data for OSS Research <ul><li>SourceForge.net is the largest OSS development website. </li></ul><ul><li>Besides host...
 
Problems with SourceForge <ul><li>Number of ongoing OSS projects is misleading.  Most of the overall activity levels accou...
Problem Definition <ul><li>GIVEN: OSS Data downloaded from SourceForge.net </li></ul><ul><li>OBJECTIVE: Find patterns whic...
Why not statistical models? <ul><li>Attributes were heterogeneous type:numerical and discrete </li></ul><ul><li>Data plagu...
Association Rules <ul><li>Association rule mining: </li></ul><ul><ul><li>Finding frequent patterns, associations, correlat...
Typical Association Rule Mining Approaches <ul><li>Discover robust association rules that are non-obvious and actionable, ...
Association Rules <ul><li>Given: (1) database of transactions (OSS projects), (2) each transaction is a list of items (pro...
Problems with Association Rule Mining <ul><li>Too many (irrelevant/redundant) rules generated </li></ul><ul><li>Measures o...
Association Rules Network <ul><li>Consider a binary table  R(A,B,C,D,E,F,G) </li></ul><ul><li>{B=1, C=1}  -> {A=1} </li></...
ARN Definition <ul><li>An ARN  (R,z)  is a weighted directed hypergraph  G= (V U z, E)  where  z  is a distinguished sink ...
ARN Definition cont.. <ul><li>The distinguished vertex  z  is reachable from any other vertex in  G . </li></ul><ul><li>An...
Sampling <ul><li>Results based on a sample of 2301  ‘stable’ or ‘production’ projects which were initiated in the second h...
ARN for High Download #Download = High #Support Request  = High #Patches Completed = High #Bugs Found = High #Forum Messag...
ARN for Low Download #Download = Low #Support Completed  = Low #Public Forum = Low #Bugs Found = Low #Forum Messages = Low...
Resulting Network #Bugs Found #Forum Messages # Developers #Public Forum = Low #CVS Committed #Bugs Fixed #CVS Committed #...
Critical Factors <ul><li>Coding and bug fixing activity levels </li></ul><ul><li>Communication intensity </li></ul><ul><li...
Validation with Factor Analysis(FA) <ul><li>Independently applied FA.  </li></ul><ul><li>Factors are mutually orthogonal v...
 
Related Research Projects <ul><li>Temporal analysis of OSS project evolution </li></ul><ul><li>Studies of OSS communities ...
Conclusion <ul><li>Need to understand the key drivers for OSS beyond experience-based intuition and isolated case studies ...
Upcoming SlideShare
Loading in …5
×

Mining Open Source Software(OSS) Data Using Association Rules ...

810 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
810
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Mining Open Source Software(OSS) Data Using Association Rules ...

  1. 1. Mining sourceforge data to Discover Models of Open Source Software (OSS) Project Performance Joseph Davis, Bavani Arunasalam, Simon Poon, Sanjay Chawla Knowledge Management Research Group School of Information Technologies The University of Sydney
  2. 2. Outline <ul><li>Motivation for this project </li></ul><ul><li>Open Source Software (OSS) Development </li></ul><ul><li>SourceForge data repository </li></ul><ul><li>Data Mining Possibilities </li></ul><ul><li>Association Rule Mining and Association Rules Network (ARN) </li></ul><ul><li>Application of ARNs to OSS data </li></ul><ul><li>Theory Building using Data Mining </li></ul><ul><li>Conclusions and Future Research </li></ul>
  3. 3. Motivation <ul><li>Steady Success of Open Source Software(OSS): Linux, Apache, Samba, Python, MySQL </li></ul><ul><li>KM group is trying to study a range OSS-related questions using theoretical and data-driven approaches </li></ul><ul><li>Availability of extensive data on most aspects of OSS projects </li></ul><ul><li>Question: What are the key factors that can explain ‘success’ in OSS projects? </li></ul>
  4. 4. Open Source Software Development <ul><li>Non-proprietary and perceived to be socially beneficial model of software development </li></ul><ul><li>OS software in the public domain; source code freely available for modification and distribution </li></ul><ul><li>Nearly 200,000 projects in progress, each involving dozens to hundreds of (geographically distributed) developers who coordinate their work through the internet </li></ul><ul><li>Increasingly viewed as a viable model for building robust, secure, and scalable software - commons-based peer production model/distributed innovation. </li></ul>
  5. 5. OSS Trends <ul><li>Growing acceptance of OS software in organizations, </li></ul><ul><li>Increasing participation by large software companies such as IBM, Sun, HP etc. in OSS development </li></ul><ul><li>Increasingly viable software distribution business models </li></ul><ul><li>Large and growing communities of OSS developers and users </li></ul>
  6. 6. Untested Claims regarding OSS development <ul><li>Good software evolves when a dedicated community (of developers and programmers) work cooperatively (in comparison with the more traditional hierarchical and closed model (OSI, 2001), ‘Cathedral’ and the ‘bazaar’ analogy. </li></ul><ul><li>Quality, speed, portability, and scalability of the resulting software. </li></ul><ul><li>Taming complexity, fewer bugs (many eyeballs phenomenon) </li></ul><ul><li>Offers a viable model for the emerging ‘virtual organisations’. </li></ul>
  7. 7. Open Research Questions <ul><li>How do we discover crucial relationships that characterise successful and unsuccessful OSS projects? </li></ul><ul><li>How can we develop models (specifying hypotheses) of the critical determinants of OSS project performance? </li></ul><ul><li>What constitutes good performance in OSS development? </li></ul>
  8. 8. Field Data for OSS Research <ul><li>SourceForge.net is the largest OSS development website. </li></ul><ul><li>Besides hosting, SourceForge.net provides services for version control, bug-tracking etc. </li></ul><ul><li>Nearly 200,000 projects grouped under 17 categories; over 2 million users. </li></ul><ul><li>Great source of ‘field’ data to research OSS development. </li></ul>
  9. 10. Problems with SourceForge <ul><li>Number of ongoing OSS projects is misleading. Most of the overall activity levels accounted for by fewer than 10% of the projects (Pareto distributions) </li></ul><ul><li>Need for purposeful sampling and careful datacleaning – extreme variations across projects and noise </li></ul>
  10. 11. Problem Definition <ul><li>GIVEN: OSS Data downloaded from SourceForge.net </li></ul><ul><li>OBJECTIVE: Find patterns which characterize a high performing OSS project </li></ul><ul><li>CONSTRAINTS: Performance surrogate variable to be number of downloads. </li></ul>
  11. 12. Why not statistical models? <ul><li>Attributes were heterogeneous type:numerical and discrete </li></ul><ul><li>Data plagued with missing values </li></ul><ul><li>Downloads followed a Pareto distribution </li></ul><ul><ul><li>Most downloads few but long tail </li></ul></ul><ul><ul><li>Ex: median download 70 but can be upto 600000 </li></ul></ul>
  12. 13. Association Rules <ul><li>Association rule mining: </li></ul><ul><ul><li>Finding frequent patterns, associations, correlations among sets of items or objects in transaction databases, relational databases, and other information repositories. </li></ul></ul><ul><li>Applications: </li></ul><ul><ul><li>Market basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. </li></ul></ul><ul><li>Examples. </li></ul><ul><ul><li>Rule form: “ Body  ead for a given [support, confidence]”. </li></ul></ul><ul><ul><li>buys(x, “diapers”)  buys(x, “beers”) [1 %, 60%] </li></ul></ul><ul><ul><li>major(x, “CS”) ^ takes(x, “DB”)  grade(x, “high”) [1%, 75%] </li></ul></ul>
  13. 14. Typical Association Rule Mining Approaches <ul><li>Discover robust association rules that are non-obvious and actionable, </li></ul><ul><li>Discover frequent item sets as features that serve as discriminators for classification and prediction (based on a class variable) </li></ul><ul><li>Our approach seeks to discover a graph structure that characterises performance based on the mined association rules. </li></ul>
  14. 15. Association Rules <ul><li>Given: (1) database of transactions (OSS projects), (2) each transaction is a list of items (project variable values) </li></ul><ul><li>Find: all rules that correlate the presence of one set of items with that of another set of items </li></ul><ul><ul><li>E.g., 72% of OSS projects for which bug fixing activity level is high and whose (number of developers =‘high”) -----  (number of downloads=‘high’) </li></ul></ul>
  15. 16. Problems with Association Rule Mining <ul><li>Too many (irrelevant/redundant) rules generated </li></ul><ul><li>Measures of “interestingness” still primitive and not general </li></ul><ul><li>Our solution: A pruning strategy – create an Association Rules Network in a recursive manner: </li></ul><ul><li>Related Work: </li></ul><ul><li>S. Chawla, J. Davis, G. Pandey, &quot;On Local Pruning of Associaton Rules Using Directed Hypergraphs&quot;, IEEE Conference on Data Engineering (ICDE’04) </li></ul>
  16. 17. Association Rules Network <ul><li>Consider a binary table R(A,B,C,D,E,F,G) </li></ul><ul><li>{B=1, C=1} -> {A=1} </li></ul><ul><li>{D=1} -> {A=1} </li></ul><ul><li>{F=0} ->{B=1} </li></ul><ul><li>{F=0, E=1} -> {C=1} </li></ul><ul><li>{E=1, G=0} -> {D=1} </li></ul><ul><li>{A=1,G=1} ->{E=1} </li></ul>B=1 C=1 F=0 A=1 D=1 E=1 G=0 Fix a consequent {A=1}
  17. 18. ARN Definition <ul><li>An ARN (R,z) is a weighted directed hypergraph G= (V U z, E) where z is a distinguished sink item (node) and R is the set of association rules such that </li></ul><ul><li>Each hyperedge E corresponds to a rule R whose consequent is a singleton, </li></ul><ul><li>There is a hyperedge which corresponds to a rule r whose consequent is the single item z . </li></ul>
  18. 19. ARN Definition cont.. <ul><li>The distinguished vertex z is reachable from any other vertex in G . </li></ul><ul><li>Any vertex p not equal to z is not reachable from z . </li></ul><ul><li>The weight of the edges correspond to the confidence of the rule that they encapsulate. </li></ul>
  19. 20. Sampling <ul><li>Results based on a sample of 2301 ‘stable’ or ‘production’ projects which were initiated in the second half of 1999. </li></ul>
  20. 21. ARN for High Download #Download = High #Support Request = High #Patches Completed = High #Bugs Found = High #Forum Messages = High # Developers = High OS = POSIX #CVS Committed = High #Bugs Fixed = High #Public Forums = High # Administrators = High 78.7% 73.8% 68.4% 67.9% 90% 55.3% 93.3%
  21. 22. ARN for Low Download #Download = Low #Support Completed = Low #Public Forum = Low #Bugs Found = Low #Forum Messages = Low # Developers = Low # OS = 1 #CVS Committed = Low #Bugs Fixed = Low # Mailing Lists = Low # Administrators = Low 95.3% 77.9% 92.1% 60.1% #Support Requested = Low #Task Completed = Low Environment = Web based # Patches = Low #Surveys = Low # Environments = 1
  22. 23. Resulting Network #Bugs Found #Forum Messages # Developers #Public Forum = Low #CVS Committed #Bugs Fixed #CVS Committed #Administrators #Download
  23. 24. Critical Factors <ul><li>Coding and bug fixing activity levels </li></ul><ul><li>Communication intensity </li></ul><ul><li>Core development team strength </li></ul>
  24. 25. Validation with Factor Analysis(FA) <ul><li>Independently applied FA. </li></ul><ul><li>Factors are mutually orthogonal variables which are linear combinations of subsets of original variables. </li></ul><ul><li>The factor structures generally consistent with the ARN results. </li></ul>
  25. 27. Related Research Projects <ul><li>Temporal analysis of OSS project evolution </li></ul><ul><li>Studies of OSS communities </li></ul><ul><li>Analysis of OS software code and community co-evolution (Samba) </li></ul><ul><li>Study of open source software deployment in organisations. </li></ul>
  26. 28. Conclusion <ul><li>Need to understand the key drivers for OSS beyond experience-based intuition and isolated case studies </li></ul><ul><li>Association Rules Network(ARN) give some insight into the process </li></ul><ul><li>These insights consistent with results from Software Engineering </li></ul><ul><li>Factor Analysis as a form of validation </li></ul>

×