Your SlideShare is downloading. ×
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
The Comment Density of Open Source Software Code
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The Comment Density of Open Source Software Code

4,013

Published on

The development processes of open source software are different from traditional closed source development processes. Still, open source software is frequently of high quality. Thus, we are …

The development processes of open source software are different from traditional closed source development processes. Still, open source software is frequently of high quality. Thus, we are investigating how open source software creates high quality and whether it can maintain this quality for ever larger project sizes. In this paper, we look at one particular quality indicator, the density of comments in open source software code. In a large-scale study of more than 5,000 projects, we find that active open source projects document their source code, and we find that the comment density is independent of team and project size, but not of project age. In future work, we intend to correlate comment density with project success or failure.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
4,013
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The Comment Density of Open Source Software Code Oliver Arafat Dirk Riehle Siemens AG, Corporate Technology SAP Research, SAP Labs LLC oarafat@gmail.com dirk@riehle.org, http://dirkriehle.com http://twitter.com/dirkriehle  Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software  Code.” In Companion to Proceedings of the 31st International Conference  on Software Engineering (ICSE 2009). IEEE Press, 2009. Page 195­198. [Link: http://dirkriehle.com/2009/02/04/the­comment­density­of­open­source­software­code/]
  • 2. Overview Summary (blog post): The Sweet Spot of Code Commenting in Open Source [Link: http://dirkriehle.com/2009/02/04/the­sweet­spot­of­code­commenting­in­open­source/] Abstract: The development processes of open source software are different from  traditional closed source development processes. Still, open source software is  frequently of high quality. Thus, we are investigating how open source software  creates high quality and whether it can maintain this quality for ever larger project  sizes. In this paper, we look at one particular quality indicator, the density of comments  in open source software code. In a large­scale study of more than 5,000 projects, we  find that active open source projects document their source code, and we find that the  comment density is independent of team and project size, but not of project age. In  future work, we intend to correlate comment density with project success or failure.  Reference: Oliver Arafat, Dirk Riehle.  “The Comment Density of Open Source Software Code.” In Companion to  Proceedings of the 31st International Conference on Software Engineering (ICSE  2009). IEEE Press, 2009. Page 195­198. [Link:  http://dirkriehle.com/2009/02/04/the­comment­density­of­open­source­software­code/] April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 2
  • 3. Analysis Process and Tool Chain Raw data source Raw Data • Local database  (ohloh.net snapshot , crawled sources ) (ohloh.net) • Web services access  (ohloh.net, sourceforge .net, others) Pre­processing SQL, Java • Database querying using SQL and scripts • Java library for computationally heavyweight filters , aggregation Aggregated data source Aggr. Data • Output of pre ­processing stage for specific analytical tasks (RDBMS) • Aggregated data significantly improves analysis speed Excel / Calc, Analytical processing R, spec. tools • Mines aggregated  (and raw ) data for insights , hypothesis testing • At present basic processing  (Excel), machine learning next Analysis output • Results of analytical processing : averages , distributions , correlations 3 4 5 • Presented as models , tables, graphs, charts, etc. 3 2 1 April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved.
  • 4. Commenting Practice in Open Source 1.E+08 y = 0.1596x 1.E+07 R2 = 0.872 1.E+06 Comment Lines 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 Project Size in Lines of Code (LoC = CL + SLoC) April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 4
  • 5. Comment Density 100% mean = 0.1867 90% median = 0.1674 stdev = 0.1088 80% correl = ­0.00787 70% Comment Density 60% 50% 40% 30% 20% 10% 0% 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 Project Size in Lines of Code (LoC = CL + SLoC) • Comment density • Definitions – Percentage of lines of code that are comments – CL = comment lines – Measure of how well documented code is – SLoC = source lines of code – CD = CL / (CL + SLoC) – Indicator of likelihood of survival April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 5
  • 6. Variation by Programming Language Java 1.E+08 100% 70% y = 0.2798x 90% mean = 0.2587 1.E+07 median = 0.2566 60% R2 = 0.9334 Percentage of Occurrences 80% stdev = 0.1111 1.E+06 70% 50% Comment Density Comment Lines 1.E+05 60% 40% 0.3327 1.E+04 50% 30% 40% 0.2450 0.2340 1.E+03 30% 20% 1.E+02 20% 0.0841 0.0841 10% 1.E+01 10% 0.0165 0.0037 0.0000 0.0000 1.E+00 0% 0% 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 [0.0, 0.1[ [0.1, 0.2[ [0.2, 0.3[ [0.3, 0.4[ [0.4, 0.5[ [0.5, 0.6[ [0.6, 0.7[ [0.7, 0.8[ [0.8, 0.9[ y = 0.2798x; R2 = 0.9334; mean = 0.2587; median = 0.2566; stdev = 0.1111; population size = 1085 Project Size in Lines of Code (CL + SLoC) Project Size in Lines of Code (CL + SLoC) Comment Density PERL 1.E+07 100% 70% y = 0.1602x 90% mean = 0.1044 0.5855 1.E+06 median = 0.0902 60% R2 = 0.9464 Percentage of Occurrences 80% stdev = 0.0713 1.E+05 70% 50% Comment Density Comment Lines 60% 1.E+04 40% 0.3382 50% 1.E+03 30% 40% 1.E+02 30% 20% 20% 1.E+01 y = 0.1602x; R2 = 0.9464; mean = 0.1044; median = 0.0902; stdev = 0.0713; population size = 273 10% 10% 0.0582 0.0073 0.0109 0.0000 0.0000 0.0000 0.0000 1.E+00 0% 0% 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 [0.0, 0.1[ [0.1, 0.2[ [0.2, 0.3[ [0.3, 0.4[ [0.4, 0.5[ [0.5, 0.6[ [0.6, 0.7[ [0.7, 0.8[ [0.8, 0.9[ Project Size in Lines of Code (CL + SLoC) Project Size in Lines of Code (CL + SLoC) Comment Density More here: How Open Source Comments (by Programming Language) [Link: http://dirkriehle.com/2008/11/10/how­open­source­comments­by­programming­language/] April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 6
  • 7. Comment Density by Commit Size 100% sloc size = 1­100 sloc size = 50­100 sloc size = 80­100 Comment Density by SLoC Size 90% mean = 0.2513 mean = 0.2234 mean = 0.2224 Total Average Comment Density median = 0.2340 median = 0.2190 median = 0.2171 80% stdev = 0.0626 stdev = 0.0169 stdev = 0.0209 70% Comment Density 60% 50% 40% 30% 20% 10% 0% 0 10 20 30 40 50 60 70 80 90 100 Source Lines of Code in Commit [SLoC] April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 7
  • 8. Comment Density by Team Size 50% 50% team size = 1­20 team size = 1­50 team size = 1­100 Comment Density by Team Size 45% mean = 0.1914 mean = 0.1922 mean = 0.1856 45% Total Average Comment Density median = 0.1878 median = 0.1906 median = 0.1857 40% Percent Projects with Team Size 40% stdev = 0.0255 stdev = 0.0425 stdev = 0.0641 correl = ­0.0550 35% 35% Comment Density 30% 30% 25% 25% 20% 20% 15% 15% 10% 10% 5% 5% 0% 0% 0 10 20 30 40 50 60 70 80 90 100 Team Size [Number of Committers] April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 8
  • 9. Comment Density by Project Age 19.50% correl = ­0.9054 19.25% Average Comment Density 19.00% 18.75% 18.50% 18.25% 18.00% 17.75% 17.50% 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 Project Age [months] April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 9
  • 10. Commenting Practice Summary • Successful open source projects follow a consistent commenting practice – 1 out of 20 commits serves commenting purposes only – 1 out of 5 non­empty lines is a comment line • Average comment density is independent of project size • Average comment density is independent of team size • Average comment density slowly falls with project age April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved. 10
  • 11. References • Dirk Riehle, John Ellenberger, Tamir Menahem, Boris Mikhailovski, Yuri Natchetoi, Barak Naveh, Thomas  Odenwald. “Bringing Open Source Best Practices into Corporations Using a Software Forge.” IEEE  Software, 2009. See:  http://dirkriehle.com/2009/02/11/open­collaboration­within­corporations­using­software­forges/  • Dirk Riehle. “The Economic Motivation of Open Source: Stakeholder Perspectives.” IEEE Computer, vol.  40, no. 4 (April 2007). Page 25­32. See:  http://dirkriehle.com/computer­science/research/2007/computer­2007.html  • Oliver Arafat, Dirk Riehle. “The Commit Size Distribution of Open Source Software.” In Proceedings of  the 42nd Hawaiian International Conference on System Sciences (HICSS 42). IEEE Press, 2009. See:  http://dirkriehle.com/2008/09/23/the­commit­size­distribution­of­open­source­software/  • Amit Deshpande, Dirk Riehle. “The Total Growth of Open Source.” In Proceedings of the Fourth  Conference on Open Source Systems (OSS 2008). Springer Verlag, 2008. Page 197­209. See:  http://dirkriehle.com/2008/03/14/the­total­growth­of­open­source/  • Amit Deshpande, Dirk Riehle. “Continuous Integration in Open Source Software Development.” In  Proceedings of the Fourth Conference on Open Source Systems (OSS 2008). Springer Verlag, 2008.  Page 273­280. See:  http://dirkriehle.com/2008/03/08/continuous­integration­in­open­source­software­development/  • Philipp Hofmann, Dirk Riehle. “Estimating Commit Sizes Efficiently.” In Proceedings of the 5th  International Conference on Open Source Systems (OSS 2009). Springer Verlag, 2009. Page 105­115.  See: http://dirkriehle.com/2009/02/11/estimating­commit­sizes­efficiently/  • Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software Code.” In Companion to  Proceedings of the 31st International Conference on Software Engineering (ICSE 2009). IEEE Press,  2009. Page 195­198. See:  http://dirkriehle.com/2009/02/04/the­comment­density­of­open­source­software­code/  11 April 6, 2009. Version 1.1.  © 2009 Dirk Riehle. All Rights Reserved.
  • 12. Thank you! Questions? For any feedback or questions, please email the authors: Oliver Arafat Dirk Riehle Siemens AG, Corporate Technology SAP Research, SAP Labs LLC oarafat@gmail.com dirk@riehle.org, http://dirkriehle.com http://twitter.com/dirkriehle 

×