Open Source By The Numbers


Published on

Rich Sands, Director of Developer Communities at Black Duck, presented these interesting statistics on open source projects from at the 2012 Linux Foundation Collaboration Summit.

Published in: Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Introduce selfExplain roleUseOhloh data to tease out some interesting facts about FOSS.
  • How do we even answer this question?Is it aboutThe number of repositories?The number of projects?How much code is under an approved license?The number of developers contributing?The number of commits?We know it is BIG but what does that mean?
  • Most of the size estimates don’t span multiple forges.Not all the repositories on a site like GitHub are part of FOSS projects.There aren’t any ‘complete” directories of FOSS projects.But that doesn’t really matter.For the purposes of this presentation we’re going to look at the data in Ohloh.Spans multiple forges, includes projects that host their own code too.But it doesn’t include everything.Still, it is a sizeable fraction of everything and a representative cross-section of projects.Those huge numbers of projects and repositories miss the point. The FOSS that matters is the FOSS that has activity and community.So we’ll focus on that subset for this presentation.
  • So how many projects are active?A little under ½ have a working repository with code.Only 35% of those have had activity in the past 2 years.
  • Here’s how the curve looks for projects with a code analysis.The vertical axis shows how long it has been since the last commit. The horizontal axis shows the percentage of projects that have had at least one commit within a particular timeframe.Lets pick a nice, arbitrary but reasonable definition of “active” – at least one commit in the past year.About 17.3% of all the projects with an analysis (46,883) have ha a commit within the past year.The rest of the projects can be considered “abandoned”.FOSS plants a lot of seeds but only a small percentage take root.So does that mean that we need to pay attention to about 47,000 projects? Not quite.
  • There is more to a project than commits. FOSS is collaborative. It is the product of a community.Lets set a bar for “community”. A pretty low bar. We’ll set it at 2. How many of the active projects have had more than one committer – ever?A bit less than ½ of the active projects have ever had more than one committer. The rest are someone’s private thing.So only 8.5% of all the projects have had a commit in the past year, and have a team of at least 2 people working on them. That is a little over 13,000 projects.Lets declare these projects to be “live” FOSS projects.
  • So how do we measureliveness? Can we come up with a score that:Puts projects onto a scoring continuum in sensible relationshipsSpreads out values enough so that well-known projects don't just bunch up at the high end of the scaleSmaller projects still have a meaningful "spread”Start with the basic definition of “Liveness” as we’ve seen so far. Projects that don’t clear the <year & team hurdles get a score of 0.Now make more recent activity count more than older activity.For the analysis in this presentation we used committer count. We could use commit count, or LOC deltas, or any number of other approaches combining these – need to experiment more to see what works best.Exponential decay helps spread out the projects and make the most active ones with both large and active teams really stand out.One project in particular has a much higher “Liveness” using this method – the Linux Kernel. That makes sense – lets make the Kernel = 1000 and normalize everything else to that.
  • Using this method, here is a Liveness list for the top 50. We see the Kernel at the top, and a number of famous projects with a lot of activity: Chrome, KDE, Firefox, GNOME, OpenStack, Android, …. represented at the top of this list. There are some duplicate and mirror repos here as well. Boot To Gecko, a project that recently got a lot of buzz at Mobile World Congress as a new approach to mobile OS design also shows up in the top 10.Further experimentation is needed to come up with a really solid “liveness” metric but even this basic approach seems to hold promise.
  • Lets start comparing projects on different dimensions using this liveness score. How big are the most live projects?Here is a scatter plot of the top 5000 live projects. What are we looking at?We can see that “famous” projects – those with large, active teams and lots of code activity spread out from the dense cluster of smaller, less well-known but still live efforts.A few projects with enormous code bases show up – these are “distros” that aggregate a lot of other projects and code.There are no really HUGE, active projects.In fact, the larger the project, the less “live” it is. There are always exceptions – like the Linux Kernel, which “prove the rule”.
  • One of the dimensions of liveness is committer count – and we see a similar effect.Bigger projects have fewer committers.Anyone want to speculate as to why that might be?Harder to understand a really big code base – fewer devs are able to dive in?Smaller projects are in more popular languages????
  • Speaking of languages – here is a breakdown of the primary language for the top 5000 live projects.As Steven O’Grady discussed at FOSDEM in his presentation “Java in the Age of the JVM” Java is [still] not dead.C-family languages are also very heavily used – Java + C + C++ is about half of all the actively developed projects.The rest are the hot dynamic languages with JavaScript and Python the big ones. “Other” which is everything else – Scala, Groovy, Clojure, Haskell, whatever, totals about the same as Python.
  • So which languages are the primary languages of the most live projects?Here is the average “liveness” of the projects by their primary language.The C-family languages are heavily represented in the largest, most active projects. Why? Perhaps because these infrastructure and OS projects are “system software” with a long history.Once we account for that history, we see more recent and popular languages “Duking” it out with Java.
  • We saw that project size (LOC) is inversely correlated to liveness.Java is used in a number of very large projects as well – could that account for its placement on the previous chart behind Ruby and PHP?Are Ruby and Python’s small typical project size a factor in these languages’ popularity?
  • Which languages attract the largest numbers of developers to live projects?Note the LOG scale here – otherwise the red and green bars would be too short to compare.No surprise here – Java, C, and C++ projects get the most number of committers, because these well-established languages are very well known by a large number of developers.We don’t see any unusual outliers when we look at committers in the last year, or in the last month.
  • But things start getting more interesting when we look at the primary language of NEW projects. (note – this is for all projects not just live ones – a lot of the ones started 5 years ago have since been abandoned).In fact, lets look at the primary languages of projects STARTED during 2006-2007, 5 years ago, vs. projects STARTED in the past year.We can see that when developers are cranking up something new, they’re experimenting with new languages (Other) and adopting Python, PHP, and JavaScript significantly more than they did five years ago.
  • Is there anything we can learn by looking at this past year’s new projects, and looking at the most live ones?A few things stand out:Most of the new, most-live projects are backed by major corporations, often as part of a consortium (oVirt, Twitter Bootstrap, Wikimedia Puppet, Katello, Cloud Foundry)Projects that build a reference implementation for an industry standard (WebRTC)Projects that are dead-simple to contribute to and leverage crowdsourcing (Khan Academy Exercises)More than anything else though, new, successful projects – those that have the best chance of emerging from the cluster of live projects to join the “famous” family – aim to solve a burning problem for a large user base.It seems like the corporate support may be a consequence of the burning problem effect.
  • What can we learn from all of this?This is just a snapshot of the kinds of analysis possible with a big pool of metadata on FOSS. There are many other ways to slice this data, and it doesn’t even really touch on the kinds of contributor or what could be gleaned from a people-centric view, rather than a project-centered view.Yes all that code out there is available for the taking and yes, even some abandoned projects are being checked out and used, or inspiring new projects.But only a tiny fraction (about 4.2%) of the total projects in Ohloh are “live”.By looking at the most live projects from different angles – size, language, etc. – we can start to see some patterns.Big code bases get less contributors, less activity.Famous projects have a long history, and get a huge share of the overall activity in FOSS. This skews language use and adoption towards the languages that were most popular back when these famous projects were young.Newer projects are able to adopt newer languages, and do.The most important factor in new project success is trying to tackle a really important problem. This leads to big backers, and marketing to drive awareness.New projects also gain contributors because they’re small enough that developers can make a real difference. “First mover” contributors on new projects can have a huge impact on how these projects evolve.
  • Open Source By The Numbers

    1. 1. Open Source By The Numbers Rich Sands Director of Developer Communities Black Duck Software, Inc.
    2. 2. How Big is FOSS?• GitHub: 4,751,000 repositories• SourceForge: 324,000 projects• Ohloh: 550,000 projects BIG
    3. 3. No, REALLY, How Big is FOSS?• It depends on how you count.• Lots of projects, but – How many are active, how many abandoned? – How many have a team? A better question to ask: How much FOSS is actually being worked on now?
    4. 4. How Many Projects are Active?• 550,000+ projects on Ohloh.• 271,372 with a code analysis.• 96,824 with a commit in the past 2 years.• 46,883 with a commit in the past year.• 29,303 with a commit in the past 6 months.• 21,251 with a commit in the past 3 months.• 12,870 with a commit in the past month.• 5,629 with a commit in the past week.• 1,224 with a commit in the past day (3/30-3/31, a weekend)
    5. 5. How Many Projects Are Active? 6000Days Since Last Commit 5000 4000 3000 17.3% 2000 10001 Yr 100 90 80 70 60 50 40 30 20 10 % of Analyzed Projects With a Commit in the last Y Days
    6. 6. But Do All These Projects Have a Team? 2827Number of Committers 50 40 30 49.3% 2 or more 8.5% of all analyzed projects 20 10 2 100 90 80 70 60 50 40 30 20 10 % of Active Projects With At Least Y Committers
    7. 7. What is a “Live” Project, Anyway?• Lets invent a new metric – “Liveness”: – At least one commit in the last year, and at least 2 committers for liveness to be non-zero. – Time-weighted roll-up of activity, where older activity counts less than more recent activity. – For this presentation, activity is committer count. – Exponential time-weighting decay such that the most recent month’s activity counts fully, and 11 months back activity counts nearly zero. – Normalized; liveness of the Linux Kernel = 1000.
    8. 8. Sniff Test – What Are the Top 50 Live Projects?1000.00 Linux Kernel 118.32 openstacks nova711.40 Chromium (Google Chrome) 117.52 The LLVM Compiler Infrastructure516.68 KDE 115.64 llvm-mirror491.68 Mozilla Firefox 115.20 NetBeans IDE491.37 Mozilla Core 114.01 JBoss Application Server473.17 Boot To Gecko 113.96 NetBSD396.51 GNOME 112.73 JBossAS7322.54 Homebrew 112.73 JBoss Application Server 7319.47 Gentoo Linux 109.69 Jenkins300.32 WebKit 109.26 U-Boot273.38 Qt 5 108.83 Go programming language226.36 FreeBSD Ports 107.60 tavs go194.87 OpenStack 105.34 QEMU163.50 docrails 103.62 pkgsrc: The NetBSD Packages Collection163.19 Ruby on Rails 101.50 platform_frameworks_base159.54 Android 101.39 Trinity Core155.82 LibreOffice 100.86 LLVM C/Objective-C/C++ frontend (old)154.54 MediaWiki 100.86 LLVM/Clang C family frontend146.83 FreeBSD 100.16 Symfony145.55 GNU Compiler Collection 95.76 WSO2 Business Process Server129.94 FFmpeg 94.23 Intellij Community124.62 OpenERP 90.95 Wine123.33 SBo-git 89.46 Qt 4123.33 89.01 XBMC Media Center118.32 OpenStack Nova 88.39 Chromium Tools (Google Chrome) Note – there are a few duplicates and mirrors in this list
    9. 9. How Big Are Live Projects? 60M Aptosid (Debian distro) No really big, 50M “Distros” really active projects Lines of Code Android Platform Frameworks Base 40M For most projects, 30M bigger means less active Linux Kernel 20M Android “Famous” KDE LibreOffice projects 10M FreeBSD Firefox GCC GNOME Chromium MySQL WebKit Git Qt Ruby on Rails 0 250 500 750 1000Top 5000 live projects Liveness (0-1000 scale)
    10. 10. How Does Size Relate to Committer 60M Count? 50MLines of Code 40M Similar effect – larger 30M means fewer committers Linux Kernel 20M Android KDE LibreOffice 10M Firefox GNOME GCC Chromium Qt Ruby on Rails 0 1000 2000 3000 Top 5000 live projects 1-Year Committer Count
    11. 11. Languages of Live Projects Perl C# Ruby Java PHP JavaScript C PythonTop 5000 live projects C++ Other
    12. 12. Average Liveness By Language C#JavaScript Perl Python Java PHP Ruby C C++ 0 2 4 6 8 10 12 14 16 18 Top 5000 live projects Liveness
    13. 13. Average Project Size By Language Ruby Python Perl C#JavaScript PHP Java C C++ 0 1 2 3 4 5 6 7 8 Top 5000 live projects Millions of lines of code
    14. 14. Language vs. Number of Committers 100000Total Committers 10000 1000 Java C C++ Python JavaScript PHP Ruby C# Perl All-time Committers 1 Year Committers 30 Day Committers Top 5000 live projects
    15. 15. Languages of New Projects – Then and Now 30%% New Projects Primary Language 25% 20% 15% 10% 5% 0% Java Other C++ C Python PHP JavaScript C# Ruby Perl Started 5 Years Ago Started in the Past Year
    16. 16. The 8 Most Live New Projects in the Past YearProject Description Why So Active?oVirt-engine KVM Management System Open governance, backed by Cisco, (Liveness: 50.5, 457K LOC, Java) Red Hat, IBM, Canonical, Intel, ... and aims at a burning problem.WebRTC Implements W3C RFC for streaming Supported by Google (Chrome), media JavaScript API Mozilla, and Opera, a core HTML5 (Liveness: 47.3, 407K LOC, C++) streaming media std.Khan Academy Crowdsourced exercises for a self- Very easy to contribute, taps altruistic service educational platform. impulses of educators worldwide.Exercises (Liveness: 44.3, 90K LOC, JavaScript)Twitter Bootstrap CSS, HTML, JavaScript toolkit for Heavily promoted by Twitter, high- rapid webapp development. quality, aims at a burning problem. (Liveness: 40.0, 41K LOC, JavaScript)Wikimedia Puppet Wikimedia’s Puppet configuration. Exemplary Puppet implementation by (Liveness: 33.8, 37K LOC, Puppet) for a very heavily trafficked site.Katello RHEL server system lifecycle mgmt. Announced at Red Hat 2011 Summit, (Liveness: 31.6, 137K LOC, Ruby) follow-on to Satellite project.Cloud Foundry VMWare’s PaaS platform. Substantial industry support, (Liveness: 30.7, 29K LOC, Ruby) marketing. Aims at a burning problem.Composer Package manager for PHP. Aims at a burning problem for PHP. (Liveness: 30.3, 14K LOC, PHP)
    17. 17. Open Source by the Numbers• Only a small fraction of all the projects ever started gain long- term traction.• Less than 5% of all projects on Ohloh are “live”: a commit in the past year, and more than 1 committer, ever.• The larger the code base, the less contributors and activity.• “Famous” projects are mostly Java and C-family, and these older, established languages retain their dominant mindshare.• New live projects trending towards Python, PHP, JavaScript and away from C-family languages.• The “most likely to succeed” new projects: – Have big backers and marketing behind them. – Are still small enough for people joining them to have an impact.
    18. 18. Questions? Your guide to open sourceJoin the Ohloh community and gain critical insights into the world of open source projects