tranSMART Community Meeting 5-7 Nov 13 - Session 2: Herding Cat


Published on

tranSMART Community Meeting 5-7 Nov 13 - Session 2: Herding Cats: Managing Open Source Projects and Communities
Peter Rice, Imperial College London
Bioinformatics in academia was an early adopter of the open source approach to software projects, after first trying commerialisation and proprietary approaches. A selection of projects highlights the issues that arose and how they were successfully resolved.

Published in: Health & Medicine, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

tranSMART Community Meeting 5-7 Nov 13 - Session 2: Herding Cat

  1. 1. Herding Cats: Managing Open Source Projects and Communities Peter Rice 5th November 2013
  2. 2. Herding Cats • Managing an open source community is like herding cats. • ‘Cat, come with me.' 'Nenni!' said the Cat. 'I am the Cat who walks by himself, and all places are alike to me. I will not come. But all the same, he followed' (Rudyard Kipling, Just So Stories)
  3. 3. A brief history of open source software – EMACS – Linux – GPL – Apache
  4. 4. Open source licensing • • • • • Source code provided Users free to inspect, modify and redistribute Restrictions may be applied Freedoms may be guaranteed Several licenses may be combined – If they are compatible
  5. 5. GNU General Public License • Originally written for the EMACS editor and the GNU project • Based on copyleft – Copyright holder usually restricts rights – In GPL, copyright holder requires all further distributions to ensure free access – No further restrictions may be imposed – “Free as in speech, not as in beer”
  6. 6. Lesser General Public License • The full GNU General Public License makes it difficult to combine with other licenses as the whole binary is covered by GPL. • The Lesser (Library) GPL only preserves the interface and requires LGPL library source code to be made available. • Applications can be under any license • GPL code requires unlinked interfaces (APIs)
  7. 7. Other Open Source Licenses • Apache 2.0 allows modified code to use another license (including proprietary). “Indemnity clause” can be scary but is safe • Perl artistic license has issues with redistributed code • BSD license imposed a “restriction” requiring citing the original authors, usually removed in several “modified BSD” versions
  8. 8. A brief history of bioinformatics • 1980 Staden package – in support of Fred Sanger • 1982 EMBL/GenBank – Free sequence databases, later also SwissProt • 1984 Genetics Computer Group – Free (initially) sequence analysis package • 1990 Sequence Retrieval System • 1990 BLAST • 1997 EMBOSS
  9. 9. Copyright and Ownership • The Staden package was developed from 1987 to 2003 by Rodger Staden at the MRC-funded Laboratory for Molecular Biology • To get a copy of the software, users mailed a cheque for £100 to the Medical Research Council • In 2003, renewal of funding was rejected
  10. 10. Copyright and Ownership • The software was still owned by the funders • The authors had no right to apply for alternative funding • … nor did anyone else • Two years later it was formally re-released as open source, but developers had left.
  11. 11. Multiple licensing • The HMMER package provides standard Hidden Markov Model applications for multiple alignments of protein sequences • HMMER 2 had a dual licensing model – GNU General Public License – Commercial license • Only one of these can include third-party contributions. The commercial license cannot.
  12. 12. From academia to commercial • The Sequence Retrieval System was developed by Thure Etzold as a PhD project, then at EMBL Heidelberg and the European Bioinformatics Institute. • LION Bioscience in Cambridge started up to maintain and develop SRS commercially • LION merged with competitors (e.g. NetGenics)
  13. 13. From academia to commercial • NetGenics software was withdrawn – Customers had to purchase an SRS license instead • LION merged with BioWisdom • BioWisdom merged with Instem • Lesson: commercial software is high quality, well supported, but can disappear at any time. • Open source software avoids this risk
  14. 14. Branching • BLAST was developed at NCBI as a successor to FASTA • Development split into BLAST and WU-BLAST (Washington University) providing new features • WU-BLAST in turn became commercial ABBLAST
  15. 15. Competition • BLAST and the NCBI Toolkit were an early example of open source bioinformatics • Most software at the time was commercial • In 1990 the commercial providers wrote to Congress asking for withdrawal of funding for NCBI software because it competed with US industry. • They failed.
  16. 16. Competition • The GCG package was developed by the Genetics Computer Group at the University of Wisconsin • One of the most cited papers in biology – If you change more than 25% of the code, you can remove the GCG copyright • Changed to an annual source code license model • Extensions (EGCG) distributed as source code by EMBL Heidelberg and then by Sanger
  17. 17. Competition • Social scientists have reported in detail on GCG as an example of he development of bioinformatics. • Intelligenetics Inc objected to GCG’s unfair competition • Wisconsin spun off GCG Inc • Software license fee doubled • Usage continued • EGCG developed to 50% of the GCG code base
  18. 18. Competition • • • • • • • GCG Inc looked for a new owner Source code deemed to be their major asset Source code distribution was withdrawn Increased fee for source code Very restrictive terms of distribution EGCG was abandoned with 150 applications EMBOSS written from scratch to replace both – GPL/LGPL licensing – Created by the former EGCG community
  19. 19. Competition • So, to summarise – 1984 GCG started as open source – 1990 Became GCG Inc – 1997 Acquired by Oxford Molecular – 2000 EMBOSS 1.0 released as open source Harvey, M. and McMeekin, A. (2007) “Public or Private Economies of Knowledge? Turbulence in the Biological Sciences”
  20. 20. Back to Herding Cats
  21. 21. Managing an Open Source Community • The developers are only the beginning – Users – Installers – Technical authors – Helpdesk and support – Communication – Quality assurance – Competitors
  22. 22. Contributions by developers • New source code, new functionality • Maintaining source code – Bug fixes, coding standards • Interfaces – APIs, third party integration • Competititors (including open source) – New features and functionality – Integration and active collaboration
  23. 23. Contributions (continued) • Branches – Someone needs to merge branches • Original developers should agree to help • Often merged by others wanting to use new features – Ideally, merge with a single core – Useful to merge any set of branches – Combine with test suite(s)
  24. 24. Contributions (continued) • • • • New data types ETL procedures Standards Project-specific requirements
  25. 25. Contributions (continued) • Documentation – Users are good at writing/updating manuals • Training – Shared examples with public data and common standards • Support • Feature requests
  26. 26. Hosting solutions • • • • Git: Github etc. Sourceforge Open Bio Foundation Locally hosted solutions: – CVS or SubVersion • Wiki – Documentation: developers, users, installers
  27. 27. Coordination • Projects need a coordinator – Linux: Linus Torvalds – Emacs: Richard Stallman – GCG: John Devereux – EMBOSS: Peter Rice
  28. 28. Coordination • Maintaining a standard code base – • Tracking branches and modified copies elsewhere • Selecting best solutions from available branches • Merging conflicting changes • Continuous testing
  29. 29. Communication • Community meetings (London, Amsterdam, Paris, …) for developers and users • Regular technical developer meetings / TCs • Mailing lists – Provide a useful archive • Trackers (JIRA, Pivotal, …) – Defining tasks/issues and resolving them • Wiki – Community documentation
  30. 30. Efficiency • Quality assurance – More tests are always helpful • Automated documentation – Creating screenshots from test outputs • Create tests for documented examples • Automated update when results change • Ensure documented functionality still functions
  31. 31. Cat incentives • In a small community, sanctions work – Financial penalties for breaking the code – Small fines for bugs – Put back e.g. funding Xmas drinks
  32. 32. Cat treats • Acknowledge contributions • Benefit from sharing code in other branches • Developers need to support one another – Put out any flame wars • Involve the user community – Encourage non-developers to contribute • Keep everything public – Support the community – Attract new cats
  33. 33. Herding Cats: Managing Open Source Projects and Communities