Talk at ICPC 2017.
Abstract: Software provided under open source licenses is widely used, from forming high-profile stand-alone applications (e.g., Mozilla Firefox) to being embedded in commercial offerings (e.g., network routers). Despite the high frequency of use of open source licenses, there has been little work about whether software developers understand the open source licenses they use. To help fill the gap of whether or not developers understand the open source licenses they use, we conducted a survey that posed development scenarios involving three popular open source licenses (GNU GPL 3.0, GNU LGPL 3.0 and MPL 2.0) both alone and in combination.
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Do software developers understand open source licenses?
1. DO SOFTWARE DEVELOPERS
UNDERSTAND OPEN SOURCE LICENSES?
Daniel A. Almeida and
Gail C. Murphy
University of British
Columbia
Greg Wilson
Rangle.io
Mike Hoye
Mozilla
Corporation
1
@_DanielAlmeida
3. • Java applications using the Central Repository relied on more than 100
open source components in 2014. [1]
• More than 25 different licenses in use in a sample of Java GitHub
projects. [2]
• License compliance problems in 150+ products found by gpl-
violations.org. [3]
[1] Sonatype, ”2015 state of the software supply chain report: Hidden speed bumps on the road to ”continuous.
[2] C. Vendome, “A large scale study of license usage on GitHub”.
[3] A. Hemel, K. T. Kalleberg, R. Vermaas, and E. Dolstra, “Finding software license violations through binary code clone detection”
3
Do software developers understand
open source licenses?
4. • Survey consisting of 7 hypothetical software
development scenarios.
• 375 participants from many countries recruited on
social media and mailing lists.
• Our participants:
• Software developers (67%)
• At least 3 years of development experience (93%)
• Had to choose a project’s license before (85%)
• Responsible for licensing decisions (63%)
• Often contribute to open source projects (74%)
4
5. Licenses
5
Restrictive/copyleft: generally require the same
rights on derivative work (e.g., GNU GPL).
Weak copyleft: generally, not all of the derivative
work needs to be released under the copyleft
license.
Permissive: require only attribution, allowing
derivative work to be proprietary (e.g., MIT
License and BSD License).
6. Scenario 1
6
John has been working on ToDoApp, his own personal task management application. ToDoApp
will be used exclusively by John on his own computer. John will use LightDB to persist ToDoApp’s
data.
If LightDB is distributed under the following licenses, would John be allowed to use it as part of
ToDoApp?
LightDB
LICENSE
CHOICES
GNU GPL 3.0 Yes No Unsure
GNU LGPL 3.0 Yes No Unsure
MPL 2.0 Yes No Unsure
LightDB
ToDoApp
John
7. Scenario 1
7
Could you explain why you are not sure about
your answer for MPL 2.0?
LightDB
LICENSE
CHOICES
GNU GPL 3.0 Yes No Unsure
GNU LGPL 3.0 Yes No Unsure
MPL 2.0 Yes No Unsure “I’m not very familiar with MPL.”
Are there any assumptions you've made about this scenario?
Is anything unclear or confusing to you?
“This is all assuming John follows the details of each license.”
8. Scenario 2
8
LightDB
ToDoApp
If LightDB, the lightweight library used to persist ToDoApp’s data is
distributed under [LightDB LICENSE] would John be allowed to make
ToDoApp available under [ToDoApp LICENSE] ?
LightDB
LICENSE
ToDoApp
LICENSE
CHOICES
GPL GPL Yes No Unsure
GPL LGPL Yes No Unsure
GPL MPL Yes No Unsure
LGPL GPL Yes No Unsure
LGPL LGPL Yes No Unsure
… … …
9. Method
Our oracle: an intellectual property lawyer with more than a decade
of experience in software licensing.
Quantitative analysis: all 7 scenarios and its 45 cases.
Qualitative analysis: open-coding of the comments and assumptions
for cases where (1) over 30% of the participants disagreed with
expert; OR (2) at least 10% of the participants answered “Unsure”.
9
10. UNSURENOYES RIGHT ANSWER
10
Overview
• 7 scenarios (total of 45 cases) involving 3 open source licenses.
• Participants were given access to the licenses used in the survey
and were free to use external resources as needed.
• Participants agreed with our oracle in 26 out of 42 cases (62%).
• Open-coding focused on the 19 cases (and related scenarios)
where over 30% of the participants disagreed with our oracle or
at least 10% of the participants answered “Unsure”.
11. Observation #1
Developers cope well with single
licenses even in complex scenarios,
but have difficulty when more than
one license is in use.
11
13. Observation #2
Developers understand that how the software
is built affects license interactions, but don't
have a deep grasp of what technical details
matter.
13
15. Scenario 2: comments coding
15
Technical Detail
(concerns about technical
aspects of the case)
A: Assumption, Am: Ambiguity, I: Invalid, LI: License Interactions, SC: Specific Case, U: Unsure
16. Scenario 2 (GPL-LGPL): comments
“It depends on how ToDoApp is distributed. If ToDoApp
was only distributed as source then this would be fine.
For binary distributions, if ToDoApp is statically linked
against LightDB it must be distributed under GPL. The
case is less clear for dynamically linked code - I
understand the FSF and other organizations disagree!”.
“I think it might depend on how the two libraries are
linked together”.
16
18. Assumptions coding
18
License Interactions
(ramifications of more
than one license)
License Assumption
(characteristics of
license)
AG: Authorship, CD: Change Dependent, I: Invalid, IQ: Invalid Question, PA: Patent Assumption, SC: Specific Case, TA:
Technical Assumption, TeA: Term Assumption, U: Unsure
19. Scenario 2: comments coding
19
License Interaction
(what actions are possible with more
than one license)
A: Assumption, Am: Ambiguity, I: Invalid, LI: License Interactions, SC: Specific Case, U: Unsure
Specific Case
(dual licensing or relicensing)
20. Scenario 2 (GPL-MPL): comments
“I don’t understand how the secondary license
restriction and GPL interact”
“MPL/(L)GPL dual licensing is popular, so I assume
there is a reason for that”
“Have not studied the details; generically expect
trouble when mixing non-GPL licenses with GPL so
would have guessed ’No’ if forced”
20
21. Other observations
• Questions that arise about the use of multiple open
source licenses are situationally dependent.
• A number of developers lack knowledge of the
details of open source licenses.
21
23. DO DEVELOPERS UNDERSTAND OPEN SOURCE LICENSES?
• Cope well with single licenses even in complex scenarios,
but struggle when more than one license is in use.
• Understand that technical details affect license interactions.
• Don't have a deep grasp of what technical details matter or
of the intricacies of how licenses interact.
23
@_DanielAlmeida
Editor's Notes
+ Reuse of high-quality components
+ Fast production of software
+ Low cost
Why these licenses:
Common licenses in use
Range from restrictive to permissive
Different resulting restrictions (GPL vs LGPL)
GPL: strong copyleft, requires licensed works or modifications to be open source
LGPL: weak copyleft. Mostly used for shared libraries. Allows us to use the licensed library without making the rest of the code/product open source
MPL: weak copyleft. Similar to LGPL, but at a file level.
Quantitative analysis of all scenarios: to account for some degree of ambiguity, we consider that the participants did well when at least 70% of the answers matched the expert’s.
We focused our qualitative analysis on the cases where participants disagreed with the expert (that is, less than 70% of them agreed) or more than 10% of them answered “Unsure”
Four scenarios for which there are cases where more than 30% of the participants disagreed with the expert...or cases where more than 10% of them answered “Unsure”
We noticed an issue with Scenario 5. Our legal expert and the participants made such different assumptions that we decided to focus our analysis on the other scenarios.
Four scenarios for which there are cases where more than 30% of the participants disagreed with the expert...or cases where more than 10% of them answered “Unsure”
We noticed an issue with Scenario 5. Our legal expert and the participants made such different assumptions that we decided to focus our analysis on the other scenarios.
Different combinations of codes appear for the same license combinations (e.g., S2-GPL-LGPL had a lot of LI, but that was not an issue for S3-GPL-LGPL or S6-LGPL-GPL). The difference is in how the software is used, changed and combined.
The most frequent code for the case comments was Unsure, indicating a lack of knowledge of the licenses used in this survey.
German and Hassan: models for identifying possible mismatches and a number of “patterns of integration”.
Vendome and Poshyvanyk: find, explain and recommend a fix for license incompatibility (either license change or code restructuring).
Our participants described ways of restructuring their code or the open source component’s code to avoid license incompatibility. We believe we may need a more robust recommender system that is able to formally model license interactions in terms of how it’s used in the code and in which other ways it could be used. A model such as the one introduced by Alspaugh and others might be a starting point to build tools that can recommend how the code can be refactored to resolve license compliance issues.
Open source software is not a small, self-contained set of licenses and components.
Software developers struggle to interpret the implications of license interactions and the relevant technical details.
We need tools to help developers identify and resolve license incompatibility issues.
Open source components are released under a variety of licenses and used in closed and open-source projects. There are thousands of components released under many different licenses.
Developers have a good understanding of at least three licenses, but in many cases they struggle to identify the relevant technical details and correctly interpret the license interactions.
We need tools to help developers identify and solve license incompatibility problems. One possibility, based on existing work by Germán and Hassan, is to use formal models to identify mismatches. We can go further and help developers change the code structures that are causing the mismatch.