Hi, I’m Rich Sands, product and community wrangler for Ohloh.net, Black Duck Software’s comprehensive databank of FOSS metrics.We recently put Ohloh’s data under a Creative Commons Attribution license. I didn’t know all that much about Open Data before we decided to take this approach. I’ve learned a bit about it in the course of implementing our Open Data initiative, which we rolled out at OSCON in July of this year. That is what this talk is about… understanding some of the pragmatic and legal complications of Open Data, and thinking about how to make the right things happen.So what is open data?
The concept of Open Data owes a lot to the FOSS movement. This definition by the Open Knowledge Foundation issimple, easily understood, and derived mostly from the Open Source definition: (read)This very short, simple definition incorporates all of the critical bits. For data to be open, it must be:Freely redistributable.Allow for derived works.Place no restrictions on field of use, bundling, or use by particular persons or groups, and doesn’t demand particular implementations or technologies.Basically let anyone do anything, as long as they optionally attribute and/or offer modifications.
Sounds good, right?
This is a pretty uncontroversial sounding definition – motherhood and apple pie. Folks who spend time thinking about free software, free culture, open source, and open access probably would expect such a definition for Open Data as essentially allowing complete freedom.It lays out principles that help us judge whether specific data is Open Data. It helps us evaluate existing licenses, and new ones.But this definition, when applied in the real world, with existing laws and legal precedents, and the realities of how data is used and potentially abused, can create some unexpected and possibly unintended results.So lets go through this definition piece by piece and look at some of these consequences, and the differences between data and code that underlie some of these unintended outcomes.
Lets start with the thing itself we’re defining as open - some content or data. Right away we see the definition wrestling with one of the biggest challenges in this area: content and data are quite different, legally. And our definition needs to work within existing legal frameworks, treaties, and agreements so that open data is open everyplace, and there are laws that will allow us to enforce its openness.
Content is something that has been created by a person through original thought. Something written by a person, and made concrete by a visible expression. So a poem, a written paragraph, a song, a drawing, that sort of thing. A piece of source code is content. A map – the visual representation of geodata – is content. Content can be copyrighted. Copyright law is specifically intended to protect content, and there are treaties and conventions that mostly harmonize copyright law around the world so that the protections offered to content in different jurisdictions are similar – things like what is covered, for how long, the specific rights that are protected, and so on.For the purposes of copyright, something is protectable as long as it has some originality. Some spark of creativity. Not much is needed – really if it requires any human mental energy to create it or express it, or if a human applies any judgment to its choice, arrangement, or presentation, then it is copyrightable content.When something is copyrightable, then certain rights to that content are reserved to the author for a period of time. Copyright locks down content such that only the author gets to decide how it is copied and distributed, whether you have to pay someone to make a copy and use it, and any conditions or requirements placed on such copies.Copyright is a really old legal idea – its been around for hundreds of years. In the U.S., the Constitution grants Congress the power to write copyright laws, “…to Promote the Progress of Science and useful Arts.” Its purpose isn’t to reward authors specifically. Rather, its purpose is to establish a limited monopoly on the exploitation of an author’s original content, so that there is an incentive to create. Authors get this monopoly, but only for a time, and copyright is intended to allow others to build on and use the ideas, information, and facts that are conveyed in a work. The whole structure of copyright law and precedent and ideas like “fair use” all derive from both this basic monopoly grant, and the underlying rationale as to why granting it is good for society.And without copyright, open source, open data, free culture – could not be protected. Isn’t it interesting how this works? When you make something open, you are applying a copyright license – a set of rules that govern when and how and to whom your work may be copied and distributed. But the rules you apply state that your work may be freely copied!You get to set the rules. You have the copyright, the monopoly. But you use that right to explicitly free your work – to grant the right to copy and use and distribute your work without any further restrictions. You can’t give those rights away if you don’t have them in the first place! So if you love free culture and open source, you shouldn’t be too keen to abolish copyright.
If Content is the expression of original or creative ideas, data is just the facts. Some examples of facts are the temperature at which water freezes or lead melts. A tide table for Nantucket. A list of names and phone numbers.Data: Information – the fact itself, is not copyrightable.Where do we draw the line though? How little creativity is required to make something content, rather than data?And this leads us to another aspect of the Open Definition….
What do we mean by a “piece” of content or data? The open definition talks about what you can do with an individual chunk of information – content or data. But we care about compilations of information – that is, after all, what a database is, right? If there are separate rights in the collection of information that limit someone’s freedom to use it – well that piece or all pieces like it aren’t open.
There is a famous Supreme Court case in the U.S. from 1991, that established the modern day boundary between content and data and that also addresses the rights in compilations of information. The Feist case held that an alphabetized list of names and phone numbers does not have enough originality to be copyrighted. If you collect a bunch of facts, and write them down in an arrangement that does not entail any human-mental originality, the result cannot be copyrighted. In other words – the law does not grant a monopoly on controlling copies of these facts or an unoriginal collection of these facts.A list of cities and their populations cannot be copyrighted. But a list of cities and their populations, chosen by someone as “The 10 Best Cities For Barbeque” is protected. Human judgment went into the choice of the cities to be included in the list.Lets say you spend a huge amount of time and money and employ a bunch of people to collect some set of facts. Say, the DNA sequence for a human being. Or the topography of the surface of the moon, down to a one foot resolution. Whatever. Feist holds that only the compiler’s selection and arrangements of facts in a compilation may be copyrighted. The individual facts themselves cannot be copyrighted – there is no right to control copying of facts, and they can be copied at will. This is true no matter how hard it is to collect the facts. Copyright does not protect or reward the labor of collecting facts.If you compile that mega database, you can’t copyright it. That means….
Familiar copyright licenses like our friends the GPL, or the Apache Software License, or Creative Commons Attribution, can’t protect databases of facts, or the facts inside databases.Anyone who has access to a database can copy the facts (but not content held in a database – it is separately copyrighted!) and use them any way they want. Mash them up with other data. Lock them up behind a paywall on the Internet. Use them as part of the knowledge needed to build a bomb.“This data is copyright Rich Sands, and licensed under Creative Commons Attribution ShareAlike” is an interesting bit of verbiage that has no effect.It turns out though that if what you’re trying to do is make Open Data, you don’t need this. You need only to have a license that calls out the parts of a database that can be copyrighted, and that uses copyright in those parts to free the database. That would be content, as opposed to data, stored in a database, and any original or creative selection, arrangement, or presentation of the data. The facts themselves are already free.
The Feist case is U.S. law. Things are different in Europe. In 1996, the E.U. adopted a Database Directive that grants additional “sui generis” database rights which protect the labor inherent in collecting a body of facts, even if the facts and their arrangement are entirely unoriginal.This means that an Open Data license must acknowledge these sui generis database rights, and along with the copyrightable elements of a database, explicitly grants others the free use of the collection of data.There are a number of such Open Data licenses out there, but they’re not well-known or understood, and have not been tested. Tricky!
Lets move on to a different aspect of the Open Data definition. This one seems pretty straightforward as well. It isn’t Open Data unless anyone can use content or data for any purpose whatsoever. No discrimination. No restrictions on field of use (“non-commercial” is non-free). This is a familiar idea from the world of FOSS. But where the consequences of freeing code seem overall to be pretty benign – sure, people can use open source code to accomplish evil purposes but the use of data for any purpose whatsoever runs into more fundamental and troubling potential for bad consequences. So do we really mean anyone? Like…
Recruiters? Government agencies and law enforcement? Employers?
Corporations, no matter what they do?
Financial service and Insurance companies?Free culture advocates usually defend freedom on principle – that while the specific consequences of granting unfettered freedom to information and code may be bad, we must defend freedom on principle, because the enormous benefits of freedom can only be had if we accept the downsides as well.I can defend this principle for code. But I don’t know that Open Data’s potential negative consequences can be tolerated without some sort of limitations.One problem we ran into when opening up the Ohloh data is that authorship is inherent in the way source code is developed, and a part of what is open about open source. There is a long-standing desire and convention in the world of FOSS that attribution – knowing and crediting who did what – is central to the establishment of working communities. When you commit code to a FOSS project, you’re publishing as part of that code, that you’re the one that created it. Your identity gets tied to your commits, and as a consequence of how projects work and SCMs are implemented, this means your email becomes public.FOSS developers know this. They accept it as an inevitable consequence of participating in an open process. But Ohloh is collecting all the committer IDs of everyone who has ever contributed to FOSS, and collating those identities in a centralized and conveniently queried database. Sure, recruiters might be able to go into the individual repositories of projects with developers they think could be interesting, and scrape the email addresses. Most recruiters wouldn’t know how to do that, or how to combine search and selection of projects with extracting the IDs from various SCM repositories around the web. Ohloh is far more convenient though. And yes, Ohloh could hide the identities of all the contributors. But that would defeat the purpose of Ohloh!Did someone contributing to a project using their email address as their committer ID really think that this data would be repurposed and republished in a form ready-made for recruiters and spammers to use in targeting them? Probably not. It would be bad indeed if developers decided to curtail their participation in FOSS, because participating becomes an inherent privacy nightmare. It is easy to imagine other scenarios even more disturbing where data collected for entirely different purposes is used in ways not originally contemplated – to bad effect. It seems to me unavoidable that somehow, there needs to be some limits on how Open Data is used, by whom. But isn’t that the very opposite of Open Data?
Lets move on to another bit of the definition – about reuse.This is really about how Open Data can be combined.
There are two options here: is the Open Data under a “ShareAlike” license, or no? “ShareAlike” is like the GPL’s copyleft – if you modify some data you must share the modifications also as Open Data.ShareAlike has similar effects on data as copyleft has on code.
Open Data licensed under a ShareAlike license may be more attractive to contributors. They know that their contributions cannot be mashed up or modified by someone and then those modifications locked-up. So inbound information flow may be enhanced. But outbound use of the data may be inhibited if commercial users believe that using such data puts too much burden of disclosure on their own data. This is not dissimilar to how commercial entities often look at the GPL and copyleft.
Likewise, Open Data licensed under a more permissive license requiring only attribution might dissuade some contributors from participating, since their participation might end up aiding efforts they don’t wish to support. But commercial interests will have less worry about using it, since there is no requirement to disclose their own data or data mixed and modified by them.
Wouldn’t it be nice if there was some sort of compromise that everyone could agree is Open Data, and that creates strong incentives for both contributions and use?
Redistribution is another interesting challenge for Open Data. The challenge comes in what happens to the community, and to the integrity of the data when lots of copies are out there, fulfilling lots of uses, by different players in the ecosystem.
When a community is contributing to a body of data, a virtual cycle develops where useful data attracts people who want to make it even more useful, which in turn attracts more adoption – you have seen this before.When you’re trying to build a large and accurate body of knowledge and put it out as Open Data to the world, this dynamic is your friend. What drives this cycle is the aggregation of the data into a single data set with consistent format and expectations on quality and coverage. This authoritative consistency attracts more participation and makes the data more trustworthy. This is what I mean by integrity.
So what happens when you have Open Data without a ShareAlike requirement? Because the data can be freely redistributed, multiple copies of the data set may spring up and be used by different players for different purposes. Where do contributions get made in such a situation? If contributions are made where data is used, there won’t be a single, authoritative version of data. Rather, the data will end up fragmented, and users will have to evaluate which version of the data is more accurate for their particular needs. Some copies might be missing some data, and over time these versions will diverge. Community engagement will be spread across lots of versions.
When you have Open Data with ShareAlike, there is less chance that the data will fragment. But there is also less likelihood that the data will be as widely adopted. What if you need some combination of Open Data with proprietary data to solve your problem, but the proprietary owners don’t think using a ShareAlike data set fits their model?
This is just what happens with mobile mapping. Google Maps does not use OpenStreetMap data today. Apple’s new IOS 6 Map app needed OpenStreetMap data to fill out some parts of their geo database. If you notice a mistake in IOS 6 Maps and submit a correction, if it is a correction to OpenStreetMap info, Apple will be required to share that correction. But Apple only will use OpenStreetMap data when there are no commercially viable geodata sets for a particular region. Otherwise the corrections will go to their proprietary partners.If OpenStreetMap were not ShareAlike, it is possible that both Google and Apple and others might use it more heavily, and even contribute back. But there wouldn’t be a requirement to do so, and the FOSS cartographic community would be much less keen on contributing.
Another difference between open data and open source: you can read diffs of a fork and understand what is happening and decide whether the original or the forked code best meets your needs.
Data is often not like that. It may be much harder to know which version of a fragmented data set is the one you want to use.
So this brings us to the last element of the Open Data definition I’ll look at today. The Open Data Definition has a familiar bit of language that says you cannot add additional limitations or conditions that constrain the basic freedoms. This stops end-runs around the licenses and rules which lock down data. But it also prevents many of the most likely solutions to the challenges I’ve been talking about.
Here is the usual concept behind using Open Data: grab it in a free download, in aggregate, and have at it.This is the easiest possible way to gain value from Open Data.It is also the fastest route to spam, privacy problems, fragmentation, and other potential evils. What to do?
Our host has just published this very insightful piece on the ultimate value of data in driving revenue, building barriers to entry, and constructing sustainable competitive advantage. Steve’s right – anyone thinking software and data are two separate and distinct things is living in the last decade, or century. Steve concludes that companies would be smart to lock down their data as a “moat” – a barrier to entry – to compensate for the commoditization of software driven by FOSS. As a business strategy that makes sense. But I think it would be profoundly bad for the planet if data becomes a weapon, instead of a tool that can be leveraged for common benefit. It isn’t as easy as just declaring Open Data to be like Open Source.
How do we meet these challenges of fragmentation, intolerable “bad uses”, differing laws in multiple jurisdictions, and the tension between ShareAlike and commercial use, while gaining the benefits of Open Data?Thanks!
Open Data ≠ Open SourceRich Sands – Product Manager @ Ohloh.net @ohloh #ohloh
Open DefinitionA piece of content or data is open if anyoneis free to use, reuse, and redistribute it –subject to only, at most, the requirement toattribute and/or share-alike.
A of content or data is open ifanyone is free to use, reuse, and redistributeit – subject to only, at most, the requirementto attribute and/or share-alike.
[T]he first person to find and report a particular facthas not created the fact; he or she has merelydiscovered its existence. . . . Census-takers, forexample, do not “create” the population figures thatemerge from their efforts; in a sense, they copythese figures from the world around them. . . .Census data therefore do not trigger copyrightbecause these data are not “original” in theconstitutional sense. -Justice Sandra Day O’Connor, Feist Publications, Inc. v. Rural Telephone Service
Our Data is Open, But Thou Shalt Not Use Our API To:Do bad stuff with the dataViolate people’s privacyUse without attributionModify without giving backFragment the data by copying the DBMake $ without our permissionDo other stuff with the data we don’t like Thanks for asking, but no data dump is available. Sorry!