Why open drug discovery needs four simple rules for licensing data and models

219
-1

Published on

When we look at the rapid growth of scientific databases on the Internet in the past decade, we tend to take the accessibility and provenance of the data for granted. As we see a future of increased database integration, the licensing of the data may be a hurdle that hampers progress and usability. We have formulated four rules for licensing data for open drug discovery, which we propose as a starting point for consideration by databases and for their ultimate adoption. This work could also be extended to the computational models derived from such data. We suggest that scientists in the future will need to consider data licensing before they embark upon re-using such content in databases they construct themselves.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
219
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Why open drug discovery needs four simple rules for licensing data and models

  1. 1. PerspectiveWhy Open Drug Discovery Needs Four Simple Rules forLicensing Data and ModelsAntony J. Williams1*, John Wilbanks2, Sean Ekins31 Royal Society of Chemistry, Wake Forest, North Carolina, United States of America, 2 Consent to Research, Oakland, California, United States of America, 3 Collaborationsin Chemistry, Fuquay-Varina, North Carolina, United States of America Abstract: When we look at the platforms or derived models without care inside pharmaceutical companies to mesh rapid growth of scientific databases given to data quality is a poor strategy for with their existing private data [18], on the Internet in the past decade, long-term science [10] as errors become including in the expanding Linked Open we tend to take the accessibility perpetuated in additional databases. There Data cloud or in freely available online and provenance of the data for is real evidence that the integration of large, databases, and can be downloaded and granted. As we see a future of heterogeneous sets of databases and other used to enhance their content and to increased database integration, the types of content is ‘‘unreasonably effective’’ establish linking between data. The Open licensing of the data may be a at accelerating the conversion of data into PHACTS project [19,20] utilizes a se- hurdle that hampers progress and knowledge [11]. This implies the need for mantic web approach to integrate chem- usability. We have formulated four technical and semantic work to bring istry and biology data across a myriad of rules for licensing data for open databases together that were never de- data sources, including for chemistry drug discovery, which we propose signed for interoperability [12], which is in ChEBI, ChEMBL, and DrugBank, and as a starting point for consideration itself a significant task [13,14]. for biology UniProt, Wikipathways, and by databases and for their ultimate As we and others have argued previ- many others. The chemical structure adoption. This work could also be ously, there is another dimension to representations are obtained from Chem- extended to the computational interoperability than technical formats Spider, which has previously imported the models derived from such data. [12] and ontological agreement [15]: the chemical databases and standardized We suggest that scientists in the complex interactions of database licenses according to their data model and are future will need to consider data and terms of use around intellectual making the data available as open data to licensing before they embark upon property. Many of these online databases the project. Many of the primary online re-using such content in databases they construct themselves. have either obscure or confused licensing databases already have multiple links to terms [16], and even in those cases where external systems. This linking may be data are freely available for download and achieved by using available database reuse there are often no clear definitions. services to form transitory links in by,Introduction Many databases simply ‘‘cut and paste’’ for example, using a chemical represen- Public online databases [1] supporting prohibitive copyright schema from tradi- tation such as an InChI [21] to probe anlife sciences research have become valu- tional websites, or fail to address download application programming interface,able resources for researchers depending and reintegration entirely (ibid). Since search for the compound, and generateon data for use in cheminformatics, copyright law requires explicit permissions the linking URL in real time. Commonly,bioinformatics, systems biology, transla- in advance to make use of copyrighted however, the links are more permanent intional medicine, and drug repositioning works, it is certainly unsafe to assume data nature and are generated by downloadingefforts, to name just a few of the potential licensing rights for any database that does data from the various data sources,end user groups. Worldwide funding not explicitly allow it. depositing a subset of the data (generallyagencies (governments and not-for-profits) The availability of data for download the chemical compound and associatedhave invested in public domain chemistry and reuse is an important offering to the database identifier), and using the partic-platforms. In the United States these community, as these data may be used for ular database URL structure to forminclude PubChem [2], ChemIDPlus [3], the purpose of modeling to develop permanent links. This act of downloadand the Environmental Protection prediction tools [17]. In addition, data and deposition of multiple data sources isAgency’s ACToR [4], while the United can be ingested into internal systems commonly mixing the various licenses, ifKingdom has funded ChEMBL [5] andChemSpider [6], among others, and new Citation: Williams AJ, Wilbanks J, Ekins S (2012) Why Open Drug Discovery Needs Four Simple Rules fordatabases continue to appear annually [7]. Licensing Data and Models. PLoS Comput Biol 8(9): e1002706. doi:10.1371/journal.pcbi.1002706 We have argued recently that the data Editor: Philip E. Bourne, University of California San Diego, United States of Americaquality contained within many of these Published September 27, 2012databases is suspect [8] and scientists Copyright: ß 2012 Williams et al. This is an open-access article distributed under the terms of the Creativeshould consider issues of data quality [9] Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,when using these resources. By assimilating provided the original author and source are credited.various data sources together and meshing Funding: The authors received no specific funding for this article.data on drugs, proteins, and diseases, these Competing Interests: Sean Ekins consults for Collaborative Drug Discovery, Inc. and is on the Board ofvarious databases and network and com- Directors of the Pistoia Alliance. Antony J. Williams is employed by The Royal Society of Chemistry, which hostsputational methods may be useful to the ChemSpider database discussed in this article. John Wilbanks consults for and sits on the Board of Directorsaccelerate drug discovery efforts. The at Sage Bionetworks, which runs an open access database of genomic and health information.development of related cheminformatics * E-mail: tony27587@gmail.comPLOS Computational Biology | www.ploscompbiol.org 1 September 2012 | Volume 8 | Issue 9 | e1002706
  2. 2. licenses are even declared, which, in essary when the discussion is framed this 4. Don’t ever lock up metadata. A signif-many cases, they are not. way. icant swath of data will be incompatible In some ways, there are analogous It is also important to avoid noncom- with an open regime, whether it’s todifficulties in the exchange of computa- mercial or share-alike approaches whenev- protect trade secrets or patient privacy.tional models like quantitative structure er possible. These are attractive terms to But the metadata that describes closedactivity relationship (QSAR) datasets many data providers, but create significant data, and how to access closed data, can[22]—while there are efforts to standard- barriers to interoperability. Noncommer- be almost as valuable. If you can’t makeize how the data and models are stored, cial data might be incompatible for re- the data public domain, make thequeried, and exchanged, there has been searchers at a pharmaceutical company, metadata public domain.little consideration of licenses required to even to run a simple web-based query. It isenable making the sharing of open source important to realize data under a share- As a general rule, these four simple rulesmodels a reality [23]. Similarly, one could alike license from one entity is probably not should allow us to build a more stable dataconsider the creation of maps of disease combinable with data under a share-alike and model sharing ecosystem while we liveand how they are shared and reused [24] license from another entity (this lack of with some uncertainties until the courtsin the same manner. interoperability kept Creative Commons rule on where the line of property stops licensed images out of Wikipedia for years, and starts. We can’t wait for the certainty The potential legal fragility of knowledge and is not one we wish to introduce into the to emerge, but we also want our systems toproducts derived from online databases ecosystem again!). work when the courts do finally rule onwith poorly understood licensing for each Thus, we propose the following simple issues such as where data and metadataof the databases is a real problem, and one rules for developing data licensing ap- stop and start, where copyright attaches,that will only increase in severity over time. proaches inside scientific projects. how data rights really affect re-use, andThis realization is not novel; indeed, the what it means to move towards a ‘‘cloudchemical blogosphere has been host to 1. Before you begin a database project, world’’ where copies aren’t made of datamany discussions regarding the need for convene a meeting of all of the at all. Following these heuristics whenclear data licensing definitions on chemis- stakeholders. Expose all of the expec- providing and/or accepting data is antry-related data. Many scientists likely echo tations of the group and decide if your approach that creates at least the oppor-these comments, but we will provide some goals are primarily scientific, commer- tunity to be forward-compatible for theexamples. In particular, Peter Murray-Rust cial, or mixed. If mixed, take a stern future development of technologies.[25] espouses the value of ‘‘open data’’ [26] look at the actual commercial potential But it is also important to pay closeto the scientific discovery process and of the project. Invite technology trans- attention to licensing sanitation as a dataencourages clear licensing of all chemistry fer offices to join you—they have consumer and user. No matter how temptingdata according to Open Knowledge Defi- greater experience in the realities of it is, do not copy a batch of informally open,nition (OKD) [27] and the Panton Princi- commercialization. but formally closed, data, run a databaseples [28]. 2. If your project is scientific in nature, and integration, and release the new database as Herein we provide an extensive back- ‘‘open’’—that hurts the community. Instead,ground to the intellectual property around not commercial, explore the benefits of look for the terms of use, ask if it is ‘‘open’’,data and databases in the sciences in- open licensing and drawbacks of enclo- post your enquiry, and only when you arevolved in drug discovery, those of biology, sure. Go through the various definitions certain, redistribute. We think databaseschemistry, and related fields, as well as and find the most common ground funded by the government should at the verydiscussion of open data licensing, open- possible, always placing the burden of least be open, and if not this should be statedness, and open license limitations (Text proof on those who want more control prominently.S1). More importantly, we provide a set of and not less. This will create less ‘‘defaultrules that practitioners might apply when enclosure’’ but allow for those increasingly rare situations in which ‘‘open’’ is not Conclusionsmaking data or databases available via theInternet or mobile apps [29]. Our ultimate appropriate. Attempt to hew as closely as Although most scientists are likely unawaregoal is to illuminate the legal fragility of possible to the admittedly rigorous open of this at present, data licenses are going tothe database ecosystem in the drug definitions and standards, and do not become increasingly important in science indiscovery sciences, and to initiate a write your own intellectual property the future, especially as we see more scientistsconversation about creating best practices. licenses—instead, use existing and well embracing open notebook science, open deployed ones. science, and open-access publishing, andSimple Rules for Licensing 3. Develop simple explanations of your funding bodies promoting the increased‘‘Open’’ Data terms of use, and make them easy to accessibility of the fruits of their funding. find for users. Make sure that your We are likely not too far from funding bodies We suggest based on our analysis of the licensing, expectations for attribution, mandating immediate release of all data andcurrent data situation (Text S1) the ideal is terms of use, and more are linked in results produced by each of their grantees,to use strong default rules for openness. many ways to your data and database. which is something we would advocate asFrom a copyright and database rights Do not expect your users to read the potentially disruptive in its own right (S. Ekinsperspective, the public domain gives the legal text of your terms and conditions et al., unpublished data).most clarity and should be the default and licenses; instead, create simple We can hence imagine a near future insetting for data deposit, although it may summaries with linkages to the detailed which many scientists will blog some or allnot always be achievable. Understanding text for users to access. Whenever of their research results while data aggre-this is vital, because it sets the bar at the possible, use metadata to indicate the gators will in turn consume this contentright height. Justifications for additional licensing terms explicitly—the Creative and repackage it for others [31]. Thecontrols should be subject to argument— Commons Rights Expression Lan- licensing of this and other data will need toone often finds those controls are unnec- guage [30] is a good tool for this. be clear if we are to build on the shouldersPLOS Computational Biology | www.ploscompbiol.org 2 September 2012 | Volume 8 | Issue 9 | e1002706
  3. 3. of giants and not have to face legal battles discovery represent a proposed starting Supporting Informationthat pit Davids versus Goliaths. Consider- point for consideration by database pro-ing data licensing as a part of the ducers. These licenses could equally be Text S1 This consists of a discussion in‘‘scientific process’’ is vital for its future used by individual scientists on their blogs three sections:usability, and we strongly encourage and other online environments or ac- N Intellectual property rights in data:scientists to consider data licensing before counts in which they make their data Copyright and Database Rights.they embark upon re-using such content in and models available for others. N Trends in legal certainty: Open Datadatabases they construct themselves or in Licensing.the course of their research. N ‘‘Informal’’ Openness and Open License The four simple rules we have formu- Limitations.lated for licensing data for open drug (PDF)References 1. Williams AJ, Tkachenko V, Lipinski C, Tropsha 13. NeuroCommons (n.d.) NeuroCommons project. 22. Spjuth O, Willighagen EL, Guha R, Eklund M, A, Ekins S (2009) Free online resources enabling Available: http://neurocommons.org. Accessed Wikberg JE (2010) Towards interoperable and crowdsourced drug discovery. Drug Discovery August 2012. reproducible QSAR analyses: exchange of data- World 10, Winter: 33–38. 14. Ruttenberg A, Rees JA, Samwald M, Marshall sets. J Cheminform 2: 5. 2. National Center for Biotechnology Information MS (2009) Life sciences on the Semantic Web: 23. Gupta RR, Gifford EM, Liston T, Waller CL, (n.d.) The PubChem database. Available: http:// the Neurocommons and beyond. Brief Bioinform Bunin B, et al. (2010) Using open source pubchem.ncbi.nlm.nih.gov/. Accessed August 10: 193–204. computational tools for predicting human meta- 2012. 15. Hastings J, Chepelev L, Willighagen E, Adams N, bolic stability and additional ADME/TOX 3. US National Library of Medicine (n.d.) ChemID- Steinbeck C, et al. (2011) The chemical informa- properties. Drug Metab Dispos 38: 2083–2090. Plus Advanced. Available: http://chem.sis.nlm. tion ontology: provenance and disambiguation for 24. Derry JM, Mangravite LM, Suver C, Furia MD, nih.gov/chemidplus/. Accessed August 2012. chemical data on the biological semantic web. Henderson D, et al. (2012) Developing predictive 4. Judson R, Richard A, Dix D, Houck K, Elloumi PLoS ONE 6: e25513. doi:10.1371/journal. molecular maps of human disease through F, et al. (2008) ACToR–Aggregated Computa- pone.0025513 community-based modeling. Nat Genet 44: tional Toxicology Resource. Toxicol Appl Phar- 16. de Rosnay MD (2008) Check your data freedom: 127–130. macol 233: 7–13. a taxonomy to assess life science database 25. Murray-Rust P (n.d.) Dr Peter Murray-Rust. 5. EMBL-EBI (n.d.) ChEMBL. Available: http:// openness. Nature Precedings. Available: http:// Available: http://www.ch.cam.ac.uk/person/ www.ebi.ac.uk/chembldb/index.php. Accessed dx.doi.org/10.1038/npre.2008.2083.1. Accessed pm286. Accessed August 2012. August 2012. August 2012. 26. Wikipedia (n.d.) Open data. Available: http://en. 6. Pence H, Williams AJ (2010) ChemSpider: an 17. Ekins S, Williams AJ (2010) Precompetitive online chemical information resource. J Chem preclinical ADME/Tox Data: set it free on the wikipedia.org/wiki/Open_data. Accessed August Educ 87: 1123–1124. web to facilitate computational model building to 2012. 7. Galperin MY, Cochrane GR (2011) The 2011 assist drug development. Lab on a Chip 10: 13– 27. Open Knowledge Foundation (n.d.) Open data Nucleic Acids Research Database issue and the 22. licensing. Available: http://wiki.okfn.org/Open_ online Molecular Biology Database Collection. 18. Zhu Q, Lajiness MS, Ding Y, Wild DJ (2010) Data_Licensing. Accessed August 2012. Nucleic Acids Res 39: D1–D6. WENDI: a tool for finding non-obvious relation- 28. Murray-Rust P, Neylon C, Pollock R, Wilbanks J, 8. Williams AJ, Ekins S, Tkachenko V (2012) ships between compounds and biological proper- Open Knowledge Foundation Working Group on Towards a gold standard: regarding quality in ties, genes, diseases and scholarly publications. Open Data in Science (2010) The Panton public domain chemistry databases and ap- J Cheminform 2: 6. principles. Available: http://pantonprinciples. proaches to improving the situation. Drug Discov ´ 19. Azzaoui K, Jacoby E, Senger S, Rodrıguez EC, org/. Accessed August 2012. Today 17: 685–701. Loza M, et al. (2012) Analysis of the scientific 29. Williams AJ, Ekins S, Clark AM, Jack JJ, 9. Williams AJ, Ekins S (2011) A quality alert and competency questions followed by the IMI Open- Apodaca RL (2011) Mobile apps for chemistry call for improved curation of public chemistry PHACTS consortium for the development of the in the world of drug discovery. Drug Disc Today databases. Drug Disc Today 16: 747–750. semantic web-based molecular information sys- 16: 928–939.10. Fourches D, Muratov E, Tropsha A (2010) Trust, tem OPS. Drug Disc Today. In press. 30. Creative Commons (n.d.) ccREL: Creative Com- but verify: on the importance of chemical 20. Williams AJ, Harland L, Groth P, Pettifer S, mons rights expression language. Available: structure curation in cheminformatics and QSAR Chichester C, et al. (2012) Open PHACTS: http://www.w3.org/Submission/ccREL/. Ac- modeling research. J Chem Inf Model 50: 1189– semantic interoperability for drug discovery. cessed August 2012. 1204. Drug Discov Today. In press. Available: http:// 31. Ekins S, Clark AM, Williams AJ (2012) Open11. Halevy A, Norvig P, Pereira F (2009) The dx.doi.org/10.1016/j.drudis.2012.05.016. Ac- drug discovery teams: a chemistry mobile app for unreasonable effectiveness of data. Intelligent cessed August 2012. collaboration. Molecular Informatics. In press. Systems 24: 8–12. 21. Wikipedia (n.d.) InChIKey on the InChI Wikipedia doi:10.1002/minf.201200034.12. Sansone SA, Rocca-Serra P, Field D, Maguire E, page. Available: http://en.wikipedia.org/wiki/ Taylor C, et al. (2012) Toward interoperable International_Chemical_Identifier#InChIKey. Ac- bioscience data. Nat Genet 44: 121–126. cessed August 2012.PLOS Computational Biology | www.ploscompbiol.org 3 September 2012 | Volume 8 | Issue 9 | e1002706

×