Social, Political and Legal Aspects of Text and Data Mining (TDM)


As part of the ContentMine project (, this talk gives an up-to-date (in 2014) overview of text and data mining.

Written by Michelle Brook, Charles Oppenheim and Peter Murray-Rust.

This presentation was given by Charles Oppenheim at WOSP 2014.

  1. 1. SOCIAL, POLITICAL AND LEGAL ASPECTS OF TEXT AND DATA MINING (TDM) Michelle Brook, The Content Mine, Peter Murray-Rust, University of Cambridge and Shuttleworth Fellow, Charles Oppenheim, Visiting Professor at City, Northampton and Robert Gordon Universities,
  2. 2. SO WHAT ARE THE NON-TECHNICAL PROBLEMS OF TDM? • LEGAL - copyright, database rights and licensing • SOCIAL - The lack of awareness, and relative technological gap, between many TDM tools and the skills of many academics • POLITICAL – the massive gap between publishers’ approaches to TDM and researchers’ needs; also the lack of specific TDM exceptions to copyright in most countries’ laws
  3. 3. COPYING IS OFTEN INVOLVED IN TDM • PDFs, the lingua franca of academic journals, are not machine readable • For TDM purposes, they must be transferred into a different digital form • That form is often custom and specific to the research question being asked and the most appropriate tools to answer that question • So there is a need to copy/adapt the original PDF
  4. 4. COPYRIGHT/DATABASE RIGHT • Gives the owner the right to authorise, or to refuse to authorise, any of the so-called restricted acts, including: copying; adapting; redisseminating all, or a “substantial” part, of a copyright work (similar rules apply to databases) • Substantial does not mean “most of”, but rather “what is important” • If someone does such restricted acts without permission, they have infringed the right and can be sued • However, there are certain (but very restricted) exceptions to copyright, whereby someone CAN copy, etc., without having to ask for permission or pay fees • Only a few countries (UK recently being one of them) have a specific exception for TDM in their laws • In the absence of such an exception in a country’s national law, researchers much ask for permission (request a licence) from the copyright owners. Generally, the copyright owners are publishers, because authors have (foolishly) assigned their copyright to them
  5. 5. THE NEW UK LAW • Came into force in June 2014 • Specific exception to copyright for TDM • UK researchers do not have to ask for permission, pay fees, etc., to do TDM as long as it is for “non-commercial” purposes and long as they have “lawful access” to the raw materials. • What is, or is not “non-commercial” is controversial, but what is clear is that the question must be asked at the time the TDM was undertaken, so unexpected commercial benefits at the end of the project as OK, so long as at the time the intent was non-0commercial • “Lawful access” usually means licensed content, whether OA or a subscription to the materials
  6. 6. THE PROBLEMS OF APPROACHING PUBLISHERS FOR LICENCES FOR TDM • Many publishers want unreasonably high fees and/or place restrictions on what could be done with their materials after TDM, and/or require researchers to use its API, and/or take an extremely long time to decide how to respond to a TDM request • TDM researchers have to approach multiple publishers, each of whom have different attitudes, conditions, and speed of response to such requests. • This is very costly to a researcher, and has significant impact upon the take up of TDM, as well as inhibiting academics from sharing the outputs of their TDM research • These problems are inhibiting the take-up of TDM, thereby limiting the potential benefits this technology enables. • Also explains why so many TDM experiments are limited to OA materials
  7. 7. PUBLISHER TDM LICENCE INITIATIVES GENERALLY DO NOT HELP • Publishers have started offering their own TDM licences and policies • Their licences often impose unfair (and in the case of the UK, unenforceable) constraints on researchers’ freedom to exploit TDM. • Why “unenforceable”? Because UK law specifically states that any contract or licence term that prevents anyone from doing TDM in the manner prescribed in the new exception shall be deemed null and void • There are exceptions of course – Springer and Royal Society in particular offer generous TDM provisions. • So why are publishers offering restrictive licences in the UK? • One can only surmise that they hope licensees are ignorant of the new law, or the publishers in fact don’t know about it. So they are either deliberately misleading, or ignorant
  8. 8. WHAT POLITICAL INITATIVES ARE NEEDED? • Under EU law, countries in the EU are able to introduce exceptions for non-commercial TDM research, • However, so far only the UK has taken advantage of this. The EC is considering an EU-wide exception for TDM, and the Republic of Ireland is also considering such a change to its national law. • Outside of Europe, only one or two Far East countries have introduced such exceptions. • There needs to be an international treaty requiring all countries to include an exception for TDM in their national laws
  9. 9. WHAT CAN PUBLISHERS DO TO HELP? • Offer all researchers world-wide the same freedom as is now available to UK researchers to undertake TDM for non-commercial research purposes, so long as the user has lawful access to the original materials • Earn goodwill amongst the TDM research community by offering user- friendly APIs (without, of course, REQUIRING a researcher to use them), free advice, and discussion fora for the exchange of experience and ideas in the theory and practice of TDM • Develop clear agreed statements as to what types of research they agree is “non-commercial” and which is “commercial”.
  10. 10. ADDRESSING THE RESEARCHER/TECHNOLOGY GAP • Current TDM researchers are very technologically adept and work will need to be done to develop the existing tools to be easier to use by those with less expertise. • While The Content Mine and other organisations such as Software Carpentry are running workshops to help academics become more technically confident, much more needs to be done. • The TDM community needs to help close the gaps in knowledge, ability and awareness • Funders and institutions also have a responsibility to ensure academics and PhD students are trained in such skills and technologies
  11. 11. IN CONCLUSION • The main barriers against the uptake of TDM are primarily a lack of awareness among academics, a skills gap, legal issues around copyright and database rights, and restrictions being implemented by publishers’ licences. These problems are all solvable • Other countries should change their laws to make TDM lawful • Publishers should work with the TDM academic community to develop agreed statements as to what types of research they agree is “non-commercial” and which is “commercial”, and prevent any possible chilling effect from ambiguity around these terms • Funders and institutions should be exploring how to teach TDM techniques to interested academics and research students • Thank you for your attention.
  12. 12. SOME USEFUL RESOURCES/ACKNOWLEDGEMENT • Use of TDM to detect scientific fraud - 1.15859 • General overview of benefits of TDM - D. McDonald and U. Kelly, The value and benefits of text mining (2012), benefits-of-text-mining • Official guidance on the new UK copyright exception for TDM - ata/file/315014/copyright-guidance-research.pdf • Excellent general overview of the change to UK law and its implications - - provides link to the precise wording in the law • Details of Springer’s and Royal Society’s initiatives at 29056 and • Image shown in this presentation is from Wikipedia and is covered by a Creative Commons CC BY licence