Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Developing a research Library position statement on Text and Data Mining in the UK

1,145 views

Published on

These are slides from a workshop held during the RLUK2017 Conference http://rlukconference.com/ presented by Dr Danny Kingsley, Dr Deborah Hansen and Anna Vernon.
The Abstract:
"The library community has been almost silent on the issue of text and data mining (T&DM) partly due to concerns about the risk of having institutions ‘cut off’ from subscriptions due to large downloads of research articles for the purpose of mining. This workshop is an intention to identify where the information rests about T&DM - including looking at the details as they appear in Jisc negotiated licenses - consider some case studies and develop together a set of principles that identify the position of research libraries in the on the issue of T&DM. "

Published in: Education
  • Be the first to comment

Developing a research Library position statement on Text and Data Mining in the UK

  1. 1. OS C Office of Scholarly Communication Developing a research library position statement on Text and Data Mining in the UK RLUK 2017 Dr Danny Kingsley, Dr Debbie Hansen - University of Cambridge Anna Vernon - Jisc British Library - 9th March 2017
  2. 2. OS C Office of Scholarly Communication Who are we?
  3. 3. OS C Slide title here
  4. 4. OS C Slide title here
  5. 5. OS C Slide title here
  6. 6. OS C Slide title here
  7. 7. OS C Office of Scholarly Communication What is TDM? “the use of large online text collections to discover new facts and trends about the world itself” (Hearst, 1999†) †Hearst, M. A., Untangling Text Data Mining, Proceedings of ACL'99: the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, June 20-26, 1999 (invited paper)
  8. 8. OS CWhy do you do it? Fast literature review Extract new facts Answer research questions Access wide range of sources for a topic Saves time More back for same cost Not achievable through manual searches Research Innovation
  9. 9. OS CMining of scholarly publications
  10. 10. OS CTDM basics - how
  11. 11. OS C Office of Scholarly Communication What is the legal situation? Hargreaves exception Licensing
  12. 12. OS CHargreaves Review • Independent review of UK Intellectual Property system, focus UK Copyright law • Recommendations from Hargreaves Review: – introduce a copyright exception – allow TDM for non-commercial purposes – prohibit exclusion of TDM through contract • Government introduction of exception June 2014: – if user has lawful access, works can be copied for TDM for non-commercial research
  13. 13. OS C Office of Scholarly Communication Current practicalities What do publishers say about TDM?
  14. 14. OS C • Open Access Scholarly Publishers Association • Expressed their support of TDM efforts – By signing Hague Declaration in 2016 • Support the adoption of best practice and behaviours regarding TDM • Statement: – ‘reasonable best practice for those engaged in TDM to inform publishers … that such content mining is planned’ OASPA, December 2016, http://oaspa.org/oaspa-comment-text-data-mining-proposed-eu-copyright- reform/ Comment from OASPA
  15. 15. OS C £500K per annum • Estimate for TDM activity compliance –Before TDM exception • Figure from OA staffing costs –OA staffing costs to monitor TDM activity –TDM clearance fees with rightsholders University College London study
  16. 16. OS CDo the publishers have statements? Oxford University Press ● No permission needed for non-commercial TDM ○ But can contact for consultation on TDM (e.g. to avoid technical safeguard triggers) ● Can contact to request TDM for commercial purposes https://academic.oup.com/journals/pages/help/third_party_data_mining Royal Society ● Support use of computers to extract from scholarly publications ● Members of subscribing institutions have permission to mine ○ For non-commercial and commercial purposes ○ Respect copyright and cite where possible ● Let them know when you intend to do TDM ○ To prevent automatic lock-out https://royalsociety.org/journals/ethics-policies/data-sharing-mining/ Cambridge University Press International Union of Crystallography now also:
  17. 17. OS CDo the publishers have statements? Elsevier Different licenses, different rules, e.g.: ● CC BY - yes to TDM ● CC BY-NC-SA - yes to TDM for non-commercial purposes ● CC BY-NC-ND - no to TDM ● Open Archives (content made available after an embargo) ○ Yes to TDM for non-commercial purposes and cite authors and source https://www.elsevier.com/connect/what-changes-when-publishing-open-access-understanding-the- fine-print
  18. 18. OS C • Hindawi facilitate the use of their content for data mining purposes - https://www.hindawi.com/corpus/. • Full XML content available for download –as single .zip file –.zip file updated daily –(XML files adhere to the US National Library of Medicine Document Type Definition) • Not advertised widely – over last 12 months, 1,770 unique visits – btwn 60-90 downloads per/month – roughly 720-1080 downloads for the year Hindawi
  19. 19. OS C Negotiations between a publisher and Cambridge University in May 2015 over TDM. –Original contract would have been binding for whole University –Data only available on a hard drive and not downloadable onto a server –Charge of £1,100 for the cost of the hard drive –Substantial number of limitations and restrictions Example situation
  20. 20. OS C Office of Scholarly Communication Talk about any experiences you have had with TDM. Feedback into the group: * Challenges encountered? * Concerns? * Successes? Group discussion - about your experiences
  21. 21. OS CFeedback from discussion Situation • Hard drive provided. • Not know what is being asked for - who is responsible? • What is the IT responsibility here? • Copyright and compliance officer needed to do a lot of work. Solutions: • Clearer understanding of the licesning situation. • Mechanism of where to go for advice. • Procedures of what to do with it - policy issue
  22. 22. OS C • Issue: – Researcher behaviour - academics not concerned by copyright • Library implications: – Librarians are not always aware of TDM taking place. – Help if have better understanding. – New legislation, so we are currently reactive to it – Change of role of the library - traditionally to preserve access to items. – TDM could threaten access, so internal disquiet – Would like to be enabling this activity rather than saying no you can’t • Solutions? – Help if publishers deliver material in different ways - not a hard drive. Could this be part of a platform? – Good if material was produced in a format that allowed TDM (at no extra cost) Feedback from discussion 2
  23. 23. OS C Office of Scholarly Communication International activity in this area There are several large initiatives looking at Text and Data Mining
  24. 24. OS C Office of Scholarly Communication Work in this area - FutureTDM • Background: America and Asia lead activity in TDM • FutureTDM seek to increase TDM activity in EU • Engagement with stakeholders (e.g. researchers, developers, publishers) –Why is uptake lower in EU? –Raise awareness –Develop solutions
  25. 25. OS C Office of Scholarly Communication Work in this area - European Commission Proposal: New copyright exception for research organisations carrying out research in public interest – to carry out TDM of copyright protected content – if they have lawful access (e.g. subscription) – without prior authorisation
  26. 26. OS C Office of Scholarly Communication You can have your say on the EU reform • Sign the Hague Declaration and ask your researchers to sign it http://thehaguedeclaration.com – (not just about copyright reform but about advancing research more generally) • Ask your institutions to support this joint letter for LIBER, LERU etc. http://libereurope.eu/blog/2017/01/10/eu-copyright-reform-liber-joins- leading-research-groups-call-change/ • Write to your local MEP saying why you support a European exception on TDM. Mary Honeyball and Catherine Stihler 2 key UK MEPs • Collect examples of TDM projects, problems, solutions, share and promote them. Make the UK Intellectual Property Office aware of issues that you have with the UK legislation. • Once the report goes through European Parliament it will go to the European council (EU heads of state) so contacting your national representatives (ministers for research etc., IP Office) will be key at this point. European Commission MEMO (MEMO/16/3011)
  27. 27. OS C Office of Scholarly Communication UK-based TDM activity British Library Content Mine - Wikimedia project ChemDataExtractor NaCTeM
  28. 28. OS CBritish Library EThOS • E-Thesis On-line Service • British Library opportunity for PhD student placement* • TDM on 150,000 theses held in EThOS –Extract new metadata information –E.g. Names of supervisors from Acknowledgements, funding information –Outputs feed into future initiatives British Library, 2017, https://www.bl.uk/news/2016/november/british-library-phd-placements-call-for- applications *Applications closed 20 February 2017
  29. 29. OS CContentMine and WikiFactMine ContentMine supplies open source TDM software to access and analyse documents Project grant to develop WikiFactMine – ContentMine partnering with Wikimedia Foundation – Project aims to make scientific data available to editors of Wikidata and Wikipedia http://contentmine.org/
  30. 30. OS CChemDataExtractor • Molecular Engineering Group, University of Cambridge • Chemical information from scientific documentation (e.g. text, tables) • Open source software package • Extracted data for onward analysis Swain, M. C., & Cole, J. M. "ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature", J. Chem. Inf. Model. 2016, 56 (10), pp 1894–1904 10.1021/acs.jcim.6b00207
  31. 31. OS C Office of Scholarly Communication Biomedical Text Mining • Manchester Institute of Biotechnology - National Centre for Text Mining (NaCTeM) • Text mining tools and services in the biomedical field http://www.nactem.ac.uk/index.php
  32. 32. OS C Office of Scholarly Communication Libraries are worried about getting cut off from their subscription by publishers due to large downloads of papers through TDM activity The problem we are trying to solve
  33. 33. OS CBeing cut off - how it works • Publishers systems pre-programmed to react to suspicious activity • TDM may invoke automated investigation, may cause access block • For Universities to maintain support mechanism to ensure continuity of access –Require workflows for swift resolution, fast communication, team of communicators • Also requires education of researchers of potential issues
  34. 34. OS C Office of Scholarly Communication Discussion Write on three separate post it notes your top three reasons why your organisation is not actively supporting TDM (Yellow post its) If your organisation is supporting TDM write the top three challenges you face (Green post-its)
  35. 35. OS C Slide title here Discussion feedback: Why not supporting? ● Practical ○ Challenges of handling physical media ○ Risk of lockout ● Lack of demand ○ We are not getting enquiries Perhaps not coming to the Library. Someone in IT supporting research computing may not even pass on the queries. Internal discussion needed. ○ Not much call ● Who is responsible?- ○ No institutional view on TDM because the issues are not raised at academic level. - POLICY NEEDED ○ How can a library provide a service - responding to individual queries, how do we scale it up? ○ Not joined up - assumption in the discussion that the Library is at the centre of all this and we are not joined up as organisations
  36. 36. OS C Slide title hereDiscussion feedback: Challenges? ● When making research within a specific environment it should be relatively straightforward if it remains within the environment. ● Complicated ○ In order to provide access to the data, there are requirements at the content owner level - everyone needs to understand the need. ○ Intrusive on the researcher process. ○ Need to ensure it is not commercial use, and ensuring people know their responsibilities ● Time ○ A contract with a particular publisher to allow our researchers to TDM took two years to finalise.
  37. 37. OS C To draft a statement for a Service Level Agreement for publishers to assure us that if the activity is legal we will be reinstated within 1 hour (or something like that). Discuss - What are the issues if we did this? Proposal
  38. 38. OS C Expectation of publishers? • Publishers contact the library to give a grace period to investigate rather than being cut off • Way publisher platforms operate - – LOCKSS crawls publisher software without getting trapped. – This could work in the same way with a bank of IP addresses that is secured for this purpose. – Avoid some of the manual work. Third party IP registry. • Basis for the conversation over the SLA – The law is on the subscriber’s side if everyone is doing it legally. – We need an understanding of the extent of infringing activity going on with University networks (understanding that people can ‘mask’ themselves). – Useful for thinking of thresholds. Discussion feedback 1
  39. 39. OS C Expectation of libraries? • Not like to do a register. • Range of IP addresses to be part of the license agreement • Create a safe space for TDM? Or is this a barrier? • Tryinf to design something which is bolted onto a different use content. Large scale computational reading is something totally different. • Two issues – How do we manage the licenses we are currently signed up to? – How do we manage licensing into the future so we separate the different uses? Discussion feedback 2
  40. 40. OS C Time frames? – Being cut off for a week or two weeks with no redress is unusual at best! Discussion feedback 3
  41. 41. OS C 1.Don’t cut us off! Have a conversation first (and if you want to cut us off - prove there are all these activities happening in the UK) 2.If you do cut us off and it turns out to be legitimate then we expect compensation for the time we were cut off 3.Mechanisms for TDM where certain behaviours are expected - built into separate licensing agreements for TDM Agreed Expectations:
  42. 42. OS C Office of Scholarly Communication
  43. 43. OS C Office of Scholarly Communication Next steps Getting the statement endorsed by RLUK, funding councils etc take to to publishers.

×