Alex Fenlon - University of Birmingham, Lisa Bird -
University of Birmingham
In this session we look at how Library Services at Birmingham responded to researchers wanting to leverage the UK’s copyright rules around text and data mining (TDM) for non-commercial research purposes. Our talk will cover our journey from initial engagement with researchers, to exploring infrastructure issues with IT colleagues, and encountering skills gaps as we look to develop new services and activities that meet the needs of those using TDM, artificial intelligence (AI), machine learning (ML) or Big Data methodologies in teaching and research. Contributions from others just starting their journey or travelling a well-trodden path, are most welcome.
UKSG 2023 - A TDM journey: understanding user needs and developing library support
1. A TDM journey: understanding user needs and
developing library support
James Barnett, Research Skills Advisor, @jameswbarnett
Lisa Bird, Copyright & Licensing Advisor, @CopyadvisorBird
Alex Fenlon, Head of Copyright & Licensing, @Afenoer
2. Outline of the session
▪ How the Library got involved
in TDM.
▪ The Library’s TDM Support
Project.
▪ What’s next for the Library
and TDM.
4. Library Services
▪ 5 library locations including Dubai.
▪ 3 special collection libraries.
▪ 6,000 – 7,000 visits per day.
▪ Collections:
▪ 1.5 million print books
▪ 850,000 electronic books
▪ 127,000 print and electronic
journals.
5. Key actors supporting TDM
Library
Services
Academic
Skills Centre
Collection Management
and Development:
Research Skills
Copyright and Licensing
Collection Development
Scholarly
Communications
IT
Services
BEAR
Wider
University
Institute for
Interdisciplinary Data
Science & Artificial
Intelligence
Interdisciplinary Data
Science & AI – MS
Teams Network
7. The Beginning
▪ Aware of 2014 changes to UK law
▪ TDM exception for non
commercial research
▪ Supporting some researchers with
adhoc queries
▪ No practical skills or experience
▪ No ‘service’ offer.
▪ Identified potential for growing
interest.
“Artificial intelligence and copyright intellectual property brain”
image generated by Dall-E / Open AI, 07/03/23
8. Corpus linguistics
▪ Funded research project
looking at the use of
language in literature:
▪ Dickens novels
▪ Times Digital Archive (TDA)
▪ Called for Library Support
▪ Copyright
▪ Access & use of data
▪ Hosting
Home screen from clic.bham.ac.uk
9. Supporting the project: How do we enable this?
▪ Make it available
▪ Shelf vs research infrastructure
▪ Just for this project or for others?
▪ Data security, preservation, loss prevention
▪ Uses by researchers
▪ Sharing of data internally and externally
▪ Reproducibility
▪ Publication activity
10. Supporting the project: IT Infrastructure
▪ BEAR
▪ Golden copy
▪ Usable version
▪ Researcher space
11. Supporting the project: Legal issues.
▪ Understanding legal issues
▪ Contracts/ licences/ copyright exceptions
▪ Uses in research
▪ Retention/ preservation
▪ Publication
▪ Snippets
▪ Communicating conditions of use
12. Building on this: TDM Project 2018
Aims:
▪ Explore how this research activity might be expanded
▪ Can we make the TDA data available to others for TDM purposes?
▪ Is there interest in using it more widely?
▪ Are others already using it?
▪ Support & enable other researchers to access the data
▪ How can access be provided?
▪ Requesting mechanism
▪ Provide guidance on TDM related issues
13. TDM Project:
Access mechanism
▪ Use Research infrastructure
▪ Secure
▪ Managed
▪ Familiar
▪ Existing
▪ Cooperation with colleagues
vital
14. TDM Project: Usability of data
▪ Is the data usable?
▪ What editing is needed to enable access?
▪ Golden copy vs Research copy
▪ Library staff can administer
▪ Scalability?
▪ accessibility?
15. Increase in queries
▪ Variety of queries
▪ Newspapers
▪ Citation data
▪ Legal cases
▪ Medical data
▪ Literature
▪ Can we have access to the API please?
17. TDM Project: Survey of activity
2022 Library sent TDM survey to:
▪ 150 academic staff via targeted email
▪ PGRs weekly online newsletter story
Asked:
▪ If they were involved in TDM activity
▪ What tools they used
▪ What were the barriers to TDM
▪ How useful they found the Library’s TDM webpages.
18. TDM Survey: responses
▪ Had a 15% response rate.
▪ Responses from all five Colleges across the University.
▪ 78% of respondents were already involved in TDM projects.
▪ Not involved – wanted to be but need training.
19. TDM Survey: Tools
▪ 35 tools mentioned.
Name of Tool %
R/R Studio 39%
Python 30%
Self–created / In-House 13%
Matlab 9%
“I use Python for pre-processing and text mining, followed by R for
statistical analysis.”
20. TDM Survey: Barriers
Barrier %
Skill/Technical Knowledge 17
Data Availability 17
Skills:
Data cleaning
Programming
Which tool best to use.
Data:
API costs
Contractual issues
Locating data sets.
21. TDM Survey: web pages
Positives:
▪ Good starting point.
▪ Helpful for beginners
Development:
▪ User case studies
▪ Skills guidance
▪ More dynamic - interactive.
22. TDM Project: Outcomes
▪ Data now available for access and use by others
▪ Other dataset also uploaded
▪ Webpage revised, further revisions in development
▪ Established connections with more researchers
▪ Membership of Professional Services Group working with Institute of
Interdisciplinary Data Science & AI
▪ Increased awareness and skills base
24. Context:
Current Landscape
▪ Government focus on nurturing
& supporting AI growth in UK
▪ Sector bodies looking to
develop services and upskill
staff & students
▪ Legal issues uncertain
▪ Ethical issues uncertain
25. Hello ChatGPT!
▪ How will generative AI tools like ChatGPT impact TDM activity?
▪ In theory, ChatGPT’s open API will make TDM easier:
▪ Might be incorporated into third party TDM tools
▪ Can be used to code data with >90% accuracy (Mellon et al.,
2022)
▪ ChatGPT plugins now available – potential to prompt specific
queries based on own data sets?
26. ChatGPT and TDM issues
▪ However – are there trust/integrity issues?:
▪ Can researchers see ‘under the bonnet’ of AI tools enough
for large-scale TDM projects?
▪ What are the copyright & licensing implications of using and
contributing to such tools?
▪ Reproducibility concerns
▪ ChatGPT inaccuracies (Davis, 2023)
▪ Librarian role – equipping researchers with skills to critically
evaluate AI tools (Upshall, 2022)
27. Addressing the TDM Skills Gap
Who supports that development?
▪ Copyright & licensing Team
▪ Research Skills Team
▪ Academic Skills Centre
▪ BEAR (IT Services)
28. Addressing the TDM Skills Gap
How do we upskill Library staff?
▪ LinkedIn Learning
▪ BEAR Coding Basic Course
▪ Code First Girls
▪ RLUK DSN
▪ CILIP UKeiG
▪ Gale Digital Scholar Lab
29. Current state of support:
Content and tool provision
▪ Budget is tailored towards content
▪ Can budgets be used to provide APIs?
▪ API costs vs skills/ capacity / resource
▪ New platforms/ tools to support/ enable
access
▪ Biases, privacy/ data protection/ user
tracking
▪ Preservation, interoperability, sustainability,
reproducibility
30. TDM and Print Collections: Digitise to Mine
Project
▪ Explore how existing digitisation workflows might be adjusted to support
AI/ TDM/ ML research
▪ Engage with researchers to explore needs
▪ What do they do?
▪ What formats do they need data in?
▪ Is there best practice in the sector to do this?
▪ Can we support them?
▪ Review resourcing requirements
31. Digitise to Mine Project: Service Development
▪ If the DS outputs can be used, could this be a service?
▪ What are the implications?
▪ Needs of researchers
▪ Scale & scope
▪ Infrastructure
▪ Resourcing/ staffing
▪ What are the copyright/ Data Protection issues that need addressing?
32. 2023 & Beyond:
▪ Skills/ expertise development & Knowledge sharing
▪ Researcher engagement
▪ Content Access
▪ Improved web guidance
▪ Horizon scan
▪ Observe + participate + support
▪ Seek input and guidance from peers
▪ TDM Canvas course
▪ ‘AI Tools for Researchers’ workshop
33. Summary: Take aways
▪ Collaboration is essential
▪ Starting small is fine
▪ Be confident using the legal exception for TDM
▪ Upskill staff
▪ It’s not just for researchers
▪ Think electronic and print.
34. A question for the audience – yes you!
What is the Library’s role in Text and
Data Mining?
36. Reference List
Slide 24:
▪ Jisc (n.d.) Explore AI. Available at: https://exploreai.jisc.ac.uk/
(Accessed: 27 March 2023)
▪ RLUK (2022) RLUK ADN Digital Workforce Development Strategy
2022-25. Available at: https://www.rluk.ac.uk/digital-workforce-
strategy/ (Accessed: 27 March 2023)
▪ UK HM Government (2021) National AI Strategy. Available at:
https://www.gov.uk/government/publications/national-ai-strategy
(Accessed: 27 March 2023)
37. Reference List
Slide 25:
▪ Mellon, J., Bailey, J., Scott, R., Breckwoldt, J. and Miori, M. (2022)
‘Does GPT-3 know what the Most Important Issue is? Using Large
Language Models to Code Open-Text Social Survey Responses At
Scale’, SSRN, 27 December 2022. Available at:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4310154
(Accessed: 27 March 2023)
▪ OpenAI (2023) ChatGPT plugins. Available at:
https://openai.com/blog/chatgpt-plugins (Accessed: 27 March 2023)
38. Reference List
Slide 26:
▪ Davis, P. (2023) ‘Did ChatGPT Just Lie To Me?’, The Scholarly
Kitchen, 13 January. Available at:
https://scholarlykitchen.sspnet.org/2023/01/13/did-chatgpt-just-lie-to-
me/ (Accessed: 27 March 2023)
▪ Upshall, M. (2022) ‘An AI toolkit for libraries’, Insights, 35, pp. 18. doi:
https://doi.org/10.1629/uksg.592