UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library collections

Evaluating future uses and limits
in library collections
S.Haime@bbk.ac.uk
@Siobhan_M_HQ

Overview
Demystifying
AI
Machine learning,
algorithms and
data.
Use cases in
collections
In-house
opportunities and
external
possibilities
Looking to the
future
What next,
expectations and
recommendations

About me
2014-17
2018-19
2019-
2021-23
2023-
BA Liberal Arts &
Sciences
Computer Science
Statistics
Neuroscience
Linguistics
MA Applied
Linguistics
Conversation analysis
Pragmatics
Corpus Linguistics
PhD Applied
Linguistics
Conversation analysis
Ambiguity in stories
Human – computer
interaction
Collections
Assistant
Acquisitions &
Reading lists
University of
Leeds
Publishing
Technologies
Librarian
Open Library of
Humanities
Janeway Systems

Demystifying AI
Opening the black box
What is AI?

Demystifying AI
What is AI?
Whatever you want it to become

Demystifying AI
What is AI?
“[algorithm based technologies that aim to]
simulate human intelligence and problem-solving
capabilities”
– IBM

Machine learning
“We say that a machine learns with respect to a particular task (T), performance metric P,
and type of [training] experience (E), if the system reliably improves its performance (P) at
task (T), following experience (E). Depending on how we specify T, P, and E, the learning
task might also be called by names such as data mining, autonomous discovery, database
updating, programming by example, etc.”
-Tom M. Mitchell, “The Discipline of Machine Learning.”
Playing chess
matches
% of games
won against
players
Practice
games against
itself
Task Performance
measure
Training
experience

Machine learning
“We say that a machine learns with respect to a particular task (T), performance metric P,
and type of [training] experience (E), if the system reliably improves its performance (P) at
task (T), following experience (E). Depending on how we specify T, P, and E, the learning
task might also be called by names such as data mining, autonomous discovery, database
updating, programming by example, etc.”
-Tom M. Mitchell, “The Discipline of Machine Learning.”
Playing chess
matches
% of games
won against
players
Practice
games against
itself
Task
Performance
measure
Training
experience

(Un)intelligence?
What is intelligence
VS
Searle’s Chinese room argument (1980)
Language generation versus language
understanding (or interaction)

Deep learning
Linear processing
Parallel / networked processing

Rise of the Transformers
‘Old AI’ Transformers
Specialised use cases Broader use cases
Re-training and re-writing Fine-tuning
1000s – 10,000s tokens
(A library dataset or large set of
usage statistics)
Millions – billions tokens
(Significant portions of the
internet)
Relatively easy and transparent Complex – ‘black box’
Cheap Expensive

Rise of the Transformers
‘Old AI’ Transformers
Specialised use cases Broader use cases
Re-training and re-writing Fine-tuning
1000s – 10,000s tokens
(A library dataset or large set of
usage statistics)
Millions – billions tokens
(Significant portions of the
internet)
Relatively easy and transparent Complex – ‘black box’
Cheap Expensive
Find the right tool for the job!

DATA VALUE
• Scarcity and commercial value
• Data can make or break development
• Copyright
• Contracts and agreement
• Can our data be used to train products that
will be sold back to us (that generate more
data)?
Define commercially sensitive/valuable data

DATA SECURITY & PRIVACY
• Is our data secure?
• Do users have a right to privacy?
! Microsoft CoPilot and US congress
Don’t feed AI GDPR/commercially sensitive data

DATA QUALITY
• Completeness, accuracy and consistency
• The correct data for the task
• For what purpose was the data collected?
• What is its ecosystem and what are the underlying
assumptions?
Quantitative examples Qualitative examples
Citation metrics ↛ Research quality Country of pub./nationality ↛ Diversity
Budget ↛ Service quality Academic book reviews
Reading list statistics ↛ Engagement Surveys

DATA BIAS
• Algorithmic bias and data provenance
“Managing bias rather than working to eliminate
bias is a distinction born of the sense that elimination
is not possible because elimination would be a kind of
bias itself—essentially a well-meaning, if ultimately
futile, ouroboros.”(Padilla, 2019)
ALL DATASETS ARE BIASED

Use cases
• Proof of concept for use cases and in-house development
• Generative AI and ‘regular’ machine learning
• Failing fast and failing often
• Other future applications

Generating MARC – ESTHER
Run No. 1 – Web interface
1. Limits in file size for
upload
2. Could not always
access links
3. Only older / better
known titles and/ or
ISBNs
4. No control

Generating MARC – ESTHER
Run No. 1 – Web interface
1. Limits in file size for
upload
2. Could not always
access links
3. Only older / better
known titles and/ or
ISBNs
4. No control
Run No. 2 -API (langchain)
1. Could use various
filetypes and sizes
2. Could access weblinks
3. Print based on
weblinks was variable
4. Control!
ALWAYS USE THE API FOR ANALYSES

Title comparisons - MARY
• Run No. 1 – CSV upload to web (failed)
• Run No. 1.5 – Cleaning CSV in web (failed)

• Run No. 2 – API (pandas / langchain)
• Worked reasonably well, but expensive?

• Run No. 2 – API (pandas / langchain)
• Worked reasonably well, but expensive?
• Run No. 3 - Machine Learning
• Neural network (numpy, PyTorch, sklearn, pandas)
• Training on outliers and low confidence answers
• More work, testing and refining – but it works* and is cheaper!
• Data processing and cleaning

Fail fast, fail often
JUDITH - Recommender
• Combined usage
statistics, reading list
statistics and
reading list
information using
predictive modelling
• Worked
‘theoretically’ but
lacked data and thus
meaningful testing
• Best integrated into
LMS
MIRIAM - Print Book Usage
• Determining high and
low usage?
• Combined print usage,
ILL and reading lists
statistics
• Lacked contextual
data / understanding
– unable to do
meaningful
development or
testing
MAGADLENE -Finding New
Editions
• Building on MARY-
comparing titles and
then identifying
new editions
• More complex than
expected
• Required an
additional API
• Best integrated in
LMS
MAGDALENA
MIRIAM
JUDITH

• Combined usage
statistics and
reading list
information using
• Worked
meaningful testing
LMS
low usage?
statistics
– unable to do
meaningful
development or
testing
Editions
then identifying
new editions
expected
• Required an
additional API
LMS
MAGDALENA
MIRIAM

• Combined usage
statistics and
reading list
information using
• Worked
meaningful testing
LMS
low usage?
statistics
– unable to do
meaningful
development or
testing
Editions
then identifying
new editions
expected
• Required an
additional API
LMS
MAGDALENA

• Combined usage
statistics and
reading list
information using
• Worked
meaningful testing
LMS
low usage?
statistics
– unable to do
meaningful
development or
testing
Editions
then identifying
new editions
expected
• Required an
additional API
LMS

Other use cases
Linked
(open)
data
Contextual
information
Enhanced
discovery
Data as
collections
OERs
Collections
as data
Networked
knowledge
graphs
Semantic
search
Enhanced
analytics*
Machine
translation
Collection
mapping
Simplifying
workflows
Improved
digitisation
Contextual
metadata
Supporting
acquisitions

Looking to the future
• Model sustainability
• Financially
• Ecologically
• Copyright
• Open-source
• HuggingFace
• OpenLLMs
• Specialised LLMS? (GLINER on Github)
• Transformer technology
• Engine, rather than the tech

Short term actions library
DATA MODELS PEOPLE
▪ Commercial
value of data
▪ Data
governance
▪ Data
assessment
criteria
▪ Data audits
▪ Benchmarks
and quality
standards
▪ Appropriate
tools
▪ Experiment!
▪ Assess the
context
▪ Find the right
people
▪ Cross-department
collaboration
▪ Working with the
data and
workflows
▪ Problem
formulation

RESIST THE URGE TO BE
IMPRESSED

UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library collections

Recommended

Recommended

More Related Content

Similar to UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library collections

Similar to UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library collections (20)

More from UKSG: connecting the knowledge community

More from UKSG: connecting the knowledge community (20)

Recently uploaded

Recently uploaded (20)

UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library collections