Presentation from MDG 2020 online conference.
Presentation abstract:
Aims
The purpose of this paper is twofold:
1. To describe how the British Library used machine learning techniques to automatically assign language codes to catalogue records, as part of ongoing metadata enhancement work;
2. To demonstrate that it is possible to undertake machine learning tasks using commonly-available computers and software.
The project
The project itself arose from the identification of a ‘language gap’ between the British Library’s current cataloguing practice and the quality of its legacy metadata: as of October 2018, 30% of legacy records were lacking language codes. The intention was to develop automated language identification tools which could assign language codes to catalogue records, in order to provide information about the languages of content of the resources described.
In the first phase of the project, language codes were assigned to 1.15 million records with 99.7% confidence, and it is intended that approximately 4 million legacy records will eventually be enhanced in this way.
The work has been well-received by Library patrons and staff, who have been able to use the additional language coding to better understand the collection, and to identify curatorial responsibilities.
Themes of the paper
The paper will cover:
1. A brief introduction to machine learning in this context, outlining what it encompasses, and how it might realistically be employed in the library/bibliographic context – for example, in metadata enhancement or cleaning of legacy data;
2. A definition of the problem of language identification: the process of determining the natural language in which a given piece of text is written;
3. Various approaches to tackling the problem of language identification, including linguistic and statistical modelling, and the advantages and disadvantages of these;
4. Particular challenges posed by the bibliographic/library context, including the number of languages to consider – there are nearly 500 MARC language codes – and the fact that the data to be analysed are text strings extracted from catalogue records, often less than 200 characters in length;
5. The statistical model developed as part of the project, focusing on features of the model which make it appropriate for its intended usage, rather than on the details of statistical analysis or computer programming;
6. Suggested skills, techniques and tools applicable to projects of this nature;
7. An overview of lessons learnt, and the implications that these might have for future work in the same area, or for other metadata enhancement projects.
Closing the language gap: developing machine learning tools to detect the language of legacy catalogue records
1. Closing the language gap:
developing machine learning tools to
detect the language of legacy
catalogue records
Victoria Morris
Metadata Standards Team, British Library
8th September 2020
3. www.bl.uk
Language identification
3
• Determining the natural language in which a given piece of text is written
• Texts analysed are referred to as documents
• Analysis of short documents received relatively little attention until …
4. www.bl.uk
0
10
20
30
40
50
60
70
80
90
100
2013 2014 2015 2016 2017
Language of Content
Foundation catalogues Integrated Catalogue Annual Production
The problem
• Most records include information about the
language of content …
• … but some (~ 4.6 million) do not.
4
5. www.bl.uk
Possible approaches?
Linguistic modelling:
• Analysis of grammatical structure
(morphological properties of nouns, verbs,
adjectives, etc.)
• Analysis of alphabets, diacritics etc.
5
Statistical models
• Analysis of features without semantics
• Features may be words, character n-grams
(sequences of n adjacent characters) or
word n-grams
Realistic
Complex
Can be done on a PC
A bit naïve?
6. www.bl.uk
Statistical models
• Build a model using records where we
do have language information
• Compare other records to the model
• Predict (guess?) what the language is
6
French Italian
Jongor
This ‘book’
is in Italian?
Could model any property of the
metadata – not just language
7. www.bl.uk
Comprehensible example: Rank order statistics
• Rank works by frequency of occurrence:
• In each language
• In each document
• Documents which rank words in the same
order are likely to be in the same language?
7
Doesn’t work for short documents
8. www.bl.uk
Incomprehensible example: Bayesian models
• Calculate the probability of a document being in a particular language, based on:
• words
• n-grams
8
A bit naïve? Features are not really independent
ℙ 𝐷 is in language 𝑙 given that it has features 𝑓1 … 𝑓𝑛
∝ ℙ (𝐷 is in language 𝑙)
𝑖=1
𝑛
)ℙ(feature 𝑓𝑖 arises in language 𝑙
9. www.bl.uk
The Bayesian idea
• Create a statistical model to analyse the words in the title and make a prediction
about the language(s) of the content
• Dependency on large “training set” of data to create word-language frequency model
9
definitely a Hungarian word ⇒ probably a Hungarian title
12. www.bl.uk
Matrix of probabilities
12
ℙ word 𝑤𝑖 arises in language 𝑙 =
number of occurences of word 𝑤𝑖 in language 𝑙
number of occurences of all words in language 𝑙
ℙ ′rieser′ is a German word =
number of occurences of ′rieser′ in German titles
total number of words in all German titles
=
number of times ′rieser′ appears in the German bucket
total number of words in the German bucket
14. www.bl.uk
The maths
• Naïve Bayesian probabilities
• Scary maths … but it’s all based on counting words, which computers are good at
ℙ 𝐷 is in language 𝑙 given that it has words 𝑤1 … 𝑤 𝑛
∝ ℙ 𝐷 is in language 𝑙
𝑖=1
𝑛
ℙ word 𝑤𝑖 arises in language 𝑙
∝
𝑖=1
𝑛
number of occurences of word 𝑤𝑖 in language 𝑙
number of occurences of all words in language 𝑙
14
15. www.bl.uk
How do we measure success?
• Precision: does the model predict the correct language?
• Recall: does the model find everything in a particular language?
15
Micro-averaged Macro-averaged
Precision Recall Precision Recall
Bayesian (words) 86.1% 65.4% 63.7 % 64.1 %
Bayesian (n-grams) 37.2% 17.9% 23.1 % 23.1 %
16. www.bl.uk
Assumptions
• Language codes already present within the catalogue are correct (!)
• Catalogue records are monolingual
• Title, edition and series title are in the same language as the resource
• No new languages
16
20. www.bl.uk
The impact
20
0
20
40
60
80
100
120
2013 2014 2015 2016 2017 2018 2019 2020
Language of Content (provisional)
Foundation catalogues Integrated Catalogue Annual Production
3 tranches of ~ 1 million language codes:
1. 99.7% confidence
2. 99.4% confidence
3. 99.1% confidence
Curators able to identify collection
responsibilities more accurately
Researchers able to discover more texts
21. www.bl.uk
Write-up
21
• Paper in Cataloging & Classification Quarterly available at:
https://doi.org/10.1080/01639374.2019.1700201
• In the BL research repository at: https://bl.iro.bl.uk/work/sc/6c99ffcb-0003-477d-
8a58-64cf8c45ecf5
• Source code will appear on GitHub one day … https://github.com/victoriamorris