Closing the language gap: developing machine learning tools to detect the language of legacy catalogue records

Closing the language gap:
developing machine learning tools to
detect the language of legacy
catalogue records
Victoria Morris
Metadata Standards Team, British Library
8th September 2020

www.bl.uk
Language identification
2Image from: https://www.slideshare.net/bigshum/automatic-
language-identification-9937750

www.bl.uk
Language identification
3
• Determining the natural language in which a given piece of text is written
• Texts analysed are referred to as documents
• Analysis of short documents received relatively little attention until …

www.bl.uk
0
10
20
30
40
50
60
70
80
90
100
2013 2014 2015 2016 2017
Language of Content
Foundation catalogues Integrated Catalogue Annual Production
The problem
• Most records include information about the
language of content …
• … but some (~ 4.6 million) do not.
4

www.bl.uk
Possible approaches?
Linguistic modelling:
• Analysis of grammatical structure
(morphological properties of nouns, verbs,
adjectives, etc.)
• Analysis of alphabets, diacritics etc.
5
Statistical models
• Analysis of features without semantics
• Features may be words, character n-grams
(sequences of n adjacent characters) or
word n-grams
Realistic
Complex
Can be done on a PC
A bit naïve?

www.bl.uk
Statistical models
• Build a model using records where we
do have language information
• Compare other records to the model
• Predict (guess?) what the language is
6
French Italian
Jongor
This ‘book’
is in Italian?
Could model any property of the
metadata – not just language

www.bl.uk
Comprehensible example: Rank order statistics
• Rank works by frequency of occurrence:
• In each language
• In each document
• Documents which rank words in the same
order are likely to be in the same language?
7
Doesn’t work for short documents

www.bl.uk
Incomprehensible example: Bayesian models
• Calculate the probability of a document being in a particular language, based on:
• words
• n-grams
8
A bit naïve? Features are not really independent
ℙ 𝐷 is in language 𝑙 given that it has features 𝑓1 … 𝑓𝑛
∝ ℙ (𝐷 is in language 𝑙)
𝑖=1
𝑛
)ℙ(feature 𝑓𝑖 arises in language 𝑙

www.bl.uk
The Bayesian idea
• Create a statistical model to analyse the words in the title and make a prediction
about the language(s) of the content
• Dependency on large “training set” of data to create word-language frequency model
9
definitely a Hungarian word ⇒ probably a Hungarian title

www.bl.uk
Buckets of words
10
English Hungarian
red
riding
hood
little nehéz
szerelem
regénye
lira
poedotas
flolemil
nugänik
Volapuk

www.bl.uk
Matrix of probabilities
12
ℙ word 𝑤𝑖 arises in language 𝑙 =
number of occurences of word 𝑤𝑖 in language 𝑙
number of occurences of all words in language 𝑙
ℙ ′rieser′ is a German word =
number of occurences of ′rieser′ in German titles
total number of words in all German titles
=
number of times ′rieser′ appears in the German bucket
total number of words in the German bucket

www.bl.uk
Matrix of probabilities
13

www.bl.uk
The maths
• Naïve Bayesian probabilities
• Scary maths … but it’s all based on counting words, which computers are good at
ℙ 𝐷 is in language 𝑙 given that it has words 𝑤1 … 𝑤 𝑛
∝ ℙ 𝐷 is in language 𝑙
𝑖=1
𝑛
ℙ word 𝑤𝑖 arises in language 𝑙
∝
𝑖=1
𝑛
number of occurences of word 𝑤𝑖 in language 𝑙
number of occurences of all words in language 𝑙
14

www.bl.uk
How do we measure success?
• Precision: does the model predict the correct language?
• Recall: does the model find everything in a particular language?
15
Micro-averaged Macro-averaged
Precision Recall Precision Recall
Bayesian (words) 86.1% 65.4% 63.7 % 64.1 %
Bayesian (n-grams) 37.2% 17.9% 23.1 % 23.1 %

www.bl.uk
Assumptions
• Language codes already present within the catalogue are correct (!)
• Catalogue records are monolingual
• Title, edition and series title are in the same language as the resource
• No new languages
16

www.bl.uk 17
Latin? English?
English? Latvian?
Words can belong to more than one language

www.bl.uk 18
Language of title = language of content?

www.bl.uk
Results
• Shaded cells on the diagonal = correct predictions
19

www.bl.uk
The impact
20
0
20
40
60
80
100
120
2013 2014 2015 2016 2017 2018 2019 2020
Language of Content (provisional)
Foundation catalogues Integrated Catalogue Annual Production
3 tranches of ~ 1 million language codes:
1. 99.7% confidence
2. 99.4% confidence
3. 99.1% confidence
Curators able to identify collection
responsibilities more accurately
Researchers able to discover more texts

www.bl.uk
Write-up
21
• Paper in Cataloging & Classification Quarterly available at:
https://doi.org/10.1080/01639374.2019.1700201
• In the BL research repository at: https://bl.iro.bl.uk/work/sc/6c99ffcb-0003-477d-
8a58-64cf8c45ecf5
• Source code will appear on GitHub one day … https://github.com/victoriamorris

www.bl.uk
Questions?
victoria.morris@bl.uk
22

Closing the language gap: developing machine learning tools to detect the language of legacy catalogue records

Recommended

Recommended

More Related Content

Similar to Closing the language gap: developing machine learning tools to detect the language of legacy catalogue records

Similar to Closing the language gap: developing machine learning tools to detect the language of legacy catalogue records (20)

More from CILIP MDG

More from CILIP MDG (20)

Recently uploaded

Recently uploaded (20)

Closing the language gap: developing machine learning tools to detect the language of legacy catalogue records