SlideShare a Scribd company logo
1 of 22
Closing the language gap:
developing machine learning tools to
detect the language of legacy
catalogue records
Victoria Morris
Metadata Standards Team, British Library
8th September 2020
www.bl.uk
Language identification
2Image from: https://www.slideshare.net/bigshum/automatic-
language-identification-9937750
www.bl.uk
Language identification
3
• Determining the natural language in which a given piece of text is written
• Texts analysed are referred to as documents
• Analysis of short documents received relatively little attention until …
www.bl.uk
0
10
20
30
40
50
60
70
80
90
100
2013 2014 2015 2016 2017
Language of Content
Foundation catalogues Integrated Catalogue Annual Production
The problem
• Most records include information about the
language of content …
• … but some (~ 4.6 million) do not.
4
www.bl.uk
Possible approaches?
Linguistic modelling:
• Analysis of grammatical structure
(morphological properties of nouns, verbs,
adjectives, etc.)
• Analysis of alphabets, diacritics etc.
5
Statistical models
• Analysis of features without semantics
• Features may be words, character n-grams
(sequences of n adjacent characters) or
word n-grams
Realistic
Complex
Can be done on a PC
A bit naïve?
www.bl.uk
Statistical models
• Build a model using records where we
do have language information
• Compare other records to the model
• Predict (guess?) what the language is
6
French Italian
Jongor
This ‘book’
is in Italian?
Could model any property of the
metadata – not just language
www.bl.uk
Comprehensible example: Rank order statistics
• Rank works by frequency of occurrence:
• In each language
• In each document
• Documents which rank words in the same
order are likely to be in the same language?
7
Doesn’t work for short documents
www.bl.uk
Incomprehensible example: Bayesian models
• Calculate the probability of a document being in a particular language, based on:
• words
• n-grams
8
A bit naïve? Features are not really independent
ℙ 𝐷 is in language 𝑙 given that it has features 𝑓1 … 𝑓𝑛
∝ ℙ (𝐷 is in language 𝑙)
𝑖=1
𝑛
)ℙ(feature 𝑓𝑖 arises in language 𝑙
www.bl.uk
The Bayesian idea
• Create a statistical model to analyse the words in the title and make a prediction
about the language(s) of the content
• Dependency on large “training set” of data to create word-language frequency model
9
definitely a Hungarian word ⇒ probably a Hungarian title
www.bl.uk
Buckets of words
10
English Hungarian
red
riding
hood
little nehéz
szerelem
regénye
lira
poedotas
flolemil
nugänik
Volapuk
www.bl.uk
Bucket of words
11
www.bl.uk
Matrix of probabilities
12
ℙ word 𝑤𝑖 arises in language 𝑙 =
number of occurences of word 𝑤𝑖 in language 𝑙
number of occurences of all words in language 𝑙
ℙ ′rieser′ is a German word =
number of occurences of ′rieser′ in German titles
total number of words in all German titles
=
number of times ′rieser′ appears in the German bucket
total number of words in the German bucket
www.bl.uk
Matrix of probabilities
13
www.bl.uk
The maths
• Naïve Bayesian probabilities
• Scary maths … but it’s all based on counting words, which computers are good at
ℙ 𝐷 is in language 𝑙 given that it has words 𝑤1 … 𝑤 𝑛
∝ ℙ 𝐷 is in language 𝑙
𝑖=1
𝑛
ℙ word 𝑤𝑖 arises in language 𝑙
∝
𝑖=1
𝑛
number of occurences of word 𝑤𝑖 in language 𝑙
number of occurences of all words in language 𝑙
14
www.bl.uk
How do we measure success?
• Precision: does the model predict the correct language?
• Recall: does the model find everything in a particular language?
15
Micro-averaged Macro-averaged
Precision Recall Precision Recall
Bayesian (words) 86.1% 65.4% 63.7 % 64.1 %
Bayesian (n-grams) 37.2% 17.9% 23.1 % 23.1 %
www.bl.uk
Assumptions
• Language codes already present within the catalogue are correct (!)
• Catalogue records are monolingual
• Title, edition and series title are in the same language as the resource
• No new languages
16
www.bl.uk 17
Latin? English?
English? Latvian?
Words can belong to more than one language
www.bl.uk 18
Language of title = language of content?
www.bl.uk
Results
• Shaded cells on the diagonal = correct predictions
19
www.bl.uk
The impact
20
0
20
40
60
80
100
120
2013 2014 2015 2016 2017 2018 2019 2020
Language of Content (provisional)
Foundation catalogues Integrated Catalogue Annual Production
3 tranches of ~ 1 million language codes:
1. 99.7% confidence
2. 99.4% confidence
3. 99.1% confidence
Curators able to identify collection
responsibilities more accurately
Researchers able to discover more texts
www.bl.uk
Write-up
21
• Paper in Cataloging & Classification Quarterly available at:
https://doi.org/10.1080/01639374.2019.1700201
• In the BL research repository at: https://bl.iro.bl.uk/work/sc/6c99ffcb-0003-477d-
8a58-64cf8c45ecf5
• Source code will appear on GitHub one day … https://github.com/victoriamorris
www.bl.uk
Questions?
victoria.morris@bl.uk
22

More Related Content

Similar to Closing the language gap: developing machine learning tools to detect the language of legacy catalogue records

Corpus study design
Corpus study designCorpus study design
Corpus study designbikashtaly
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Lviv Data Science Summer School
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Mtvectorspace 161101214722
Mtvectorspace 161101214722Mtvectorspace 161101214722
Mtvectorspace 161101214722LinkedIn
 
Mtvectorspace 161101214722
Mtvectorspace 161101214722Mtvectorspace 161101214722
Mtvectorspace 161101214722LinkedIn
 
Using corpora in instruction
Using corpora in instructionUsing corpora in instruction
Using corpora in instructionJonathan Smart
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 SVTaylor123
 
5810 oral lang anly transcr wkshp (fall 2014) pdf
5810 oral lang anly transcr wkshp (fall 2014) pdf  5810 oral lang anly transcr wkshp (fall 2014) pdf
5810 oral lang anly transcr wkshp (fall 2014) pdf SVTaylor123
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
 
CLIL + selections + brainwave 2013
CLIL + selections + brainwave 2013CLIL + selections + brainwave 2013
CLIL + selections + brainwave 2013Majid Safadaran
 
Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Andrej Muhic
 
Building Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped LearningBuilding Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped LearningSaint Michael's College
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 SVTaylor123
 

Similar to Closing the language gap: developing machine learning tools to detect the language of legacy catalogue records (20)

Corpus study design
Corpus study designCorpus study design
Corpus study design
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Linguascope2018
Linguascope2018Linguascope2018
Linguascope2018
 
Mtvectorspace 161101214722
Mtvectorspace 161101214722Mtvectorspace 161101214722
Mtvectorspace 161101214722
 
Mtvectorspace 161101214722
Mtvectorspace 161101214722Mtvectorspace 161101214722
Mtvectorspace 161101214722
 
Using corpora in instruction
Using corpora in instructionUsing corpora in instruction
Using corpora in instruction
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
5810 oral lang anly transcr wkshp (fall 2014) pdf
5810 oral lang anly transcr wkshp (fall 2014) pdf  5810 oral lang anly transcr wkshp (fall 2014) pdf
5810 oral lang anly transcr wkshp (fall 2014) pdf
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 
Wordpower for ks3
Wordpower for ks3Wordpower for ks3
Wordpower for ks3
 
Intro
IntroIntro
Intro
 
Intro
IntroIntro
Intro
 
CLIL + selections + brainwave 2013
CLIL + selections + brainwave 2013CLIL + selections + brainwave 2013
CLIL + selections + brainwave 2013
 
Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic
 
intro.ppt
intro.pptintro.ppt
intro.ppt
 
Building Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped LearningBuilding Reading Fluency Through Blended & Flipped Learning
Building Reading Fluency Through Blended & Flipped Learning
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014
 

More from CILIP MDG

UK Committee on RDA, RDA Day: New Tools for the Future of Cataloguing - Jenny...
UK Committee on RDA, RDA Day: New Tools for the Future of Cataloguing - Jenny...UK Committee on RDA, RDA Day: New Tools for the Future of Cataloguing - Jenny...
UK Committee on RDA, RDA Day: New Tools for the Future of Cataloguing - Jenny...CILIP MDG
 
Challenges to implementation - Jenny Wright
Challenges to implementation - Jenny WrightChallenges to implementation - Jenny Wright
Challenges to implementation - Jenny WrightCILIP MDG
 
Application Profiles in RDA - Jenny Wright
Application Profiles in RDA - Jenny WrightApplication Profiles in RDA - Jenny Wright
Application Profiles in RDA - Jenny WrightCILIP MDG
 
The Official RDA Toolkit - Opportunities for Efficiency - Thurstan Young
The Official RDA Toolkit - Opportunities for Efficiency - Thurstan YoungThe Official RDA Toolkit - Opportunities for Efficiency - Thurstan Young
The Official RDA Toolkit - Opportunities for Efficiency - Thurstan YoungCILIP MDG
 
The Official RDA Toolkit - Opportunities for Enrichment - Thurstan Youing
The Official RDA Toolkit - Opportunities for Enrichment - Thurstan YouingThe Official RDA Toolkit - Opportunities for Enrichment - Thurstan Youing
The Official RDA Toolkit - Opportunities for Enrichment - Thurstan YouingCILIP MDG
 
UKCoR RDA Day 2023 - "Only" Connect
UKCoR RDA Day 2023 - "Only" ConnectUKCoR RDA Day 2023 - "Only" Connect
UKCoR RDA Day 2023 - "Only" ConnectCILIP MDG
 
RDA methods, scenarios, tools - Gordon Dunsire
RDA methods, scenarios, tools - Gordon DunsireRDA methods, scenarios, tools - Gordon Dunsire
RDA methods, scenarios, tools - Gordon DunsireCILIP MDG
 
Poster: What’s in a name? Re-Discovering cataloguing and index through metada...
Poster: What’s in a name? Re-Discovering cataloguing and index through metada...Poster: What’s in a name? Re-Discovering cataloguing and index through metada...
Poster: What’s in a name? Re-Discovering cataloguing and index through metada...CILIP MDG
 
Poster: Revamping our in-house cataloguing training / Victoria Parkinson (Kin...
Poster: Revamping our in-house cataloguing training / Victoria Parkinson (Kin...Poster: Revamping our in-house cataloguing training / Victoria Parkinson (Kin...
Poster: Revamping our in-house cataloguing training / Victoria Parkinson (Kin...CILIP MDG
 
Poster: FAST : can it lighten the load, and what is the impact? / Jenny Wrigh...
Poster: FAST : can it lighten the load, and what is the impact? / Jenny Wrigh...Poster: FAST : can it lighten the load, and what is the impact? / Jenny Wrigh...
Poster: FAST : can it lighten the load, and what is the impact? / Jenny Wrigh...CILIP MDG
 
Poster: The West Midlands Evidence Repository (WMER) : a regional collaborati...
Poster: The West Midlands Evidence Repository (WMER) : a regional collaborati...Poster: The West Midlands Evidence Repository (WMER) : a regional collaborati...
Poster: The West Midlands Evidence Repository (WMER) : a regional collaborati...CILIP MDG
 
Poster: Updating the Wessex Classification Scheme for UK health libraries : a...
Poster: Updating the Wessex Classification Scheme for UK health libraries : a...Poster: Updating the Wessex Classification Scheme for UK health libraries : a...
Poster: Updating the Wessex Classification Scheme for UK health libraries : a...CILIP MDG
 
Revamping in-house cataloguing training / Victoria Parkinson (King's College ...
Revamping in-house cataloguing training / Victoria Parkinson (King's College ...Revamping in-house cataloguing training / Victoria Parkinson (King's College ...
Revamping in-house cataloguing training / Victoria Parkinson (King's College ...CILIP MDG
 
UK NACO funnel : progress, obstacles, and solutions / Martin Kelleher (Univer...
UK NACO funnel : progress, obstacles, and solutions / Martin Kelleher (Univer...UK NACO funnel : progress, obstacles, and solutions / Martin Kelleher (Univer...
UK NACO funnel : progress, obstacles, and solutions / Martin Kelleher (Univer...CILIP MDG
 
Ship[w]right[e]s? : the challenges of cataloguing reports from scientific exp...
Ship[w]right[e]s? : the challenges of cataloguing reports from scientific exp...Ship[w]right[e]s? : the challenges of cataloguing reports from scientific exp...
Ship[w]right[e]s? : the challenges of cataloguing reports from scientific exp...CILIP MDG
 
BFI Reuben Library : an RDA implementation story / Anastasia Kerameos (BFI Re...
BFI Reuben Library : an RDA implementation story / Anastasia Kerameos (BFI Re...BFI Reuben Library : an RDA implementation story / Anastasia Kerameos (BFI Re...
BFI Reuben Library : an RDA implementation story / Anastasia Kerameos (BFI Re...CILIP MDG
 
RDA implementation at the British Library / Thurstan Young (British Library)
RDA implementation at the British Library / Thurstan Young (British Library)RDA implementation at the British Library / Thurstan Young (British Library)
RDA implementation at the British Library / Thurstan Young (British Library)CILIP MDG
 
Community forward : developing descriptive cataloguing of rare materials (RDA...
Community forward : developing descriptive cataloguing of rare materials (RDA...Community forward : developing descriptive cataloguing of rare materials (RDA...
Community forward : developing descriptive cataloguing of rare materials (RDA...CILIP MDG
 
The West Midlands Evidence Repository (WMER) : a regional collaboration proje...
The West Midlands Evidence Repository (WMER) : a regional collaboration proje...The West Midlands Evidence Repository (WMER) : a regional collaboration proje...
The West Midlands Evidence Repository (WMER) : a regional collaboration proje...CILIP MDG
 
Authority of assertion in repository contributions to the PID graph / George ...
Authority of assertion in repository contributions to the PID graph / George ...Authority of assertion in repository contributions to the PID graph / George ...
Authority of assertion in repository contributions to the PID graph / George ...CILIP MDG
 

More from CILIP MDG (20)

UK Committee on RDA, RDA Day: New Tools for the Future of Cataloguing - Jenny...
UK Committee on RDA, RDA Day: New Tools for the Future of Cataloguing - Jenny...UK Committee on RDA, RDA Day: New Tools for the Future of Cataloguing - Jenny...
UK Committee on RDA, RDA Day: New Tools for the Future of Cataloguing - Jenny...
 
Challenges to implementation - Jenny Wright
Challenges to implementation - Jenny WrightChallenges to implementation - Jenny Wright
Challenges to implementation - Jenny Wright
 
Application Profiles in RDA - Jenny Wright
Application Profiles in RDA - Jenny WrightApplication Profiles in RDA - Jenny Wright
Application Profiles in RDA - Jenny Wright
 
The Official RDA Toolkit - Opportunities for Efficiency - Thurstan Young
The Official RDA Toolkit - Opportunities for Efficiency - Thurstan YoungThe Official RDA Toolkit - Opportunities for Efficiency - Thurstan Young
The Official RDA Toolkit - Opportunities for Efficiency - Thurstan Young
 
The Official RDA Toolkit - Opportunities for Enrichment - Thurstan Youing
The Official RDA Toolkit - Opportunities for Enrichment - Thurstan YouingThe Official RDA Toolkit - Opportunities for Enrichment - Thurstan Youing
The Official RDA Toolkit - Opportunities for Enrichment - Thurstan Youing
 
UKCoR RDA Day 2023 - "Only" Connect
UKCoR RDA Day 2023 - "Only" ConnectUKCoR RDA Day 2023 - "Only" Connect
UKCoR RDA Day 2023 - "Only" Connect
 
RDA methods, scenarios, tools - Gordon Dunsire
RDA methods, scenarios, tools - Gordon DunsireRDA methods, scenarios, tools - Gordon Dunsire
RDA methods, scenarios, tools - Gordon Dunsire
 
Poster: What’s in a name? Re-Discovering cataloguing and index through metada...
Poster: What’s in a name? Re-Discovering cataloguing and index through metada...Poster: What’s in a name? Re-Discovering cataloguing and index through metada...
Poster: What’s in a name? Re-Discovering cataloguing and index through metada...
 
Poster: Revamping our in-house cataloguing training / Victoria Parkinson (Kin...
Poster: Revamping our in-house cataloguing training / Victoria Parkinson (Kin...Poster: Revamping our in-house cataloguing training / Victoria Parkinson (Kin...
Poster: Revamping our in-house cataloguing training / Victoria Parkinson (Kin...
 
Poster: FAST : can it lighten the load, and what is the impact? / Jenny Wrigh...
Poster: FAST : can it lighten the load, and what is the impact? / Jenny Wrigh...Poster: FAST : can it lighten the load, and what is the impact? / Jenny Wrigh...
Poster: FAST : can it lighten the load, and what is the impact? / Jenny Wrigh...
 
Poster: The West Midlands Evidence Repository (WMER) : a regional collaborati...
Poster: The West Midlands Evidence Repository (WMER) : a regional collaborati...Poster: The West Midlands Evidence Repository (WMER) : a regional collaborati...
Poster: The West Midlands Evidence Repository (WMER) : a regional collaborati...
 
Poster: Updating the Wessex Classification Scheme for UK health libraries : a...
Poster: Updating the Wessex Classification Scheme for UK health libraries : a...Poster: Updating the Wessex Classification Scheme for UK health libraries : a...
Poster: Updating the Wessex Classification Scheme for UK health libraries : a...
 
Revamping in-house cataloguing training / Victoria Parkinson (King's College ...
Revamping in-house cataloguing training / Victoria Parkinson (King's College ...Revamping in-house cataloguing training / Victoria Parkinson (King's College ...
Revamping in-house cataloguing training / Victoria Parkinson (King's College ...
 
UK NACO funnel : progress, obstacles, and solutions / Martin Kelleher (Univer...
UK NACO funnel : progress, obstacles, and solutions / Martin Kelleher (Univer...UK NACO funnel : progress, obstacles, and solutions / Martin Kelleher (Univer...
UK NACO funnel : progress, obstacles, and solutions / Martin Kelleher (Univer...
 
Ship[w]right[e]s? : the challenges of cataloguing reports from scientific exp...
Ship[w]right[e]s? : the challenges of cataloguing reports from scientific exp...Ship[w]right[e]s? : the challenges of cataloguing reports from scientific exp...
Ship[w]right[e]s? : the challenges of cataloguing reports from scientific exp...
 
BFI Reuben Library : an RDA implementation story / Anastasia Kerameos (BFI Re...
BFI Reuben Library : an RDA implementation story / Anastasia Kerameos (BFI Re...BFI Reuben Library : an RDA implementation story / Anastasia Kerameos (BFI Re...
BFI Reuben Library : an RDA implementation story / Anastasia Kerameos (BFI Re...
 
RDA implementation at the British Library / Thurstan Young (British Library)
RDA implementation at the British Library / Thurstan Young (British Library)RDA implementation at the British Library / Thurstan Young (British Library)
RDA implementation at the British Library / Thurstan Young (British Library)
 
Community forward : developing descriptive cataloguing of rare materials (RDA...
Community forward : developing descriptive cataloguing of rare materials (RDA...Community forward : developing descriptive cataloguing of rare materials (RDA...
Community forward : developing descriptive cataloguing of rare materials (RDA...
 
The West Midlands Evidence Repository (WMER) : a regional collaboration proje...
The West Midlands Evidence Repository (WMER) : a regional collaboration proje...The West Midlands Evidence Repository (WMER) : a regional collaboration proje...
The West Midlands Evidence Repository (WMER) : a regional collaboration proje...
 
Authority of assertion in repository contributions to the PID graph / George ...
Authority of assertion in repository contributions to the PID graph / George ...Authority of assertion in repository contributions to the PID graph / George ...
Authority of assertion in repository contributions to the PID graph / George ...
 

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Closing the language gap: developing machine learning tools to detect the language of legacy catalogue records

  • 1. Closing the language gap: developing machine learning tools to detect the language of legacy catalogue records Victoria Morris Metadata Standards Team, British Library 8th September 2020
  • 2. www.bl.uk Language identification 2Image from: https://www.slideshare.net/bigshum/automatic- language-identification-9937750
  • 3. www.bl.uk Language identification 3 • Determining the natural language in which a given piece of text is written • Texts analysed are referred to as documents • Analysis of short documents received relatively little attention until …
  • 4. www.bl.uk 0 10 20 30 40 50 60 70 80 90 100 2013 2014 2015 2016 2017 Language of Content Foundation catalogues Integrated Catalogue Annual Production The problem • Most records include information about the language of content … • … but some (~ 4.6 million) do not. 4
  • 5. www.bl.uk Possible approaches? Linguistic modelling: • Analysis of grammatical structure (morphological properties of nouns, verbs, adjectives, etc.) • Analysis of alphabets, diacritics etc. 5 Statistical models • Analysis of features without semantics • Features may be words, character n-grams (sequences of n adjacent characters) or word n-grams Realistic Complex Can be done on a PC A bit naïve?
  • 6. www.bl.uk Statistical models • Build a model using records where we do have language information • Compare other records to the model • Predict (guess?) what the language is 6 French Italian Jongor This ‘book’ is in Italian? Could model any property of the metadata – not just language
  • 7. www.bl.uk Comprehensible example: Rank order statistics • Rank works by frequency of occurrence: • In each language • In each document • Documents which rank words in the same order are likely to be in the same language? 7 Doesn’t work for short documents
  • 8. www.bl.uk Incomprehensible example: Bayesian models • Calculate the probability of a document being in a particular language, based on: • words • n-grams 8 A bit naïve? Features are not really independent ℙ 𝐷 is in language 𝑙 given that it has features 𝑓1 … 𝑓𝑛 ∝ ℙ (𝐷 is in language 𝑙) 𝑖=1 𝑛 )ℙ(feature 𝑓𝑖 arises in language 𝑙
  • 9. www.bl.uk The Bayesian idea • Create a statistical model to analyse the words in the title and make a prediction about the language(s) of the content • Dependency on large “training set” of data to create word-language frequency model 9 definitely a Hungarian word ⇒ probably a Hungarian title
  • 10. www.bl.uk Buckets of words 10 English Hungarian red riding hood little nehéz szerelem regénye lira poedotas flolemil nugänik Volapuk
  • 12. www.bl.uk Matrix of probabilities 12 ℙ word 𝑤𝑖 arises in language 𝑙 = number of occurences of word 𝑤𝑖 in language 𝑙 number of occurences of all words in language 𝑙 ℙ ′rieser′ is a German word = number of occurences of ′rieser′ in German titles total number of words in all German titles = number of times ′rieser′ appears in the German bucket total number of words in the German bucket
  • 14. www.bl.uk The maths • Naïve Bayesian probabilities • Scary maths … but it’s all based on counting words, which computers are good at ℙ 𝐷 is in language 𝑙 given that it has words 𝑤1 … 𝑤 𝑛 ∝ ℙ 𝐷 is in language 𝑙 𝑖=1 𝑛 ℙ word 𝑤𝑖 arises in language 𝑙 ∝ 𝑖=1 𝑛 number of occurences of word 𝑤𝑖 in language 𝑙 number of occurences of all words in language 𝑙 14
  • 15. www.bl.uk How do we measure success? • Precision: does the model predict the correct language? • Recall: does the model find everything in a particular language? 15 Micro-averaged Macro-averaged Precision Recall Precision Recall Bayesian (words) 86.1% 65.4% 63.7 % 64.1 % Bayesian (n-grams) 37.2% 17.9% 23.1 % 23.1 %
  • 16. www.bl.uk Assumptions • Language codes already present within the catalogue are correct (!) • Catalogue records are monolingual • Title, edition and series title are in the same language as the resource • No new languages 16
  • 17. www.bl.uk 17 Latin? English? English? Latvian? Words can belong to more than one language
  • 18. www.bl.uk 18 Language of title = language of content?
  • 19. www.bl.uk Results • Shaded cells on the diagonal = correct predictions 19
  • 20. www.bl.uk The impact 20 0 20 40 60 80 100 120 2013 2014 2015 2016 2017 2018 2019 2020 Language of Content (provisional) Foundation catalogues Integrated Catalogue Annual Production 3 tranches of ~ 1 million language codes: 1. 99.7% confidence 2. 99.4% confidence 3. 99.1% confidence Curators able to identify collection responsibilities more accurately Researchers able to discover more texts
  • 21. www.bl.uk Write-up 21 • Paper in Cataloging & Classification Quarterly available at: https://doi.org/10.1080/01639374.2019.1700201 • In the BL research repository at: https://bl.iro.bl.uk/work/sc/6c99ffcb-0003-477d- 8a58-64cf8c45ecf5 • Source code will appear on GitHub one day … https://github.com/victoriamorris