SlideShare a Scribd company logo
How to build a multi-lingual text
classifier ?
July 24, 2019 Axel de Romblay
© 2 0 1 9 CO N FI D E N TI A L
Motivation
W HAT FO R ?
Let’s introduce some important
figures, goals & KPIs
© 2 0 1 9 CO N FI D E N TI A L
A multi-lingual
video catalog
Dailymotion hosts hundreds of
millions videos in more than 20
languages. Our purpose is to share
the most compelling music,
entertainment, news and sports
content around.
C HAPTE R
3
© 2 0 1 9 CO N FI D E N TI A L
Content
categorization for
a better user
experience
Why do we care at dailymotion about
being able to accurately categorize
content at scale?
• Watching interface
• Search engine
• SEO & acquisition
C HAPTE R
4
3
21
4
© 2 0 1 9 CO N FI D E N TI A L
High precision/coverage
tradeoff
Tag the maximum videos with a
minimum error rate
Fast & up-to-date
annotation
Get updated topics/categories
Relevance & quality
Get relevant & meaningful
topics/categories
Multi-lingual annotation
Tag videos for all the languages
C HAPTE R
5
Criteria for a good classification
© 2 0 1 9 CO N FI D E N TI A L
First steps
W HAT ARE T HE REQ UIRE M E NTS ?
Let’s introduce the settings:
language detection, video
annotation using NEL &
unsupervised categorization of
topics.
Topic annotation
for English &
French videos
Reference: https://medium.com/dailymotion/topic-
annotation-automatic-algorithms-data-377079d27936
© 2 0 1 9 CO N FI D E N TI A L
Topic annotation pipeline
C HAPTE R
8
© 2 0 1 9 CO N FI D E N TI A L
• Polyglot python package (based on cld2)
• Naïve Bayesian classifier trained on
millions web pages for each language.
• Optimized to run at scale.
• Supports 196 languages
C HAPTE R
9
Text extraction & language detection
How to detect the language ?
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
1 0
Topic generator : Named Entity Linking
with Wikidata knowledge graph
Open source knowledge graph
Multilingual
Updated
50M interconnected entities
© 2 0 1 9 CO N FI D E N TI A L
Preprocessing
• Standard
preprocessing : stop
words …
• Tokenization :
detection of
overlapping words.
C HAPTE R
1 1
Topic generator : Named Entity Linking
with Wikidata knowledge graph
Disambiguation
Given a word a, we choose the appropriate
Wikidata entity pa using :
• The commonness of pa : Pr(pa|a)
• A relatedness score between a and pa :
Pruning
We keep relevant wikidata entities using :
• The link probability of a word
• The coherence of the word :
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
1 2
Topic filter : feature engineering &
centrality classification problem
Training Set
Candidate topics
Features : coherence, popularity,
disambiguation score, location
score …
Machine Learning
Topic
categorization
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
1 4
Unsupervised & uncontextual topic
categorization using Wikidata
ØGather the topics: get less classes than topics
ØGet different levels of classification (hierarchy)
§ Set a good number of levels
§ Set good splits for each level (and avoid small & big classes)
ØGet a relevant number of classes for each topic
§ Have at least one class (coverage)
§ Limit the number of classes for each topic
ØGet a good label for each class: match IAB
taxonomy or at least get a wikidata qid
ØCoherence/consistency : similar classes must be
in a same level (ex: countries, teams, ...)
Humans
Not Humans
Football Player
Model
Singer
RM
Cristiano
Ronaldo
Rap
Rock
Count : 20 topics
Count : 11+2+2=15 classes
What criteria for a good categorization ?
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
1 5
Preprocessing: selecting a Wikidata subgraph
with product rules
Wikidata
44M topics (vertices) & 4k relations
(types of edges)
Our subgraph
500k topics & 10 relations
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
1 6
From a graph of connected classes to
a hierarchy of classes
Compute a granularity
measure G
It will perform a ranking between all the classes
and will allow us to split the classes into
different levels depending on a measure G :
G (c) = number of leaves having a path to the
class c
Compute a correlation
measure C
It will detect and drop correlated classes: this will
allow us to drastically reduce the number of classes
per topic.
C (c1, c2) =
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
1 7
A concrete example…
© 2 0 1 9 CO N FI D E N TI A L
Multi-lingual
text classifier
HOW W E D O THIS ?
Now we can move on to NLP
algorithms J
© 2 0 1 9 CO N FI D E N TI A L
What we want to do
Get a contextual
categorization of our
video catalog
… For all languages !
How we do this
1. Get a robust representation of our video
catalog (multi-lingual embeddings)
2. Train a predictive model on the top of it,
on French & English videos only !
3. Transfer on other languages
4. Evaluate performances
C HAPTE R
1 9
What we have
57% of our video
catalog (French +
English) is
annotated &
categorized into
different levels.
The categorization
is uncontextual (it
only depends on
the topic and not
on the video)
Where do we stand
Multi-label
classification
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
2 1
Get a contextual classification for French
& English videos with sparse inputs
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
2 2
Get a contextual classification for French
& English videos with sparse inputs
ü Fast & accurate model
§BOW trained using DataFlow
§Top1-accuracy = 0.9
ü Low memory usage : sparse model
implemented using tf.keras
Pros Cons
Reference: https://medium.com/dailymotion/how-to-design-deep-learning-models-with-sparse-inputs-in-tensorflow-keras-fd5e754abec1
q Not transferable on others languages
§Bow vocabulary for French/English only
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
2 3
Robust multi-lingual embeddings with
BERT using a Tesla V100
Reference: https://github.com/google-research/bert
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
2 4
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
2 5
Fine-tuning on French/English videos by
adding prediction layers
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
2 6
Some qualitative results on other
languages
Ø We display the most confident
predictions on other languages (test
set)
Ø To decrease the number of False
Positive, we can set a threshold to
get 85% precision and deduce the
recall
Ø But how can we make sure that this
threshold is the same on the test
set ?
© 2 0 1 9 CO N FI D E N TI A L
The next
step(s)
CO NC LUS IO N
Let’s conclude with the next steps
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
2 8
First find a quantitative metric for
transfer learning.
Two biases:
• Content probably depends on the
language (Korean videos tend to
display more news, English videos
more sports...)
• BERT is supposed to align multi-lingual
embeddings.
Some ideas:
• Translate some videos into English using
Google Translation API
• Apply BOW model to get a groundtruth on
other languages
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
2 9
Then tune hyperparameters using state-
of-the-art optimization (BOHB)
Reference:
https://www.automl.org/blog_bohb/
© 2 0 1 9 CO N FI D E N TI A L
C HAPTE R
3 0
Push to production environment.
qCode on Github (SQL / Python / TensorFlow /
DataFlow) & dump models / tables (Google Cloud).
qCheck CI passes: run unit tests, style checks, code
reviews…
qBuild docker image and push to quay repository
qDeploy (Kubernetes)
qSchedule the tasks (Airflow)
qMonitor (Datadog & Tableau)
Thanks to my
squad
&
Thank you !
Contact: https://www.linkedin.com/in/axel-de-romblay-6444a990/

More Related Content

Similar to Meetup "Paris NLP"

Subtitling in SDL Trados Studio
Subtitling in SDL Trados StudioSubtitling in SDL Trados Studio
Subtitling in SDL Trados Studio
Paul Filkin
 
Compiler Construction | Lecture 1 | What is a compiler?
Compiler Construction | Lecture 1 | What is a compiler?Compiler Construction | Lecture 1 | What is a compiler?
Compiler Construction | Lecture 1 | What is a compiler?
Eelco Visser
 
SDL BeGlobal The SDL Platform for Automated Translation
SDL BeGlobal The SDL Platform for Automated TranslationSDL BeGlobal The SDL Platform for Automated Translation
SDL BeGlobal The SDL Platform for Automated Translation
SDL Trados
 
Closed Captioning Online Video Clips for FCC Compliance
Closed Captioning Online Video Clips for FCC ComplianceClosed Captioning Online Video Clips for FCC Compliance
Closed Captioning Online Video Clips for FCC Compliance
3Play Media
 
C_Programming_Notes_ICE
C_Programming_Notes_ICEC_Programming_Notes_ICE
C_Programming_Notes_ICE
Gilbert NZABONITEGEKA
 
Presentation about Introduction of C language.pptx
Presentation about Introduction of C language.pptxPresentation about Introduction of C language.pptx
Presentation about Introduction of C language.pptx
MDChamokShuvo
 
A Journey From Objective C to Swift - Chromeinfotech
A Journey From Objective C to Swift - ChromeinfotechA Journey From Objective C to Swift - Chromeinfotech
A Journey From Objective C to Swift - Chromeinfotech
ChromeInfo Technologies
 
Mini Project- Digital Video Editing
Mini Project- Digital Video EditingMini Project- Digital Video Editing
State of the Machine Translation by Intento (November 2017)
State of the Machine Translation by Intento (November 2017)State of the Machine Translation by Intento (November 2017)
State of the Machine Translation by Intento (November 2017)
Konstantin Savenkov
 
Transformer_Clustering_PyData_2022.pdf
Transformer_Clustering_PyData_2022.pdfTransformer_Clustering_PyData_2022.pdf
Transformer_Clustering_PyData_2022.pdf
ChristopherLennan
 
STREAMING and BROADCASTING CHEAT SHEET
STREAMING and BROADCASTING CHEAT SHEETSTREAMING and BROADCASTING CHEAT SHEET
STREAMING and BROADCASTING CHEAT SHEET
Andy W. Kochendorfer
 
Learn c programming language in 24 hours allfreebooks.tk
Learn c programming language in 24 hours   allfreebooks.tkLearn c programming language in 24 hours   allfreebooks.tk
Learn c programming language in 24 hours allfreebooks.tk
ragulasai
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Codemotion
 
A Research Study of Data Collection and Analysis of Semantics of Programming ...
A Research Study of Data Collection and Analysis of Semantics of Programming ...A Research Study of Data Collection and Analysis of Semantics of Programming ...
A Research Study of Data Collection and Analysis of Semantics of Programming ...
IRJET Journal
 
Closed Captioning Legal Requirements, Best Practices, and Workflows for Media...
Closed Captioning Legal Requirements, Best Practices, and Workflows for Media...Closed Captioning Legal Requirements, Best Practices, and Workflows for Media...
Closed Captioning Legal Requirements, Best Practices, and Workflows for Media...
3Play Media
 
Semantics, Automatic Metadata and Audiovisual Contents. A case of study: the ...
Semantics, Automatic Metadata and Audiovisual Contents. A case of study: the ...Semantics, Automatic Metadata and Audiovisual Contents. A case of study: the ...
Semantics, Automatic Metadata and Audiovisual Contents. A case of study: the ...
FIAT/IFTA
 
Commit messages vs. release notes
Commit messages vs. release notesCommit messages vs. release notes
Commit messages vs. release notes
Eva Parish
 
The Why of Go
The Why of GoThe Why of Go
The Why of Go
C4Media
 
Enterprise DevOps Series: Using VS Code & Zowe
Enterprise DevOps Series: Using VS Code & ZoweEnterprise DevOps Series: Using VS Code & Zowe
Enterprise DevOps Series: Using VS Code & Zowe
DevOps.com
 
Ctutor ashu
Ctutor ashuCtutor ashu
Ctutor ashu
20101994ashu
 

Similar to Meetup "Paris NLP" (20)

Subtitling in SDL Trados Studio
Subtitling in SDL Trados StudioSubtitling in SDL Trados Studio
Subtitling in SDL Trados Studio
 
Compiler Construction | Lecture 1 | What is a compiler?
Compiler Construction | Lecture 1 | What is a compiler?Compiler Construction | Lecture 1 | What is a compiler?
Compiler Construction | Lecture 1 | What is a compiler?
 
SDL BeGlobal The SDL Platform for Automated Translation
SDL BeGlobal The SDL Platform for Automated TranslationSDL BeGlobal The SDL Platform for Automated Translation
SDL BeGlobal The SDL Platform for Automated Translation
 
Closed Captioning Online Video Clips for FCC Compliance
Closed Captioning Online Video Clips for FCC ComplianceClosed Captioning Online Video Clips for FCC Compliance
Closed Captioning Online Video Clips for FCC Compliance
 
C_Programming_Notes_ICE
C_Programming_Notes_ICEC_Programming_Notes_ICE
C_Programming_Notes_ICE
 
Presentation about Introduction of C language.pptx
Presentation about Introduction of C language.pptxPresentation about Introduction of C language.pptx
Presentation about Introduction of C language.pptx
 
A Journey From Objective C to Swift - Chromeinfotech
A Journey From Objective C to Swift - ChromeinfotechA Journey From Objective C to Swift - Chromeinfotech
A Journey From Objective C to Swift - Chromeinfotech
 
Mini Project- Digital Video Editing
Mini Project- Digital Video EditingMini Project- Digital Video Editing
Mini Project- Digital Video Editing
 
State of the Machine Translation by Intento (November 2017)
State of the Machine Translation by Intento (November 2017)State of the Machine Translation by Intento (November 2017)
State of the Machine Translation by Intento (November 2017)
 
Transformer_Clustering_PyData_2022.pdf
Transformer_Clustering_PyData_2022.pdfTransformer_Clustering_PyData_2022.pdf
Transformer_Clustering_PyData_2022.pdf
 
STREAMING and BROADCASTING CHEAT SHEET
STREAMING and BROADCASTING CHEAT SHEETSTREAMING and BROADCASTING CHEAT SHEET
STREAMING and BROADCASTING CHEAT SHEET
 
Learn c programming language in 24 hours allfreebooks.tk
Learn c programming language in 24 hours   allfreebooks.tkLearn c programming language in 24 hours   allfreebooks.tk
Learn c programming language in 24 hours allfreebooks.tk
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
 
A Research Study of Data Collection and Analysis of Semantics of Programming ...
A Research Study of Data Collection and Analysis of Semantics of Programming ...A Research Study of Data Collection and Analysis of Semantics of Programming ...
A Research Study of Data Collection and Analysis of Semantics of Programming ...
 
Closed Captioning Legal Requirements, Best Practices, and Workflows for Media...
Closed Captioning Legal Requirements, Best Practices, and Workflows for Media...Closed Captioning Legal Requirements, Best Practices, and Workflows for Media...
Closed Captioning Legal Requirements, Best Practices, and Workflows for Media...
 
Semantics, Automatic Metadata and Audiovisual Contents. A case of study: the ...
Semantics, Automatic Metadata and Audiovisual Contents. A case of study: the ...Semantics, Automatic Metadata and Audiovisual Contents. A case of study: the ...
Semantics, Automatic Metadata and Audiovisual Contents. A case of study: the ...
 
Commit messages vs. release notes
Commit messages vs. release notesCommit messages vs. release notes
Commit messages vs. release notes
 
The Why of Go
The Why of GoThe Why of Go
The Why of Go
 
Enterprise DevOps Series: Using VS Code & Zowe
Enterprise DevOps Series: Using VS Code & ZoweEnterprise DevOps Series: Using VS Code & Zowe
Enterprise DevOps Series: Using VS Code & Zowe
 
Ctutor ashu
Ctutor ashuCtutor ashu
Ctutor ashu
 

More from Axel de Romblay

MLBox 0.8.2
MLBox 0.8.2 MLBox 0.8.2
MLBox 0.8.2
Axel de Romblay
 
[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems
Axel de Romblay
 
Meetup "Big Data & Machine Learning" (French version)
Meetup "Big Data & Machine Learning" (French version)Meetup "Big Data & Machine Learning" (French version)
Meetup "Big Data & Machine Learning" (French version)
Axel de Romblay
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbols
Axel de Romblay
 
How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?
Axel de Romblay
 
Automate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAutomate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBox
Axel de Romblay
 
MLBox
MLBoxMLBox
Udacity webinar on Recommendation Systems
Udacity webinar on Recommendation SystemsUdacity webinar on Recommendation Systems
Udacity webinar on Recommendation Systems
Axel de Romblay
 

More from Axel de Romblay (8)

MLBox 0.8.2
MLBox 0.8.2 MLBox 0.8.2
MLBox 0.8.2
 
[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems
 
Meetup "Big Data & Machine Learning" (French version)
Meetup "Big Data & Machine Learning" (French version)Meetup "Big Data & Machine Learning" (French version)
Meetup "Big Data & Machine Learning" (French version)
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbols
 
How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?
 
Automate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBoxAutomate Machine Learning Pipeline Using MLBox
Automate Machine Learning Pipeline Using MLBox
 
MLBox
MLBoxMLBox
MLBox
 
Udacity webinar on Recommendation Systems
Udacity webinar on Recommendation SystemsUdacity webinar on Recommendation Systems
Udacity webinar on Recommendation Systems
 

Recently uploaded

Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
Hitesh Sikarwar
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 

Recently uploaded (20)

Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 

Meetup "Paris NLP"

  • 1. How to build a multi-lingual text classifier ? July 24, 2019 Axel de Romblay
  • 2. © 2 0 1 9 CO N FI D E N TI A L Motivation W HAT FO R ? Let’s introduce some important figures, goals & KPIs
  • 3. © 2 0 1 9 CO N FI D E N TI A L A multi-lingual video catalog Dailymotion hosts hundreds of millions videos in more than 20 languages. Our purpose is to share the most compelling music, entertainment, news and sports content around. C HAPTE R 3
  • 4. © 2 0 1 9 CO N FI D E N TI A L Content categorization for a better user experience Why do we care at dailymotion about being able to accurately categorize content at scale? • Watching interface • Search engine • SEO & acquisition C HAPTE R 4
  • 5. 3 21 4 © 2 0 1 9 CO N FI D E N TI A L High precision/coverage tradeoff Tag the maximum videos with a minimum error rate Fast & up-to-date annotation Get updated topics/categories Relevance & quality Get relevant & meaningful topics/categories Multi-lingual annotation Tag videos for all the languages C HAPTE R 5 Criteria for a good classification
  • 6. © 2 0 1 9 CO N FI D E N TI A L First steps W HAT ARE T HE REQ UIRE M E NTS ? Let’s introduce the settings: language detection, video annotation using NEL & unsupervised categorization of topics.
  • 7. Topic annotation for English & French videos Reference: https://medium.com/dailymotion/topic- annotation-automatic-algorithms-data-377079d27936
  • 8. © 2 0 1 9 CO N FI D E N TI A L Topic annotation pipeline C HAPTE R 8
  • 9. © 2 0 1 9 CO N FI D E N TI A L • Polyglot python package (based on cld2) • Naïve Bayesian classifier trained on millions web pages for each language. • Optimized to run at scale. • Supports 196 languages C HAPTE R 9 Text extraction & language detection How to detect the language ?
  • 10. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 1 0 Topic generator : Named Entity Linking with Wikidata knowledge graph Open source knowledge graph Multilingual Updated 50M interconnected entities
  • 11. © 2 0 1 9 CO N FI D E N TI A L Preprocessing • Standard preprocessing : stop words … • Tokenization : detection of overlapping words. C HAPTE R 1 1 Topic generator : Named Entity Linking with Wikidata knowledge graph Disambiguation Given a word a, we choose the appropriate Wikidata entity pa using : • The commonness of pa : Pr(pa|a) • A relatedness score between a and pa : Pruning We keep relevant wikidata entities using : • The link probability of a word • The coherence of the word :
  • 12. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 1 2 Topic filter : feature engineering & centrality classification problem Training Set Candidate topics Features : coherence, popularity, disambiguation score, location score … Machine Learning
  • 14. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 1 4 Unsupervised & uncontextual topic categorization using Wikidata ØGather the topics: get less classes than topics ØGet different levels of classification (hierarchy) § Set a good number of levels § Set good splits for each level (and avoid small & big classes) ØGet a relevant number of classes for each topic § Have at least one class (coverage) § Limit the number of classes for each topic ØGet a good label for each class: match IAB taxonomy or at least get a wikidata qid ØCoherence/consistency : similar classes must be in a same level (ex: countries, teams, ...) Humans Not Humans Football Player Model Singer RM Cristiano Ronaldo Rap Rock Count : 20 topics Count : 11+2+2=15 classes What criteria for a good categorization ?
  • 15. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 1 5 Preprocessing: selecting a Wikidata subgraph with product rules Wikidata 44M topics (vertices) & 4k relations (types of edges) Our subgraph 500k topics & 10 relations
  • 16. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 1 6 From a graph of connected classes to a hierarchy of classes Compute a granularity measure G It will perform a ranking between all the classes and will allow us to split the classes into different levels depending on a measure G : G (c) = number of leaves having a path to the class c Compute a correlation measure C It will detect and drop correlated classes: this will allow us to drastically reduce the number of classes per topic. C (c1, c2) =
  • 17. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 1 7 A concrete example…
  • 18. © 2 0 1 9 CO N FI D E N TI A L Multi-lingual text classifier HOW W E D O THIS ? Now we can move on to NLP algorithms J
  • 19. © 2 0 1 9 CO N FI D E N TI A L What we want to do Get a contextual categorization of our video catalog … For all languages ! How we do this 1. Get a robust representation of our video catalog (multi-lingual embeddings) 2. Train a predictive model on the top of it, on French & English videos only ! 3. Transfer on other languages 4. Evaluate performances C HAPTE R 1 9 What we have 57% of our video catalog (French + English) is annotated & categorized into different levels. The categorization is uncontextual (it only depends on the topic and not on the video) Where do we stand
  • 21. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 2 1 Get a contextual classification for French & English videos with sparse inputs
  • 22. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 2 2 Get a contextual classification for French & English videos with sparse inputs ü Fast & accurate model §BOW trained using DataFlow §Top1-accuracy = 0.9 ü Low memory usage : sparse model implemented using tf.keras Pros Cons Reference: https://medium.com/dailymotion/how-to-design-deep-learning-models-with-sparse-inputs-in-tensorflow-keras-fd5e754abec1 q Not transferable on others languages §Bow vocabulary for French/English only
  • 23. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 2 3 Robust multi-lingual embeddings with BERT using a Tesla V100 Reference: https://github.com/google-research/bert
  • 24. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 2 4
  • 25. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 2 5 Fine-tuning on French/English videos by adding prediction layers
  • 26. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 2 6 Some qualitative results on other languages Ø We display the most confident predictions on other languages (test set) Ø To decrease the number of False Positive, we can set a threshold to get 85% precision and deduce the recall Ø But how can we make sure that this threshold is the same on the test set ?
  • 27. © 2 0 1 9 CO N FI D E N TI A L The next step(s) CO NC LUS IO N Let’s conclude with the next steps
  • 28. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 2 8 First find a quantitative metric for transfer learning. Two biases: • Content probably depends on the language (Korean videos tend to display more news, English videos more sports...) • BERT is supposed to align multi-lingual embeddings. Some ideas: • Translate some videos into English using Google Translation API • Apply BOW model to get a groundtruth on other languages
  • 29. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 2 9 Then tune hyperparameters using state- of-the-art optimization (BOHB) Reference: https://www.automl.org/blog_bohb/
  • 30. © 2 0 1 9 CO N FI D E N TI A L C HAPTE R 3 0 Push to production environment. qCode on Github (SQL / Python / TensorFlow / DataFlow) & dump models / tables (Google Cloud). qCheck CI passes: run unit tests, style checks, code reviews… qBuild docker image and push to quay repository qDeploy (Kubernetes) qSchedule the tasks (Airflow) qMonitor (Datadog & Tableau)
  • 31. Thanks to my squad & Thank you ! Contact: https://www.linkedin.com/in/axel-de-romblay-6444a990/