SlideShare a Scribd company logo
1 of 25
Download to read offline
NLP in the WILD
-or-
Building a System for
Text Language Identification
Vsevolod Dyomkin
12/2016
A Bit about Me
* Lisp programmer
* 5+ years of NLP work at Grammarly
* Occasional lecturer
https://vseloved.github.io
Roles
Langid Problem
* 150+ langs in Wikipedia
* >10 writing systems
(script/alphabet) in active use
* script-lang: 1:1, 1:2, 1:n, n:1 :)
* Latin >50 langs, Cyrillyc >20
* Long texts easy, short hmm– –
* Internet texts (mixed langs)
* Small task => resource-constrained
Twitter Case Study
https://blog.twitter.com/2015/evaluating-language-
identification-performance
Prior Art
* C++: https://github.com/CLD2Owners/cld2
* Python: https://github.com/saffsd/langid.py
* Java: 
https://github.com/shuyo/language-detection/
http://blog.mikemccandless.com/2011/10/accuracy
-and-performance-of-googles.html
http://lab.hypotheses.org/1083
http://labs.translated.net/language-identifier/
YALI WILD
* All of them use weak models
* Wanted to use Wiktionary —
150+ languages, always evolving
* Wanted to do in Lisp
Linguistics
(domain knowledge)
* Polyglots?
* ISO 639
* Internet lang bias
https://en.wikipedia.org/wiki/Languages_used_on_the_Internet
* Rule-based ideas:
- 1:1/1:2 scripts
- unique letters
* Per-script/per-lang segmentation
insight
data
Data
* evaluation data:
- smoke test
- in-/out-of-domain data
- precision-/recall-oriented
* training data
- where to get? Wikidata
- how to get? SAX parsing
Wiktionary
* good source for various
dictionaries and word lists (word
forms, definitions, synonyms,…)
* ~100 langs
Wiktionary
* good source for various
dictionaries and word lists (word
forms, definitions, synonyms,…)
* ~100 langs
Wikipedia
* >150 langs
* size? Wikipedia abstracts
* automation?
* filtering?
Alternatives
* API
(defun get-examples (word)
(remove-if-not
^(upper-case-p (char % 0))
(mapcar ^(substitute #Space #_ (? % "text"))
(? (yason:parse
(drakma:http-request
(fmt "http://api.wordnik.com/v4/word.json/~A/examples"
(drakma:url-encode word :utf-8))
:additional-headers *wordnik-auth-headers*))
"examples"))))
* Web scraping
(defmethod scrape ((site (eql :linguaholic)) source)
(match-html source
'(>> article
(aside (>> a ($ user))
(>> li (strong "Native Tongue:") ($ lang)))
(div |...| (>> (div :data-role "commentContent")
($ text) (span) |...|))
!!!))))
Research
(quality)
* Simple task => simple models (NB)
* Challenges
- short texts
- mixed langs
- 90% of data - cryptic
ideas
experiments
Naive Bayes
* Features: 3-/4-char ngrams
* Improvement ideas:
- add words (word unigrams)
- factor in word lengths
- use Internet lang bias
Formula:
(argmax (* (? priors lang)
(or (? word-probs word)
(norm (reduce '* ^(? 3g-probs %)
(word-3gs word)))))
langs)
http://www.paulgraham.com/spam.html
Experiments
* Usual ML setup (70:30) doesn't
work here
* “If you torment the data too
much...” (~c) Yaser Abu-Mosafa
* Comparison with existing systems
helps
Confusion MatrixAB: 0.90 | FR:0.10
AF: 0.80 | EN:0.20
AK: 0.80 | NN:0.10 IT:0.10
AN: 0.90 | ES:0.10
AY: 0.90 | ES:0.10
BG: 0.60 | RU:0.40
BM: 0.80 | FR:0.10 LA:0.10
BS: 0.90 | EN:0.10
CO: 0.90 | IT:0.10
CR: 0.40 | FR:0.30 UND:0.20 MS:0.10
CS: 0.90 | IT:0.10
CU: 0.90 | VI:0.10
CV: 0.80 | RU:0.20
DA: 0.70 | FO:0.10 NO:0.10 NN:0.10
DV: 0.80 | UZ:0.10 EN:0.10
DZ: NIL | BO:0.80 IK:0.10 NE:0.10
EN: 0.90 | NL:0.10
ET: 0.80 | EN:0.20
FF: 0.50 | EN:0.20 FR:0.10 EO:0.10 SV:0.10
FI: 0.80 | FR:0.10 DA:0.10
FJ: 0.90 | OC:0.10
GL: 0.90 | ES:0.10
HA: 0.80 | YO:0.10 EN:0.10
HR: 0.70 | BS:0.10 DE:0.10 GL:0.10
ID: 0.80 | MS:0.20
IE: 0.90 | EN:0.10
IG: 0.60 | EN:0.40
IO: 0.86 | DA:0.14
KG: 0.90 | SW:0.10
KL: 0.90 | EN:0.10
KS: 0.30 | UR:0.60 UND:0.10
KU: 0.90 | EN:0.10
KW: 0.89 | UND:0.11
LA: 0.90 | FR:0.10
LB: 0.90 | EN:0.10
LG: 0.90 | IT:0.10
LI: 0.80 | NL:0.20
MI: 0.90 | ES:0.10
MK: 0.80 | IT:0.10 RU:0.10
MS: 0.80 | ID:0.10 EN:0.10
MT: 0.90 | DE:0.10
NO: 0.90 | DA:0.10
NY: 0.80 | AR:0.10 SW:0.10
OM: 0.90 | EN:0.10
OS: 0.90 | RU:0.10
QU: 0.70 | ES:0.20 EN:0.10
RM: 0.90 | EN:0.10
RN: 0.50 | RW:0.40 YO:0.10
SC: 0.90 | FR:0.10
SG: 0.90 | FR:0.10
SR: 0.80 | HR:0.10 BS:0.10
SS: 0.50 | EN:0.30 DA:0.10 ZU:0.10
ST: 0.90 | PT:0.10
SV: 0.90 | DA:0.10
TI: 0.40 | AM:0.40 LA:0.10 EN:0.10
TK: 0.80 | TR:0.20
TO: 0.50 | EN:0.50
TS: 0.80 | EN:0.10 UZ:0.10
TW: 0.40 | EN:0.40 AK:0.10 YO:0.10
TY: 0.90 | ES:0.10
UG: 0.60 | UZ:0.40
UK: 0.80 | UND:0.10 VI:0.10
VE: 0.90 | EN:0.10
WO: 0.80 | NL:0.10 FR:0.10
XH: 0.80 | UZ:0.10 EN:0.10
YO: 0.80 | EN:0.20
ZU: 0.60 | XH:0.30 PT:0.10
Total quality: 0.90
The Ladder of NLP
Rule-based
Linear ML
Decision Trees & co.
Sequence models
Artificial Neural networks
Better Models
What can be improved?
* Account for word order
* Discriminative models per script
* DeepLearning™ model
Marginal gain is not huge…
Engineer
(efficiency)
* Just a small piece
of the pipeline:
- good-enough speed
- minimize space usage
- minimize external dependencies
* Proper floating-point calculations
* Proper processing of big texts?
* Pre-/post-processing
* Clean API
implementation
optimization
Model Optimization
Initial model size: ~1G
Target: ~10M :)
How to do it?
- Lossy compression: pruning
- Lossless compression:
Huffman coding, efficient DS
API
* Levels of detalization:
- text-langs
- word-langs
- window?
* UI: library, REPL & Web APIs
Recap
* Triple view of any
knowledge-related problem
* Ladder of approaches to solving
NLP problems
* Importance of productive env:
general- & special-purpose
REPL lang API access to data– –
efficient testing–
* Main stages of problem solving:
data experiment→ →
implementation optimization→

More Related Content

Viewers also liked

Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispDamien Cassou
 
LISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love ParanthesesLISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love ParanthesesDominic Graefen
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Vsevolod Dyomkin
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная оберткаVsevolod Dyomkin
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхVsevolod Dyomkin
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common LispVsevolod Dyomkin
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Rajiv Shah
 

Viewers also liked (11)

Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common Lisp
 
LISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love ParanthesesLISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love Parantheses
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная обертка
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данных
 
CL-NLP
CL-NLPCL-NLP
CL-NLP
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common Lisp
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow
 

Similar to NLP in the WILD or Building a System for Text Language Identification

Building modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaBuilding modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaAlexander Gyoshev
 
Lambdas myths-and-mistakes
Lambdas myths-and-mistakesLambdas myths-and-mistakes
Lambdas myths-and-mistakesRichardWarburton
 
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)jaxLondonConference
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate WorksGoro Fuji
 
An Introduction to CSS Preprocessors
An Introduction to CSS PreprocessorsAn Introduction to CSS Preprocessors
An Introduction to CSS PreprocessorsMiloš Sutanovac
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?Jeremy Schneider
 
Developing OpenResty Framework
Developing OpenResty FrameworkDeveloping OpenResty Framework
Developing OpenResty FrameworkAapo Talvensaari
 
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksHow to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksPuppet
 
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksHow to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksCarlos Sanchez
 
Creating web APIs with apigility
Creating web APIs with apigilityCreating web APIs with apigility
Creating web APIs with apigilityKaloyan Raev
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisBrendan Gregg
 
How deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performanceHow deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performanceCumulus Networks
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsAjay Ohri
 
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures LibraryAPOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Libraryjexp
 
Beginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleBeginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleAri Lerner
 
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей СавченкоWebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей СавченкоGeeksLab Odessa
 
Spring scala - Sneaking Scala into your corporation
Spring scala  - Sneaking Scala into your corporationSpring scala  - Sneaking Scala into your corporation
Spring scala - Sneaking Scala into your corporationHenryk Konsek
 
The Crystal language *recently* update
The Crystal language *recently* updateThe Crystal language *recently* update
The Crystal language *recently* updatekarupanerura
 

Similar to NLP in the WILD or Building a System for Text Language Identification (20)

Building modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaBuilding modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and java
 
Lambdas myths-and-mistakes
Lambdas myths-and-mistakesLambdas myths-and-mistakes
Lambdas myths-and-mistakes
 
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
 
Scalaxb preso
Scalaxb presoScalaxb preso
Scalaxb preso
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate Works
 
An Introduction to CSS Preprocessors
An Introduction to CSS PreprocessorsAn Introduction to CSS Preprocessors
An Introduction to CSS Preprocessors
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
Developing OpenResty Framework
Developing OpenResty FrameworkDeveloping OpenResty Framework
Developing OpenResty Framework
 
Soap vs-rest
Soap vs-restSoap vs-rest
Soap vs-rest
 
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksHow to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
 
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksHow to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
 
Creating web APIs with apigility
Creating web APIs with apigilityCreating web APIs with apigility
Creating web APIs with apigility
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance Analysis
 
How deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performanceHow deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performance
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
 
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures LibraryAPOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
 
Beginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleBeginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at Google
 
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей СавченкоWebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
 
Spring scala - Sneaking Scala into your corporation
Spring scala  - Sneaking Scala into your corporationSpring scala  - Sneaking Scala into your corporation
Spring scala - Sneaking Scala into your corporation
 
The Crystal language *recently* update
The Crystal language *recently* updateThe Crystal language *recently* update
The Crystal language *recently* update
 

Recently uploaded

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

NLP in the WILD or Building a System for Text Language Identification

  • 1. NLP in the WILD -or- Building a System for Text Language Identification Vsevolod Dyomkin 12/2016
  • 2. A Bit about Me * Lisp programmer * 5+ years of NLP work at Grammarly * Occasional lecturer https://vseloved.github.io
  • 4. Langid Problem * 150+ langs in Wikipedia * >10 writing systems (script/alphabet) in active use * script-lang: 1:1, 1:2, 1:n, n:1 :) * Latin >50 langs, Cyrillyc >20 * Long texts easy, short hmm– – * Internet texts (mixed langs) * Small task => resource-constrained
  • 6. Prior Art * C++: https://github.com/CLD2Owners/cld2 * Python: https://github.com/saffsd/langid.py * Java:  https://github.com/shuyo/language-detection/ http://blog.mikemccandless.com/2011/10/accuracy -and-performance-of-googles.html http://lab.hypotheses.org/1083 http://labs.translated.net/language-identifier/
  • 7.
  • 8. YALI WILD * All of them use weak models * Wanted to use Wiktionary — 150+ languages, always evolving * Wanted to do in Lisp
  • 9. Linguistics (domain knowledge) * Polyglots? * ISO 639 * Internet lang bias https://en.wikipedia.org/wiki/Languages_used_on_the_Internet * Rule-based ideas: - 1:1/1:2 scripts - unique letters * Per-script/per-lang segmentation insight data
  • 10. Data * evaluation data: - smoke test - in-/out-of-domain data - precision-/recall-oriented * training data - where to get? Wikidata - how to get? SAX parsing
  • 11. Wiktionary * good source for various dictionaries and word lists (word forms, definitions, synonyms,…) * ~100 langs
  • 12. Wiktionary * good source for various dictionaries and word lists (word forms, definitions, synonyms,…) * ~100 langs
  • 13. Wikipedia * >150 langs * size? Wikipedia abstracts * automation? * filtering?
  • 14. Alternatives * API (defun get-examples (word) (remove-if-not ^(upper-case-p (char % 0)) (mapcar ^(substitute #Space #_ (? % "text")) (? (yason:parse (drakma:http-request (fmt "http://api.wordnik.com/v4/word.json/~A/examples" (drakma:url-encode word :utf-8)) :additional-headers *wordnik-auth-headers*)) "examples")))) * Web scraping (defmethod scrape ((site (eql :linguaholic)) source) (match-html source '(>> article (aside (>> a ($ user)) (>> li (strong "Native Tongue:") ($ lang))) (div |...| (>> (div :data-role "commentContent") ($ text) (span) |...|)) !!!))))
  • 15. Research (quality) * Simple task => simple models (NB) * Challenges - short texts - mixed langs - 90% of data - cryptic ideas experiments
  • 16. Naive Bayes * Features: 3-/4-char ngrams * Improvement ideas: - add words (word unigrams) - factor in word lengths - use Internet lang bias Formula: (argmax (* (? priors lang) (or (? word-probs word) (norm (reduce '* ^(? 3g-probs %) (word-3gs word))))) langs) http://www.paulgraham.com/spam.html
  • 17.
  • 18. Experiments * Usual ML setup (70:30) doesn't work here * “If you torment the data too much...” (~c) Yaser Abu-Mosafa * Comparison with existing systems helps
  • 19. Confusion MatrixAB: 0.90 | FR:0.10 AF: 0.80 | EN:0.20 AK: 0.80 | NN:0.10 IT:0.10 AN: 0.90 | ES:0.10 AY: 0.90 | ES:0.10 BG: 0.60 | RU:0.40 BM: 0.80 | FR:0.10 LA:0.10 BS: 0.90 | EN:0.10 CO: 0.90 | IT:0.10 CR: 0.40 | FR:0.30 UND:0.20 MS:0.10 CS: 0.90 | IT:0.10 CU: 0.90 | VI:0.10 CV: 0.80 | RU:0.20 DA: 0.70 | FO:0.10 NO:0.10 NN:0.10 DV: 0.80 | UZ:0.10 EN:0.10 DZ: NIL | BO:0.80 IK:0.10 NE:0.10 EN: 0.90 | NL:0.10 ET: 0.80 | EN:0.20 FF: 0.50 | EN:0.20 FR:0.10 EO:0.10 SV:0.10 FI: 0.80 | FR:0.10 DA:0.10 FJ: 0.90 | OC:0.10 GL: 0.90 | ES:0.10 HA: 0.80 | YO:0.10 EN:0.10 HR: 0.70 | BS:0.10 DE:0.10 GL:0.10 ID: 0.80 | MS:0.20 IE: 0.90 | EN:0.10 IG: 0.60 | EN:0.40 IO: 0.86 | DA:0.14 KG: 0.90 | SW:0.10 KL: 0.90 | EN:0.10 KS: 0.30 | UR:0.60 UND:0.10 KU: 0.90 | EN:0.10 KW: 0.89 | UND:0.11 LA: 0.90 | FR:0.10 LB: 0.90 | EN:0.10 LG: 0.90 | IT:0.10 LI: 0.80 | NL:0.20 MI: 0.90 | ES:0.10 MK: 0.80 | IT:0.10 RU:0.10 MS: 0.80 | ID:0.10 EN:0.10 MT: 0.90 | DE:0.10 NO: 0.90 | DA:0.10 NY: 0.80 | AR:0.10 SW:0.10 OM: 0.90 | EN:0.10 OS: 0.90 | RU:0.10 QU: 0.70 | ES:0.20 EN:0.10 RM: 0.90 | EN:0.10 RN: 0.50 | RW:0.40 YO:0.10 SC: 0.90 | FR:0.10 SG: 0.90 | FR:0.10 SR: 0.80 | HR:0.10 BS:0.10 SS: 0.50 | EN:0.30 DA:0.10 ZU:0.10 ST: 0.90 | PT:0.10 SV: 0.90 | DA:0.10 TI: 0.40 | AM:0.40 LA:0.10 EN:0.10 TK: 0.80 | TR:0.20 TO: 0.50 | EN:0.50 TS: 0.80 | EN:0.10 UZ:0.10 TW: 0.40 | EN:0.40 AK:0.10 YO:0.10 TY: 0.90 | ES:0.10 UG: 0.60 | UZ:0.40 UK: 0.80 | UND:0.10 VI:0.10 VE: 0.90 | EN:0.10 WO: 0.80 | NL:0.10 FR:0.10 XH: 0.80 | UZ:0.10 EN:0.10 YO: 0.80 | EN:0.20 ZU: 0.60 | XH:0.30 PT:0.10 Total quality: 0.90
  • 20. The Ladder of NLP Rule-based Linear ML Decision Trees & co. Sequence models Artificial Neural networks
  • 21. Better Models What can be improved? * Account for word order * Discriminative models per script * DeepLearning™ model Marginal gain is not huge…
  • 22. Engineer (efficiency) * Just a small piece of the pipeline: - good-enough speed - minimize space usage - minimize external dependencies * Proper floating-point calculations * Proper processing of big texts? * Pre-/post-processing * Clean API implementation optimization
  • 23. Model Optimization Initial model size: ~1G Target: ~10M :) How to do it? - Lossy compression: pruning - Lossless compression: Huffman coding, efficient DS
  • 24. API * Levels of detalization: - text-langs - word-langs - window? * UI: library, REPL & Web APIs
  • 25. Recap * Triple view of any knowledge-related problem * Ladder of approaches to solving NLP problems * Importance of productive env: general- & special-purpose REPL lang API access to data– – efficient testing– * Main stages of problem solving: data experiment→ → implementation optimization→