SlideShare a Scribd company logo
NLP Project
Full Circle
Vsevolod Dyomkin, @vseloved
Kyiv Dataspring 18, 2018-05-19
A Bit about Me
* 10+ Lisp programmer
* 5+ years of NLP work (Grammarly & my own)
* co-author of Projector NLP course
https://vseloved.github.io
Roles
NLP Project Stages
1) Domain analysis
2) Data preparation
3) Iterating the solution
4) Productizing the result
5) Gathering feedback
and returning to stages 1/2/3
1. Domain Analysis
1) Problem formulation
2) Existing solutions
3) Possible datasets
4) Identifying Challenges
5) Metrics, baseline & SOTA
6) Selecting possible
approaches and devising
a plan of attack
Domain:
text language identification
Problem Formulation
* Not an unsolved problem
* Short texts
* Support “all” languages
* The issue of UND/MULT texts
* Dialects?
* Encodings?
Interesting Examples
https://blog.twitter.com/2015/evaluating-language-
identification-performance
Existing Solutions
* https://github.com/shuyo/language-detection/
(Java)
* https://github.com/saffsd/langid.py (Python)
* https://github.com/mzsanford/cld (C++)
Existing Solutions
* https://github.com/shuyo/language-detection/
(Java)
* https://github.com/saffsd/langid.py (Python)
* https://github.com/mzsanford/cld (C++)
* https://github.com/CLD2Owners/cld2
* https://github.com/google/cld3
Existing Solutions
* https://github.com/shuyo/language-detection/
(Java)
* https://github.com/saffsd/langid.py (Python)
* https://github.com/mzsanford/cld (C++)
* https://github.com/CLD2Owners/cld2
* https://github.com/google/cld3
Possible Datasets
* Debian i18n (~90 langs)
Possible Datasets
* Debian i18n (~90 langs)
* TED Multilingual (109 langs)
Possible Datasets
* Debian i18n (~90 langs)
* TED Multilingual (109 langs)
* Wiktionary (~150 langs)
Possible Datasets
* Debian i18n (~90 langs)
* TED Multilingual (109 langs)
* Wiktionary (~150 langs)
* Wikipedia (175 langs)
Test Data
* Twitter evaluation dataset
* Datasets with fewer langs
* Extract from Wikipedia
+ smoke test
Challenges
* Linguistic challenges
- know next to nothing about
90% of the languages
- languages and scripts
http://www.omniglot.com/writing/langalph.htm
- language distributions
https://en.wikipedia.org/wiki/Languages_used_on_the_Internet#Content_languages_for_websites
- word segmentation?
* Engineering challenges
Metrics
NLP Evaluation
Intrinsic: use a gold-
standard dataset and some
metric to measure the system’s
performance directly.
(in-domain & out-of-domain)
Extrinsic: use a system as
part of an upstream task(s)
and measure performance change
there.
Dev-Test Split
f1 et al.
1)
f1 et al.
Confusion Matrix
AF: 1.00 |
DE: 1.00 |
EN: 1.00 |
ES: 0.94 | IT:0.06
NL: 0.85 | IT:0.03 CA:0.03 AF:0.03 DE:0.03 FR:0.03
RU: 0.82 | UK:0.12 EN:0.03 DA:0.02 FR:0.02
UK: 0.93 | TG:0.07
Total quality: 0.91
Evaluating NLG
1) Word Error Rate (WER)
2) BLEU, ROUGE, METEOR
Other Metrics
1) Cross-entropy
2) Perplexity
3) Custom: MaxMatch, smatch,
parseval, UAS/LAS…
4) Your own?
Baseline
Quality that can be achieved
using some reasonable “basic”
approach
(on the same data).
2 ways to measure improvement:
* quality improvement
* error reduction
Recent SNLI Example
https://arxiv.org/pdf/1803.02324.pdf
- majority class baseline: 0.34
- SOTA models: 0.87
Recent SNLI Example
https://arxiv.org/pdf/1803.02324.pdf
- majority class baseline: 0.34
- SOTA models: 0.87
- basic fasttext classifier: 0.67
State-of-the-Art
(SOTA)
The highest publicly known
result on a well-established
dataset.
https://aclweb.org/aclwiki/Sta
te_of_the_art
2. Getting Data
The Unreasonable
Effectiveness of Data
“Data is ten times more powerful than algorithms.”
-- Peter Norvig, http://youtu.be/yvDCzhbjYWs
https://twitter.com/shivon/status/864889085697024000
Uses of Data in NLP
* understanding the problem
* statistical analysis
* connecting separate domains
* evaluation data set
* training data set
* real-time feedback
* marketing/PR
* external competitions
* what else?
Data Workflow
Types of Data
* Structured
* Semi-structured
* Unstructured
Getting Data
* Get existing
* Create your own
* Acquire from users
Corpus
Annotated collection of docs
in a certain format.
Prominent Corpora
* National: OANC/MASC,
British (non-free)
* LDC(non-free): Penn Treebank,
OntoNotes, Web Treebank
* Books: Gutenberg, GoogleBooks
* Corporate: Reuters, Enron
* Research: SNLI, SQuAD
* Multilang: UDeps, Europarl
Corpora Pitfalls
* Requires licensing (often)
* Tied to a domain
* Annotation quality
* Requires processing
of custom formats (often)
DBs & KBs
* Wikimedia
* RDF knowledge bases (Freebase)
* Wordnet, Conceptnet, Babelnet
* Custom DBs
Dictionaries
* Wordlists (example: COCA)
* Dictionaries, grammar dicts
- Wiktionary
* Thesauri
Unstructured Data
* Internet
* CommonCrawl (also, NewsCrawl)
* UMBC, ClueWeb
Raw Text Pros & Cons
+ Can collect stats
=> build LMs, word vectors…
+ Can have a variety of domains
- … but hard to control the
distribution of domains
- Web artifacts
- Web noise
- Huge processing effort
Creating Own Data
* Scraping
* Annotation
* Crowdsourcing
* Getting from users
* Generating
Data Scraping
* Full-page web-scraping
(see: readability)
* Custom web-scraping
(see: scrapy, crawlik)
* Extracting from non-HTML
formats (.pdf, .doc…)
* Getting from API
(see: twitter, webhose)
Scraping Pros & Cons
+ possible to get needed domain
+ may be the only “fast” way
+ it’s the “google” way
- licensing issues
- getting volume requires time
- need to deal with antirobot
protection
- other engineering challenges
Corpus Annotation
* Initial data gathering
* Annotation guidelines
* Annotator selection
* Annotation tools
* Processing of raw annotations
* Result analysis & feedback
Annotation Variants
* professional
* mturk
* volunteer
Crowdsourcing
- Like annotation but harder
requires a custom solution
With gamification
+ If successful, can get
volumes and diversity
User Feedback Data
+ Your product, your domain
+ Real-time, allows adaptation
+ Ties into customer support
- Chicken & egg problem
- Requires anonymization
Approach: “lean startup”
Generating Data
+ Potentially unlimited volume
+ Control of the parameters
- Artificial (if you can code
a generation function,
why not code a reverse one?)
Data Best Practices
* Proper ML dataset handling
* Domain adequacy, diversity
* Inter-annotator agreement,
reasonable baselines
* Error analysis
* Real-time tracking
Data Tools
* (grep & co)
* other Shell powertools
* statistical analysis tools
+ plotting
* annotation tools
* web-scraping tools
* metrics databases (Graphite)
* Hadoop, Spark
3. Possible Approaches
1) Symbolical/rule-based
“The (Weak) Empire of Reason”
2) Counting-based
“The Empiricist Invasion”
3) Deep Learning-based
“The Revenge of the Spherical Cows”
http://www.earningmyturns.org/2017/06/
a-computational-linguistic-farce-in.ht
ml
Rule-based Approach
English Tokenization
(small problem)
* Simplest regex:
[^s]+
* More advanced regex:
w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|] «»“”‘’-]⟨⟩ ‒–—―
* Even more advanced regex:
[+-]?[0-9](?:[0-9,.]*[0-9])?
|[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']?
|["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—―
|[.!?]+
|-+
Add post-processing:
* concatenate abbreviations and decimals
* split contractions with regexes
- 2-character abbreviations regex:
i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they)['‘’`]d$
- 3-character abbreviations regex:
(?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$
https://github.com/lang-uk/tokenize-uk/blob/master/tokenize_uk/tokeni
ze_uk.py
Error Correction
(inherently rule-based)
LanguageTool:<category name=" " id="STYLE" type="style">Стиль
<rulegroup id="BILSH_WITH_ADJ" name=" ">Більш з прикметниками
<rule>
<pattern>
<token> </token>більш
<token postag_regexp="yes" postag="ad[jv]:.*compc.*">
<exception> </exception>менш
</token>
</pattern>
<message> « » </message>Після більш не може стояти вища форма прикметника
<!-- TODO: when we can bind comparative forms togher again
<suggestion><match no="2"/></suggestion>
<suggestion><match no="1"/> <match no="2" postag_regexp="yes"
postag="(.*):compc(.*)" postag_replace="$1:compb$2"/></suggestion>
<example correction=" | "><marker>Світліший Більш світлий Більш
</marker>.</example>світліший
-->
<example correction=""><marker> </marker>.</example>Більш світліший
<example> </example>все закінчилось більш менш
</rule>
...
https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modu
les/uk/src/main/resources/org/languagetool/rules/uk/
Dialog Systems
(inherently scripted)
ELIZA:
(defparameter *eliza-rules*
'((((?* ?x) hello (?* ?y))
(How do you do. Please state your problem.))
(((?* ?x) I want (?* ?y))
(What would it mean if you got ?y)
(Why do you want ?y) (Suppose you got ?y soon))
(((?* ?x) if (?* ?y))
(Do you really think its likely that ?y) (Do you wish that ?y)
(What do you think about ?y) (Really-- if ?y))
(((?* ?x) no (?* ?y))
(Why not?) (You are being a bit negative)
(Are you saying "NO" just to be negative?))
(((?* ?x) I was (?* ?y))
(Were you really?) (Perhaps I already knew you were ?y)
(Why do you tell me you were ?y now?))
(((?* ?x) I feel (?* ?y))
(Do you often feel ?y ?))
(((?* ?x) I felt (?* ?y))
(What other feelings do you have?))))
https://norvig.com/paip/eliza1.lisp
Fighting Spam
(just a bad idea :)
SpamAssasin:
# bodyn
# example: postacie greet! Chcesz porozmawiac Co prawda tu rzadko bywam,
zwykle pisze tu - http://hanna.3xa.info
uri __LOCAL_LINK_INFO /http://w{3,8}.www.info/
header __LOCAL_FROM_O2 From =~ /@o2.pl/
meta LOCAL_LINK_INFO __LOCAL_LINK_INFO && __LOCAL_FROM_O2
describe LOCAL_LINK_INFO Link postaci http://cos.cos.info i z o2.pl
score LOCAL_LINK_INFO 5
https://wiki.apache.org/spamassassin/CustomRulesets
(…which continues being
reimplemented)
Facebook antispam (using HAXL):
fpSpammer :: Haxl Bool
fpSpammer =
talkingAboutFP .&&
numFriends .> 100 .&&
friendsLikeCPlusPlus
where
talkingAboutFP =
strContains “Functional Programming” <$> postContent
friendsLikeCPlusPlus = do
friends <- getFriends
cppFriends <- filterM likesCPlusPlus friends
return (length cppFriends >= length friends `div` 2)
http://multicore.doc.ic.ac.uk/iPr0gram/slides/2015-2016/Marlow-fighting-s
pam.pdf
https://petrimazepa.com/m_li_ili_kak_algoritm_facebook_blokiruet_polzovat
elei_na_osnovanii_stop_slov
Languages for Rules
* regexes
Languages for Rules
* regexes
* XML, JSON
Languages for Rules
* regexes
* XML, JSON
* external DSLs (example: JBOSS drules)
Languages for Rules
* regexes
* XML, JSON
* external DSLs
* internal DSLs
(match-html source
'(>> article
(aside (>> a ($ user))
(>> li (strong "Native Tongue:") ($ lang)))
(div |...| (>> (div :data-role "commentContent"
($ text) (span) |...|))
!!!))
Rules Pros & Cons
+ compact & fast
+ full control
+ arbitrary recall
+ iterative
+ best interpetability
+ perfectly accommodate domain experts
- precision ceiling
- non-optimal weights
- require a lot of human labor
- hard to interpret/calibrate score
Classic ML Approach
NLP Viewpoints
* Bag-of-words
* Sequence
* Tree
* Graph
NB Model for LangID
* The problem of using words
* Character ngrams to the rescue
* Combining them
Deep Learning Approach
BiLSTM+attention as SOTA
“Basically, if you want to do an
NLP task, no matter what it is,
what you should do is throw your
data into a Bi-directional
long-short term memory network,
and augment its information flow
with the attention mechanism.”
– Chris Manning
https://twitter.com/mayurbhangale
/status/988332845708886016
Hybrid Approach
“Rule-based” framework that
incorporates Machine Learning
models.
+ control
+ best of both worlds
- more complex
- not sexy
Bias
* Domain bias - systematic statistical
differences between “domains”, usually
in the form of vocab, class priors and
conditional probabilities
* Dataset bias - sampling bias in a
given dataset, giving rise to (often
unintentional) content bias
* Model bias - bias in the output of a
model wrt gold standard
* Social bias - model bias
towards/against particular socio-
demographic groups of users
Counteracting Bias
* Rule-based post-processing
* Mix multiple datasets
+ special feature-selection
* Train multiple models on
different datasets and
create an ensemble
* Use domain adaptation
techniques
4. Productization
Delivering your NLP
application to the users
Product variants
* end-user product
(web, mobile, desktop)
* internal feature
* API
* library
* scientific paper
Requirements
* speed (processing, startup)
* memory usage
* storage
* interoperability
* ease of use
* ease of update
Storage Optimization
WILD Initial model size ~ 1G
(compared to megabytes for CLD,
target: 10-20 MB)
Ways to reduce:
- partly rule-based
- model pruning
- clever algorithms: perfect hash
tables, Huffman coding
- model compression
https://arxiv.org/abs/1612.03651
Model Formats
* pickle & co.
Model Formats
* pickle & co.
* JSON, ProtoBuf, ...
Model Formats
* pickle & co.
* JSON, ProtoBuf, ...
* HDF5
Model Formats
* pickle & co.
* JSON, ProtoBuf, ...
* HDF5
* Custom gzipped text-based
Adaptation
(manual vs automatic)
* for RB systems: add more
rules :D
* for ML systems: online
learning algorithms
* for DL systems: recent
“universal” models
* Lambda architecture
The Pipeline
“Getting a data pipeline into production for the
first time entails dealing with many different
moving parts in these cases, spend more time—
thinking about the pipeline itself, and how it
will enable experimentation, and less about its
first algorithm.”
https://medium.com/@neal_lathia/five-lessons-from
-building-machine-learning-systems-d703162846ad
Read More
• https://www.slideshare.net/frandzi/native-langua
ge-identification-brief-review-to-the-state-of-t
he-art•
• http://people.cs.umass.edu/~brenocon/inlp2015/15
-eval.pdf
• https://ehudreiter.com/2017/05/03/metrics-nlg-ev
aluation/
• https://nlpers.blogspot.com/2014/11/the-myth-of-
strong-baseline.html
•
https://www.slideshare.net/grammarly/grammarly-a
inlp-club-1-domain-and-social-bias-in-nlp-case-s
tudy-in-language-identification-tim-baldwin-8025
2288

More Related Content

What's hot

Training python (new Updated)
Training python (new Updated)Training python (new Updated)
Training python (new Updated)
University of Technology
 
Why I Love Python
Why I Love PythonWhy I Love Python
Why I Love Python
didip
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
Nowell Strite
 
Basics of python
Basics of pythonBasics of python
Basics of python
Jatin Kochhar
 
Experience protocol buffer on android
Experience protocol buffer on androidExperience protocol buffer on android
Experience protocol buffer on android
Richard Chang
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHPDeclarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHP
stubbles
 
Introduction to programming with python
Introduction to programming with pythonIntroduction to programming with python
Introduction to programming with python
Porimol Chandro
 
Phython Programming Language
Phython Programming LanguagePhython Programming Language
Phython Programming Language
R.h. Himel
 
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Steven Francia
 
Python Tutorial for Beginner
Python Tutorial for BeginnerPython Tutorial for Beginner
Python Tutorial for Beginner
rajkamaltibacademy
 
Document Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitDocument Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitBen Healey
 
Python final presentation kirti ppt1
Python final presentation kirti ppt1Python final presentation kirti ppt1
Python final presentation kirti ppt1
Kirti Verma
 
Go lang introduction
Go lang introductionGo lang introduction
Go lang introduction
yangwm
 
Introduction To Python | Edureka
Introduction To Python | EdurekaIntroduction To Python | Edureka
Introduction To Python | Edureka
Edureka!
 
Augmenting and structuring user queries to support efficient free-form code s...
Augmenting and structuring user queries to support efficient free-form code s...Augmenting and structuring user queries to support efficient free-form code s...
Augmenting and structuring user queries to support efficient free-form code s...
Dongsun Kim
 
Python - Lesson 1
Python - Lesson 1Python - Lesson 1
Python - Lesson 1
Andrew Frangos
 
FaCoY – A Code-to-Code Search Engine
FaCoY – A Code-to-Code Search EngineFaCoY – A Code-to-Code Search Engine
FaCoY – A Code-to-Code Search Engine
Dongsun Kim
 
Executable specifications for xtext
Executable specifications for xtextExecutable specifications for xtext
Executable specifications for xtextmeysholdt
 
A Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoA Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to Go
Matt Stine
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Fasihul Kabir
 

What's hot (20)

Training python (new Updated)
Training python (new Updated)Training python (new Updated)
Training python (new Updated)
 
Why I Love Python
Why I Love PythonWhy I Love Python
Why I Love Python
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Basics of python
Basics of pythonBasics of python
Basics of python
 
Experience protocol buffer on android
Experience protocol buffer on androidExperience protocol buffer on android
Experience protocol buffer on android
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHPDeclarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHP
 
Introduction to programming with python
Introduction to programming with pythonIntroduction to programming with python
Introduction to programming with python
 
Phython Programming Language
Phython Programming LanguagePhython Programming Language
Phython Programming Language
 
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
 
Python Tutorial for Beginner
Python Tutorial for BeginnerPython Tutorial for Beginner
Python Tutorial for Beginner
 
Document Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitDocument Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language Toolkit
 
Python final presentation kirti ppt1
Python final presentation kirti ppt1Python final presentation kirti ppt1
Python final presentation kirti ppt1
 
Go lang introduction
Go lang introductionGo lang introduction
Go lang introduction
 
Introduction To Python | Edureka
Introduction To Python | EdurekaIntroduction To Python | Edureka
Introduction To Python | Edureka
 
Augmenting and structuring user queries to support efficient free-form code s...
Augmenting and structuring user queries to support efficient free-form code s...Augmenting and structuring user queries to support efficient free-form code s...
Augmenting and structuring user queries to support efficient free-form code s...
 
Python - Lesson 1
Python - Lesson 1Python - Lesson 1
Python - Lesson 1
 
FaCoY – A Code-to-Code Search Engine
FaCoY – A Code-to-Code Search EngineFaCoY – A Code-to-Code Search Engine
FaCoY – A Code-to-Code Search Engine
 
Executable specifications for xtext
Executable specifications for xtextExecutable specifications for xtext
Executable specifications for xtext
 
A Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoA Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to Go
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 

Similar to NLP Project Full Circle

Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Esteve Castells
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to Microservices
Ad van der Veer
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
Sease
 
Mdb dn 2016_06_query_primer
Mdb dn 2016_06_query_primerMdb dn 2016_06_query_primer
Mdb dn 2016_06_query_primer
Daniel M. Farrell
 
Prairie Dev Con West - 2012-03-14 - Webmatrix, see what the matrix can do fo...
Prairie Dev Con West -  2012-03-14 - Webmatrix, see what the matrix can do fo...Prairie Dev Con West -  2012-03-14 - Webmatrix, see what the matrix can do fo...
Prairie Dev Con West - 2012-03-14 - Webmatrix, see what the matrix can do fo...Frédéric Harper
 
Mini-Training: Redis
Mini-Training: RedisMini-Training: Redis
Mini-Training: Redis
Betclic Everest Group Tech Team
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...
Antonios Giannopoulos
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
dwm042
 
Code Documentation. That ugly thing...
Code Documentation. That ugly thing...Code Documentation. That ugly thing...
Code Documentation. That ugly thing...
Christos Manios
 
Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Tony Frame
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
Dainius Jocas
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Model
chomas kandar
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Modelchomas kandar
 
Ruby On Rails Intro
Ruby On Rails IntroRuby On Rails Intro
Ruby On Rails Intro
Sarah Allen
 
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA
 
MongoDB hearts Django? (Django NYC)
MongoDB hearts Django? (Django NYC)MongoDB hearts Django? (Django NYC)
MongoDB hearts Django? (Django NYC)
Mike Dirolf
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
Ihor Bobak
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
Doug Needham
 

Similar to NLP Project Full Circle (20)

Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to Microservices
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 
Mdb dn 2016_06_query_primer
Mdb dn 2016_06_query_primerMdb dn 2016_06_query_primer
Mdb dn 2016_06_query_primer
 
Prairie Dev Con West - 2012-03-14 - Webmatrix, see what the matrix can do fo...
Prairie Dev Con West -  2012-03-14 - Webmatrix, see what the matrix can do fo...Prairie Dev Con West -  2012-03-14 - Webmatrix, see what the matrix can do fo...
Prairie Dev Con West - 2012-03-14 - Webmatrix, see what the matrix can do fo...
 
Mini-Training: Redis
Mini-Training: RedisMini-Training: Redis
Mini-Training: Redis
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Code Documentation. That ugly thing...
Code Documentation. That ugly thing...Code Documentation. That ugly thing...
Code Documentation. That ugly thing...
 
10 Ways To Improve Your Code
10 Ways To Improve Your Code10 Ways To Improve Your Code
10 Ways To Improve Your Code
 
Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Model
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Model
 
Ruby On Rails Intro
Ruby On Rails IntroRuby On Rails Intro
Ruby On Rails Intro
 
Html5 and css3
Html5 and css3Html5 and css3
Html5 and css3
 
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
 
MongoDB hearts Django? (Django NYC)
MongoDB hearts Django? (Django NYC)MongoDB hearts Django? (Django NYC)
MongoDB hearts Django? (Django NYC)
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 

More from Vsevolod Dyomkin

The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
Vsevolod Dyomkin
 
Loading Multiple Versions of an ASDF System in the Same Lisp Image
Loading Multiple Versions of an ASDF System in the Same Lisp ImageLoading Multiple Versions of an ASDF System in the Same Lisp Image
Loading Multiple Versions of an ASDF System in the Same Lisp Image
Vsevolod Dyomkin
 
NLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationNLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language Identification
Vsevolod Dyomkin
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
Vsevolod Dyomkin
 
Sugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturySugaring Lisp for the 21st Century
Sugaring Lisp for the 21st Century
Vsevolod Dyomkin
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
Vsevolod Dyomkin
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
Vsevolod Dyomkin
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
Vsevolod Dyomkin
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 
CL-NLP
CL-NLPCL-NLP
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная оберткаVsevolod Dyomkin
 
Lisp for Python Programmers
Lisp for Python ProgrammersLisp for Python Programmers
Lisp for Python ProgrammersVsevolod Dyomkin
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
Vsevolod Dyomkin
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
Vsevolod Dyomkin
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelinesTedxkyiv communication guidelines
Tedxkyiv communication guidelinesVsevolod Dyomkin
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхVsevolod Dyomkin
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Vsevolod Dyomkin
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common Lisp
Vsevolod Dyomkin
 

More from Vsevolod Dyomkin (19)

The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
Loading Multiple Versions of an ASDF System in the Same Lisp Image
Loading Multiple Versions of an ASDF System in the Same Lisp ImageLoading Multiple Versions of an ASDF System in the Same Lisp Image
Loading Multiple Versions of an ASDF System in the Same Lisp Image
 
NLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationNLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language Identification
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Sugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturySugaring Lisp for the 21st Century
Sugaring Lisp for the 21st Century
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Lisp Machine Prunciples
Lisp Machine PrunciplesLisp Machine Prunciples
Lisp Machine Prunciples
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
CL-NLP
CL-NLPCL-NLP
CL-NLP
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная обертка
 
Lisp for Python Programmers
Lisp for Python ProgrammersLisp for Python Programmers
Lisp for Python Programmers
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelinesTedxkyiv communication guidelines
Tedxkyiv communication guidelines
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данных
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common Lisp
 

Recently uploaded

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 

NLP Project Full Circle

  • 1. NLP Project Full Circle Vsevolod Dyomkin, @vseloved Kyiv Dataspring 18, 2018-05-19
  • 2. A Bit about Me * 10+ Lisp programmer * 5+ years of NLP work (Grammarly & my own) * co-author of Projector NLP course https://vseloved.github.io
  • 4. NLP Project Stages 1) Domain analysis 2) Data preparation 3) Iterating the solution 4) Productizing the result 5) Gathering feedback and returning to stages 1/2/3
  • 5. 1. Domain Analysis 1) Problem formulation 2) Existing solutions 3) Possible datasets 4) Identifying Challenges 5) Metrics, baseline & SOTA 6) Selecting possible approaches and devising a plan of attack
  • 7. Problem Formulation * Not an unsolved problem * Short texts * Support “all” languages * The issue of UND/MULT texts * Dialects? * Encodings?
  • 9. Existing Solutions * https://github.com/shuyo/language-detection/ (Java) * https://github.com/saffsd/langid.py (Python) * https://github.com/mzsanford/cld (C++)
  • 10. Existing Solutions * https://github.com/shuyo/language-detection/ (Java) * https://github.com/saffsd/langid.py (Python) * https://github.com/mzsanford/cld (C++) * https://github.com/CLD2Owners/cld2 * https://github.com/google/cld3
  • 11. Existing Solutions * https://github.com/shuyo/language-detection/ (Java) * https://github.com/saffsd/langid.py (Python) * https://github.com/mzsanford/cld (C++) * https://github.com/CLD2Owners/cld2 * https://github.com/google/cld3
  • 12. Possible Datasets * Debian i18n (~90 langs)
  • 13. Possible Datasets * Debian i18n (~90 langs) * TED Multilingual (109 langs)
  • 14. Possible Datasets * Debian i18n (~90 langs) * TED Multilingual (109 langs) * Wiktionary (~150 langs)
  • 15. Possible Datasets * Debian i18n (~90 langs) * TED Multilingual (109 langs) * Wiktionary (~150 langs) * Wikipedia (175 langs)
  • 16. Test Data * Twitter evaluation dataset * Datasets with fewer langs * Extract from Wikipedia + smoke test
  • 17. Challenges * Linguistic challenges - know next to nothing about 90% of the languages - languages and scripts http://www.omniglot.com/writing/langalph.htm - language distributions https://en.wikipedia.org/wiki/Languages_used_on_the_Internet#Content_languages_for_websites - word segmentation? * Engineering challenges
  • 19. NLP Evaluation Intrinsic: use a gold- standard dataset and some metric to measure the system’s performance directly. (in-domain & out-of-domain) Extrinsic: use a system as part of an upstream task(s) and measure performance change there.
  • 23. Confusion Matrix AF: 1.00 | DE: 1.00 | EN: 1.00 | ES: 0.94 | IT:0.06 NL: 0.85 | IT:0.03 CA:0.03 AF:0.03 DE:0.03 FR:0.03 RU: 0.82 | UK:0.12 EN:0.03 DA:0.02 FR:0.02 UK: 0.93 | TG:0.07 Total quality: 0.91
  • 24. Evaluating NLG 1) Word Error Rate (WER) 2) BLEU, ROUGE, METEOR
  • 25.
  • 26. Other Metrics 1) Cross-entropy 2) Perplexity 3) Custom: MaxMatch, smatch, parseval, UAS/LAS… 4) Your own?
  • 27. Baseline Quality that can be achieved using some reasonable “basic” approach (on the same data). 2 ways to measure improvement: * quality improvement * error reduction
  • 28. Recent SNLI Example https://arxiv.org/pdf/1803.02324.pdf - majority class baseline: 0.34 - SOTA models: 0.87
  • 29. Recent SNLI Example https://arxiv.org/pdf/1803.02324.pdf - majority class baseline: 0.34 - SOTA models: 0.87 - basic fasttext classifier: 0.67
  • 30. State-of-the-Art (SOTA) The highest publicly known result on a well-established dataset. https://aclweb.org/aclwiki/Sta te_of_the_art
  • 32. The Unreasonable Effectiveness of Data “Data is ten times more powerful than algorithms.” -- Peter Norvig, http://youtu.be/yvDCzhbjYWs https://twitter.com/shivon/status/864889085697024000
  • 33. Uses of Data in NLP * understanding the problem * statistical analysis * connecting separate domains * evaluation data set * training data set * real-time feedback * marketing/PR * external competitions * what else?
  • 35. Types of Data * Structured * Semi-structured * Unstructured
  • 36. Getting Data * Get existing * Create your own * Acquire from users
  • 37. Corpus Annotated collection of docs in a certain format.
  • 38. Prominent Corpora * National: OANC/MASC, British (non-free) * LDC(non-free): Penn Treebank, OntoNotes, Web Treebank * Books: Gutenberg, GoogleBooks * Corporate: Reuters, Enron * Research: SNLI, SQuAD * Multilang: UDeps, Europarl
  • 39. Corpora Pitfalls * Requires licensing (often) * Tied to a domain * Annotation quality * Requires processing of custom formats (often)
  • 40. DBs & KBs * Wikimedia * RDF knowledge bases (Freebase) * Wordnet, Conceptnet, Babelnet * Custom DBs
  • 41. Dictionaries * Wordlists (example: COCA) * Dictionaries, grammar dicts - Wiktionary * Thesauri
  • 42. Unstructured Data * Internet * CommonCrawl (also, NewsCrawl) * UMBC, ClueWeb
  • 43. Raw Text Pros & Cons + Can collect stats => build LMs, word vectors… + Can have a variety of domains - … but hard to control the distribution of domains - Web artifacts - Web noise - Huge processing effort
  • 44. Creating Own Data * Scraping * Annotation * Crowdsourcing * Getting from users * Generating
  • 45. Data Scraping * Full-page web-scraping (see: readability) * Custom web-scraping (see: scrapy, crawlik) * Extracting from non-HTML formats (.pdf, .doc…) * Getting from API (see: twitter, webhose)
  • 46. Scraping Pros & Cons + possible to get needed domain + may be the only “fast” way + it’s the “google” way - licensing issues - getting volume requires time - need to deal with antirobot protection - other engineering challenges
  • 47. Corpus Annotation * Initial data gathering * Annotation guidelines * Annotator selection * Annotation tools * Processing of raw annotations * Result analysis & feedback
  • 49. Crowdsourcing - Like annotation but harder requires a custom solution With gamification + If successful, can get volumes and diversity
  • 50. User Feedback Data + Your product, your domain + Real-time, allows adaptation + Ties into customer support - Chicken & egg problem - Requires anonymization Approach: “lean startup”
  • 51. Generating Data + Potentially unlimited volume + Control of the parameters - Artificial (if you can code a generation function, why not code a reverse one?)
  • 52. Data Best Practices * Proper ML dataset handling * Domain adequacy, diversity * Inter-annotator agreement, reasonable baselines * Error analysis * Real-time tracking
  • 53. Data Tools * (grep & co) * other Shell powertools * statistical analysis tools + plotting * annotation tools * web-scraping tools * metrics databases (Graphite) * Hadoop, Spark
  • 54. 3. Possible Approaches 1) Symbolical/rule-based “The (Weak) Empire of Reason” 2) Counting-based “The Empiricist Invasion” 3) Deep Learning-based “The Revenge of the Spherical Cows” http://www.earningmyturns.org/2017/06/ a-computational-linguistic-farce-in.ht ml
  • 56. English Tokenization (small problem) * Simplest regex: [^s]+ * More advanced regex: w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|] «»“”‘’-]⟨⟩ ‒–—― * Even more advanced regex: [+-]?[0-9](?:[0-9,.]*[0-9])? |[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']? |["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—― |[.!?]+ |-+ Add post-processing: * concatenate abbreviations and decimals * split contractions with regexes - 2-character abbreviations regex: i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they)['‘’`]d$ - 3-character abbreviations regex: (?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$ https://github.com/lang-uk/tokenize-uk/blob/master/tokenize_uk/tokeni ze_uk.py
  • 57. Error Correction (inherently rule-based) LanguageTool:<category name=" " id="STYLE" type="style">Стиль <rulegroup id="BILSH_WITH_ADJ" name=" ">Більш з прикметниками <rule> <pattern> <token> </token>більш <token postag_regexp="yes" postag="ad[jv]:.*compc.*"> <exception> </exception>менш </token> </pattern> <message> « » </message>Після більш не може стояти вища форма прикметника <!-- TODO: when we can bind comparative forms togher again <suggestion><match no="2"/></suggestion> <suggestion><match no="1"/> <match no="2" postag_regexp="yes" postag="(.*):compc(.*)" postag_replace="$1:compb$2"/></suggestion> <example correction=" | "><marker>Світліший Більш світлий Більш </marker>.</example>світліший --> <example correction=""><marker> </marker>.</example>Більш світліший <example> </example>все закінчилось більш менш </rule> ... https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modu les/uk/src/main/resources/org/languagetool/rules/uk/
  • 58. Dialog Systems (inherently scripted) ELIZA: (defparameter *eliza-rules* '((((?* ?x) hello (?* ?y)) (How do you do. Please state your problem.)) (((?* ?x) I want (?* ?y)) (What would it mean if you got ?y) (Why do you want ?y) (Suppose you got ?y soon)) (((?* ?x) if (?* ?y)) (Do you really think its likely that ?y) (Do you wish that ?y) (What do you think about ?y) (Really-- if ?y)) (((?* ?x) no (?* ?y)) (Why not?) (You are being a bit negative) (Are you saying "NO" just to be negative?)) (((?* ?x) I was (?* ?y)) (Were you really?) (Perhaps I already knew you were ?y) (Why do you tell me you were ?y now?)) (((?* ?x) I feel (?* ?y)) (Do you often feel ?y ?)) (((?* ?x) I felt (?* ?y)) (What other feelings do you have?)))) https://norvig.com/paip/eliza1.lisp
  • 59.
  • 60. Fighting Spam (just a bad idea :) SpamAssasin: # bodyn # example: postacie greet! Chcesz porozmawiac Co prawda tu rzadko bywam, zwykle pisze tu - http://hanna.3xa.info uri __LOCAL_LINK_INFO /http://w{3,8}.www.info/ header __LOCAL_FROM_O2 From =~ /@o2.pl/ meta LOCAL_LINK_INFO __LOCAL_LINK_INFO && __LOCAL_FROM_O2 describe LOCAL_LINK_INFO Link postaci http://cos.cos.info i z o2.pl score LOCAL_LINK_INFO 5 https://wiki.apache.org/spamassassin/CustomRulesets
  • 61. (…which continues being reimplemented) Facebook antispam (using HAXL): fpSpammer :: Haxl Bool fpSpammer = talkingAboutFP .&& numFriends .> 100 .&& friendsLikeCPlusPlus where talkingAboutFP = strContains “Functional Programming” <$> postContent friendsLikeCPlusPlus = do friends <- getFriends cppFriends <- filterM likesCPlusPlus friends return (length cppFriends >= length friends `div` 2) http://multicore.doc.ic.ac.uk/iPr0gram/slides/2015-2016/Marlow-fighting-s pam.pdf https://petrimazepa.com/m_li_ili_kak_algoritm_facebook_blokiruet_polzovat elei_na_osnovanii_stop_slov
  • 63. Languages for Rules * regexes * XML, JSON
  • 64. Languages for Rules * regexes * XML, JSON * external DSLs (example: JBOSS drules)
  • 65. Languages for Rules * regexes * XML, JSON * external DSLs * internal DSLs (match-html source '(>> article (aside (>> a ($ user)) (>> li (strong "Native Tongue:") ($ lang))) (div |...| (>> (div :data-role "commentContent" ($ text) (span) |...|)) !!!))
  • 66. Rules Pros & Cons + compact & fast + full control + arbitrary recall + iterative + best interpetability + perfectly accommodate domain experts - precision ceiling - non-optimal weights - require a lot of human labor - hard to interpret/calibrate score
  • 68. NLP Viewpoints * Bag-of-words * Sequence * Tree * Graph
  • 69. NB Model for LangID * The problem of using words * Character ngrams to the rescue * Combining them
  • 71. BiLSTM+attention as SOTA “Basically, if you want to do an NLP task, no matter what it is, what you should do is throw your data into a Bi-directional long-short term memory network, and augment its information flow with the attention mechanism.” – Chris Manning https://twitter.com/mayurbhangale /status/988332845708886016
  • 72. Hybrid Approach “Rule-based” framework that incorporates Machine Learning models. + control + best of both worlds - more complex - not sexy
  • 73. Bias * Domain bias - systematic statistical differences between “domains”, usually in the form of vocab, class priors and conditional probabilities * Dataset bias - sampling bias in a given dataset, giving rise to (often unintentional) content bias * Model bias - bias in the output of a model wrt gold standard * Social bias - model bias towards/against particular socio- demographic groups of users
  • 74. Counteracting Bias * Rule-based post-processing * Mix multiple datasets + special feature-selection * Train multiple models on different datasets and create an ensemble * Use domain adaptation techniques
  • 75. 4. Productization Delivering your NLP application to the users
  • 76. Product variants * end-user product (web, mobile, desktop) * internal feature * API * library * scientific paper
  • 77. Requirements * speed (processing, startup) * memory usage * storage * interoperability * ease of use * ease of update
  • 78. Storage Optimization WILD Initial model size ~ 1G (compared to megabytes for CLD, target: 10-20 MB) Ways to reduce: - partly rule-based - model pruning - clever algorithms: perfect hash tables, Huffman coding - model compression https://arxiv.org/abs/1612.03651
  • 80. Model Formats * pickle & co. * JSON, ProtoBuf, ...
  • 81. Model Formats * pickle & co. * JSON, ProtoBuf, ... * HDF5
  • 82. Model Formats * pickle & co. * JSON, ProtoBuf, ... * HDF5 * Custom gzipped text-based
  • 83. Adaptation (manual vs automatic) * for RB systems: add more rules :D * for ML systems: online learning algorithms * for DL systems: recent “universal” models * Lambda architecture
  • 84. The Pipeline “Getting a data pipeline into production for the first time entails dealing with many different moving parts in these cases, spend more time— thinking about the pipeline itself, and how it will enable experimentation, and less about its first algorithm.” https://medium.com/@neal_lathia/five-lessons-from -building-machine-learning-systems-d703162846ad
  • 85. Read More • https://www.slideshare.net/frandzi/native-langua ge-identification-brief-review-to-the-state-of-t he-art• • http://people.cs.umass.edu/~brenocon/inlp2015/15 -eval.pdf • https://ehudreiter.com/2017/05/03/metrics-nlg-ev aluation/ • https://nlpers.blogspot.com/2014/11/the-myth-of- strong-baseline.html • https://www.slideshare.net/grammarly/grammarly-a inlp-club-1-domain-and-social-bias-in-nlp-case-s tudy-in-language-identification-tim-baldwin-8025 2288