NLP Project
Full Circle
Vsevolod Dyomkin, @vseloved
Kyiv Dataspring 18, 2018-05-19
A Bit about Me
* 10+ Lisp programmer
* 5+ years of NLP work (Grammarly & my own)
* co-author of Projector NLP course
https://vseloved.github.io
Roles
NLP Project Stages
1) Domain analysis
2) Data preparation
3) Iterating the solution
4) Productizing the result
5) Gathering feedback
and returning to stages 1/2/3
1. Domain Analysis
1) Problem formulation
2) Existing solutions
3) Possible datasets
4) Identifying Challenges
5) Metrics, baseline & SOTA
6) Selecting possible
approaches and devising
a plan of attack
Domain:
text language identification
Problem Formulation
* Not an unsolved problem
* Short texts
* Support “all” languages
* The issue of UND/MULT texts
* Dialects?
* Encodings?
Interesting Examples
https://blog.twitter.com/2015/evaluating-language-
identification-performance
Existing Solutions
* https://github.com/shuyo/language-detection/
(Java)
* https://github.com/saffsd/langid.py (Python)
* https://github.com/mzsanford/cld (C++)
Existing Solutions
* https://github.com/shuyo/language-detection/
(Java)
* https://github.com/saffsd/langid.py (Python)
* https://github.com/mzsanford/cld (C++)
* https://github.com/CLD2Owners/cld2
* https://github.com/google/cld3
Existing Solutions
* https://github.com/shuyo/language-detection/
(Java)
* https://github.com/saffsd/langid.py (Python)
* https://github.com/mzsanford/cld (C++)
* https://github.com/CLD2Owners/cld2
* https://github.com/google/cld3
Possible Datasets
* Debian i18n (~90 langs)
Possible Datasets
* Debian i18n (~90 langs)
* TED Multilingual (109 langs)
Possible Datasets
* Debian i18n (~90 langs)
* TED Multilingual (109 langs)
* Wiktionary (~150 langs)
Possible Datasets
* Debian i18n (~90 langs)
* TED Multilingual (109 langs)
* Wiktionary (~150 langs)
* Wikipedia (175 langs)
Test Data
* Twitter evaluation dataset
* Datasets with fewer langs
* Extract from Wikipedia
+ smoke test
Challenges
* Linguistic challenges
- know next to nothing about
90% of the languages
- languages and scripts
http://www.omniglot.com/writing/langalph.htm
- language distributions
https://en.wikipedia.org/wiki/Languages_used_on_the_Internet#Content_languages_for_websites
- word segmentation?
* Engineering challenges
Metrics
NLP Evaluation
Intrinsic: use a gold-
standard dataset and some
metric to measure the system’s
performance directly.
(in-domain & out-of-domain)
Extrinsic: use a system as
part of an upstream task(s)
and measure performance change
there.
Dev-Test Split
f1 et al.
1)
f1 et al.
Confusion Matrix
AF: 1.00 |
DE: 1.00 |
EN: 1.00 |
ES: 0.94 | IT:0.06
NL: 0.85 | IT:0.03 CA:0.03 AF:0.03 DE:0.03 FR:0.03
RU: 0.82 | UK:0.12 EN:0.03 DA:0.02 FR:0.02
UK: 0.93 | TG:0.07
Total quality: 0.91
Evaluating NLG
1) Word Error Rate (WER)
2) BLEU, ROUGE, METEOR
Other Metrics
1) Cross-entropy
2) Perplexity
3) Custom: MaxMatch, smatch,
parseval, UAS/LAS…
4) Your own?
Baseline
Quality that can be achieved
using some reasonable “basic”
approach
(on the same data).
2 ways to measure improvement:
* quality improvement
* error reduction
Recent SNLI Example
https://arxiv.org/pdf/1803.02324.pdf
- majority class baseline: 0.34
- SOTA models: 0.87
Recent SNLI Example
https://arxiv.org/pdf/1803.02324.pdf
- majority class baseline: 0.34
- SOTA models: 0.87
- basic fasttext classifier: 0.67
State-of-the-Art
(SOTA)
The highest publicly known
result on a well-established
dataset.
https://aclweb.org/aclwiki/Sta
te_of_the_art
2. Getting Data
The Unreasonable
Effectiveness of Data
“Data is ten times more powerful than algorithms.”
-- Peter Norvig, http://youtu.be/yvDCzhbjYWs
https://twitter.com/shivon/status/864889085697024000
Uses of Data in NLP
* understanding the problem
* statistical analysis
* connecting separate domains
* evaluation data set
* training data set
* real-time feedback
* marketing/PR
* external competitions
* what else?
Data Workflow
Types of Data
* Structured
* Semi-structured
* Unstructured
Getting Data
* Get existing
* Create your own
* Acquire from users
Corpus
Annotated collection of docs
in a certain format.
Prominent Corpora
* National: OANC/MASC,
British (non-free)
* LDC(non-free): Penn Treebank,
OntoNotes, Web Treebank
* Books: Gutenberg, GoogleBooks
* Corporate: Reuters, Enron
* Research: SNLI, SQuAD
* Multilang: UDeps, Europarl
Corpora Pitfalls
* Requires licensing (often)
* Tied to a domain
* Annotation quality
* Requires processing
of custom formats (often)
DBs & KBs
* Wikimedia
* RDF knowledge bases (Freebase)
* Wordnet, Conceptnet, Babelnet
* Custom DBs
Dictionaries
* Wordlists (example: COCA)
* Dictionaries, grammar dicts
- Wiktionary
* Thesauri
Unstructured Data
* Internet
* CommonCrawl (also, NewsCrawl)
* UMBC, ClueWeb
Raw Text Pros & Cons
+ Can collect stats
=> build LMs, word vectors…
+ Can have a variety of domains
- … but hard to control the
distribution of domains
- Web artifacts
- Web noise
- Huge processing effort
Creating Own Data
* Scraping
* Annotation
* Crowdsourcing
* Getting from users
* Generating
Data Scraping
* Full-page web-scraping
(see: readability)
* Custom web-scraping
(see: scrapy, crawlik)
* Extracting from non-HTML
formats (.pdf, .doc…)
* Getting from API
(see: twitter, webhose)
Scraping Pros & Cons
+ possible to get needed domain
+ may be the only “fast” way
+ it’s the “google” way
- licensing issues
- getting volume requires time
- need to deal with antirobot
protection
- other engineering challenges
Corpus Annotation
* Initial data gathering
* Annotation guidelines
* Annotator selection
* Annotation tools
* Processing of raw annotations
* Result analysis & feedback
Annotation Variants
* professional
* mturk
* volunteer
Crowdsourcing
- Like annotation but harder
requires a custom solution
With gamification
+ If successful, can get
volumes and diversity
User Feedback Data
+ Your product, your domain
+ Real-time, allows adaptation
+ Ties into customer support
- Chicken & egg problem
- Requires anonymization
Approach: “lean startup”
Generating Data
+ Potentially unlimited volume
+ Control of the parameters
- Artificial (if you can code
a generation function,
why not code a reverse one?)
Data Best Practices
* Proper ML dataset handling
* Domain adequacy, diversity
* Inter-annotator agreement,
reasonable baselines
* Error analysis
* Real-time tracking
Data Tools
* (grep & co)
* other Shell powertools
* statistical analysis tools
+ plotting
* annotation tools
* web-scraping tools
* metrics databases (Graphite)
* Hadoop, Spark
3. Possible Approaches
1) Symbolical/rule-based
“The (Weak) Empire of Reason”
2) Counting-based
“The Empiricist Invasion”
3) Deep Learning-based
“The Revenge of the Spherical Cows”
http://www.earningmyturns.org/2017/06/
a-computational-linguistic-farce-in.ht
ml
Rule-based Approach
English Tokenization
(small problem)
* Simplest regex:
[^s]+
* More advanced regex:
w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|] «»“”‘’-]⟨⟩ ‒–—―
* Even more advanced regex:
[+-]?[0-9](?:[0-9,.]*[0-9])?
|[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']?
|["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—―
|[.!?]+
|-+
Add post-processing:
* concatenate abbreviations and decimals
* split contractions with regexes
- 2-character abbreviations regex:
i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they)['‘’`]d$
- 3-character abbreviations regex:
(?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$
https://github.com/lang-uk/tokenize-uk/blob/master/tokenize_uk/tokeni
ze_uk.py
Error Correction
(inherently rule-based)
LanguageTool:<category name=" " id="STYLE" type="style">Стиль
<rulegroup id="BILSH_WITH_ADJ" name=" ">Більш з прикметниками
<rule>
<pattern>
<token> </token>більш
<token postag_regexp="yes" postag="ad[jv]:.*compc.*">
<exception> </exception>менш
</token>
</pattern>
<message> « » </message>Після більш не може стояти вища форма прикметника
<!-- TODO: when we can bind comparative forms togher again
<suggestion><match no="2"/></suggestion>
<suggestion><match no="1"/> <match no="2" postag_regexp="yes"
postag="(.*):compc(.*)" postag_replace="$1:compb$2"/></suggestion>
<example correction=" | "><marker>Світліший Більш світлий Більш
</marker>.</example>світліший
-->
<example correction=""><marker> </marker>.</example>Більш світліший
<example> </example>все закінчилось більш менш
</rule>
...
https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modu
les/uk/src/main/resources/org/languagetool/rules/uk/
Dialog Systems
(inherently scripted)
ELIZA:
(defparameter *eliza-rules*
'((((?* ?x) hello (?* ?y))
(How do you do. Please state your problem.))
(((?* ?x) I want (?* ?y))
(What would it mean if you got ?y)
(Why do you want ?y) (Suppose you got ?y soon))
(((?* ?x) if (?* ?y))
(Do you really think its likely that ?y) (Do you wish that ?y)
(What do you think about ?y) (Really-- if ?y))
(((?* ?x) no (?* ?y))
(Why not?) (You are being a bit negative)
(Are you saying "NO" just to be negative?))
(((?* ?x) I was (?* ?y))
(Were you really?) (Perhaps I already knew you were ?y)
(Why do you tell me you were ?y now?))
(((?* ?x) I feel (?* ?y))
(Do you often feel ?y ?))
(((?* ?x) I felt (?* ?y))
(What other feelings do you have?))))
https://norvig.com/paip/eliza1.lisp
Fighting Spam
(just a bad idea :)
SpamAssasin:
# bodyn
# example: postacie greet! Chcesz porozmawiac Co prawda tu rzadko bywam,
zwykle pisze tu - http://hanna.3xa.info
uri __LOCAL_LINK_INFO /http://w{3,8}.www.info/
header __LOCAL_FROM_O2 From =~ /@o2.pl/
meta LOCAL_LINK_INFO __LOCAL_LINK_INFO && __LOCAL_FROM_O2
describe LOCAL_LINK_INFO Link postaci http://cos.cos.info i z o2.pl
score LOCAL_LINK_INFO 5
https://wiki.apache.org/spamassassin/CustomRulesets
(…which continues being
reimplemented)
Facebook antispam (using HAXL):
fpSpammer :: Haxl Bool
fpSpammer =
talkingAboutFP .&&
numFriends .> 100 .&&
friendsLikeCPlusPlus
where
talkingAboutFP =
strContains “Functional Programming” <$> postContent
friendsLikeCPlusPlus = do
friends <- getFriends
cppFriends <- filterM likesCPlusPlus friends
return (length cppFriends >= length friends `div` 2)
http://multicore.doc.ic.ac.uk/iPr0gram/slides/2015-2016/Marlow-fighting-s
pam.pdf
https://petrimazepa.com/m_li_ili_kak_algoritm_facebook_blokiruet_polzovat
elei_na_osnovanii_stop_slov
Languages for Rules
* regexes
Languages for Rules
* regexes
* XML, JSON
Languages for Rules
* regexes
* XML, JSON
* external DSLs (example: JBOSS drules)
Languages for Rules
* regexes
* XML, JSON
* external DSLs
* internal DSLs
(match-html source
'(>> article
(aside (>> a ($ user))
(>> li (strong "Native Tongue:") ($ lang)))
(div |...| (>> (div :data-role "commentContent"
($ text) (span) |...|))
!!!))
Rules Pros & Cons
+ compact & fast
+ full control
+ arbitrary recall
+ iterative
+ best interpetability
+ perfectly accommodate domain experts
- precision ceiling
- non-optimal weights
- require a lot of human labor
- hard to interpret/calibrate score
Classic ML Approach
NLP Viewpoints
* Bag-of-words
* Sequence
* Tree
* Graph
NB Model for LangID
* The problem of using words
* Character ngrams to the rescue
* Combining them
Deep Learning Approach
BiLSTM+attention as SOTA
“Basically, if you want to do an
NLP task, no matter what it is,
what you should do is throw your
data into a Bi-directional
long-short term memory network,
and augment its information flow
with the attention mechanism.”
– Chris Manning
https://twitter.com/mayurbhangale
/status/988332845708886016
Hybrid Approach
“Rule-based” framework that
incorporates Machine Learning
models.
+ control
+ best of both worlds
- more complex
- not sexy
Bias
* Domain bias - systematic statistical
differences between “domains”, usually
in the form of vocab, class priors and
conditional probabilities
* Dataset bias - sampling bias in a
given dataset, giving rise to (often
unintentional) content bias
* Model bias - bias in the output of a
model wrt gold standard
* Social bias - model bias
towards/against particular socio-
demographic groups of users
Counteracting Bias
* Rule-based post-processing
* Mix multiple datasets
+ special feature-selection
* Train multiple models on
different datasets and
create an ensemble
* Use domain adaptation
techniques
4. Productization
Delivering your NLP
application to the users
Product variants
* end-user product
(web, mobile, desktop)
* internal feature
* API
* library
* scientific paper
Requirements
* speed (processing, startup)
* memory usage
* storage
* interoperability
* ease of use
* ease of update
Storage Optimization
WILD Initial model size ~ 1G
(compared to megabytes for CLD,
target: 10-20 MB)
Ways to reduce:
- partly rule-based
- model pruning
- clever algorithms: perfect hash
tables, Huffman coding
- model compression
https://arxiv.org/abs/1612.03651
Model Formats
* pickle & co.
Model Formats
* pickle & co.
* JSON, ProtoBuf, ...
Model Formats
* pickle & co.
* JSON, ProtoBuf, ...
* HDF5
Model Formats
* pickle & co.
* JSON, ProtoBuf, ...
* HDF5
* Custom gzipped text-based
Adaptation
(manual vs automatic)
* for RB systems: add more
rules :D
* for ML systems: online
learning algorithms
* for DL systems: recent
“universal” models
* Lambda architecture
The Pipeline
“Getting a data pipeline into production for the
first time entails dealing with many different
moving parts in these cases, spend more time—
thinking about the pipeline itself, and how it
will enable experimentation, and less about its
first algorithm.”
https://medium.com/@neal_lathia/five-lessons-from
-building-machine-learning-systems-d703162846ad
Read More
• https://www.slideshare.net/frandzi/native-langua
ge-identification-brief-review-to-the-state-of-t
he-art•
• http://people.cs.umass.edu/~brenocon/inlp2015/15
-eval.pdf
• https://ehudreiter.com/2017/05/03/metrics-nlg-ev
aluation/
• https://nlpers.blogspot.com/2014/11/the-myth-of-
strong-baseline.html
•
https://www.slideshare.net/grammarly/grammarly-a
inlp-club-1-domain-and-social-bias-in-nlp-case-s
tudy-in-language-identification-tim-baldwin-8025
2288

NLP Project Full Circle

  • 1.
    NLP Project Full Circle VsevolodDyomkin, @vseloved Kyiv Dataspring 18, 2018-05-19
  • 2.
    A Bit aboutMe * 10+ Lisp programmer * 5+ years of NLP work (Grammarly & my own) * co-author of Projector NLP course https://vseloved.github.io
  • 3.
  • 4.
    NLP Project Stages 1)Domain analysis 2) Data preparation 3) Iterating the solution 4) Productizing the result 5) Gathering feedback and returning to stages 1/2/3
  • 5.
    1. Domain Analysis 1)Problem formulation 2) Existing solutions 3) Possible datasets 4) Identifying Challenges 5) Metrics, baseline & SOTA 6) Selecting possible approaches and devising a plan of attack
  • 6.
  • 7.
    Problem Formulation * Notan unsolved problem * Short texts * Support “all” languages * The issue of UND/MULT texts * Dialects? * Encodings?
  • 8.
  • 9.
    Existing Solutions * https://github.com/shuyo/language-detection/ (Java) *https://github.com/saffsd/langid.py (Python) * https://github.com/mzsanford/cld (C++)
  • 10.
    Existing Solutions * https://github.com/shuyo/language-detection/ (Java) *https://github.com/saffsd/langid.py (Python) * https://github.com/mzsanford/cld (C++) * https://github.com/CLD2Owners/cld2 * https://github.com/google/cld3
  • 11.
    Existing Solutions * https://github.com/shuyo/language-detection/ (Java) *https://github.com/saffsd/langid.py (Python) * https://github.com/mzsanford/cld (C++) * https://github.com/CLD2Owners/cld2 * https://github.com/google/cld3
  • 12.
  • 13.
    Possible Datasets * Debiani18n (~90 langs) * TED Multilingual (109 langs)
  • 14.
    Possible Datasets * Debiani18n (~90 langs) * TED Multilingual (109 langs) * Wiktionary (~150 langs)
  • 15.
    Possible Datasets * Debiani18n (~90 langs) * TED Multilingual (109 langs) * Wiktionary (~150 langs) * Wikipedia (175 langs)
  • 16.
    Test Data * Twitterevaluation dataset * Datasets with fewer langs * Extract from Wikipedia + smoke test
  • 17.
    Challenges * Linguistic challenges -know next to nothing about 90% of the languages - languages and scripts http://www.omniglot.com/writing/langalph.htm - language distributions https://en.wikipedia.org/wiki/Languages_used_on_the_Internet#Content_languages_for_websites - word segmentation? * Engineering challenges
  • 18.
  • 19.
    NLP Evaluation Intrinsic: usea gold- standard dataset and some metric to measure the system’s performance directly. (in-domain & out-of-domain) Extrinsic: use a system as part of an upstream task(s) and measure performance change there.
  • 20.
  • 21.
  • 22.
  • 23.
    Confusion Matrix AF: 1.00| DE: 1.00 | EN: 1.00 | ES: 0.94 | IT:0.06 NL: 0.85 | IT:0.03 CA:0.03 AF:0.03 DE:0.03 FR:0.03 RU: 0.82 | UK:0.12 EN:0.03 DA:0.02 FR:0.02 UK: 0.93 | TG:0.07 Total quality: 0.91
  • 24.
    Evaluating NLG 1) WordError Rate (WER) 2) BLEU, ROUGE, METEOR
  • 26.
    Other Metrics 1) Cross-entropy 2)Perplexity 3) Custom: MaxMatch, smatch, parseval, UAS/LAS… 4) Your own?
  • 27.
    Baseline Quality that canbe achieved using some reasonable “basic” approach (on the same data). 2 ways to measure improvement: * quality improvement * error reduction
  • 28.
    Recent SNLI Example https://arxiv.org/pdf/1803.02324.pdf -majority class baseline: 0.34 - SOTA models: 0.87
  • 29.
    Recent SNLI Example https://arxiv.org/pdf/1803.02324.pdf -majority class baseline: 0.34 - SOTA models: 0.87 - basic fasttext classifier: 0.67
  • 30.
    State-of-the-Art (SOTA) The highest publiclyknown result on a well-established dataset. https://aclweb.org/aclwiki/Sta te_of_the_art
  • 31.
  • 32.
    The Unreasonable Effectiveness ofData “Data is ten times more powerful than algorithms.” -- Peter Norvig, http://youtu.be/yvDCzhbjYWs https://twitter.com/shivon/status/864889085697024000
  • 33.
    Uses of Datain NLP * understanding the problem * statistical analysis * connecting separate domains * evaluation data set * training data set * real-time feedback * marketing/PR * external competitions * what else?
  • 34.
  • 35.
    Types of Data *Structured * Semi-structured * Unstructured
  • 36.
    Getting Data * Getexisting * Create your own * Acquire from users
  • 37.
    Corpus Annotated collection ofdocs in a certain format.
  • 38.
    Prominent Corpora * National:OANC/MASC, British (non-free) * LDC(non-free): Penn Treebank, OntoNotes, Web Treebank * Books: Gutenberg, GoogleBooks * Corporate: Reuters, Enron * Research: SNLI, SQuAD * Multilang: UDeps, Europarl
  • 39.
    Corpora Pitfalls * Requireslicensing (often) * Tied to a domain * Annotation quality * Requires processing of custom formats (often)
  • 40.
    DBs & KBs *Wikimedia * RDF knowledge bases (Freebase) * Wordnet, Conceptnet, Babelnet * Custom DBs
  • 41.
    Dictionaries * Wordlists (example:COCA) * Dictionaries, grammar dicts - Wiktionary * Thesauri
  • 42.
    Unstructured Data * Internet *CommonCrawl (also, NewsCrawl) * UMBC, ClueWeb
  • 43.
    Raw Text Pros& Cons + Can collect stats => build LMs, word vectors… + Can have a variety of domains - … but hard to control the distribution of domains - Web artifacts - Web noise - Huge processing effort
  • 44.
    Creating Own Data *Scraping * Annotation * Crowdsourcing * Getting from users * Generating
  • 45.
    Data Scraping * Full-pageweb-scraping (see: readability) * Custom web-scraping (see: scrapy, crawlik) * Extracting from non-HTML formats (.pdf, .doc…) * Getting from API (see: twitter, webhose)
  • 46.
    Scraping Pros &Cons + possible to get needed domain + may be the only “fast” way + it’s the “google” way - licensing issues - getting volume requires time - need to deal with antirobot protection - other engineering challenges
  • 47.
    Corpus Annotation * Initialdata gathering * Annotation guidelines * Annotator selection * Annotation tools * Processing of raw annotations * Result analysis & feedback
  • 48.
  • 49.
    Crowdsourcing - Like annotationbut harder requires a custom solution With gamification + If successful, can get volumes and diversity
  • 50.
    User Feedback Data +Your product, your domain + Real-time, allows adaptation + Ties into customer support - Chicken & egg problem - Requires anonymization Approach: “lean startup”
  • 51.
    Generating Data + Potentiallyunlimited volume + Control of the parameters - Artificial (if you can code a generation function, why not code a reverse one?)
  • 52.
    Data Best Practices *Proper ML dataset handling * Domain adequacy, diversity * Inter-annotator agreement, reasonable baselines * Error analysis * Real-time tracking
  • 53.
    Data Tools * (grep& co) * other Shell powertools * statistical analysis tools + plotting * annotation tools * web-scraping tools * metrics databases (Graphite) * Hadoop, Spark
  • 54.
    3. Possible Approaches 1)Symbolical/rule-based “The (Weak) Empire of Reason” 2) Counting-based “The Empiricist Invasion” 3) Deep Learning-based “The Revenge of the Spherical Cows” http://www.earningmyturns.org/2017/06/ a-computational-linguistic-farce-in.ht ml
  • 55.
  • 56.
    English Tokenization (small problem) *Simplest regex: [^s]+ * More advanced regex: w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|] «»“”‘’-]⟨⟩ ‒–—― * Even more advanced regex: [+-]?[0-9](?:[0-9,.]*[0-9])? |[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']? |["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—― |[.!?]+ |-+ Add post-processing: * concatenate abbreviations and decimals * split contractions with regexes - 2-character abbreviations regex: i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they)['‘’`]d$ - 3-character abbreviations regex: (?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$ https://github.com/lang-uk/tokenize-uk/blob/master/tokenize_uk/tokeni ze_uk.py
  • 57.
    Error Correction (inherently rule-based) LanguageTool:<categoryname=" " id="STYLE" type="style">Стиль <rulegroup id="BILSH_WITH_ADJ" name=" ">Більш з прикметниками <rule> <pattern> <token> </token>більш <token postag_regexp="yes" postag="ad[jv]:.*compc.*"> <exception> </exception>менш </token> </pattern> <message> « » </message>Після більш не може стояти вища форма прикметника <!-- TODO: when we can bind comparative forms togher again <suggestion><match no="2"/></suggestion> <suggestion><match no="1"/> <match no="2" postag_regexp="yes" postag="(.*):compc(.*)" postag_replace="$1:compb$2"/></suggestion> <example correction=" | "><marker>Світліший Більш світлий Більш </marker>.</example>світліший --> <example correction=""><marker> </marker>.</example>Більш світліший <example> </example>все закінчилось більш менш </rule> ... https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modu les/uk/src/main/resources/org/languagetool/rules/uk/
  • 58.
    Dialog Systems (inherently scripted) ELIZA: (defparameter*eliza-rules* '((((?* ?x) hello (?* ?y)) (How do you do. Please state your problem.)) (((?* ?x) I want (?* ?y)) (What would it mean if you got ?y) (Why do you want ?y) (Suppose you got ?y soon)) (((?* ?x) if (?* ?y)) (Do you really think its likely that ?y) (Do you wish that ?y) (What do you think about ?y) (Really-- if ?y)) (((?* ?x) no (?* ?y)) (Why not?) (You are being a bit negative) (Are you saying "NO" just to be negative?)) (((?* ?x) I was (?* ?y)) (Were you really?) (Perhaps I already knew you were ?y) (Why do you tell me you were ?y now?)) (((?* ?x) I feel (?* ?y)) (Do you often feel ?y ?)) (((?* ?x) I felt (?* ?y)) (What other feelings do you have?)))) https://norvig.com/paip/eliza1.lisp
  • 60.
    Fighting Spam (just abad idea :) SpamAssasin: # bodyn # example: postacie greet! Chcesz porozmawiac Co prawda tu rzadko bywam, zwykle pisze tu - http://hanna.3xa.info uri __LOCAL_LINK_INFO /http://w{3,8}.www.info/ header __LOCAL_FROM_O2 From =~ /@o2.pl/ meta LOCAL_LINK_INFO __LOCAL_LINK_INFO && __LOCAL_FROM_O2 describe LOCAL_LINK_INFO Link postaci http://cos.cos.info i z o2.pl score LOCAL_LINK_INFO 5 https://wiki.apache.org/spamassassin/CustomRulesets
  • 61.
    (…which continues being reimplemented) Facebookantispam (using HAXL): fpSpammer :: Haxl Bool fpSpammer = talkingAboutFP .&& numFriends .> 100 .&& friendsLikeCPlusPlus where talkingAboutFP = strContains “Functional Programming” <$> postContent friendsLikeCPlusPlus = do friends <- getFriends cppFriends <- filterM likesCPlusPlus friends return (length cppFriends >= length friends `div` 2) http://multicore.doc.ic.ac.uk/iPr0gram/slides/2015-2016/Marlow-fighting-s pam.pdf https://petrimazepa.com/m_li_ili_kak_algoritm_facebook_blokiruet_polzovat elei_na_osnovanii_stop_slov
  • 62.
  • 63.
    Languages for Rules *regexes * XML, JSON
  • 64.
    Languages for Rules *regexes * XML, JSON * external DSLs (example: JBOSS drules)
  • 65.
    Languages for Rules *regexes * XML, JSON * external DSLs * internal DSLs (match-html source '(>> article (aside (>> a ($ user)) (>> li (strong "Native Tongue:") ($ lang))) (div |...| (>> (div :data-role "commentContent" ($ text) (span) |...|)) !!!))
  • 66.
    Rules Pros &Cons + compact & fast + full control + arbitrary recall + iterative + best interpetability + perfectly accommodate domain experts - precision ceiling - non-optimal weights - require a lot of human labor - hard to interpret/calibrate score
  • 67.
  • 68.
    NLP Viewpoints * Bag-of-words *Sequence * Tree * Graph
  • 69.
    NB Model forLangID * The problem of using words * Character ngrams to the rescue * Combining them
  • 70.
  • 71.
    BiLSTM+attention as SOTA “Basically,if you want to do an NLP task, no matter what it is, what you should do is throw your data into a Bi-directional long-short term memory network, and augment its information flow with the attention mechanism.” – Chris Manning https://twitter.com/mayurbhangale /status/988332845708886016
  • 72.
    Hybrid Approach “Rule-based” frameworkthat incorporates Machine Learning models. + control + best of both worlds - more complex - not sexy
  • 73.
    Bias * Domain bias- systematic statistical differences between “domains”, usually in the form of vocab, class priors and conditional probabilities * Dataset bias - sampling bias in a given dataset, giving rise to (often unintentional) content bias * Model bias - bias in the output of a model wrt gold standard * Social bias - model bias towards/against particular socio- demographic groups of users
  • 74.
    Counteracting Bias * Rule-basedpost-processing * Mix multiple datasets + special feature-selection * Train multiple models on different datasets and create an ensemble * Use domain adaptation techniques
  • 75.
    4. Productization Delivering yourNLP application to the users
  • 76.
    Product variants * end-userproduct (web, mobile, desktop) * internal feature * API * library * scientific paper
  • 77.
    Requirements * speed (processing,startup) * memory usage * storage * interoperability * ease of use * ease of update
  • 78.
    Storage Optimization WILD Initialmodel size ~ 1G (compared to megabytes for CLD, target: 10-20 MB) Ways to reduce: - partly rule-based - model pruning - clever algorithms: perfect hash tables, Huffman coding - model compression https://arxiv.org/abs/1612.03651
  • 79.
  • 80.
    Model Formats * pickle& co. * JSON, ProtoBuf, ...
  • 81.
    Model Formats * pickle& co. * JSON, ProtoBuf, ... * HDF5
  • 82.
    Model Formats * pickle& co. * JSON, ProtoBuf, ... * HDF5 * Custom gzipped text-based
  • 83.
    Adaptation (manual vs automatic) *for RB systems: add more rules :D * for ML systems: online learning algorithms * for DL systems: recent “universal” models * Lambda architecture
  • 84.
    The Pipeline “Getting adata pipeline into production for the first time entails dealing with many different moving parts in these cases, spend more time— thinking about the pipeline itself, and how it will enable experimentation, and less about its first algorithm.” https://medium.com/@neal_lathia/five-lessons-from -building-machine-learning-systems-d703162846ad
  • 85.
    Read More • https://www.slideshare.net/frandzi/native-langua ge-identification-brief-review-to-the-state-of-t he-art• •http://people.cs.umass.edu/~brenocon/inlp2015/15 -eval.pdf • https://ehudreiter.com/2017/05/03/metrics-nlg-ev aluation/ • https://nlpers.blogspot.com/2014/11/the-myth-of- strong-baseline.html • https://www.slideshare.net/grammarly/grammarly-a inlp-club-1-domain-and-social-bias-in-nlp-case-s tudy-in-language-identification-tim-baldwin-8025 2288