NLP

NLP
It’s not neuro-linguistic programming
1

Topic
Auto-suggestion
Normalization
Stemming
Lemmatization
Spelling Correction
BLEU Score
Morphological Analysis
Transliteration
Sentiment Analysis
Summerization
Conﬁdence Score
2

Funny Autocomplete
“autocomplete is not a function” is current top-1 Google
autocomplete of “autocomplete is”.
3

Autocomplete is not A function
• Neither is auto-suggestion
• They are many-to-many relations with scores.
• Remember this?
4

Many-to-many Scoring
• Map by preﬁx, rank by popularity
• Google search box autocomplete
• Map by occurrence, rank by similarity
• Search (information retrieval)
• Map by information, rank by knowledge
• Translation
5

Preﬁx, Occurrence
• Surface pattern
• Regular
6
• Context-free
• Context-sensitive
• Recursively blahblah……

Information?
• Surface patterns and……
7
• Imaginations
• Will get back to that later.

Popularity & Similarity
• Popular: famous or infamous?
• Off-topic
• Similarity
• Distance
8

Map & Rank
• Regular expression
• Edit distance
9

Regular expression
• [a-z]+
• Colours of cats and dogs.
• [^o]{2}
• cat|dog
• Colou?rs?
• Colors of cats and dogs.
• Color of a cat.
• <[A-Za-z][A-Za-z]*>
• <html>Colours of cats and dogs.</html>
10

Edit Distance
• Colors
• Delete s
• Color
• Insert u
• Colour
• Replace C with c
• colour
• Distance from Colors to colour: 3
(or 4 if the cost of replacing is 2)
11

– One may ask
“What if I wanted to map 1,１, one, and ONE?”
12

Normalization
• time flies like an arrow. fruit flies like bananas.
• Case restoration
• Time flies like an arrow. Fruit flies like bananas.
• Sentence segmentation
• time flies like an arrow.
• fruit flies like bananas.
• Word normalization: stemming or lemmatization?
13

Stemming
• Porter Stemmer (mainly suffix
stripping)
• flies → fli
• bananas → banana
• How about “flies → fly”?
• Lemmatization
14

Lemmatization
• ﬂies → ﬂy
• better → good
• meeting
• meet?
• axes
• axe?
• axis?
15

Stemming or lemmatization, which is better?
“Battlestar Galactica is frakking wierd.”
16

Spelling Correction
• Hello again, edit distance.
• Just one step from “wierd” to “weird”
• Language modeling
• “Battlestar Galactica” often comes with “frak”
17

Language modeling
• Information (entropy) about encoding
• Horse race analogy, assuming winners were
• B A C B C C D C
• P(A) =1/8, P(B) = 2/8, P(C) = 4/8, P(D) = 1/8
• C = 0(00), B = 10(0), A = 110, D = 111
• n-gram
• Although B won fewer times than C, but what if B always
won when A was next to D?
18

Are we doing good?
Evaluate it!
19

BLEU Score
• Horse race analogy
• “B A C B C C D C” vs. “C C C C C C C C”
• Sequence precision: 4/8 = 0.5
• Unigram precision (as long as a unigram matched): 8/8 = 1
• When “natural-ness” matters
• “there is a cat on the mat | the cat is on the mat” vs. “the the the the the the”
• Sequence precision: ?
• Unigram precision: 7/7 = 1
• Modiﬁed unigram precision: 2/7
• Modiﬁed bigram precision: 0/7
20

I want more info
Less is more.
21

Imagine there’s……
• No more vocabularies than
• N
• V
• Adj
• Adv
22

Read the signs
• Morphology? Word-formation? Part-of-speech?
• They are sequential structures.
• Remember this?
23

Sequential Structures
• Morphological typology
• Analytic (isolating)
• Chinese
• Synthetic
• Agglutinative
• Japanese, Korean
• Fusional (inﬂecting)
• Arabic, English, French, Italian, Spanish
• Syntax? Morphosyntax?
• Morphological word? Prosodic word?
24

Read my lips
It’s not only about sound
25

Transliteration is not……
• Romanization
• Transcription
26

Transliteration
• Alignment
• Alignment
• Alignment
27
(1)
er of
n the
and
ence
also
s or
of
to-one-alignments possible. Furthermore,
combine to produce a single phoneme (d
single letter can sometimes produce tw
phonemes). For example, the English wo
Chinese transliteration “ ”, which
“phonemes”, is aligned as [15]:
A BE RT
| | |

The name of the rose
Sounds negative? Let’s try it anyway……
28

Sentiment Analysis
• Classiﬁcation
• Polarity
• やばい
• Subjectivity
• In my
opinion……
• Emotion
29

Semantics?
• Classiﬁcation vs.
• Ranking (as we’ve seen so far)
• Clustering
• Regression
• ……
30

Summarization
• Extraction
• Classiﬁcation
• Discriminative
• Abstraction
• Aggregation
• Generative
31

Classiﬁcation
32
• Surface patterns and ……
• Imaginations

Machine Learning
• Generative models
• Hidden-Markov models
• Language models
• Discriminative models
• Support Vector Machine
• Logistic Regression
• Conditional Random Fields
• Maximum Entropy
33

Confidence Score
• Confidence interval? Confidence level?
• Not really
• But it can be
• Just a buzz word from speech recognition
• Shannon’s game
• Hidden-Markov models
• Generative
• The Italian who went to Malta
• Can be any reasonable score
• Mostly probability
34

Wrap up
https://class.coursera.org/nlp/lecture
35

NLP

Recommended

Recommended

More Related Content

More from Mike Tian-Jian Jiang

More from Mike Tian-Jian Jiang (6)

Recently uploaded

Recently uploaded (20)

NLP