10. Regular expression
• [a-z]+
• Colours of cats and dogs.
• [^o]{2}
• Colours of cats and dogs.
• cat|dog
• Colours of cats and dogs.
• Colou?rs?
• Colours of cats and dogs.
• Colors of cats and dogs.
• Color of a cat.
• <[A-Za-z][A-Za-z]*>
• <html>Colours of cats and dogs.</html>
10
11. Edit Distance
• Colors
• Delete s
• Color
• Insert u
• Colour
• Replace C with c
• colour
• Distance from Colors to colour: 3
(or 4 if the cost of replacing is 2)
11
12. – One may ask
“What if I wanted to map 1,1, one, and ONE?”
12
13. Normalization
• time flies like an arrow. fruit flies like bananas.
• Case restoration
• Time flies like an arrow. Fruit flies like bananas.
• Sentence segmentation
• time flies like an arrow.
• fruit flies like bananas.
• Word normalization: stemming or lemmatization?
13
17. Spelling Correction
• Hello again, edit distance.
• Just one step from “wierd” to “weird”
• Language modeling
• “Battlestar Galactica” often comes with “frak”
17
18. Language modeling
• Information (entropy) about encoding
• Horse race analogy, assuming winners were
• B A C B C C D C
• P(A) =1/8, P(B) = 2/8, P(C) = 4/8, P(D) = 1/8
• C = 0(00), B = 10(0), A = 110, D = 111
• n-gram
• Although B won fewer times than C, but what if B always
won when A was next to D?
18
20. BLEU Score
• Horse race analogy
• “B A C B C C D C” vs. “C C C C C C C C”
• Sequence precision: 4/8 = 0.5
• Unigram precision (as long as a unigram matched): 8/8 = 1
• When “natural-ness” matters
• “there is a cat on the mat | the cat is on the mat” vs. “the the the the the the”
• Sequence precision: ?
• Unigram precision: 7/7 = 1
• Modified unigram precision: 2/7
• Modified bigram precision: 0/7
20
27. Transliteration
• Alignment
• Alignment
• Alignment
27
(1)
er of
n the
and
ence
also
s or
of
to-one-alignments possible. Furthermore,
combine to produce a single phoneme (d
single letter can sometimes produce tw
phonemes). For example, the English wo
Chinese transliteration “ ”, which
“phonemes”, is aligned as [15]:
A BE RT
| | |
28. The name of the rose
Sounds negative? Let’s try it anyway……
28
33. Machine Learning
• Generative models
• Hidden-Markov models
• Language models
• Discriminative models
• Support Vector Machine
• Logistic Regression
• Conditional Random Fields
• Maximum Entropy
33
34. Confidence Score
• Confidence interval? Confidence level?
• Not really
• But it can be
• Just a buzz word from speech recognition
• Shannon’s game
• Hidden-Markov models
• Generative
• The Italian who went to Malta
• Can be any reasonable score
• Mostly probability
34