Cl.week5-6

613 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
613
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cl.week5-6

  1. 1. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Topics in Computational Linguistics Week 5: ngrams and language model Shu-Kai Hsieh Lab of Ontologies, Language Processing and e-Humanities GIL, National Taiwan University March 28, 2014 Topics in Computational Linguistics Shu-Kai Hsieh
  2. 2. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab ..1 N-grams model Evaluation Smoothing Techniques ..2 Web-scaled N-grams ..3 Related Topics ..4 The Entropy of Natural Languages ..5 Lab Topics in Computational Linguistics Shu-Kai Hsieh
  3. 3. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Language models • Statistical/probabilistic language models aim to compute • either the prob. of a sentence or sequence of words, P(S) = P(w1, w2, w3, ...wn), or • the prob. of the upcoming word P(wn|w1, w2, w3, ...wn−1) (which will turn out to be closely related to computing the probability of a sequence of words.) • N-gram model is one of the most important tools in speech and language processing. • Varied applications: spelling checker, MT, Speech Recognition, QA, etc. Topics in Computational Linguistics Shu-Kai Hsieh
  4. 4. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab ..1 N-grams model Evaluation Smoothing Techniques ..2 Web-scaled N-grams ..3 Related Topics ..4 The Entropy of Natural Languages ..5 Lab Topics in Computational Linguistics Shu-Kai Hsieh
  5. 5. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Simple n-gram model • Let’s start with calculating the P(S), say, P(S) = P(學, 語言, 很, 有趣) Topics in Computational Linguistics Shu-Kai Hsieh
  6. 6. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Review of Joint and Conditional Probability • Recall that the conditional prob. of X given Y, P(X|Y), is defined in terms of the prob. of Y, P(Y), and the joint prob. of X and Y, P(X, Y): P(X|Y) = P(X, Y) P(Y) Topics in Computational Linguistics Shu-Kai Hsieh
  7. 7. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Review of Chain Rule of Probability Conversely, the joint prob. P(X, Y) can be expressed in terms of the conditional prob. P(X|Y). P(X, Y) = P(X|Y) P(Y) which leads to the chain rule P(X1, X2, X3, · · · , Xn) = P(X1)P(X2|X1)P(X3|X1, X2) · · · P(Xn|X1, · · · , Xn−1) = P(X1) ∏n i=2 P(Xi|X1, · · · , Xi−1) Topics in Computational Linguistics Shu-Kai Hsieh
  8. 8. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab The Chain Rule applied to calculate joint probability of words in sentence chain rule of probability P(S) = P(wn 1) = P(w1)P(w2|w1)P(w3|w2 1)...P(wn|wn−1 1 ) = ∏n k=1 P(wk|wk−1 1 ) = P(學) * P(語言|學) * P(很|學 語言) * P(有趣|學 語言 很) Topics in Computational Linguistics Shu-Kai Hsieh
  9. 9. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab How to Estimate these Probabilities? • Maximum Likelihood Estimation (MLE): by dividing simply counting in a corpus and normalize them so that they lie between 0 and 1. (There are of course more sophisticated algorithms) 1 count and divide P(嗎 | 學 語言 很 有趣) = Count(學 語言 很 有趣 嗎) / Count(學 語言 很 有趣) 1 MLE sometimes called relative frequency Topics in Computational Linguistics Shu-Kai Hsieh
  10. 10. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Markov Assumption: Don’t look too far into the past Simplified idea: instead of computing the prob. of a word given its entire history, we can approximate the history by just the last few words. P(嗎 | 學 語言 很 有趣) ≈ P( 嗎 | 有趣) OR, P(嗎 | 學 語言 很 有趣) ≈ P( 嗎 | 很 有趣 ) Topics in Computational Linguistics Shu-Kai Hsieh
  11. 11. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab In other words • Bi-gram model: approximates the prob. of a word give all the previous P(wn|wn−1 1 ) by using only the conditional prob. of the preceding words P(wn|wn−1). Thus generalized as P(wn|wn−1 1 ) ≈ P(wn|wn−1 n−N+1) • Tri-gram: (your turn) • We can extend to trigrams, 4-grams, 5-grams, knowing that in general this is an insufficient model of language (because language has long-distance dependencies). 我 在 一 個 非 常 奇特 的 機緣巧合 之下 學 梵文 Topics in Computational Linguistics Shu-Kai Hsieh
  12. 12. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab In other words • So given the bi-gram assumption for the prob. of an individual word, we can compute the prob. of the entire sentence as P(S) = P(wn 1) ≈ n∏ k=1 P(wk|wk−1) • recall MLE on JM book equation (4.13)-(4.14) Topics in Computational Linguistics Shu-Kai Hsieh
  13. 13. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Example: Language Modeling of Alice.txt Topics in Computational Linguistics Shu-Kai Hsieh
  14. 14. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Topics in Computational Linguistics Shu-Kai Hsieh
  15. 15. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Exercise • Walk through the example of Berkeley Restaurant Project sentences (PP90-91) BTW, we used to do everything in log space to avoid underflow (also adding is faster than multiplying) log(p1 ∗ p2 ∗ p3) = logp1 + logp2 + logp3 Topics in Computational Linguistics Shu-Kai Hsieh
  16. 16. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Google n-gram and Google Suggestion Topics in Computational Linguistics Shu-Kai Hsieh
  17. 17. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Generating the Wall Street Journal vs Generating Shakespeare Topics in Computational Linguistics Shu-Kai Hsieh
  18. 18. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Generating the Wall Street Journal vs Generating Shakespeare Topics in Computational Linguistics Shu-Kai Hsieh
  19. 19. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab • Quadrigrams looks like Shakespeare because it is Shakespeare. • N-gram model is very sensitive to the training corpus! Overfitting issue • N-grams only work well for word prediction if the test corpus looks like the training corpus, but in real life, it often doesn’t. • We need to train a more robust model that generalize, e.g. Zeros issue, i.e., Things that don’t ever occur in the training set but occur in the test set. Topics in Computational Linguistics Shu-Kai Hsieh
  20. 20. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Evaluation ..1 N-grams model Evaluation Smoothing Techniques ..2 Web-scaled N-grams ..3 Related Topics ..4 The Entropy of Natural Languages ..5 Lab Topics in Computational Linguistics Shu-Kai Hsieh
  21. 21. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Evaluation Evaluating n-gram models How good is our model? How to make it better(robust)? • N-gram language models are evaluated by separating the corpus into a training set and a test set, training the model on the training set, and evaluating on the test set. An evaluation metric tells us how well our model does on the test set. • Extrinsic (in vivo) evaluation • intrinsic evaluation: perplexity (2H of of the language model on a test set is used to compare language models.) Topics in Computational Linguistics Shu-Kai Hsieh
  22. 22. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Evaluation Evaluation the N-gram Model But the model relies heavily on the corpus the models were trained on, and thus often results in overfitting! Example • Given a vocabulary of 20,000 types, the potential number of bigrams is 20, 0002 = 400, 000, 000, and with tri-grams, it amounts to the astronomic figure of 20, 0003. No corpus yet has the size to cover the corresponding word combinations. • MLE gives no hint on how to estimate their prob. • Here we use smoothing (or discounting) techniques to estimate prob. of unseen ngrams, presumably because a distribution without zeros is smoother than one with zeros.Topics in Computational Linguistics Shu-Kai Hsieh
  23. 23. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Evaluation Perplexity • The best language model is one that best predicts an unseen test set (i.e., Gives the highest P(sentence)). • Perplexity is defined as the inverse probability of the test set, normalized by the number of words. Topics in Computational Linguistics Shu-Kai Hsieh
  24. 24. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques The intuition of smoothing (from Dan Klein) Topics in Computational Linguistics Shu-Kai Hsieh
  25. 25. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques Smoothing Techniques Smoothing n-gram probabilities • sparse data: the corpus is not big enough to have all the bigrams covered with a realistic estimate. • Smoothing algorithms provide a better way of estimating the probability of n-grams than Maximum Likelihood Estimation. Topics in Computational Linguistics Shu-Kai Hsieh
  26. 26. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques Smoothing Techniques • Laplace Smoothing (a.k.a. add-one method) • Interpolation • Backoff • Good-Turing Estimation(/Discounting) • Kneser-Ney Smoothing Topics in Computational Linguistics Shu-Kai Hsieh
  27. 27. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques Laplace Smoothing • Pretend we saw each word one more time than we did. • Re-estimate the counts by just add one to all the counts! • Read the BeRP examples (JM pp99-100) Topics in Computational Linguistics Shu-Kai Hsieh
  28. 28. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques Laplace Smoothing: Comparing with Raw Bigram Counts Topics in Computational Linguistics Shu-Kai Hsieh
  29. 29. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques Laplace Smoothing: It’s a blunt estimation • Too much probability mass is moved to all the zeros. 喧賓奪主: 為了處理大量的 zero,Chinese food 可以少 10 倍! Topics in Computational Linguistics Shu-Kai Hsieh
  30. 30. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques (Katz) Backoff and Interpolation Intuition Sometimes it helps to use less context. Condition on less context for contexts you haven’t learned much about. • Backoff and Interpolation are another two strategies that utilize n-grams of variable length. • Backoff: use trigram if you have good evidence, otherwise bigram, otherwise unigram. • Interpolation: mix unigram, bigram, trigram. Topics in Computational Linguistics Shu-Kai Hsieh
  31. 31. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques Katz Back-off • The idea is to use the frequency of longest available n-grams, and if no n-gram is available to back-off to the (n-1)-gram, and then to (n-2)-gram, and so on. • If n = 3, we first try trigrams, then bigrams, and finally unigrams. Topics in Computational Linguistics Shu-Kai Hsieh
  32. 32. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques P∗ and α? • P∗: the discounted probability rather than MLE probabilities, such as Good-Turing. • α: the normalizing factor Topics in Computational Linguistics Shu-Kai Hsieh
  33. 33. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques Linear Interpolation 線性插值 將高階模型和低階模型作線性組合 • Simple interpolation • Lambdas conditional on context Topics in Computational Linguistics Shu-Kai Hsieh
  34. 34. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques Advanced Discounting Techniques Intuition To use the count of things you’ve seen once to help estimate the count of things you’ve never seen. • Good-Turing • Witten-Bell • Kneser-Ney Topics in Computational Linguistics Shu-Kai Hsieh
  35. 35. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques Good-Turing Smoothing: Notations • A word or N-gram (or any event) that occurs once is called singleton or a hapax legomenon. • Nc: the number of things we’ve seen c times, i.e., the frequency of frequency c. Example (In terms of bigrams) N0 is the number of bigrams with count 0, N1 the number of bigrams with count 1 (singleton), etc Topics in Computational Linguistics Shu-Kai Hsieh
  36. 36. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques Good-Turing Smoothing:Intuition [2]:pp101-102 Topics in Computational Linguistics Shu-Kai Hsieh
  37. 37. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques Good-Turing Smoothing: Answer Topics in Computational Linguistics Shu-Kai Hsieh
  38. 38. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques Other advanced Smoothing Techniques Topics in Computational Linguistics Shu-Kai Hsieh
  39. 39. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab ..1 N-grams model Evaluation Smoothing Techniques ..2 Web-scaled N-grams ..3 Related Topics ..4 The Entropy of Natural Languages ..5 Lab Topics in Computational Linguistics Shu-Kai Hsieh
  40. 40. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab How to deal with huge web-scaled ngrams How might one build a language model (ngrams model) that allows scaling to very large amounts of training data? • Naive Pruning: Only store N-grams with count geq threshold, and remove singletons of higher-order n-grams. • Entropy-based pruning Topics in Computational Linguistics Shu-Kai Hsieh
  41. 41. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing for Web-scaled N-grams “Standard backoff” uses variations of context-dependent backoff, where p are pre-computed and stored probabilities, and λ are back-off weights. Topics in Computational Linguistics Shu-Kai Hsieh
  42. 42. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing for Web-scaled N-grams “Stupid backoff” [1] don’t apply any discounting and instead directly use the relative frequencies (S is used instead of P to emphasize that these are not probabilities but scores): Topics in Computational Linguistics Shu-Kai Hsieh
  43. 43. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab LM Tools and n-gram Resources • CMU Statistical Language Modeling Toolkit http://www.speech.cs.cmu.edu/SLM/toolkit.html • SRILM http://www.speech.sri.com/projects/srilm/ • Google Web1T5-gram http://googleresearch.blogspot. com/2006/08/all-our-n-gram-are-belong-to-you.html • Google Book N-grams • Chinese Web 5-gram http://www.ldc.upenn.edu/ Catalog/catalogEntry.jsp?catalogId=LDC2010T06 Topics in Computational Linguistics Shu-Kai Hsieh
  44. 44. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Quick demo of CMU-LM Topics in Computational Linguistics Shu-Kai Hsieh
  45. 45. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Google book ngrams Topics in Computational Linguistics Shu-Kai Hsieh
  46. 46. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab From Corpus-based to Google-based Linguistics Enhancing Linguistic Search with the Google Books Ngram Viewer Topics in Computational Linguistics Shu-Kai Hsieh
  47. 47. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab From Corpus-based to Google-based Linguistics Syntactic N-grams are coming out too! http://commondatastorage.googleapis.com/books/ syntactic-ngrams/index.html Topics in Computational Linguistics Shu-Kai Hsieh
  48. 48. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Exercise The Google Web 1T 5-Gram Database — SQLite Index & Web Interface Topics in Computational Linguistics Shu-Kai Hsieh
  49. 49. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Applications What Next Words Predication (based on Probabilistic Language Models) can do today? source: fandywang,2012 ExampleTopics in Computational Linguistics Shu-Kai Hsieh
  50. 50. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab You’d definitely like to try this An Automatic CS Paper Generator http://pdos.csail.mit.edu/scigen/ Topics in Computational Linguistics Shu-Kai Hsieh
  51. 51. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Collocations • Collocations are recurrent combinations of words. Example • Simple collocations are fixed ngrams, such as The Wall Street, • Collocations with predicative relations involves morpho-syntactic variations, such as the one linking make and decision: to make a decision, decisions to be made, made an important decision, etc. Topics in Computational Linguistics Shu-Kai Hsieh
  52. 52. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Collocations • Statistically, collocates are events co-occur more often than by chance. • Measures used to calculate the strength of word preference are Mutual Information, t-score and the likelihood ratio. MI Topics in Computational Linguistics Shu-Kai Hsieh
  53. 53. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Lab • ngramR for Google book ngram • python nltk [see extra ipython notebook] Example For newbie in python https://www.coursera.org/course/interactivepython For quick starter (Develop and host Python from your browser):https://www.pythonanywhere.com/ Topics in Computational Linguistics Shu-Kai Hsieh
  54. 54. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Homework.week5 80% (4.3, JM book p122) 20% 預習 chapter 5 [2] Topics in Computational Linguistics Shu-Kai Hsieh
  55. 55. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Homework.week6 20% 閱讀中研院平衡語料庫說明手冊( http://app.sinica.edu.tw/kiwi/mkiwi/98-04.pdf),預 習 chapter 6. 80% 實作服貿論述的 language model (data will be provided),由 此建立自動 PRO/CON 文本產生器。 Topics in Computational Linguistics Shu-Kai Hsieh
  56. 56. ..... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . .... . .... . ..... . .... . ..... . .... . .... . . . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, and Jeffrey Dean. Large language models in machine translation. In In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Citeseer, 2007. Dan Jurafsky and James H Martin. Speech & Language Processing. Pearson Education India, 2000. Topics in Computational Linguistics Shu-Kai Hsieh

×