Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- ソーシャルメディアの多言語判定 #SoC2014 by Shuyo Nakatani 4137 views
- 猫に教えてもらうルベーグ可測 by Shuyo Nakatani 23920 views
- ドラえもんでわかる統計的因果推論 #TokyoR by Shuyo Nakatani 8880 views
- 人工知能と機械学習の違いって？ by Shuyo Nakatani 18141 views
- アラビア語とペルシャ語の見分け方 #DSIRNLP 5 by Shuyo Nakatani 16312 views
- KB + Text => Great KB な論文を多読してみた by Koji Matsuda 1661 views

5,018 views

Published on

Published in:
Technology

No Downloads

Total views

5,018

On SlideShare

0

From Embeds

0

Number of Embeds

2,616

Shares

0

Downloads

15

Comments

0

Likes

3

No embeds

No notes for slide

- 1. [Zhang+ ACL2014] Kneser-Ney Smoothing on Expected Count [Pickhardt+ ACL2014] A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing 2014/7/12 ACL Reading @ PFI Nakatani Shuyo, Cybozu Labs Inc.
- 2. Kneser-Ney Smoothing [Kneser+ 1995] • Discounting & Interpolation 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 = max 𝑐 𝑤𝑖−𝑛+1 𝑖 − 𝐷, 0 𝑐 𝑤𝑖−𝑛+1 𝑖−1 + 𝐷 𝑐 𝑤𝑖−𝑛+1 𝑖−1 𝑁1+ 𝑤𝑖−𝑛+1 𝑖−1 ∙ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+2 𝑖−1 • where 𝑤 𝑚 𝑛 = 𝑤 𝑚 ⋯ 𝑤 𝑛, 𝑁1+ 𝑤 𝑚 𝑛 ⋅ = 𝑤𝑖|𝑐 𝑤 𝑚 𝑛 𝑤𝑖 > 0 Number of Discounting
- 3. Modified KN-Smoothing [Chen+ 1999] 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 = 𝑐 𝑤𝑖−𝑛+1 𝑖 − 𝐷 𝑤𝑖−𝑛+1 𝑖 𝑐 𝑤𝑖−𝑛+1 𝑖−1 + 𝛾 𝑤𝑖−𝑛+1 𝑖−1 𝑃 𝑤𝑖 𝑤𝑖−𝑛+2 𝑖−1 • where 𝐷 𝑐 = 0 if 𝑐 = 0, 𝐷1 if 𝑐 = 1, 𝐷2 if 𝑐 = 2, _ 𝐷3+ if 𝑐 ≥ 3 𝛾 𝑤𝑖−𝑛+1 𝑖−1 = [amount of discounting] 𝑐 𝑤𝑖−𝑛+1 𝑖−1 Weighted Discounting (D_n are estimated by leave-1-out CV)
- 4. [Zhang+ ACL2014] Kneser-Ney Smoothing on Expected Count • When each sentence has fractional weight – Domain adaptation – EM-algorithm on word alignment • Propose KN-smoothing using expected fractional counts I’m interested in it!
- 5. Model • 𝒖 means 𝑤𝑖−𝑛+1 𝑖−1 , and 𝒖′ means 𝑤𝑖−𝑛+2 𝑖−1 • A sequence 𝒖𝑤 occurs 𝑘 times and each occurring has probability 𝑝𝑖 (𝑖 = 1, ⋯ , 𝑘) as weight, • then count 𝑐(𝒖𝑤) is distributed according to Poisson Binomial Distribution. • 𝑝 𝑐 𝑢𝑤 = 𝑟 = 𝑠 𝑘, 𝑟 , where 𝑠 𝑘, 𝑟 = 𝑠 𝑘 − 1, 𝑟 1 − 𝑝 𝑘 + 𝑠 𝑘 − 1, 𝑟 − 1 𝑝 𝑘 if 0 ≤ 𝑟 ≤ 𝑘 1 if 𝑘 = 𝑟 = 0 0 otherwise
- 6. MLE on this model • Expectations – 𝔼 𝑐 𝒖𝑤 = 𝑟 ⋅ 𝑝 𝑐 𝒖𝑤 = 𝑟𝑟 – 𝔼 𝑁𝑟 𝒖 ⋅ = 𝑝 𝑐 𝒖𝑤 = 𝑟𝑤 – 𝔼 𝑁𝑟+ 𝒖 ⋅ = 𝑝 𝑐 𝒖𝑤 ≥ 𝑟𝑤 • Maximize (expected) likelihood – 𝔼 𝐿 = 𝔼 𝑐 𝒖𝑤 log 𝑝 𝑤 𝒖𝒖𝑤 = 𝔼 𝑐 𝒖𝑤 log 𝑝 𝑤 𝒖𝒖𝑤 – obtain 𝑝MLE 𝑤 𝒖 = 𝔼 𝑐 𝒖𝑤 𝔼 𝑐 𝒖⋅
- 7. Expected Kneser-Ney • 𝑐 𝒖𝑤 = max 0, 𝑐 𝒖𝑤 − 𝐷 + 𝑁1+ 𝒖 ⋅ 𝐷𝑝′(𝑤|𝒖′ ) • So, 𝔼 𝑐 𝒖𝑤 = 𝔼 𝑐 𝒖𝑤 − 𝑝 𝑐 𝒖𝑤 > 0 𝐷 + 𝔼 𝑁1+ 𝒖 ⋅ 𝐷𝑝′(𝑤|𝒖′ ) – where 𝑝′ 𝑤 𝒖′ = 𝔼 𝑁1+ ⋅𝒖′ 𝑤 𝔼 𝑁1+ ⋅𝒖′⋅ • then 𝑝 𝑤 𝒖 = 𝔼 𝑐 𝒖𝑤 𝔼 𝑐 𝒖⋅
- 8. Language model adaptation • Our corpus consists on – large general-domain data and – small specific domain data • Sentence 𝒘 ‘s weight: – 𝑝 𝒘 is in − domain = 1 1+exp −𝐻 𝒘 – where 𝐻 𝒘 = log 𝑝in 𝒘 −log 𝑝out 𝒘 𝒘 , – 𝑝in:lang. model of in-domain, 𝑝out: out’s one
- 9. • Figure 1: On the language model adaptation task, expected KN outperforms all other methods across all sizes of selected subsets. Integral KN is applied to unweighted instances, while fractional WB, fractional KN and expected KN are applied to weighted instances. (via [Zhang+ ACL2014]) from general-domain data in-domain data - training: 54k - testing: 3k 192 162 156 148 Why isn't there Modified KN as a baseline?
- 10. [Pickhardt+ ACL2014] A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing • Higher-order n-grams are very sparse – Especially remarkable on small data(e.g. domain specific data!) • Improve performance for small data by skipped n-grams and Modified KN- smoothing – Perplexity reduces 25.7% for very small training data of only 736KB text
- 11. “Generalized Language Models” • 𝜕3 𝑤1 𝑤2 𝑤3 𝑤4 = 𝑤1 𝑤2_𝑤4 – “_” means a word placeholder 𝑃GLM 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 = 𝑐 𝑤𝑖−𝑛+1 𝑖 − 𝐷 𝑐 𝑤𝑖−𝑛+1 𝑖 𝑐 𝑤𝑖−𝑛+1 𝑖−1 +𝛾high 𝑤𝑖−𝑛+1 𝑖−1 1 𝑛 − 1 𝑃GLM 𝑛−1 𝑗=1 𝑤𝑖 𝜕𝑗 𝑤𝑖−𝑛+1 𝑖−1 𝑃GLM 𝑤𝑖 𝜕𝑗 𝑤𝑖−𝑛+1 𝑖−1 = 𝑁1+ 𝜕𝑗 𝑤𝑖−𝑛 𝑖 − 𝐷 𝑐 𝜕𝑗 𝑤𝑖−𝑛+1 𝑖 𝑁1+ 𝜕𝑗 𝑤𝑖−𝑛+1 𝑖−1 ∗ +𝛾mid 𝜕𝑗 𝑤𝑖−𝑛+1 𝑖−1 1 𝑛 − 2 𝑃GLM 𝑤𝑖 𝜕𝑗 𝜕 𝑘 𝑤𝑖−𝑛+1 𝑖−1 𝑛−1 𝑘=1,𝑘≠𝑗
- 12. • The bold arrows correspond to interpolation of models in traditional modified Kneser-Ney smoothing. The lighter arrows illustrate the additional interpolations introduced by our generalized language models. (via [Pickhardt+ ACL2014])
- 13. • shrunk training data sets for the English Wikipedia small domain specific data
- 14. Space Complexity model size = 9.5GB # of entries = 427M model size = 15GB # of entries = 742M
- 15. References • [Zhang+ ACL2014] Kneser-Ney Smoothing on Expected Count • [Pickhardt+ ACL2014] A Generalized Language Model as the Comination of Skipped n-grams and Modified Kneser-Ney Smoothing • [Kneser+ 1995] Improved backing-off for m- gram language modeling • [Chen+ 1999] An Empirical Study of Smoothing Techniques for Language Modeling

No public clipboards found for this slide

Be the first to comment