Successfully reported this slideshow.
Upcoming SlideShare
×

# 大規模日本語ブログコーパスにおける言語モデルの構築と評価

2,088 views

Published on

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### 大規模日本語ブログコーパスにおける言語モデルの構築と評価

1. 1. {yookuno, msassano}@yahoo-corp.jp1 1 90 [1] 2 [2] Web Web 2 2 N-gram N-gram [3] 1 [4] MapReduce [5] [6] LOUDS [7] N-gram [8] 3 3.1 N-gram n w1 = n w1 , ...wn P (w1 )
2. 2. N-gram N −1 c b [1] ∏ n ∏ n n P (w1 ) = P (wi |w1 ) = i−1 P (wi |wi−N +1 ) (1) i−1 D i=1 i=1 Absolute P (wi |wi−N +1 ) i−1 max(0, C(abc) − D) + DN (ab∗)P (c|b) P (c|ab) = i C(ab∗) C(wi−N +1 ) (4) P (wi |wi−N +1 ) = i−1 i−1 (2) C(wi−N +1 ) N (ab∗) ab j j C(wi ) wi i−1 (2) wi−N +1 wi 3.4 Kneser-Ney N Absolute N-gram N-gram N Kneser-Ney [10] 0 max(0, N (∗bc) − D) + DR(∗b∗)P (c|b) P (c|ab) = N (∗b∗) (5) R(∗b∗) = c : N (∗bc) > 0 ∗b∗3.2 Dirichlet N-gramN-gram P (wi |wi−N +1 ) i−1 Di-richlet (N- 3.51)-gram [9] n w1 C(wi−N +1 ) + αP (wi |wi−N +2 ) i i−1 P (wi |wi−N +1 ) = i−1 1∑ i−1 n C(wi−N +1 ) + α (3) H=− log2 P (wi |w1 ) i−1 (6) n i=1 (3) Dirichlet (N-1)-gram P (wi |wi−N +2 ) i−1 Dirichlet H bit 1-gram P (w) P P = 2H P (w) = C(w) C C3.3 Absolute 3.6 MapReduce N-gram [4] j wi abc N-gram i a b N-gram C(wi−N +1 )
3. 3. Map(int id, string doc): string[] words = MorphologicalAnalyze(doc) 1: N (bit) for i = 1 to size(words)-N+1 Wikipedia Blog Emit(words[i..i+N-1], 1) N Dirichlet Kneser-Ney Dirichlet Kneser-Ney 1 10.65 10.65 10.77 10.77Reduce(string[] words, int[] counts): 2 8.71 8.52 9.63 9.44 sum = 0 3 7.72 5.15 9.21 6.87 for each count in counts 4 7.09 5.23 9.35 7.70 sum += count 5 6.64 5.69 9.43 8.73 Emit(words, sum) 6 6.73 6.25 9.48 9.33 7 6.47 6.23 9.49 9.62 1: MapReduce N-gram 4.2 MapReduce[11] 1 Yahoo! Map Reduce [5] 2009 10 2010 10 1 LZO 2TB Hadoop Map Map 1CPU/12GB Memory/1TB*4 HDD 20 1 + 19Shuﬄe Yahoo! API Reduce MapReduce Hadoop 4.34 LZO N 24.1 N [12] 2: : 860GB 2TB Wikipedia 9:50 28:16 1000 mecab 0.98 1-gram 2:14 7:42 1 2-gram 3:34 13:45 α D 1 3-gram 5:02 20:4310000 10 4-gram 8:58 1 5-gram 11:12 6-gram 13:00 7-gram 14:48 • N Wikipedia 2TB 4-gram • Wikipe- dia Kneser-Ney 3
4. 4. 860GB 1 7-gram N 1000 Dirichlet 100 10000 N N-gram [1] , . . , 1999. N-gram [2] , , , . . , Vol.40, No.7, pp.2946-2953, 1999. 3: (bit) (byte) [3] Stanley Chen and Joshua Goodman. An Empiri-N 10000 1000 100 10000 1000 100 cal Study of Smoothing Techniques for Language Modeling. TR-10-09, Computer Science Group,1 16.25 17.21 17.80 2.8M 9.1M 40M Harvard University, 1998.2 7.71 6.48 7.66 21M 127M 683M3 8.88 6.41 6.51 30M 293M 2.5G [4] Deniz Yuret. Smoothing a Tera-word Language4 8.93 6.71 6.18 23M 201M 3.6G Model. ACL-08: HLT, pp.141-144, June 2008.5 8.66 6.20 5.97 15M 232M 3.5G [5] Thorsten Brants, Ashok C. Popat, Peng Xu,6 8.28 5.98 5.74 8.2M 160M 1.6G Franz J. Och, Jeﬀrey Dean. Large Language7 7.81 5.68 5.65 5.2M 113M 1.1G Models in Machine Translation. EMNLP-ACL, pp.858-867, June 2007. [6] Graham Cormode, Marios Hadjieleftheriou. Met- hods for Finding Frequent Items in Data Streams. VLDB, vol.1 Issue 2, August 2008. [7] Taro Watanabe, Hajime Tsukada, Hideki Iso- zaki. A Succinct N-gram Language Model. ACL- IJCNLP, pp.341-344, August 2009. 3 [8] Ahmad Emami, Kishore Papineni, Jeﬀrey So- rensen. Large-Scale Distributed Language Model. 1 PC ICASSP, IV-37-IV-40, April 2007. PC 1GB [9] David J. C. MacKay, Linda C. Bauman Peto. 3 A hierarchical Dirichlet language model. Natu- 1000 ral Language Engineering, vol.1 Issue 03, pp.289- 1.1GB 308, 1995. 5.68bit [10] Kneser R., Ney H.. Improved backing-oﬀ for M- gram language modeling. ICASSP, pp.181-184, vol.1, 1995. [11] Jeﬀrey Dean, Sanjay Ghemawat. MapReduce: Simpliﬁed Data Processing on Large Clusters.5 OSDI, December, 2004. [12] , , Web N , N-gram , 2007.