Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

大規模日本語ブログコーパスにおける言語モデルの構築と評価

2,108 views

Published on

Published in: Education, Business
  • Be the first to comment

大規模日本語ブログコーパスにおける言語モデルの構築と評価

  1. 1. {yookuno, msassano}@yahoo-corp.jp1 1 90 [1] 2 [2] Web Web 2 2 N-gram N-gram [3] 1 [4] MapReduce [5] [6] LOUDS [7] N-gram [8] 3 3.1 N-gram n w1 = n w1 , ...wn P (w1 )
  2. 2. N-gram N −1 c b [1] ∏ n ∏ n n P (w1 ) = P (wi |w1 ) = i−1 P (wi |wi−N +1 ) (1) i−1 D i=1 i=1 Absolute P (wi |wi−N +1 ) i−1 max(0, C(abc) − D) + DN (ab∗)P (c|b) P (c|ab) = i C(ab∗) C(wi−N +1 ) (4) P (wi |wi−N +1 ) = i−1 i−1 (2) C(wi−N +1 ) N (ab∗) ab j j C(wi ) wi i−1 (2) wi−N +1 wi 3.4 Kneser-Ney N Absolute N-gram N-gram N Kneser-Ney [10] 0 max(0, N (∗bc) − D) + DR(∗b∗)P (c|b) P (c|ab) = N (∗b∗) (5) R(∗b∗) = c : N (∗bc) > 0 ∗b∗3.2 Dirichlet N-gramN-gram P (wi |wi−N +1 ) i−1 Di-richlet (N- 3.51)-gram [9] n w1 C(wi−N +1 ) + αP (wi |wi−N +2 ) i i−1 P (wi |wi−N +1 ) = i−1 1∑ i−1 n C(wi−N +1 ) + α (3) H=− log2 P (wi |w1 ) i−1 (6) n i=1 (3) Dirichlet (N-1)-gram P (wi |wi−N +2 ) i−1 Dirichlet H bit 1-gram P (w) P P = 2H P (w) = C(w) C C3.3 Absolute 3.6 MapReduce N-gram [4] j wi abc N-gram i a b N-gram C(wi−N +1 )
  3. 3. Map(int id, string doc): string[] words = MorphologicalAnalyze(doc) 1: N (bit) for i = 1 to size(words)-N+1 Wikipedia Blog Emit(words[i..i+N-1], 1) N Dirichlet Kneser-Ney Dirichlet Kneser-Ney 1 10.65 10.65 10.77 10.77Reduce(string[] words, int[] counts): 2 8.71 8.52 9.63 9.44 sum = 0 3 7.72 5.15 9.21 6.87 for each count in counts 4 7.09 5.23 9.35 7.70 sum += count 5 6.64 5.69 9.43 8.73 Emit(words, sum) 6 6.73 6.25 9.48 9.33 7 6.47 6.23 9.49 9.62 1: MapReduce N-gram 4.2 MapReduce[11] 1 Yahoo! Map Reduce [5] 2009 10 2010 10 1 LZO 2TB Hadoop Map Map 1CPU/12GB Memory/1TB*4 HDD 20 1 + 19Shuffle Yahoo! API Reduce MapReduce Hadoop 4.34 LZO N 24.1 N [12] 2: : 860GB 2TB Wikipedia 9:50 28:16 1000 mecab 0.98 1-gram 2:14 7:42 1 2-gram 3:34 13:45 α D 1 3-gram 5:02 20:4310000 10 4-gram 8:58 1 5-gram 11:12 6-gram 13:00 7-gram 14:48 • N Wikipedia 2TB 4-gram • Wikipe- dia Kneser-Ney 3
  4. 4. 860GB 1 7-gram N 1000 Dirichlet 100 10000 N N-gram [1] , . . , 1999. N-gram [2] , , , . . , Vol.40, No.7, pp.2946-2953, 1999. 3: (bit) (byte) [3] Stanley Chen and Joshua Goodman. An Empiri-N 10000 1000 100 10000 1000 100 cal Study of Smoothing Techniques for Language Modeling. TR-10-09, Computer Science Group,1 16.25 17.21 17.80 2.8M 9.1M 40M Harvard University, 1998.2 7.71 6.48 7.66 21M 127M 683M3 8.88 6.41 6.51 30M 293M 2.5G [4] Deniz Yuret. Smoothing a Tera-word Language4 8.93 6.71 6.18 23M 201M 3.6G Model. ACL-08: HLT, pp.141-144, June 2008.5 8.66 6.20 5.97 15M 232M 3.5G [5] Thorsten Brants, Ashok C. Popat, Peng Xu,6 8.28 5.98 5.74 8.2M 160M 1.6G Franz J. Och, Jeffrey Dean. Large Language7 7.81 5.68 5.65 5.2M 113M 1.1G Models in Machine Translation. EMNLP-ACL, pp.858-867, June 2007. [6] Graham Cormode, Marios Hadjieleftheriou. Met- hods for Finding Frequent Items in Data Streams. VLDB, vol.1 Issue 2, August 2008. [7] Taro Watanabe, Hajime Tsukada, Hideki Iso- zaki. A Succinct N-gram Language Model. ACL- IJCNLP, pp.341-344, August 2009. 3 [8] Ahmad Emami, Kishore Papineni, Jeffrey So- rensen. Large-Scale Distributed Language Model. 1 PC ICASSP, IV-37-IV-40, April 2007. PC 1GB [9] David J. C. MacKay, Linda C. Bauman Peto. 3 A hierarchical Dirichlet language model. Natu- 1000 ral Language Engineering, vol.1 Issue 03, pp.289- 1.1GB 308, 1995. 5.68bit [10] Kneser R., Ney H.. Improved backing-off for M- gram language modeling. ICASSP, pp.181-184, vol.1, 1995. [11] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters.5 OSDI, December, 2004. [12] , , Web N , N-gram , 2007.

×