1. {yookuno, msassano}@yahoo-corp.jp
1
1
90 [1]
2
[2] Web
Web
2 2
N-gram
N-gram [3]
1
[4] MapReduce
[5]
[6]
LOUDS
[7] N-gram [8]
3
3.1 N-gram
n
w1 =
n
w1 , ...wn P (w1 )
2. N-gram N −1 c b
[1]
∏
n ∏
n
n
P (w1 ) = P (wi |w1 ) =
i−1
P (wi |wi−N +1 ) (1)
i−1
D
i=1 i=1
Absolute
P (wi |wi−N +1 )
i−1
max(0, C(abc) − D) + DN (ab∗)P (c|b)
P (c|ab) =
i
C(ab∗)
C(wi−N +1 ) (4)
P (wi |wi−N +1 ) =
i−1
i−1
(2)
C(wi−N +1 ) N (ab∗) ab
j j
C(wi ) wi
i−1
(2) wi−N +1
wi
3.4 Kneser-Ney
N Absolute
N-gram N-gram
N
Kneser-Ney [10]
0
max(0, N (∗bc) − D) + DR(∗b∗)P (c|b)
P (c|ab) =
N (∗b∗)
(5)
R(∗b∗) = c : N (∗bc) > 0 ∗b∗
3.2 Dirichlet N-gram
N-gram P (wi |wi−N +1 )
i−1
Di-
richlet (N-
3.5
1)-gram
[9]
n
w1
C(wi−N +1 ) + αP (wi |wi−N +2 )
i i−1
P (wi |wi−N +1 ) =
i−1
1∑
i−1 n
C(wi−N +1 ) + α
(3) H=− log2 P (wi |w1 )
i−1
(6)
n i=1
(3) Dirichlet (N-1)-gram
P (wi |wi−N +2 )
i−1
Dirichlet H bit
1-gram P (w) P P = 2H
P (w) = C(w)
C
C
3.3 Absolute 3.6 MapReduce N-gram
[4] j
wi abc N-gram
i
a b N-gram C(wi−N +1 )
3. Map(int id, string doc):
string[] words = MorphologicalAnalyze(doc) 1: N (bit)
for i = 1 to size(words)-N+1 Wikipedia Blog
Emit(words[i..i+N-1], 1) N Dirichlet Kneser-Ney Dirichlet Kneser-Ney
1 10.65 10.65 10.77 10.77
Reduce(string[] words, int[] counts): 2 8.71 8.52 9.63 9.44
sum = 0 3 7.72 5.15 9.21 6.87
for each count in counts 4 7.09 5.23 9.35 7.70
sum += count 5 6.64 5.69 9.43 8.73
Emit(words, sum) 6 6.73 6.25 9.48 9.33
7 6.47 6.23 9.49 9.62
1: MapReduce N-gram
4.2
MapReduce[11] 1
Yahoo!
Map Reduce
[5]
2009 10 2010 10 1
LZO 2TB
Hadoop
Map Map
1CPU/12GB Memory/1TB*4 HDD 20
1 + 19
Shuffle
Yahoo! API
Reduce
MapReduce
Hadoop
4.3
4 LZO N
2
4.1
N [12] 2: :
860GB 2TB
Wikipedia 9:50 28:16
1000 mecab 0.98 1-gram 2:14 7:42
1 2-gram 3:34 13:45
α D 1 3-gram 5:02 20:43
10000 10 4-gram 8:58
1 5-gram 11:12
6-gram 13:00
7-gram 14:48
• N Wikipedia
2TB 4-gram
• Wikipe-
dia Kneser-Ney
3
4. 860GB 1 7-gram N
1000
Dirichlet
100
10000
N N-gram [1] , . .
, 1999.
N-gram
[2] , , , .
. , Vol.40,
No.7, pp.2946-2953, 1999.
3: (bit) (byte)
[3] Stanley Chen and Joshua Goodman. An Empiri-
N 10000 1000 100 10000 1000 100 cal Study of Smoothing Techniques for Language
Modeling. TR-10-09, Computer Science Group,
1 16.25 17.21 17.80 2.8M 9.1M 40M
Harvard University, 1998.
2 7.71 6.48 7.66 21M 127M 683M
3 8.88 6.41 6.51 30M 293M 2.5G [4] Deniz Yuret. Smoothing a Tera-word Language
4 8.93 6.71 6.18 23M 201M 3.6G Model. ACL-08: HLT, pp.141-144, June 2008.
5 8.66 6.20 5.97 15M 232M 3.5G [5] Thorsten Brants, Ashok C. Popat, Peng Xu,
6 8.28 5.98 5.74 8.2M 160M 1.6G Franz J. Och, Jeffrey Dean. Large Language
7 7.81 5.68 5.65 5.2M 113M 1.1G Models in Machine Translation. EMNLP-ACL,
pp.858-867, June 2007.
[6] Graham Cormode, Marios Hadjieleftheriou. Met-
hods for Finding Frequent Items in Data Streams.
VLDB, vol.1 Issue 2, August 2008.
[7] Taro Watanabe, Hajime Tsukada, Hideki Iso-
zaki. A Succinct N-gram Language Model. ACL-
IJCNLP, pp.341-344, August 2009.
3
[8] Ahmad Emami, Kishore Papineni, Jeffrey So-
rensen. Large-Scale Distributed Language Model.
1 PC
ICASSP, IV-37-IV-40, April 2007.
PC 1GB
[9] David J. C. MacKay, Linda C. Bauman Peto.
3
A hierarchical Dirichlet language model. Natu-
1000
ral Language Engineering, vol.1 Issue 03, pp.289-
1.1GB
308, 1995.
5.68bit
[10] Kneser R., Ney H.. Improved backing-off for M-
gram language modeling. ICASSP, pp.181-184,
vol.1, 1995.
[11] Jeffrey Dean, Sanjay Ghemawat. MapReduce:
Simplified Data Processing on Large Clusters.
5 OSDI, December, 2004.
[12] , , Web N ,
N-gram , 2007.