SlideShare a Scribd company logo
1 of 4
Download to read offline
{yookuno, msassano}@yahoo-corp.jp


1
                                                                                             1

          90                      [1]


                                                  2

                   [2]                                Web
    Web




               2                        2
                         N-gram
                                                            N-gram               [3]
                          1
                                                                     [4]         MapReduce
                                                                           [5]
                                                  [6]
                                                                                           LOUDS
                                            [7]         N-gram                                [8]




                                        3
                                        3.1               N-gram
                                                                                                  n
                                                                                                 w1 =
                                                                                     n
                                        w1 , ...wn                               P (w1 )
N-gram                              N −1                                      c                            b
                                                     [1]


              ∏
              n                         ∏
                                        n
     n
 P (w1 ) =          P (wi |w1 ) =
                            i−1
                                              P (wi |wi−N +1 ) (1)
                                                      i−1
                                                                                                                     D
              i=1                       i=1
                                                                                                            Absolute
                                                P (wi |wi−N +1 )
                                                        i−1




                                                                                     max(0, C(abc) − D) + DN (ab∗)P (c|b)
                                                                        P (c|ab) =
                                           i
                                                                                                   C(ab∗)
                                        C(wi−N +1 )                                                                     (4)
              P (wi |wi−N +1 ) =
                      i−1
                                           i−1
                                                                 (2)
                                        C(wi−N +1 )                               N (ab∗)                         ab
               j                    j
            C(wi )                 wi
                                                          i−1
                                        (2)              wi−N +1
                             wi
                                                                       3.4   Kneser-Ney
                      N                                                 Absolute
                                                                        N-gram                             N-gram
                                                             N
                                                                       Kneser-Ney                          [10]
                                                     0
                                                                                 max(0, N (∗bc) − D) + DR(∗b∗)P (c|b)
                                                                        P (c|ab) =
                                                                                                 N (∗b∗)
                                                                                                                    (5)
                                                                               R(∗b∗) = c : N (∗bc) > 0       ∗b∗
3.2       Dirichlet                                                                          N-gram


N-gram               P (wi |wi−N +1 )
                             i−1
                                                                 Di-
richlet                                                          (N-
                                                                       3.5
1)-gram
            [9]
                                                                                                       n
                                                                                                      w1

                           C(wi−N +1 ) + αP (wi |wi−N +2 )
                              i                   i−1
 P (wi |wi−N +1 ) =
         i−1

                                                                                            1∑
                                       i−1                                                     n
                                    C(wi−N +1 ) + α
                                                                 (3)                 H=−          log2 P (wi |w1 )
                                                                                                               i−1
                                                                                                                         (6)
                                                                                            n i=1
      (3)    Dirichlet                                   (N-1)-gram
      P (wi |wi−N +2 )
              i−1
                                                           Dirichlet     H            bit
                                1-gram            P (w)                P P = 2H
                          P (w) = C(w)
                                    C
      C


3.3       Absolute                                                     3.6   MapReduce                     N-gram

                             [4]                            j
                                                           wi    abc                               N-gram
                                                                                                          i
                      a                                          b                      N-gram         C(wi−N +1 )
Map(int id, string doc):
    string[] words = MorphologicalAnalyze(doc)                      1:                    N                                (bit)
        for i = 1 to size(words)-N+1                                         Wikipedia                                  Blog
            Emit(words[i..i+N-1], 1)                           N    Dirichlet     Kneser-Ney           Dirichlet          Kneser-Ney
                                                               1     10.65             10.65               10.77             10.77
Reduce(string[] words, int[] counts):                          2      8.71              8.52               9.63               9.44
    sum = 0                                                    3      7.72              5.15               9.21               6.87
        for each count in counts                               4      7.09              5.23               9.35               7.70
            sum += count                                       5      6.64              5.69               9.43               8.73
        Emit(words, sum)                                       6      6.73              6.25               9.48               9.33
                                                               7      6.47              6.23               9.49               9.62
                1: MapReduce           N-gram
                                                              4.2
                       MapReduce[11]                1
                                                                                                    Yahoo!
                       Map       Reduce
          [5]
                                                                              2009      10             2010         10               1
                                                                                         LZO                                2TB


                                                                          Hadoop
          Map                          Map
                                                              1CPU/12GB Memory/1TB*4 HDD                                      20
                                                                         1      +              19
Shuffle
                                                                                 Yahoo!                         API
                                          Reduce
                                              MapReduce
                 Hadoop
                                                              4.3

4                                                                                      LZO                          N
                                                                                                                2
4.1
                                        N           [12]                         2:                         :
                                                                                               860GB            2TB
                             Wikipedia                                                          9:50        28:16
        1000                                    mecab 0.98                    1-gram            2:14            7:42
                                                         1                    2-gram            3:34        13:45
                                            α   D       1                     3-gram            5:02        20:43
10000             10                                                          4-gram            8:58
                   1                                                          5-gram           11:12
                                                                              6-gram           13:00
                                                                              7-gram           14:48
    •                    N        Wikipedia


                                                                         2TB          4-gram
    •                                               Wikipe-
        dia                  Kneser-Ney
                                                                                                       3
860GB                                  1 7-gram                                N


                                          1000
                  Dirichlet


                                                        100
    10000


          N                 N-gram                               [1]            ,                .                 .
                                                                               , 1999.
                                                    N-gram
                                                                 [2]           ,             ,         ,       .
                                                                                           .                           , Vol.40,
                                                                        No.7, pp.2946-2953, 1999.
    3:                           (bit)                  (byte)
                                                                 [3] Stanley Chen and Joshua Goodman. An Empiri-
N        10000     1000       100        10000   1000     100           cal Study of Smoothing Techniques for Language
                                                                        Modeling. TR-10-09, Computer Science Group,
1        16.25     17.21      17.80      2.8M    9.1M     40M
                                                                        Harvard University, 1998.
2         7.71     6.48       7.66       21M     127M   683M
3         8.88     6.41       6.51       30M     293M     2.5G   [4] Deniz Yuret. Smoothing a Tera-word Language
4         8.93     6.71       6.18       23M     201M     3.6G       Model. ACL-08: HLT, pp.141-144, June 2008.
5         8.66     6.20       5.97       15M     232M     3.5G   [5] Thorsten Brants, Ashok C. Popat, Peng Xu,
6         8.28     5.98       5.74       8.2M    160M     1.6G          Franz J. Och, Jeffrey Dean. Large Language
7         7.81     5.68       5.65       5.2M    113M     1.1G          Models in Machine Translation. EMNLP-ACL,
                                                                        pp.858-867, June 2007.
                                                                 [6] Graham Cormode, Marios Hadjieleftheriou. Met-
                                                                     hods for Finding Frequent Items in Data Streams.
                                                                     VLDB, vol.1 Issue 2, August 2008.
                                                                 [7] Taro Watanabe, Hajime Tsukada, Hideki Iso-
                                                                     zaki. A Succinct N-gram Language Model. ACL-
                                                                     IJCNLP, pp.341-344, August 2009.
    3
                                                                 [8] Ahmad Emami, Kishore Papineni, Jeffrey So-
                                                                     rensen. Large-Scale Distributed Language Model.
              1     PC
                                                                     ICASSP, IV-37-IV-40, April 2007.
         PC                                         1GB
                                                                 [9] David J. C. MacKay, Linda C. Bauman Peto.
                                           3
                                                                     A hierarchical Dirichlet language model. Natu-
                                                 1000
                                                                     ral Language Engineering, vol.1 Issue 03, pp.289-
                                      1.1GB
                                                                        308, 1995.
                  5.68bit
                                                                 [10] Kneser R., Ney H.. Improved backing-off for M-
                                                                     gram language modeling. ICASSP, pp.181-184,
                                                                        vol.1, 1995.
                                                                 [11] Jeffrey Dean, Sanjay Ghemawat. MapReduce:
                                                                     Simplified Data Processing on Large Clusters.
5                                                                       OSDI, December, 2004.
                                                                 [12]              ,          , Web        N                   ,
         N-gram                                                                              , 2007.

More Related Content

What's hot

An evaluation of gnss code and phase solutions
An evaluation of gnss code and phase solutionsAn evaluation of gnss code and phase solutions
An evaluation of gnss code and phase solutionsAlexander Decker
 
Study of the variation of power loss with frequency along a rectangular
Study of the variation of power loss with frequency along a rectangularStudy of the variation of power loss with frequency along a rectangular
Study of the variation of power loss with frequency along a rectangularIAEME Publication
 
Munich07 Foils
Munich07 FoilsMunich07 Foils
Munich07 FoilsAntonini
 
Module 13 Gradient And Area Under A Graph
Module 13  Gradient And Area Under A GraphModule 13  Gradient And Area Under A Graph
Module 13 Gradient And Area Under A Graphguestcc333c
 
Marking Scheme Worksheet 2
Marking Scheme Worksheet 2Marking Scheme Worksheet 2
Marking Scheme Worksheet 2Hira Rizvi
 
Efficient Hill Climber for Constrained Pseudo-Boolean Optimization Problems
Efficient Hill Climber for Constrained Pseudo-Boolean Optimization ProblemsEfficient Hill Climber for Constrained Pseudo-Boolean Optimization Problems
Efficient Hill Climber for Constrained Pseudo-Boolean Optimization Problemsjfrchicanog
 
Module 11 Tansformation
Module 11  TansformationModule 11  Tansformation
Module 11 Tansformationguestcc333c
 
Signal Processing Course : Compressed Sensing
Signal Processing Course : Compressed SensingSignal Processing Course : Compressed Sensing
Signal Processing Course : Compressed SensingGabriel Peyré
 
A Dimension Abstraction Approach to Vectorization in Matlab
A Dimension Abstraction Approach to Vectorization in MatlabA Dimension Abstraction Approach to Vectorization in Matlab
A Dimension Abstraction Approach to Vectorization in MatlabaiQUANT
 
Surface Area
Surface Area Surface Area
Surface Area cadteach
 
Module 7 The Straight Lines
Module 7 The Straight LinesModule 7 The Straight Lines
Module 7 The Straight Linesguestcc333c
 
M schemes(work, energy and power)
M schemes(work, energy and power)M schemes(work, energy and power)
M schemes(work, energy and power)Hira Rizvi
 
Signal Processing Course : Approximation
Signal Processing Course : ApproximationSignal Processing Course : Approximation
Signal Processing Course : ApproximationGabriel Peyré
 
Bouguet's MatLab Camera Calibration Toolbox
Bouguet's MatLab Camera Calibration ToolboxBouguet's MatLab Camera Calibration Toolbox
Bouguet's MatLab Camera Calibration ToolboxYuji Oyamada
 
Sparsity and Compressed Sensing
Sparsity and Compressed SensingSparsity and Compressed Sensing
Sparsity and Compressed SensingGabriel Peyré
 

What's hot (20)

Future CMB Experiments
Future CMB ExperimentsFuture CMB Experiments
Future CMB Experiments
 
add mad
add madadd mad
add mad
 
An evaluation of gnss code and phase solutions
An evaluation of gnss code and phase solutionsAn evaluation of gnss code and phase solutions
An evaluation of gnss code and phase solutions
 
Study of the variation of power loss with frequency along a rectangular
Study of the variation of power loss with frequency along a rectangularStudy of the variation of power loss with frequency along a rectangular
Study of the variation of power loss with frequency along a rectangular
 
Munich07 Foils
Munich07 FoilsMunich07 Foils
Munich07 Foils
 
Module 13 Gradient And Area Under A Graph
Module 13  Gradient And Area Under A GraphModule 13  Gradient And Area Under A Graph
Module 13 Gradient And Area Under A Graph
 
Marking Scheme Worksheet 2
Marking Scheme Worksheet 2Marking Scheme Worksheet 2
Marking Scheme Worksheet 2
 
Efficient Hill Climber for Constrained Pseudo-Boolean Optimization Problems
Efficient Hill Climber for Constrained Pseudo-Boolean Optimization ProblemsEfficient Hill Climber for Constrained Pseudo-Boolean Optimization Problems
Efficient Hill Climber for Constrained Pseudo-Boolean Optimization Problems
 
Module 11 Tansformation
Module 11  TansformationModule 11  Tansformation
Module 11 Tansformation
 
Signal Processing Course : Compressed Sensing
Signal Processing Course : Compressed SensingSignal Processing Course : Compressed Sensing
Signal Processing Course : Compressed Sensing
 
03 image transform
03 image transform03 image transform
03 image transform
 
A Dimension Abstraction Approach to Vectorization in Matlab
A Dimension Abstraction Approach to Vectorization in MatlabA Dimension Abstraction Approach to Vectorization in Matlab
A Dimension Abstraction Approach to Vectorization in Matlab
 
Surface Area
Surface Area Surface Area
Surface Area
 
Module 7 The Straight Lines
Module 7 The Straight LinesModule 7 The Straight Lines
Module 7 The Straight Lines
 
M schemes(work, energy and power)
M schemes(work, energy and power)M schemes(work, energy and power)
M schemes(work, energy and power)
 
Signal Processing Course : Approximation
Signal Processing Course : ApproximationSignal Processing Course : Approximation
Signal Processing Course : Approximation
 
Presentation
PresentationPresentation
Presentation
 
Bouguet's MatLab Camera Calibration Toolbox
Bouguet's MatLab Camera Calibration ToolboxBouguet's MatLab Camera Calibration Toolbox
Bouguet's MatLab Camera Calibration Toolbox
 
Sparsity and Compressed Sensing
Sparsity and Compressed SensingSparsity and Compressed Sensing
Sparsity and Compressed Sensing
 
Module 5 Sets
Module 5 SetsModule 5 Sets
Module 5 Sets
 

Similar to 大規模日本語ブログコーパスにおける言語モデルの構築と評価

Form 5 formulae and note
Form 5 formulae and noteForm 5 formulae and note
Form 5 formulae and notesmktsj2
 
Formulario de matematicas
Formulario de matematicasFormulario de matematicas
Formulario de matematicasCarlos
 
5 marks scheme for add maths paper 2 trial spm
5 marks scheme for add maths paper 2 trial spm5 marks scheme for add maths paper 2 trial spm
5 marks scheme for add maths paper 2 trial spmzabidah awang
 
2 senarai rumus add maths k2 trial spm sbp 2010
2 senarai rumus add maths k2 trial spm sbp 20102 senarai rumus add maths k2 trial spm sbp 2010
2 senarai rumus add maths k2 trial spm sbp 2010zabidah awang
 
2 senarai rumus add maths k1 trial spm sbp 2010
2 senarai rumus add maths k1 trial spm sbp 20102 senarai rumus add maths k1 trial spm sbp 2010
2 senarai rumus add maths k1 trial spm sbp 2010zabidah awang
 
5 marks scheme for add maths paper 2 trial spm
5 marks scheme for add maths paper 2 trial spm5 marks scheme for add maths paper 2 trial spm
5 marks scheme for add maths paper 2 trial spmzabidah awang
 
D-Branes and The Disformal Dark Sector - Danielle Wills and Tomi Koivisto
D-Branes and The Disformal Dark Sector - Danielle Wills and Tomi KoivistoD-Branes and The Disformal Dark Sector - Danielle Wills and Tomi Koivisto
D-Branes and The Disformal Dark Sector - Danielle Wills and Tomi KoivistoCosmoAIMS Bassett
 

Similar to 大規模日本語ブログコーパスにおける言語モデルの構築と評価 (9)

Form 5 formulae and note
Form 5 formulae and noteForm 5 formulae and note
Form 5 formulae and note
 
Formulario de matematicas
Formulario de matematicasFormulario de matematicas
Formulario de matematicas
 
Cheat Sheet
Cheat SheetCheat Sheet
Cheat Sheet
 
確率伝播その2
確率伝播その2確率伝播その2
確率伝播その2
 
5 marks scheme for add maths paper 2 trial spm
5 marks scheme for add maths paper 2 trial spm5 marks scheme for add maths paper 2 trial spm
5 marks scheme for add maths paper 2 trial spm
 
2 senarai rumus add maths k2 trial spm sbp 2010
2 senarai rumus add maths k2 trial spm sbp 20102 senarai rumus add maths k2 trial spm sbp 2010
2 senarai rumus add maths k2 trial spm sbp 2010
 
2 senarai rumus add maths k1 trial spm sbp 2010
2 senarai rumus add maths k1 trial spm sbp 20102 senarai rumus add maths k1 trial spm sbp 2010
2 senarai rumus add maths k1 trial spm sbp 2010
 
5 marks scheme for add maths paper 2 trial spm
5 marks scheme for add maths paper 2 trial spm5 marks scheme for add maths paper 2 trial spm
5 marks scheme for add maths paper 2 trial spm
 
D-Branes and The Disformal Dark Sector - Danielle Wills and Tomi Koivisto
D-Branes and The Disformal Dark Sector - Danielle Wills and Tomi KoivistoD-Branes and The Disformal Dark Sector - Danielle Wills and Tomi Koivisto
D-Branes and The Disformal Dark Sector - Danielle Wills and Tomi Koivisto
 

More from Yahoo!デベロッパーネットワーク

ヤフーでは開発迅速性と品質のバランスをどう取ってるか
ヤフーでは開発迅速性と品質のバランスをどう取ってるかヤフーでは開発迅速性と品質のバランスをどう取ってるか
ヤフーでは開発迅速性と品質のバランスをどう取ってるかYahoo!デベロッパーネットワーク
 
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2Yahoo!デベロッパーネットワーク
 
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtc
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtcヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtc
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtcYahoo!デベロッパーネットワーク
 
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtcYahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtcYahoo!デベロッパーネットワーク
 
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtcヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtcYahoo!デベロッパーネットワーク
 
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtcYahoo!デベロッパーネットワーク
 
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtc
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtcPC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtc
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtcYahoo!デベロッパーネットワーク
 
モブデザインによる多職種チームのコミュニケーション改善 #yjtc
モブデザインによる多職種チームのコミュニケーション改善 #yjtcモブデザインによる多職種チームのコミュニケーション改善 #yjtc
モブデザインによる多職種チームのコミュニケーション改善 #yjtcYahoo!デベロッパーネットワーク
 
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtc
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtcユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtc
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtcYahoo!デベロッパーネットワーク
 

More from Yahoo!デベロッパーネットワーク (20)

ゼロから始める転移学習
ゼロから始める転移学習ゼロから始める転移学習
ゼロから始める転移学習
 
継続的なモデルモニタリングを実現するKubernetes Operator
継続的なモデルモニタリングを実現するKubernetes Operator継続的なモデルモニタリングを実現するKubernetes Operator
継続的なモデルモニタリングを実現するKubernetes Operator
 
ヤフーでは開発迅速性と品質のバランスをどう取ってるか
ヤフーでは開発迅速性と品質のバランスをどう取ってるかヤフーでは開発迅速性と品質のバランスをどう取ってるか
ヤフーでは開発迅速性と品質のバランスをどう取ってるか
 
オンプレML基盤on Kubernetes パネルディスカッション
オンプレML基盤on Kubernetes パネルディスカッションオンプレML基盤on Kubernetes パネルディスカッション
オンプレML基盤on Kubernetes パネルディスカッション
 
LakeTahoe
LakeTahoeLakeTahoe
LakeTahoe
 
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
オンプレML基盤on Kubernetes 〜Yahoo! JAPAN AIPF〜
 
Persistent-memory-native Database High-availability Feature
Persistent-memory-native Database High-availability FeaturePersistent-memory-native Database High-availability Feature
Persistent-memory-native Database High-availability Feature
 
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
データの価値を最大化させるためのデザイン~データビジュアライゼーションの方法~ #devsumi 17-E-2
 
eコマースと実店舗の相互利益を目指したデザイン #yjtc
eコマースと実店舗の相互利益を目指したデザイン #yjtceコマースと実店舗の相互利益を目指したデザイン #yjtc
eコマースと実店舗の相互利益を目指したデザイン #yjtc
 
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtc
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtcヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtc
ヤフーを支えるセキュリティ ~サイバー攻撃を防ぐエンジニアの仕事とは~ #yjtc
 
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtcYahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
 
ビッグデータから人々のムードを捉える #yjtc
ビッグデータから人々のムードを捉える #yjtcビッグデータから人々のムードを捉える #yjtc
ビッグデータから人々のムードを捉える #yjtc
 
サイエンス領域におけるMLOpsの取り組み #yjtc
サイエンス領域におけるMLOpsの取り組み #yjtcサイエンス領域におけるMLOpsの取り組み #yjtc
サイエンス領域におけるMLOpsの取り組み #yjtc
 
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtcヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
ヤフーのAIプラットフォーム紹介 ~AIテックカンパニーを支えるデータ基盤~ #yjtc
 
Yahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtc
Yahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtcYahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtc
Yahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtc
 
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc
新技術を使った次世代の商品の見せ方 ~ヤフオク!のマルチビュー機能~ #yjtc
 
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtc
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtcPC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtc
PC版Yahoo!メールリニューアル ~サービスのUI/UX統合と改善プロセス~ #yjtc
 
モブデザインによる多職種チームのコミュニケーション改善 #yjtc
モブデザインによる多職種チームのコミュニケーション改善 #yjtcモブデザインによる多職種チームのコミュニケーション改善 #yjtc
モブデザインによる多職種チームのコミュニケーション改善 #yjtc
 
「新しいおうち探し」のためのAIアシスト検索 #yjtc
「新しいおうち探し」のためのAIアシスト検索 #yjtc「新しいおうち探し」のためのAIアシスト検索 #yjtc
「新しいおうち探し」のためのAIアシスト検索 #yjtc
 
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtc
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtcユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtc
ユーザーの地域を考慮した検索入力補助機能の改善の試み #yjtc
 

Recently uploaded

Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........LeaCamillePacle
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationAadityaSharma884161
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayMakMakNepo
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 

Recently uploaded (20)

Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint Presentation
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up Friday
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 

大規模日本語ブログコーパスにおける言語モデルの構築と評価

  • 1. {yookuno, msassano}@yahoo-corp.jp 1 1 90 [1] 2 [2] Web Web 2 2 N-gram N-gram [3] 1 [4] MapReduce [5] [6] LOUDS [7] N-gram [8] 3 3.1 N-gram n w1 = n w1 , ...wn P (w1 )
  • 2. N-gram N −1 c b [1] ∏ n ∏ n n P (w1 ) = P (wi |w1 ) = i−1 P (wi |wi−N +1 ) (1) i−1 D i=1 i=1 Absolute P (wi |wi−N +1 ) i−1 max(0, C(abc) − D) + DN (ab∗)P (c|b) P (c|ab) = i C(ab∗) C(wi−N +1 ) (4) P (wi |wi−N +1 ) = i−1 i−1 (2) C(wi−N +1 ) N (ab∗) ab j j C(wi ) wi i−1 (2) wi−N +1 wi 3.4 Kneser-Ney N Absolute N-gram N-gram N Kneser-Ney [10] 0 max(0, N (∗bc) − D) + DR(∗b∗)P (c|b) P (c|ab) = N (∗b∗) (5) R(∗b∗) = c : N (∗bc) > 0 ∗b∗ 3.2 Dirichlet N-gram N-gram P (wi |wi−N +1 ) i−1 Di- richlet (N- 3.5 1)-gram [9] n w1 C(wi−N +1 ) + αP (wi |wi−N +2 ) i i−1 P (wi |wi−N +1 ) = i−1 1∑ i−1 n C(wi−N +1 ) + α (3) H=− log2 P (wi |w1 ) i−1 (6) n i=1 (3) Dirichlet (N-1)-gram P (wi |wi−N +2 ) i−1 Dirichlet H bit 1-gram P (w) P P = 2H P (w) = C(w) C C 3.3 Absolute 3.6 MapReduce N-gram [4] j wi abc N-gram i a b N-gram C(wi−N +1 )
  • 3. Map(int id, string doc): string[] words = MorphologicalAnalyze(doc) 1: N (bit) for i = 1 to size(words)-N+1 Wikipedia Blog Emit(words[i..i+N-1], 1) N Dirichlet Kneser-Ney Dirichlet Kneser-Ney 1 10.65 10.65 10.77 10.77 Reduce(string[] words, int[] counts): 2 8.71 8.52 9.63 9.44 sum = 0 3 7.72 5.15 9.21 6.87 for each count in counts 4 7.09 5.23 9.35 7.70 sum += count 5 6.64 5.69 9.43 8.73 Emit(words, sum) 6 6.73 6.25 9.48 9.33 7 6.47 6.23 9.49 9.62 1: MapReduce N-gram 4.2 MapReduce[11] 1 Yahoo! Map Reduce [5] 2009 10 2010 10 1 LZO 2TB Hadoop Map Map 1CPU/12GB Memory/1TB*4 HDD 20 1 + 19 Shuffle Yahoo! API Reduce MapReduce Hadoop 4.3 4 LZO N 2 4.1 N [12] 2: : 860GB 2TB Wikipedia 9:50 28:16 1000 mecab 0.98 1-gram 2:14 7:42 1 2-gram 3:34 13:45 α D 1 3-gram 5:02 20:43 10000 10 4-gram 8:58 1 5-gram 11:12 6-gram 13:00 7-gram 14:48 • N Wikipedia 2TB 4-gram • Wikipe- dia Kneser-Ney 3
  • 4. 860GB 1 7-gram N 1000 Dirichlet 100 10000 N N-gram [1] , . . , 1999. N-gram [2] , , , . . , Vol.40, No.7, pp.2946-2953, 1999. 3: (bit) (byte) [3] Stanley Chen and Joshua Goodman. An Empiri- N 10000 1000 100 10000 1000 100 cal Study of Smoothing Techniques for Language Modeling. TR-10-09, Computer Science Group, 1 16.25 17.21 17.80 2.8M 9.1M 40M Harvard University, 1998. 2 7.71 6.48 7.66 21M 127M 683M 3 8.88 6.41 6.51 30M 293M 2.5G [4] Deniz Yuret. Smoothing a Tera-word Language 4 8.93 6.71 6.18 23M 201M 3.6G Model. ACL-08: HLT, pp.141-144, June 2008. 5 8.66 6.20 5.97 15M 232M 3.5G [5] Thorsten Brants, Ashok C. Popat, Peng Xu, 6 8.28 5.98 5.74 8.2M 160M 1.6G Franz J. Och, Jeffrey Dean. Large Language 7 7.81 5.68 5.65 5.2M 113M 1.1G Models in Machine Translation. EMNLP-ACL, pp.858-867, June 2007. [6] Graham Cormode, Marios Hadjieleftheriou. Met- hods for Finding Frequent Items in Data Streams. VLDB, vol.1 Issue 2, August 2008. [7] Taro Watanabe, Hajime Tsukada, Hideki Iso- zaki. A Succinct N-gram Language Model. ACL- IJCNLP, pp.341-344, August 2009. 3 [8] Ahmad Emami, Kishore Papineni, Jeffrey So- rensen. Large-Scale Distributed Language Model. 1 PC ICASSP, IV-37-IV-40, April 2007. PC 1GB [9] David J. C. MacKay, Linda C. Bauman Peto. 3 A hierarchical Dirichlet language model. Natu- 1000 ral Language Engineering, vol.1 Issue 03, pp.289- 1.1GB 308, 1995. 5.68bit [10] Kneser R., Ney H.. Improved backing-off for M- gram language modeling. ICASSP, pp.181-184, vol.1, 1995. [11] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. 5 OSDI, December, 2004. [12] , , Web N , N-gram , 2007.