SlideShare a Scribd company logo
長岡技術科学大学 自然言語処理研究室
修士1年 西山 浩気
 ニューラル文字埋め込み(neural character embedding)
を用いたスペル修正手法の提案
◦ 発音情報を取得
◦ 編集距離によらず、計算時間が短い
 高い正解率と高速な計算を達成
◦ 編集距離 2 以下で 正解率99%
2
 スペル修正
◦ 例: ワープロ、ブラウザ、検索エンジン...
◦ 正しい綴りの単語辞書を利用
 従来手法
◦ 正しい綴りと誤った綴りの距離の尺度を利用
◦ Levenshtein距離
3
 noisy channel model[Norvig, 2009]
◦ スペルミスのある単語と辞書のすべての単語の
Levenshtein距離を計算
 計算時間が長く、リアルタイム修正は困難
 日本語・韓国語では50万語
◦ 改善案
 ある閾値以下の編集距離までの全ての単語を生成し、
単語辞書に存在する場合正しい綴りに置き換え
 編集距離が2より大きい場合、計算コストが高
4
 高精度な探索空間縮小手法の提案
◦ 教師無し学習
◦ 探索スペース縮小による、計算速度向上が目的
5
1. C2Vmap作成の準備
◦ 辞書に含まれる単語から発音情報を取得
 連続して母音・子音が続く箇所で分割
 例 : “affiliates” ⇒ “a ff i l ia t e s”
2. word2Vecで発音情報ベクトル(C2Vmap)を学習
◦ 学習にはSkip-gramを使う
◦ 発音の条件確率をn次元のベクトル表現に
6
 
 
 
 













......
...
...
...
n
ai
tr
3. 辞書に含まれる単語の発音ごとにC2Vmapをあ
てはめる
 例 : “affiliates”
7
                ...,...,...,...,...,...,...,... setialiffa
1. スペルミスのある単語のn次元ベクトルを構築
◦ 単語辞書から構築したC2Vmapを利用
◦ 例 : “affilaites”
2. ミススペルと最近傍の単語を取得
◦ Ball Treeアルゴリズムを利用
8
                ...,...,...,...,...,...,...,... setailiffa
 単語辞書
◦ 109,582語
 評価データ : Birkbeckスペルエラーコーパス
◦ 109,897語
◦ 編集距離10以上の語も存在
 評価指標
◦ 正解率:
9
[%]100
評価データ中の単語数
ることができた単語数正しいスペルを推定す
◦ k=5000, n = 100 において最大値 88.20%
 サンプルごとに計算時間 52[ms]
◦ n=25(低次元)においてもn=100より僅かな減少
 k, n を柔軟に変化させることができる
10
 スペル修正のための探索手法について提案
◦ 探索空間削減・正解率の両方において有効性を示した
 編集距離2以下 : 99.6%
 編集距離3以下 : 97.9%
 C2Vmap作成のために発音情報を利用
◦ 子音・母音が連続する箇所で分割
◦ 今後さらなる改良を模索
11

More Related Content

More from 浩気 西山

Using continuous lexical embeddings to improve symbolicprosody prediction in ...
Using continuous lexical embeddings to improve symbolicprosody prediction in ...Using continuous lexical embeddings to improve symbolicprosody prediction in ...
Using continuous lexical embeddings to improve symbolicprosody prediction in ...
浩気 西山
 
Character-based Joint Segmentation and POS Tagging for Chinese using Bidirec...
Character-based Joint Segmentation and POS Tagging for Chinese using Bidirec...Character-based Joint Segmentation and POS Tagging for Chinese using Bidirec...
Character-based Joint Segmentation and POS Tagging for Chinese using Bidirec...
浩気 西山
 
Character word lstm language models
Character word lstm language modelsCharacter word lstm language models
Character word lstm language models
浩気 西山
 
Nlp2018 参加報告
Nlp2018 参加報告Nlp2018 参加報告
Nlp2018 参加報告
浩気 西山
 
Character aware-neural-networks-for-arabic-named-entity-recognition-for-socia...
Character aware-neural-networks-for-arabic-named-entity-recognition-for-socia...Character aware-neural-networks-for-arabic-named-entity-recognition-for-socia...
Character aware-neural-networks-for-arabic-named-entity-recognition-for-socia...
浩気 西山
 
Evaluating non expert_annotations_for_natural_language_tasks
Evaluating non expert_annotations_for_natural_language_tasksEvaluating non expert_annotations_for_natural_language_tasks
Evaluating non expert_annotations_for_natural_language_tasks
浩気 西山
 
Semi supervised sequence tagging with bidirectional language models
Semi supervised sequence tagging with bidirectional language modelsSemi supervised sequence tagging with bidirectional language models
Semi supervised sequence tagging with bidirectional language models
浩気 西山
 
Classifying Temporal Relations by Bidirectional LSTM over Dependency Paths
Classifying Temporal Relations by Bidirectional LSTM over Dependency PathsClassifying Temporal Relations by Bidirectional LSTM over Dependency Paths
Classifying Temporal Relations by Bidirectional LSTM over Dependency Paths
浩気 西山
 
Neural Network Language Model For Chinese Pinyin Input Method Engine
Neural Network Language Model For Chinese Pinyin Input Method EngineNeural Network Language Model For Chinese Pinyin Input Method Engine
Neural Network Language Model For Chinese Pinyin Input Method Engine
浩気 西山
 
Are emojis predictable
Are emojis predictableAre emojis predictable
Are emojis predictable
浩気 西山
 
Semantic analysis and helpfulness prediction of text for online product reviews
Semantic analysis and helpfulness prediction of text  for online product reviewsSemantic analysis and helpfulness prediction of text  for online product reviews
Semantic analysis and helpfulness prediction of text for online product reviews
浩気 西山
 
1.単純パーセプトロンと学習アルゴリズム
1.単純パーセプトロンと学習アルゴリズム1.単純パーセプトロンと学習アルゴリズム
1.単純パーセプトロンと学習アルゴリズム
浩気 西山
 
1.単純パーセプトロンと学習アルゴリズム
1.単純パーセプトロンと学習アルゴリズム1.単純パーセプトロンと学習アルゴリズム
1.単純パーセプトロンと学習アルゴリズム
浩気 西山
 

More from 浩気 西山 (13)

Using continuous lexical embeddings to improve symbolicprosody prediction in ...
Using continuous lexical embeddings to improve symbolicprosody prediction in ...Using continuous lexical embeddings to improve symbolicprosody prediction in ...
Using continuous lexical embeddings to improve symbolicprosody prediction in ...
 
Character-based Joint Segmentation and POS Tagging for Chinese using Bidirec...
Character-based Joint Segmentation and POS Tagging for Chinese using Bidirec...Character-based Joint Segmentation and POS Tagging for Chinese using Bidirec...
Character-based Joint Segmentation and POS Tagging for Chinese using Bidirec...
 
Character word lstm language models
Character word lstm language modelsCharacter word lstm language models
Character word lstm language models
 
Nlp2018 参加報告
Nlp2018 参加報告Nlp2018 参加報告
Nlp2018 参加報告
 
Character aware-neural-networks-for-arabic-named-entity-recognition-for-socia...
Character aware-neural-networks-for-arabic-named-entity-recognition-for-socia...Character aware-neural-networks-for-arabic-named-entity-recognition-for-socia...
Character aware-neural-networks-for-arabic-named-entity-recognition-for-socia...
 
Evaluating non expert_annotations_for_natural_language_tasks
Evaluating non expert_annotations_for_natural_language_tasksEvaluating non expert_annotations_for_natural_language_tasks
Evaluating non expert_annotations_for_natural_language_tasks
 
Semi supervised sequence tagging with bidirectional language models
Semi supervised sequence tagging with bidirectional language modelsSemi supervised sequence tagging with bidirectional language models
Semi supervised sequence tagging with bidirectional language models
 
Classifying Temporal Relations by Bidirectional LSTM over Dependency Paths
Classifying Temporal Relations by Bidirectional LSTM over Dependency PathsClassifying Temporal Relations by Bidirectional LSTM over Dependency Paths
Classifying Temporal Relations by Bidirectional LSTM over Dependency Paths
 
Neural Network Language Model For Chinese Pinyin Input Method Engine
Neural Network Language Model For Chinese Pinyin Input Method EngineNeural Network Language Model For Chinese Pinyin Input Method Engine
Neural Network Language Model For Chinese Pinyin Input Method Engine
 
Are emojis predictable
Are emojis predictableAre emojis predictable
Are emojis predictable
 
Semantic analysis and helpfulness prediction of text for online product reviews
Semantic analysis and helpfulness prediction of text  for online product reviewsSemantic analysis and helpfulness prediction of text  for online product reviews
Semantic analysis and helpfulness prediction of text for online product reviews
 
1.単純パーセプトロンと学習アルゴリズム
1.単純パーセプトロンと学習アルゴリズム1.単純パーセプトロンと学習アルゴリズム
1.単純パーセプトロンと学習アルゴリズム
 
1.単純パーセプトロンと学習アルゴリズム
1.単純パーセプトロンと学習アルゴリズム1.単純パーセプトロンと学習アルゴリズム
1.単純パーセプトロンと学習アルゴリズム
 

Effective search space reduction for spell correction using character neural embeddings