SlideShare a Scribd company logo
文献紹介(H26/7/1)
Improving SMT quality
with morpho-syntactic analysis
長岡技術科学大学 高橋寛治
概要
• 言語情報を与えると統計的機械翻訳の性能が向
上すると期待
• 40%の単語はコーパス中で1回しか現れない
• 独英統計的機械翻訳において、形態的統語的情
報を利用することにより性能が向上
• Sonja Nieflen, Hermann Ney, COLING, 2000,
Vol.2
コーパスの統計 一度しか現れない
全体図
分離動詞
• 英語 Go out:外出する
• ドイツ語 Out go -> Outgo
• ausgehen 外出する, aus|gehen
• 【分離前つづり+基礎動詞部分】
• フランクは 今晩 ペトラと 外出する。
• Frank heute Abend mit Petra ausgehen.
• Frank geht heute Abend mit Petra aus.
分離動詞の書き換え(verb
prefixes)
• フランクは 今晩 ペトラと 外出する。
• Frank geht heute Abend mit Petra aus.
↓
• Frank heute Abend mit Petra ausgehen.
複合語(split compounds)
• 複合語「Ftuchtetee」は翻訳できない
• それぞれの要素「Ftuchte」,「 Tee」はコーパ
ス中に存在→翻訳可能
• トレーニングに存在しない複合語は分割
品詞付与(pos)
• 語義曖昧性解消の手掛かりに品詞を利用
• Aber
• 副詞, 接続詞
• Zu
• 副詞, 前置詞, 分離した動詞の接頭辞, 不定詞の指標
• Der, die, das
• 定冠詞, 代名詞
間違って翻訳されやすい
• “Das wurde mir sehr gut passen.”
• 正:“That would suit me very well.”
• 誤:“The would suit me very well.”
• “Das war zu schnell”
• 正:“That was to fast.”
• 誤:“That was too fast.”
熟語の結合(merge)
• 2語以上からなる熟語は文中での振る舞いが全
く異なる
• “irgend etwas” (“anything”)
• 熟語21語を一語としてエントリー
• “irgend-etwas”
未収録語
• トレーニング中に存在しない固有名詞は、その
まま出力
• 固有名詞の出力文中での位置は大抵正解
• 重複するが、複合語の分割はドイツ語の未収録
語を減らす
• 未収録語を一般形に変換することで、ある時は
意図された意味を翻訳できる
• “kaltes”->”kalt” (cold), “Jahre”->”Jahr” (years)
翻訳
• コーパス
• VERBMOIL
• 日程決めの会話のコーパス
• 入力
• テキスト、音声認識(認識精度69%)の二つ
• トレーニングセット
• 45680組の文
• テストセット
• 未収録語を含まない147文
• 評価にSSER(著者ら2000)を利用
• 人手で評価
• 0.0:意味も構文も正しい
• 1.0:完全に間違い
結果
• 複合語の分割により、語の種類数が減る。
• 1度しか現れない語は2.8%減少。
結果
• テキストを翻訳 • 音声認識を翻訳
品詞付与、熟語の結合、動詞の一般化が
翻訳性能に寄与
まとめ
• 形態的統語的情報を利用して統計的機械翻訳を
精度向上
• 複合語
• 分離動詞
• 品詞付与
• 熟語
• 未収録語
• 自然な対話で有効性を確認

More Related Content

More from Kanji Takahashi

言語処理学会第23回年次大会参加報告
言語処理学会第23回年次大会参加報告言語処理学会第23回年次大会参加報告
言語処理学会第23回年次大会参加報告
Kanji Takahashi
 
20170203The Effects of Data Size and Frequency Range on Distributional Semant...
20170203The Effects of Data Size and Frequency Range on Distributional Semant...20170203The Effects of Data Size and Frequency Range on Distributional Semant...
20170203The Effects of Data Size and Frequency Range on Distributional Semant...
Kanji Takahashi
 
20161215Neural Machine Translation of Rare Words with Subword Units
20161215Neural Machine Translation of Rare Words with Subword Units20161215Neural Machine Translation of Rare Words with Subword Units
20161215Neural Machine Translation of Rare Words with Subword Units
Kanji Takahashi
 
Enriching Morphologically Poor Languages for Statistical Machine Translation
Enriching Morphologically Poor Languages for Statistical Machine TranslationEnriching Morphologically Poor Languages for Statistical Machine Translation
Enriching Morphologically Poor Languages for Statistical Machine Translation
Kanji Takahashi
 
A Beam-Search Decoder for Normalization of Social Media Text with Application...
A Beam-Search Decoder for Normalization of Social Media Text with Application...A Beam-Search Decoder for Normalization of Social Media Text with Application...
A Beam-Search Decoder for Normalization of Social Media Text with Application...
Kanji Takahashi
 
Reducing the Impact of Data Sparsity in Statistical Machine Translation
Reducing the Impact of Data Sparsity in Statistical Machine TranslationReducing the Impact of Data Sparsity in Statistical Machine Translation
Reducing the Impact of Data Sparsity in Statistical Machine Translation
Kanji Takahashi
 
文献紹介:Morphological analysis for Statistical Machine Translation
文献紹介:Morphological analysis for Statistical Machine Translation文献紹介:Morphological analysis for Statistical Machine Translation
文献紹介:Morphological analysis for Statistical Machine Translation
Kanji Takahashi
 
Distributed Representations of Words and Phrases and their Compositionally
Distributed Representations of Words and Phrases and their CompositionallyDistributed Representations of Words and Phrases and their Compositionally
Distributed Representations of Words and Phrases and their Compositionally
Kanji Takahashi
 
Nlp2016参加報告(高橋)
Nlp2016参加報告(高橋)Nlp2016参加報告(高橋)
Nlp2016参加報告(高橋)
Kanji Takahashi
 
Domain-spesific Paraphrase Extraction
Domain-spesific Paraphrase ExtractionDomain-spesific Paraphrase Extraction
Domain-spesific Paraphrase Extraction
Kanji Takahashi
 
Vietnamese Word Segmentation with CRFs and SVMs: An Investigation
Vietnamese Word Segmentation with CRFs and SVMs: An InvestigationVietnamese Word Segmentation with CRFs and SVMs: An Investigation
Vietnamese Word Segmentation with CRFs and SVMs: An Investigation
Kanji Takahashi
 
Improving vietnamese word segmentation and pos tagging using MEM with various...
Improving vietnamese word segmentation and pos tagging using MEM with various...Improving vietnamese word segmentation and pos tagging using MEM with various...
Improving vietnamese word segmentation and pos tagging using MEM with various...
Kanji Takahashi
 
日本語機能表現の自動検出と統計的係り受け解析への応用
日本語機能表現の自動検出と統計的係り受け解析への応用日本語機能表現の自動検出と統計的係り受け解析への応用
日本語機能表現の自動検出と統計的係り受け解析への応用
Kanji Takahashi
 
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
Kanji Takahashi
 
20150728So similar and yet incompatible: Toward automated identification of s...
20150728So similar and yet incompatible:Toward automated identification of s...20150728So similar and yet incompatible:Toward automated identification of s...
20150728So similar and yet incompatible: Toward automated identification of s...
Kanji Takahashi
 
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
Kanji Takahashi
 
20150415 automatic retirieval_and_clustering_of_similar_words
20150415 automatic retirieval_and_clustering_of_similar_words20150415 automatic retirieval_and_clustering_of_similar_words
20150415 automatic retirieval_and_clustering_of_similar_words
Kanji Takahashi
 
A baseline system for chinese near synonym choice
A baseline system for chinese near synonym choiceA baseline system for chinese near synonym choice
A baseline system for chinese near synonym choice
Kanji Takahashi
 
20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing
20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing
20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing
Kanji Takahashi
 
英日翻訳システムの作成の演習
英日翻訳システムの作成の演習英日翻訳システムの作成の演習
英日翻訳システムの作成の演習
Kanji Takahashi
 

More from Kanji Takahashi (20)

言語処理学会第23回年次大会参加報告
言語処理学会第23回年次大会参加報告言語処理学会第23回年次大会参加報告
言語処理学会第23回年次大会参加報告
 
20170203The Effects of Data Size and Frequency Range on Distributional Semant...
20170203The Effects of Data Size and Frequency Range on Distributional Semant...20170203The Effects of Data Size and Frequency Range on Distributional Semant...
20170203The Effects of Data Size and Frequency Range on Distributional Semant...
 
20161215Neural Machine Translation of Rare Words with Subword Units
20161215Neural Machine Translation of Rare Words with Subword Units20161215Neural Machine Translation of Rare Words with Subword Units
20161215Neural Machine Translation of Rare Words with Subword Units
 
Enriching Morphologically Poor Languages for Statistical Machine Translation
Enriching Morphologically Poor Languages for Statistical Machine TranslationEnriching Morphologically Poor Languages for Statistical Machine Translation
Enriching Morphologically Poor Languages for Statistical Machine Translation
 
A Beam-Search Decoder for Normalization of Social Media Text with Application...
A Beam-Search Decoder for Normalization of Social Media Text with Application...A Beam-Search Decoder for Normalization of Social Media Text with Application...
A Beam-Search Decoder for Normalization of Social Media Text with Application...
 
Reducing the Impact of Data Sparsity in Statistical Machine Translation
Reducing the Impact of Data Sparsity in Statistical Machine TranslationReducing the Impact of Data Sparsity in Statistical Machine Translation
Reducing the Impact of Data Sparsity in Statistical Machine Translation
 
文献紹介:Morphological analysis for Statistical Machine Translation
文献紹介:Morphological analysis for Statistical Machine Translation文献紹介:Morphological analysis for Statistical Machine Translation
文献紹介:Morphological analysis for Statistical Machine Translation
 
Distributed Representations of Words and Phrases and their Compositionally
Distributed Representations of Words and Phrases and their CompositionallyDistributed Representations of Words and Phrases and their Compositionally
Distributed Representations of Words and Phrases and their Compositionally
 
Nlp2016参加報告(高橋)
Nlp2016参加報告(高橋)Nlp2016参加報告(高橋)
Nlp2016参加報告(高橋)
 
Domain-spesific Paraphrase Extraction
Domain-spesific Paraphrase ExtractionDomain-spesific Paraphrase Extraction
Domain-spesific Paraphrase Extraction
 
Vietnamese Word Segmentation with CRFs and SVMs: An Investigation
Vietnamese Word Segmentation with CRFs and SVMs: An InvestigationVietnamese Word Segmentation with CRFs and SVMs: An Investigation
Vietnamese Word Segmentation with CRFs and SVMs: An Investigation
 
Improving vietnamese word segmentation and pos tagging using MEM with various...
Improving vietnamese word segmentation and pos tagging using MEM with various...Improving vietnamese word segmentation and pos tagging using MEM with various...
Improving vietnamese word segmentation and pos tagging using MEM with various...
 
日本語機能表現の自動検出と統計的係り受け解析への応用
日本語機能表現の自動検出と統計的係り受け解析への応用日本語機能表現の自動検出と統計的係り受け解析への応用
日本語機能表現の自動検出と統計的係り受け解析への応用
 
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
 
20150728So similar and yet incompatible: Toward automated identification of s...
20150728So similar and yet incompatible:Toward automated identification of s...20150728So similar and yet incompatible:Toward automated identification of s...
20150728So similar and yet incompatible: Toward automated identification of s...
 
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
 
20150415 automatic retirieval_and_clustering_of_similar_words
20150415 automatic retirieval_and_clustering_of_similar_words20150415 automatic retirieval_and_clustering_of_similar_words
20150415 automatic retirieval_and_clustering_of_similar_words
 
A baseline system for chinese near synonym choice
A baseline system for chinese near synonym choiceA baseline system for chinese near synonym choice
A baseline system for chinese near synonym choice
 
20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing
20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing
20150225文献紹介 On WordNet Semantic Classes and Dependency Parsing
 
英日翻訳システムの作成の演習
英日翻訳システムの作成の演習英日翻訳システムの作成の演習
英日翻訳システムの作成の演習
 

20150701 Improving SMT quality with morpho-syntactic analysis