SlideShare a Scribd company logo
Vietnamese Word Segmentation
with CRFs and SVMs: An Investigation
長岡技術科学大学 自然言語処理研究室
高橋寛治
C.T. Nguyen, T.K.Nguyen, X.H.Phan, L.M.Nguyen, Q.T.Ha,
Proceedings of the 20th PACLIC, pp.215-222, 2006.10
文献紹介 2015年12月17日
概要
•SVMとCRFを用いてベトナム語の単語分割を行い
比較する
•タグ付け済みコーパスを構築し、単語分割の調査
•素性やコーパスサイズが性能にどれほど影響する
か調査
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 2
はじめに
•統計的手法や機械学習手法によるベトナム語単語
分割の精度は91%程度と報告されている
•既存の研究は他の手法と比較を行っていない
•ベースラインと比較して調査
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 3
モチベーション
•SVMやCRFはNLPにおける分割問題やラベル問題
で成功している
•ベトナム語単語分割でもうまくいくだろう
•しかしながら、素性選択は両手法において必要
•→どのような素性が精度に影響するのか?
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 4
ベトナム語の単語についておさらい
•音節
•単語
Ø1音節の単語:tôi(私), bạn(あなた), nhà(家)
Ø複合語:bơi lội(泳ぐ), đường sắt(鉄道)
Ø畳語:「神々しい」のような語
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 5
本稿での新語
•解析に利用する辞書、学習コーパスに出現しない
単語のこと(未知語)
•省略語
ØCAND(Công An Nhân Dân – 警察官)
•固有表現
ØHồ Chí Minh, Công ty Hải Hà(Hải Hà社)
•外国の単語
Øアルファベットを利用するため区別がつかない
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 6
系列ラベル
•CRF
Ø識別モデル
Øクラスに分類される確率
•SVM
Ø識別関数
Øデータを分類
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 7
コーパスの構築
•様々なニュース305記事を様々なウェブサイト
から取得
•様々なサイトから収集することで単語の分布に偏
りがなくなる
•人名コーパス(2000人)
•地名コーパス(707か所)
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 8
コーパスの内容
•コーパスは公開
•B_W, I_W, Oの3つのタグを付与
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 9
素性選択
•かっこの中の数値は素性の窓幅を示す
• 4音節以上の単語はほとんど無いので(-2,2)が最大
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 10
実験
• ツールを利用
ØCRF: FlexCRFs
ØSVM: YamCha
• 5分割交差検証で様々な素性を試す
ØSC:Syllable Conjunction, Misc:Miscellaneous, ERS:External
Resources, VSD:Vietnamese Sylabble Detection
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 11
結果
•CRFは素性を追加すればするほど良くなる
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 12
CRF
• 音節接合の素性と単語辞書の素性が顕著
• その他の素性はあまり効果がない
Ø1音節目の語かどうか
Ø数値や日付はそもそも数が少ない
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 13
結果
•VSDはCRFでは寄与、SVMでは足を引っ張る
•少ない素性でSVMは効果がある
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 14
結果
•SC+VSD+Dictの時のCRFとSVMの比較
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 15
まとめ
•CRFとSVMをベトナム語単語分割に用いる調査を
行った
•貢献
Øタグ付けしたコーパスの作成
Ø様々な素性での結果の比較
Ø実験結果から興味深いことを発見
•今後はコーパスサイズによる精度の変化を確認
Improving	Vietnamese	Word	Segmentation	and	POS	Tagging	using	MEM	with	Various	Kinds	of	Resources 16

More Related Content

More from Kanji Takahashi

論文読み会 Enriching Word Vectors with Subword Information
論文読み会 Enriching Word Vectors with Subword Information論文読み会 Enriching Word Vectors with Subword Information
論文読み会 Enriching Word Vectors with Subword Information
Kanji Takahashi
 
第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する
第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する
第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する
Kanji Takahashi
 
論文読み会 Data Augmentation for Low-Resource Neural Machine Translation
論文読み会 Data Augmentation for Low-Resource Neural Machine Translation論文読み会 Data Augmentation for Low-Resource Neural Machine Translation
論文読み会 Data Augmentation for Low-Resource Neural Machine Translation
Kanji Takahashi
 
言語処理学会第23回年次大会参加報告
言語処理学会第23回年次大会参加報告言語処理学会第23回年次大会参加報告
言語処理学会第23回年次大会参加報告
Kanji Takahashi
 
20170203The Effects of Data Size and Frequency Range on Distributional Semant...
20170203The Effects of Data Size and Frequency Range on Distributional Semant...20170203The Effects of Data Size and Frequency Range on Distributional Semant...
20170203The Effects of Data Size and Frequency Range on Distributional Semant...
Kanji Takahashi
 
20161215Neural Machine Translation of Rare Words with Subword Units
20161215Neural Machine Translation of Rare Words with Subword Units20161215Neural Machine Translation of Rare Words with Subword Units
20161215Neural Machine Translation of Rare Words with Subword Units
Kanji Takahashi
 
Enriching Morphologically Poor Languages for Statistical Machine Translation
Enriching Morphologically Poor Languages for Statistical Machine TranslationEnriching Morphologically Poor Languages for Statistical Machine Translation
Enriching Morphologically Poor Languages for Statistical Machine Translation
Kanji Takahashi
 
A Beam-Search Decoder for Normalization of Social Media Text with Application...
A Beam-Search Decoder for Normalization of Social Media Text with Application...A Beam-Search Decoder for Normalization of Social Media Text with Application...
A Beam-Search Decoder for Normalization of Social Media Text with Application...
Kanji Takahashi
 
Reducing the Impact of Data Sparsity in Statistical Machine Translation
Reducing the Impact of Data Sparsity in Statistical Machine TranslationReducing the Impact of Data Sparsity in Statistical Machine Translation
Reducing the Impact of Data Sparsity in Statistical Machine Translation
Kanji Takahashi
 
文献紹介:Morphological analysis for Statistical Machine Translation
文献紹介:Morphological analysis for Statistical Machine Translation文献紹介:Morphological analysis for Statistical Machine Translation
文献紹介:Morphological analysis for Statistical Machine Translation
Kanji Takahashi
 
Distributed Representations of Words and Phrases and their Compositionally
Distributed Representations of Words and Phrases and their CompositionallyDistributed Representations of Words and Phrases and their Compositionally
Distributed Representations of Words and Phrases and their Compositionally
Kanji Takahashi
 
Nlp2016参加報告(高橋)
Nlp2016参加報告(高橋)Nlp2016参加報告(高橋)
Nlp2016参加報告(高橋)
Kanji Takahashi
 
Domain-spesific Paraphrase Extraction
Domain-spesific Paraphrase ExtractionDomain-spesific Paraphrase Extraction
Domain-spesific Paraphrase Extraction
Kanji Takahashi
 
日本語機能表現の自動検出と統計的係り受け解析への応用
日本語機能表現の自動検出と統計的係り受け解析への応用日本語機能表現の自動検出と統計的係り受け解析への応用
日本語機能表現の自動検出と統計的係り受け解析への応用
Kanji Takahashi
 
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
Kanji Takahashi
 
20150728So similar and yet incompatible: Toward automated identification of s...
20150728So similar and yet incompatible:Toward automated identification of s...20150728So similar and yet incompatible:Toward automated identification of s...
20150728So similar and yet incompatible: Toward automated identification of s...
Kanji Takahashi
 
20150701 Improving SMT quality with morpho-syntactic analysis
20150701 Improving SMT quality with morpho-syntactic analysis20150701 Improving SMT quality with morpho-syntactic analysis
20150701 Improving SMT quality with morpho-syntactic analysis
Kanji Takahashi
 
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
Kanji Takahashi
 
20150415 automatic retirieval_and_clustering_of_similar_words
20150415 automatic retirieval_and_clustering_of_similar_words20150415 automatic retirieval_and_clustering_of_similar_words
20150415 automatic retirieval_and_clustering_of_similar_words
Kanji Takahashi
 
A baseline system for chinese near synonym choice
A baseline system for chinese near synonym choiceA baseline system for chinese near synonym choice
A baseline system for chinese near synonym choice
Kanji Takahashi
 

More from Kanji Takahashi (20)

論文読み会 Enriching Word Vectors with Subword Information
論文読み会 Enriching Word Vectors with Subword Information論文読み会 Enriching Word Vectors with Subword Information
論文読み会 Enriching Word Vectors with Subword Information
 
第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する
第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する
第17回Machine Learning 15 minutes!:ビジネスの出会いを科学する
 
論文読み会 Data Augmentation for Low-Resource Neural Machine Translation
論文読み会 Data Augmentation for Low-Resource Neural Machine Translation論文読み会 Data Augmentation for Low-Resource Neural Machine Translation
論文読み会 Data Augmentation for Low-Resource Neural Machine Translation
 
言語処理学会第23回年次大会参加報告
言語処理学会第23回年次大会参加報告言語処理学会第23回年次大会参加報告
言語処理学会第23回年次大会参加報告
 
20170203The Effects of Data Size and Frequency Range on Distributional Semant...
20170203The Effects of Data Size and Frequency Range on Distributional Semant...20170203The Effects of Data Size and Frequency Range on Distributional Semant...
20170203The Effects of Data Size and Frequency Range on Distributional Semant...
 
20161215Neural Machine Translation of Rare Words with Subword Units
20161215Neural Machine Translation of Rare Words with Subword Units20161215Neural Machine Translation of Rare Words with Subword Units
20161215Neural Machine Translation of Rare Words with Subword Units
 
Enriching Morphologically Poor Languages for Statistical Machine Translation
Enriching Morphologically Poor Languages for Statistical Machine TranslationEnriching Morphologically Poor Languages for Statistical Machine Translation
Enriching Morphologically Poor Languages for Statistical Machine Translation
 
A Beam-Search Decoder for Normalization of Social Media Text with Application...
A Beam-Search Decoder for Normalization of Social Media Text with Application...A Beam-Search Decoder for Normalization of Social Media Text with Application...
A Beam-Search Decoder for Normalization of Social Media Text with Application...
 
Reducing the Impact of Data Sparsity in Statistical Machine Translation
Reducing the Impact of Data Sparsity in Statistical Machine TranslationReducing the Impact of Data Sparsity in Statistical Machine Translation
Reducing the Impact of Data Sparsity in Statistical Machine Translation
 
文献紹介:Morphological analysis for Statistical Machine Translation
文献紹介:Morphological analysis for Statistical Machine Translation文献紹介:Morphological analysis for Statistical Machine Translation
文献紹介:Morphological analysis for Statistical Machine Translation
 
Distributed Representations of Words and Phrases and their Compositionally
Distributed Representations of Words and Phrases and their CompositionallyDistributed Representations of Words and Phrases and their Compositionally
Distributed Representations of Words and Phrases and their Compositionally
 
Nlp2016参加報告(高橋)
Nlp2016参加報告(高橋)Nlp2016参加報告(高橋)
Nlp2016参加報告(高橋)
 
Domain-spesific Paraphrase Extraction
Domain-spesific Paraphrase ExtractionDomain-spesific Paraphrase Extraction
Domain-spesific Paraphrase Extraction
 
日本語機能表現の自動検出と統計的係り受け解析への応用
日本語機能表現の自動検出と統計的係り受け解析への応用日本語機能表現の自動検出と統計的係り受け解析への応用
日本語機能表現の自動検出と統計的係り受け解析への応用
 
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
20150916How Far are We from Fully Automatic High Quality Grammatical Error Co...
 
20150728So similar and yet incompatible: Toward automated identification of s...
20150728So similar and yet incompatible:Toward automated identification of s...20150728So similar and yet incompatible:Toward automated identification of s...
20150728So similar and yet incompatible: Toward automated identification of s...
 
20150701 Improving SMT quality with morpho-syntactic analysis
20150701 Improving SMT quality with morpho-syntactic analysis20150701 Improving SMT quality with morpho-syntactic analysis
20150701 Improving SMT quality with morpho-syntactic analysis
 
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
文献紹介20150508 Paraphrasing Adaptation for Web Search Ranking
 
20150415 automatic retirieval_and_clustering_of_similar_words
20150415 automatic retirieval_and_clustering_of_similar_words20150415 automatic retirieval_and_clustering_of_similar_words
20150415 automatic retirieval_and_clustering_of_similar_words
 
A baseline system for chinese near synonym choice
A baseline system for chinese near synonym choiceA baseline system for chinese near synonym choice
A baseline system for chinese near synonym choice
 

Vietnamese Word Segmentation with CRFs and SVMs: An Investigation