SlideShare a Scribd company logo
1 of 1
Download to read offline
中文
English
Deutsch
Français
Italiano
日本語
Pусский
Español
Português
Dansk, ελληνικά, , 한국어, ...
Magyar nyelv
Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, Ling Zhu, and Shuo Li
Natural Language Processing & Portuguese - Chinese Machine Translation Laboratory (NLP2CT)
Department of Computer and Information Science,University of Macau, Macau
Acknowledgements
The authors are grateful to the Science and
Technology Development Fund of Macau and the
Research Committee of the University of Macau
for the funding support for our research, under the
reference No. 017/2009/A and RG060/09-
10S/CS/FST.
Motivation
The word segmentation of Chinese expressions
is difficult due to the fact that there are many
kinds of ambiguities and there exist widely used
abbreviations phenomena, which usually result in
different segmentations.
However, the conventional research usually
emphasizes more on the algorithms employed
and the workflow designed with less introduction
and discussion from the linguistic aspects of
CWS, such as the characteristics of Chinese.
This paper makes effort on the analysis of the
characteristics of Chinese and several categories
of ambiguities and abbreviations in Chinese
expressions to explore potential solutions.
A Study of Chinese Word Segmentation Based on the
Characteristics of Chinese
25th International Conference of the German Society for Computational Linguistics and Language Technology, Darmstadt, Germany, September 25–27, 2013
Characteristics of Chinese
Structural Ambiguity
One Chinese character can be combined with the
antecedent characters or subsequent characters.
Both combinations result in reasonable Chinese
words.
Case 1: Both the possible segmented sentences
are correct, but with different meaning.
CRF with Optimized Features
In the CRF model, 𝑋 is a variable representing
input sequence, 𝑌 represents the corresponding
labels to be attached to 𝑋, a graph 𝐺 = (𝑉, 𝐸)
comprise a set 𝑉 of vertices or nodes together with
a set 𝐸 of edges or lines, the parameters 𝜆 𝑘 and 𝜇 𝑘
are to be trained from the training data, and the
symbol “|” presents that the right part is the
precondition of the left.
Experiments
Training data (SIGHAN Bakeoff-4):
 36, 228 sentences for CityU corpus
 23, 444 sentences for Chinese Treebank
(CTB) corpus
Testing data (SIGHAN Bakeoff-4):
 8, 094 sentences for CityU corpus
 2, 772 sentences for CTB corpus
Number of words for the testing corpora:
IV: in vocabulary, representing the testing words
already in the training corpus.
OOV: out of vocabulary, representing the testing
words not existing in the training corpus.
Results evaluated by F-scores:
Closed track means only using the information
found in the provided training data.
Open track means any external data can be used
in addition to the provided training data.
Case 2: One of the possible segmented sentences
is grammatically correct, while the other is not.
Abbreviations in Named Entities
Abbreviation of personal names
Abbreviation of place/location names
他的/船只/靠在/維多利亞港
His ship moors at the Victoria Harbor
他的/船/只/靠在/維多利亞港
His ship is used to moor at the Victoria Harbor
Track IV F-score
OOV F-
score
Total F-
score
CityU CTB CityU CTB CityU CTB
[17] Closed .9483 .9556 .6093 .6286 .9183 .9354
[19] Closed .9386 .9290 .5234 .5128 .9083 .9077
[18] Closed .9101 .8939 .6072 .6273 .8850 .8780
[20] Open .9401 .9753 .6090 .8839 .9098 .9702
[21] Open N/A .9398 N/A .6581 N/A .9256
Ours Closed .9541 .9590 .6420 .6562 9268 .9405
Comparisons with some related works
水/快速/凍/成了/冰
the water is quickly frozen into ice
水/快/速凍/成了/冰
water / fast /fast frozen / into / ice
許又/從/街坊/口中/得知
XuYou heard from the neighbors
許/又/從/街坊/口中/得知
Xu once more heard from the neighbors
敵人/襲擊/巴/西北部
The enemy attacks the northwestern part of Pakistan
敵人/襲擊/巴西/北部
The enemy attacks the northern Brazil
𝑝 𝜃 𝑌 𝑋 ∝
𝑒𝑥𝑝
𝜆 𝑘 𝑓𝑘(𝑒, 𝑌 𝑒, 𝑋)𝑒∈𝐸,𝑘 +
𝜇 𝑘 𝑔 𝑘(𝑣, 𝑌 𝑣, 𝑋)𝑣∈𝑉,𝑘
(1)
Features Meaning
U 𝑛, 𝑛 ∈ (−4, 1) Unigram features
𝐵 𝑛,𝑛+1, 𝑛 ∈ (−2, 0) Bigram features
𝐵−1, 1 Jump bigram features
𝑇𝑛,𝑛+1,𝑛+2, 𝑛 ∈ (−2, −1) Trigram features
Type Total IV OOV
CityU 235, 631 216, 249 19, 382
CTB 80, 700 76, 200 4, 480




新疆/維吾爾自治區/分外/妖嬈
The Xinjiang Uygur Autonomous Region is
extraordinarily enchanting
新疆/維吾爾/自治/區分/外/妖嬈
Xinjiang / Uygur / autonomy / distinguish / out /
enchanting
由/三/局/處理/食物
Three bureaus handle the food
由/三局/處理/食物
The third bureau handles the food




张/明白了/事情原因
Zhang has seen the reason of the thing
张明/白/了/事情原因
ZhangMing/ white/ already/ reason of the thing
事件/發生/在/法/國家劇院
the incident occurred in French national theatre
事件/發生/在/法國/家/劇院
The incident/ occurred/ in/ France / home/ theatre








𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
#𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡
#𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑝𝑢𝑡
(2)
𝑅𝑒𝑐𝑎𝑙𝑙 =
#𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡
#𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑡ℎ
(3)
𝐹_𝑠𝑐𝑜𝑟𝑒 =
2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
(4)

More Related Content

Similar to A Study of Chinese Word Segmentation Based on Characteristics

Table of Test Specification for medical education
Table of Test Specification for medical educationTable of Test Specification for medical education
Table of Test Specification for medical educationRameshKumar627269
 
New Programme Details Set up for OSS – Supporting Notes
New Programme Details Set up for OSS – Supporting NotesNew Programme Details Set up for OSS – Supporting Notes
New Programme Details Set up for OSS – Supporting Notesbutest
 
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdf
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdfMelkamu_Tilahun_Oct_2017_Final_Thesis.pdf
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdfeshetuTesfa
 
Using Knowledge Building Forums in EFL Classroms - FIETxs2019
 Using Knowledge Building Forums in EFL Classroms - FIETxs2019 Using Knowledge Building Forums in EFL Classroms - FIETxs2019
Using Knowledge Building Forums in EFL Classroms - FIETxs2019ARGET URV
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalrchbeir
 
Part of speech tagger English - By sadak pramodh
Part of speech tagger   English - By sadak pramodhPart of speech tagger   English - By sadak pramodh
Part of speech tagger English - By sadak pramodhsadakpramodh
 
Academic Writing A Handbook For International Students 2Nd Edition
Academic Writing A Handbook For International Students 2Nd EditionAcademic Writing A Handbook For International Students 2Nd Edition
Academic Writing A Handbook For International Students 2Nd EditionScott Donald
 
Academic Writing A Handbook For International Students Second Edition
Academic Writing A Handbook For International Students Second EditionAcademic Writing A Handbook For International Students Second Edition
Academic Writing A Handbook For International Students Second EditionTye Rausch
 
academic writing BOOK.pdf
academic writing BOOK.pdfacademic writing BOOK.pdf
academic writing BOOK.pdfBria Davis
 
Academic Writing A Handbook For International Students
Academic Writing A Handbook For International StudentsAcademic Writing A Handbook For International Students
Academic Writing A Handbook For International StudentsShannon Green
 
A new hybrid metric for verifying
A new hybrid metric for verifyingA new hybrid metric for verifying
A new hybrid metric for verifyingcsandit
 

Similar to A Study of Chinese Word Segmentation Based on Characteristics (16)

Table of Test Specification for medical education
Table of Test Specification for medical educationTable of Test Specification for medical education
Table of Test Specification for medical education
 
3355 Pilot Syllabus
3355 Pilot Syllabus3355 Pilot Syllabus
3355 Pilot Syllabus
 
CeTEALNewsletter_Nov-Dec_2015
CeTEALNewsletter_Nov-Dec_2015CeTEALNewsletter_Nov-Dec_2015
CeTEALNewsletter_Nov-Dec_2015
 
New Programme Details Set up for OSS – Supporting Notes
New Programme Details Set up for OSS – Supporting NotesNew Programme Details Set up for OSS – Supporting Notes
New Programme Details Set up for OSS – Supporting Notes
 
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdf
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdfMelkamu_Tilahun_Oct_2017_Final_Thesis.pdf
Melkamu_Tilahun_Oct_2017_Final_Thesis.pdf
 
Using Knowledge Building Forums in EFL Classroms - FIETxs2019
 Using Knowledge Building Forums in EFL Classroms - FIETxs2019 Using Knowledge Building Forums in EFL Classroms - FIETxs2019
Using Knowledge Building Forums in EFL Classroms - FIETxs2019
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Part of speech tagger English - By sadak pramodh
Part of speech tagger   English - By sadak pramodhPart of speech tagger   English - By sadak pramodh
Part of speech tagger English - By sadak pramodh
 
Academic Writing A Handbook For International Students 2Nd Edition
Academic Writing A Handbook For International Students 2Nd EditionAcademic Writing A Handbook For International Students 2Nd Edition
Academic Writing A Handbook For International Students 2Nd Edition
 
Academic Writing A Handbook For International Students Second Edition
Academic Writing A Handbook For International Students Second EditionAcademic Writing A Handbook For International Students Second Edition
Academic Writing A Handbook For International Students Second Edition
 
academic writing BOOK.pdf
academic writing BOOK.pdfacademic writing BOOK.pdf
academic writing BOOK.pdf
 
Academic Writing A Handbook For International Students
Academic Writing A Handbook For International StudentsAcademic Writing A Handbook For International Students
Academic Writing A Handbook For International Students
 
A new hybrid metric for verifying
A new hybrid metric for verifyingA new hybrid metric for verifying
A new hybrid metric for verifying
 
In search of a happy medium: price components as part of alliance team select...
In search of a happy medium: price components as part of alliance team select...In search of a happy medium: price components as part of alliance team select...
In search of a happy medium: price components as part of alliance team select...
 
Jk2416381644
Jk2416381644Jk2416381644
Jk2416381644
 

More from Lifeng (Aaron) Han

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterLifeng (Aaron) Han
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Lifeng (Aaron) Han
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...Lifeng (Aaron) Han
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Lifeng (Aaron) Han
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...Lifeng (Aaron) Han
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...Lifeng (Aaron) Han
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaLifeng (Aaron) Han
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveyLifeng (Aaron) Han
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Lifeng (Aaron) Han
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
 

More from Lifeng (Aaron) Han (20)

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methods
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine Translation
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a survey
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
 

A Study of Chinese Word Segmentation Based on Characteristics

  • 1. 中文 English Deutsch Français Italiano 日本語 Pусский Español Português Dansk, ελληνικά, , 한국어, ... Magyar nyelv Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, Ling Zhu, and Shuo Li Natural Language Processing & Portuguese - Chinese Machine Translation Laboratory (NLP2CT) Department of Computer and Information Science,University of Macau, Macau Acknowledgements The authors are grateful to the Science and Technology Development Fund of Macau and the Research Committee of the University of Macau for the funding support for our research, under the reference No. 017/2009/A and RG060/09- 10S/CS/FST. Motivation The word segmentation of Chinese expressions is difficult due to the fact that there are many kinds of ambiguities and there exist widely used abbreviations phenomena, which usually result in different segmentations. However, the conventional research usually emphasizes more on the algorithms employed and the workflow designed with less introduction and discussion from the linguistic aspects of CWS, such as the characteristics of Chinese. This paper makes effort on the analysis of the characteristics of Chinese and several categories of ambiguities and abbreviations in Chinese expressions to explore potential solutions. A Study of Chinese Word Segmentation Based on the Characteristics of Chinese 25th International Conference of the German Society for Computational Linguistics and Language Technology, Darmstadt, Germany, September 25–27, 2013 Characteristics of Chinese Structural Ambiguity One Chinese character can be combined with the antecedent characters or subsequent characters. Both combinations result in reasonable Chinese words. Case 1: Both the possible segmented sentences are correct, but with different meaning. CRF with Optimized Features In the CRF model, 𝑋 is a variable representing input sequence, 𝑌 represents the corresponding labels to be attached to 𝑋, a graph 𝐺 = (𝑉, 𝐸) comprise a set 𝑉 of vertices or nodes together with a set 𝐸 of edges or lines, the parameters 𝜆 𝑘 and 𝜇 𝑘 are to be trained from the training data, and the symbol “|” presents that the right part is the precondition of the left. Experiments Training data (SIGHAN Bakeoff-4):  36, 228 sentences for CityU corpus  23, 444 sentences for Chinese Treebank (CTB) corpus Testing data (SIGHAN Bakeoff-4):  8, 094 sentences for CityU corpus  2, 772 sentences for CTB corpus Number of words for the testing corpora: IV: in vocabulary, representing the testing words already in the training corpus. OOV: out of vocabulary, representing the testing words not existing in the training corpus. Results evaluated by F-scores: Closed track means only using the information found in the provided training data. Open track means any external data can be used in addition to the provided training data. Case 2: One of the possible segmented sentences is grammatically correct, while the other is not. Abbreviations in Named Entities Abbreviation of personal names Abbreviation of place/location names 他的/船只/靠在/維多利亞港 His ship moors at the Victoria Harbor 他的/船/只/靠在/維多利亞港 His ship is used to moor at the Victoria Harbor Track IV F-score OOV F- score Total F- score CityU CTB CityU CTB CityU CTB [17] Closed .9483 .9556 .6093 .6286 .9183 .9354 [19] Closed .9386 .9290 .5234 .5128 .9083 .9077 [18] Closed .9101 .8939 .6072 .6273 .8850 .8780 [20] Open .9401 .9753 .6090 .8839 .9098 .9702 [21] Open N/A .9398 N/A .6581 N/A .9256 Ours Closed .9541 .9590 .6420 .6562 9268 .9405 Comparisons with some related works 水/快速/凍/成了/冰 the water is quickly frozen into ice 水/快/速凍/成了/冰 water / fast /fast frozen / into / ice 許又/從/街坊/口中/得知 XuYou heard from the neighbors 許/又/從/街坊/口中/得知 Xu once more heard from the neighbors 敵人/襲擊/巴/西北部 The enemy attacks the northwestern part of Pakistan 敵人/襲擊/巴西/北部 The enemy attacks the northern Brazil 𝑝 𝜃 𝑌 𝑋 ∝ 𝑒𝑥𝑝 𝜆 𝑘 𝑓𝑘(𝑒, 𝑌 𝑒, 𝑋)𝑒∈𝐸,𝑘 + 𝜇 𝑘 𝑔 𝑘(𝑣, 𝑌 𝑣, 𝑋)𝑣∈𝑉,𝑘 (1) Features Meaning U 𝑛, 𝑛 ∈ (−4, 1) Unigram features 𝐵 𝑛,𝑛+1, 𝑛 ∈ (−2, 0) Bigram features 𝐵−1, 1 Jump bigram features 𝑇𝑛,𝑛+1,𝑛+2, 𝑛 ∈ (−2, −1) Trigram features Type Total IV OOV CityU 235, 631 216, 249 19, 382 CTB 80, 700 76, 200 4, 480     新疆/維吾爾自治區/分外/妖嬈 The Xinjiang Uygur Autonomous Region is extraordinarily enchanting 新疆/維吾爾/自治/區分/外/妖嬈 Xinjiang / Uygur / autonomy / distinguish / out / enchanting 由/三/局/處理/食物 Three bureaus handle the food 由/三局/處理/食物 The third bureau handles the food     张/明白了/事情原因 Zhang has seen the reason of the thing 张明/白/了/事情原因 ZhangMing/ white/ already/ reason of the thing 事件/發生/在/法/國家劇院 the incident occurred in French national theatre 事件/發生/在/法國/家/劇院 The incident/ occurred/ in/ France / home/ theatre         𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = #𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡 #𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑝𝑢𝑡 (2) 𝑅𝑒𝑐𝑎𝑙𝑙 = #𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡 #𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑡ℎ (3) 𝐹_𝑠𝑐𝑜𝑟𝑒 = 2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 (4)