A Study of Chinese Word Segmentation Based on Characteristics

中文
English
Deutsch
Français
Italiano
日本語
Pусский
Español
Português
Dansk, ελληνικά, , 한국어, ...
Magyar nyelv
Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, Ling Zhu, and Shuo Li
Natural Language Processing & Portuguese - Chinese Machine Translation Laboratory (NLP2CT)
Department of Computer and Information Science，University of Macau, Macau
Acknowledgements
The authors are grateful to the Science and
Technology Development Fund of Macau and the
Research Committee of the University of Macau
for the funding support for our research, under the
reference No. 017/2009/A and RG060/09-
10S/CS/FST.
Motivation
The word segmentation of Chinese expressions
is difficult due to the fact that there are many
kinds of ambiguities and there exist widely used
abbreviations phenomena, which usually result in
different segmentations.
However, the conventional research usually
emphasizes more on the algorithms employed
and the workflow designed with less introduction
and discussion from the linguistic aspects of
CWS, such as the characteristics of Chinese.
This paper makes effort on the analysis of the
characteristics of Chinese and several categories
of ambiguities and abbreviations in Chinese
expressions to explore potential solutions.
A Study of Chinese Word Segmentation Based on the
Characteristics of Chinese
25th International Conference of the German Society for Computational Linguistics and Language Technology, Darmstadt, Germany, September 25–27, 2013
Characteristics of Chinese
Structural Ambiguity
One Chinese character can be combined with the
antecedent characters or subsequent characters.
Both combinations result in reasonable Chinese
words.
Case 1: Both the possible segmented sentences
are correct, but with different meaning.
CRF with Optimized Features
In the CRF model, 𝑋 is a variable representing
input sequence, 𝑌 represents the corresponding
labels to be attached to 𝑋, a graph 𝐺 = (𝑉, 𝐸)
comprise a set 𝑉 of vertices or nodes together with
a set 𝐸 of edges or lines, the parameters 𝜆 𝑘 and 𝜇 𝑘
are to be trained from the training data, and the
symbol “|” presents that the right part is the
precondition of the left.
Experiments
Training data (SIGHAN Bakeoff-4):
 36, 228 sentences for CityU corpus
 23, 444 sentences for Chinese Treebank
(CTB) corpus
Testing data (SIGHAN Bakeoff-4):
 8, 094 sentences for CityU corpus
 2, 772 sentences for CTB corpus
Number of words for the testing corpora:
IV: in vocabulary, representing the testing words
already in the training corpus.
OOV: out of vocabulary, representing the testing
words not existing in the training corpus.
Results evaluated by F-scores:
Closed track means only using the information
found in the provided training data.
Open track means any external data can be used
in addition to the provided training data.
Case 2: One of the possible segmented sentences
is grammatically correct, while the other is not.
Abbreviations in Named Entities
Abbreviation of personal names
Abbreviation of place/location names
他的/船只/靠在/維多利亞港
His ship moors at the Victoria Harbor
他的/船/只/靠在/維多利亞港
His ship is used to moor at the Victoria Harbor
Track IV F-score
OOV F-
score
Total F-
score
CityU CTB CityU CTB CityU CTB
[17] Closed .9483 .9556 .6093 .6286 .9183 .9354
[19] Closed .9386 .9290 .5234 .5128 .9083 .9077
[18] Closed .9101 .8939 .6072 .6273 .8850 .8780
[20] Open .9401 .9753 .6090 .8839 .9098 .9702
[21] Open N/A .9398 N/A .6581 N/A .9256
Ours Closed .9541 .9590 .6420 .6562 9268 .9405
Comparisons with some related works
水/快速/凍/成了/冰
the water is quickly frozen into ice
水/快/速凍/成了/冰
water / fast /fast frozen / into / ice
許又/從/街坊/口中/得知
XuYou heard from the neighbors
許/又/從/街坊/口中/得知
Xu once more heard from the neighbors
敵人/襲擊/巴/西北部
The enemy attacks the northwestern part of Pakistan
敵人/襲擊/巴西/北部
The enemy attacks the northern Brazil
𝑝 𝜃 𝑌 𝑋 ∝
𝑒𝑥𝑝
𝜆 𝑘 𝑓𝑘(𝑒, 𝑌 𝑒, 𝑋)𝑒∈𝐸,𝑘 +
𝜇 𝑘 𝑔 𝑘(𝑣, 𝑌 𝑣, 𝑋)𝑣∈𝑉,𝑘
(1)
Features Meaning
U 𝑛, 𝑛 ∈ (−4, 1) Unigram features
𝐵 𝑛,𝑛+1, 𝑛 ∈ (−2, 0) Bigram features
𝐵−1, 1 Jump bigram features
𝑇𝑛,𝑛+1,𝑛+2, 𝑛 ∈ (−2, −1) Trigram features
Type Total IV OOV
CityU 235, 631 216, 249 19, 382
CTB 80, 700 76, 200 4, 480




新疆/維吾爾自治區/分外/妖嬈
The Xinjiang Uygur Autonomous Region is
extraordinarily enchanting
新疆/維吾爾/自治/區分/外/妖嬈
Xinjiang / Uygur / autonomy / distinguish / out /
enchanting
由/三/局/處理/食物
Three bureaus handle the food
由/三局/處理/食物
The third bureau handles the food




张/明白了/事情原因
Zhang has seen the reason of the thing
张明/白/了/事情原因
ZhangMing/ white/ already/ reason of the thing
事件/發生/在/法/國家劇院
the incident occurred in French national theatre
事件/發生/在/法國/家/劇院
The incident/ occurred/ in/ France / home/ theatre








𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
#𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡
#𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑝𝑢𝑡
(2)
𝑅𝑒𝑐𝑎𝑙𝑙 =
#𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑖𝑛 𝑜𝑢𝑡𝑝𝑢𝑡
#𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑢𝑡ℎ
(3)
𝐹_𝑠𝑐𝑜𝑟𝑒 =
2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
(4)

A Study of Chinese Word Segmentation Based on Characteristics

Recommended

Recommended

More Related Content

Similar to A Study of Chinese Word Segmentation Based on Characteristics

Similar to A Study of Chinese Word Segmentation Based on Characteristics (16)

More from Lifeng (Aaron) Han

More from Lifeng (Aaron) Han (20)

A Study of Chinese Word Segmentation Based on Characteristics