GSCL2013.A Study of Chinese Word Segmentation Based on the Characteristics of Chinese


Published on

Language Processing and Knowledge in the Web - Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, (GSCL 2013), Darmstadt, Germany, on September 25–27, 2013. LNCS Vol. 8105, Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch. (EI)

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

GSCL2013.A Study of Chinese Word Segmentation Based on the Characteristics of Chinese

  1. 1. A Study of Chinese Word Segmentation Based on the Characteristics of Chinese Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao Liangye He, Ling Zhu, and Shuo Li University of Macau, Department of Computer and Information Science Av. Padre Toms Pereira Taipa, Macau, China, {derekfw,lidiasc} {wutianshui0515,imecholing,leevis1987} Abstract. This paper introduces the research on Chinese word segmen- tation (CWS). The word segmentation of Chinese expressions is difficult due to the fact that there is no word boundary in Chinese expressions and that there are some kinds of ambiguities that could result in differ- ent segmentations. To distinguish itself from the conventional research that usually emphasizes more on the algorithms employed and the work- flow designed with less contribution to the discussion of the fundamen- tal problems of CWS, this paper firstly makes effort on the analysis of the characteristics of Chinese and several categories of ambiguities in Chinese to explore potential solutions. The selected conditional random field models are trained with a quasi-Newton algorithm to perform the sequence labeling. To consider as much of the contextual information as possible, an augmented and optimized set of features is developed. The experiments show promising evaluation scores as compared to some related works.1 Keywords: Natural language processing, Chinese word segmentation, Characteristics of Chinese, Optimized features. 1 Introduction With the rapid development of natural language processing and knowledge man- agement, word segmentation becomes more and more important due to its crucial impacts on text mining, information extraction and word alignment, etc. The Chinese word segmentation (CWS) faces more challenges because of a lack of clear word boundary in Chinese texts and the many kinds of ambiguities in Chi- nese. The algorithms explored on the Chinese language processing (CLP) include the Maximum Matching Method (MMM) [1], the Stochastic Finite-State model [2], the Hidden Markov Model [3], the Maximum Entropy method [4] and the Conditional Random Fields [5] [6], etc. The workflow includes self-supervised models [7], unsupervised models [8] [9], and a combination of the supervised and unsupervised methods [10]. Generally speaking, the supervised methods gain 1 The final publication is available at
  2. 2. 2 A.L.F. Han, D.F. Wong, L.S. Chao, L. He, L. Zhu, S. Li higher accuracy scores than the unsupervised ones. The Hidden Markov Model (HMM) assumes a strong independence between variables forming an obstacle for the consideration of contextual information. The Maximum Entropy (ME) model has the label bias problem seeking the local optimization rather than global. Conditional random fields (CRF) overcome the above two problems but face new challenges, such as the selection of the optimized features for some con- crete issues. Furthermore, the conventional research tends to emphasize more on the algorithms utilized or the workflow designed while the exploration of the essential issues under CWS is less mentioned. 2 Characteristics of Chinese 2.1 Structural Ambiguity The structural ambiguity phenomenon usually exists between the adjacent words in a Chinese sentence. This means one Chinese character can be combined with the antecedent characters or subsequent characters, and both combinations re- sult in reasonable Chinese words. The structural ambiguity leads to two prob- lems. a). It results in two differently segmented sentences, and each of the sentences has a correct language structure and sensible meaning. For instance, the Chi- nese sentence “他的船只靠在維多利亞港” has a structural ambiguity between the adjacent words “ 船只靠在”. The Chinese character “船” and the combined characters “船只” have the same meaning, “ship” as a noun. The Chinese charac- ter “只” also has the meaning of “only” as an adverb. So the adjacent characters “船只靠在” could be segmented as “ 船只/靠在” (a ship moors in) or “ 船/只/靠 在” (ships only moor in). Thus, the original Chinese sentence has two different structures and the corresponding meanings “他的/船只/靠在/維多利亞港” (His ship is mooring in Victoria Harbour) and “他的/船/只/靠在/維多利亞港” (His ship only moors in Victoria Harbour). b). It results in two differently segmented sentences with one sentence hav- ing a correct structure while the other one being wrongly structured and not forming a normal Chinese sentence. For instance, the Chinese sentence “水快 速凍成了冰” has a structural ambiguity between the characters “快速凍”. The combined characters “快速” mean “fast” as an adjective or an adverb, while “速 凍” is also a Chinese word usually as an adjective to specify some kind of food, e.g. “速凍/食品” (fast-frozen food). So the original Chinese sentence may be automatically segmented as “水/快速/凍/成了/冰” (the water is quickly frozen into ice.) or “水/快/速凍/成了/冰” (water / fast /fast frozen / into / ice). The second segmentation result does not have a normal Chinese structure. 2.2 Abbreviation and Ellipsis The two phenomena of abbreviation and ellipsis usually lead to ambiguity in Chinese sentences. Firstly, let’s see the abbreviation phenomenon in Chinese hu- man name entities. People sometimes use the Chinese family name to represent
  3. 3. A Study of Chinese Word Segmentation 3 one person with his or her given name omitted in the expression. For instance, the Chinese sentence “ 許又從街坊口中得知” can be understood as “許又/從/街 坊/口中/得知” (XuYou heard from the neighbors) with the word “許又” (XuY- ou) as a Chinese full name because “許” (Xu) is a commonly used Chinese family name which is to be followed by one or two characters as the given name. How- ever, the sentence can also be structured as “ 許/又/從/街坊/口中/得知” (Xu once more heard from the neighbors) with the surname “許” (Xu) representing a person and the character “ 又” (once more) as an adverb to describe the verb “得知” (heard). Secondly, let’s see the ellipsis phenomenon in place name entities (including the translated foreign place names). People usually use the first Chinese char- acter (the beginning character) of a place name entity to stand for the place especially when the string length of the entity is large (four or more characters). For example, the Chinese sentence “敵人襲擊巴西北部” could be understood as “敵人/襲擊/巴/西北部” (The enemy attacks the northwestern part of Pakistan) with the character “巴” (Ba) as the representation of “巴基斯坦” (Pakistan) which is very common in Chinese international news reports, and the word “西 北部” (northwestern part) is a noun of locality. However, the sentence can al- so be understood as “敵人/襲擊/巴西/北部” (the enemy attacks the northern Brazil) because the two characters “巴西” (Brazil) combine to mean the Chinese name of Brazil and the following characters “ 北部” (northern part) combine to function as a noun of locality. Each of these two kinds of segmentations results in a well structured Chinese sentence. 3 CRF Model The CRFs are both conditional probabilistic models and statistics-based undi- rected graph models that can be used to seek the globally optimized optimization results when dealing with the sequence labeling problems [11]. Assume X is a variable representing input sequence, and Y is another variable representing the corresponding labels to be attached to X. The two variables interact as condi- tional probability P(Y |X) mathematically. Then comes the definition of CRF: Let a graph model G = (V, E) comprise a set V of vertices or nodes together with a set E of edges or lines and Y = {Yv|v ∈ V }, such that Y is indexed by the vertices of graph G, then (X, Y) shapes a CRF model. This set meets the following form: Pθ(Y |X) ∝ exp   e∈E,k λkfk(e, Y |e, X) + v∈V,k µkgk(v, Y |v, X)   (1) where X and Y represent the data sequence and label sequence respectively; fk and gk are the features to be defined; λk and µk are the parameters trained from the data sets; the bar “|” is the mathematical symbol to express that the right part is the precondition of the left. The training methods for the parameters include the Conjugate-gradient algorithm [12], the Iterative Scaling Algorithms
  4. 4. 4 A.L.F. Han, D.F. Wong, L.S. Chao, L. He, L. Zhu, S. Li [11], the Voted Perceptron Training [13] etc. In this paper we select a quasi- newton algorithm [14] and some online tools2 to train the model. 4 Features Designed As discussed in the previous sections of this paper about the characteristics of Chinese, CWS is highly reliant on the contextual information to deal with the ambiguity phenomena and yield the correct segmentations of sentences. So we will employ a large set of features to consider as much of the contextual information as possible. Furthermore, the name entities play an important role in sentence segmentations [15]. For example, the Chinese sentence “新疆維吾爾 自治區/分外/妖嬈” (the Xinjiang Uygur Autonomous Region is extraordinarily enchanting) is probable wrongly segmented as “ 新疆/維吾爾/自治/區分/外/妖 嬈” (Xinjiang / Uygur / autonomy / distinguish / out / enchanting) due to that the location name entity “新疆維吾爾自治區” (Xinjiang Uygur Autonomous Region) is a long string (8 characters) and its last character “ 區” (Qu) can be combined with the following character “ 分” (Fen) to form a commonly used Chinese word “ 區分” (distinguish). There are several characteristics of Chinese name entities. First, some of the name entities contain more than five or six characters that are much longer than the common names, e.g. the location name “古爾班通古特沙漠” (Gu Er Ban Tong Gu Te Desert) and the organization name “中華人民共和國” (People’s Republic of China) contain eight and seven Chinese characters respectively. Sec- ond, one name entity usually can be understood as composed by two inner parts i.e. its proprietary name (in the beginning of the name entity) and the commonly used categorical name which serves as the suffix, and the suffixes usually con- tain one or two characters thus are shorter than the proprietary names. Thus, in the designing of the features, we put more consideration into the antecedent characters than the subsequent characters for each token studied to avoid un- necessary consideration of the subsequent characters which may generate noises in the model. The final optimized feature set used in the experiment is shown in Table 1 which includes unigram (U), bigram (B) and trigram (T) features. 5 Experiments 5.1 Corpora The corpora used in the experiments are CityU (City University of Hong Kong corpus) and CTB (Pennsylvania Chinese Tree Library corpus) which we select from the SIGHAN (a special interest group in Association of Computational Linguistics) Chinese Language Processing Bakeoff-4 [16] for the testing of tra- ditional Chinese and simplified Chinese respectively. CityU and CTB contain 36,228 and 23,444 Chinese sentences respectively in the training corpora, while the corresponding test corpora contain 8,094 and 2,772 sentences. 2
  5. 5. A Study of Chinese Word Segmentation 5 Table 1. Designed feature sets Features Meaning Un, n ∈ (−4, 1) Unigram, from antecedent 4th to subsequent 1st character Bn,n+1, n ∈ (−2, 0) Bigram, three pairs of characters from the antecedent 2nd to the subsequent 1st B−1,1 Jump Bigram, the antecedent 1st and subsequent 1st char- acter Tn,n+1,n+2, n ∈ (−2, −1) Trigram, the characters from the antecedent 2nd to the fol- lowing 1st Table 2. Information of the test corpora Number of Words Type Total IV OOV CityU 235,631 216,249 19,382 CTB 80,700 76,200 4,480 The detailed information of the testing corpora is shown in Table 2. IV (in vocabulary) means the number of words existing in the training corpus while OOV means the number of words never appears in the training corpus. 5.2 Experiment Results The evaluation scores of experiments are shown in Table 3 with three indicators. The evaluation scores on precision and recall are similar without big differences, leading to the total F-scores 92.68% and 94.05% respectively on CityU and CTB. On the other hand, the evaluation scores also demonstrate that the OOV words are the main challenges in the CWS research (64.20% and 65.62% respectively of the F-score on OOV words). Table 3. Evaluation scores of results Total IV OOV Recall Precision F Recall Precision F Recall Precision F CityU 0.9271 0.9265 0.9268 0.9490 0.9593 0.9541 0.6828 0.6057 0.6420 CTB 0.9387 0.9423 0.9405 0.9515 0.9666 0.9590 0.7214 0.6018 0.6562 5.3 Comparison with Related works Lu [17] conducted the CWS research using synthetic word analysis with tree- based structure information and annotation of morphologically derived words
  6. 6. 6 A.L.F. Han, D.F. Wong, L.S. Chao, L. He, L. Zhu, S. Li from training dictionary in closed track (only using the information found in the provided training data). Zhang and Sumita [18] created a new Chinese morphological analyzer Achilles by integrating rule-based, dictionary-based, and statistical machine learning methods and CRF. Keong [19] performed CWS using a Maximum Entropy based Model. Wu [20] employed CRF as the primary model and adopts a transformation based learning technique for the post-processing in the open track (any external data could be used in addition to the provided training data). Qin [21] proposed a segmenter using forward-backward maximum matching algorithm as the basic model with the external knowledge of the news wire of Chinese People’s Daily in the year of 2000 to achieve the disambiguation. Table 4. Comparison with related works Track IV F-score OOV F-score Total F-score CityU CTB CityU CTB CityU CTB [17] Closed 0.9483 0.9556 0.6093 0.6286 0.9183 0.9354 [19] Closed 0.9386 0.9290 0.5234 0.5128 0.9083 0.9077 [18] Closed 0.9101 0.8939 0.6072 0.6273 0.8850 0.8780 [20] Open 0.9401 0.9753 0.6090 0.8839 0.9098 0.9702 [21] Open N/A 0.9398 N/A 0.6581 N/A 0.9256 Ours Closed 0.9541 0.9590 0.6420 0.6562 0.9268 0.9405 The comparison with related works shows that this paper yields promising results without using external resources. The closed-track total F-score on CityU corpus (0.9268) even outperforms [20] what is tested on open track using external resources (0.9098). 6 Conclusion and Future Works This paper focuses on the research work of CWS that is a difficult problem in NLP literature due to the fact that there is no clear boundary in Chinese sen- tences. Firstly, the authors introduce several kinds of ambiguities inherent with Chinese that underlie the thorniness of CWS to explore the potential solutions. Then the CRF model is employed to conduct the sequence labeling. In the selec- tion of features, in consideration of the excessive length of certain name entities and for the disambiguation of some sentences, an augmented and optimized fea- ture set is designed. Without using any external resources, the experiments yield promising evaluation scores as compared to some related works including both those using the closed track and those using the open track. Acknowledgments. The authors wish to thank the anonymous reviewers for many helpful comments. The authors are grateful to the Science and Technology Development Fund of Macau and the Research Committee of the University
  7. 7. A Study of Chinese Word Segmentation 7 of Macau for the funding support for our research, under the reference No. 017/2009/A and RG060/09-10S/CS/FST. The authors also wish to thank Mr. Yunsheng Xie who is a Master Candidate of Arts in English Studies from the Department of English in the Faculty of Social Sciences and Humanities at the University of Macau for his kind help in improving the language quality. References 1. Wong Pak-kwong, and Chan Chorkin.: Chinese word segmentation based on max- imum matching and word binding force. In Proceedings of the 16th conference on Computational linguistics – Vol. 1. (COLING ’96), Association for Computational Linguistics. Stroudsburg, PA, USA, 200-203. (1996) 2. Sproat Richard, Gale Willian, Shih Chilin, and Chang Nancy: A stochastic finite- state word-segmentation algorithm for Chinese. Computational Linguistics, Vol. 22(3), 377-404. (1996) 3. Zhang Hua-Ping, Liu Qun, Cheng Xue-Qi, Zhang Hao, and Yu Hong-Kui: Chinese lexical analysis using hierarchical hidden Markov model. In Proceedings of the sec- ond SIGHAN workshop on Chinese language processing – Vol. 17, (SIGHAN ’03), Association for Computational Linguistics. Stroudsburg, PA, USA, 63-70. (2003) 4. Low Kiat Jin, Ng Tou Hwee and Guo Wenyuan: A maximum entropy approach to Chinese word segmentation. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Pro-cessing. vol. 164. (2005) 5. Peng Fuchun, Feng Fangfang, and McCallum An-drew: Chinese segmentation and new word detection using conditional random fields. In Pro-ceedings of the 20th international conference on Computational Linguistics (COLING ’04). Associ-ation for Computational Linguistics. Stroudsburg, PA, USA, Article 562. (2004) 6. Yang Ting-hao, Jiang Tian-Jian, Kuo Chan-hung, Tsai Tzong-han Richard and Hsu Wen-lian: Unsupervised overlapping feature selection for conditional random fields learning in Chinese word segmentation. In Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing (ROCLING ’11), Association for Computational Linguistics. Stroudsburg, PA, USA, 109-122. (2011) 7. Peng Fuchun, Huang Xiangji, Schuurmans Dale, Cer-cone Nick and Robertson Stephen: Using self-supervised word segmentation in Chinese in-formation retrieval. In Proceedings of the 25th an-nual international ACM SIGIR conference on Re- search and development in information retrieval (SIGIR ’02). ACM, New York, NY, USA, 349-350. (2002) 8. Wang Hanshi, Zhu Jian, Tang Shiping, and Fan Xiaozhong: A new unsupervised approach to word segmentation. Computational Linguistics, Vol. 37(3), 421-454. (2011) 9. Song Yan, Kit Chunyu, Xu Ruifeng, and Zhao Hai: How unsupervised learning affects character tagging based Chinese Word Segmentation: A quantitative investi- gation. Machine Learning and Cybernetics, International Conference. Vol. 6, 2009. 3481-3486. (2009) 10. Zhao Hai and Kit Chunyu: Integrating unsupervised and supervised word segmen- tation: The role of goodness measures. Information Sciences, vol. 181(1), 163-183. (2011) 11. Lafferty John, McCallum Andrew, and Pereira C.N. Ferando: Conditional ran- dom fields: Proba-bilistic models for segmenting and labeling se-quence data. In Proceeding of 18th International Conference on Machine Learning. 282-289. (2001)
  8. 8. 8 A.L.F. Han, D.F. Wong, L.S. Chao, L. He, L. Zhu, S. Li 12. Shewchuk, J. R.: An introduction to the conjugate gradient method without the agonizing pain. Technical Report CMUCS-TR-94-125, Carnegie Mellon University (1994) 13. Collins Michael, Duffy Nigel and Park Florham: New ranking algorithms for pars- ing and tag-ging: kernels over discrete structures, and the voted perceptron. In Proceedings of the 40th Annual Meeting on Association for Computational Lin- guistics (ACL ’02), Association for Computational Linguistics. Stroudsburg, PA, USA, 263-270. (2002) 14. The Numerical Algorithms Group: E04 - Min-imizing or Maximizing a Function, NAG Library Manual, Mark 23. Retrieved (2012) 15. Peng, L., Liu, Z., Zhang, L.: A Recognition Approach Study on Chinese Field Term Based Mutual Information /Conditional Random Fields. 2012 International Workshop on Information and Electronics Engineering. 1952-1956. (2012) 16. Jin Guangjin and Chen Xiao: The Fourth Inter-national Chinese Language Pro- cessing Bakeoff: Chinese Word Segmentation, Name Entity Recog-nition and Chi- nese POS Tagging. In Proceedings of the Sixth SIGHAN Workshop on Chinese Lan-guage Processing. Hyderabad, India, 83-95. (2008) 17. Lu J., Masayuki Asahara and Yuji Matsumoto: Analyzing Chinese Synthetic Word- s with Tree-based Information and a Survey on Chinese Mor-phologically Derived Words. In Proceedings of the Sixth SIGHAN Workshop on Chinese Language Pro- cessing. Hyderabad, India, pp. 53-60. (2008) 18. Zhang R. and Eiichiro Sumita: Achilles: NiCT/ATR Chinese Morphological Ana- lyzer for the Fourth Sighan Bakeoff. In Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing. Hyderabad, India, 178-182. (2008) 19. Leong K. S., Fai Wong, Yiping Li and M. Dong: Chinese Tagging Based on Max- imum Entropy Model. In Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing. Hy-derabad, India, 138-142. (2008) 20. Wu X., Xiaojun Lin, XinhaoWang, ChunyaoWu, Yaozhong Zhang and Dianhai Yu: An Im-proved CRF based Chinese Language Processing System for SIGHAN Bakeoff 2007. In Proceed-ings of the Sixth SIGHAN Workshop on Chinese Language Processing. Hyderabad, India, 155-160. (2008) 21. Qin Y., Caixia Yuan, Jiashen Sun and Xiaojie Wang: BUPT Systems in the SIGHAN Bakeoff 2007. In Proceedings of the Sixth SIGHAN Work-shop on Chinese Language Processing. Hyderabad, India, 94-97. (2008)