Statistical Dependency Parsing in Korean:              From Corpus Generation To Automatic Parsing       Workshop on Stati...
Dependency Parsing in Korean             •       Why dependency parsing in Korean?                   -        Korean is a ...
Dependency Parsing in Korean             •       Why dependency parsing in Korean?                   -        Korean is a ...
Dependency Parsing in Korean             •       Statistical dependency parsing in Korean                   -        Suffic...
Sejong Treebank             •       Phrase structure                   -        Including phrase tags, POS tags, and funct...
containing only left and right brackets, respectively.                    higher precedenc                       These tag...
Dependency Conversion             •       Conversion steps                   -        Find the head of each phrase using h...
Dependency Conversion        • Head-percolation rules that have treated eachas described in there are some approaches     ...
Dependency Conversion              •       Dependency labels                    -        Labels retained from the function...
Dependency Conversion             •       Coordination                   -        Previous conjuncts as dependents of the ...
Dependency Parsing             •       Dependency parsing algorithm                   -        Transition-based, non-proje...
Dependency Parsing             •       Feature selection                   -        Each token consists of multiple morphe...
ure extraction                j                                                   * ed to λ1 along with all many other mor...
rning (Hsieh et al., 2008), applying c = 0.1st), e = 0.1 (termination criterion), B = 0 (bias).        JK    Particles (J*...
discussing world history.m                     Table 10 shows how these corpora are divided into , wi and wj .            ...
Experiments             •       Morphological analysis                   -        Two automatic morphological analyzers ar...
Experiments             •       Evaluations                   -        Gold-standard vs. automatic morphological analysis....
Conclusion             •       Contributions                   -        Generating a Korean Dependency Treebank.          ...
Acknowledgements             •       Special thanks are due to                   -        Professor Kong Joo Lee of Chungn...
Upcoming SlideShare
Loading in …5
×

Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing

1,132 views
1,049 views

Published on

This paper gives two contributions to dependency parsing in Korean. First, we build a Korean dependency Treebank from an existing constituent Treebank. For a morphologically rich language like Korean, dependency parsing shows some advantages over constituent parsing. Since there is not much training data available, we automatically generate dependency trees by applying head-percolation rules and heuristics to the constituent trees. Second, we show how to extract useful features for dependency parsing from rich morphology in Korean. Once we build the dependency Treebank, any statistical parsing approach can be applied. The challenging part is how to extract features from tokens consisting of multiple morphemes. We suggest a way of selecting important morphemes and use only these as features to avoid sparsity. Our parsing approach is evaluated on three different genres using both gold-standard and automatic morphological analysis. We also test the impact of fine vs. coarse-grained morphologies on dependency parsing. With automatic morphological analysis, we achieve labeled attachment scores of 80%+. To the best of our knowledge, this is the first time that Korean dependency parsing has been evaluated on labeled edges with such a large variety of data.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,132
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing

  1. 1. Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing Workshop on Statistical Parsing of Morphologically-Rich Languages 12th International Conference on Parsing Technologies Jinho D. Choi & Martha Palmer University of Colorado at Boulder October 6th, 2011 choijd@colorado.eduThursday, October 6, 2011
  2. 2. Dependency Parsing in Korean • Why dependency parsing in Korean? - Korean is a flexible word order language. S SOV construction S NP-OBJ-1 S NP-SBJ VP NP-SBJ VP AP VP AP VP NP-OBJ VP NP-OBJ VP She still him loved Him she still *T* loved OBJ ADV ADV SBJ SBJ OBJ 2Thursday, October 6, 2011
  3. 3. Dependency Parsing in Korean • Why dependency parsing in Korean? - Korean is a flexible word order language. - Rich morphology makes it easy for dependency parsing. She + Aux. particle loved He + Obj. case marker SBJ ADV OBJ She still him 3Thursday, October 6, 2011
  4. 4. Dependency Parsing in Korean • Statistical dependency parsing in Korean - Sufficiently large training data is required. • Not much training data available for Korean dependency parsing. • Constituent Treebanks in Korean - Penn Korean Treebank: 15K sentences. - KAIST Treebank: 30K sentences. - Sejong Treebank: 60K sentences. • The most recent and largest Treebank in Korean. • Containing Penn Treebank style constituent trees. 4Thursday, October 6, 2011
  5. 5. Sejong Treebank • Phrase structure - Including phrase tags, POS tags, and function tags. - Each token can be broken into several morphemes. S ! ( )/NP+ /JX ! /MAG NP-SBJ VP ! /NP+ /JKO AP VP ! /NNG+ /XSV+ /EP+ /EF NP-OBJ VP She still him loved Tokens are mostly separated by white spaces. 5Thursday, October 6, 2011
  6. 6. containing only left and right brackets, respectively. higher precedenc These tags are also used to determine dependency precedence in VP Sejong Treebank relations during the conversion. Once we have th Phrase-level tags Function tags erate dependenc S Sentence SBJ Subject each phrase (or Q Quotative clause OBJ Object the head of the NP Noun phrase CMP Complement VP Verb phrase MOD Noun modifier all other nodes i VNP Copula phrase AJT Predicate modifier The procedure g AP Adverb phrase CNJ Conjunctive in the tree finds DP Adnoun phrase INT Vocative and Palmer (20 IP Interjection phrase PRN Parenthetical by this procedu (unique root, si NNG General noun Table 2: Phrase-level MM Adnoun tags (left) and function tags (right) EP Prefinal EM JX Auxiliary PR NNP Proper noun MAG General adverb in the Sejong Treebank. EF Final EM however, it doe JC Conjunctive PR NNB Bound noun MAJ Conjunctive adverb EC Conjunctive EM IC Interjection NP Pronoun JKS Subjective CP ETN Nominalizing EM SN Numbershows how to a NR Numeral JKC Complemental CP In addition, Sec ETM Adnominalizing EM SL Foreign word VV Verb JKG Adnomial CP XPN Noun prefix SH Chinese word VA Adjective 3.2 Head-percolation rules Noun DS JKO Objective CP XSN solve some of th NF Noun-like word VX Auxiliary predicate JKB Adverbial CP XSV Verb DS VCP Copula Table 3 givesVocative CPof head-percolation rules (from JKV the list XSA Adjective DS nested function NV Predicate-like word NA Unknown word headrules), derived from analysis of each SS, SE,It is worth m VCN Negation now on,JKQ Quotative CP adjective XR Base morpheme SF, SP, SO, SW Table 1: P OS tags phrase type in the Sejongmarker, CP: case particle, EM: ending marker, DS: Sejong Tree in the Sejong Treebank (PM: predicate Treebank. Except for the the deriva- tional suffix, PR: particle, SF SP SS SE SO: different types of punctuation). quotative clause (Q), all 6 other phrase types try to egories. This im find their heads been one of rightmost(2000) presented an approach for by these heThursday, October 6, 2011 morphological analysis has Automatic from the Han et al. children, which ated han-
  7. 7. Dependency Conversion • Conversion steps - Find the head of each phrase using head-percolation rules. • All other nodes in the phrase become dependents of the head. - Re-direct dependencies for empty categories. • Empty categories are not annotated in the Sejong Treebank. • Skipping this step generates only projective dependency trees. - Label (automatically generated) dependencies. • Special cases - Coordination, nested function tags. 7Thursday, October 6, 2011
  8. 8. Dependency Conversion • Head-percolation rules that have treated eachas described in there are some approaches - to several mor-Achieved by analyzing each phrase in the Sejong Treebank. morpheme as an individual token to parse (Chungle 1). In the Se- et al., 2010).5 Korean is a head-final language.mostly by whiteB+C)D’ is con- S r VP;VNP;S;NP|AP;Q;*oes not contain Q l S|VP|VNP|NP;Q;*ult, a token can NP r NP;S;VP;VNP;AP;*ual morphemes VP r VP;VNP;NP;S;IP;* VNP r VNP;NP;S;*ated with func- AP r AP;VP;NP;S;*gs show depen- DP r DP;VP;*hrases and their IP r IP;VNP;* y labels during X|L|R r *special types of No rules to find the head morpheme of each token. Table 3: Head-percolation rules for the Sejong Tree-Table 2. X indi- bank. l/r implies looking for the leftmost/rightmost con-articles, ending stituent. * implies any phrase-level tag. | implies a logi- 8ndicate phrases cal OR and ; is a delimiter between tags. Each rule gives Thursday, October 6, 2011
  9. 9. Dependency Conversion • Dependency labels - Labels retained from the function tags. - Labels inferred from constituent relations.ejong Treebank, and use the automati- S input : (c, p), where c is a dependent of p.d and linked empty categories to gener- l output: A dependency label l as c ← p. − NP-SBJ VPective dependencies. begin AP VP if p = root then ROOT → ldency labels elif c.pos = AP then ADV → l NP-OBJ VPf dependency labels are derived from the elif p.pos = AP then AMOD → l rees. The first type includes labels re- elif p.pos = DP then DMOD → l the function tags. still She When any nodeloved him an- elif p.pos = NP then NMOD → l a function tag is determined toOBJ a de- be elif p.pos = VP|VNP|IP then VMOD → lsome other node by our headrules, the ADV else DEP → l SBJ end is taken as the dependency label to its Algorithm 1: Getting inferred labels.e 3 shows a dependency tree converted stituent tree in Figure 2, using the func- AJT 9 11.70 MOD 18.71 X 0.01dependency labels (SBJ and OBJ). CMP 1.49 AMOD 0.13 X AJT 0.08 Thursday, October 6, 2011
  10. 10. Dependency Conversion • Coordination - Previous conjuncts as dependents of the following conjuncts. • Nested function tag - Nodes with nested f-tags become the heads of the phrases. S NP-SBJ VP NP-CNJ NP-SBJ NP-OBJ VP NP-CNJ NP-SBJ I_and he_and she home left CNJ CNJ OBJ SBJ 10Thursday, October 6, 2011
  11. 11. Dependency Parsing • Dependency parsing algorithm - Transition-based, non-projective parsing algorithm. • Choi & Palmer, 2011. - Performs transitions from both projective and non-projective dependency parsing algorithms selectively. • Linear time parsing speed in practice for non-projective trees. • Machine learning algorithm - Liblinear L2-regularized L1-loss support vector. Jinho D. Choi & Martha Palmer. 2011. Getting the Most out of Transition-based Dependency Parsing. In Proceedings of ACL:HLT’11 11Thursday, October 6, 2011
  12. 12. Dependency Parsing • Feature selection - Each token consists of multiple morphemes (up to 21). - POS tag feature of each token? • (NNG & XSV & EP & EF & SF) vs. (NNG | XSV | EP | EF | SF) • Sparse information vs. lack of information. Happy medium? ! /NNP+ /NNG+ /JX Nakrang_ Nakrang + Princess + JX ! /NNP+ /NNG+ /JKO Hodong_ Hodong + Prince + JKO ! /NNG+ /XSV+ /EP+ /EF+./SF Love + XSV + EP + EF + . 12Thursday, October 6, 2011
  13. 13. ure extraction j * ed to λ1 along with all many other morphemes helpfullast punctuation, only if there is no other PY The for parsing. Thus, ned in Section 3.1, each token in our cor- Dependency Parsingedure is repeated with as a compromise, we decide to select followed by the punctuation sts of one or many morphemes annotated+1 . The algorithm ter- morpheme certain types of morphemes and use only these as features. Ta-rent POS tags. This morphology makes Table 6: Types of morphemes in each token used to ex-n left in β. ble 6 shows the types of morphemes used to extract • to extract features for dependency pars- tract features for our parsing models.rithm when two tokens, winglish, Morphemeand for ,our parsing models. selection features wj are for a dependency relation, FS extract fea- Figure 6 shows morphemes extracted from the to- we The first morpheme zed L1-loss S VM for kens in Figure 5. For unigrams, these morphemes POS tags of wi and wj (wi .pos,The .pos), LS wj last morpheme before JO|DS|EM , applying c = 0.1d feature of POS tags between two tokens(J*canTable 1) either individually (e.g., the POS tag of JK Particles in be used iterion), B = 0 (bias). annotated with JK for the 1st token is JX) or jointly (e.g., a joined j .pos). Since each token isDS Derivational suffixes (XS* in Table 1) EM Ending markers (E* of Tabletags between LS and JK for the 1st OS tag in English, it is trivial to extract feature in POS 1)ures. In Korean, each token is annotated token is only if there isgenerate features. From our PY The last punctuation, NNG+JX) to no other each token in our cor-uence of POS tags, depending on how mor- followed by thefeatures extracted from the JK and EM morpheme experiments, punctuation morphemes annotated e segmented. It is possible to join all POS morphemes are found to be the most useful. s morphology makes Table Types ofn a token and treat that as/NNG+6: tag (e.g., morphemes in each token used to ex- /NNP+ a single /JX for dependency pars- tract features for our parsing models. +JX for the first token in FigureJX how- Nakrang + Princess + 5);okens, wi and wj , aree tags usually cause very sparse vectorsmorphemes extracted from the to- Figure 6 shows /NNP /NNG /JX lation, we extract fea- /NNG+ /JKO as features. /NNP+ kens in Figure 5. For unigrams, these morphemes wj (wi .pos, wj .pos),Prince + JKO Hodong + /NNP /NNG /JKO s between two tokens can be used either individually (e.g., the POS tag of /XSV /EF /SF /NNGoken is annotated with JK for the 1st token is JX) or jointly (e.g., a joined /NNG+ /XSV+ /EP+ /EF+./SF Hodong_ it is trivial to extract + EP + EF of . POS tags between LS and JK for the 1st Love + XSV feature + ch! token /NNP+ is annotated token is NNG+JX) to generate features. extracted from the tokens in Fig- Figure 6: Morphemes From our /NNG+ /JXepending on how mor- experiments, features extracted from to the types in Table 6. ure 5 with respect the JK and EM Nakrang + POS + JX 13 ossible to join allPrincess morphemes are found ton-grams where n > 1, it is not obvious which be the most useful. For at! a single 2011 (e.g., Thursday, October 6, tag as /NNP+ /NNG+ /JKO combinations of these morphemes across different
  14. 14. rning (Hsieh et al., 2008), applying c = 0.1st), e = 0.1 (termination criterion), B = 0 (bias). JK Particles (J* in Table 1) Derivational suffixes (XS* in Table 1) Dependency Parsing DS Feature extraction EM Ending markers (E* in Table 1) PY The last punctuation, only if there is no other mentioned in Section 3.1, each token in our cor- • morpheme followed by the punctuation Feature extraction a consists of one or many morphemes annotated h different POS tags. This morphology makes Table 6: Types of morphemes in each token used to ex- -difficult to extract features for dependency pars- tract features for our parsing models. Extract features using only important morphemes. . In English, when two tokens, wi and wj , are •mpared for a dependency relation, we extract fea- Figure 6the1st morphemestokens. from the to- shows Individual POS tag features of Figure 5.and 3rdes like POS tags of wi and wj (wi .pos, wj .pos), kens in extracted For unigrams, these morphemes : NNP1, NNG1, JK1, NNG3,can be ,used either individually (e.g., the POS tag of XSV3 EF3a joined feature of POS tags between two tokens • Joined is annotated with JK for the 1st token is JX) or jointly (e.g., a joined .pos+wj .pos). Since each token features of POS tags between the 1st and 3rd tokens. ingle POS tag in English, it is trivial toNNP _XSV , NNP _EF ,tags _NNG , JK _XSV for the 1st : NNP1_NNG3, extract feature of POS JK between LS and JK 1 3 1 3 1 3 1 3 se features. In Korean, each token is annotated token is NNG+JX) to generate features. From our - h a sequence of POS tags, depending on how mor- experiments, features extracted from the JK and EM Tokens used: wiall POS i±1, wj±1 are found to be the most useful.emes are segmented. It is possible to join , wj, w morphemes s within a token and treat that as/NNG+ tag (e.g., /NNP+ a single /JX P+NNG+JX for the first token in Figure 5); how- Nakrang + Princess + JX r, these tags usually cause very sparse vectors /NNP /NNG /JX en used as features. /NNP+ /NNG+ /JKO Hodong + Prince + JKO /NNP /NNG /JKO /NNG /XSV /EF /SFakrang_ /NNG+ /XSV+ Hodong_ /EP+ /EF+./SF Love + XSV + EP + EF + . Figure 6: Morphemes extracted from the tokens in Fig- ! /NNP+ /NNG+ /JX ure 5 with respect to the types in Table 6. Nakrang + Princess + JX 14 For n-grams where n > 1, it is not obvious which ! /NNP+ /NNG+ /JKO combinations of these morphemes across different Thursday, October 6, 2011
  15. 15. discussing world history.m Table 10 shows how these corpora are divided into , wi and wj . Experiments training, development, and evaluation sets. For the development and evaluation sets, we pick one news-d features ex- • Corpora about art, one fiction text, and one informa- paper h column and - Dependencyabout trans-nationalism, the Sejong Treebank. tive book trees converted from and use each of the first half for development and the second half for d wj , respec-e of POS tags - Consists of 20 sources these genres. evaluation. Note that in 6 development and evalu- ation sets are very diverse(MZ), Fiction (FI), Memoir (ME), Newspaper (NP), Magazine compared to the trainingjoined feature • data. Testing on such and Educational Cartoon the ro- Informative Book (IB), evaluation sets ensures (EC).d a POS tag of bustness of our parsing model, which is very impor-and a form of -ed feature be- Evaluation sets are very diverse compared to training sets. tant for our purpose because we are hoping to use features used • this model to parse various texts on the web. Ensures the robustness of our parsing models. ed morpholo- NP MZ FI ME IB EC T 8,060 6,713 15,646 5,053 7,983 1,548 D 2,048 - 2,174 - 1,307 -DS EM E 2,048 - 2,175 - 1,308 - z z # ofof sentences in training (T), develop- Table 10: Number sentences in each setx,z x y+ x∗ ,y+ ment (D), and evaluation (E) sets for each genre. 15 xThursday, October z 2011 x, 6,
  16. 16. Experiments • Morphological analysis - Two automatic morphological analyzers are used. • Intelligent Morphological Analyzer - Developed by the Sejong project. - Provides the same morphological analysis as their Treebank. • Considered as fine-grained morphological analysis. • Mach (Shim and Yang, 2002) - Analyzes 1.3M words per second. - Provides more coarse-grained morphological analysis. Kwangseob Shim & Jaehyung Yang. 2002. A Supersonic Korean Morphological Analyzer. In Proceedings of COLING’02 16Thursday, October 6, 2011
  17. 17. Experiments • Evaluations - Gold-standard vs. automatic morphological analysis. • Relatively low performance from the automatic system. - Fine vs. course-grained morphological analysis. • Differences are not too significant. - Robustness across different genres. Gold, fine-grained Auto, fine-grained Auto, coarse-grained LAS UAS LS LAS UAS LS LAS UAS LS NP 82.58 84.32 94.05 79.61 82.35 91.49 79.00 81.68 91.50 FI 84.78 87.04 93.70 81.54 85.04 90.95 80.11 83.96 90.24 IB 84.21 85.50 95.82 80.45 82.14 92.73 81.43 83.38 93.89 Avg. 83.74 85.47 94.57 80.43 83.01 91.77 80.14 82.89 91.99ble 11: Parsing accuracies achieved by three models (in %). L AS - labeled attachment score, UAS - unlabelachment score, L S - label accuracy score 17Thursday, October 6, 2011
  18. 18. Conclusion • Contributions - Generating a Korean Dependency Treebank. - Selecting important morphemes for dependency parsing. - Evaluating the impact of fine vs. coarse-grained morphological analysis on dependency parsing. - Evaluating the robustness across different genres. • Future work - Increase the feature span beyond bigrams. - Find head morphemes of individual tokens. - Insert empty categories. 18Thursday, October 6, 2011
  19. 19. Acknowledgements • Special thanks are due to - Professor Kong Joo Lee of Chungnam National University. - Professor Kwangseob Shim of Shungshin Women’s University. • We gratefully acknowledge the support of the National Science Foundation Grants CISE-IIS-RI-0910992, Richer Representations for Machine Translation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. 19Thursday, October 6, 2011

×