Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 1

Building Universal Dependency Treebanks in Korean

1

Share

Download to read offline

This paper presents three treebanks in Korean that consist of dependency trees derived from existing treebanks, the Google UD Treebank, the Penn Korean Treebank, and the KAIST Treebank, and pseudo-annotated by the latest guidelines from the Universal Dependencies (UD) project. The Korean portion of the Google UD Treebank is re-tokenized to match the morpheme-level annotation suggested by the other corpora, and systematically assessed for errors. Phrase structure trees in the Penn Korean Treebank and the KAIST Treebank are automatically converted into dependency trees using head-finding rules and linguistic heuristics. Additionally, part-of-speech tags in all treebanks are converted into the UD tagset. A total of 38K+ dependency trees are generated that comprise a coherent set of dependency relations for over a half million tokens. To the best of our knowledge, this is the first time that these Korean corpora are analyzed together and transformed into dependency trees following the latest UD guidelines, version 2.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Building Universal Dependency Treebanks in Korean

  1. 1. Building Universal Dependency Treebanks in Korean Jayeol Chun,1 Na-Rae Han,2 Jena D. Hwang,3 Jinho D. Choi1 1 Emory University; 2 University of Pittsburgh; 3 IHMC {che.yeol.chun, jinho.choi}@emory.edu, naraehan@pitt.edu, jhwang@ihmc.us Objectives This paper presents three dependency treebanks in Korean derived from existing corpora, and pseudo-annotated by the latest UD guidelines, version 2 (UDv2). • Fix several issues with the Korean portion of Google UD Treebank with respect to UDv2. • Convert phrase structure trees in Penn Korean Treebank and KAIST Treebank into dependency trees following UDv2. • Provide corpus analytics that include statistics of the new dependency treebanks and remaining issues with the current annotation. Google UD Treebank Google UD Treebank (GKT) includes 6K+ sentences from weblogs and newswire annotated under old UD guidelines. We carry out systematic correction of GKT, bring it up to the standards of UDv2. Original Tree Morphological Analysis Tokenization Head ID Remapping Dependency Labeling Corpus Analytics • At approximately 26 dependency nodes per sentence, PKT includes on average the longest and complex sentences among the three corpora, likely reflective of the news domain. • KTB is by far the largest corpus in this study with its sentence complexity com- parable to that of GKT at approximately 12 dependency nodes per sentence. Part-of-Speech Tags • NOUN, VERB, ADV and PUNCT as the top parts-of-speech. • In both PKT and GKT, PROPN (proper noun) is the fifth-highest ranking POS, while it is seen ranking much lower in KAIST, which instead has ADJ (adjective) taking the spot. • NUM (number) is prominent in PKT which is likely a reflection of its news domain. • Absence of the SCONJ in GKT is due to the tokenization that does not analyze particles as separate tokens. • Notably, AUX (auxiliary) and PART (particle), lacking in GKT, were partially introduced into the revised GKT as the result of tokenization of symbols and punctuation marks. Dependency Labels • PKT and KTB appear consistent except in compound, nummod, dislocated and nsubj. As briefly mentioned, compound and nummod are likely domain-specific particularities. • GKT’s abundant annotation of flat is a remnant of coarse tokenization that led to embedded tokens labeled flat as a whole. Statistics GKT PKT KTB Total Tokens 80,392 132,041 350,090 562,523 Sentences 6,339 5,010 27,363 38,712 Official UD Project: http://universaldependencies.org Korean UD Project: https://github.com/emorynlp/ud-korean Penn Korean Universal Dependency Treebank will be released officially through LDC. Language Resources and Evaluation Conference May 7-12, 2018; Miyazaki, Japan Penn Korean Treebank & KAIST Treebank Two Korean phrase structure treebanks are analyzed and converted into dependency trees using UDv2. • Penn Korean Treebank (PKT): 5K+ sentences from newswire. • KAIST Treebank (KTB): 27K+ sentences from literature, newswire, and academic manuscripts. Empty Categories Coordination Part-of-Speech Tags Dependency Relations Penn KAIST N/A Empty Categories Coordination Structures • Heuristics are used for matching constituency tags at both phrasal and morpheme levels. • Elided predicates caused by gapping relations are handled as fixed conjuncts, which needs to be further investigated. • Coordination structures are detected by heuris- tics discovered from corpus analytics. • Each conjunct becomes a head of its left sib- ling such that the rightmost conjunct becomes the head of the coordination structure. Part-of-Speech Tags Dependency Relations • Part-of-speech tags are mapped to UDv2 via manually analyzed heuristics. With a few ex- ceptions, the mappings are categorical for both the PKT and KTB. • Some post-position markers (josa) and verbal endings (eomi) were identified as encoding conjunction: CCONJ, SCONJ. Rest mapped to adpositions (ADP) and particles (PART), respec- tively. • Once the empty categories are handled, each constituency node is assigned its head with head-percolation rules established separately for PKT and KTB. • The dependency relation between the node and its head is inferred by investigating the function tags, phrasal tags and morphemes from the original treebanks.

×