SlideShare a Scribd company logo
Building Universal Dependency Treebanks in Korean
Jayeol Chun,1
Na-Rae Han,2
Jena D. Hwang,3
Jinho D. Choi1
1
Emory University; 2
University of Pittsburgh; 3
IHMC
{che.yeol.chun, jinho.choi}@emory.edu, naraehan@pitt.edu, jhwang@ihmc.us
Objectives
This paper presents three dependency treebanks in Korean derived from existing corpora, and pseudo-annotated by the latest UD guidelines, version 2 (UDv2).
• Fix several issues with the Korean portion of Google UD Treebank with respect to UDv2.
• Convert phrase structure trees in Penn Korean Treebank and KAIST Treebank into dependency trees following UDv2.
• Provide corpus analytics that include statistics of the new dependency treebanks and remaining issues with the current annotation.
Google UD Treebank
Google UD Treebank (GKT) includes 6K+ sentences
from weblogs and newswire annotated under old UD
guidelines. We carry out systematic correction of GKT,
bring it up to the standards of UDv2.
Original Tree
Morphological Analysis
Tokenization
Head ID Remapping
Dependency Labeling
Corpus Analytics
• At approximately 26 dependency nodes per sentence, PKT includes on average
the longest and complex sentences among the three corpora, likely reflective of
the news domain.
• KTB is by far the largest corpus in this study with its sentence complexity com-
parable to that of GKT at approximately 12 dependency nodes per sentence.
Part-of-Speech Tags
• NOUN, VERB, ADV and PUNCT as the top parts-of-speech.
• In both PKT and GKT, PROPN (proper noun) is the fifth-highest ranking POS,
while it is seen ranking much lower in KAIST, which instead has ADJ (adjective)
taking the spot.
• NUM (number) is prominent in PKT which is likely a reflection of its news domain.
• Absence of the SCONJ in GKT is due to the tokenization that does not analyze
particles as separate tokens.
• Notably, AUX (auxiliary) and PART (particle), lacking in GKT, were partially
introduced into the revised GKT as the result of tokenization of symbols and
punctuation marks.
Dependency Labels
• PKT and KTB appear consistent except in compound, nummod, dislocated and
nsubj. As briefly mentioned, compound and nummod are likely domain-specific
particularities.
• GKT’s abundant annotation of flat is a remnant of coarse tokenization that led
to embedded tokens labeled flat as a whole.
Statistics
GKT PKT KTB Total
Tokens 80,392 132,041 350,090 562,523
Sentences 6,339 5,010 27,363 38,712
Official UD Project: http://universaldependencies.org
Korean UD Project: https://github.com/emorynlp/ud-korean
Penn Korean Universal Dependency Treebank will be released officially through LDC.
Language Resources and Evaluation Conference
May 7-12, 2018; Miyazaki, Japan
Penn Korean Treebank & KAIST Treebank
Two Korean phrase structure treebanks are analyzed and converted into dependency trees using UDv2.
• Penn Korean Treebank (PKT): 5K+ sentences from newswire.
• KAIST Treebank (KTB): 27K+ sentences from literature, newswire, and academic manuscripts.
Empty Categories Coordination Part-of-Speech Tags Dependency Relations
Penn
KAIST N/A
Empty Categories Coordination Structures
• Heuristics are used for matching constituency
tags at both phrasal and morpheme levels.
• Elided predicates caused by gapping relations
are handled as fixed conjuncts, which needs
to be further investigated.
• Coordination structures are detected by heuris-
tics discovered from corpus analytics.
• Each conjunct becomes a head of its left sib-
ling such that the rightmost conjunct becomes
the head of the coordination structure.
Part-of-Speech Tags Dependency Relations
• Part-of-speech tags are mapped to UDv2 via
manually analyzed heuristics. With a few ex-
ceptions, the mappings are categorical for both
the PKT and KTB.
• Some post-position markers (josa) and verbal
endings (eomi) were identified as encoding
conjunction: CCONJ, SCONJ. Rest mapped to
adpositions (ADP) and particles (PART), respec-
tively.
• Once the empty categories are handled, each
constituency node is assigned its head with
head-percolation rules established separately
for PKT and KTB.
• The dependency relation between the node
and its head is inferred by investigating the
function tags, phrasal tags and morphemes
from the original treebanks.

More Related Content

What's hot

RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML
 
Personalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic StemmingPersonalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic Stemming
nitin jha
 
17. Anne Schuman (USAAR) Terminology and Ontologies 2
17. Anne Schuman (USAAR) Terminology and Ontologies 217. Anne Schuman (USAAR) Terminology and Ontologies 2
17. Anne Schuman (USAAR) Terminology and Ontologies 2RIILP
 
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and RulesRuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
RuleML
 
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...
RuleML
 
Treebank annotation
Treebank annotationTreebank annotation
Treebank annotation
Mohit Jasapara
 
Automated building of taxonomies for search engines
Automated building of taxonomies for search enginesAutomated building of taxonomies for search engines
Automated building of taxonomies for search engines
Boris Galitsky
 
GRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachGRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachChimezie Ogbuji
 
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Waqas Tariq
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 

What's hot (12)

RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
 
Personalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic StemmingPersonalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic Stemming
 
17. Anne Schuman (USAAR) Terminology and Ontologies 2
17. Anne Schuman (USAAR) Terminology and Ontologies 217. Anne Schuman (USAAR) Terminology and Ontologies 2
17. Anne Schuman (USAAR) Terminology and Ontologies 2
 
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and RulesRuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
 
NLP todo
NLP todoNLP todo
NLP todo
 
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...
RuleML2015: Explanation of proofs of regulatory (non-)complianceusing semanti...
 
Treebank annotation
Treebank annotationTreebank annotation
Treebank annotation
 
Automated building of taxonomies for search engines
Automated building of taxonomies for search enginesAutomated building of taxonomies for search engines
Automated building of taxonomies for search engines
 
GRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachGRDDL: A Pictorial Approach
GRDDL: A Pictorial Approach
 
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
 
FinalDraftRevisisions
FinalDraftRevisisionsFinalDraftRevisisions
FinalDraftRevisisions
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 

Similar to Building Universal Dependency Treebanks in Korean

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Nimrita Koul
 
[ ] uottawa_copeck.doc
[ ] uottawa_copeck.doc[ ] uottawa_copeck.doc
[ ] uottawa_copeck.docbutest
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Young Seok Kim
 
Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...
Konstantinos Zagoris
 
Lecture 2 Hierarchy of NLP & TF-IDF.pptx
Lecture 2 Hierarchy of NLP & TF-IDF.pptxLecture 2 Hierarchy of NLP & TF-IDF.pptx
Lecture 2 Hierarchy of NLP & TF-IDF.pptx
KunalSingh560957
 
SKOS and Its Application in Transferring Traditional Thesauri into Networked KOS
SKOS and Its Application in Transferring Traditional Thesauri into Networked KOSSKOS and Its Application in Transferring Traditional Thesauri into Networked KOS
SKOS and Its Application in Transferring Traditional Thesauri into Networked KOS
Marcia Zeng
 
Compiler design Project
Compiler design ProjectCompiler design Project
Compiler design Project
DushyantSharma146
 
N20190530
N20190530N20190530
N20190530
TMU, Japan
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural Network
Saurav Jha
 
2014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 2
2014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 22014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 2
2014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 2Kevin White, Ph.D
 
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence LabelingMarek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Association for Computational Linguistics
 
Formalising the Swedish Constructicon in Grammatical Framework
Formalising the Swedish Constructicon in Grammatical FrameworkFormalising the Swedish Constructicon in Grammatical Framework
Formalising the Swedish Constructicon in Grammatical Framework
Normunds Grūzītis
 
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
Association for Computational Linguistics
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Jimmy Lai
 
Coling2014:Single Document Keyphrase Extraction Using Label Information
Coling2014:Single Document Keyphrase Extraction Using Label InformationColing2014:Single Document Keyphrase Extraction Using Label Information
Coling2014:Single Document Keyphrase Extraction Using Label Information
Ryuchi Tachibana
 
Protein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on RosettaProtein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on Rosetta
Bioinformatics and Computational Biosciences Branch
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Edmond Lepedus
 
An Empirical Comparison of Fast and Efficient Tools for Mining Textual Data
An Empirical Comparison of Fast and Efficient Tools for Mining Textual DataAn Empirical Comparison of Fast and Efficient Tools for Mining Textual Data
An Empirical Comparison of Fast and Efficient Tools for Mining Textual Data
vtunali
 

Similar to Building Universal Dependency Treebanks in Korean (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
[ ] uottawa_copeck.doc
[ ] uottawa_copeck.doc[ ] uottawa_copeck.doc
[ ] uottawa_copeck.doc
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...Segmentation - based Historical Handwritten Word Spotting using document-spec...
Segmentation - based Historical Handwritten Word Spotting using document-spec...
 
Lecture 2 Hierarchy of NLP & TF-IDF.pptx
Lecture 2 Hierarchy of NLP & TF-IDF.pptxLecture 2 Hierarchy of NLP & TF-IDF.pptx
Lecture 2 Hierarchy of NLP & TF-IDF.pptx
 
SKOS and Its Application in Transferring Traditional Thesauri into Networked KOS
SKOS and Its Application in Transferring Traditional Thesauri into Networked KOSSKOS and Its Application in Transferring Traditional Thesauri into Networked KOS
SKOS and Its Application in Transferring Traditional Thesauri into Networked KOS
 
Compiler design Project
Compiler design ProjectCompiler design Project
Compiler design Project
 
NLP
NLPNLP
NLP
 
N20190530
N20190530N20190530
N20190530
 
Reference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural NetworkReference Scope Identification of Citances Using Convolutional Neural Network
Reference Scope Identification of Citances Using Convolutional Neural Network
 
2014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 2
2014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 22014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 2
2014-03-12 White - MANA Poster (Epoxy-PA-MWCNT) 2
 
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence LabelingMarek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
 
Formalising the Swedish Constructicon in Grammatical Framework
Formalising the Swedish Constructicon in Grammatical FrameworkFormalising the Swedish Constructicon in Grammatical Framework
Formalising the Swedish Constructicon in Grammatical Framework
 
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
Normunds Gruzitis - 2015 - Formalising the Swedish Constructicon in Grammatic...
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
Coling2014:Single Document Keyphrase Extraction Using Label Information
Coling2014:Single Document Keyphrase Extraction Using Label InformationColing2014:Single Document Keyphrase Extraction Using Label Information
Coling2014:Single Document Keyphrase Extraction Using Label Information
 
Protein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on RosettaProtein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on Rosetta
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
 
An Empirical Comparison of Fast and Efficient Tools for Mining Textual Data
An Empirical Comparison of Fast and Efficient Tools for Mining Textual DataAn Empirical Comparison of Fast and Efficient Tools for Mining Textual Data
An Empirical Comparison of Fast and Efficient Tools for Mining Textual Data
 
Dl2014 slides
Dl2014 slidesDl2014 slides
Dl2014 slides
 

More from Jinho Choi

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Jinho Choi
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Jinho Choi
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Jinho Choi
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Jinho Choi
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference Resolution
Jinho Choi
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Jinho Choi
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning Representation
Jinho Choi
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
Jinho Choi
 
CKY Parsing
CKY ParsingCKY Parsing
CKY Parsing
Jinho Choi
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet Similarities
Jinho Choi
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical Relations
Jinho Choi
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue Management
Jinho Choi
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
Jinho Choi
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
Jinho Choi
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue Understanding
Jinho Choi
 
Topological Sort
Topological SortTopological Sort
Topological Sort
Jinho Choi
 
Tries - Put
Tries - PutTries - Put
Tries - Put
Jinho Choi
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Jinho Choi
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Jinho Choi
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports Intelligently
Jinho Choi
 

More from Jinho Choi (20)

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference Resolution
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning Representation
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
CKY Parsing
CKY ParsingCKY Parsing
CKY Parsing
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet Similarities
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical Relations
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue Management
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue Understanding
 
Topological Sort
Topological SortTopological Sort
Topological Sort
 
Tries - Put
Tries - PutTries - Put
Tries - Put
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports Intelligently
 

Recently uploaded

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 

Recently uploaded (20)

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 

Building Universal Dependency Treebanks in Korean

  • 1. Building Universal Dependency Treebanks in Korean Jayeol Chun,1 Na-Rae Han,2 Jena D. Hwang,3 Jinho D. Choi1 1 Emory University; 2 University of Pittsburgh; 3 IHMC {che.yeol.chun, jinho.choi}@emory.edu, naraehan@pitt.edu, jhwang@ihmc.us Objectives This paper presents three dependency treebanks in Korean derived from existing corpora, and pseudo-annotated by the latest UD guidelines, version 2 (UDv2). • Fix several issues with the Korean portion of Google UD Treebank with respect to UDv2. • Convert phrase structure trees in Penn Korean Treebank and KAIST Treebank into dependency trees following UDv2. • Provide corpus analytics that include statistics of the new dependency treebanks and remaining issues with the current annotation. Google UD Treebank Google UD Treebank (GKT) includes 6K+ sentences from weblogs and newswire annotated under old UD guidelines. We carry out systematic correction of GKT, bring it up to the standards of UDv2. Original Tree Morphological Analysis Tokenization Head ID Remapping Dependency Labeling Corpus Analytics • At approximately 26 dependency nodes per sentence, PKT includes on average the longest and complex sentences among the three corpora, likely reflective of the news domain. • KTB is by far the largest corpus in this study with its sentence complexity com- parable to that of GKT at approximately 12 dependency nodes per sentence. Part-of-Speech Tags • NOUN, VERB, ADV and PUNCT as the top parts-of-speech. • In both PKT and GKT, PROPN (proper noun) is the fifth-highest ranking POS, while it is seen ranking much lower in KAIST, which instead has ADJ (adjective) taking the spot. • NUM (number) is prominent in PKT which is likely a reflection of its news domain. • Absence of the SCONJ in GKT is due to the tokenization that does not analyze particles as separate tokens. • Notably, AUX (auxiliary) and PART (particle), lacking in GKT, were partially introduced into the revised GKT as the result of tokenization of symbols and punctuation marks. Dependency Labels • PKT and KTB appear consistent except in compound, nummod, dislocated and nsubj. As briefly mentioned, compound and nummod are likely domain-specific particularities. • GKT’s abundant annotation of flat is a remnant of coarse tokenization that led to embedded tokens labeled flat as a whole. Statistics GKT PKT KTB Total Tokens 80,392 132,041 350,090 562,523 Sentences 6,339 5,010 27,363 38,712 Official UD Project: http://universaldependencies.org Korean UD Project: https://github.com/emorynlp/ud-korean Penn Korean Universal Dependency Treebank will be released officially through LDC. Language Resources and Evaluation Conference May 7-12, 2018; Miyazaki, Japan Penn Korean Treebank & KAIST Treebank Two Korean phrase structure treebanks are analyzed and converted into dependency trees using UDv2. • Penn Korean Treebank (PKT): 5K+ sentences from newswire. • KAIST Treebank (KTB): 27K+ sentences from literature, newswire, and academic manuscripts. Empty Categories Coordination Part-of-Speech Tags Dependency Relations Penn KAIST N/A Empty Categories Coordination Structures • Heuristics are used for matching constituency tags at both phrasal and morpheme levels. • Elided predicates caused by gapping relations are handled as fixed conjuncts, which needs to be further investigated. • Coordination structures are detected by heuris- tics discovered from corpus analytics. • Each conjunct becomes a head of its left sib- ling such that the rightmost conjunct becomes the head of the coordination structure. Part-of-Speech Tags Dependency Relations • Part-of-speech tags are mapped to UDv2 via manually analyzed heuristics. With a few ex- ceptions, the mappings are categorical for both the PKT and KTB. • Some post-position markers (josa) and verbal endings (eomi) were identified as encoding conjunction: CCONJ, SCONJ. Rest mapped to adpositions (ADP) and particles (PART), respec- tively. • Once the empty categories are handled, each constituency node is assigned its head with head-percolation rules established separately for PKT and KTB. • The dependency relation between the node and its head is inferred by investigating the function tags, phrasal tags and morphemes from the original treebanks.