SlideShare a Scribd company logo
1 of 24
Download to read offline
ACBiMA: Advanced Chinese Bi-Character Word
Morphological Analyzer
1
Ting-Hao (Kenneth) Huang
Yun-Nung (Vivian) Chen
Lingpeng Kong
HTTP://ACBIMA.ORG
Outline
● Introduction
● Related Work
● Morphological Type Scheme
● Morphological Type Classification
o Drived Word: Rule-Based Approach
o Compond Word: ML Approach
● Experiments
o ACBiMA Corpus 1.0
o Experimental Results
● Conclusion & Future Work
2:: ACBIMA.ORG ::
Outline
● Introduction
● Related Work
● Morphological Type Scheme
● Morphological Type Classification
o Drived Word: Rule-Based Approach
o Compond Word: ML Approach
● Experiments
o ACBiMA Corpus 1.0
o Experimental Results
● Conclusion & Future Work
3:: ACBIMA.ORG ::
Introduction
● NLP tasks usually focus on segmented words
● Morphology is how words are composed with morphemes
● Usages of Chinese morphological structures
o Sentiment Analysis (Ku, 2009; Huang, 2009)
o POS Tagging (Qiu, 2008)
o Word Segmentation (Gao, 2005)
o Parsing (Li, 2011; Li, 2012; Zhang, 2013)
● Challenge for Chinese morphology
o Lack of complete theories
o Lack of category schema
o Lack of toolkits
4:: ACBIMA.ORG ::
抗 菌
anti bacteria
verb object
Outline
● Introduction
● Related Work
● Morphological Type Scheme
● Morphological Type Classification
o Drived Word: Rule-Based Approach
o Compond Word: ML Approach
● Experiments
o ACBiMA Corpus 1.0
o Experimental Results
● Conclusion & Future Work
5:: ACBIMA.ORG ::
Related Work
6:: ACBIMA.ORG ::
● Focus on longer unknown words
o Tseng, 2002; Tseng, 2005; Lu, 2008; Qiu, 2008
● Focus on the functionality of morphemic characters
o Bruno, 2010
● Focus on Chinese bi-character words
o Huang, et al., 2010 (LREC)
o 52% multi-character Chinese tokens are bi-character
o analyze Chinese morphological types
o developed a suite of classifiers for type prediction
Issue: covers only a subset of Chinese content words and has limited scalability
Outline
● Introduction
● Related Work
● Morphological Type Scheme
● Morphological Type Classification
o Drived Word: Rule-Based Approach
o Compond Word: ML Approach
● Experiments
o ACBiMA Corpus 1.0
o Experimental Results
● Conclusion & Future Work
7:: ACBIMA.ORG ::
Morphological Type Scheme
8:: ACBIMA.ORG ::
Chinese Bi-Char
Content Word
Single-Morpheme Word
Synthetic
Word
Derived
Word
Compound
Word
dup, pfx, sfx, neg, ec
a-head, conj, n-head, nsubj,
v-head, vobj, vprt, els
els
Derived Word
9:: ACBIMA.ORG ::
Compond Word
10:: ACBIMA.ORG ::
Outline
● Introduction
● Related Work
● Morphological Type Scheme
● Morphological Type Classification
o Drived Word: Rule-Based Approach
o Compond Word: ML Approach
● Experiments
o ACBiMA Corpus 1.0
o Experimental Results
● Conclusion & Future Work
11:: ACBIMA.ORG ::
Morphological Type Classification
12:: ACBIMA.ORG ::
● Assumption: Chinese morphological structures are independent
from word-level contexts (Tseng, 2002; Li, 2011)
● Derived words
o Rule-based approach
● Compound words
o ML-based approach
Derived Word: Rule-Based
13:: ACBIMA.ORG ::
● Idea
o a morphologically derived word can be recognized based on its
formation
● Approach
o pattern matching rules
● Evaluation
o Data: Chinese Treebank 7.0
o Result:
o 2.9% of bi-char content words are annotated as derived words
o Precision = 0.97
Rule-based methods are able to effectively recognizing derived words.
Compond Word: ML-Based
14:: ACBIMA.ORG ::
● Idea
o The characteristics of individual characters can help decide the
type of compond words
● ML classification models
o Naïve Bayes
o Random Forest
o SVM
Classification Feature
15:: ACBIMA.ORG ::
o Dict: Revised Mandarin Chinese Dictionary (MoE, 1994)
o CTB: Chinese Treebank 5.1 (Xue et al., 2005)
Outline
● Introduction
● Related Work
● Morphological Type Scheme
● Morphological Type Classification
o Drived Word: Rule-Based Approach
o Compond Word: ML Approach
● Experiments
o ACBiMA Corpus 1.0
o Experimental Results
● Conclusion & Future Work
16:: ACBIMA.ORG ::
ACBiMA Corpus 1.0
17:: ACBIMA.ORG ::
● Initial Set
o 3,052 words
o Extracted from CTB5
o Annotated with difficulty level
● Whole Set
o 11,366 words
o Initial Set +
3k words from CTB 5.1 +
6.5k words from (Huang, 2010)
Baseline Models
18:: ACBIMA.ORG ::
1) Majority
2) Stanford Dependency Map
3) Tabular Models
o Step 1: assign the POS tags to each known character based
on different heuristics
o Step 2: assign the most frequent morphological type
obtained from training data to each POS combination, e.g.,
“(VV, NN) = vobj”
Experimental Result
19:: ACBIMA.ORG ::
o Setting: 10-fold cross-validation
o Metrics: Macro F-measure (MF), Accuracy (ACC)
Approach nsubj
v-
head
a-
head
n-
head
vprt vobj conj els MF ACC
Majority 0 0 0 .507 0 0 0 0 .172 .340
Stanford Dep. Map 0 0 0 .525 .351 .438 .213 .010 .332 .388
Tabular
Stanford 0 .296 0 .524 .389 .434 .162 .064 .349 .395
CTB .021 .337 .009 .645 .397 .529 .421 .095 .479 .508
Dict 0 .292 .060 .670 .253 .572 .484 .035 .495 .526
Tablular approaches perform better among all baselines.
Experimental Result
20:: ACBIMA.ORG ::
o Setting: 10-fold cross-validation
o Metrics: Macro F-measure (MF), Accuracy (ACC)
Approach nsubj
v-
head
a-
head
n-
head
vprt vobj conj els MF ACC
Majority 0 0 0 .507 0 0 0 0 .172 .340
Stanford Dep. Map 0 0 0 .525 .351 .438 .213 .010 .332 .388
Tabular
Stanford 0 .296 0 .524 .389 .434 .162 .064 .349 .395
CTB .021 .337 .009 .645 .397 .529 .421 .095 .479 .508
Dict 0 .292 .060 .670 .253 .572 .484 .035 .495 .526
Naïve Base .273 .406 .195 .523 .679 .566 .547 .188 .519 .518
Random Forest .250 .421 .063 .760 .803 .643 .656 .076 .647 .674
SVM .413 .541 .288 .748 .791 .657 .636 .271 .662 .665
ML-based methods outperform all baselines, where SVM & RF perform best.
Experimental Result
21:: ACBIMA.ORG ::
o Setting: 10-fold cross-validation
o Metrics: Macro F-measure (MF), Accuracy (ACC)
Approach nsubj
v-
head
a-
head
n-
head
vprt vobj conj els MF ACC
Majority 0 0 0 .507 0 0 0 0 .172 .340
Stanford Dep. Map 0 0 0 .525 .351 .438 .213 .010 .332 .388
Tabular
Stanford 0 .296 0 .524 .389 .434 .162 .064 .349 .395
CTB .021 .337 .009 .645 .397 .529 .421 .095 .479 .508
Dict 0 .292 .060 .670 .253 .572 .484 .035 .495 .526
Naïve Base .273 .406 .195 .523 .679 .566 .547 .188 .519 .518
Random Forest .250 .421 .063 .760 .803 .643 .656 .076 .647 .674
SVM .413 .541 .288 .748 .791 .657 .636 .271 .662 .665
Avg Difficulty 1.74 1.55 1.64 1.36 1.38 1.38 1.47 1.95 - -
Outline
● Introduction
● Related Work
● Morphological Type Scheme
● Morphological Type Classification
o Drived Word: Rule-Based Approach
o Compond Word: ML Approach
● Experiments
o ACBiMA Corpus 1.0
o Experimental Results
● Conclusion & Future Work
22:: ACBIMA.ORG ::
Conclusion & Future Work
23:: ACBIMA.ORG ::
● Contribution
o Linguistic
 Propose a morphological type scheme
 Develop a corpus containing about 11K words
o Technical
 Develop an effective morphological classifier
o Practical
 Data and tool available
 Additional features for any Chinese task
● Future
o Improve other NLP tasks by using ACBiMA
Q & A
Thanks for your attentions!!
24
● HTTP://ACBIMA.ORG

More Related Content

Similar to Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer

NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsDimitris Kontokostas
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.pptbutest
 
Yves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLPYves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLPHendrik D'Oosterlinck
 
Chunking in manipuri using crf
Chunking in manipuri using crfChunking in manipuri using crf
Chunking in manipuri using crfijnlc
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Lviv Data Science Summer School
 
It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool
It Depends: Dependency Parser Comparison Using A Web-based Evaluation ToolIt Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool
It Depends: Dependency Parser Comparison Using A Web-based Evaluation ToolJinho Choi
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
 
Unknown Word 08
Unknown Word 08Unknown Word 08
Unknown Word 08Jason Yang
 
Merghani-SACNAS Poster
Merghani-SACNAS PosterMerghani-SACNAS Poster
Merghani-SACNAS PosterTaha Merghani
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationSeonghyun Kim
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Marcin Junczys-Dowmunt
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languageshs0041
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...AI Frontiers
 

Similar to Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer (20)

NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
Yves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLPYves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLP
 
Chunking in manipuri using crf
Chunking in manipuri using crfChunking in manipuri using crf
Chunking in manipuri using crf
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
 
It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool
It Depends: Dependency Parser Comparison Using A Web-based Evaluation ToolIt Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool
It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
Unknown Word 08
Unknown Word 08Unknown Word 08
Unknown Word 08
 
Merghani-SACNAS Poster
Merghani-SACNAS PosterMerghani-SACNAS Poster
Merghani-SACNAS Poster
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
 
UWB semeval2016-task5
UWB semeval2016-task5UWB semeval2016-task5
UWB semeval2016-task5
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
L046056365
L046056365L046056365
L046056365
 
p138-jiang
p138-jiangp138-jiang
p138-jiang
 
Eskm20140903
Eskm20140903Eskm20140903
Eskm20140903
 

More from Association for Computational Linguistics

Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Association for Computational Linguistics
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Association for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsAssociation for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsAssociation for Computational Linguistics
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Association for Computational Linguistics
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Association for Computational Linguistics
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Association for Computational Linguistics
 

More from Association for Computational Linguistics (20)

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal TextMuis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour AnalysisCastro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text SimplificationElior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 

Recently uploaded

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 

Recently uploaded (20)

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 

Ting-Hao (Kenneth) Huang - 2015 - ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer

  • 1. ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong HTTP://ACBIMA.ORG
  • 2. Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 2:: ACBIMA.ORG ::
  • 3. Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 3:: ACBIMA.ORG ::
  • 4. Introduction ● NLP tasks usually focus on segmented words ● Morphology is how words are composed with morphemes ● Usages of Chinese morphological structures o Sentiment Analysis (Ku, 2009; Huang, 2009) o POS Tagging (Qiu, 2008) o Word Segmentation (Gao, 2005) o Parsing (Li, 2011; Li, 2012; Zhang, 2013) ● Challenge for Chinese morphology o Lack of complete theories o Lack of category schema o Lack of toolkits 4:: ACBIMA.ORG :: 抗 菌 anti bacteria verb object
  • 5. Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 5:: ACBIMA.ORG ::
  • 6. Related Work 6:: ACBIMA.ORG :: ● Focus on longer unknown words o Tseng, 2002; Tseng, 2005; Lu, 2008; Qiu, 2008 ● Focus on the functionality of morphemic characters o Bruno, 2010 ● Focus on Chinese bi-character words o Huang, et al., 2010 (LREC) o 52% multi-character Chinese tokens are bi-character o analyze Chinese morphological types o developed a suite of classifiers for type prediction Issue: covers only a subset of Chinese content words and has limited scalability
  • 7. Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 7:: ACBIMA.ORG ::
  • 8. Morphological Type Scheme 8:: ACBIMA.ORG :: Chinese Bi-Char Content Word Single-Morpheme Word Synthetic Word Derived Word Compound Word dup, pfx, sfx, neg, ec a-head, conj, n-head, nsubj, v-head, vobj, vprt, els els
  • 11. Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 11:: ACBIMA.ORG ::
  • 12. Morphological Type Classification 12:: ACBIMA.ORG :: ● Assumption: Chinese morphological structures are independent from word-level contexts (Tseng, 2002; Li, 2011) ● Derived words o Rule-based approach ● Compound words o ML-based approach
  • 13. Derived Word: Rule-Based 13:: ACBIMA.ORG :: ● Idea o a morphologically derived word can be recognized based on its formation ● Approach o pattern matching rules ● Evaluation o Data: Chinese Treebank 7.0 o Result: o 2.9% of bi-char content words are annotated as derived words o Precision = 0.97 Rule-based methods are able to effectively recognizing derived words.
  • 14. Compond Word: ML-Based 14:: ACBIMA.ORG :: ● Idea o The characteristics of individual characters can help decide the type of compond words ● ML classification models o Naïve Bayes o Random Forest o SVM
  • 15. Classification Feature 15:: ACBIMA.ORG :: o Dict: Revised Mandarin Chinese Dictionary (MoE, 1994) o CTB: Chinese Treebank 5.1 (Xue et al., 2005)
  • 16. Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 16:: ACBIMA.ORG ::
  • 17. ACBiMA Corpus 1.0 17:: ACBIMA.ORG :: ● Initial Set o 3,052 words o Extracted from CTB5 o Annotated with difficulty level ● Whole Set o 11,366 words o Initial Set + 3k words from CTB 5.1 + 6.5k words from (Huang, 2010)
  • 18. Baseline Models 18:: ACBIMA.ORG :: 1) Majority 2) Stanford Dependency Map 3) Tabular Models o Step 1: assign the POS tags to each known character based on different heuristics o Step 2: assign the most frequent morphological type obtained from training data to each POS combination, e.g., “(VV, NN) = vobj”
  • 19. Experimental Result 19:: ACBIMA.ORG :: o Setting: 10-fold cross-validation o Metrics: Macro F-measure (MF), Accuracy (ACC) Approach nsubj v- head a- head n- head vprt vobj conj els MF ACC Majority 0 0 0 .507 0 0 0 0 .172 .340 Stanford Dep. Map 0 0 0 .525 .351 .438 .213 .010 .332 .388 Tabular Stanford 0 .296 0 .524 .389 .434 .162 .064 .349 .395 CTB .021 .337 .009 .645 .397 .529 .421 .095 .479 .508 Dict 0 .292 .060 .670 .253 .572 .484 .035 .495 .526 Tablular approaches perform better among all baselines.
  • 20. Experimental Result 20:: ACBIMA.ORG :: o Setting: 10-fold cross-validation o Metrics: Macro F-measure (MF), Accuracy (ACC) Approach nsubj v- head a- head n- head vprt vobj conj els MF ACC Majority 0 0 0 .507 0 0 0 0 .172 .340 Stanford Dep. Map 0 0 0 .525 .351 .438 .213 .010 .332 .388 Tabular Stanford 0 .296 0 .524 .389 .434 .162 .064 .349 .395 CTB .021 .337 .009 .645 .397 .529 .421 .095 .479 .508 Dict 0 .292 .060 .670 .253 .572 .484 .035 .495 .526 Naïve Base .273 .406 .195 .523 .679 .566 .547 .188 .519 .518 Random Forest .250 .421 .063 .760 .803 .643 .656 .076 .647 .674 SVM .413 .541 .288 .748 .791 .657 .636 .271 .662 .665 ML-based methods outperform all baselines, where SVM & RF perform best.
  • 21. Experimental Result 21:: ACBIMA.ORG :: o Setting: 10-fold cross-validation o Metrics: Macro F-measure (MF), Accuracy (ACC) Approach nsubj v- head a- head n- head vprt vobj conj els MF ACC Majority 0 0 0 .507 0 0 0 0 .172 .340 Stanford Dep. Map 0 0 0 .525 .351 .438 .213 .010 .332 .388 Tabular Stanford 0 .296 0 .524 .389 .434 .162 .064 .349 .395 CTB .021 .337 .009 .645 .397 .529 .421 .095 .479 .508 Dict 0 .292 .060 .670 .253 .572 .484 .035 .495 .526 Naïve Base .273 .406 .195 .523 .679 .566 .547 .188 .519 .518 Random Forest .250 .421 .063 .760 .803 .643 .656 .076 .647 .674 SVM .413 .541 .288 .748 .791 .657 .636 .271 .662 .665 Avg Difficulty 1.74 1.55 1.64 1.36 1.38 1.38 1.47 1.95 - -
  • 22. Outline ● Introduction ● Related Work ● Morphological Type Scheme ● Morphological Type Classification o Drived Word: Rule-Based Approach o Compond Word: ML Approach ● Experiments o ACBiMA Corpus 1.0 o Experimental Results ● Conclusion & Future Work 22:: ACBIMA.ORG ::
  • 23. Conclusion & Future Work 23:: ACBIMA.ORG :: ● Contribution o Linguistic  Propose a morphological type scheme  Develop a corpus containing about 11K words o Technical  Develop an effective morphological classifier o Practical  Data and tool available  Additional features for any Chinese task ● Future o Improve other NLP tasks by using ACBiMA
  • 24. Q & A Thanks for your attentions!! 24 ● HTTP://ACBIMA.ORG