SlideShare a Scribd company logo
Word Segmentation and
Lexical Normalization for
Unsegmented Languages
Doctoral Defense
December 16, 2021.
Shohei Higashiyama
NLP Lab, Division of Information Science, NAIST
This slide is a slightly modified version of that used for the author’s
doctoral defense at NAIST on December 16, 2021.
The major contents of this slide were taken from the following papers.
• [Study 1] Higashiyama et al., “Incorporating Word Attention into Character-Based Word
Segmentation”, NAACL-HLT, 2019
https://www.aclweb.org/anthology/N19-1276
• [Study 1] Higashiyama et al., “Character-to-Word Attention for Word Segmentation”,
Journal of Natural Language Processing, 2020 (Paper Award)
https://www.jstage.jst.go.jp/article/jnlp/27/3/27_499/_article/-char/en
• [Study 2] Higashiyama et al., “Auxiliary Lexicon Word Prediction for Cross-Domain Word
Segmentation”, Journal of Natural Language Processing, 2020
https://www.jstage.jst.go.jp/article/jnlp/27/3/27_573/_article/-char/en
• [Study 3] Higashiyama et al., “User-Generated Text Corpus for Evaluating Japanese
Morphological Analysis and Lexical Normalization”, NAACL-HLT, 2021
https://www.aclweb.org/anthology/2021.naacl-main.438/
• [Study 4] Higashiyama et al., “A Text Editing Approach to Joint Japanese Word
Segmentation, POS Tagging, and Lexical Normalization”, W-NUT, 2021 (Best Paper Award)
https://aclanthology.org/2021.wnut-1.9/
2
Overview
◆Research theme
- Word Segmentation (WS) and
Lexical Normalization (LN) for Unsegmented Languages
◆Studies in this dissertation
[Study 1] Japanese/Chinese WS for general domains
[Study 2] Japanese/Chinese WS for specialized domains
[Study 3] Construction of Japanese user-generated text (UGT)
corpus for WS and LN
[Study 4] Japanese WS and LN for UGT domains
◆Structure of this presentation
- Background → Detail on each study → Conclusion
3
◆Segmentation/Tokenization
- The (almost) necessary first step of NLP,
which segments a sentence into tokens
◆Word
- Human-understandable unit
- Processing unit of traditional NLP
- Mandatory unit for linguistic analysis (e.g., parsing and PAS)
- Useful information as a feature or an intermediate unit of
subword for application-oriented tasks (e.g., NER and MT)
4
Char ニ,ュ,ー,ラ,ル,ネ,ッ,ト,ワ,ー,ク,に,よ,る,自,然,言,語,処,理
Subword ニュー,ラル,ネット,ワーク,による,自然,言語,処理
Word ニューラル,ネットワーク,に,よる,自然,言語,処理
ニューラルネットワークによる自然言語処理
‘Natural language processing
based on neural networks’
Background (1/4)
Background (2/4)
◆Word Segmentation (WS) in unsegmented languages
- Task to segment sentences into words
using annotated data based on a segmentation standard
- Nontrivial task because of the ambiguity problem
- Segmentation accuracy degrades in domains w/o sufficient
labeled data mainly due to the unknown word problem.
Research issue 1
- How to achieve high accuracy in various text domains,
including those w/o labeled data
5
彼は日本人だ 彼 | は | 日本 | 人 | だ
‘He is a Japanese.’
日本 ‘Japan’
本 ‘book’
本人 ‘the person’
?
Background (3/4)
◆Effective WS approaches for different domain types
- [Study 1] General domains:
Use of labeled data (and other resources)
- [Study 2] Specialized domains:
Use of general domain labeled data and target domain resources
- [Study 3&4] User-generated text (UGT):
Handling nonstandard words → Lexical normalization
6
Domain Type Example
Labeled
data
Unlabeled
data
Lexicon
Other
characteristics
General dom. News ✓ ✓ ✓
Specialized dom.
Scientific
documents
✕ ✓ △
UGT dom. Social media ✕ ✓ △
Nonstandard
words
✓: available
△: sometimes available
×: almost unavailable
Background (4/4)
◆The frequent use of nonstandard words in UGT
- Examples: オハヨー ohayoo ‘good morning’ (おはよう)
すっっげええ suggee ‘awesome’ (すごい)
- Achieving accurate WS and downstream processing is difficult.
◆Lexical Normalization (LN)
- Task to transform nonstandard words into standard forms
- Main problem: the lack of public labeled data for evaluating
and training Japanese LN models
Research issue 2
- How to train/evaluate WS and LN models for Japanese UGT
under the low-resource situation
7
日本語 まぢ ムズカシイ 日本 語 まじ 難しい/むずかしい
Japanese Majimu Zukashii
Japanese Majimuzukashii
Online
Translators A, B
Japanese is really difficult
Japanese is difficult
Contributions of This Dissertation (1/2)
1. How to achieve accurate WS in various text domains
- We proposed effective approach for each of three domain types.
➢Our methods can be effective options to achieve
accurate WS and downstream tasks in these domains.
8
General
domains
Specialized
domains
UGT
domains
[Study 1] Neural model combining
character and word features
[Study 2] Auxiliary prediction task based on
unlabeled data and lexicon
[Study 4] Joint prediction of WS and LN
Contributions of This Dissertation (1/2)
2. How to train/evaluate WS and LN models for Ja UGT
- We constructed manually/automatically-annotated corpora.
➢Our evaluation corpus can be a useful benchmark to compare
and analyze existing and future systems.
➢Our LN method can be a good baseline to develop more
practical Japanese LN methods in future.
9
UGT
domains
[Study 3] Evaluation corpus annotation
[Study 4] Pseudo-training data generation
◆I focused on improvements of WS and LN accuracy
for each domain type.
Corpus
annotation for
fair evaluation
Overview of Studies in This Dissertation
10
Development of
More accurate
models
General
domains
Specialized
domains
UGT
domains
Study 1, 2, 4
Study 3
Development of
More fast
models
Prerequisite Performance
Study 1: Word Segmentation for
General Domains
Higashiyama et al., “Incorporating Word Attention into Character-Based
Word Segmentation”, NAACL-HLT, 2019
Higashiyama et al., “Character-to-Word Attention for Word Segmentation”,
Journal of Natural Language Processing, 2020 (Paper Award)
Study 1: WS for General Domains
◆Goal: Achieve more accurate WS in general domains
◆Background
- Limited efforts have been devoted to leverage
complementary char and word information for neural WS.
◆Our contributions
- Proposed a char-based model incorporating word information
- Achieved performance better than or competitive to existing
SOTA models on Japanese and Chinese datasets
12
テ キ ス ト の 分 割
テ キ ス ト|の|分割
テ キ ス ト の 分 割
テキスト
の
分割
Char-
based
Word-
based
Efficient prediction via
first-order sequence labeling
Easy use of
word-level info
[Study 1] Proposed Model Architecture
◆Char-based model with char-to-word attention
to learn the importance of candidate words
13
本
日本
本人
S S B E S
BiLSTM
Char
context
vector hi Word
embedding
ew
j
Word
summary
vector ai
Attend
Aggregate
Word vocab
Char embedding
Lookup
日本 ‘Japan’
本 ‘book’
本人 ‘the person’
?
Input sentence
BiLSTM
CRF
彼 は 日 本 人
[Study 1] Character-to-Word Attention
14
本
日本
本人
は日本
日本人
本人だ
…
Word
embedding
ew
j
Word vocab
彼 は 日 本 人 だ 。
Char context
vector hi
αij ew
j
exp(hi
T Wew
j)
∑k exp(hi
T Wew
k)
αij =
Input sentence
Lookup
Attend
Max word length = 4
WAVG
(weighted
average)
WCON
(weighted
concat)
OR
Aggregate
…
Word summary vector ai
15
Word vocab
BiLSTM-CRF
(baseline)
Training
set Unlabeled
text
Segmented
text
Train
Decode
…
…
…
Word
embeddings
Word2Vec
Pre-train Min word freq = 5
- Word vocabulary comprises training words and
words automatically segmented by the baseline.
[Study 1] Construction of Word Vocabulary
[Study 1] Experimental Datasets
◆Training/Test data
- Chinese: 2 source domains
- Japanese: 4 source domains and 7 target domains
◆Unlabeled text for pre-training word embeddings
- Chinese: 48M sentences in Chinese Gigaword 5
- Japanese: 5.9M sentences in BCCWJ non-core data
16
[Study 1] Experimental Settings
◆Hyperparameters
- num_BiLSTM_layers=2 or 3, num_BiLSTM_units=600,
char/word_emb_dim=300, min_word_freq=5, max_word_length=4, etc.
◆Evaluation
1. Comparison of baseline and proposed model variants
(and analysis on model size)
2. Comparison with existing methods on in-domain
and cross-domain datasets
3. Effect of semi-supervised learning
4. Effect of word frequency and length
5. Effect of attention for segmentation performance
6. Effect of additional word embeddings from target domains
7. Analysis of segmentation examples
17
[Study 1] Exp 1. Comparison of Model Variants
◆F1 on development sets (mean of three runs)
- Word-integrated models outperformed BASE by up to 1.0
(significant in 20 of 24 cases).
- Attention-based models outperformed non-attention
counterparts in 10 of 12 cases (significant in 4 cases).
- WCON achieved the best performance,
which may be because of word length and char position info.
18
† significant at the 0.01 level over the baseline
‡ significant at the 0.01 level over the variant w/o attention
BiLSTM-CRF
Attention-
based
[Study 1] Exp 2. Comparison with Existing Methods
◆F1 on test sets (mean of three runs)
- WCON achieved better/competitive performance than
existing methods.
(More recent work achieved further improvements on Chinese datasets.)
19
20
[Study 1] Exp 5. Effect of Attention for Segmentation
本
本 日本
本人
本
本 日本
本人
0.1
0.1
0.8 0.1
0.1
0.8
if p≧pt if p<pt
p~Uniform(0,1)
◆Character-level accuracy on BCCWJ-dev
(Most frequent cases where both correct and incorrect candidate words
exist for a character)
- Segmentation label accuracy: 99.54%
- Attention accuracy for proper words: 93.25%
◆Segmentation accuracy of the trained model increased
for larger “correct attention probability” pt
[Study 1] Conclusion
- We proposed a neural word segmenter with attention,
which incorporates word information into
a character-level sequence labeling framework.
- Our experiments showed that
• the proposed method, WCON, achieved performance better than or
competitive to existing methods, and
• learning appropriate attention weights contributed to accurate
segmentation.
21
Study 2: Word Segmentation for
Specialized Domains
Higashiyama et al., “Auxiliary Lexicon Word Prediction for Cross-Domain Word
Segmentation”, Journal of Natural Language Processing, 2020
Study 2: WS for Specialized Domains
◆Goal
- Improve WS performance for specialized domains
where labeled data is non-available
➢Our focus: how to use linguistic resources in target domain
◆Our contributions
- Proposed a WS method to learn signals of word occurrences
from unlabeled sentences and a lexicon (in target domain)
- Our method improved performance for various Chinese and
Japanese target domains.
23
Domain Type
Labeled
data
Unlabeled
data
Lexicon
Specialized domains ✕ ✓ △
✓: available
△: sometimes available
×: almost unavailable
[Study 2] Cross-Domain WS with Linguistic Resources
◆Methods for Cross-Domain WS
◆Our Model
- To overcome the limitation of lexicon features,
we mode lexical information via auxiliary task for neural models.
➢Assumption:
24
(Liu+ 2019), (Gan+ 2019), Ours
Neural representation learning
Lexicon feature
Modeling Lexical Information
Modeling Statistical Information
Generating pseudo-labeled data
(Neubig+ 2011), (Zhang+ 2018)
Domain Labeled data Unlabeled data Lexicon
Source ✓ ✓ ✓
Target ✕ ✓ ✓
[Study 2] Lexicon Feature
- Models cannot learn relationship b/w feature values and
segmentation labels in a target domain w/o labeled data.
25
週末の外出自粛要請
BESBEBEBE
Gold label
(self-restraint request in the weekend)
100111010
000000000
010011101
111111111
Sentence
Lexicon
features
Lexicon
[B]
[I]
[E]
[S]
長短期記憶ネットワーク
(Long short-term memory)
{週,末,の,外,出,自,粛,要,請,
週末,外出,出自,自粛,要請, …}
{長,短,期,記,憶,長短,短期,記憶,
ネット,ワーク,ネットワーク, …}
11010100100
00000011110
01101001001
11111000000
source sentence target sentence
Generate
?
Predict/Learn
[Study 2] Our Lexicon Word Prediction
- We introduce auxiliary tasks to predict whether each character
corresponds to specific positions in lexical words.
- The model learns parameters also from target unlabeled sentences.
26
Seg label
Sentence
Auxiliary
labels
Lexicon
[B]
[I]
[E]
[S]
長短期記憶ネットワーク
{長,短,期,記,憶,長短,短期,記憶,
ネット,ワーク,ネットワーク, …}
11010100100
00000011110
01101001001
11111000000
Generate
Predict/Learn
Predict/
Learn
source sentence target sentence
Predict/Learn
週末の外出自粛要請
BESBEBEBE
(self-restraint request in the weekend)
{週,末,の,外,出,自,粛,要,請,
週末,外出,出自,自粛,要請, …}
(Long short-term memory)
100111010
000000000
010011101
111111111
[Study 2] Methods and Experimental Data
◆Linguistic resources for training
- Source domain labeled data
- General and domain-specific unlabeled data
- Lexicon: UniDic (JA) or Jieba (ZH) and
semi-automatically constructed domain-specific lexicons
(390K-570K source words & 0-134K target words)
◆Methods
- Baselines: BiLSTM (BASE), BASE + self-training (ST), and
BASE + lexicon feature (LF)
- Proposed: BASE + MLPs for Segmentation and auxiliary LWP tasks
27
JNL: CS Journal; JPT, CPT: Patent; RCP: Recipe; C-ZX, P-ZX, FR, DL: Novel; DM: Medical
[Study 2] Experimental Settings
◆Hyperparameter
- num_BiLSTM_layers=2, num_BiLSTM_units=600, char_emb_dim=300,
num_MLP_units=300, min_word_len=1, max_word_len=4, etc.
◆Evaluation
1. In-domain results
2. Cross-domain results
3. Comparison with SOTA methods
4. Influence of weight for auxiliary loss
5. Results for non-adapted domains
6. Performance of unknown words
28
[Study 2] Exp 2. Cross-Domain Results
◆F1 on test sets (mean of three runs)
- LWP-S (source) outperformed BASE and ST.
- LWP-T (target) significantly outperformed the three baselines.
(+3.2 over BASE, +3.0 over ST, +1.2 over LF on average)
- Results of LWP-O (oracle) using gold test words
indicates more improvements by higher-coverage lexicons.
29
Japanese Chinese
★ significant at the 0.001 level over BASE
† significant at the 0.001 level over ST
‡ significant at the 0.001 level over LF
[Study 2] Exp 3. Comparison with SOTA Methods
◆F1 on test sets
- Our method achieved better or competitive performance on
Japanese and Chinese datasets, compared to SOTA methods,
including Higashiyama+’19 (our method in the first study).
30
Japanese Chinese
[Study 2] Exp 6. Performance for Unknown Words
◆Recall of top 10 frequent OOTV words
- For out-of-training-vocabulary (OOTV) words in test sets,
our method achieved better recall for words in lexicon,
but worse recall for words not in the lexicon (Ls∪Lt).
JPT (Patent) FR (Novel)
31
JA ZH
[Study 2] Conclusion
- We proposed a cross-domain WS method to incorporate lexical
knowledge via an auxiliary prediction task.
- Our method achieved better performance for various target
domains than the lexicon feature baseline and existing methods
(while preventing performance degradation for source domains).
32
Study 3: Construction of a Japanese
UGT corpus for WS and LN
Higashiyama et al., “User-Generated Text Corpus for Evaluating Japanese
Morphological Analysis and Lexical Normalization”, NAACL-HLT, 2021
Study 3: UGT Corpus Construction
◆Background
- The lack of public evaluation corpus for Japanese WS and LN
◆Goal
- Construct a public evaluation corpus for development and
fair comparison of Japanese WS and LN systems
◆Our contributions
- Constructed a corpus of blog and Q&A forum text annotated
with morphological and normalization information
- Conducted a detailed evaluation of UGT-specific problems of
existing methods
34
日本語まぢムズカシイ 日本語 まぢ ムズカシイ
まじ 難しい/むずかしい
‘Japanese is really difficult.’
[Study 3] Corpus Construction Policies
1. Available and restorable
- Use blog and Chiebukuro (Yahoo! Answers) sentences in
the BCCWJ non-core data and publish annotation information
2. Compatible with existing segmentation standard
- Follow the NINJAL’s SUW (short unit word, 短単位) and
extend the specification regarding non-standard words
3. Enabling a detailed evaluation on UGT-specific
phenomena
- Organize linguistic phenomena frequently observed
into several categories and
annotate every token with a category
35
[Study 3] Example Sentence in Our Corpus
36
イイ歌ですねェ
Raw sentence
イイ 歌 です ねェ
形容詞 名詞 助動詞 助詞
良い,よい,いい - - ね
Char type - - Sound change
variant variant
ii uta desu nee ‘It’s a good song, isn’t it?’
Word boundary
Standard forms
desu
(copula)
nee
(emphasis marker)
Part-of-speech
Categories
ii
‘good’
uta
‘song’
[Study 3] Corpus Details
◆Word categories
- 11 categories were defined for non-general or nonstandard words that
may often cause segmentation errors.
◆Corpus statistics
37
新語/スラング
固有名
オノマトペ
感動詞
方言
外国語
顔文字/AA
異文字種
代用表記
音変化
誤表記
Our most categories overlap with (Kaji+ 2015)’s classification
[Study 3] Experiments
Using our corpus, we evaluated two existing systems
trained only with annotated corpus for WS and POS tagging.
• MeCab (Kudo+ 2004) with UniDic v2.3.0
- A popular Japanese morphological analyzer based on CRFs
• MeCab+ER (Expansion Rules)
- Our MeCab-based implementation of (Sasano+ 2013)’s
rule-based lattice expansion method
38
Cited from (Sasano+ 2014)
‘It was delicious.’
Dynamically add nodes
by human-crafted rules
[Study 3] Experiments
◆Evaluation
1. Overall results
2. Results for each category
3. Analysis of segmentation results
4. Analysis of normalization results
39
Study 3. Exp 1. Overall Performance
40
◆Results
- MeCab+ER achieved better performance for Seg and POS by
2.5-2.9 F1 points, but achieved poor Norm recall.
◆Results
- Both achieved high Seg and POS performance for general and
standard words, but lower performance for UGT-characteristic words.
- MeCab+ER correctly normalized 30-40% of SCV and AR nonstandard
words, but none of those in other two categories.
Study 3. Exp 2. Recall for Each Category
41
Norm
[Study 3] Conclusion
- We constructed a public Japanese UGT corpus
annotated with morphological and normalization information.
(https://github.com/shigashiyama/jlexnorm)
- Experiments on the corpus demonstrated the limited performance
of the existing systems for non-general and non-standard words.
42
Study 4: WS and LN for Japanese
UGT
Higashiyama et al., “A Text Editing Approach to Joint Japanese Word
Segmentation, POS Tagging, and Lexical Normalization”, W-NUT, 2021
(Best Paper Award)
Study 4: WS and LN for Japanese UGT
◆Goal
- Develop a Japanese WS and LN model with better
performance than existing systems, under the condition that
normalization labeled data for LN is non-available
◆Our contributions
- Proposed generation methods of pseudo-labeled data and
a text editing-based method for Japanese WS, POS tagging,
and LN
- Achieved better normalization performance than
an existing method
44
[Study 4] Background and Motivation
◆Frameworks for text generation
◆Our approach
- Generate pseudo-labeled data for LN using lexical knowledge
- Use a text editing-based model to learn efficiently from
small amount of (high-quality) training data
45
⚫ Text editing method
for English lexical normalization
(Chrupała 2014)
⚫ Encoder-Decoder model
for Japanese sentence
normalization (Ikeda+ 2016)
[Study 4] Task Formulation
◆Formulation as multiple sequence labeling tasks
◆Normalization tags for Japanese char sets
- String edit operation (SEdit):
{KEEP, DEL, INS_L(c), INS_R(c), REP(c)} (c: hiragana or katakana)
- Character type conversion (CConv): {KEEP, HIRA, KATA, KANJI}
◆Kana-kanji conversion
46
日 本 語 ま ぢ ム ズ カ シ ー
B E S B E B I I I E
Noun Noun Noun Adv Adv Adj Adj Adj Adj Adj
KEEP KEEP KEEP KEEP REP(じ) KEEP KEEP KEEP KEEP REP(い)
KEEP KEEP KEEP KEEP KEEP HIRA HIRA HIRA HIRA KEEP
x =
ys =
yp =
ye =
yc =
⇒ まじ ⇒ むずかしい
Seg
POS
Norm
Sentence
も う あ き だ
KANJI
Kana-kanji
converter
(n-gram LM)
あき
秋 ‘autumn’
空き ‘vacancy’
飽き ‘bored’
…
KANJI
’It’s already
autumn.’
KEEP
KEEP
KEEP
CConv tags
[Study 4] Variant Pair Acquisition
◆Standard and nonstandard word variant pairs for
pseudo-labeled data generation
A) Dictionary-based:
Extract variant pairs from
UniDic with hierarchical
lemma definition
B) Rule-based:
Apply hand-crafted rules to transform standard forms into
nonstandard forms
47
…
⇒ 404K pairs
⇒ 47K pairs
6 out of 10 rules are similar to those in (Sasano+ 2013) and (Ikeda+ 2016).
[Study 4] Pseudo-labeled Data Generation
◆Input
- (Auto-) segmented sentence x and
- Pair v of source (nonstandard) and target (standard) word variants
48
x = スゴく|気|に|なる
ye = K K K K K K K
yc = H H K K K K K
“(I’m) very curious.”
v = (スゴく, すごく)
ス ゴ く 気 に な る
K=KEEP,H=HIRA,
D=DEL, IR=INS_R
x = ほんとう|に|心配
ye = K K D K K K
yc = K K K K K K K
v = (ほんっと, ほんとう)
ほ ん っ と に 心 配
IR(う)
“(I’m) really worried.”
⇒ すごく 気になる
⇒ ほんとう に心配
⚫ Target-side distant supervision (DStgt)
⚫ Source-side distant supervision (DSsrc)
src tgt
src tgt
Synthetic target sentence
Synthetic source sentence
Pro: Actual sentences can be used
Pro: Any number of synthetic sentences can be generated
[Study 4] Experimental Data
◆Pseudo labeled data for training (and development)
- Dict/Rule-derived variant pairs: Vd and Vr
- BCCWJ: a mixed domain corpus of news, blog, Q&A forum, etc.
◆Test data: BQNC
- Manually-annotated 929 sentences constructed in our third study
49
Du
Dt
Vd
Vd
At
Ad
DStgt
Vr Ar
57K sent.
173K syn. sent.
170K syn. sent.
57K sent.
Top np=20K
freq pairs At most ns=10 sent. were
extracted for each pair
DSsrc
core data Dt
with manual Seg&POS tags
non-core data Du
with auto Seg&POS tags
3.5M sent.
[Study 4] Experimental Settings
◆Our model
- BiLSTM + task-specific softmax layers
- Character embedding, pronunciation embedding, and
nonstandard word lexicon binary features
- Hyperparameter
• num_BiLSTM_layers=2, num_BiLSTM_units=1,000, char_emb_d=200, pron_emb_d=30, etc.
◆Baseline methods
- MeCab and MeCab+ER (Sasano+ 2013)
◆Evaluation
1. Main results
2. Effect of dataset size
3. Detailed results of normalization
4. Performance for known and unknown normalization instances
5. Error analysis
50
[Study 4] Exp 1. Main Results
◆Results
- Our method achieved better Norm performance
when trained on more types of pseudo-labeled data
- MeCab+ER achieved the best performance on Seg and POS
51
At: DSsrc(Vdic)
Ar: DStgt(Vrule)
Ad: DStgt(Vdic)
(BiLSTM)
Postprocessing
[Study 4] Exp 5. Error Analysis
◆Detailed normalization performance
- Our method outperformed MeCab+ER for all categories.
- Major errors ( ) by our model were mis-detection and
invalid tag prediction.
- Kanji conversion accuracy was 97% (67/70).
52
ほんと (に少人数で) → ほんとう ‘actually’ すげぇ → すごい ‘great’
フツー (の話をして) → 普通 ‘ordinary’ そーゆー → そう|いう ‘such’
な~に (言ってんの) → なに ‘what’ まぁるい → まるい ‘round’
Examples
of TPs
ガコンッ → ガコン ‘thud’ ゴホゴホ → ごほごホ ‘coughing sound’
はぁぁ → はああ ‘sighing sound’ おお~~ → 王 ‘king’
ケータイ → ケイタイ ‘cell phone’ ダルい → だるい ‘dull’
Examples
of FPs ×
?
[Study 4] Conclusion
- We proposed a text editing-based method for Japanese WS,
POS tagging, and LN.
- We proposed effective generation methods of pseudo-labeled data
for Japanese LN.
- The proposed method outperformed an existing method
on the joint segmentation and normalization task.
53
Summary and Future Directions
Summary of This Dissertation
1. How to achieve accurate WS in various text domains
- We proposed approaches for three domain types,
which can be effective options to achieve accurate
WS and downstream tasks in these domains.
2. How to train/evaluate WS and LN models for Ja UGT
- We constructed a public evaluation corpus, which can be
a useful benchmark to compare existing and future systems.
- We proposed a joint WS&LN method trained on pseudo-
labeled data, which can be a good baseline to develop
more practical Japanese LN methods in future.
55
Directions for Future Work
◆Model size and inference speed
- Knowledge distillation is a prospective approach to train a fast and
lightweight student model from an accurate teacher model.
◆Investigation of optimal segmentation unit
- Optimal units and effective combination of different units
(char/subwrod/word) for downstream tasks have room to be explored.
◆Performance improvement on UGT processing
- Incorporating knowledge in large pretrained LMs may be effective.
◆Evaluation on broader UGT domains and phenomena
- Constructing evaluation data in various UGT domains is beneficial to
evaluate system performance for frequently-occurring phenomena in
other UGT domains, such as proper names and neologisms.
56
57
Corpus
annotation for
fair evaluation
More
accurate
models
General
domains
Specialized
domains
UGT
domains
Study 1, 2, 4
Study 3
More
fast
models
Broader
domain corpus
Further
improvement
Word
Segmentation
Optimal
tokens
Tokenization
Directions for Future Work (Summary)

More Related Content

What's hot

PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
Rommel Carvalho
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
paperpublications3
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
paperpublications3
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
KozoChikai
 
Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...
inscit2006
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
RIILP
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
Ali Kabbadj
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
cseij
 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
Sebastian Ruder
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
Surya Sg
 
Automatic Key Term Extraction and Summarization from Spoken Course Lectures
Automatic Key Term Extraction and Summarization from Spoken Course LecturesAutomatic Key Term Extraction and Summarization from Spoken Course Lectures
Automatic Key Term Extraction and Summarization from Spoken Course Lectures
Yun-Nung (Vivian) Chen
 
14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for TranslationRIILP
 
Automatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesAutomatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course Lectures
Yun-Nung (Vivian) Chen
 
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET Journal
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
Sameer Wadkar
 
17. Anne Schuman (USAAR) Terminology and Ontologies 2
17. Anne Schuman (USAAR) Terminology and Ontologies 217. Anne Schuman (USAAR) Terminology and Ontologies 2
17. Anne Schuman (USAAR) Terminology and Ontologies 2RIILP
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
Lidia Pivovarova
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
Ny3424442448
Ny3424442448Ny3424442448
Ny3424442448
IJERA Editor
 

What's hot (20)

PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 
Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Automatic Key Term Extraction and Summarization from Spoken Course Lectures
Automatic Key Term Extraction and Summarization from Spoken Course LecturesAutomatic Key Term Extraction and Summarization from Spoken Course Lectures
Automatic Key Term Extraction and Summarization from Spoken Course Lectures
 
14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation
 
Automatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesAutomatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course Lectures
 
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
 
17. Anne Schuman (USAAR) Terminology and Ontologies 2
17. Anne Schuman (USAAR) Terminology and Ontologies 217. Anne Schuman (USAAR) Terminology and Ontologies 2
17. Anne Schuman (USAAR) Terminology and Ontologies 2
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Ny3424442448
Ny3424442448Ny3424442448
Ny3424442448
 

Similar to Word Segmentation and Lexical Normalization for Unsegmented Languages

THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
kevig
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
kevig
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
kevig
 
Parsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function TaggingParsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function Tagging
kevig
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...butest
 
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
IJERA Editor
 
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCESSTATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
cscpconf
 
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
Normunds Grūzītis
 
Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation ofAndi Wu
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
kevig
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
kevig
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET Journal
 
neural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationneural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classification
JEE HYUN PARK
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
Journal For Research
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
alessio_ferrari
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
Lifeng (Aaron) Han
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
Universidad Nacional de San Martin
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
ijnlc
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
cscpconf
 

Similar to Word Segmentation and Lexical Normalization for Unsegmented Languages (20)

THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
Parsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function TaggingParsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function Tagging
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...
 
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
 
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCESSTATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
 
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...
 
Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation of
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
 
neural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationneural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classification
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
 

Recently uploaded

ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

Word Segmentation and Lexical Normalization for Unsegmented Languages

  • 1. Word Segmentation and Lexical Normalization for Unsegmented Languages Doctoral Defense December 16, 2021. Shohei Higashiyama NLP Lab, Division of Information Science, NAIST
  • 2. This slide is a slightly modified version of that used for the author’s doctoral defense at NAIST on December 16, 2021. The major contents of this slide were taken from the following papers. • [Study 1] Higashiyama et al., “Incorporating Word Attention into Character-Based Word Segmentation”, NAACL-HLT, 2019 https://www.aclweb.org/anthology/N19-1276 • [Study 1] Higashiyama et al., “Character-to-Word Attention for Word Segmentation”, Journal of Natural Language Processing, 2020 (Paper Award) https://www.jstage.jst.go.jp/article/jnlp/27/3/27_499/_article/-char/en • [Study 2] Higashiyama et al., “Auxiliary Lexicon Word Prediction for Cross-Domain Word Segmentation”, Journal of Natural Language Processing, 2020 https://www.jstage.jst.go.jp/article/jnlp/27/3/27_573/_article/-char/en • [Study 3] Higashiyama et al., “User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization”, NAACL-HLT, 2021 https://www.aclweb.org/anthology/2021.naacl-main.438/ • [Study 4] Higashiyama et al., “A Text Editing Approach to Joint Japanese Word Segmentation, POS Tagging, and Lexical Normalization”, W-NUT, 2021 (Best Paper Award) https://aclanthology.org/2021.wnut-1.9/ 2
  • 3. Overview ◆Research theme - Word Segmentation (WS) and Lexical Normalization (LN) for Unsegmented Languages ◆Studies in this dissertation [Study 1] Japanese/Chinese WS for general domains [Study 2] Japanese/Chinese WS for specialized domains [Study 3] Construction of Japanese user-generated text (UGT) corpus for WS and LN [Study 4] Japanese WS and LN for UGT domains ◆Structure of this presentation - Background → Detail on each study → Conclusion 3
  • 4. ◆Segmentation/Tokenization - The (almost) necessary first step of NLP, which segments a sentence into tokens ◆Word - Human-understandable unit - Processing unit of traditional NLP - Mandatory unit for linguistic analysis (e.g., parsing and PAS) - Useful information as a feature or an intermediate unit of subword for application-oriented tasks (e.g., NER and MT) 4 Char ニ,ュ,ー,ラ,ル,ネ,ッ,ト,ワ,ー,ク,に,よ,る,自,然,言,語,処,理 Subword ニュー,ラル,ネット,ワーク,による,自然,言語,処理 Word ニューラル,ネットワーク,に,よる,自然,言語,処理 ニューラルネットワークによる自然言語処理 ‘Natural language processing based on neural networks’ Background (1/4)
  • 5. Background (2/4) ◆Word Segmentation (WS) in unsegmented languages - Task to segment sentences into words using annotated data based on a segmentation standard - Nontrivial task because of the ambiguity problem - Segmentation accuracy degrades in domains w/o sufficient labeled data mainly due to the unknown word problem. Research issue 1 - How to achieve high accuracy in various text domains, including those w/o labeled data 5 彼は日本人だ 彼 | は | 日本 | 人 | だ ‘He is a Japanese.’ 日本 ‘Japan’ 本 ‘book’ 本人 ‘the person’ ?
  • 6. Background (3/4) ◆Effective WS approaches for different domain types - [Study 1] General domains: Use of labeled data (and other resources) - [Study 2] Specialized domains: Use of general domain labeled data and target domain resources - [Study 3&4] User-generated text (UGT): Handling nonstandard words → Lexical normalization 6 Domain Type Example Labeled data Unlabeled data Lexicon Other characteristics General dom. News ✓ ✓ ✓ Specialized dom. Scientific documents ✕ ✓ △ UGT dom. Social media ✕ ✓ △ Nonstandard words ✓: available △: sometimes available ×: almost unavailable
  • 7. Background (4/4) ◆The frequent use of nonstandard words in UGT - Examples: オハヨー ohayoo ‘good morning’ (おはよう) すっっげええ suggee ‘awesome’ (すごい) - Achieving accurate WS and downstream processing is difficult. ◆Lexical Normalization (LN) - Task to transform nonstandard words into standard forms - Main problem: the lack of public labeled data for evaluating and training Japanese LN models Research issue 2 - How to train/evaluate WS and LN models for Japanese UGT under the low-resource situation 7 日本語 まぢ ムズカシイ 日本 語 まじ 難しい/むずかしい Japanese Majimu Zukashii Japanese Majimuzukashii Online Translators A, B Japanese is really difficult Japanese is difficult
  • 8. Contributions of This Dissertation (1/2) 1. How to achieve accurate WS in various text domains - We proposed effective approach for each of three domain types. ➢Our methods can be effective options to achieve accurate WS and downstream tasks in these domains. 8 General domains Specialized domains UGT domains [Study 1] Neural model combining character and word features [Study 2] Auxiliary prediction task based on unlabeled data and lexicon [Study 4] Joint prediction of WS and LN
  • 9. Contributions of This Dissertation (1/2) 2. How to train/evaluate WS and LN models for Ja UGT - We constructed manually/automatically-annotated corpora. ➢Our evaluation corpus can be a useful benchmark to compare and analyze existing and future systems. ➢Our LN method can be a good baseline to develop more practical Japanese LN methods in future. 9 UGT domains [Study 3] Evaluation corpus annotation [Study 4] Pseudo-training data generation
  • 10. ◆I focused on improvements of WS and LN accuracy for each domain type. Corpus annotation for fair evaluation Overview of Studies in This Dissertation 10 Development of More accurate models General domains Specialized domains UGT domains Study 1, 2, 4 Study 3 Development of More fast models Prerequisite Performance
  • 11. Study 1: Word Segmentation for General Domains Higashiyama et al., “Incorporating Word Attention into Character-Based Word Segmentation”, NAACL-HLT, 2019 Higashiyama et al., “Character-to-Word Attention for Word Segmentation”, Journal of Natural Language Processing, 2020 (Paper Award)
  • 12. Study 1: WS for General Domains ◆Goal: Achieve more accurate WS in general domains ◆Background - Limited efforts have been devoted to leverage complementary char and word information for neural WS. ◆Our contributions - Proposed a char-based model incorporating word information - Achieved performance better than or competitive to existing SOTA models on Japanese and Chinese datasets 12 テ キ ス ト の 分 割 テ キ ス ト|の|分割 テ キ ス ト の 分 割 テキスト の 分割 Char- based Word- based Efficient prediction via first-order sequence labeling Easy use of word-level info
  • 13. [Study 1] Proposed Model Architecture ◆Char-based model with char-to-word attention to learn the importance of candidate words 13 本 日本 本人 S S B E S BiLSTM Char context vector hi Word embedding ew j Word summary vector ai Attend Aggregate Word vocab Char embedding Lookup 日本 ‘Japan’ 本 ‘book’ 本人 ‘the person’ ? Input sentence BiLSTM CRF 彼 は 日 本 人
  • 14. [Study 1] Character-to-Word Attention 14 本 日本 本人 は日本 日本人 本人だ … Word embedding ew j Word vocab 彼 は 日 本 人 だ 。 Char context vector hi αij ew j exp(hi T Wew j) ∑k exp(hi T Wew k) αij = Input sentence Lookup Attend Max word length = 4 WAVG (weighted average) WCON (weighted concat) OR Aggregate … Word summary vector ai
  • 15. 15 Word vocab BiLSTM-CRF (baseline) Training set Unlabeled text Segmented text Train Decode … … … Word embeddings Word2Vec Pre-train Min word freq = 5 - Word vocabulary comprises training words and words automatically segmented by the baseline. [Study 1] Construction of Word Vocabulary
  • 16. [Study 1] Experimental Datasets ◆Training/Test data - Chinese: 2 source domains - Japanese: 4 source domains and 7 target domains ◆Unlabeled text for pre-training word embeddings - Chinese: 48M sentences in Chinese Gigaword 5 - Japanese: 5.9M sentences in BCCWJ non-core data 16
  • 17. [Study 1] Experimental Settings ◆Hyperparameters - num_BiLSTM_layers=2 or 3, num_BiLSTM_units=600, char/word_emb_dim=300, min_word_freq=5, max_word_length=4, etc. ◆Evaluation 1. Comparison of baseline and proposed model variants (and analysis on model size) 2. Comparison with existing methods on in-domain and cross-domain datasets 3. Effect of semi-supervised learning 4. Effect of word frequency and length 5. Effect of attention for segmentation performance 6. Effect of additional word embeddings from target domains 7. Analysis of segmentation examples 17
  • 18. [Study 1] Exp 1. Comparison of Model Variants ◆F1 on development sets (mean of three runs) - Word-integrated models outperformed BASE by up to 1.0 (significant in 20 of 24 cases). - Attention-based models outperformed non-attention counterparts in 10 of 12 cases (significant in 4 cases). - WCON achieved the best performance, which may be because of word length and char position info. 18 † significant at the 0.01 level over the baseline ‡ significant at the 0.01 level over the variant w/o attention BiLSTM-CRF Attention- based
  • 19. [Study 1] Exp 2. Comparison with Existing Methods ◆F1 on test sets (mean of three runs) - WCON achieved better/competitive performance than existing methods. (More recent work achieved further improvements on Chinese datasets.) 19
  • 20. 20 [Study 1] Exp 5. Effect of Attention for Segmentation 本 本 日本 本人 本 本 日本 本人 0.1 0.1 0.8 0.1 0.1 0.8 if p≧pt if p<pt p~Uniform(0,1) ◆Character-level accuracy on BCCWJ-dev (Most frequent cases where both correct and incorrect candidate words exist for a character) - Segmentation label accuracy: 99.54% - Attention accuracy for proper words: 93.25% ◆Segmentation accuracy of the trained model increased for larger “correct attention probability” pt
  • 21. [Study 1] Conclusion - We proposed a neural word segmenter with attention, which incorporates word information into a character-level sequence labeling framework. - Our experiments showed that • the proposed method, WCON, achieved performance better than or competitive to existing methods, and • learning appropriate attention weights contributed to accurate segmentation. 21
  • 22. Study 2: Word Segmentation for Specialized Domains Higashiyama et al., “Auxiliary Lexicon Word Prediction for Cross-Domain Word Segmentation”, Journal of Natural Language Processing, 2020
  • 23. Study 2: WS for Specialized Domains ◆Goal - Improve WS performance for specialized domains where labeled data is non-available ➢Our focus: how to use linguistic resources in target domain ◆Our contributions - Proposed a WS method to learn signals of word occurrences from unlabeled sentences and a lexicon (in target domain) - Our method improved performance for various Chinese and Japanese target domains. 23 Domain Type Labeled data Unlabeled data Lexicon Specialized domains ✕ ✓ △ ✓: available △: sometimes available ×: almost unavailable
  • 24. [Study 2] Cross-Domain WS with Linguistic Resources ◆Methods for Cross-Domain WS ◆Our Model - To overcome the limitation of lexicon features, we mode lexical information via auxiliary task for neural models. ➢Assumption: 24 (Liu+ 2019), (Gan+ 2019), Ours Neural representation learning Lexicon feature Modeling Lexical Information Modeling Statistical Information Generating pseudo-labeled data (Neubig+ 2011), (Zhang+ 2018) Domain Labeled data Unlabeled data Lexicon Source ✓ ✓ ✓ Target ✕ ✓ ✓
  • 25. [Study 2] Lexicon Feature - Models cannot learn relationship b/w feature values and segmentation labels in a target domain w/o labeled data. 25 週末の外出自粛要請 BESBEBEBE Gold label (self-restraint request in the weekend) 100111010 000000000 010011101 111111111 Sentence Lexicon features Lexicon [B] [I] [E] [S] 長短期記憶ネットワーク (Long short-term memory) {週,末,の,外,出,自,粛,要,請, 週末,外出,出自,自粛,要請, …} {長,短,期,記,憶,長短,短期,記憶, ネット,ワーク,ネットワーク, …} 11010100100 00000011110 01101001001 11111000000 source sentence target sentence Generate ? Predict/Learn
  • 26. [Study 2] Our Lexicon Word Prediction - We introduce auxiliary tasks to predict whether each character corresponds to specific positions in lexical words. - The model learns parameters also from target unlabeled sentences. 26 Seg label Sentence Auxiliary labels Lexicon [B] [I] [E] [S] 長短期記憶ネットワーク {長,短,期,記,憶,長短,短期,記憶, ネット,ワーク,ネットワーク, …} 11010100100 00000011110 01101001001 11111000000 Generate Predict/Learn Predict/ Learn source sentence target sentence Predict/Learn 週末の外出自粛要請 BESBEBEBE (self-restraint request in the weekend) {週,末,の,外,出,自,粛,要,請, 週末,外出,出自,自粛,要請, …} (Long short-term memory) 100111010 000000000 010011101 111111111
  • 27. [Study 2] Methods and Experimental Data ◆Linguistic resources for training - Source domain labeled data - General and domain-specific unlabeled data - Lexicon: UniDic (JA) or Jieba (ZH) and semi-automatically constructed domain-specific lexicons (390K-570K source words & 0-134K target words) ◆Methods - Baselines: BiLSTM (BASE), BASE + self-training (ST), and BASE + lexicon feature (LF) - Proposed: BASE + MLPs for Segmentation and auxiliary LWP tasks 27 JNL: CS Journal; JPT, CPT: Patent; RCP: Recipe; C-ZX, P-ZX, FR, DL: Novel; DM: Medical
  • 28. [Study 2] Experimental Settings ◆Hyperparameter - num_BiLSTM_layers=2, num_BiLSTM_units=600, char_emb_dim=300, num_MLP_units=300, min_word_len=1, max_word_len=4, etc. ◆Evaluation 1. In-domain results 2. Cross-domain results 3. Comparison with SOTA methods 4. Influence of weight for auxiliary loss 5. Results for non-adapted domains 6. Performance of unknown words 28
  • 29. [Study 2] Exp 2. Cross-Domain Results ◆F1 on test sets (mean of three runs) - LWP-S (source) outperformed BASE and ST. - LWP-T (target) significantly outperformed the three baselines. (+3.2 over BASE, +3.0 over ST, +1.2 over LF on average) - Results of LWP-O (oracle) using gold test words indicates more improvements by higher-coverage lexicons. 29 Japanese Chinese ★ significant at the 0.001 level over BASE † significant at the 0.001 level over ST ‡ significant at the 0.001 level over LF
  • 30. [Study 2] Exp 3. Comparison with SOTA Methods ◆F1 on test sets - Our method achieved better or competitive performance on Japanese and Chinese datasets, compared to SOTA methods, including Higashiyama+’19 (our method in the first study). 30 Japanese Chinese
  • 31. [Study 2] Exp 6. Performance for Unknown Words ◆Recall of top 10 frequent OOTV words - For out-of-training-vocabulary (OOTV) words in test sets, our method achieved better recall for words in lexicon, but worse recall for words not in the lexicon (Ls∪Lt). JPT (Patent) FR (Novel) 31 JA ZH
  • 32. [Study 2] Conclusion - We proposed a cross-domain WS method to incorporate lexical knowledge via an auxiliary prediction task. - Our method achieved better performance for various target domains than the lexicon feature baseline and existing methods (while preventing performance degradation for source domains). 32
  • 33. Study 3: Construction of a Japanese UGT corpus for WS and LN Higashiyama et al., “User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization”, NAACL-HLT, 2021
  • 34. Study 3: UGT Corpus Construction ◆Background - The lack of public evaluation corpus for Japanese WS and LN ◆Goal - Construct a public evaluation corpus for development and fair comparison of Japanese WS and LN systems ◆Our contributions - Constructed a corpus of blog and Q&A forum text annotated with morphological and normalization information - Conducted a detailed evaluation of UGT-specific problems of existing methods 34 日本語まぢムズカシイ 日本語 まぢ ムズカシイ まじ 難しい/むずかしい ‘Japanese is really difficult.’
  • 35. [Study 3] Corpus Construction Policies 1. Available and restorable - Use blog and Chiebukuro (Yahoo! Answers) sentences in the BCCWJ non-core data and publish annotation information 2. Compatible with existing segmentation standard - Follow the NINJAL’s SUW (short unit word, 短単位) and extend the specification regarding non-standard words 3. Enabling a detailed evaluation on UGT-specific phenomena - Organize linguistic phenomena frequently observed into several categories and annotate every token with a category 35
  • 36. [Study 3] Example Sentence in Our Corpus 36 イイ歌ですねェ Raw sentence イイ 歌 です ねェ 形容詞 名詞 助動詞 助詞 良い,よい,いい - - ね Char type - - Sound change variant variant ii uta desu nee ‘It’s a good song, isn’t it?’ Word boundary Standard forms desu (copula) nee (emphasis marker) Part-of-speech Categories ii ‘good’ uta ‘song’
  • 37. [Study 3] Corpus Details ◆Word categories - 11 categories were defined for non-general or nonstandard words that may often cause segmentation errors. ◆Corpus statistics 37 新語/スラング 固有名 オノマトペ 感動詞 方言 外国語 顔文字/AA 異文字種 代用表記 音変化 誤表記 Our most categories overlap with (Kaji+ 2015)’s classification
  • 38. [Study 3] Experiments Using our corpus, we evaluated two existing systems trained only with annotated corpus for WS and POS tagging. • MeCab (Kudo+ 2004) with UniDic v2.3.0 - A popular Japanese morphological analyzer based on CRFs • MeCab+ER (Expansion Rules) - Our MeCab-based implementation of (Sasano+ 2013)’s rule-based lattice expansion method 38 Cited from (Sasano+ 2014) ‘It was delicious.’ Dynamically add nodes by human-crafted rules
  • 39. [Study 3] Experiments ◆Evaluation 1. Overall results 2. Results for each category 3. Analysis of segmentation results 4. Analysis of normalization results 39
  • 40. Study 3. Exp 1. Overall Performance 40 ◆Results - MeCab+ER achieved better performance for Seg and POS by 2.5-2.9 F1 points, but achieved poor Norm recall.
  • 41. ◆Results - Both achieved high Seg and POS performance for general and standard words, but lower performance for UGT-characteristic words. - MeCab+ER correctly normalized 30-40% of SCV and AR nonstandard words, but none of those in other two categories. Study 3. Exp 2. Recall for Each Category 41 Norm
  • 42. [Study 3] Conclusion - We constructed a public Japanese UGT corpus annotated with morphological and normalization information. (https://github.com/shigashiyama/jlexnorm) - Experiments on the corpus demonstrated the limited performance of the existing systems for non-general and non-standard words. 42
  • 43. Study 4: WS and LN for Japanese UGT Higashiyama et al., “A Text Editing Approach to Joint Japanese Word Segmentation, POS Tagging, and Lexical Normalization”, W-NUT, 2021 (Best Paper Award)
  • 44. Study 4: WS and LN for Japanese UGT ◆Goal - Develop a Japanese WS and LN model with better performance than existing systems, under the condition that normalization labeled data for LN is non-available ◆Our contributions - Proposed generation methods of pseudo-labeled data and a text editing-based method for Japanese WS, POS tagging, and LN - Achieved better normalization performance than an existing method 44
  • 45. [Study 4] Background and Motivation ◆Frameworks for text generation ◆Our approach - Generate pseudo-labeled data for LN using lexical knowledge - Use a text editing-based model to learn efficiently from small amount of (high-quality) training data 45 ⚫ Text editing method for English lexical normalization (Chrupała 2014) ⚫ Encoder-Decoder model for Japanese sentence normalization (Ikeda+ 2016)
  • 46. [Study 4] Task Formulation ◆Formulation as multiple sequence labeling tasks ◆Normalization tags for Japanese char sets - String edit operation (SEdit): {KEEP, DEL, INS_L(c), INS_R(c), REP(c)} (c: hiragana or katakana) - Character type conversion (CConv): {KEEP, HIRA, KATA, KANJI} ◆Kana-kanji conversion 46 日 本 語 ま ぢ ム ズ カ シ ー B E S B E B I I I E Noun Noun Noun Adv Adv Adj Adj Adj Adj Adj KEEP KEEP KEEP KEEP REP(じ) KEEP KEEP KEEP KEEP REP(い) KEEP KEEP KEEP KEEP KEEP HIRA HIRA HIRA HIRA KEEP x = ys = yp = ye = yc = ⇒ まじ ⇒ むずかしい Seg POS Norm Sentence も う あ き だ KANJI Kana-kanji converter (n-gram LM) あき 秋 ‘autumn’ 空き ‘vacancy’ 飽き ‘bored’ … KANJI ’It’s already autumn.’ KEEP KEEP KEEP CConv tags
  • 47. [Study 4] Variant Pair Acquisition ◆Standard and nonstandard word variant pairs for pseudo-labeled data generation A) Dictionary-based: Extract variant pairs from UniDic with hierarchical lemma definition B) Rule-based: Apply hand-crafted rules to transform standard forms into nonstandard forms 47 … ⇒ 404K pairs ⇒ 47K pairs 6 out of 10 rules are similar to those in (Sasano+ 2013) and (Ikeda+ 2016).
  • 48. [Study 4] Pseudo-labeled Data Generation ◆Input - (Auto-) segmented sentence x and - Pair v of source (nonstandard) and target (standard) word variants 48 x = スゴく|気|に|なる ye = K K K K K K K yc = H H K K K K K “(I’m) very curious.” v = (スゴく, すごく) ス ゴ く 気 に な る K=KEEP,H=HIRA, D=DEL, IR=INS_R x = ほんとう|に|心配 ye = K K D K K K yc = K K K K K K K v = (ほんっと, ほんとう) ほ ん っ と に 心 配 IR(う) “(I’m) really worried.” ⇒ すごく 気になる ⇒ ほんとう に心配 ⚫ Target-side distant supervision (DStgt) ⚫ Source-side distant supervision (DSsrc) src tgt src tgt Synthetic target sentence Synthetic source sentence Pro: Actual sentences can be used Pro: Any number of synthetic sentences can be generated
  • 49. [Study 4] Experimental Data ◆Pseudo labeled data for training (and development) - Dict/Rule-derived variant pairs: Vd and Vr - BCCWJ: a mixed domain corpus of news, blog, Q&A forum, etc. ◆Test data: BQNC - Manually-annotated 929 sentences constructed in our third study 49 Du Dt Vd Vd At Ad DStgt Vr Ar 57K sent. 173K syn. sent. 170K syn. sent. 57K sent. Top np=20K freq pairs At most ns=10 sent. were extracted for each pair DSsrc core data Dt with manual Seg&POS tags non-core data Du with auto Seg&POS tags 3.5M sent.
  • 50. [Study 4] Experimental Settings ◆Our model - BiLSTM + task-specific softmax layers - Character embedding, pronunciation embedding, and nonstandard word lexicon binary features - Hyperparameter • num_BiLSTM_layers=2, num_BiLSTM_units=1,000, char_emb_d=200, pron_emb_d=30, etc. ◆Baseline methods - MeCab and MeCab+ER (Sasano+ 2013) ◆Evaluation 1. Main results 2. Effect of dataset size 3. Detailed results of normalization 4. Performance for known and unknown normalization instances 5. Error analysis 50
  • 51. [Study 4] Exp 1. Main Results ◆Results - Our method achieved better Norm performance when trained on more types of pseudo-labeled data - MeCab+ER achieved the best performance on Seg and POS 51 At: DSsrc(Vdic) Ar: DStgt(Vrule) Ad: DStgt(Vdic) (BiLSTM) Postprocessing
  • 52. [Study 4] Exp 5. Error Analysis ◆Detailed normalization performance - Our method outperformed MeCab+ER for all categories. - Major errors ( ) by our model were mis-detection and invalid tag prediction. - Kanji conversion accuracy was 97% (67/70). 52 ほんと (に少人数で) → ほんとう ‘actually’ すげぇ → すごい ‘great’ フツー (の話をして) → 普通 ‘ordinary’ そーゆー → そう|いう ‘such’ な~に (言ってんの) → なに ‘what’ まぁるい → まるい ‘round’ Examples of TPs ガコンッ → ガコン ‘thud’ ゴホゴホ → ごほごホ ‘coughing sound’ はぁぁ → はああ ‘sighing sound’ おお~~ → 王 ‘king’ ケータイ → ケイタイ ‘cell phone’ ダルい → だるい ‘dull’ Examples of FPs × ?
  • 53. [Study 4] Conclusion - We proposed a text editing-based method for Japanese WS, POS tagging, and LN. - We proposed effective generation methods of pseudo-labeled data for Japanese LN. - The proposed method outperformed an existing method on the joint segmentation and normalization task. 53
  • 54. Summary and Future Directions
  • 55. Summary of This Dissertation 1. How to achieve accurate WS in various text domains - We proposed approaches for three domain types, which can be effective options to achieve accurate WS and downstream tasks in these domains. 2. How to train/evaluate WS and LN models for Ja UGT - We constructed a public evaluation corpus, which can be a useful benchmark to compare existing and future systems. - We proposed a joint WS&LN method trained on pseudo- labeled data, which can be a good baseline to develop more practical Japanese LN methods in future. 55
  • 56. Directions for Future Work ◆Model size and inference speed - Knowledge distillation is a prospective approach to train a fast and lightweight student model from an accurate teacher model. ◆Investigation of optimal segmentation unit - Optimal units and effective combination of different units (char/subwrod/word) for downstream tasks have room to be explored. ◆Performance improvement on UGT processing - Incorporating knowledge in large pretrained LMs may be effective. ◆Evaluation on broader UGT domains and phenomena - Constructing evaluation data in various UGT domains is beneficial to evaluate system performance for frequently-occurring phenomena in other UGT domains, such as proper names and neologisms. 56
  • 57. 57 Corpus annotation for fair evaluation More accurate models General domains Specialized domains UGT domains Study 1, 2, 4 Study 3 More fast models Broader domain corpus Further improvement Word Segmentation Optimal tokens Tokenization Directions for Future Work (Summary)