Warnik Chow - 2018 HCLT

Human Interface Laboratory
Concept and Application of
Deep learning-based Automatic Spacing
2018. 10. 13
Won Ik Cho

Contents
• Introduction
 Motivation & related works
• System Concept
 Corpus: conversation-style and non-normalized utterances
 Deep learning-based automatic spacer
 Web interface for user-generated text
• Experiment
 Training specification
 Qualitative analysis
 Sample application
• Conclusion & Future Work
1

Introduction
• Ambiguity from being non-spaced:
Just a carelessness?
2
Photos from: https://brunch.co.kr/@bong/642

Introduction
• Motivation: For someone, spacing can be a task that requires effort
3
Image created by Visual Dive

Introduction
• How about automatically spacing the user-generated texts?
 For both a real-time chat and a post upload?
 Not mandatory; options can be provided
 Useful for digitally marginalized people
(and also for the users who find annoyance in spacing)
• Limitations in related approaches
 Related approaches include both web services and published algorithms
 Usually targets “correction” of an input sentence
• e.g. PNU module or NAVER checker
 or predicts the spacing based on preceding characters
• Probabilistic models based on character n-grams (Lee, 2007)
• Pegasos algorithm, Structural SVM (Lee, 2013/2014)
4

Introduction
• Recent approach based on deep learning architecture
 PyKoSpacing (Jeon, 2018, github.com/haven-jeon/PyKoSpacing)
• Utilizes CNN, Bidirectional GRU, character n-gram on News articles (>100M)
5

Introduction
• Reported performance
• Deep learning-based approach can provide a spacing that is useful
as a preprocessing for Korean text ...
 But how can it be applied to real-time tasks with noisy input?
• Our contribution
 Corpus: conversation-style and non-normalized utterances
• which covers noisy user input
 Deep learning-based automatic spacer
• which utilizes dense character vectors
 Web interface for user-generated text
• which can be applied to various web-based services
6
Test Set Accuracy
Sejong(colloquial style) Corpus(1M) 97.1%
OOOO(literary style) Corpus(3M) 94.3%

System Concept
• Corpus: conversation-style and non-normalized utterances
 Semi-automatically collected 2M lines of drama script
• Manual collection of freely available drama scripts
• Depunctuation via text processing toolkits
– Directions for action were also removed
– Errata were not corrected (for noisy user inputs)
– Dialects, non-standard language, and transliterations were also preserved
• Spacing was checked, but not strictly correct
– Spacing for spoken/colloquial language can be quite generous, unlike for literal
language; for example: 나너본지두달다돼감
» 나 너 본 지 두 달 다 돼 감 (strictly correct)
» 나 너 본지 두 달 다 돼감 (o.k.)
» 나 너 본지 두달 다돼감 (o.k.)
» 나 너 본지두 달다 돼감 (awkward)
» 나너 본 지 두달다 돼감 (awkward)
» ...
7

System Concept
• Deep learning-based automatic spacer
 Utilizes CNN and BiLSTM as in PyKoSpacing, but
• Dense character vectors are used
– The character vectors are trained with the corpus via fastText (Bojanowski et al., 2016)
» Skip-gram, subword n-gram
» Suitable for noisy input (especially in case the corpus contains eratta)
• The architecture is parallel rather than hierarchical
– CNN conveys distributive information / BiLSTM conveys sequential information
8

System Concept
• Web interface for user-generated text
 Most web-based services that deal with text data have a server which
restores the query and performs uploading
• Proposal of application: How about giving user an option?
– If the user wants direct upload, then provide it in the server level (may be
appropriate for the digitally marginalized people) – Direct spacing
– If the user wants only a recommendation (may be appropriate for the ones who are
familiar with the devices but feel annoyance), then the user correction can be taken
into account for the better system – Recommendation
9

Experiment
• Training specification
 Input: Character vector (fastText) [100]
 CNN: Single channel, 2 convolutional layers, 1 max-pooling layer
• Input dimension: (100, 100)
• Conv-Window (3,100) Max pooling-window (2,1)
• Output by flattening
 BiLSTM: Sequence of hidden layers (100, 64)
• Output by concatenating hidden layers
 MLP: 2 hidden layers with dimension [64]
 Dropout: 0.3 (for MLP)
 Activation: ReLU (for conv layers and MLPs)
 Optimizer: Adam (0.0005)
 Loss function: Mean-squared error (MSE)
 Output: Binary prediction vector [100] activated with ReLU
• spacing was given if >0.5
10

Experiment
• Limitation in evaluation
 Issues in correctness
• Strictly correct spacing is not necessarily required in conversation
– Although the spacing is not strictly correct, the readability can be improved
» e.g. ‘나 너 본 지 두 달 다 돼 감’ (strictly correct) vs ‘나 너 본지 두달 다 돼감’
• Multiple answers are available
 Issues in naturalness
• Feeling naturalness to an output may differ from person to person
– Objective measure is not appropriate
-> Qualitatively, comparison with other tools/algorithms is available
-> How about quantitative one? (as future work; e.g., modified MOS test)
11

Experiment
• Qualitative analysis
12
입력 문장 부산대 Naver Proposed
1. 그것도안알아봤냐 그것도 안 알아봤냐 그것도 안 알아봤냐 그것도 안 알아봤냐
2. 왜말을못해왜말을못하냐구 왜 말을 못해 왜 말을 못하냐고 왜 말을 못 해 왜 말을 못 하냐고 왜 말을 못해 왜 말을 못하냐구
3. 아버지친구분당선되셨더라 아버지 친구분 당선되셨더라 아버지 친구 분당서 되셨더라 아버지 친구분 당선 되셨더라
4. 엄마가죽을병에넣어뒀어 엄마가 죽을병에 넣어뒀어 엄마가 죽을 병에 넣어뒀어 엄마가 죽을 병에 넣어뒀어
5. 나얼만큼사랑해 [대치어 없음] 나올 만큼 사랑해 나 얼만큼 사랑해
6. 상하이스파이스치킨콤보부탁해 상하이 스파이스치킨 캄보 부탁해 상하이 스파이스 치킨 콤보 부탁해 상하이 스파이스 치킨콤 보부탁해
7. 인싸연구자가되는방법은 인사연구자가 되는 방법은 인사 연구자가 되는 방법은 인싸 연구자가 되는 방법은
8. 네이버와부산대의맞춤법검사 네 이번과 부산대의 맞춤법검사 네이버와 부산대의 맞춤법검사 네 이버와 부산대의 맞춤 법 검사
9. 그는눈을질끈감았다 그는 눈을 질끈 감았다 그는 눈을 질끈 감았다 그는 눈을 질끈감았다
10. 이함무봐라 [대치어 없음] 이함 무 봐라 이 함 무봐라
11. 정신똑띠차리라 정신 똑똑히 차리라 정신똑띠차리라 정신 똑띠 차리라
12. 뭣이중헌지도모름서 뭣이 중한지도 모르면서 뭣이중헌지도모름서 뭣이 중헌지도 모름서

Experiment
• Sample application
13

Conclusion & Future Work
• Deep learning-based automatic spacer is implemented based on
the corpus with conversation-style and non-canonical utterances
• The system utilizes dense character vectors from fastText and
parallel placement of CNN/BiLSTM
• Web-based service for digitally marginalized people is suggested
• Subjective measure and test utterance set for quantitative analysis
is under construction
• Demonstration
 https://github.com/warnikchow/ttuyssubot
 https://www.youtube.com/watch?v=mcPZVpKCH94&feature=youtu.be
14

Warnik Chow - 2018 HCLT

Recommended

Recommended

More Related Content

Similar to Warnik Chow - 2018 HCLT

Similar to Warnik Chow - 2018 HCLT (20)

More from WarNik Chow

More from WarNik Chow (20)

Recently uploaded

Recently uploaded (20)

Warnik Chow - 2018 HCLT

Editor's Notes