조원익 띄쓰봇 2018_HCLT
Ttuyssubot: An automatic spacer for conversation-style and non-canonical, non-normalized Korean utterances
# Motivation
- How about automatically spacing the user-generated texts?
- For both a real-time chat and a post upload?
- Not mandatory; options can be provided
Useful for digitally marginalized people(and also for the users who find annoyance in spacing)
# Conclusion
- Deep learning-based automatic spacer is implemented based on the corpus with conversation-style and non-canonical utterances
- The system utilizes dense character vectors from fastText and parallel placement of CNN/BiLSTM
- Web-based service for digitally marginalized people is suggested
- Subjective measure and test utterance set for quantitative analysis is under construction
Demonstration
https://github.com/warnikchow/ttuyssubot
https://www.youtube.com/watch?v=mcPZVpKCH94&feature=youtu.be
2. Contents
• Introduction
Motivation & related works
• System Concept
Corpus: conversation-style and non-normalized utterances
Deep learning-based automatic spacer
Web interface for user-generated text
• Experiment
Training specification
Qualitative analysis
Sample application
• Conclusion & Future Work
1
5. Introduction
• How about automatically spacing the user-generated texts?
For both a real-time chat and a post upload?
Not mandatory; options can be provided
Useful for digitally marginalized people
(and also for the users who find annoyance in spacing)
• Limitations in related approaches
Related approaches include both web services and published algorithms
Usually targets “correction” of an input sentence
• e.g. PNU module or NAVER checker
or predicts the spacing based on preceding characters
• Probabilistic models based on character n-grams (Lee, 2007)
• Pegasos algorithm, Structural SVM (Lee, 2013/2014)
4
6. Introduction
• Recent approach based on deep learning architecture
PyKoSpacing (Jeon, 2018, github.com/haven-jeon/PyKoSpacing)
• Utilizes CNN, Bidirectional GRU, character n-gram on News articles (>100M)
5
7. Introduction
• Reported performance
• Deep learning-based approach can provide a spacing that is useful
as a preprocessing for Korean text ...
But how can it be applied to real-time tasks with noisy input?
• Our contribution
Corpus: conversation-style and non-normalized utterances
• which covers noisy user input
Deep learning-based automatic spacer
• which utilizes dense character vectors
Web interface for user-generated text
• which can be applied to various web-based services
6
Test Set Accuracy
Sejong(colloquial style) Corpus(1M) 97.1%
OOOO(literary style) Corpus(3M) 94.3%
8. System Concept
• Corpus: conversation-style and non-normalized utterances
Semi-automatically collected 2M lines of drama script
• Manual collection of freely available drama scripts
• Depunctuation via text processing toolkits
– Directions for action were also removed
– Errata were not corrected (for noisy user inputs)
– Dialects, non-standard language, and transliterations were also preserved
• Spacing was checked, but not strictly correct
– Spacing for spoken/colloquial language can be quite generous, unlike for literal
language; for example: 나너본지두달다돼감
» 나 너 본 지 두 달 다 돼 감 (strictly correct)
» 나 너 본지 두 달 다 돼감 (o.k.)
» 나 너 본지 두달 다돼감 (o.k.)
» 나 너 본지두 달다 돼감 (awkward)
» 나너 본 지 두달다 돼감 (awkward)
» ...
7
9. System Concept
• Deep learning-based automatic spacer
Utilizes CNN and BiLSTM as in PyKoSpacing, but
• Dense character vectors are used
– The character vectors are trained with the corpus via fastText (Bojanowski et al., 2016)
» Skip-gram, subword n-gram
» Suitable for noisy input (especially in case the corpus contains eratta)
• The architecture is parallel rather than hierarchical
– CNN conveys distributive information / BiLSTM conveys sequential information
8
10. System Concept
• Web interface for user-generated text
Most web-based services that deal with text data have a server which
restores the query and performs uploading
• Proposal of application: How about giving user an option?
– If the user wants direct upload, then provide it in the server level (may be
appropriate for the digitally marginalized people) – Direct spacing
– If the user wants only a recommendation (may be appropriate for the ones who are
familiar with the devices but feel annoyance), then the user correction can be taken
into account for the better system – Recommendation
9
11. Experiment
• Training specification
Input: Character vector (fastText) [100]
CNN: Single channel, 2 convolutional layers, 1 max-pooling layer
• Input dimension: (100, 100)
• Conv-Window (3,100) Max pooling-window (2,1)
• Output by flattening
BiLSTM: Sequence of hidden layers (100, 64)
• Output by concatenating hidden layers
MLP: 2 hidden layers with dimension [64]
Dropout: 0.3 (for MLP)
Activation: ReLU (for conv layers and MLPs)
Optimizer: Adam (0.0005)
Loss function: Mean-squared error (MSE)
Output: Binary prediction vector [100] activated with ReLU
• spacing was given if >0.5
10
12. Experiment
• Limitation in evaluation
Issues in correctness
• Strictly correct spacing is not necessarily required in conversation
– Although the spacing is not strictly correct, the readability can be improved
» e.g. ‘나 너 본 지 두 달 다 돼 감’ (strictly correct) vs ‘나 너 본지 두달 다 돼감’
• Multiple answers are available
Issues in naturalness
• Feeling naturalness to an output may differ from person to person
– Objective measure is not appropriate
-> Qualitatively, comparison with other tools/algorithms is available
-> How about quantitative one? (as future work; e.g., modified MOS test)
11
13. Experiment
• Qualitative analysis
12
입력 문장 부산대 Naver Proposed
1. 그것도안알아봤냐 그것도 안 알아봤냐 그것도 안 알아봤냐 그것도 안 알아봤냐
2. 왜말을못해왜말을못하냐구 왜 말을 못해 왜 말을 못하냐고 왜 말을 못 해 왜 말을 못 하냐고 왜 말을 못해 왜 말을 못하냐구
3. 아버지친구분당선되셨더라 아버지 친구분 당선되셨더라 아버지 친구 분당서 되셨더라 아버지 친구분 당선 되셨더라
4. 엄마가죽을병에넣어뒀어 엄마가 죽을병에 넣어뒀어 엄마가 죽을 병에 넣어뒀어 엄마가 죽을 병에 넣어뒀어
5. 나얼만큼사랑해 [대치어 없음] 나올 만큼 사랑해 나 얼만큼 사랑해
6. 상하이스파이스치킨콤보부탁해 상하이 스파이스치킨 캄보 부탁해 상하이 스파이스 치킨 콤보 부탁해 상하이 스파이스 치킨콤 보부탁해
7. 인싸연구자가되는방법은 인사연구자가 되는 방법은 인사 연구자가 되는 방법은 인싸 연구자가 되는 방법은
8. 네이버와부산대의맞춤법검사 네 이번과 부산대의 맞춤법검사 네이버와 부산대의 맞춤법검사 네 이버와 부산대의 맞춤 법 검사
9. 그는눈을질끈감았다 그는 눈을 질끈 감았다 그는 눈을 질끈 감았다 그는 눈을 질끈감았다
10. 이함무봐라 [대치어 없음] 이함 무 봐라 이 함 무봐라
11. 정신똑띠차리라 정신 똑똑히 차리라 정신똑띠차리라 정신 똑띠 차리라
12. 뭣이중헌지도모름서 뭣이 중한지도 모르면서 뭣이중헌지도모름서 뭣이 중헌지도 모름서
15. Conclusion & Future Work
• Deep learning-based automatic spacer is implemented based on
the corpus with conversation-style and non-canonical utterances
• The system utilizes dense character vectors from fastText and
parallel placement of CNN/BiLSTM
• Web-based service for digitally marginalized people is suggested
• Subjective measure and test utterance set for quantitative analysis
is under construction
• Demonstration
https://github.com/warnikchow/ttuyssubot
https://www.youtube.com/watch?v=mcPZVpKCH94&feature=youtu.be
14