SlideShare a Scribd company logo
1 of 16
Human Interface Laboratory
Concept and Application of
Deep learning-based Automatic Spacing
2018. 10. 13
Won Ik Cho
Contents
• Introduction
 Motivation & related works
• System Concept
 Corpus: conversation-style and non-normalized utterances
 Deep learning-based automatic spacer
 Web interface for user-generated text
• Experiment
 Training specification
 Qualitative analysis
 Sample application
• Conclusion & Future Work
1
Introduction
• Ambiguity from being non-spaced:
Just a carelessness?
2
Photos from: https://brunch.co.kr/@bong/642
Introduction
• Motivation: For someone, spacing can be a task that requires effort
3
Image created by Visual Dive
Introduction
• How about automatically spacing the user-generated texts?
 For both a real-time chat and a post upload?
 Not mandatory; options can be provided
 Useful for digitally marginalized people
(and also for the users who find annoyance in spacing)
• Limitations in related approaches
 Related approaches include both web services and published algorithms
 Usually targets “correction” of an input sentence
• e.g. PNU module or NAVER checker
 or predicts the spacing based on preceding characters
• Probabilistic models based on character n-grams (Lee, 2007)
• Pegasos algorithm, Structural SVM (Lee, 2013/2014)
4
Introduction
• Recent approach based on deep learning architecture
 PyKoSpacing (Jeon, 2018, github.com/haven-jeon/PyKoSpacing)
• Utilizes CNN, Bidirectional GRU, character n-gram on News articles (>100M)
5
Introduction
• Reported performance
• Deep learning-based approach can provide a spacing that is useful
as a preprocessing for Korean text ...
 But how can it be applied to real-time tasks with noisy input?
• Our contribution
 Corpus: conversation-style and non-normalized utterances
• which covers noisy user input
 Deep learning-based automatic spacer
• which utilizes dense character vectors
 Web interface for user-generated text
• which can be applied to various web-based services
6
Test Set Accuracy
Sejong(colloquial style) Corpus(1M) 97.1%
OOOO(literary style) Corpus(3M) 94.3%
System Concept
• Corpus: conversation-style and non-normalized utterances
 Semi-automatically collected 2M lines of drama script
• Manual collection of freely available drama scripts
• Depunctuation via text processing toolkits
– Directions for action were also removed
– Errata were not corrected (for noisy user inputs)
– Dialects, non-standard language, and transliterations were also preserved
• Spacing was checked, but not strictly correct
– Spacing for spoken/colloquial language can be quite generous, unlike for literal
language; for example: 나너본지두달다돼감
» 나 너 본 지 두 달 다 돼 감 (strictly correct)
» 나 너 본지 두 달 다 돼감 (o.k.)
» 나 너 본지 두달 다돼감 (o.k.)
» 나 너 본지두 달다 돼감 (awkward)
» 나너 본 지 두달다 돼감 (awkward)
» ...
7
System Concept
• Deep learning-based automatic spacer
 Utilizes CNN and BiLSTM as in PyKoSpacing, but
• Dense character vectors are used
– The character vectors are trained with the corpus via fastText (Bojanowski et al., 2016)
» Skip-gram, subword n-gram
» Suitable for noisy input (especially in case the corpus contains eratta)
• The architecture is parallel rather than hierarchical
– CNN conveys distributive information / BiLSTM conveys sequential information
8
System Concept
• Web interface for user-generated text
 Most web-based services that deal with text data have a server which
restores the query and performs uploading
• Proposal of application: How about giving user an option?
– If the user wants direct upload, then provide it in the server level (may be
appropriate for the digitally marginalized people) – Direct spacing
– If the user wants only a recommendation (may be appropriate for the ones who are
familiar with the devices but feel annoyance), then the user correction can be taken
into account for the better system – Recommendation
9
Experiment
• Training specification
 Input: Character vector (fastText) [100]
 CNN: Single channel, 2 convolutional layers, 1 max-pooling layer
• Input dimension: (100, 100)
• Conv-Window (3,100) Max pooling-window (2,1)
• Output by flattening
 BiLSTM: Sequence of hidden layers (100, 64)
• Output by concatenating hidden layers
 MLP: 2 hidden layers with dimension [64]
 Dropout: 0.3 (for MLP)
 Activation: ReLU (for conv layers and MLPs)
 Optimizer: Adam (0.0005)
 Loss function: Mean-squared error (MSE)
 Output: Binary prediction vector [100] activated with ReLU
• spacing was given if >0.5
10
Experiment
• Limitation in evaluation
 Issues in correctness
• Strictly correct spacing is not necessarily required in conversation
– Although the spacing is not strictly correct, the readability can be improved
» e.g. ‘나 너 본 지 두 달 다 돼 감’ (strictly correct) vs ‘나 너 본지 두달 다 돼감’
• Multiple answers are available
 Issues in naturalness
• Feeling naturalness to an output may differ from person to person
– Objective measure is not appropriate
-> Qualitatively, comparison with other tools/algorithms is available
-> How about quantitative one? (as future work; e.g., modified MOS test)
11
Experiment
• Qualitative analysis
12
입력 문장 부산대 Naver Proposed
1. 그것도안알아봤냐 그것도 안 알아봤냐 그것도 안 알아봤냐 그것도 안 알아봤냐
2. 왜말을못해왜말을못하냐구 왜 말을 못해 왜 말을 못하냐고 왜 말을 못 해 왜 말을 못 하냐고 왜 말을 못해 왜 말을 못하냐구
3. 아버지친구분당선되셨더라 아버지 친구분 당선되셨더라 아버지 친구 분당서 되셨더라 아버지 친구분 당선 되셨더라
4. 엄마가죽을병에넣어뒀어 엄마가 죽을병에 넣어뒀어 엄마가 죽을 병에 넣어뒀어 엄마가 죽을 병에 넣어뒀어
5. 나얼만큼사랑해 [대치어 없음] 나올 만큼 사랑해 나 얼만큼 사랑해
6. 상하이스파이스치킨콤보부탁해 상하이 스파이스치킨 캄보 부탁해 상하이 스파이스 치킨 콤보 부탁해 상하이 스파이스 치킨콤 보부탁해
7. 인싸연구자가되는방법은 인사연구자가 되는 방법은 인사 연구자가 되는 방법은 인싸 연구자가 되는 방법은
8. 네이버와부산대의맞춤법검사 네 이번과 부산대의 맞춤법검사 네이버와 부산대의 맞춤법검사 네 이버와 부산대의 맞춤 법 검사
9. 그는눈을질끈감았다 그는 눈을 질끈 감았다 그는 눈을 질끈 감았다 그는 눈을 질끈감았다
10. 이함무봐라 [대치어 없음] 이함 무 봐라 이 함 무봐라
11. 정신똑띠차리라 정신 똑똑히 차리라 정신똑띠차리라 정신 똑띠 차리라
12. 뭣이중헌지도모름서 뭣이 중한지도 모르면서 뭣이중헌지도모름서 뭣이 중헌지도 모름서
Experiment
• Sample application
13
Conclusion & Future Work
• Deep learning-based automatic spacer is implemented based on
the corpus with conversation-style and non-canonical utterances
• The system utilizes dense character vectors from fastText and
parallel placement of CNN/BiLSTM
• Web-based service for digitally marginalized people is suggested
• Subjective measure and test utterance set for quantitative analysis
is under construction
• Demonstration
 https://github.com/warnikchow/ttuyssubot
 https://www.youtube.com/watch?v=mcPZVpKCH94&feature=youtu.be
14
Thank you
15

More Related Content

Similar to Warnik Chow - 2018 HCLT

Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?NAVER Engineering
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
 
Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Ana Marasović
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Sequence to sequence model speech recognition
Sequence to sequence model speech recognitionSequence to sequence model speech recognition
Sequence to sequence model speech recognitionAditya Kumar Khare
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problemJaeHo Jang
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsSanghamitra Deb
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Zachary S. Brown
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Personalized spell checking using neural networks
Personalized spell checking using neural networksPersonalized spell checking using neural networks
Personalized spell checking using neural networksAmir Shokri
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERTAbdurrahimDerric
 

Similar to Warnik Chow - 2018 HCLT (20)

2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Sequence to sequence model speech recognition
Sequence to sequence model speech recognitionSequence to sequence model speech recognition
Sequence to sequence model speech recognition
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
Personalized spell checking using neural networks
Personalized spell checking using neural networksPersonalized spell checking using neural networks
Personalized spell checking using neural networks
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 

More from WarNik Chow

2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inpersonWarNik Chow
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech datasetWarNik Chow
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2eWarNik Chow
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminarWarNik Chow
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categoriesWarNik Chow
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate SpeechWarNik Chow
 
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLPWarNik Chow
 

More from WarNik Chow (20)

2312 PACLIC
2312 PACLIC2312 PACLIC
2312 PACLIC
 
2311 EAAMO
2311 EAAMO2311 EAAMO
2311 EAAMO
 
2211 HCOMP
2211 HCOMP2211 HCOMP
2211 HCOMP
 
2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
 
2211 AACL
2211 AACL2211 AACL
2211 AACL
 
2210 CODI
2210 CODI2210 CODI
2210 CODI
 
2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inperson
 
2206 Modupop!
2206 Modupop!2206 Modupop!
2206 Modupop!
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e
 
2106 PRSLLS
2106 PRSLLS2106 PRSLLS
2106 PRSLLS
 
2106 JWLLP
2106 JWLLP2106 JWLLP
2106 JWLLP
 
2104 Talk @SSU
2104 Talk @SSU2104 Talk @SSU
2104 Talk @SSU
 
2103 ACM FAccT
2103 ACM FAccT2103 ACM FAccT
2103 ACM FAccT
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminar
 
2011 NLP-OSS
2011 NLP-OSS2011 NLP-OSS
2011 NLP-OSS
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate Speech
 
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP
 

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 

Recently uploaded (20)

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 

Warnik Chow - 2018 HCLT

  • 1. Human Interface Laboratory Concept and Application of Deep learning-based Automatic Spacing 2018. 10. 13 Won Ik Cho
  • 2. Contents • Introduction  Motivation & related works • System Concept  Corpus: conversation-style and non-normalized utterances  Deep learning-based automatic spacer  Web interface for user-generated text • Experiment  Training specification  Qualitative analysis  Sample application • Conclusion & Future Work 1
  • 3. Introduction • Ambiguity from being non-spaced: Just a carelessness? 2 Photos from: https://brunch.co.kr/@bong/642
  • 4. Introduction • Motivation: For someone, spacing can be a task that requires effort 3 Image created by Visual Dive
  • 5. Introduction • How about automatically spacing the user-generated texts?  For both a real-time chat and a post upload?  Not mandatory; options can be provided  Useful for digitally marginalized people (and also for the users who find annoyance in spacing) • Limitations in related approaches  Related approaches include both web services and published algorithms  Usually targets “correction” of an input sentence • e.g. PNU module or NAVER checker  or predicts the spacing based on preceding characters • Probabilistic models based on character n-grams (Lee, 2007) • Pegasos algorithm, Structural SVM (Lee, 2013/2014) 4
  • 6. Introduction • Recent approach based on deep learning architecture  PyKoSpacing (Jeon, 2018, github.com/haven-jeon/PyKoSpacing) • Utilizes CNN, Bidirectional GRU, character n-gram on News articles (>100M) 5
  • 7. Introduction • Reported performance • Deep learning-based approach can provide a spacing that is useful as a preprocessing for Korean text ...  But how can it be applied to real-time tasks with noisy input? • Our contribution  Corpus: conversation-style and non-normalized utterances • which covers noisy user input  Deep learning-based automatic spacer • which utilizes dense character vectors  Web interface for user-generated text • which can be applied to various web-based services 6 Test Set Accuracy Sejong(colloquial style) Corpus(1M) 97.1% OOOO(literary style) Corpus(3M) 94.3%
  • 8. System Concept • Corpus: conversation-style and non-normalized utterances  Semi-automatically collected 2M lines of drama script • Manual collection of freely available drama scripts • Depunctuation via text processing toolkits – Directions for action were also removed – Errata were not corrected (for noisy user inputs) – Dialects, non-standard language, and transliterations were also preserved • Spacing was checked, but not strictly correct – Spacing for spoken/colloquial language can be quite generous, unlike for literal language; for example: 나너본지두달다돼감 » 나 너 본 지 두 달 다 돼 감 (strictly correct) » 나 너 본지 두 달 다 돼감 (o.k.) » 나 너 본지 두달 다돼감 (o.k.) » 나 너 본지두 달다 돼감 (awkward) » 나너 본 지 두달다 돼감 (awkward) » ... 7
  • 9. System Concept • Deep learning-based automatic spacer  Utilizes CNN and BiLSTM as in PyKoSpacing, but • Dense character vectors are used – The character vectors are trained with the corpus via fastText (Bojanowski et al., 2016) » Skip-gram, subword n-gram » Suitable for noisy input (especially in case the corpus contains eratta) • The architecture is parallel rather than hierarchical – CNN conveys distributive information / BiLSTM conveys sequential information 8
  • 10. System Concept • Web interface for user-generated text  Most web-based services that deal with text data have a server which restores the query and performs uploading • Proposal of application: How about giving user an option? – If the user wants direct upload, then provide it in the server level (may be appropriate for the digitally marginalized people) – Direct spacing – If the user wants only a recommendation (may be appropriate for the ones who are familiar with the devices but feel annoyance), then the user correction can be taken into account for the better system – Recommendation 9
  • 11. Experiment • Training specification  Input: Character vector (fastText) [100]  CNN: Single channel, 2 convolutional layers, 1 max-pooling layer • Input dimension: (100, 100) • Conv-Window (3,100) Max pooling-window (2,1) • Output by flattening  BiLSTM: Sequence of hidden layers (100, 64) • Output by concatenating hidden layers  MLP: 2 hidden layers with dimension [64]  Dropout: 0.3 (for MLP)  Activation: ReLU (for conv layers and MLPs)  Optimizer: Adam (0.0005)  Loss function: Mean-squared error (MSE)  Output: Binary prediction vector [100] activated with ReLU • spacing was given if >0.5 10
  • 12. Experiment • Limitation in evaluation  Issues in correctness • Strictly correct spacing is not necessarily required in conversation – Although the spacing is not strictly correct, the readability can be improved » e.g. ‘나 너 본 지 두 달 다 돼 감’ (strictly correct) vs ‘나 너 본지 두달 다 돼감’ • Multiple answers are available  Issues in naturalness • Feeling naturalness to an output may differ from person to person – Objective measure is not appropriate -> Qualitatively, comparison with other tools/algorithms is available -> How about quantitative one? (as future work; e.g., modified MOS test) 11
  • 13. Experiment • Qualitative analysis 12 입력 문장 부산대 Naver Proposed 1. 그것도안알아봤냐 그것도 안 알아봤냐 그것도 안 알아봤냐 그것도 안 알아봤냐 2. 왜말을못해왜말을못하냐구 왜 말을 못해 왜 말을 못하냐고 왜 말을 못 해 왜 말을 못 하냐고 왜 말을 못해 왜 말을 못하냐구 3. 아버지친구분당선되셨더라 아버지 친구분 당선되셨더라 아버지 친구 분당서 되셨더라 아버지 친구분 당선 되셨더라 4. 엄마가죽을병에넣어뒀어 엄마가 죽을병에 넣어뒀어 엄마가 죽을 병에 넣어뒀어 엄마가 죽을 병에 넣어뒀어 5. 나얼만큼사랑해 [대치어 없음] 나올 만큼 사랑해 나 얼만큼 사랑해 6. 상하이스파이스치킨콤보부탁해 상하이 스파이스치킨 캄보 부탁해 상하이 스파이스 치킨 콤보 부탁해 상하이 스파이스 치킨콤 보부탁해 7. 인싸연구자가되는방법은 인사연구자가 되는 방법은 인사 연구자가 되는 방법은 인싸 연구자가 되는 방법은 8. 네이버와부산대의맞춤법검사 네 이번과 부산대의 맞춤법검사 네이버와 부산대의 맞춤법검사 네 이버와 부산대의 맞춤 법 검사 9. 그는눈을질끈감았다 그는 눈을 질끈 감았다 그는 눈을 질끈 감았다 그는 눈을 질끈감았다 10. 이함무봐라 [대치어 없음] 이함 무 봐라 이 함 무봐라 11. 정신똑띠차리라 정신 똑똑히 차리라 정신똑띠차리라 정신 똑띠 차리라 12. 뭣이중헌지도모름서 뭣이 중한지도 모르면서 뭣이중헌지도모름서 뭣이 중헌지도 모름서
  • 15. Conclusion & Future Work • Deep learning-based automatic spacer is implemented based on the corpus with conversation-style and non-canonical utterances • The system utilizes dense character vectors from fastText and parallel placement of CNN/BiLSTM • Web-based service for digitally marginalized people is suggested • Subjective measure and test utterance set for quantitative analysis is under construction • Demonstration  https://github.com/warnikchow/ttuyssubot  https://www.youtube.com/watch?v=mcPZVpKCH94&feature=youtu.be 14

Editor's Notes

  1. .
  2. 감사합니다.