Human Interface Laboratory
Investigating an Effective Character-level
Embedding in Korean Sentence Classification
2019. 9. 13 @PACLIC 33
Won Ik Cho, Seok Min Kim, Nam Soo Kim
Contents
• Why character-level embedding?
 Case of Korean writing system
 Previous approaches
• Task description
• Experiment
 Feature engineering
 Implementation
• Result & Analysis
• Done and afterward
1
Why character-level embedding?
• Word level vs. Subword-level vs. (sub-)character-level
 In English (using alphabet)
• hello (word level)
• hel ##lo (subword level ~ word piece)
• h e l l o (character level)
 In Korean (using Hangul)
• 반갑다 (pan-kap-ta, word level)
• 반가- / -ㅂ- / -다 (morpheme level)
• 반갑 다 (subword level ~ word piece)
• 반 갑 다 (character level ~ word piece)
• ㅂㅏㄴㄱㅏㅂㄷㅏ# (Jamo level)
2
* Jamo: letters of Korean Hangul, where Ja denotes consonants and Mo denotes vowels
Why character-level embedding?
• On Korean (sub-)character-level (Jamo) system
3
반A character (pan)
First sound (cho-seng)
Second sound (cung-seng)
Third sound (cong-seng)
Structure: {Syllable: CV(C)}
# First sound (C): 19
# Second sound (V): 11
# Third sound (C): 27 + ‘ ‘
Total 19 * 11 * 28 = 11,172 characters!
Why character-level embedding?
• Research questions
 To what extent can we think of the (sub-)character-level
embedding styles in Korean NLP?
 What kind of embedding best fits with the sentence classification
tasks (binary, multi-class)?
 How and why does the performance between the embedding
differs?
4
Why character-level embedding?
• Previous approaches
 Zhang et al. (2017)
• Which encoding is the best for text classification in C/E/J/K?
5
Mistake in the article!
Character-level features
far outperform the
word level features ...
Fortunately this does not
deter the goal of our study
Test errors
We use no morphological analysis, but only decomposition of the blocks!
Why character-level embedding?
• Previous approaches
 Shin et al. (2017): Jamo (sub-character)-level padding
 Cho et al. (2018c): Jamo-level + solely used Jamos
 Cho et al. (2018a)-Sparse: About 25K frequently used
characters in conversation-style corpus
 Cho et al. (2018a)-Dense: Utilizing subword representations
in dense embedding (fastText) ← Pretrained!
 Song et al. (2018): Multi-hot encoding
6
Why character-level embedding?
• On Korean (sub-)character-level embedding
 Sparse embedding
• One-hot or multi-hot representation
• Narrow but long sequence for Jamo-level features
• Wide but shorter sequence for character-level features
 Dense embedding
• word2vec-style representation utilizing skip-gram
• Narrow and short sequence, and especially for character-level feature
– The number of Jamos lack for training them as ‘word’s
– But for characters, real-life token usage up to 2,500 available
» A kind of subword/word piece!
7
Task description
• Two classification tasks
 Sentiment analysis (NSMC): binary
• aim to cover lexical semantics
 Intention identification (3i4K): 7-class
• aim to cover syntax-semantics
 Why classification?
• Easy/fast to train and featurize
• Result is clearly analyzable (straightforward metric such as F1,
accuracy)
• Featurization methodology can be extended to other tasks
– Role labeling, entity recognition, translation, generation etc.
8
Task description
• Sentiment analysis
 Naver sentiment movie corpus (NSMC)
• Widely used benchmark for evaluation of Korean LMs
• Annotation follows Maas et al. (2011)
• Positive label for reviews with score > 8 and negative for < 5
– Neutral reviews are removed; thus BINARY classification
• Contains various non-Jamo symbols
– e.g., ^^, @@, ...
• 150K/50K for training/test each
• https://github.com/e9t/nsmc
9
Task description
• Intention identification
 Intonation-aided intention identification for Korean (3i4K)
• Recently distributed open data for speech act classification (Cho
et al., 2018)
• Total seven categories
– Fragments, statement, question, command, rhetorical question
(RQ), rhetorical command (RC), intonation-dependent utterances
• Contains only full
hangul characters
(no sole sub-characters
nor non-letters)
• https://github.com/
warnikchow/3i4k
10
Experiment
• Feature engineering
 Sequence length
• NSMC: 420 (for Jamo-level) and 140 (for character-level)
• 3i4K: 240 (for Jamo-level) and 80 (for character-level)
 Sequence width
• Shin et al. (2017): 67 = 19 + 11 + 27 (‘ ‘ zero-padded)
• Cho et al. (2018c): 118 = Shin et al. (2017) + 51 (solely-used
Jamos)
– e.g., ㅜ, ㅠ, ㅡ, ㅋ, ...
• Cho et al. (2018a) – Sparse: 2,534
• Cho et al. (2018a) – Dense: 100
– length-1 subwords only!
• Song et al. (2018): 67 (in specific, 2 or 3-hot)
11
Experiment
• Implementation
 Bidirectional long short term memory (BiLSTM, Schuster and
Paliwal, 1997)
• Representative recurrent neural network (RNN) model
• Strong at representing sequential information
 Self-attentive embedding (SA, Lin, 2017)
• Different from self-attention, but frequently utilized in sentence
embedding
• Utilizes a context vector to make up an attention weight layer
12
• BiLSTM Architecture
 Input dimension: (L, D)
 RNN Hidden layer width: 64 (=322)
 The width of FCN connected to the last hidden layer: 128 (Activation: ReLU)
 Output layer width: N (Activation: softmax)
• SA Architecture (Figure)
 Input dimension identical
 Context vector width: 64
(Activation: ReLU, Dropout 0.3)
 Additional MLPs and Dropouts
after a weighted sum
13
Experiment
• Implementation
 Python 3, Keras, Hangul toolkit, fastText
• Keras (Chollet et al., 2015) for NN training
– TensorFlow Backend, Very concise in implementation
• Hangul toolkit for decomposing the characters
– Decomposes characters into sub-character sequence (length x 3)
• fastText (Bojanowski et al., 2016) for dense character-level
embeddings
– Dense character vector obtained from a drama script (2M lines)
– Appropriate for colloquial expressions
 Optimizer: Adam 5e-4
 Loss function: Categorical cross-entropy
 Batch size: 64 for NSMC, 16 for 3i4K
 Device: Nvidia Tesla M40 24GB
14
Result & Analysis
• Result
 Only accuracy for NSMC (positive/negative 5:5)
• Why lower than the results reported in literature? (about 0.88)
– Since no non-letter tokens were utilized...
» And the data very sensitive to non-letter expressions (emojis
and solely used subcharacters e.g., ㅠㅠ, ㅋㅋ)
 F1 score: harmonic mean of precision and recall
15
Result & Analysis
• Analysis 1: Performance comparison
 Dense character-level embedding outperforms sparse ones
• Injection of a distributive information (word2vec-like) onto the tokens
• Some characters role as a case marker or a short ‘word’
 One-hot/Multi-hot
character-level?
• No additional information
• Will be powerful if dataset bigger & balanced
 Low performance of Jamo-level features in NSMC?
• Decomposition meaningful in syntax-semantic task (rather than
lexical semantics)?
 Using self-attention highly improves Jamo-level embeddings
16
Result & Analysis
• Analysis 2: Using self-attentive embedding
 Most emphasized in Jamo-level feature
 Least emphasized in One-hot character-level encoding
 Why?
• Decomposability of the blocks
– How the sub-character information is projected onto embedding
• e.g., 이상한 (i-sang-han, strange)
– In morpheme: 이상하 (i-sang-ha, the root) + -ㄴ (-n, a particle)
– Presence and role of the morphemes are pointed out
– Not guaranteed in block-preserving networks
– Strengthens syntax-semantic analysis?
17
Result & Analysis
• Analysis 3: Decomposability vs. Local semantics
 Disadvantage of character-level embeddings:
• characters cannot be decomposed,
even for multi-hot
 Then, outperforming
displayed for what?
• Seems to originate in
the preservation of the cluster of letters
– Which stably indicates where the token separation take place, e.g., for 반갑다 (pan-
kap-ta, hello),
» (Jamo-level) ㅂ/ㅏ/ㄴ/ㄱ/ㅏ/ㅂ/ㄷ/ㅏ/<empty>
» (character-level) <ㅂㅏㄴ><ㄱㅏㅂ><ㄷㅏ>
– The tendency will be different if 1) sub-character level word piece model (byte pair
encoding) is implemented or 2) sub-character property (1 & 3rd sound) is additionally
attached to tokens
18
Result & Analysis
• Analysis 4: Computation efficiency
 Computation for NSMC models
• Jamo-level
– Moderate parameter
size, but slow in
training
• Dense/Multi-hot
– Smaller parameter size (in case of SA)
– Faster training time
– Equal to or better performance
19
Discussion
• Primary goal of the paper:
 To search for a Jamo/character-level embedding that best fits
with the given Korean NLP task
• The utility of the comparison result
 Also can be applied to Japanese/Chinese NLP?
• Japanese: Morae (e.g., small tsu) that roughly matches with the
third sound of Korean
• Chinese/Japanese: Hanzi or Kanji can further be decomposed
into small forms of glyphs (Nguyen et al., 2017)
– e.g., 鯨 ``whale" to 魚 ``fish" and 京 ``capital city“
• Many South/Southeast Asian languages
– Composition of consonant and vowel
– Maybe decomposing the properties is better than ..?
20
Done & Afterward
• Reviewed five (sub-)character-level embedding for a
character-rich and agglutinative language (Korean)
 Dense and multi-hot character-level representation perform best
• for dense one, maybe distributional information matters
 Multi-hot has potential to be utilized beyond the given tasks
• conciseness & computation efficiency
 Sub-character level features useful in tasks that require
morphological decomposition
• have potential to be improved via word piece approach or
information attachment
 Overall tendency useful for the text processing of other character-
rich languages with conjunct forms in the writing system
21
Reference (order of appearance)
• Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification.
In Advances in neural information processing systems, pages 649–657.
• Haebin Shin, Min-Gwan Seo, and Hyeongjin Byeon. 2017. Korean alphabet level convolution neural network for
text classification. In Proceedings of Korea Computer Congress 2017 [in Korean], pages 587–589.
• Yong Woo Cho, Gyu Su Han, and Hyuk Jun Lee. 2018c. Character level bi-directional lstm-cnn model for movie
rating prediction. In Proceedings of Korea Computer Congress 2018 [in Korean], pages 1009–1011.
• Won Ik Cho, Sung Jun Cheon, Woo Hyun Kang, Ji Won Kim, and Nam Soo Kim. 2018a. Real-time automatic
word segmentation for user-generated text. arXiv preprint arXiv:1810.13113.
• Won Ik Cho, Hyeon Seung Lee, Ji Won Yoon, Seok Min Kim, and Nam Soo Kim. 2018b. Speech intention
understanding in a head-final language: A disambiguation utilizing intonation-dependency. arXiv preprint
arXiv:1811.04231.
• Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal
Processing, 45(11):2673–2681.
• Zhouhan Lin,Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio.
2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
• Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with
subword information. arXiv preprint arXiv:1607.04606.
• Francois Chollet et al. 2015. Keras. https://github.com/fchollet/keras.
• Viet Nguyen, Julian Brooke, and Timothy Baldwin. 2017. Sub-character neural language modelling in japanese.
In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 148–153.
22
Thank you!
EndOfPresentation

1909 paclic

  • 1.
    Human Interface Laboratory Investigatingan Effective Character-level Embedding in Korean Sentence Classification 2019. 9. 13 @PACLIC 33 Won Ik Cho, Seok Min Kim, Nam Soo Kim
  • 2.
    Contents • Why character-levelembedding?  Case of Korean writing system  Previous approaches • Task description • Experiment  Feature engineering  Implementation • Result & Analysis • Done and afterward 1
  • 3.
    Why character-level embedding? •Word level vs. Subword-level vs. (sub-)character-level  In English (using alphabet) • hello (word level) • hel ##lo (subword level ~ word piece) • h e l l o (character level)  In Korean (using Hangul) • 반갑다 (pan-kap-ta, word level) • 반가- / -ㅂ- / -다 (morpheme level) • 반갑 다 (subword level ~ word piece) • 반 갑 다 (character level ~ word piece) • ㅂㅏㄴㄱㅏㅂㄷㅏ# (Jamo level) 2 * Jamo: letters of Korean Hangul, where Ja denotes consonants and Mo denotes vowels
  • 4.
    Why character-level embedding? •On Korean (sub-)character-level (Jamo) system 3 반A character (pan) First sound (cho-seng) Second sound (cung-seng) Third sound (cong-seng) Structure: {Syllable: CV(C)} # First sound (C): 19 # Second sound (V): 11 # Third sound (C): 27 + ‘ ‘ Total 19 * 11 * 28 = 11,172 characters!
  • 5.
    Why character-level embedding? •Research questions  To what extent can we think of the (sub-)character-level embedding styles in Korean NLP?  What kind of embedding best fits with the sentence classification tasks (binary, multi-class)?  How and why does the performance between the embedding differs? 4
  • 6.
    Why character-level embedding? •Previous approaches  Zhang et al. (2017) • Which encoding is the best for text classification in C/E/J/K? 5 Mistake in the article! Character-level features far outperform the word level features ... Fortunately this does not deter the goal of our study Test errors We use no morphological analysis, but only decomposition of the blocks!
  • 7.
    Why character-level embedding? •Previous approaches  Shin et al. (2017): Jamo (sub-character)-level padding  Cho et al. (2018c): Jamo-level + solely used Jamos  Cho et al. (2018a)-Sparse: About 25K frequently used characters in conversation-style corpus  Cho et al. (2018a)-Dense: Utilizing subword representations in dense embedding (fastText) ← Pretrained!  Song et al. (2018): Multi-hot encoding 6
  • 8.
    Why character-level embedding? •On Korean (sub-)character-level embedding  Sparse embedding • One-hot or multi-hot representation • Narrow but long sequence for Jamo-level features • Wide but shorter sequence for character-level features  Dense embedding • word2vec-style representation utilizing skip-gram • Narrow and short sequence, and especially for character-level feature – The number of Jamos lack for training them as ‘word’s – But for characters, real-life token usage up to 2,500 available » A kind of subword/word piece! 7
  • 9.
    Task description • Twoclassification tasks  Sentiment analysis (NSMC): binary • aim to cover lexical semantics  Intention identification (3i4K): 7-class • aim to cover syntax-semantics  Why classification? • Easy/fast to train and featurize • Result is clearly analyzable (straightforward metric such as F1, accuracy) • Featurization methodology can be extended to other tasks – Role labeling, entity recognition, translation, generation etc. 8
  • 10.
    Task description • Sentimentanalysis  Naver sentiment movie corpus (NSMC) • Widely used benchmark for evaluation of Korean LMs • Annotation follows Maas et al. (2011) • Positive label for reviews with score > 8 and negative for < 5 – Neutral reviews are removed; thus BINARY classification • Contains various non-Jamo symbols – e.g., ^^, @@, ... • 150K/50K for training/test each • https://github.com/e9t/nsmc 9
  • 11.
    Task description • Intentionidentification  Intonation-aided intention identification for Korean (3i4K) • Recently distributed open data for speech act classification (Cho et al., 2018) • Total seven categories – Fragments, statement, question, command, rhetorical question (RQ), rhetorical command (RC), intonation-dependent utterances • Contains only full hangul characters (no sole sub-characters nor non-letters) • https://github.com/ warnikchow/3i4k 10
  • 12.
    Experiment • Feature engineering Sequence length • NSMC: 420 (for Jamo-level) and 140 (for character-level) • 3i4K: 240 (for Jamo-level) and 80 (for character-level)  Sequence width • Shin et al. (2017): 67 = 19 + 11 + 27 (‘ ‘ zero-padded) • Cho et al. (2018c): 118 = Shin et al. (2017) + 51 (solely-used Jamos) – e.g., ㅜ, ㅠ, ㅡ, ㅋ, ... • Cho et al. (2018a) – Sparse: 2,534 • Cho et al. (2018a) – Dense: 100 – length-1 subwords only! • Song et al. (2018): 67 (in specific, 2 or 3-hot) 11
  • 13.
    Experiment • Implementation  Bidirectionallong short term memory (BiLSTM, Schuster and Paliwal, 1997) • Representative recurrent neural network (RNN) model • Strong at representing sequential information  Self-attentive embedding (SA, Lin, 2017) • Different from self-attention, but frequently utilized in sentence embedding • Utilizes a context vector to make up an attention weight layer 12
  • 14.
    • BiLSTM Architecture Input dimension: (L, D)  RNN Hidden layer width: 64 (=322)  The width of FCN connected to the last hidden layer: 128 (Activation: ReLU)  Output layer width: N (Activation: softmax) • SA Architecture (Figure)  Input dimension identical  Context vector width: 64 (Activation: ReLU, Dropout 0.3)  Additional MLPs and Dropouts after a weighted sum 13
  • 15.
    Experiment • Implementation  Python3, Keras, Hangul toolkit, fastText • Keras (Chollet et al., 2015) for NN training – TensorFlow Backend, Very concise in implementation • Hangul toolkit for decomposing the characters – Decomposes characters into sub-character sequence (length x 3) • fastText (Bojanowski et al., 2016) for dense character-level embeddings – Dense character vector obtained from a drama script (2M lines) – Appropriate for colloquial expressions  Optimizer: Adam 5e-4  Loss function: Categorical cross-entropy  Batch size: 64 for NSMC, 16 for 3i4K  Device: Nvidia Tesla M40 24GB 14
  • 16.
    Result & Analysis •Result  Only accuracy for NSMC (positive/negative 5:5) • Why lower than the results reported in literature? (about 0.88) – Since no non-letter tokens were utilized... » And the data very sensitive to non-letter expressions (emojis and solely used subcharacters e.g., ㅠㅠ, ㅋㅋ)  F1 score: harmonic mean of precision and recall 15
  • 17.
    Result & Analysis •Analysis 1: Performance comparison  Dense character-level embedding outperforms sparse ones • Injection of a distributive information (word2vec-like) onto the tokens • Some characters role as a case marker or a short ‘word’  One-hot/Multi-hot character-level? • No additional information • Will be powerful if dataset bigger & balanced  Low performance of Jamo-level features in NSMC? • Decomposition meaningful in syntax-semantic task (rather than lexical semantics)?  Using self-attention highly improves Jamo-level embeddings 16
  • 18.
    Result & Analysis •Analysis 2: Using self-attentive embedding  Most emphasized in Jamo-level feature  Least emphasized in One-hot character-level encoding  Why? • Decomposability of the blocks – How the sub-character information is projected onto embedding • e.g., 이상한 (i-sang-han, strange) – In morpheme: 이상하 (i-sang-ha, the root) + -ㄴ (-n, a particle) – Presence and role of the morphemes are pointed out – Not guaranteed in block-preserving networks – Strengthens syntax-semantic analysis? 17
  • 19.
    Result & Analysis •Analysis 3: Decomposability vs. Local semantics  Disadvantage of character-level embeddings: • characters cannot be decomposed, even for multi-hot  Then, outperforming displayed for what? • Seems to originate in the preservation of the cluster of letters – Which stably indicates where the token separation take place, e.g., for 반갑다 (pan- kap-ta, hello), » (Jamo-level) ㅂ/ㅏ/ㄴ/ㄱ/ㅏ/ㅂ/ㄷ/ㅏ/<empty> » (character-level) <ㅂㅏㄴ><ㄱㅏㅂ><ㄷㅏ> – The tendency will be different if 1) sub-character level word piece model (byte pair encoding) is implemented or 2) sub-character property (1 & 3rd sound) is additionally attached to tokens 18
  • 20.
    Result & Analysis •Analysis 4: Computation efficiency  Computation for NSMC models • Jamo-level – Moderate parameter size, but slow in training • Dense/Multi-hot – Smaller parameter size (in case of SA) – Faster training time – Equal to or better performance 19
  • 21.
    Discussion • Primary goalof the paper:  To search for a Jamo/character-level embedding that best fits with the given Korean NLP task • The utility of the comparison result  Also can be applied to Japanese/Chinese NLP? • Japanese: Morae (e.g., small tsu) that roughly matches with the third sound of Korean • Chinese/Japanese: Hanzi or Kanji can further be decomposed into small forms of glyphs (Nguyen et al., 2017) – e.g., 鯨 ``whale" to 魚 ``fish" and 京 ``capital city“ • Many South/Southeast Asian languages – Composition of consonant and vowel – Maybe decomposing the properties is better than ..? 20
  • 22.
    Done & Afterward •Reviewed five (sub-)character-level embedding for a character-rich and agglutinative language (Korean)  Dense and multi-hot character-level representation perform best • for dense one, maybe distributional information matters  Multi-hot has potential to be utilized beyond the given tasks • conciseness & computation efficiency  Sub-character level features useful in tasks that require morphological decomposition • have potential to be improved via word piece approach or information attachment  Overall tendency useful for the text processing of other character- rich languages with conjunct forms in the writing system 21
  • 23.
    Reference (order ofappearance) • Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657. • Haebin Shin, Min-Gwan Seo, and Hyeongjin Byeon. 2017. Korean alphabet level convolution neural network for text classification. In Proceedings of Korea Computer Congress 2017 [in Korean], pages 587–589. • Yong Woo Cho, Gyu Su Han, and Hyuk Jun Lee. 2018c. Character level bi-directional lstm-cnn model for movie rating prediction. In Proceedings of Korea Computer Congress 2018 [in Korean], pages 1009–1011. • Won Ik Cho, Sung Jun Cheon, Woo Hyun Kang, Ji Won Kim, and Nam Soo Kim. 2018a. Real-time automatic word segmentation for user-generated text. arXiv preprint arXiv:1810.13113. • Won Ik Cho, Hyeon Seung Lee, Ji Won Yoon, Seok Min Kim, and Nam Soo Kim. 2018b. Speech intention understanding in a head-final language: A disambiguation utilizing intonation-dependency. arXiv preprint arXiv:1811.04231. • Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681. • Zhouhan Lin,Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130. • Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. • Francois Chollet et al. 2015. Keras. https://github.com/fchollet/keras. • Viet Nguyen, Julian Brooke, and Timothy Baldwin. 2017. Sub-character neural language modelling in japanese. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 148–153. 22
  • 24.

Editor's Notes

  • #2 .
  • #3 overview: gender bias in NLP – various problems translation: real-world problem - example e.g. Turkish, Korean..? How is it treated in previous works? Why should it be guaranteed? problem statement: with KR-EN example why not investigated in previous works? why appropriate for investigating gender bias? what examples are observed? construction: what are to be considered? formality (걔 vs 그 사람) politeness (-어 vs –어요) lexicon sentiment polarity (positive & negative & occupation) + things to be considered in... (not to threaten the fairness) - Measure? how the measure is defined, and proved to be bounded (and have optimum when the condition fits with the ideal case) concept of Vbias and Sbias – how they are aggregated into the measure << disadvantage? how the usage is justified despite disadvantages the strong points? - Experiment? how the EEC is used in evaluation, and how the arithmetic averaging is justified the result: GT > NP > KT? - Analysis? quantitative analysis – Vbias and Sbias, significant with style-related features qualitative analysis – observed with the case of occupation words Done: tgbi for KR-EN, with an EEC Afterward: how Sbias can be considered more explicitly? what if among context? how about with other target/source language?