1909 paclic

Human Interface Laboratory
Investigating an Effective Character-level
Embedding in Korean Sentence Classification
2019. 9. 13 @PACLIC 33
Won Ik Cho, Seok Min Kim, Nam Soo Kim

Contents
• Why character-level embedding?
 Case of Korean writing system
 Previous approaches
• Task description
• Experiment
 Feature engineering
 Implementation
• Result & Analysis
• Done and afterward
1

Why character-level embedding?
• Word level vs. Subword-level vs. (sub-)character-level
 In English (using alphabet)
• hello (word level)
• hel ##lo (subword level ~ word piece)
• h e l l o (character level)
 In Korean (using Hangul)
• 반갑다 (pan-kap-ta, word level)
• 반가- / -ㅂ- / -다 (morpheme level)
• 반갑 다 (subword level ~ word piece)
• 반 갑 다 (character level ~ word piece)
• ㅂㅏㄴㄱㅏㅂㄷㅏ# (Jamo level)
2
* Jamo: letters of Korean Hangul, where Ja denotes consonants and Mo denotes vowels

• On Korean (sub-)character-level (Jamo) system
3
반A character (pan)
First sound (cho-seng)
Second sound (cung-seng)
Third sound (cong-seng)
Structure: {Syllable: CV(C)}
# First sound (C): 19
# Second sound (V): 11
# Third sound (C): 27 + ‘ ‘
Total 19 * 11 * 28 = 11,172 characters!

• Research questions
 To what extent can we think of the (sub-)character-level
embedding styles in Korean NLP?
 What kind of embedding best fits with the sentence classification
tasks (binary, multi-class)?
 How and why does the performance between the embedding
differs?
4

• Previous approaches
 Zhang et al. (2017)
• Which encoding is the best for text classification in C/E/J/K?
5
Mistake in the article!
Character-level features
far outperform the
word level features ...
Fortunately this does not
deter the goal of our study
Test errors
We use no morphological analysis, but only decomposition of the blocks!

• Previous approaches
 Shin et al. (2017): Jamo (sub-character)-level padding
 Cho et al. (2018c): Jamo-level + solely used Jamos
 Cho et al. (2018a)-Sparse: About 25K frequently used
characters in conversation-style corpus
 Cho et al. (2018a)-Dense: Utilizing subword representations
in dense embedding (fastText) ← Pretrained!
 Song et al. (2018): Multi-hot encoding
6

• On Korean (sub-)character-level embedding
 Sparse embedding
• One-hot or multi-hot representation
• Narrow but long sequence for Jamo-level features
• Wide but shorter sequence for character-level features
 Dense embedding
• word2vec-style representation utilizing skip-gram
• Narrow and short sequence, and especially for character-level feature
– The number of Jamos lack for training them as ‘word’s
– But for characters, real-life token usage up to 2,500 available
» A kind of subword/word piece!
7

Task description
• Two classification tasks
 Sentiment analysis (NSMC): binary
• aim to cover lexical semantics
 Intention identification (3i4K): 7-class
• aim to cover syntax-semantics
 Why classification?
• Easy/fast to train and featurize
• Result is clearly analyzable (straightforward metric such as F1,
accuracy)
• Featurization methodology can be extended to other tasks
– Role labeling, entity recognition, translation, generation etc.
8

Task description
• Sentiment analysis
 Naver sentiment movie corpus (NSMC)
• Widely used benchmark for evaluation of Korean LMs
• Annotation follows Maas et al. (2011)
• Positive label for reviews with score > 8 and negative for < 5
– Neutral reviews are removed; thus BINARY classification
• Contains various non-Jamo symbols
– e.g., ^^, @@, ...
• 150K/50K for training/test each
• https://github.com/e9t/nsmc
9

Task description
• Intention identification
 Intonation-aided intention identification for Korean (3i4K)
• Recently distributed open data for speech act classification (Cho
et al., 2018)
• Total seven categories
– Fragments, statement, question, command, rhetorical question
(RQ), rhetorical command (RC), intonation-dependent utterances
• Contains only full
hangul characters
(no sole sub-characters
nor non-letters)
• https://github.com/
warnikchow/3i4k
10

Experiment
• Feature engineering
 Sequence length
• NSMC: 420 (for Jamo-level) and 140 (for character-level)
• 3i4K: 240 (for Jamo-level) and 80 (for character-level)
 Sequence width
• Shin et al. (2017): 67 = 19 + 11 + 27 (‘ ‘ zero-padded)
• Cho et al. (2018c): 118 = Shin et al. (2017) + 51 (solely-used
Jamos)
– e.g., ㅜ, ㅠ, ㅡ, ㅋ, ...
• Cho et al. (2018a) – Sparse: 2,534
• Cho et al. (2018a) – Dense: 100
– length-1 subwords only!
• Song et al. (2018): 67 (in specific, 2 or 3-hot)
11

Experiment
• Implementation
 Bidirectional long short term memory (BiLSTM, Schuster and
Paliwal, 1997)
• Representative recurrent neural network (RNN) model
• Strong at representing sequential information
 Self-attentive embedding (SA, Lin, 2017)
• Different from self-attention, but frequently utilized in sentence
embedding
• Utilizes a context vector to make up an attention weight layer
12

• BiLSTM Architecture
 Input dimension: (L, D)
 RNN Hidden layer width: 64 (=322)
 The width of FCN connected to the last hidden layer: 128 (Activation: ReLU)
 Output layer width: N (Activation: softmax)
• SA Architecture (Figure)
 Input dimension identical
 Context vector width: 64
(Activation: ReLU, Dropout 0.3)
 Additional MLPs and Dropouts
after a weighted sum
13

Experiment
• Implementation
 Python 3, Keras, Hangul toolkit, fastText
• Keras (Chollet et al., 2015) for NN training
– TensorFlow Backend, Very concise in implementation
• Hangul toolkit for decomposing the characters
– Decomposes characters into sub-character sequence (length x 3)
• fastText (Bojanowski et al., 2016) for dense character-level
embeddings
– Dense character vector obtained from a drama script (2M lines)
– Appropriate for colloquial expressions
 Optimizer: Adam 5e-4
 Loss function: Categorical cross-entropy
 Batch size: 64 for NSMC, 16 for 3i4K
 Device: Nvidia Tesla M40 24GB
14

Result & Analysis
• Result
 Only accuracy for NSMC (positive/negative 5:5)
• Why lower than the results reported in literature? (about 0.88)
– Since no non-letter tokens were utilized...
» And the data very sensitive to non-letter expressions (emojis
and solely used subcharacters e.g., ㅠㅠ, ㅋㅋ)
 F1 score: harmonic mean of precision and recall
15

Result & Analysis
• Analysis 1: Performance comparison
 Dense character-level embedding outperforms sparse ones
• Injection of a distributive information (word2vec-like) onto the tokens
• Some characters role as a case marker or a short ‘word’
 One-hot/Multi-hot
character-level?
• No additional information
• Will be powerful if dataset bigger & balanced
 Low performance of Jamo-level features in NSMC?
• Decomposition meaningful in syntax-semantic task (rather than
lexical semantics)?
 Using self-attention highly improves Jamo-level embeddings
16

Result & Analysis
• Analysis 2: Using self-attentive embedding
 Most emphasized in Jamo-level feature
 Least emphasized in One-hot character-level encoding
 Why?
• Decomposability of the blocks
– How the sub-character information is projected onto embedding
• e.g., 이상한 (i-sang-han, strange)
– In morpheme: 이상하 (i-sang-ha, the root) + -ㄴ (-n, a particle)
– Presence and role of the morphemes are pointed out
– Not guaranteed in block-preserving networks
– Strengthens syntax-semantic analysis?
17

Result & Analysis
• Analysis 3: Decomposability vs. Local semantics
 Disadvantage of character-level embeddings:
• characters cannot be decomposed,
even for multi-hot
 Then, outperforming
displayed for what?
• Seems to originate in
the preservation of the cluster of letters
– Which stably indicates where the token separation take place, e.g., for 반갑다 (pan-
kap-ta, hello),
» (Jamo-level) ㅂ/ㅏ/ㄴ/ㄱ/ㅏ/ㅂ/ㄷ/ㅏ/<empty>
» (character-level) <ㅂㅏㄴ><ㄱㅏㅂ><ㄷㅏ>
– The tendency will be different if 1) sub-character level word piece model (byte pair
encoding) is implemented or 2) sub-character property (1 & 3rd sound) is additionally
attached to tokens
18

Result & Analysis
• Analysis 4: Computation efficiency
 Computation for NSMC models
• Jamo-level
– Moderate parameter
size, but slow in
training
• Dense/Multi-hot
– Smaller parameter size (in case of SA)
– Faster training time
– Equal to or better performance
19

Discussion
• Primary goal of the paper:
 To search for a Jamo/character-level embedding that best fits
with the given Korean NLP task
• The utility of the comparison result
 Also can be applied to Japanese/Chinese NLP?
• Japanese: Morae (e.g., small tsu) that roughly matches with the
third sound of Korean
• Chinese/Japanese: Hanzi or Kanji can further be decomposed
into small forms of glyphs (Nguyen et al., 2017)
– e.g., 鯨 ``whale" to 魚 ``fish" and 京 ``capital city“
• Many South/Southeast Asian languages
– Composition of consonant and vowel
– Maybe decomposing the properties is better than ..?
20

Done & Afterward
• Reviewed five (sub-)character-level embedding for a
character-rich and agglutinative language (Korean)
 Dense and multi-hot character-level representation perform best
• for dense one, maybe distributional information matters
 Multi-hot has potential to be utilized beyond the given tasks
• conciseness & computation efficiency
 Sub-character level features useful in tasks that require
morphological decomposition
• have potential to be improved via word piece approach or
information attachment
 Overall tendency useful for the text processing of other character-
rich languages with conjunct forms in the writing system
21

Reference (order of appearance)
• Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification.
In Advances in neural information processing systems, pages 649–657.
• Haebin Shin, Min-Gwan Seo, and Hyeongjin Byeon. 2017. Korean alphabet level convolution neural network for
text classification. In Proceedings of Korea Computer Congress 2017 [in Korean], pages 587–589.
• Yong Woo Cho, Gyu Su Han, and Hyuk Jun Lee. 2018c. Character level bi-directional lstm-cnn model for movie
rating prediction. In Proceedings of Korea Computer Congress 2018 [in Korean], pages 1009–1011.
• Won Ik Cho, Sung Jun Cheon, Woo Hyun Kang, Ji Won Kim, and Nam Soo Kim. 2018a. Real-time automatic
word segmentation for user-generated text. arXiv preprint arXiv:1810.13113.
• Won Ik Cho, Hyeon Seung Lee, Ji Won Yoon, Seok Min Kim, and Nam Soo Kim. 2018b. Speech intention
understanding in a head-final language: A disambiguation utilizing intonation-dependency. arXiv preprint
arXiv:1811.04231.
• Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal
Processing, 45(11):2673–2681.
• Zhouhan Lin,Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio.
2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
• Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with
subword information. arXiv preprint arXiv:1607.04606.
• Francois Chollet et al. 2015. Keras. https://github.com/fchollet/keras.
• Viet Nguyen, Julian Brooke, and Timothy Baldwin. 2017. Sub-character neural language modelling in japanese.
In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 148–153.
22

1909 paclic

More Related Content

What's hot

Similar to 1909 paclic

More from WarNik Chow

Recently uploaded

1909 paclic

Editor's Notes