Chinese liwc lexicon expansion via hierarchical classification of word embeddings with sememe attention

SMU Classification: Restricted
Zeng, X., Yang, C., Tu, C., Liu, Z.,&Sun, M. (2018). Chinese LIWC Lexicon Expansion via
Hierarchical Classification of Word Embeddings with Sememe Attention, AAAI18

1

2
LIWC
• The Linguistic Inquiry and Word Count
• Widely used for computerized text analysis in social
science
• Consists of 92 predefined word categories
(e.g., Work, Leisure, Money, Time,…)

3
• For each category, LIWC gives the proportion of
words under that category of the given text
• The word lists are first constructed by Pennebaker
et al. in early 1990s
!
""
#
""

4
• LIWC-based applications:
- Depression detection (Ramirez-Esparza, N et al. 2008)
- Social status prediction (Kacewicz, W., Pennebaker, J.W. 2013)
- Personality prediction (Golbeck, J. 2011)
• LIWC in multi-language:
- Traditional and Simplified Chinese (Huang et al. 2012)
- Italian (Alparone, F 2004)
- Dutch (van Wissen 2017)

5
• Although Chinese LIWC has been developed
- The lexicon included are insufficient
- Do not consist emerging word and netspeak
• In order to expand the Chinese LIWC
- Too time-consuming to manually annotate new
words
- Language expertise is needed
• This paper proposes to automatically expand LIWC
lexicon

6
• Problems in automated lexicon expansion
- Polysemy: multiple meaning carried by a word
- Indistinctness: categories are too fine-grained,
making it hard to distinguish one from another

7
• LIWC categories form a tree structure hierarchy
- Hierarchical SVM do not consider polysemy and
indistinctness property
- This paper proposes to use HowNet together with
attention mechanism for expansion learning

8
• Hierarchical multilabel
• LIWC lexicon expansion is a non-mandatory
hierarchical multilabel classification

9
• < ", $%&, %' >
- " indicates the classes are arranged into a tree
structure

10
• < ", $%&, %' >
- $%& (Multiple Paths of Labels) is equivalent to
the term hierarchical multilabel

11
• < ", $%&, %' >
- %' (Partial Depth Labeling) indicates that some
instances have a partial depth of labeling

12
• Most previous works on lexicon expansion are
based on feature engineering techniques
- Requires lots of human knowledge on feature
designing
- Non-trivial to be adopted
• Hierarchical classification
- Flat Classifier ( Barbedo and Lopes 2007)
- Local Classifier Per Node (Fagni and Sebastiani 2007)
- Local Classifier Per Parent Node (Silla and Freitas 2009)
- Local Classifier Per Level (Clare and King 2003)
- Global Classifier (Kiritchenko et al. 2006)

13
• Deep learning based hierarchical classification
- Use the predictions of the neural network as inputs
for the next neural network associated with the next
hierarchical level
(Cerri, Barros, and De Carvalho 2014)
- Use RNN encoder-decoder for entity mention
classification by generating paths in the hierarchy
from top node to leaf nodes
(Karn, Waltinger, and Sch utze 2017)

14

15
• Model the hierarchical classification problem as a
Sequence-to-sequence decoding
- Input: word embedding of a given word !
- Output: hierarchical label " of !

16
• Denotations:
- ! denotes the label set
- ": ! → ! denotes the parent relationship
where "(') is the parent node of ' ∈ !
- Given a word *, its labels form a tree structure.
Each path from root node to leaf node can be
transform to a sequence ' = (',, '., … , '0)
where " '1 = '12, , ∀1 ∈ [., 0]
and 0 is the number of levels in the hierarchy

• Hierarchical Decoder (HD) predicts a label !" by
taking the parent label sequence (!$, !&, … , !"($) into
consideration
-
17

18
• Using LSTM, each conditional probability is
computed as
-
-

19
Cell
State !
Hidden
State ℎ
#$ = '()* + ℎ$,-, /$ + 1*)
3$ = '()4 + ℎ$,-, /$ + 14)
5̃$ = tanh(); + ℎ$,-, /$ + 1;)
5$ = #$ ∘ 5$,- + 3$ ∘ 5̃$
=$ = '()> + ℎ$,-, ?$ + 1>)
ℎ$ = =$ ∘ tanh(5$)
/$,- /$@-/$
ℎ$,- ℎ$@-ℎ$
5̃$
5$
=$
ℎ$
#$
3$
∘
∘∘

20
!" = %('( ) ℎ"+,, ." + 0()
2" = %('3 ) ℎ"+,, ." + 03)
4̃" = tanh(': ) ℎ"+,, ." + 0:)
4" = !" ∘ 4"+, + 2" ∘ 4̃"
<" = %('= ) ℎ"+,, >" + 0=)
ℎ" = <" ∘ tanh(4")
• Denotations:
- ∘ is element-wise multiplication
- % is the sigmoid function
- 0∗ are the biases
- '∗ are weights
- <", 2", and !" are output gate, input gate and forget
gate respectively

21
• Take advantage of word embeddings:
- Initial state !" = $%
where $% represent the word embedding of word &
- ' ∈ ℝ + ×-. denotes the word embedding matrix
where /0 is the dimension number
- 1 ∈ ℝ 2 ×-3 denotes the layer embedding matrix
where /4 is the dimension number
- Word embeddings are pre-trained and fixed during
training

22
• Hierarchical decoder is insufficient
- Each word only has one representation
- Cannot handle polysemy and indistinctness
- Propose to incorporate sememe information
• Sememe attention
- Different sememe represent different meanings of a
word
- Same sememe could be under different category
- Incorporating weights when predicting

23
• Consider the tree and HowNet structure
- When decoder chooses among the sub-classes of
PersonalConcerns, location should have a relatively
lower weight

24
• Propose to utilize the attention mechanism
- By Bahdanau, Cho, and Bengio 2014
- Apply word embeddings as initial state
- Conditional probability is defined as
where !" is the context vector
S ∈ ℝ ' ×)* denotes the sememe embedding matrix
where +, is the dimension number
{ℎ/, … , ℎ2} is the set of sememe embeddings from S

25
• The weight !"# of each sememe embedding ℎ# is
defined as
where
- %"# is the score indicating how well the sememe
embedding ℎ# is associated when predicting the
current label &"
- ' ∈ ℝ+, ,- ∈ ℝ+×/0, and ,1 ∈ ℝ+×/2 are weight
matrices
3 is the number of hidden units in attention model

26
• At each time step, HDSA
- Choose which sememe to pay attention to when
predicting the current word label
- Different sememes can have different weights
- Same sememe can have different weights under
different categories

27
• For optimization, the objective function is defined as
where
- !"# ∈ {0,1} represents whether word *# belongs to
label !"
- !′"# represents the predicted probability of word *#
belonging to label !"
- , is the number of words

28
• In the implementation
- Adam algorithm is utilize to automatically adapt the
learning rate of each parameter
- Beam search:
A word will be assign to a label sequence ! only
when
where " is a threshold
- Recurrent Dropout and Layer Normalization are
employed to prevent overfitting

• Use the Chinese LIWC by Huang et al. 2012
- Hierarchy depth is 3
- Statistics
• Employ the sememe annotation in HowNet
- 1,617 distinct sememes are used
• Use Sogou-T corpus to learn word embeddings
and sememe embeddings
- Sogou-T contains over 130 million web pages
provided by a Chinese commercial search engine
29

30
• Hierarchical algorithms are chosen for comparison
- Top-down k-NN (TD k-NN): Top-down decision
making using k-NN at each parent label
- Top-down SVM (TD SVM): Top-down decision making
using SVM at each parent label
- Structural SVM: A margin rescaled structural SVM
using the 1-slack formulation and cutting plane method
- Condensing Sort and Select Algorithm (CSSA): A
hierarchical classification algorithm which can be used on
both tree- and DAG-structured hierarchies
- Hierarchical Decoder (HD): The Hierarchical
Decoder without sememe attention

31
• The word embedding matrix ! is pre-trained using
word2vec, setting
- Directly use the embeddings from ! as sememe
embeddings
- Dimension #$ = #& = 300
- Window size ) = 5
- Negative sampling +, = 5
- Consider words with frequency > 50
• Randomly initialize the label embedding matrix -,
and use backpropagation to update the value during
training

32
• Comparative algorithms settings
- TD k-NN
! = 5
- TD SVM and Structural SVM
$ = 1, '() = 0.01
- CSSA
Each sample is given 4 labels (1 label in other approaches)
• Experiment settings
- - = ./ = 300
- Beam size = 5, δ = −1.6
- Adam algorithm
4 = 0.001, 56 = 0.9 , 58 = 0.999, 9 = 10:;

33
• Transform tree structure to label sequence
- If a word has more than one path in the tree, it will be
transformed into multiple label sequences
"
#$%
#$&
#$' #$(
" #$% #$& #$'
" #$% #$& #$(
)
)

34
• Micro-!"
• Macro-!"
Weighted Macro-F1 (W-M-F1) is used for
categories in LIWC are highly imbalanced with some
categories containing more than 1,000 instances
while some only have less than 40 instances

35

36
• Both the HD and HDSA outperform all other
baselines
• HDSA outperforms HD by 2%
• The performance difference between HDSA and top-
down approaches gets larger as level going deeper

37
• ! has a direct impact on both precision and recall
- Higher ! means more strict standard, resulting in
higher precision and lower recall
• Micro-F1 and W-M-F1 do not change much as !
increases

38
• Polysemy
• Imprecise word embeddings
• Indistinctness

39
• Sememe being misleading

40
• Contribution
- Utilize the Sequence-to-Sequence model for
hierarchical classification of LIWC lexicon expansion
- Propose to use sememe and utilize the attention
mechanism to select senses of words
- Prove the model to be capable of understanding the
meaning of words with the help of sememe attention

41
• Future directions
- Consider the complicated structure and relations of
sememes and words in HowNet
- Explore the possibility of updating word embeddings
during training instead of using pre-trained
embeddings

Chinese liwc lexicon expansion via hierarchical classification of word embeddings with sememe attention

Recommended

Recommended

More Related Content

Similar to Chinese liwc lexicon expansion via hierarchical classification of word embeddings with sememe attention

Similar to Chinese liwc lexicon expansion via hierarchical classification of word embeddings with sememe attention (20)

More from PC LO

More from PC LO (14)

Recently uploaded

Recently uploaded (20)

Chinese liwc lexicon expansion via hierarchical classification of word embeddings with sememe attention