SlideShare a Scribd company logo
1 of 43
Download to read offline
SMU Classification: Restricted
Zeng, X., Yang, C., Tu, C., Liu, Z.,&Sun, M. (2018). Chinese LIWC Lexicon Expansion via
Hierarchical Classification of Word Embeddings with Sememe Attention, AAAI18
SMU Classification: Restricted
1
SMU Classification: Restricted
2
LIWC
• The Linguistic Inquiry and Word Count
• Widely used for computerized text analysis in social
science
• Consists of 92 predefined word categories
(e.g., Work, Leisure, Money, Time,…)
SMU Classification: Restricted
3
• For each category, LIWC gives the proportion of
words under that category of the given text
• The word lists are first constructed by Pennebaker
et al. in early 1990s
!
""
#
""
SMU Classification: Restricted
4
• LIWC-based applications:
- Depression detection (Ramirez-Esparza, N et al. 2008)
- Social status prediction (Kacewicz, W., Pennebaker, J.W. 2013)
- Personality prediction (Golbeck, J. 2011)
• LIWC in multi-language:
- Traditional and Simplified Chinese (Huang et al. 2012)
- Italian (Alparone, F 2004)
- Dutch (van Wissen 2017)
SMU Classification: Restricted
5
• Although Chinese LIWC has been developed
- The lexicon included are insufficient
- Do not consist emerging word and netspeak
• In order to expand the Chinese LIWC
- Too time-consuming to manually annotate new
words
- Language expertise is needed
• This paper proposes to automatically expand LIWC
lexicon
SMU Classification: Restricted
6
• Problems in automated lexicon expansion
- Polysemy: multiple meaning carried by a word
- Indistinctness: categories are too fine-grained,
making it hard to distinguish one from another
SMU Classification: Restricted
7
• LIWC categories form a tree structure hierarchy
- Hierarchical SVM do not consider polysemy and
indistinctness property
- This paper proposes to use HowNet together with
attention mechanism for expansion learning
SMU Classification: Restricted
8
• Hierarchical multilabel
• LIWC lexicon expansion is a non-mandatory
hierarchical multilabel classification
SMU Classification: Restricted
9
• < ", $%&, %' >
- " indicates the classes are arranged into a tree
structure
SMU Classification: Restricted
10
• < ", $%&, %' >
- $%& (Multiple Paths of Labels) is equivalent to
the term hierarchical multilabel
SMU Classification: Restricted
11
• < ", $%&, %' >
- %' (Partial Depth Labeling) indicates that some
instances have a partial depth of labeling
SMU Classification: Restricted
12
• Most previous works on lexicon expansion are
based on feature engineering techniques
- Requires lots of human knowledge on feature
designing
- Non-trivial to be adopted
• Hierarchical classification
- Flat Classifier ( Barbedo and Lopes 2007)
- Local Classifier Per Node (Fagni and Sebastiani 2007)
- Local Classifier Per Parent Node (Silla and Freitas 2009)
- Local Classifier Per Level (Clare and King 2003)
- Global Classifier (Kiritchenko et al. 2006)
SMU Classification: Restricted
13
• Deep learning based hierarchical classification
- Use the predictions of the neural network as inputs
for the next neural network associated with the next
hierarchical level
(Cerri, Barros, and De Carvalho 2014)
- Use RNN encoder-decoder for entity mention
classification by generating paths in the hierarchy
from top node to leaf nodes
(Karn, Waltinger, and Sch utze 2017)
SMU Classification: Restricted
14
SMU Classification: Restricted
15
• Model the hierarchical classification problem as a
Sequence-to-sequence decoding
- Input: word embedding of a given word !
- Output: hierarchical label " of !
SMU Classification: Restricted
16
• Denotations:
- ! denotes the label set
- ": !	 → ! denotes the parent relationship
where "(')	is the parent node of ' ∈ !
- Given a word *, its labels form a tree structure.
Each path from root node to leaf node can be
transform to a sequence ' = (',, '., … , '0)
where " '1 = '12,	, ∀1 ∈ [., 0]
and 0 is the number of levels in the hierarchy
SMU Classification: Restricted
• Hierarchical Decoder (HD) predicts a label !" by
taking the parent label sequence (!$, !&, … , !"($) into
consideration
-
17
SMU Classification: Restricted
18
• Using LSTM, each conditional probability is
computed as
-
-
SMU Classification: Restricted
19
Cell
State !
Hidden
State ℎ
#$ = 	'()* + ℎ$,-, /$ +	1*)
3$ = 	'()4 + ℎ$,-, /$ +	14)
5̃$ = tanh(); + ℎ$,-, /$ +	1;)
5$ =	#$ ∘ 5$,- +	3$ ∘ 5̃$
=$ = 	'()> + ℎ$,-, ?$ +	1>)
ℎ$ = =$ ∘ tanh(5$)
/$,- /$@-/$
ℎ$,- ℎ$@-ℎ$
5̃$
5$
=$
ℎ$
#$
3$
∘
∘∘
SMU Classification: Restricted
20
!" = 	%('( ) ℎ"+,, ." +	0()
2" = 	%('3 ) ℎ"+,, ." +	03)
4̃" = tanh(': ) ℎ"+,, ." +	0:)
4" =	!" ∘ 4"+, +	2" ∘ 4̃"
<" = 	%('= ) ℎ"+,, >" +	0=)
ℎ" = <" ∘ tanh(4")
• Denotations:
- ∘ is element-wise multiplication
- % is the sigmoid function
- 0∗ are the biases
- '∗ are weights
- <", 2", and !" are output gate, input gate and forget
gate respectively
SMU Classification: Restricted
21
• Take advantage of word embeddings:
- Initial state !" = $%
where $% represent the word embedding of word &
- '	 ∈ ℝ + ×-. denotes the word embedding matrix
where /0 is the dimension number
- 1	 ∈ ℝ 2 ×-3 denotes the layer embedding matrix
where /4 is the dimension number
- Word embeddings are pre-trained and fixed during
training
SMU Classification: Restricted
22
• Hierarchical decoder is insufficient
- Each word only has one representation
- Cannot handle polysemy and indistinctness
- Propose to incorporate sememe information
• Sememe attention
- Different sememe represent different meanings of a
word
- Same sememe could be under different category
- Incorporating weights when predicting
SMU Classification: Restricted
23
• Consider the tree and HowNet structure
- When decoder chooses among the sub-classes of
PersonalConcerns, location should have a relatively
lower weight
SMU Classification: Restricted
24
• Propose to utilize the attention mechanism
- By Bahdanau, Cho, and Bengio 2014
- Apply word embeddings as initial state
- Conditional probability is defined as
where !" is the context vector
S	 ∈ ℝ ' ×)* denotes the sememe embedding matrix
where +, is the dimension number
{ℎ/, … , ℎ2} is the set of sememe embeddings from S
SMU Classification: Restricted
25
• The weight !"# of each sememe embedding ℎ# is
defined as
where
- %"# is the score indicating how well the sememe
embedding ℎ# is associated when predicting the
current label &"
- '	 ∈ ℝ+, ,- 	∈ ℝ+×/0, and ,1 	∈ ℝ+×/2 are weight
matrices
3	 is the number of hidden units in attention model
SMU Classification: Restricted
26
• At each time step, HDSA
- Choose which sememe to pay attention to when
predicting the current word label
- Different sememes can have different weights
- Same sememe can have different weights under
different categories
SMU Classification: Restricted
27
• For optimization, the objective function is defined as
where
- !"# ∈ {0,1} represents whether word *# belongs to
label !"
- !′"# represents the predicted probability of word *#
belonging to label !"
- , is the number of words
SMU Classification: Restricted
28
• In the implementation
- Adam algorithm is utilize to automatically adapt the
learning rate of each parameter
- Beam search:
A word will be assign to a label sequence ! only
when
where " is a threshold
- Recurrent Dropout and Layer Normalization are
employed to prevent overfitting
SMU Classification: Restricted
• Use the Chinese LIWC by Huang et al. 2012
- Hierarchy depth is 3
- Statistics
• Employ the sememe annotation in HowNet
- 1,617 distinct sememes are used
• Use Sogou-T corpus to learn word embeddings
and sememe embeddings
- Sogou-T contains over 130 million web pages
provided by a Chinese commercial search engine
29
SMU Classification: Restricted
30
• Hierarchical algorithms are chosen for comparison
- Top-down k-NN (TD k-NN): Top-down decision
making using k-NN at each parent label
- Top-down SVM (TD SVM): Top-down decision making
using SVM at each parent label
- Structural SVM: A margin rescaled structural SVM
using the 1-slack formulation and cutting plane method
- Condensing Sort and Select Algorithm (CSSA): A
hierarchical classification algorithm which can be used on
both tree- and DAG-structured hierarchies
- Hierarchical Decoder (HD): The Hierarchical
Decoder without sememe attention
SMU Classification: Restricted
31
• The word embedding matrix !	is pre-trained using
word2vec, setting
- Directly use the embeddings from ! as sememe
embeddings
- Dimension #$ = #& = 300
- Window size ) = 5
- Negative sampling +, = 5
- Consider words with frequency > 50
• Randomly initialize the label embedding matrix -,
and use backpropagation to update the value during
training
SMU Classification: Restricted
32
• Comparative algorithms settings
- TD k-NN
! = 5
- TD SVM and Structural SVM
$ = 1, '() = 0.01	
- CSSA
Each sample is given 4 labels (1 label in other approaches)
• Experiment settings
- - = ./ = 300
- Beam size = 5, 	δ = −1.6
- Adam algorithm
4 = 0.001, 56 = 0.9 , 58 = 0.999, 9 = 10:;
SMU Classification: Restricted
33
• Transform tree structure to label sequence
- If a word has more than one path in the tree, it will be
transformed into multiple label sequences
	"
	#$%
	#$&
	#$' 	#$(
	" 	#$% 	#$& 	#$'
	" 	#$% 	#$& 	#$(
)
)
SMU Classification: Restricted
34
• Micro-!"
• Macro-!"
Weighted Macro-F1 (W-M-F1) is used for
categories in LIWC are highly imbalanced with some
categories containing more than 1,000 instances
while some only have less than 40 instances
SMU Classification: Restricted
35
SMU Classification: Restricted
36
• Both the HD and HDSA outperform all other
baselines
• HDSA outperforms HD by 2%
• The performance difference between HDSA and top-
down approaches gets larger as level going deeper
SMU Classification: Restricted
37
• ! has a direct impact on both precision and recall
- Higher ! means more strict standard, resulting in
higher precision and lower recall
• Micro-F1 and W-M-F1 do not change much as !
increases
SMU Classification: Restricted
38
• Polysemy
• Imprecise word embeddings
• Indistinctness
SMU Classification: Restricted
39
• Sememe being misleading
SMU Classification: Restricted
40
• Contribution
- Utilize the Sequence-to-Sequence model for
hierarchical classification of LIWC lexicon expansion
- Propose to use sememe and utilize the attention
mechanism to select senses of words
- Prove the model to be capable of understanding the
meaning of words with the help of sememe attention
SMU Classification: Restricted
41
• Future directions
- Consider the complicated structure and relations of
sememes and words in HowNet
- Explore the possibility of updating word embeddings
during training instead of using pre-trained
embeddings
SMU Classification: Restricted

More Related Content

Similar to Chinese liwc lexicon expansion via hierarchical classification of word embeddings with sememe attention

Mining the social web 6
Mining the social web 6Mining the social web 6
Mining the social web 6HyeonSeok Choi
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
Active learning from crowds
Active learning from crowdsActive learning from crowds
Active learning from crowdsPC LO
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learningtelss09
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadfalizain9604
 
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdfLecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdfssuserf86fba
 
Task oriented word embedding for text classification
Task oriented word embedding for text classificationTask oriented word embedding for text classification
Task oriented word embedding for text classificationPC LO
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Implementing the Database Server session 01
Implementing the Database Server  session 01Implementing the Database Server  session 01
Implementing the Database Server session 01Guillermo Julca
 
Semi-Supervised.pptx
Semi-Supervised.pptxSemi-Supervised.pptx
Semi-Supervised.pptxTamer Nadeem
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeProf. Wim Van Criekinge
 
Hierarchical text classification
Hierarchical text classificationHierarchical text classification
Hierarchical text classificationKausal Malladi
 
Lecture-2 - Relational Model.pptx
Lecture-2 - Relational Model.pptxLecture-2 - Relational Model.pptx
Lecture-2 - Relational Model.pptxHanzlaNaveed1
 
A Framework for Computing Linguistic Hedges in Fuzzy Queries
A Framework for Computing Linguistic Hedges in Fuzzy QueriesA Framework for Computing Linguistic Hedges in Fuzzy Queries
A Framework for Computing Linguistic Hedges in Fuzzy Queriesijdms
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Prof. Wim Van Criekinge
 

Similar to Chinese liwc lexicon expansion via hierarchical classification of word embeddings with sememe attention (20)

Mining the social web 6
Mining the social web 6Mining the social web 6
Mining the social web 6
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Moviereview prjct
Moviereview prjctMoviereview prjct
Moviereview prjct
 
Active learning from crowds
Active learning from crowdsActive learning from crowds
Active learning from crowds
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdfLecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
Lecture-2-Relational-Algebra-and-SQL-Advanced-DataBase-Theory-MS.pdf
 
Datamining
DataminingDatamining
Datamining
 
Task oriented word embedding for text classification
Task oriented word embedding for text classificationTask oriented word embedding for text classification
Task oriented word embedding for text classification
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Implementing the Database Server session 01
Implementing the Database Server  session 01Implementing the Database Server  session 01
Implementing the Database Server session 01
 
Semi-Supervised.pptx
Semi-Supervised.pptxSemi-Supervised.pptx
Semi-Supervised.pptx
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
 
Hierarchical text classification
Hierarchical text classificationHierarchical text classification
Hierarchical text classification
 
Lecture-2 - Relational Model.pptx
Lecture-2 - Relational Model.pptxLecture-2 - Relational Model.pptx
Lecture-2 - Relational Model.pptx
 
A Framework for Computing Linguistic Hedges in Fuzzy Queries
A Framework for Computing Linguistic Hedges in Fuzzy QueriesA Framework for Computing Linguistic Hedges in Fuzzy Queries
A Framework for Computing Linguistic Hedges in Fuzzy Queries
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
 

More from PC LO

ext mining for the Vaccine Adverse Event Reporting System: medical text class...
ext mining for the Vaccine Adverse Event Reporting System: medical text class...ext mining for the Vaccine Adverse Event Reporting System: medical text class...
ext mining for the Vaccine Adverse Event Reporting System: medical text class...PC LO
 
Sentiment analysis and opinion mining Ch.7
Sentiment analysis and opinion mining Ch.7Sentiment analysis and opinion mining Ch.7
Sentiment analysis and opinion mining Ch.7PC LO
 
On Joint Modeling of Topical Communities and Personal Interest in Microblogs
On Joint Modeling of Topical Communities and Personal Interest in MicroblogsOn Joint Modeling of Topical Communities and Personal Interest in Microblogs
On Joint Modeling of Topical Communities and Personal Interest in MicroblogsPC LO
 
Campclass
CampclassCampclass
CampclassPC LO
 
MIS報告 供應鏈管理
MIS報告 供應鏈管理MIS報告 供應鏈管理
MIS報告 供應鏈管理PC LO
 
資料庫期末報告
資料庫期末報告資料庫期末報告
資料庫期末報告PC LO
 
新事業期末報告
新事業期末報告新事業期末報告
新事業期末報告PC LO
 
ERP 期末報告
ERP 期末報告ERP 期末報告
ERP 期末報告PC LO
 
User Acceptance of Information Technology
User Acceptance of Information TechnologyUser Acceptance of Information Technology
User Acceptance of Information TechnologyPC LO
 
SCM_B2B
SCM_B2BSCM_B2B
SCM_B2BPC LO
 
Ubuntu
UbuntuUbuntu
UbuntuPC LO
 
Travelution
TravelutionTravelution
TravelutionPC LO
 
社團整合系統
社團整合系統社團整合系統
社團整合系統PC LO
 
禿窄痘胖醜_期中
禿窄痘胖醜_期中禿窄痘胖醜_期中
禿窄痘胖醜_期中PC LO
 

More from PC LO (14)

ext mining for the Vaccine Adverse Event Reporting System: medical text class...
ext mining for the Vaccine Adverse Event Reporting System: medical text class...ext mining for the Vaccine Adverse Event Reporting System: medical text class...
ext mining for the Vaccine Adverse Event Reporting System: medical text class...
 
Sentiment analysis and opinion mining Ch.7
Sentiment analysis and opinion mining Ch.7Sentiment analysis and opinion mining Ch.7
Sentiment analysis and opinion mining Ch.7
 
On Joint Modeling of Topical Communities and Personal Interest in Microblogs
On Joint Modeling of Topical Communities and Personal Interest in MicroblogsOn Joint Modeling of Topical Communities and Personal Interest in Microblogs
On Joint Modeling of Topical Communities and Personal Interest in Microblogs
 
Campclass
CampclassCampclass
Campclass
 
MIS報告 供應鏈管理
MIS報告 供應鏈管理MIS報告 供應鏈管理
MIS報告 供應鏈管理
 
資料庫期末報告
資料庫期末報告資料庫期末報告
資料庫期末報告
 
新事業期末報告
新事業期末報告新事業期末報告
新事業期末報告
 
ERP 期末報告
ERP 期末報告ERP 期末報告
ERP 期末報告
 
User Acceptance of Information Technology
User Acceptance of Information TechnologyUser Acceptance of Information Technology
User Acceptance of Information Technology
 
SCM_B2B
SCM_B2BSCM_B2B
SCM_B2B
 
Ubuntu
UbuntuUbuntu
Ubuntu
 
Travelution
TravelutionTravelution
Travelution
 
社團整合系統
社團整合系統社團整合系統
社團整合系統
 
禿窄痘胖醜_期中
禿窄痘胖醜_期中禿窄痘胖醜_期中
禿窄痘胖醜_期中
 

Recently uploaded

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 

Recently uploaded (20)

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Chinese liwc lexicon expansion via hierarchical classification of word embeddings with sememe attention

  • 1. SMU Classification: Restricted Zeng, X., Yang, C., Tu, C., Liu, Z.,&Sun, M. (2018). Chinese LIWC Lexicon Expansion via Hierarchical Classification of Word Embeddings with Sememe Attention, AAAI18
  • 3. SMU Classification: Restricted 2 LIWC • The Linguistic Inquiry and Word Count • Widely used for computerized text analysis in social science • Consists of 92 predefined word categories (e.g., Work, Leisure, Money, Time,…)
  • 4. SMU Classification: Restricted 3 • For each category, LIWC gives the proportion of words under that category of the given text • The word lists are first constructed by Pennebaker et al. in early 1990s ! "" # ""
  • 5. SMU Classification: Restricted 4 • LIWC-based applications: - Depression detection (Ramirez-Esparza, N et al. 2008) - Social status prediction (Kacewicz, W., Pennebaker, J.W. 2013) - Personality prediction (Golbeck, J. 2011) • LIWC in multi-language: - Traditional and Simplified Chinese (Huang et al. 2012) - Italian (Alparone, F 2004) - Dutch (van Wissen 2017)
  • 6. SMU Classification: Restricted 5 • Although Chinese LIWC has been developed - The lexicon included are insufficient - Do not consist emerging word and netspeak • In order to expand the Chinese LIWC - Too time-consuming to manually annotate new words - Language expertise is needed • This paper proposes to automatically expand LIWC lexicon
  • 7. SMU Classification: Restricted 6 • Problems in automated lexicon expansion - Polysemy: multiple meaning carried by a word - Indistinctness: categories are too fine-grained, making it hard to distinguish one from another
  • 8. SMU Classification: Restricted 7 • LIWC categories form a tree structure hierarchy - Hierarchical SVM do not consider polysemy and indistinctness property - This paper proposes to use HowNet together with attention mechanism for expansion learning
  • 9. SMU Classification: Restricted 8 • Hierarchical multilabel • LIWC lexicon expansion is a non-mandatory hierarchical multilabel classification
  • 10. SMU Classification: Restricted 9 • < ", $%&, %' > - " indicates the classes are arranged into a tree structure
  • 11. SMU Classification: Restricted 10 • < ", $%&, %' > - $%& (Multiple Paths of Labels) is equivalent to the term hierarchical multilabel
  • 12. SMU Classification: Restricted 11 • < ", $%&, %' > - %' (Partial Depth Labeling) indicates that some instances have a partial depth of labeling
  • 13. SMU Classification: Restricted 12 • Most previous works on lexicon expansion are based on feature engineering techniques - Requires lots of human knowledge on feature designing - Non-trivial to be adopted • Hierarchical classification - Flat Classifier ( Barbedo and Lopes 2007) - Local Classifier Per Node (Fagni and Sebastiani 2007) - Local Classifier Per Parent Node (Silla and Freitas 2009) - Local Classifier Per Level (Clare and King 2003) - Global Classifier (Kiritchenko et al. 2006)
  • 14. SMU Classification: Restricted 13 • Deep learning based hierarchical classification - Use the predictions of the neural network as inputs for the next neural network associated with the next hierarchical level (Cerri, Barros, and De Carvalho 2014) - Use RNN encoder-decoder for entity mention classification by generating paths in the hierarchy from top node to leaf nodes (Karn, Waltinger, and Sch utze 2017)
  • 16. SMU Classification: Restricted 15 • Model the hierarchical classification problem as a Sequence-to-sequence decoding - Input: word embedding of a given word ! - Output: hierarchical label " of !
  • 17. SMU Classification: Restricted 16 • Denotations: - ! denotes the label set - ": ! → ! denotes the parent relationship where "(') is the parent node of ' ∈ ! - Given a word *, its labels form a tree structure. Each path from root node to leaf node can be transform to a sequence ' = (',, '., … , '0) where " '1 = '12, , ∀1 ∈ [., 0] and 0 is the number of levels in the hierarchy
  • 18. SMU Classification: Restricted • Hierarchical Decoder (HD) predicts a label !" by taking the parent label sequence (!$, !&, … , !"($) into consideration - 17
  • 19. SMU Classification: Restricted 18 • Using LSTM, each conditional probability is computed as - -
  • 20. SMU Classification: Restricted 19 Cell State ! Hidden State ℎ #$ = '()* + ℎ$,-, /$ + 1*) 3$ = '()4 + ℎ$,-, /$ + 14) 5̃$ = tanh(); + ℎ$,-, /$ + 1;) 5$ = #$ ∘ 5$,- + 3$ ∘ 5̃$ =$ = '()> + ℎ$,-, ?$ + 1>) ℎ$ = =$ ∘ tanh(5$) /$,- /$@-/$ ℎ$,- ℎ$@-ℎ$ 5̃$ 5$ =$ ℎ$ #$ 3$ ∘ ∘∘
  • 21. SMU Classification: Restricted 20 !" = %('( ) ℎ"+,, ." + 0() 2" = %('3 ) ℎ"+,, ." + 03) 4̃" = tanh(': ) ℎ"+,, ." + 0:) 4" = !" ∘ 4"+, + 2" ∘ 4̃" <" = %('= ) ℎ"+,, >" + 0=) ℎ" = <" ∘ tanh(4") • Denotations: - ∘ is element-wise multiplication - % is the sigmoid function - 0∗ are the biases - '∗ are weights - <", 2", and !" are output gate, input gate and forget gate respectively
  • 22. SMU Classification: Restricted 21 • Take advantage of word embeddings: - Initial state !" = $% where $% represent the word embedding of word & - ' ∈ ℝ + ×-. denotes the word embedding matrix where /0 is the dimension number - 1 ∈ ℝ 2 ×-3 denotes the layer embedding matrix where /4 is the dimension number - Word embeddings are pre-trained and fixed during training
  • 23. SMU Classification: Restricted 22 • Hierarchical decoder is insufficient - Each word only has one representation - Cannot handle polysemy and indistinctness - Propose to incorporate sememe information • Sememe attention - Different sememe represent different meanings of a word - Same sememe could be under different category - Incorporating weights when predicting
  • 24. SMU Classification: Restricted 23 • Consider the tree and HowNet structure - When decoder chooses among the sub-classes of PersonalConcerns, location should have a relatively lower weight
  • 25. SMU Classification: Restricted 24 • Propose to utilize the attention mechanism - By Bahdanau, Cho, and Bengio 2014 - Apply word embeddings as initial state - Conditional probability is defined as where !" is the context vector S ∈ ℝ ' ×)* denotes the sememe embedding matrix where +, is the dimension number {ℎ/, … , ℎ2} is the set of sememe embeddings from S
  • 26. SMU Classification: Restricted 25 • The weight !"# of each sememe embedding ℎ# is defined as where - %"# is the score indicating how well the sememe embedding ℎ# is associated when predicting the current label &" - ' ∈ ℝ+, ,- ∈ ℝ+×/0, and ,1 ∈ ℝ+×/2 are weight matrices 3 is the number of hidden units in attention model
  • 27. SMU Classification: Restricted 26 • At each time step, HDSA - Choose which sememe to pay attention to when predicting the current word label - Different sememes can have different weights - Same sememe can have different weights under different categories
  • 28. SMU Classification: Restricted 27 • For optimization, the objective function is defined as where - !"# ∈ {0,1} represents whether word *# belongs to label !" - !′"# represents the predicted probability of word *# belonging to label !" - , is the number of words
  • 29. SMU Classification: Restricted 28 • In the implementation - Adam algorithm is utilize to automatically adapt the learning rate of each parameter - Beam search: A word will be assign to a label sequence ! only when where " is a threshold - Recurrent Dropout and Layer Normalization are employed to prevent overfitting
  • 30. SMU Classification: Restricted • Use the Chinese LIWC by Huang et al. 2012 - Hierarchy depth is 3 - Statistics • Employ the sememe annotation in HowNet - 1,617 distinct sememes are used • Use Sogou-T corpus to learn word embeddings and sememe embeddings - Sogou-T contains over 130 million web pages provided by a Chinese commercial search engine 29
  • 31. SMU Classification: Restricted 30 • Hierarchical algorithms are chosen for comparison - Top-down k-NN (TD k-NN): Top-down decision making using k-NN at each parent label - Top-down SVM (TD SVM): Top-down decision making using SVM at each parent label - Structural SVM: A margin rescaled structural SVM using the 1-slack formulation and cutting plane method - Condensing Sort and Select Algorithm (CSSA): A hierarchical classification algorithm which can be used on both tree- and DAG-structured hierarchies - Hierarchical Decoder (HD): The Hierarchical Decoder without sememe attention
  • 32. SMU Classification: Restricted 31 • The word embedding matrix ! is pre-trained using word2vec, setting - Directly use the embeddings from ! as sememe embeddings - Dimension #$ = #& = 300 - Window size ) = 5 - Negative sampling +, = 5 - Consider words with frequency > 50 • Randomly initialize the label embedding matrix -, and use backpropagation to update the value during training
  • 33. SMU Classification: Restricted 32 • Comparative algorithms settings - TD k-NN ! = 5 - TD SVM and Structural SVM $ = 1, '() = 0.01 - CSSA Each sample is given 4 labels (1 label in other approaches) • Experiment settings - - = ./ = 300 - Beam size = 5, δ = −1.6 - Adam algorithm 4 = 0.001, 56 = 0.9 , 58 = 0.999, 9 = 10:;
  • 34. SMU Classification: Restricted 33 • Transform tree structure to label sequence - If a word has more than one path in the tree, it will be transformed into multiple label sequences " #$% #$& #$' #$( " #$% #$& #$' " #$% #$& #$( ) )
  • 35. SMU Classification: Restricted 34 • Micro-!" • Macro-!" Weighted Macro-F1 (W-M-F1) is used for categories in LIWC are highly imbalanced with some categories containing more than 1,000 instances while some only have less than 40 instances
  • 37. SMU Classification: Restricted 36 • Both the HD and HDSA outperform all other baselines • HDSA outperforms HD by 2% • The performance difference between HDSA and top- down approaches gets larger as level going deeper
  • 38. SMU Classification: Restricted 37 • ! has a direct impact on both precision and recall - Higher ! means more strict standard, resulting in higher precision and lower recall • Micro-F1 and W-M-F1 do not change much as ! increases
  • 39. SMU Classification: Restricted 38 • Polysemy • Imprecise word embeddings • Indistinctness
  • 40. SMU Classification: Restricted 39 • Sememe being misleading
  • 41. SMU Classification: Restricted 40 • Contribution - Utilize the Sequence-to-Sequence model for hierarchical classification of LIWC lexicon expansion - Propose to use sememe and utilize the attention mechanism to select senses of words - Prove the model to be capable of understanding the meaning of words with the help of sememe attention
  • 42. SMU Classification: Restricted 41 • Future directions - Consider the complicated structure and relations of sememes and words in HowNet - Explore the possibility of updating word embeddings during training instead of using pre-trained embeddings