Incorporating Word Embeddings into Open Directory Project Based Large-Scale Classification

Incorporating Word Embeddings into Open Directory Project
based Large-Scale Classification
Presented by KANG-MIN KIM
kangmin89@korea.ac.kr
Data Intelligence Laboratory, Korea University
5th June, 2018
Kang-Min Kim, Alieva Dinara, Byung-Ju Choi, SangKeun Lee
Korea University
PAKDD 2018
1

Contents
1. Problem Statement
2. Contributions
3. Preliminary
4. Joint Model of Explicit and Implicit Representation
5. Semantic Similarity Measure
6. Experiments
7. Conclusion
2

1. Problem Statement
• How to incorporate word embeddings into explicit
knowledge representation to improve the
performance of large-scale text classification?
?Explicit
Representation
Implicit
Representation
3

1. We propose a novel methodology to handle the large-scale text
classification, which utilizes both the explicit and implicit
representation.
2. We develop two novel joint models of ODP-based classification and
word2vec to generate category vectors that represent the semantics of
ODP categories. In addition, we develop a new semantic similarity
measure that utilizes both the category and word vectors.
3. We demonstrate the efficacy of the proposed methodology through
extensive experiments on real-world datasets. Our approach
significantly outperforms the state-of-the-art techniques.
2. Contributions
4

• Text classification using NN [1][2]
Requires a lot of training data per class
Limited to small-
or moderate-scale
* Previous Works
[1] Kim, Y. , “Convolutional neural networks for sentence classification”, EMNLP 2014, 1746-1751.
[2] Le, Q.V., Mikolov, T., “Distributed representation of sentences and documents”, ICML 2014, 1188-1196
5

* Previous Works
• Text classification using knowledge base [1][2]
[1] Lee, J.H., Ha, J., Jung, J.Y., Lee, S. , “Semantic contextual advertising based on the open directory project”, ACM Trans. On the
Web 7(4) 2013, 24:1-24:22
[2] Shin, H., Lee, G., Rye, W.J., Lee, S., “Utilizing Wikipedia knowledge in open directory project based text classification”, SAC 2017,
309-314
6

* Open Directory Project (ODP)
• Large-scale and tree-structured (maximum of 15 levels) web directory
constructed by humans.
• Contains about 1 million categories and 4 million web pages.
Top
Game Society
Online Gambling GovernmentCrime
7

3.1 ODP-based Knowledge Representation [1]
Top
Game Society
.
.
.
.
.
.
List of government
officials and their title,
updated weekly…
ODP documents
…
3. Preliminary
Bag-of-words
word frequency
government 1
officials 1
updated 1
weekly 1
... ...
Web 7(4) 2013, 24:1-24:22
8

Society/Government
government 0.47
election 0.29
… …
Averaged term vector
(TF-IDF scheme)
president 0.10
… …
3.1 ODP-based Knowledge Representation [1]
0.51
0.31
Top
Game Society
.
.
.
.
.
.
…
9
Web 7(4) 2013, 24:1-24:22

Society/Government/President
president 0.44
government 0.31
trump 0.10
… …
Game/Gambling/Casinos
casino 0.53
trump 0.38
card 0.26
… …
3.1 ODP-based Knowledge Representation
• This approach cannot capture the semantic relations between words
(e.g., prez and president).
?
president
ODP taxonomy
A given document d,
“Tump became prez”
0.67 trump
0.51 prez
… …
prez
10
“Trump became prez”

• The word2vec projects words in a shallow layer structure and directly learns
the representation of words using context words.
• Trained word vectors with similar semantic meanings would be located at
high proximity within the vector space.
3.2 implicit representation model - Word2Vec[1][2]
prez
election
president
casino
gambling
[1] Mikolov, T., Chen, K., Corrado, G., Dean, J., “Efficient estimation of word representations in vector space”, ICLR (Workshop) 2013
[2] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J., “Distributed representations of words and phrases and their
compositionality”, NIPS 2013, 3111-3119
11
“Additive compositionality”

4. Joint Model of Explicit and Implicit Representation
Explicit
Representation
president 0.44
government 0.31
trump 0.10
… …
Implicit
Representation
12
• We propose two joint models of ODP-based text classification and word2vec.
• These joint models generate category vectors, which represent the semantics
of ODP categories.

4.1 Generating Category Vector with Algebraic Operation
president
government
trump
…
Explicit
Representation
Implicit
Representation
Word vectorsODP category
=
+
* +
…
Category vector
…
president 0.44
government 0.31
trump 0.10
… …
…
…
13
• First approach generates the category vector by using the vector scalar
multiplication and vector addition methods.

4.2 Generating Category Vector with Embedding
US President
Trump urged
congress … Society/Government/President
Candidate categories
Target word and
President urged congressUS
ODP-based classification
Skip-gram Category embedding
Game/Gambling
context words
Category vector
…
14
• Second approach extends word2vec to represent category vectors.
• The category vector of an ODP category is expected to represent the
collective semantics of words under this category.

A given document d,
0.67 trump
0.51 prez
… …
president 0.44
government 0.31
trump 0.10
… …
casino 0.53
trump 0.38
card 0.26
… …
Category vector
…
Category vector
…
document vector
…
5. Semantic Similarity Model
15
• We develop a novel semantic similarity measure, which captures both the
semantic relations between words and the semantics of ODP categories.

president 0.44
government 0.31
trump 0.10
… …
casino 0.53
trump 0.38
card 0.26
… …
5.1 Using Word-level Semantics
?0.7
A given document d,
0.67 trump
0.51 prez
… …
…
president
…
prez
16
(> 0.6)
0.51*0.44*

…/President ?
…/Casinos ?
? Document d
5.2 Using Category- and Word-level Semantics
president 0.44
government 0.31
trump 0.10
… …
casino 0.53
trump 0.38
card 0.26
… …
A given document d,
0.67 trump
0.51 prez
… …
17
α
α
α

6. Experiments
6.1 Dataset
• We train our category embedding model and word2vec model on the
“One Billion Word Language Model Benchmark” dataset.
18

• Baselines
▫ 𝑂𝐷𝑃 [1]
▫ 𝑃𝑉 [2]
▫ 𝐶𝑁𝑁 [3]
• Proposed models
▫ 𝑂𝐷𝑃𝐶𝑉
▫ 𝑂𝐷𝑃 𝑊𝑉
▫ 𝑂𝐷𝑃𝐶𝑉+𝑊𝑉
6.2 Experimental Setup
𝑂𝐷𝑃𝐶𝑉
𝑂𝐷𝑃 𝑊𝑉
𝑂𝐷𝑃𝐶𝑉+𝑊𝑉
19
Web 7(4) 2013, 24:1-24:22
[3] Kim, Y. , “Convolutional neural networks for sentence classification”, EMNLP 2014, 1746-1751
(category-level semantics)
(word-level semantics)
(category- and word-level semantics)

6.3 Experimental Results
20
• Comparison of Category Vector Generations

• We found that the curve reaches a peek at α = 0.9. This result shows that the
category vector plays a major role in the performance of 𝑂𝐷𝑃𝐶𝑉+𝑊𝑉.
• We observed that when the weight of category vector is 1.0, the performance
drops sharply. This means that the word overlap feature is still helpful.
21
• Classification Performance based on Different α Values

• 𝑂𝐷𝑃𝐶𝑉+𝑊𝑉 performs better than 𝑂𝐷𝑃 over 9%, 12%, and 10% on average in
terms of precision, recall, and F1-score, respectively.
• We observe that 𝐶𝑁𝑁 exhibits a better performance than 𝑂𝐷𝑃 in the
moderate-scale text classification.
22
• Classification Performance on the ODP Dataset
Web 7(4) 2013, 24:1-24:22
[3] Kim, Y. , “Convolutional neural networks for sentence classification”, EMNLP 2014, 1746-1751
[1]
[2]
[3]
[1]
[3]

23
• Classification Performance on the NYT Dataset
(2,735 categories)
• The performance of 𝑂𝐷𝑃𝐶𝑉+𝑊𝑉 outperforms all the methods.
• We also observe that both 𝑂𝐷𝑃𝐶𝑉 and 𝑂𝐷𝑃 𝑊𝑉 outperform ODP. These
results clearly demonstrate that both category and word vectors are effective
at text classification.
[1]
[2]
[3]

6.4 Analysis
24
• Nearest Words of Category Vector and Highly Weighted
Words in Centroid Vector of ODP Categories
• We also observe that the category vector identifies semantically related
words that do not appear in the ODP knowledge base

1. We have proposed novel joint models of the explicit
and implicit representation techniques to handle the
large-scale text classification.
2. We have incorporated the well-known word2vec
model into the ODP-based classification framework.
3. We have verified the large-scale classification
performance of the proposed methodology using real-
world datasets.
7. Conclusion
25

* Explicit Knowledge Representation
27

* Implicit Knowledge Representation: Embedding
28

Incorporating Word Embeddings into Open Directory Project Based Large-Scale Classification

More Related Content

What's hot

Similar to Incorporating Word Embeddings into Open Directory Project Based Large-Scale Classification

Recently uploaded

Incorporating Word Embeddings into Open Directory Project Based Large-Scale Classification

Editor's Notes