Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Text Conversation Task

Analysis of Similarity Measures
between Short Text for the NTCIR-12
Short Text Conversation Task
Kozo Chikai
Graduate school of
information science and
technology,
Osaka University
1
Yuki Arase
Graduate school of
information science and
technology,
Osaka University

① Introduction
ØOverview of STC task and goal of our study
② Methods & System Design
• WTMF model
• Word2vec
• Random Forest and its learning method
③ Experiment
④ Result & Discussion
2
AGENDA

• Short Text Conversation (STC) Task
üRetrieve suitable replies from a pool of tweets
→ Systemʼs output = Ranking of replies
üSystem design : Input → Text (similar to input) → Output (reply)
3
INTRODUCTION

① Introduction
ØOverview of STC task and Goal of our study
• WTMF model
• Word2vec
③ Experiment
4
AGENDA

METHODS
• Weighted Text Matrix Factorization (WTMF)
üSimply models the text-to-word information without leveraging the
correlation between short text.
ü𝑋: term-document matrix (row: document vector, cell: TF-IDF value)
→ Factorized into two matrix such that 𝑋 ≈ 𝑃$ 𝑄 (where 𝑃 ∈ ℝ(×*, Q ∈ ℝ(×,)
5
Source：Modeling Sentences in the Latent Space

METHODS
• Word Embedding
üGenerate word vectors using a neural network
üLiner operation between words is possible
(ex: ”king”-”man”+”woman” = “queen”)
üAdopted the implementation known as Word2vec tool
→ Output: low-dimensional vector representing words
6
Source：Efficient estimation of word representations in vector space

• Learning Post-Comment Pairs using Random Forest
7
tweet tweet
tweet pair set
・
・
・
Reduce the dimension of
tweet vectors to 100
using SVD
Randomly sample
300,000
positive/negative
examples
Random Forest
𝑎.,. ⋯ 𝑎.,1..
⋮ ⋱ ⋮
𝑎4.....,. ⋯ 𝑎4.....,1..
Training
Decide if an input pair is likely post-comment
METHODS
post comment

• Combination of Random Forest and TF-IDF
8
tweet
tweet set
・
・
・
Input post
Probability: 0.6Probability: 0.2Probability: 0.89
Combine probability output by Random
Forest and the value of TF-IDF
A) Random Forest → TF-IDF
Tweets have the probability more
than 0.5 → TF-IDF
B) Random Forest ＋ TF-IDF
Sum the probability and TF-IDF
(cos similarity) value
METHODS

① TF-IDF
② WTMF
③ Word2vec → TF-IDF
④ Random Forest → TF-IDF
⑤ Random Forest ＋ TF-IDF
DESIGN OF CONVERSATION SYSTEM
Ranking 9

① Introduction
• WTMF model
• Word2vec
③ Experiment
10
AGENDA

EXPERIMENT
• Setting
üGoal: Evaluate a ranked list of potential replies to an input tweet
ü10 annotators assigned a score to each reply (tweet)
→ score: +2, +1, and 0
→ The larger score the tweet has, the better reply it is!
üEvaluation criteria: nDCG@1, nERR@5, and Accuracy
→ Each method has a value between 0.0 and 1.0 using these criteria
→ The larger score is better
11
Name of method Technique
Oni-J-R1 ④Random Forest → TF-IDF
Oni-J-R2 ⑤Random Forest ＋ TF-IDF
Oni-J-R3 ①TF-IDF
Oni-J-R4 ③Word2Vec → TF-IDF
Oni-J-R5 ②Weighted Text Matrix Factorization

① Introduction
• WTMF model
• Word2vec
③ Experiment
12
OVERVIEW OF THIS PRESENTATION

DISCUSSION
15
( Oni-J-R1: ʻRandom Forest → TF-IDFʼ, Oni-J-R3ʼs method: ʻTF-IDFʼ )

DISCUSSION(2)
16
( Oni-J-R1ʼs method is ʻRandom Forest → TF-IDFʼ )

DISCUSSION(3)
17
( Oni-J-R3ʼs method: ʻTF-IDFʼ, Oni-J-R4: ʻWord2vec → TF-IDFʼ)

SUMMARY
• In this study, we compare the conventional methods to handle short text.
• We use unsupervised methods to generate vectors
• We also use supervised methods to learn if a pair of tweets can be a post-
comment pair.
• As the result of the formal run in STC, ①Random Forest → TF-IDF
outperformed other methods.
• The method using Word2vec shows interesting results in some context,
however there is room for improvement in this method.
18

REFERENCES
19
• Modeling Sentences in the Latent Space
W. Guo, and M. Diab
In Proceeding of ACL (2012)
• Distributed Representations of Sentences and Documents
T. Mikolov, Q. V. Le
In Proceedings of ICML(2014)
• Overview of the NTCIR-12 short text conversation task
L. Shang, T. Sakai, Z. Lu, H. Li, R. Higashinaka, and Y. Miyao
In Proceedings of the NTCIR Workshop Meeting on Evaluation of
Information Access Technologies(2016)

APPENDIX (1)
• The way of Word2Vec → TF-IDF
1. Extract nouns and verbs in an input post
2. Extract 20 most similar words for each noun and verb using word2vec
3. Search tweets containing these words in the corpus
4. Generate the reply list from these tweets in the same manner with ①TF-IDF
20
Did you finish
your homework ?
rank word
1 “end”
2 “complete”
… …
20 “close”
rank word
1 “assignment”
2 “work”
… …
20 “question”
1.
“finish”
“homework”
2.
Have you got the
homework done?
Iʼve finished
todayʼs work.
・
・
・3.
candidates
4. Generate
the reply list
using cosine
similarity

APPENDIX (2)
• Method of Generating the reply list
üWe use the relationship of a pair of post-comment in Twitter.
üWhy is the similar tweet itself included in the reply list?
→ Because the similar tweet can be useful as a reply in some context.
üIn this case, the direction of post-comment can be reversed.
Thus each tweet can be a reply for an input post.
21
For example…
Post : It is pretty cool in Hokkaido today.
Comment : Summer is the best season in Hokkaido.

Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Text Conversation Task

More Related Content

What's hot

Similar to Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Text Conversation Task

Recently uploaded

Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Text Conversation Task