Analysis of Similarity Measures
between Short Text for the NTCIR-12
Short Text Conversation Task
Kozo Chikai
Graduate school of
information science and
technology,
Osaka University
1
Yuki Arase
Graduate school of
information science and
technology,
Osaka University
① Introduction
ØOverview of STC task and goal of our study
② Methods & System Design
• WTMF model
• Word2vec
• Random Forest and its learning method
③ Experiment
④ Result & Discussion
2
AGENDA
• Short Text Conversation (STC) Task
üRetrieve suitable replies from a pool of tweets
→ Systemʼs output = Ranking of replies
üSystem design : Input → Text (similar to input) → Output (reply)
3
INTRODUCTION
① Introduction
ØOverview of STC task and Goal of our study
② Methods & System Design
• WTMF model
• Word2vec
• Random Forest and its learning method
③ Experiment
④ Result & Discussion
4
AGENDA
METHODS
• Weighted Text Matrix Factorization (WTMF)
üSimply models the text-to-word information without leveraging the
correlation between short text.
ü𝑋: term-document matrix (row: document vector, cell: TF-IDF value)
→ Factorized into two matrix such that 𝑋 ≈ 𝑃$ 𝑄 (where 𝑃 ∈ ℝ(×*, Q ∈ ℝ(×,)
5
Source:Modeling Sentences in the Latent Space
METHODS
• Word Embedding
üGenerate word vectors using a neural network
üLiner operation between words is possible
(ex: ”king”-”man”+”woman” = “queen”)
üAdopted the implementation known as Word2vec tool
→ Output: low-dimensional vector representing words
6
Source:Efficient estimation of word representations in vector space
• Learning Post-Comment Pairs using Random Forest
7
tweet tweet
tweet pair set
・
・
・
Reduce the dimension of
tweet vectors to 100
using SVD
Randomly sample
300,000
positive/negative
examples
Random Forest
𝑎.,. ⋯ 𝑎.,1..
⋮ ⋱ ⋮
𝑎4.....,. ⋯ 𝑎4.....,1..
Training
Decide if an input pair is likely post-comment
METHODS
post comment
• Combination of Random Forest and TF-IDF
8
tweet
tweet set
・
・
・
Input post
Probability:	0.6Probability:	0.2Probability:	0.89
Combine probability output by Random
Forest and the value of TF-IDF
A) Random Forest → TF-IDF
Tweets have the probability more
than 0.5 → TF-IDF
B) Random Forest + TF-IDF
Sum the probability and TF-IDF
(cos similarity) value
METHODS
① TF-IDF
② WTMF
③ Word2vec → TF-IDF
④ Random Forest → TF-IDF
⑤ Random Forest + TF-IDF
DESIGN OF CONVERSATION SYSTEM
Ranking 9
① Introduction
ØOverview of STC task and Goal of our study
② Methods & System Design
• WTMF model
• Word2vec
• Random Forest and its learning method
③ Experiment
④ Result & Discussion
10
AGENDA
EXPERIMENT
• Setting
üGoal: Evaluate a ranked list of potential replies to an input tweet
ü10 annotators assigned a score to each reply (tweet)
→ score: +2, +1, and 0
→ The larger score the tweet has, the better reply it is!
üEvaluation criteria: nDCG@1, nERR@5, and Accuracy
→ Each method has a value between 0.0 and 1.0 using these criteria
→ The larger score is better
11
Name of method Technique
Oni-J-R1 ④Random Forest → TF-IDF
Oni-J-R2 ⑤Random Forest + TF-IDF
Oni-J-R3 ①TF-IDF
Oni-J-R4 ③Word2Vec → TF-IDF
Oni-J-R5 ②Weighted Text Matrix Factorization
① Introduction
ØOverview of STC task and Goal of our study
② Methods & System Design
• WTMF model
• Word2vec
• Random Forest and its learning method
③ Experiment
④ Result & Discussion
12
OVERVIEW OF THIS PRESENTATION
13
RESULT
14
DISCUSSION
DISCUSSION
15
( Oni-J-R1: ʻRandom Forest → TF-IDFʼ, Oni-J-R3ʼs method: ʻTF-IDFʼ )
DISCUSSION(2)
16
( Oni-J-R1ʼs method is ʻRandom Forest → TF-IDFʼ )
DISCUSSION(3)
17
( Oni-J-R3ʼs method: ʻTF-IDFʼ, Oni-J-R4: ʻWord2vec → TF-IDFʼ)
SUMMARY
• In this study, we compare the conventional methods to handle short text.
• We use unsupervised methods to generate vectors
• We also use supervised methods to learn if a pair of tweets can be a post-
comment pair.
• As the result of the formal run in STC, ①Random Forest → TF-IDF
outperformed other methods.
• The method using Word2vec shows interesting results in some context,
however there is room for improvement in this method.
18
REFERENCES
19
• Modeling Sentences in the Latent Space
W. Guo, and M. Diab
In Proceeding of ACL (2012)
• Distributed Representations of Sentences and Documents
T. Mikolov, Q. V. Le
In Proceedings of ICML(2014)
• Overview of the NTCIR-12 short text conversation task
L. Shang, T. Sakai, Z. Lu, H. Li, R. Higashinaka, and Y. Miyao
In Proceedings of the NTCIR Workshop Meeting on Evaluation of
Information Access Technologies(2016)
APPENDIX (1)
• The way of Word2Vec → TF-IDF
1. Extract nouns and verbs in an input post
2. Extract 20 most similar words for each noun and verb using word2vec
3. Search tweets containing these words in the corpus
4. Generate the reply list from these tweets in the same manner with ①TF-IDF
20
Did you finish
your homework ?
rank word
1 “end”
2 “complete”
… …
20 “close”
rank word
1 “assignment”
2 “work”
… …
20 “question”
1.
“finish”
“homework”
2.
Have you got the
homework done?
Iʼve finished
todayʼs work.
・
・
・3.
candidates
4. Generate
the reply list
using cosine
similarity
APPENDIX (2)
• Method of Generating the reply list
üWe use the relationship of a pair of post-comment in Twitter.
üWhy is the similar tweet itself included in the reply list?
→ Because the similar tweet can be useful as a reply in some context.
üIn this case, the direction of post-comment can be reversed.
Thus each tweet can be a reply for an input post.
21
For example…
Post : It is pretty cool in Hokkaido today.
Comment : Summer is the best season in Hokkaido.

Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Text Conversation Task

  • 1.
    Analysis of SimilarityMeasures between Short Text for the NTCIR-12 Short Text Conversation Task Kozo Chikai Graduate school of information science and technology, Osaka University 1 Yuki Arase Graduate school of information science and technology, Osaka University
  • 2.
    ① Introduction ØOverview ofSTC task and goal of our study ② Methods & System Design • WTMF model • Word2vec • Random Forest and its learning method ③ Experiment ④ Result & Discussion 2 AGENDA
  • 3.
    • Short TextConversation (STC) Task üRetrieve suitable replies from a pool of tweets → Systemʼs output = Ranking of replies üSystem design : Input → Text (similar to input) → Output (reply) 3 INTRODUCTION
  • 4.
    ① Introduction ØOverview ofSTC task and Goal of our study ② Methods & System Design • WTMF model • Word2vec • Random Forest and its learning method ③ Experiment ④ Result & Discussion 4 AGENDA
  • 5.
    METHODS • Weighted TextMatrix Factorization (WTMF) üSimply models the text-to-word information without leveraging the correlation between short text. ü𝑋: term-document matrix (row: document vector, cell: TF-IDF value) → Factorized into two matrix such that 𝑋 ≈ 𝑃$ 𝑄 (where 𝑃 ∈ ℝ(×*, Q ∈ ℝ(×,) 5 Source:Modeling Sentences in the Latent Space
  • 6.
    METHODS • Word Embedding üGenerateword vectors using a neural network üLiner operation between words is possible (ex: ”king”-”man”+”woman” = “queen”) üAdopted the implementation known as Word2vec tool → Output: low-dimensional vector representing words 6 Source:Efficient estimation of word representations in vector space
  • 7.
    • Learning Post-CommentPairs using Random Forest 7 tweet tweet tweet pair set ・ ・ ・ Reduce the dimension of tweet vectors to 100 using SVD Randomly sample 300,000 positive/negative examples Random Forest 𝑎.,. ⋯ 𝑎.,1.. ⋮ ⋱ ⋮ 𝑎4.....,. ⋯ 𝑎4.....,1.. Training Decide if an input pair is likely post-comment METHODS post comment
  • 8.
    • Combination ofRandom Forest and TF-IDF 8 tweet tweet set ・ ・ ・ Input post Probability: 0.6Probability: 0.2Probability: 0.89 Combine probability output by Random Forest and the value of TF-IDF A) Random Forest → TF-IDF Tweets have the probability more than 0.5 → TF-IDF B) Random Forest + TF-IDF Sum the probability and TF-IDF (cos similarity) value METHODS
  • 9.
    ① TF-IDF ② WTMF ③Word2vec → TF-IDF ④ Random Forest → TF-IDF ⑤ Random Forest + TF-IDF DESIGN OF CONVERSATION SYSTEM Ranking 9
  • 10.
    ① Introduction ØOverview ofSTC task and Goal of our study ② Methods & System Design • WTMF model • Word2vec • Random Forest and its learning method ③ Experiment ④ Result & Discussion 10 AGENDA
  • 11.
    EXPERIMENT • Setting üGoal: Evaluatea ranked list of potential replies to an input tweet ü10 annotators assigned a score to each reply (tweet) → score: +2, +1, and 0 → The larger score the tweet has, the better reply it is! üEvaluation criteria: nDCG@1, nERR@5, and Accuracy → Each method has a value between 0.0 and 1.0 using these criteria → The larger score is better 11 Name of method Technique Oni-J-R1 ④Random Forest → TF-IDF Oni-J-R2 ⑤Random Forest + TF-IDF Oni-J-R3 ①TF-IDF Oni-J-R4 ③Word2Vec → TF-IDF Oni-J-R5 ②Weighted Text Matrix Factorization
  • 12.
    ① Introduction ØOverview ofSTC task and Goal of our study ② Methods & System Design • WTMF model • Word2vec • Random Forest and its learning method ③ Experiment ④ Result & Discussion 12 OVERVIEW OF THIS PRESENTATION
  • 13.
  • 14.
  • 15.
    DISCUSSION 15 ( Oni-J-R1: ʻRandomForest → TF-IDFʼ, Oni-J-R3ʼs method: ʻTF-IDFʼ )
  • 16.
    DISCUSSION(2) 16 ( Oni-J-R1ʼs methodis ʻRandom Forest → TF-IDFʼ )
  • 17.
    DISCUSSION(3) 17 ( Oni-J-R3ʼs method:ʻTF-IDFʼ, Oni-J-R4: ʻWord2vec → TF-IDFʼ)
  • 18.
    SUMMARY • In thisstudy, we compare the conventional methods to handle short text. • We use unsupervised methods to generate vectors • We also use supervised methods to learn if a pair of tweets can be a post- comment pair. • As the result of the formal run in STC, ①Random Forest → TF-IDF outperformed other methods. • The method using Word2vec shows interesting results in some context, however there is room for improvement in this method. 18
  • 19.
    REFERENCES 19 • Modeling Sentencesin the Latent Space W. Guo, and M. Diab In Proceeding of ACL (2012) • Distributed Representations of Sentences and Documents T. Mikolov, Q. V. Le In Proceedings of ICML(2014) • Overview of the NTCIR-12 short text conversation task L. Shang, T. Sakai, Z. Lu, H. Li, R. Higashinaka, and Y. Miyao In Proceedings of the NTCIR Workshop Meeting on Evaluation of Information Access Technologies(2016)
  • 20.
    APPENDIX (1) • Theway of Word2Vec → TF-IDF 1. Extract nouns and verbs in an input post 2. Extract 20 most similar words for each noun and verb using word2vec 3. Search tweets containing these words in the corpus 4. Generate the reply list from these tweets in the same manner with ①TF-IDF 20 Did you finish your homework ? rank word 1 “end” 2 “complete” … … 20 “close” rank word 1 “assignment” 2 “work” … … 20 “question” 1. “finish” “homework” 2. Have you got the homework done? Iʼve finished todayʼs work. ・ ・ ・3. candidates 4. Generate the reply list using cosine similarity
  • 21.
    APPENDIX (2) • Methodof Generating the reply list üWe use the relationship of a pair of post-comment in Twitter. üWhy is the similar tweet itself included in the reply list? → Because the similar tweet can be useful as a reply in some context. üIn this case, the direction of post-comment can be reversed. Thus each tweet can be a reply for an input post. 21 For example… Post : It is pretty cool in Hokkaido today. Comment : Summer is the best season in Hokkaido.