2. Motivation
Comments left by readers on Web documents contain valuable
information that can be utilized in different information retrieval tasks
including document search, visualization, and summarization.
In this project we aim to summarize a Web document (e.g. a blog post) by
considering the comments left by its readers.
Web documents are now presented with annotations given by their
readers in the form of tags, comments, ratings, and others.
These Annotations along with comments are valuable input from users and
can be utilized in different IR tasks.
By considering these comments, the generated summary can better
capture the input from the readers, as opposed to the author of the
document only.
Comments-oriented summary provides balanced views from both author
and readers.
3. Introduction
Problem Statement
Given a blog post, consisting of a set of sentences P = {s1 , s2 , . . . , sn} and
the set of comments C = {c1 , c2 , . . . c} associated with blog post , the task of
comments- oriented blog summarization is to extract a subset of sentences P ,
denoted by Sr (Sr ⊂ P ), that best represents the discussion in C
Solution
Score blog sentences based on their similarity with top scored relevant
comments.
Comments are scored by using RQT graph/tensor based approach and
Named Entity Similarity score.
5. Approach
In summary generation it is important to retrieve relevant comments.
A comment is relevant if it reflects the topic discussed in blog or has more
replies.
A comment is scored using RQT model and Named Entity Similarity model
and top comments are selected as relevant comments.
Similarity score for each sentence is calculated by summation of cosine
similarity between that sentence and other comments.
Top scored sentences are the ones which are grasped by most
commentators and hence are relevant for summary.
6. RQT Model
Three factors determine RQT score (Rc) of a comment
Response Count (Cr ) : Number of replies to each comment.
Topic Related Cluster Count (Ct): Cosine similarity used to cluster comments
Quotation Count(Cq ): Number of times it is quoted in other comments.
Rc= Cr+Ct +Cq
7. Additional Factors
Likes Count (Cl) :Our dataset is Techcrunch.com where people comment
using facebook. Likes on a particular comment also increase relevance of
a comment significantly. Number of likes (Cl) also affect weightage.
Rc= Cr+Ct +Cq+Cl
Named Entity Similarity: Named entites in a comment are identified by
Stanford POS tagger and named entity score (Ec) is calculated by taking
number of named entities in a comment.
Final Score for each comment is calculated as
Score(C) = Rc + Ec
8. Sentence Scoring
Comments whose weights are greater than threshold value are chosen as
top comments.
Cosine similarity of each sentence is calculated with top comments .
Sentences are assigned score based on their cosine similarity with top
comments.
Score(Si)=Summation(CS(Si,Comments))
CS : Cosine Similarity Si : Blog Sentence
Only those top 5~7 sentences which has more than 6~8 words will be
selected as summary for the blog. More or less number of sentences can
be selected based on percentage of summary required.
9. Experiments And Results
10 blogs were randomly chosen with large number of comments and
generated a summary with 30 % and 20 % of words.
Generated summaries using online tools and compared System generated
summaries using the ROUGE Summary Evaluation Package by Chin-Yew
LIN.
10. Conclusions
Our approach depends upon number of factors like the number of likes
each comment has got, the length of the blog content etc
Generated summary was awarded with less ROGUE score if the number of
comments aren't enough.
Generated summary was less accurate when none of comments were
accurate
By scoring the comments based on named entities, accuracy of ranking of
comments increased significantly.
This system needs more testing and larger dataset in order to get optimal
values of the constants.