Author: MeishanHu, Aixin Sun, Ee-Peng Lim<br />Publication: CIKM’07<br />Presenter: Jhih-Ming Chen<br />Comments-Oriented ...
Outline<br />Introduction<br />Problem Definition<br />ReQuT Model<br />Reader-, Quotation- and Topic- Measures<br />Word ...
Introduction<br />Readers treat comments associated with a post as an inherent part of the post.<br />existing research la...
Introduction<br />This paper focus on the problem of comments-oriented blog post summarization.<br />summarize a blog post...
Introduction<br />5<br />Sentence Detection<br />splits blog post content into sentences<br />Word Representativeness Meas...
Problem Definition<br />6<br />Given a blog post P<br />P = {s1, s2, ... , sn}<br />si = {w1, w2, ... , wm}<br />C = {c1, ...
Problem Definition<br />7<br />One straightforward approach<br />compute a representativeness score for each sentence si, ...
Problem Definition<br />8<br />Binary captures minimum information; CF and TF capture slightly more.<br />Other informatio...
REQUT Model<br />9<br />A comment, other than its content, is often associated with an author, a time-stamp, and even a pe...
Reader-, Quotation- and Topic- Measures<br />10<br />Based on the three observations, we believe that a word is representa...
Reader-, Quotation- and Topic- Measures<br />11<br />With Observation 1<br />given the full set of comments to ablog, we c...
Reader-, Quotation- and Topic- Measures<br />12<br />With Observation 1<br />Compute reader authority<br />|R|denotes the ...
Reader-, Quotation- and Topic- Measures<br />13<br />With Observation 2<br />for the set of comments associatedwith each b...
Reader-, Quotation- and Topic- Measures<br />14<br />With Observation 2<br />Derive the quotation degree D(ci)of a comment...
Reader-, Quotation- and Topic- Measures<br />15<br />With Observation 3<br />given the set of comments associatedwith each...
Reader-, Quotation- and Topic- Measures<br />16<br />With Observation 3<br />Compute the importance of a topiccluster<br /...
Word Representativeness Score<br />17<br />Rep(wk) is thecombination of reader-, quotation- and topic- measures inReQuTmod...
Sentence Selection<br />18<br />Density-based selection (DBS)<br />wordsappearing in comments as keywords and therest non-...
Sentence Selection<br />19<br />Summation-based selection (SBS)<br />give a higher representativeness score to a sentence ...
User Study and Experiments<br />20<br />We collected data from two famous blogs, Cosmic Variance and IEBlog, both having r...
User Study<br />21<br />Our hypothesis is thatone’s understanding about a blog post does not change afterreading the comme...
User Study<br />22<br />For each human summarizer, we computed the level ofself-agreementshown in Table.<br />Self-agreeme...
Experimental Results<br />23<br />RefSet-2 was used to evaluate the two sentence selection methodswith four word represent...
Conclusion<br />24<br />Reading commentsdoes affect one’s understanding about a blog post.<br />Evaluated two sentence sel...
Upcoming SlideShare
Loading in …5
×

Comments oriented blog summarization by sentence extraction

3,163 views
3,015 views

Published on

1 Comment
2 Likes
Statistics
Notes
  • May be below link can be useful for the people for their implementation: azimchisty.blogspot.com/2014/07/comment-oriented-blog-summarizer.html
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
3,163
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
16
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

Comments oriented blog summarization by sentence extraction

  1. 1. Author: MeishanHu, Aixin Sun, Ee-Peng Lim<br />Publication: CIKM’07<br />Presenter: Jhih-Ming Chen<br />Comments-Oriented Blog Summarization by Sentence Extraction<br />1<br />
  2. 2. Outline<br />Introduction<br />Problem Definition<br />ReQuT Model<br />Reader-, Quotation- and Topic- Measures<br />Word Representativeness Score<br />Sentence Selection<br />User Study and Experiments<br />User Study<br />Experimental Results<br />Conclusion<br />2<br />
  3. 3. Introduction<br />Readers treat comments associated with a post as an inherent part of the post.<br />existing research largely ignore comments by focusing on blog posts only<br />This paper conducted a user study on summarizing blog posts by labeling representative sentences in those posts.<br />to find out whether the reading of comments would change a reader’s understanding about the post<br />3<br />
  4. 4. Introduction<br />This paper focus on the problem of comments-oriented blog post summarization.<br />summarize a blog post by extracting representative sentences from the post using information hidden in its comments<br />4<br />
  5. 5. Introduction<br />5<br />Sentence Detection<br />splits blog post content into sentences<br />Word Representativeness Measure<br />weighs words appearing in comments<br />Sentence Selection<br />computes a representativeness score for each sentence based on representativeness of its contained words<br />
  6. 6. Problem Definition<br />6<br />Given a blog post P<br />P = {s1, s2, ... , sn}<br />si = {w1, w2, ... , wm}<br />C = {c1, c2,... , ck}associated with P<br />The task of comments-oriented blog summarization is to extract a subset of sentences from P, denoted by Sr (Sr ⊂ P), that best represents the discussion in C.<br />
  7. 7. Problem Definition<br />7<br />One straightforward approach<br />compute a representativeness score for each sentence si, denoted by Rep(si), and select sentences with representativeness scores above a given threshold<br />Intuitively, word representativeness can be measured by counting the number of occurrences of a word in comments<br />Binary<br />Rep(wk)= 1if wkappears in at least one comment and Rep(wk) = 0 otherwise.<br />Comment Frequency (CF)<br />Rep(wk) is the number of comments containing word wk<br />Term Frequency (TF)<br />Rep(wk)is the number of occurrences of wkin all comments associated with a blog post.<br />
  8. 8. Problem Definition<br />8<br />Binary captures minimum information; CF and TF capture slightly more.<br />Other information available in comments that could be very useful are ignored.<br />e.g., authors of comments, quotations among comments and so on.<br />All three measures suffer from spam comments.<br />
  9. 9. REQUT Model<br />9<br />A comment, other than its content, is often associated with an author, a time-stamp, and even a permalink.<br />These observations provide us guidelines on measuring word representativeness.<br />A reader often mentions another reader’s name to indicate that the current comment is a reply to previous comment(s).<br />A comment may contain quoted sentences from one or more comments.<br />Discussion in comments often branches into several topics.<br />
  10. 10. Reader-, Quotation- and Topic- Measures<br />10<br />Based on the three observations, we believe that a word is representative if it is written by authoritative readers, appears in widely quoted comments, and represents hotly discussed topics.<br />
  11. 11. Reader-, Quotation- and Topic- Measures<br />11<br />With Observation 1<br />given the full set of comments to ablog, we construct a directed reader graph GR=(VR, ER)<br />ra ∈ VRis a reader<br />eR(rb,ra)∈ ERexists if rbmentionsrain one of rb’scomments<br />WR(rb,ra) is the ratio between the numberof times rbmention raagainst all timesrbmention otherreaders (including ra)<br />
  12. 12. Reader-, Quotation- and Topic- Measures<br />12<br />With Observation 1<br />Compute reader authority<br />|R|denotes the total number of readers of the blog<br />d is the damping factor (usually set d to 0.85)<br />The reader measure of a word wkdenoted by RM(wk)<br />tf(wk,ci)is the term frequency of word wkin comment ci<br />ci← rameansthatciisauthored by reader ra<br />
  13. 13. Reader-, Quotation- and Topic- Measures<br />13<br />With Observation 2<br />for the set of comments associatedwith each blog post, we construct a directed acyclic quotationgraph<br />GQ = (VQ,EQ)<br />ci∈ VQis acomment<br />eQ(cj,ci)∈ EQindicatescjquotedsentences from ci<br />WQ(cj,ci)is 1over the number of comments that cjever quoted<br />
  14. 14. Reader-, Quotation- and Topic- Measures<br />14<br />With Observation 2<br />Derive the quotation degree D(ci)of a comment ci<br />|C| is the number of comments associated with the givenpost<br />A comment that is not quoted by anyother comment receives a quotation degree of 1/|C|<br />The quotation measure of a word wkdenoted by QM(wk)<br />wk∈cimeans that word wkappears in comment ci<br />
  15. 15. Reader-, Quotation- and Topic- Measures<br />15<br />With Observation 3<br />given the set of comments associatedwith each blog post, we group these comments intotopic clusters using a Single-Pass Incremental Clustering algorithm<br />the similarity threshold in clustering comments was empirically set to 0.4<br />hotly discussed topic has a large numberof comments all close to the topic cluster centroid<br />
  16. 16. Reader-, Quotation- and Topic- Measures<br />16<br />With Observation 3<br />Compute the importance of a topiccluster<br />|ci| is the length of comment ciinnumberof words<br />C is the set of comments<br />sim(ci,tu)is thecosine similarity between comment ciand the centroid oftopic cluster tu<br />The topic measure of a word wkdenotedby TM(wk)<br />ci∈ tudenotes comment ciis clustered into topic cluster tu<br />
  17. 17. Word Representativeness Score<br />17<br />Rep(wk) is thecombination of reader-, quotation- and topic- measures inReQuTmodel.<br />Rep(wk)=α ∙ RM(wk)+β ∙ QM(wk)+γ ∙ TM(wk)<br />0 ≤ α, β, γ ≤ 1.0and α + β + γ = 1.0<br />
  18. 18. Sentence Selection<br />18<br />Density-based selection (DBS)<br />wordsappearing in comments as keywords and therest non-keywords<br />K is the total number ofkeywords contained in si<br />Score(wj) is the score of keywordwj<br />distance(wj, wj+1)is the number of non-keywords(including stopwords) between the two adjacent keywordswj and wj+1 in si<br />
  19. 19. Sentence Selection<br />19<br />Summation-based selection (SBS)<br />give a higher representativeness score to a sentence ifit contains more representative words<br />|si|is the length of sentence siin number of words (includingstopwords)<br />τ > 0is a parameter to flexibly control the contribution of a word’s representativeness score<br />
  20. 20. User Study and Experiments<br />20<br />We collected data from two famous blogs, Cosmic Variance and IEBlog, both having relatively large readershipand being widely commented.<br />Cosmic Variancehas more loyalbut fewer readers with very diverse topics covered in posts.<br />IEBlog has less loyal but more readers, with topicsmainly in Web development.<br />
  21. 21. User Study<br />21<br />Our hypothesis is thatone’s understanding about a blog post does not change afterreading the comments associated with the post.<br />The user study was conducted in two phrases<br />3 human summarizers, 20 blog posts, nearly 1000 comments associated with the 20 posts.<br />Select approximately 30%of sentences from each post withoutcomments as its summary<br />The selectedsentences served as a labeled dataset known as RefSet-1.<br />Select approximately 30%of sentences from each posts and their comments as its summary<br />The selectedsentences served as a labeled dataset known as RefSet-2.<br />
  22. 22. User Study<br />22<br />For each human summarizer, we computed the level ofself-agreementshown in Table.<br />Self-agreement level is definedby the percentage of sentences labeled in both referencesets against sentences in RefSet-1 by the same summarizer.<br />That is,reading comments does change one’s understanding aboutblog posts.<br />
  23. 23. Experimental Results<br />23<br />RefSet-2 was used to evaluate the two sentence selection methodswith four word representativeness measures.<br />τ=0.2 <br />α = β = γ = 0.33<br />Normalized Discounted Cumulative Gain<br />
  24. 24. Conclusion<br />24<br />Reading commentsdoes affect one’s understanding about a blog post.<br />Evaluated two sentence selection methods with four word representativeness measures.<br />ReQuTgives the flexibility to measure word representativeness through three aspects, reader, quotation and topic.<br />

×