Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Comments oriented blog summarization by sentence extraction


Published on

  • May be below link can be useful for the people for their implementation:
    Are you sure you want to  Yes  No
    Your message goes here

Comments oriented blog summarization by sentence extraction

  1. 1. Author: MeishanHu, Aixin Sun, Ee-Peng Lim<br />Publication: CIKM’07<br />Presenter: Jhih-Ming Chen<br />Comments-Oriented Blog Summarization by Sentence Extraction<br />1<br />
  2. 2. Outline<br />Introduction<br />Problem Definition<br />ReQuT Model<br />Reader-, Quotation- and Topic- Measures<br />Word Representativeness Score<br />Sentence Selection<br />User Study and Experiments<br />User Study<br />Experimental Results<br />Conclusion<br />2<br />
  3. 3. Introduction<br />Readers treat comments associated with a post as an inherent part of the post.<br />existing research largely ignore comments by focusing on blog posts only<br />This paper conducted a user study on summarizing blog posts by labeling representative sentences in those posts.<br />to find out whether the reading of comments would change a reader’s understanding about the post<br />3<br />
  4. 4. Introduction<br />This paper focus on the problem of comments-oriented blog post summarization.<br />summarize a blog post by extracting representative sentences from the post using information hidden in its comments<br />4<br />
  5. 5. Introduction<br />5<br />Sentence Detection<br />splits blog post content into sentences<br />Word Representativeness Measure<br />weighs words appearing in comments<br />Sentence Selection<br />computes a representativeness score for each sentence based on representativeness of its contained words<br />
  6. 6. Problem Definition<br />6<br />Given a blog post P<br />P = {s1, s2, ... , sn}<br />si = {w1, w2, ... , wm}<br />C = {c1, c2,... , ck}associated with P<br />The task of comments-oriented blog summarization is to extract a subset of sentences from P, denoted by Sr (Sr ⊂ P), that best represents the discussion in C.<br />
  7. 7. Problem Definition<br />7<br />One straightforward approach<br />compute a representativeness score for each sentence si, denoted by Rep(si), and select sentences with representativeness scores above a given threshold<br />Intuitively, word representativeness can be measured by counting the number of occurrences of a word in comments<br />Binary<br />Rep(wk)= 1if wkappears in at least one comment and Rep(wk) = 0 otherwise.<br />Comment Frequency (CF)<br />Rep(wk) is the number of comments containing word wk<br />Term Frequency (TF)<br />Rep(wk)is the number of occurrences of wkin all comments associated with a blog post.<br />
  8. 8. Problem Definition<br />8<br />Binary captures minimum information; CF and TF capture slightly more.<br />Other information available in comments that could be very useful are ignored.<br />e.g., authors of comments, quotations among comments and so on.<br />All three measures suffer from spam comments.<br />
  9. 9. REQUT Model<br />9<br />A comment, other than its content, is often associated with an author, a time-stamp, and even a permalink.<br />These observations provide us guidelines on measuring word representativeness.<br />A reader often mentions another reader’s name to indicate that the current comment is a reply to previous comment(s).<br />A comment may contain quoted sentences from one or more comments.<br />Discussion in comments often branches into several topics.<br />
  10. 10. Reader-, Quotation- and Topic- Measures<br />10<br />Based on the three observations, we believe that a word is representative if it is written by authoritative readers, appears in widely quoted comments, and represents hotly discussed topics.<br />
  11. 11. Reader-, Quotation- and Topic- Measures<br />11<br />With Observation 1<br />given the full set of comments to ablog, we construct a directed reader graph GR=(VR, ER)<br />ra ∈ VRis a reader<br />eR(rb,ra)∈ ERexists if rbmentionsrain one of rb’scomments<br />WR(rb,ra) is the ratio between the numberof times rbmention raagainst all timesrbmention otherreaders (including ra)<br />
  12. 12. Reader-, Quotation- and Topic- Measures<br />12<br />With Observation 1<br />Compute reader authority<br />|R|denotes the total number of readers of the blog<br />d is the damping factor (usually set d to 0.85)<br />The reader measure of a word wkdenoted by RM(wk)<br />tf(wk,ci)is the term frequency of word wkin comment ci<br />ci← rameansthatciisauthored by reader ra<br />
  13. 13. Reader-, Quotation- and Topic- Measures<br />13<br />With Observation 2<br />for the set of comments associatedwith each blog post, we construct a directed acyclic quotationgraph<br />GQ = (VQ,EQ)<br />ci∈ VQis acomment<br />eQ(cj,ci)∈ EQindicatescjquotedsentences from ci<br />WQ(cj,ci)is 1over the number of comments that cjever quoted<br />
  14. 14. Reader-, Quotation- and Topic- Measures<br />14<br />With Observation 2<br />Derive the quotation degree D(ci)of a comment ci<br />|C| is the number of comments associated with the givenpost<br />A comment that is not quoted by anyother comment receives a quotation degree of 1/|C|<br />The quotation measure of a word wkdenoted by QM(wk)<br />wk∈cimeans that word wkappears in comment ci<br />
  15. 15. Reader-, Quotation- and Topic- Measures<br />15<br />With Observation 3<br />given the set of comments associatedwith each blog post, we group these comments intotopic clusters using a Single-Pass Incremental Clustering algorithm<br />the similarity threshold in clustering comments was empirically set to 0.4<br />hotly discussed topic has a large numberof comments all close to the topic cluster centroid<br />
  16. 16. Reader-, Quotation- and Topic- Measures<br />16<br />With Observation 3<br />Compute the importance of a topiccluster<br />|ci| is the length of comment ciinnumberof words<br />C is the set of comments<br />sim(ci,tu)is thecosine similarity between comment ciand the centroid oftopic cluster tu<br />The topic measure of a word wkdenotedby TM(wk)<br />ci∈ tudenotes comment ciis clustered into topic cluster tu<br />
  17. 17. Word Representativeness Score<br />17<br />Rep(wk) is thecombination of reader-, quotation- and topic- measures inReQuTmodel.<br />Rep(wk)=α ∙ RM(wk)+β ∙ QM(wk)+γ ∙ TM(wk)<br />0 ≤ α, β, γ ≤ 1.0and α + β + γ = 1.0<br />
  18. 18. Sentence Selection<br />18<br />Density-based selection (DBS)<br />wordsappearing in comments as keywords and therest non-keywords<br />K is the total number ofkeywords contained in si<br />Score(wj) is the score of keywordwj<br />distance(wj, wj+1)is the number of non-keywords(including stopwords) between the two adjacent keywordswj and wj+1 in si<br />
  19. 19. Sentence Selection<br />19<br />Summation-based selection (SBS)<br />give a higher representativeness score to a sentence ifit contains more representative words<br />|si|is the length of sentence siin number of words (includingstopwords)<br />τ > 0is a parameter to flexibly control the contribution of a word’s representativeness score<br />
  20. 20. User Study and Experiments<br />20<br />We collected data from two famous blogs, Cosmic Variance and IEBlog, both having relatively large readershipand being widely commented.<br />Cosmic Variancehas more loyalbut fewer readers with very diverse topics covered in posts.<br />IEBlog has less loyal but more readers, with topicsmainly in Web development.<br />
  21. 21. User Study<br />21<br />Our hypothesis is thatone’s understanding about a blog post does not change afterreading the comments associated with the post.<br />The user study was conducted in two phrases<br />3 human summarizers, 20 blog posts, nearly 1000 comments associated with the 20 posts.<br />Select approximately 30%of sentences from each post withoutcomments as its summary<br />The selectedsentences served as a labeled dataset known as RefSet-1.<br />Select approximately 30%of sentences from each posts and their comments as its summary<br />The selectedsentences served as a labeled dataset known as RefSet-2.<br />
  22. 22. User Study<br />22<br />For each human summarizer, we computed the level ofself-agreementshown in Table.<br />Self-agreement level is definedby the percentage of sentences labeled in both referencesets against sentences in RefSet-1 by the same summarizer.<br />That is,reading comments does change one’s understanding aboutblog posts.<br />
  23. 23. Experimental Results<br />23<br />RefSet-2 was used to evaluate the two sentence selection methodswith four word representativeness measures.<br />τ=0.2 <br />α = β = γ = 0.33<br />Normalized Discounted Cumulative Gain<br />
  24. 24. Conclusion<br />24<br />Reading commentsdoes affect one’s understanding about a blog post.<br />Evaluated two sentence selection methods with four word representativeness measures.<br />ReQuTgives the flexibility to measure word representativeness through three aspects, reader, quotation and topic.<br />