Author: MeishanHu, Aixin Sun, Ee-Peng LimPublication: CIKM’07Presenter: Jhih-Ming ChenComments-Oriented Blog Summarization by Sentence Extraction1
OutlineIntroductionProblem DefinitionReQuT ModelReader-, Quotation- and Topic- MeasuresWord Representativeness ScoreSentence SelectionUser Study and ExperimentsUser StudyExperimental ResultsConclusion2
IntroductionReaders treat comments associated with a post as an inherent part of the post.existing research largely ignore comments by focusing on blog posts onlyThis paper conducted a user study on summarizing blog posts by labeling representative sentences in those posts.to find out whether the reading of comments would change a reader’s understanding about the post3
IntroductionThis paper focus on the problem of comments-oriented blog post summarization.summarize a blog post by extracting representative sentences from the post using information hidden in its comments4
Introduction5Sentence Detectionsplits blog post content into sentencesWord Representativeness Measureweighs words appearing in commentsSentence Selectioncomputes a representativeness score for each sentence based on representativeness of its contained words
Problem Definition6Given a blog post PP = {s1, s2, ... , sn}si = {w1, w2, ... , wm}C = {c1, c2,... , ck}associated with PThe task of comments-oriented blog summarization is to extract a subset of sentences from P, denoted by Sr (Sr ⊂ P), that best represents the discussion in C.
Problem Definition7One straightforward approachcompute a representativeness score for each sentence si, denoted by Rep(si), and select sentences with representativeness scores above a given thresholdIntuitively, word representativeness can be measured by counting the number of occurrences of a word in commentsBinaryRep(wk)= 1if wkappears in at least one comment and Rep(wk) = 0 otherwise.Comment Frequency (CF)Rep(wk) is the number of comments containing word wkTerm Frequency (TF)Rep(wk)is the number of occurrences of wkin all comments associated with a blog post.
Problem Definition8Binary captures minimum information; CF and TF capture slightly more.Other information available in comments that could be very useful are ignored.e.g., authors of comments, quotations among comments and so on.All three measures suffer from spam comments.
REQUT Model9A comment, other than its content, is often associated with an author, a time-stamp, and even a permalink.These observations provide us guidelines on measuring word representativeness.A reader often mentions another reader’s name to indicate that the current comment is a reply to previous comment(s).A comment may contain quoted sentences from one or more comments.Discussion in comments often branches into several topics.
Reader-, Quotation- and Topic- Measures10Based on the three observations, we believe that a word is representative if it is written by authoritative readers, appears in widely quoted comments, and  represents hotly discussed topics.
Reader-, Quotation- and Topic- Measures11With Observation 1given the full set of comments to ablog, we construct a directed reader graph GR=(VR, ER)ra ∈ VRis a readereR(rb,ra)∈ ERexists if rbmentionsrain one of rb’scommentsWR(rb,ra) is the ratio between the numberof times rbmention raagainst all timesrbmention otherreaders (including ra)
Reader-, Quotation- and Topic- Measures12With Observation 1Compute reader authority|R|denotes the total number of readers of the blogd is the damping factor (usually set d to 0.85)The reader measure of a word wkdenoted by RM(wk)tf(wk,ci)is the term frequency of word wkin comment cici← rameansthatciisauthored by reader ra
Reader-, Quotation- and Topic- Measures13With Observation 2for the set of comments associatedwith each blog post, we construct a directed acyclic quotationgraphGQ = (VQ,EQ)ci∈ VQis acommenteQ(cj,ci)∈ EQindicatescjquotedsentences from ciWQ(cj,ci)is 1over the number of comments that cjever quoted
Reader-, Quotation- and Topic- Measures14With Observation 2Derive the quotation degree D(ci)of a comment ci|C| is the number of comments associated with the givenpostA comment that is not quoted by anyother comment receives a quotation degree of 1/|C|The quotation measure of a word wkdenoted by QM(wk)wk∈cimeans that word wkappears in comment ci
Reader-, Quotation- and Topic- Measures15With Observation 3given the set of comments associatedwith each blog post, we group these comments intotopic clusters using a Single-Pass Incremental Clustering algorithmthe similarity threshold in clustering comments was empirically set to 0.4hotly discussed topic has a large numberof comments all close to the topic cluster centroid
Reader-, Quotation- and Topic- Measures16With Observation 3Compute the importance of a topiccluster|ci| is the length of comment ciinnumberof wordsC is the set of commentssim(ci,tu)is thecosine similarity between comment ciand the centroid oftopic cluster tuThe topic measure of a word wkdenotedby TM(wk)ci∈ tudenotes comment ciis clustered into topic cluster tu
Word Representativeness Score17Rep(wk) is thecombination of reader-, quotation- and topic- measures inReQuTmodel.Rep(wk)=α ∙ RM(wk)+β ∙ QM(wk)+γ ∙ TM(wk)0 ≤ α, β, γ ≤ 1.0and α + β + γ = 1.0
Sentence Selection18Density-based selection (DBS)wordsappearing in comments as keywords and therest non-keywordsK is the total number ofkeywords contained in siScore(wj) is the score of keywordwjdistance(wj, wj+1)is the number of non-keywords(including stopwords) between the two adjacent keywordswj and wj+1 in si
Sentence Selection19Summation-based selection (SBS)give a higher representativeness score to a sentence ifit contains more representative words|si|is the length of sentence siin number of words (includingstopwords)τ > 0is a parameter to flexibly control the contribution of a word’s representativeness score
User Study and Experiments20We collected data from two famous blogs, Cosmic Variance and IEBlog, both having relatively large readershipand being widely commented.Cosmic Variancehas more loyalbut fewer readers with very diverse topics covered in posts.IEBlog has less loyal but more readers, with topicsmainly in Web development.
User Study21Our hypothesis is thatone’s understanding about a blog post does not change afterreading the comments associated with the post.The user study was conducted in two phrases3 human summarizers, 20 blog posts, nearly 1000 comments associated with the 20 posts.Select approximately 30%of sentences from each post withoutcomments as its summaryThe selectedsentences served as a labeled dataset known as RefSet-1.Select approximately 30%of sentences from each posts and their comments as its summaryThe selectedsentences served as a labeled dataset known as RefSet-2.
User Study22For each human summarizer, we computed the level ofself-agreementshown in Table.Self-agreement level is definedby the percentage of sentences labeled in both referencesets against sentences in RefSet-1 by the same summarizer.That is,reading comments does change one’s understanding aboutblog posts.
Experimental Results23RefSet-2 was used to evaluate the two sentence selection methodswith four word representativeness measures.τ=0.2 α = β = γ = 0.33Normalized Discounted Cumulative Gain
Conclusion24Reading commentsdoes affect one’s understanding about a blog post.Evaluated two sentence selection methods with four word representativeness measures.ReQuTgives the flexibility to measure word representativeness through three aspects, reader, quotation and topic.

Comments oriented blog summarization by sentence extraction

  • 1.
    Author: MeishanHu, AixinSun, Ee-Peng LimPublication: CIKM’07Presenter: Jhih-Ming ChenComments-Oriented Blog Summarization by Sentence Extraction1
  • 2.
    OutlineIntroductionProblem DefinitionReQuT ModelReader-,Quotation- and Topic- MeasuresWord Representativeness ScoreSentence SelectionUser Study and ExperimentsUser StudyExperimental ResultsConclusion2
  • 3.
    IntroductionReaders treat commentsassociated with a post as an inherent part of the post.existing research largely ignore comments by focusing on blog posts onlyThis paper conducted a user study on summarizing blog posts by labeling representative sentences in those posts.to find out whether the reading of comments would change a reader’s understanding about the post3
  • 4.
    IntroductionThis paper focuson the problem of comments-oriented blog post summarization.summarize a blog post by extracting representative sentences from the post using information hidden in its comments4
  • 5.
    Introduction5Sentence Detectionsplits blogpost content into sentencesWord Representativeness Measureweighs words appearing in commentsSentence Selectioncomputes a representativeness score for each sentence based on representativeness of its contained words
  • 6.
    Problem Definition6Given ablog post PP = {s1, s2, ... , sn}si = {w1, w2, ... , wm}C = {c1, c2,... , ck}associated with PThe task of comments-oriented blog summarization is to extract a subset of sentences from P, denoted by Sr (Sr ⊂ P), that best represents the discussion in C.
  • 7.
    Problem Definition7One straightforwardapproachcompute a representativeness score for each sentence si, denoted by Rep(si), and select sentences with representativeness scores above a given thresholdIntuitively, word representativeness can be measured by counting the number of occurrences of a word in commentsBinaryRep(wk)= 1if wkappears in at least one comment and Rep(wk) = 0 otherwise.Comment Frequency (CF)Rep(wk) is the number of comments containing word wkTerm Frequency (TF)Rep(wk)is the number of occurrences of wkin all comments associated with a blog post.
  • 8.
    Problem Definition8Binary capturesminimum information; CF and TF capture slightly more.Other information available in comments that could be very useful are ignored.e.g., authors of comments, quotations among comments and so on.All three measures suffer from spam comments.
  • 9.
    REQUT Model9A comment,other than its content, is often associated with an author, a time-stamp, and even a permalink.These observations provide us guidelines on measuring word representativeness.A reader often mentions another reader’s name to indicate that the current comment is a reply to previous comment(s).A comment may contain quoted sentences from one or more comments.Discussion in comments often branches into several topics.
  • 10.
    Reader-, Quotation- andTopic- Measures10Based on the three observations, we believe that a word is representative if it is written by authoritative readers, appears in widely quoted comments, and represents hotly discussed topics.
  • 11.
    Reader-, Quotation- andTopic- Measures11With Observation 1given the full set of comments to ablog, we construct a directed reader graph GR=(VR, ER)ra ∈ VRis a readereR(rb,ra)∈ ERexists if rbmentionsrain one of rb’scommentsWR(rb,ra) is the ratio between the numberof times rbmention raagainst all timesrbmention otherreaders (including ra)
  • 12.
    Reader-, Quotation- andTopic- Measures12With Observation 1Compute reader authority|R|denotes the total number of readers of the blogd is the damping factor (usually set d to 0.85)The reader measure of a word wkdenoted by RM(wk)tf(wk,ci)is the term frequency of word wkin comment cici← rameansthatciisauthored by reader ra
  • 13.
    Reader-, Quotation- andTopic- Measures13With Observation 2for the set of comments associatedwith each blog post, we construct a directed acyclic quotationgraphGQ = (VQ,EQ)ci∈ VQis acommenteQ(cj,ci)∈ EQindicatescjquotedsentences from ciWQ(cj,ci)is 1over the number of comments that cjever quoted
  • 14.
    Reader-, Quotation- andTopic- Measures14With Observation 2Derive the quotation degree D(ci)of a comment ci|C| is the number of comments associated with the givenpostA comment that is not quoted by anyother comment receives a quotation degree of 1/|C|The quotation measure of a word wkdenoted by QM(wk)wk∈cimeans that word wkappears in comment ci
  • 15.
    Reader-, Quotation- andTopic- Measures15With Observation 3given the set of comments associatedwith each blog post, we group these comments intotopic clusters using a Single-Pass Incremental Clustering algorithmthe similarity threshold in clustering comments was empirically set to 0.4hotly discussed topic has a large numberof comments all close to the topic cluster centroid
  • 16.
    Reader-, Quotation- andTopic- Measures16With Observation 3Compute the importance of a topiccluster|ci| is the length of comment ciinnumberof wordsC is the set of commentssim(ci,tu)is thecosine similarity between comment ciand the centroid oftopic cluster tuThe topic measure of a word wkdenotedby TM(wk)ci∈ tudenotes comment ciis clustered into topic cluster tu
  • 17.
    Word Representativeness Score17Rep(wk)is thecombination of reader-, quotation- and topic- measures inReQuTmodel.Rep(wk)=α ∙ RM(wk)+β ∙ QM(wk)+γ ∙ TM(wk)0 ≤ α, β, γ ≤ 1.0and α + β + γ = 1.0
  • 18.
    Sentence Selection18Density-based selection(DBS)wordsappearing in comments as keywords and therest non-keywordsK is the total number ofkeywords contained in siScore(wj) is the score of keywordwjdistance(wj, wj+1)is the number of non-keywords(including stopwords) between the two adjacent keywordswj and wj+1 in si
  • 19.
    Sentence Selection19Summation-based selection(SBS)give a higher representativeness score to a sentence ifit contains more representative words|si|is the length of sentence siin number of words (includingstopwords)τ > 0is a parameter to flexibly control the contribution of a word’s representativeness score
  • 20.
    User Study andExperiments20We collected data from two famous blogs, Cosmic Variance and IEBlog, both having relatively large readershipand being widely commented.Cosmic Variancehas more loyalbut fewer readers with very diverse topics covered in posts.IEBlog has less loyal but more readers, with topicsmainly in Web development.
  • 21.
    User Study21Our hypothesisis thatone’s understanding about a blog post does not change afterreading the comments associated with the post.The user study was conducted in two phrases3 human summarizers, 20 blog posts, nearly 1000 comments associated with the 20 posts.Select approximately 30%of sentences from each post withoutcomments as its summaryThe selectedsentences served as a labeled dataset known as RefSet-1.Select approximately 30%of sentences from each posts and their comments as its summaryThe selectedsentences served as a labeled dataset known as RefSet-2.
  • 22.
    User Study22For eachhuman summarizer, we computed the level ofself-agreementshown in Table.Self-agreement level is definedby the percentage of sentences labeled in both referencesets against sentences in RefSet-1 by the same summarizer.That is,reading comments does change one’s understanding aboutblog posts.
  • 23.
    Experimental Results23RefSet-2 wasused to evaluate the two sentence selection methodswith four word representativeness measures.τ=0.2 α = β = γ = 0.33Normalized Discounted Cumulative Gain
  • 24.
    Conclusion24Reading commentsdoes affectone’s understanding about a blog post.Evaluated two sentence selection methods with four word representativeness measures.ReQuTgives the flexibility to measure word representativeness through three aspects, reader, quotation and topic.