Comments oriented blog summarization by sentence extraction
Upcoming SlideShare
Loading in...5
×
 

Comments oriented blog summarization by sentence extraction

on

  • 1,128 views

 

Statistics

Views

Total Views
1,128
Views on SlideShare
1,102
Embed Views
26

Actions

Likes
0
Downloads
5
Comments
0

2 Embeds 26

http://web204seminar.blogspot.com 18
http://web204seminar.blogspot.tw 8

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Comments oriented blog summarization by sentence extraction Comments oriented blog summarization by sentence extraction Presentation Transcript

  • Author: MeishanHu, Aixin Sun, Ee-Peng Lim
    Publication: CIKM’07
    Presenter: Jhih-Ming Chen
    Comments-Oriented Blog Summarization by Sentence Extraction
    1
  • Outline
    Introduction
    Problem Definition
    ReQuT Model
    Reader-, Quotation- and Topic- Measures
    Word Representativeness Score
    Sentence Selection
    User Study and Experiments
    User Study
    Experimental Results
    Conclusion
    2
  • Introduction
    Readers treat comments associated with a post as an inherent part of the post.
    existing research largely ignore comments by focusing on blog posts only
    This paper conducted a user study on summarizing blog posts by labeling representative sentences in those posts.
    to find out whether the reading of comments would change a reader’s understanding about the post
    3
  • Introduction
    This paper focus on the problem of comments-oriented blog post summarization.
    summarize a blog post by extracting representative sentences from the post using information hidden in its comments
    4
  • Introduction
    5
    Sentence Detection
    splits blog post content into sentences
    Word Representativeness Measure
    weighs words appearing in comments
    Sentence Selection
    computes a representativeness score for each sentence based on representativeness of its contained words
  • Problem Definition
    6
    Given a blog post P
    P = {s1, s2, ... , sn}
    si = {w1, w2, ... , wm}
    C = {c1, c2,... , ck}associated with P
    The task of comments-oriented blog summarization is to extract a subset of sentences from P, denoted by Sr (Sr ⊂ P), that best represents the discussion in C.
  • Problem Definition
    7
    One straightforward approach
    compute a representativeness score for each sentence si, denoted by Rep(si), and select sentences with representativeness scores above a given threshold
    Intuitively, word representativeness can be measured by counting the number of occurrences of a word in comments
    Binary
    Rep(wk)= 1if wkappears in at least one comment and Rep(wk) = 0 otherwise.
    Comment Frequency (CF)
    Rep(wk) is the number of comments containing word wk
    Term Frequency (TF)
    Rep(wk)is the number of occurrences of wkin all comments associated with a blog post.
  • Problem Definition
    8
    Binary captures minimum information; CF and TF capture slightly more.
    Other information available in comments that could be very useful are ignored.
    e.g., authors of comments, quotations among comments and so on.
    All three measures suffer from spam comments.
  • REQUT Model
    9
    A comment, other than its content, is often associated with an author, a time-stamp, and even a permalink.
    These observations provide us guidelines on measuring word representativeness.
    A reader often mentions another reader’s name to indicate that the current comment is a reply to previous comment(s).
    A comment may contain quoted sentences from one or more comments.
    Discussion in comments often branches into several topics.
  • Reader-, Quotation- and Topic- Measures
    10
    Based on the three observations, we believe that a word is representative if it is written by authoritative readers, appears in widely quoted comments, and represents hotly discussed topics.
  • Reader-, Quotation- and Topic- Measures
    11
    With Observation 1
    given the full set of comments to ablog, we construct a directed reader graph GR=(VR, ER)
    ra ∈ VRis a reader
    eR(rb,ra)∈ ERexists if rbmentionsrain one of rb’scomments
    WR(rb,ra) is the ratio between the numberof times rbmention raagainst all timesrbmention otherreaders (including ra)
  • Reader-, Quotation- and Topic- Measures
    12
    With Observation 1
    Compute reader authority
    |R|denotes the total number of readers of the blog
    d is the damping factor (usually set d to 0.85)
    The reader measure of a word wkdenoted by RM(wk)
    tf(wk,ci)is the term frequency of word wkin comment ci
    ci← rameansthatciisauthored by reader ra
  • Reader-, Quotation- and Topic- Measures
    13
    With Observation 2
    for the set of comments associatedwith each blog post, we construct a directed acyclic quotationgraph
    GQ = (VQ,EQ)
    ci∈ VQis acomment
    eQ(cj,ci)∈ EQindicatescjquotedsentences from ci
    WQ(cj,ci)is 1over the number of comments that cjever quoted
  • Reader-, Quotation- and Topic- Measures
    14
    With Observation 2
    Derive the quotation degree D(ci)of a comment ci
    |C| is the number of comments associated with the givenpost
    A comment that is not quoted by anyother comment receives a quotation degree of 1/|C|
    The quotation measure of a word wkdenoted by QM(wk)
    wk∈cimeans that word wkappears in comment ci
  • Reader-, Quotation- and Topic- Measures
    15
    With Observation 3
    given the set of comments associatedwith each blog post, we group these comments intotopic clusters using a Single-Pass Incremental Clustering algorithm
    the similarity threshold in clustering comments was empirically set to 0.4
    hotly discussed topic has a large numberof comments all close to the topic cluster centroid
  • Reader-, Quotation- and Topic- Measures
    16
    With Observation 3
    Compute the importance of a topiccluster
    |ci| is the length of comment ciinnumberof words
    C is the set of comments
    sim(ci,tu)is thecosine similarity between comment ciand the centroid oftopic cluster tu
    The topic measure of a word wkdenotedby TM(wk)
    ci∈ tudenotes comment ciis clustered into topic cluster tu
  • Word Representativeness Score
    17
    Rep(wk) is thecombination of reader-, quotation- and topic- measures inReQuTmodel.
    Rep(wk)=α ∙ RM(wk)+β ∙ QM(wk)+γ ∙ TM(wk)
    0 ≤ α, β, γ ≤ 1.0and α + β + γ = 1.0
  • Sentence Selection
    18
    Density-based selection (DBS)
    wordsappearing in comments as keywords and therest non-keywords
    K is the total number ofkeywords contained in si
    Score(wj) is the score of keywordwj
    distance(wj, wj+1)is the number of non-keywords(including stopwords) between the two adjacent keywordswj and wj+1 in si
  • Sentence Selection
    19
    Summation-based selection (SBS)
    give a higher representativeness score to a sentence ifit contains more representative words
    |si|is the length of sentence siin number of words (includingstopwords)
    τ > 0is a parameter to flexibly control the contribution of a word’s representativeness score
  • User Study and Experiments
    20
    We collected data from two famous blogs, Cosmic Variance and IEBlog, both having relatively large readershipand being widely commented.
    Cosmic Variancehas more loyalbut fewer readers with very diverse topics covered in posts.
    IEBlog has less loyal but more readers, with topicsmainly in Web development.
  • User Study
    21
    Our hypothesis is thatone’s understanding about a blog post does not change afterreading the comments associated with the post.
    The user study was conducted in two phrases
    3 human summarizers, 20 blog posts, nearly 1000 comments associated with the 20 posts.
    Select approximately 30%of sentences from each post withoutcomments as its summary
    The selectedsentences served as a labeled dataset known as RefSet-1.
    Select approximately 30%of sentences from each posts and their comments as its summary
    The selectedsentences served as a labeled dataset known as RefSet-2.
  • User Study
    22
    For each human summarizer, we computed the level ofself-agreementshown in Table.
    Self-agreement level is definedby the percentage of sentences labeled in both referencesets against sentences in RefSet-1 by the same summarizer.
    That is,reading comments does change one’s understanding aboutblog posts.
  • Experimental Results
    23
    RefSet-2 was used to evaluate the two sentence selection methodswith four word representativeness measures.
    τ=0.2
    α = β = γ = 0.33
    Normalized Discounted Cumulative Gain
  • Conclusion
    24
    Reading commentsdoes affect one’s understanding about a blog post.
    Evaluated two sentence selection methods with four word representativeness measures.
    ReQuTgives the flexibility to measure word representativeness through three aspects, reader, quotation and topic.