Summarizing discussion threads

Summarizing discussion threads
• Suzan Verberne
• SAKE, 12-12-2016

About DISCOSUMO
• Automatic summarization of discussion forum threads
• Radboud University:
- Antal van den Bosch
- Suzan Verberne
• Tilburg University:
- Emiel Krahmer
- Sander Wubben
• Sanoma Media

Problem
• Discussion forums on the web are an important source of information.
• But: forum threads can be extremely long
•  finding information in a forum thread can be a challenge, especially
when accessing the forum from a mobile device
Can we serve mobile forum users better by showing them summaries of
long threads?

Problem
How to summarize a forum thread?
• Question answering forums (e.g. StackOverflow):
- the opening post is a (technical) question and the responses are
answers to that question
- the best answer may be selected by the forum community through
voting
• Discussion forums (e.g. Viva, Autoweek, reddit):
- opinions and experiences are shared
- there is generally no such thing as the best answer
- threads can consists of dozens/hundreds posts

Case: Viva forum
Viva Forum (forum.viva.nl/)
• Dutch
• predominantly female user community
• 19 Million page views per month (1.5 Million unique visitors)
• readable for everyone; sample obtained from Sanoma
• most threads: experience and opinion sharing
• no hierarchy in the threads (‘flat structure’, but quotes possible)
• no liking/upvoting
• 21% of threads on Viva forum has >= 20 posts

Approach
Post/sentence selection:
• Show the user only the most important information
• Hide the less relevant information in between

How is it made?
1. Collect example data
2. Train classifiers to learn what are the most important posts and
sentences in a thread
3. Apply the classifier to unseen threads
4. Use a threshold on the classifier prediction to show more/fewer posts
and sentences

Collect example data
• If you ask five humans to create a summary of a discussion thread, they
create five different summaries
• But: a post selected by four of them it is more important than a post
selected by one of them
• We showed 106 long Viva threads to 10 different raters and asked them
to select the posts that they consider to be the most important for the
thread (number of selected posts decided by rater)
• 57 subjects participated in the study: all female, average age 27

Results: Usefulness of thread summarization
• Median usefulness score: 3 (on a 5-point scale)
• Standard deviation: 1.14 (averaged over threads)
• For 92% of the threads, at least one subject gave a usefulness score of 3
or higher
• For 62% of the threads, at least half of the subjects gave a usefulness
score of 3 or higher

Results: Agreement between human raters
• Median number of posts selected per thread: 7, with a large standard
deviation over raters (6.4)
• The agreement between the human summarizers was low (as expected)
Mean Cohen’s Kappa: 0.117

What determines the importance of a post or sentence?
• Number of words (longer = more important)
• Position in the thread (early response = more important)
• Punctuation and emoticons (fewer = more important)
• Similarity to the complete thread (higher = more important)

Evaluation setup
• 5-fold cross validation of threads
• Evaluation measures:
- Cohen’s Kappa (agreement with humans)
- Precision/Recall/F1 (using the human summaries as reference)
• Baselines:
- Random: select 7 posts randomly
- Position-based: select the first 7 posts
- Length-based: select the 7 longest posts

Results of the automatic summarization
(human-human Kappa: 0.117)
Kappa F1
random baseline -0.085 22.8%
position baseline 0.060 35.9%
length baseline 0.092 38.2%
our model 0.138 45.2%

• Two different summaries can still both be good summaries
• Is it possible that readers are satisfied by a summary, even though the
summary is different from the summary that they would create
themselves?
 Pairwise (side-by-side) blind comparison and judgment by human
subjects

• Pairwise (side-by-side) blind comparison and judgment by human
subjects: a human summary vs. our model’s summary
- human summary wins 48.3% of the comparison
- model summary wins 35.7% of the comparisons
- tie: 16.1% of the comparisons
in 51.7% of the direct comparisons, the summary by our model is
considered equal to or better than the human-made summary

Conclusions
• Subjects value the idea of thread summarization through post selection
• But inter-rater agreement for this task is low
• Despite the low agreement,
• we can automatically generate summaries that will in half of the cases be
judged equal to or better than summaries created by another human
• Also, the agreement between the model and human subjects is not lower
than the agreement among human subjects
• Two different summaries can both be good summaries

Thank you! Questions?
• http://discosumo.ruhosting.nl/
• http://sverberne.ruhosting.nl/

Summarizing discussion threads

Recommended

Recommended

More Related Content

Similar to Summarizing discussion threads

Similar to Summarizing discussion threads (20)

More from Leiden University

More from Leiden University (14)

Recently uploaded

Recently uploaded (20)

Summarizing discussion threads

Editor's Notes