1. Образец заголовка
Tutorial on Automatic
Summarization
by Shilpa Subrahmanyam
Prepared as an assignment for CS410: Text Information Systems in Spring 2016
2. Образец заголовкаThe World Today
• We live in an age in which a massive amount
of content is available at our fingertips
– 500 million tweets are posted every day
– 2 million + articles are posted daily
– The average article length at the New York Times
is about 1,200 words.
• Automatic summarization can prove
extremely useful in attempting to generate
insights and themes from such large corpuses
of data.
3. Образец заголовка
What is Automatic
Summarization?
• Definition: Employing an algorithm to distill a
text corpus to a considerably smaller body of
important ideas, sentences, phrases, etc.
• Key Challenges:
– Determining how to rank the importance of a
sentence, word, or phrase.
– Eliminating and capitalizing on redundancy
– Incorporating sentiment into the summary
– Ensuring the summary is readable
– Avoiding a search through an exponential
solution space
4. Образец заголовка
Types of Automatic
Summarization
• Extractive Summarization
– Select a subset of existing words, phrases, or
sentences in the original text in order to
generate a summary.
• Abstractive Summarization
– Aims to create a summary that is closer to
what a human might generate.
– Phrases in the summary don’t necessarily
need to have appeared in the original text
• Keyword/Key-phrase extraction
5. Образец заголовкаApplications and Use Cases
• Summarizing tweets in order to determine
the timeline of a sports game
• Product owners summarizing highly
redundant product reviews (in order to get
popular opinions and key insights)
• Distilling a long article to a set of key
points.
6. Образец заголовка
Summarization of Small Data
(i.e. Tweets, microblogs, etc.)
• These are summarization methods that are targeted
at Tweets, microblogs, and other content-limited
data.
• A lot of work has been done recently in this area
because of the increasing availability of large
amounts of content-limited small data samples that
are readily available via the advent of Twitter and
similar sites that value smaller sized content.
7. Образец заголовкаBasic Approaches
• Topical Keyphrase Extraction from
Twitter (Zhao, et al.)
• Summarizing Microblogs Automatically
(Sharifi, et al.)
8. Образец заголовка
Topical Keyphrase Extraction from Twitter
(Xin Zhao, Jiang, He, Song,
Achananuparp, Lim, Li)
• The method proposed by the authors is a
context-sensitive topical PageRank method
for keyword ranking.
• This is paired with a probabilistic scoring
function that considers two factors of key
phrases when doing key-phrase ranking:
– relevance
– interestingness
Key idea: Generate a list of topical key phrases that will serve as a
summarization of a corpus of tweets.
9. Образец заголовка
Summarizing Microblogs Automatically
(Sharifi, Hutton, Kalita)
• Start with a topic or phrase and generate tweets that are related to that
topic or phrase.
• Isolate the longest sentence in each tweet that contains the topic phrase.
We use this set of sentences as the input to our algorithm.
• Build a graph representing the common sequences of words (the common
phrases) that occur before and after the key topic phrase.
– This root node represents the topic phrase.
– Each word is represented by a node and a count that indicates how
many times the word occurs within the set of input sequences. A
phrase is represented in the graph by a sequence of nodes starting
with the root.
Key Idea: take a trending phrase, collect a huge number of tweets
containing that trending phrase, and provide an automatically generated
summary of the tweets that were collected.
10. Образец заголовка
Summarizing Microblogs Automatically
(Sharifi, Hutton, Kalita)
• Assign each node to a weight. This is in order to prevent
longer phrases from dominating the output.
– Words are given weights that are proportional to their count.
• Construct “partial summary” by searching the graph for
a path with the largest total weight (it searches all paths
that begin with the root node and end with a non-root
node).
– This path represents the most common phrase occurring
either before or after the topic phrase.
• Run algorithm once more. This time, we need to
initialize the root node with the partial summary and
rebuild the graph. This time around, the most heavily
weighted path from the new graph is the final
summary produced by the algorithm
11. Образец заголовка
Why are these approaches
suboptimal?
• Topical Key phrase Extraction from Twitter
– Key phrase extraction can often produce noisy results (i.e.
key phrases that are common but don’t help to identify
themes).
– Moreover, oftentimes, it may not be sufficient for
summative purposes to simply look through a list of key
phrases.
• More detail may be required for a sufficient grasp of the original
text.
• Summarizing Microblogs Automatically
– This approach is predicated on the fact that we specifically
retrieve tweets that all pertain to the same trending phrase
as an input to our algorithm.
– Extracts only the longest sentence that contains the topic
keyword(s) from each tweet as an input to the graph
algorithm.
• The equation of length of a sentence to importance could prove
fallacious. Furthermore, this could mean discarding valuable
12. Образец заголовка
What are some more advanced
approaches?
• Twitter Topic Summarization by
Ranking Tweets Using Social Influence
and Content Quality (YaJuan, et al.)
• Summarizing Sporting Events Using
Twitter (Nichols, et al.)
• Sumblr: continuous summarization of
evolving tweet streams (Shou, et al.)
13. Образец заголовка
Twitter Topic Summarization by Ranking
Tweets Using Social Influence and
Content Quality (YaJuan, ZhuMin, FuRu,
Ming, Heung − Yeung)
• This approach takes advantage of follower-followee
relationships on Twitter -- which is the main manner in which
social influence of users is inferred. The quality of tweets is
judged based a few factors that are incorporated into the
graph-based ranking algorithm:
– readability
– content richness
– a measure of the regularity of written language
– pointless degree of the content.
• In order to curb redundancy within the final summary, the
model selects tweets from the ranking results using a Maximal
Marginal Relevance algorithm (Carbonell and Goldstein, 1998).
Key idea: Algorithm models and formulates the ranking of tweets in a
unified mutual reinforcement graph.
14. Образец заголовка
Twitter Topic Summarization by Ranking
Tweets Using Social Influence and
Content Quality (YaJuan, ZhuMin, FuRu,
Ming, Heung − Yeung)
• Algorithm models the problem of tweet
ranking in a unified mutual reinforcement
graph.
– In this model, social influence of users and a
measure of the quality of the tweet content
are both taken into consideration (in a
simultaneously mutually reinforcing manner).
15. Образец заголовка
Summarizing Sporting Events Using
Twitter (Nichols, Mahmud, Drews)
• Takes advantage of the fact that throughout the course
of sports games, viewers generally tend to make Twitter
updates expressing opinions about different events that
occur throughout the game.
• Aims to generate a natural summary of the event that
incorporates temporal cues, such as spikes in the
volume of status updates, in order to identify important
moments throughout the course of the game.
• Aims to implement a sentence ranking method that is
used to extract relevant sentences from the tweet
corpus -- each presumably referring to an important
moment in the game.
Key idea: Summarize sporting events from a live corpus of tweets.
16. Образец заголовка
Sumblr: continuous summarization of
evolving tweet streams (Shou, Wang, Ke
Chen, Gang Chen)
• Traditional automatic summarization
methods for text documents primarily focus
on static and small-scale data.
• Sumblr (SUMmarization By stream
cLusteRing) aims to summarize tweet streams
-- thereby providing a dynamic
summarization framework.
Key idea: Timeline-based framework for topic summarization for tweets.
Algorithm ranks and selects a diverse crop of important tweets within a bunch of
different sub-topic groups. These tweets serve as the basis of the summary that
will be composed for each sub-topic.
17. Образец заголовка
Sumblr: continuous summarization of
evolving tweet streams (Shou, Wang, Ke
Chen, Gang Chen)
• During tweet stream clustering, it is necessary to maintain
statistics for tweets to facilitate summary generation. For
this reason, the authors of the paper introduce a
representation called “tweet cluster vector”(TCV).
• The Sumblr framework operates as follows:
– At the start of the stream, we collect a small number of
tweets and use a k-means clustering algorithm to create
the initial clusters. The corresponding TCVs are initialized.
– Incrementally update the TCVs whenever a new tweet
arrives. At various points it time, the algorithm has to
decide where to create a new centroid, add a tweet to an
existing centroid, or merge/delete existing clusters.
– High-level summarization step produces online and
historical summaries
18. Образец заголовка
Comparison of Microblog
Methods
Paper Pros Cons
Twitter Topic Summarization
by Ranking Tweets Using
Social Influence and Content
Quality
Takes advantage of social influence of
authors when ranking tweets; takes
readability, content richness, a
measure of the regularity of written
language, and how pointless the
content is into account.
Algorithm structure could thwart the
summarization of more niche topics
that have sparse follower-followee
adjacency matrices.
Summarizing Sporting Events
Using Twitter
Incorporates temporal cues; deals with
live corpus
Does not use valuable social metadata
to rank tweets
Topical Keyphrase Extraction
from Twitter
Considers both relevance and
interestingness
Key phrase extraction may not be as
helpful as full sentence summarization
for some use cases – especially for data
that exhibits low topical phrase
redundancies.
Sumblr: continuous
summarization of evolving
tweet streams
Provides a streaming summarization
(as well as historical summaries)
Implementation is more complicated
Summarizing Microblogs
Automatically
Graph algorithm’s relevance calculation
does not let long sentences have an
unfair advantage over shorter
sentences with just as much important
Extracts only the longest sentence that
contains the topic keyword(s) from
each tweet as an input to the graph
algorithm.
19. Образец заголовкаSummarization of Larger Data
• The following include summarization
methods that can be applied to larger data
as well. This includes reviews, documents,
news articles, and so forth.
21. Образец заголовка
Extraction based approach for text
summarization using k-means clustering
(Agrawal , Gupta)
• At a high level, the algorithm proposed by the
authors of this paper is an unsupervised learning
approach that can be broken down into three
steps:
– tokenization of the document
– computing a score for each sentence
– clustering the sentences using k-means
– extracting important sentences
– and combining those sentences in order to form a
summary.
Key idea: incorporates k-means clustering, TF-IDF, and tokenization in
order to perform extractive text summarization.
22. Образец заголовкаWhat makes this approach suboptimal?
• Does not take advantage of redundancy to
rank importance.
• Method for extraction of important
sentence(s) from each centroid is naïve
and can be gamed.
23. Образец заголовкаMore Advanced Approaches
• Product review summarization from a
deeper perspective (Ly, et al.)
• Mining and Summarizing Customer
Reviews
(Hu, et al.)
• Micropinion Generation: An Unsupervised
Approach to Generating Ultra-Concise
Summaries of Opinions (Ganesan, et al.)
• Opinosis: A Graph-Based Approach to
Abstractive Summarization of Highly
Redundant Opinions (Ganesan, et al.)
24. Образец заголовка
Product review summarization from a
deeper perspective (Ly, Sugiyama, Lin,
Kan)
• The first step is Product Facet Identification.
– In order to identify candidate facets, we need to
preprocess the input reviews.
• This involves tagging part-of-speech, stemming,
assigning syntactic rules, and stop word removal.
– We then deploy the Stanford Dependency Parser
in order to detect the role of each noun.
• We want to discard nouns that aren’t subjects or
objects.
• We then use association rule mining to identify
frequent product facets.
Key idea: algorithm automatically summarizes a massive collection of
product reviews and generates a concise, non-redundant summary. Not
only does this system extract review sentiments but it also extracts the
underlying justifications behind the review sentiments.
25. Образец заголовка
Product review summarization from a
deeper perspective (Ly, Sugiyama, Lin,
Kan)
• The second step is summarization.
– For each of the facets mined in the previous step, we
want to associate it with relevant opinion sentences
that match the appropriate polarity expressed by the
majority of the opinions in the reference text.
– We first restrict our algorithm to run only on
opinionated sentences from the reviews.
• Furthermore, we perform sentiment analysis on the
sentences to assign a polarity score to each sentence (the
sum of the polarity of each word in a sentence).
– We then calculate content-based pairwise similarities
between all of the resultant opinion sentences. Using
these scores, we perform clustering on the sentences.
– The final task is to select the most representative
sentence from each centroid for the final summary.
26. Образец заголовка
Mining and Summarizing Customer Reviews
(Hu, Minqing, Liu)
• The paper focuses on the problem of feature-based
summaries of customer reviews of products sold
online. In this context, “features” refers to product
attributes.
• Given a customer review corpus that pertains to a
given product, summarization is split into three
subtasks:
– First, we must identify the product features that customers
are speaking about.
– Second, for each feature, we have to identify sentences in
the reviews that have positive or negative opinions.
– Last, we must produce a summary that aggregates all of
Key ideas: Algorithm assists merchants in extracting the main ideas and
themes from hundreds, if not thousands, of customer reviews through
product feature extraction and consideration of sentence sentiment
28. Образец заголовка
Micropinion Generation: An Unsupervised
Approach to Generating Ultra-Concise
Summaries of Opinions (Ganesan, Zhai,
Viega)
Key ideas: greedy approach that heuristically prunes the exponential
solution space so that we only have to deal with promising candidates.
Ultimate goal is to generate a compact and informative summary using a
set of micro-opinions.
• Micro-opinion: 2-5 word phrase
• Formal problem set-up:
– Suppose we have a set of sentences Z =zi where i ∈[1,k] from
an opinion document.
– Goal is to generate a micro-opinion summary, M =m where I ∈
[1,k] where |mi| ∈ [2,5] and each mi conveys a key opinion from
Z.
– It is quite important to note that while we require that mi use
words that occur at least once in the set Z, we do not require
mi to be an exact subsequence of any of the sentences in Z.
• This, this makes this set-up more of an abstractive summarization
problem rather than an extractive summarization problem.
29. Образец заголовка
Micropinion Generation: An Unsupervised
Approach to Generating Ultra-Concise
Summaries of Opinions (Ganesan, Zhai,
Viega)
• Algorithm:
– Start with a set of high frequency unigrams from the
original corpus.
– Then, start to merge these unigrams to generate higher
order bigrams, trigrams, and n-grams.
– At each merge step, we make sure that the candidate n-
grams have reasonably high readability and
representativeness scores.
– The candidate generation process stops when an attempt
to grow an existing candidate leads to low readability or
representativeness scores.
– The final step is to sort all the candidate n-grams based on
their objective function values (i.e., sum of Srepresentativeness
and Sreadability) and generate a micro-opinion summary M
by gradually adding phrases with the highest scores to our
summary until the accumulated summary length reaches
the length threshold.
30. Образец заголовка
Opinosis: A Graph-Based Approach to
Abstractive Summarization of Highly
Redundant Opinions (Ganesan, Zhai, Han)
• The results of the evaluation studies show that when compared
to the baseline extractive method, the Opinosis summaries are
closer to human summaries.
• The high level picture of the algorithm: generate an abstractive
summary by repeatedly searching the Opinosis graph for sub-
graphs that basically represent semantically valid and
meaningful sentences that happen to have high redundancy
scores.
• It is important that these sentences have high redundancy
scores because that means that they are representative of a
major opinion.
• The sentences that are represented by these sub-graphs can be
combined to form an abstractive summary.
Key idea: graph-based approach to automatic text summarization. The
summarization framework generates concise abstractive summaries and
capitalizes on the presence of large amounts of redundancy in the
opinions.
31. Образец заголовка
Opinosis: A Graph-Based Approach to
Abstractive Summarization of Highly
Redundant Opinions (Ganesan, Zhai, Han)
• Opinosis constructs a graph that represents the
original text. The paper isolates three properties of
this graph that they exploit in order to explore and
score various sub paths throughout the graph.
These sub-paths are what help to generate the
candidate abstractive summaries.
– Properties:
• Redundancy Capture: extremely redundant textual
occurrences are naturally captured by sub-graphs.
• Gapped Subsequence Capture: existing sentence structures
create “lexical links”. These links then facilitate the discovery
of new sentences.
• Collapsible Structures: nodes that resemble hubs can
potentially be collapsed
32. Образец заголовкаComparison of Methods for Larger Data
Paper Pros Cons
Mining and Summarizing Customer
Reviews
Takes sentiment into consideration Algorithm is restricted to run only on
opinionated sentences. This could
discard potentially valuable text.
Micropinion Generation: An
Unsupervised Approach to Generating
Ultra-Concise Summaries of Opinions
Aims to capitalize on existing
redundancy and maximize readability;
abstractive summarization aims to
mimic human summarization; prunes
unpromising candidates
Does not take sentiment into
consideration; essentially provides key
phrases – which may not be optimal for
all use cases
Opinosis: A Graph-Based Approach to
Abstractive Summarization of Highly
Redundant Opinions
Capitalizes on redundancy; emphasized
readability; abstractive summarization
mimics human summarization
Does not take sentiment into
consideration
Extraction based approach for text
summarization using k-means
clustering
Implementation is simple and
straightforward.
Does not take sentiment into
consideration; doesn’t take advantage
of redundancy to rank importance.
Product review summarization from a
deeper perspective
Takes sentiment into consideration Sentiment consideration needs to be
more sophisticated in order to account
for complex English phrase structure
(i.e. sentences “I am happy” and “I am
not happy” should have an extremely
high sentiment differential. This might
not happen under this approach.)
33. Образец заголовкаFuture Strides
• Ideally, we want to push towards better
abstractive summarization approaches.
– We want to emulate human summarization as
closely as possible
• Applications of deep learning to automatic
summarization
• Highly visual automatic summarizations