Ire final

Diversity Measure for text
Summarization
Guided by: Dr.Vasudev Verma
Mentor: Litton J Kurisinkel
Submitted By:- Group No.27
Nirat Attri (201201013)
Dhruva Das (201301151)
Siddharth Saklecha (201505570)

Introduction
➢ A summary is brief but detailed outline of a
document, which conveys the essence of
the document.
➢ The goal of a text Summarizer is
condensing the source text into a shorter
version while its overall information and
meaning remains same.
➢ The Generated Summary should cover
relevant topics in the original corpus and
diverse enough.

Motivation
➢ A short summary, which conveys the
essence of the document, helps in finding
relevant information quickly
➢ Text summarization also provides a way to
cluster similar documents and present a
summary.
➢ Text summarization has become an
important and timely tool for assisting and
interpreting text information in today’ fast-
growing information age.

Problem Statement :
➢ Derive a method to improve efficiency of the task
of Text Summarization.
➢ We design novel methods to improve efficiency of the
task of Text Summarization using a class of sub
modular functions.
➢ These functions each combine two terms, one which
encourages the summary to be representative of the
corpus (coverage), and the other which positively
rewards diversity.
➢ Our functions are monotone non-decreasing and
submodular, which means that an efficient scalable
greedy optimization scheme has a constant factor
guarantee of optimality.
Proposed Solution :

Submodular Function:
➢ Sub-modular functions are those that satisfy the
property of diminishing returns: for any A⊆B⊆ V
v, a sub-modular function F must satisfy
F(A+v) - F(A) >= F(B + v) - F(B).
That is, the incremental value of v decreases as
the context in which v is considered grows from A
to B.
➢ An equivalent definition, useful mathematically, is
that for any A,B⊆V, we must have that
F(A)+F(B) >= F(AUB)+F(AnB).
If this is satisfied everywhere with equality, then
the function F is called modular.

Summarization Tools :
➢ CLUTO
It is a software package for clustering low
and high dimensional datasets and for
analyzing the characteristics of the various
clusters.
➢ ROUGE
Recall-Oriented Understudy for Gisting
Evaluation, is a set of metrics and a
software package used for evaluating
automatic summarization and machine
translation software in natural language
processing.

Approach
➢ Two properties of a good summary are relevance and non
redundancy.
➢ Objective functions for extractive summarization usually
measure these two separately and then mix them together
trading off encouraging relevance and penalizing
Redundancy.
➢ The redundancy penalty usually violates the monotonicity
of the objective functions.
➢ In particular, we model the summary quality as
F(S) = L(S) + λ R(S)
where, L(S) measures the coverage, or fidelity, of summary
set S to the document, R(S) rewards diversity in S, and λ is
a trade-off coefficient.
➢ Coverage Measure:
L(S) can be interpreted either as a set function that
measures the similarity of summary set S to the document
to be summarized, or as a function representing some
form of coverage of V by S.
L(S) should be monotone, as coverage improves with a
larger summary.

Approach Continue…
Shannon entropy is a well-known monotone submodular
function. So, we take our coverage function as:
L(S) = Σ min { Ci(S) , α Ci(V) } and i ∈ V
Basically, Ci(S) measures how similar S is to element i, or how
much of i is covered by S and Ci (V) is just the largest value
that Ci(S) can achieve.
➢Diversity Measure :
where Pi, i = 1, ...,K is a partition of the ground set V into
separate clusters, and ri ≥ 0 indicates the singleton reward of i
(i.e., the reward of adding i into the empty set). The value ri
estimates the
importance of i to the summary.

Approach 1 : K-means Clustering
➢ The dataset was fed into CLUTO to perform K-means and
Hierarchical Clustering to obtain clusters referring to
similar data.
➢ We ran a Grid Search on the values of and to get the best
optimal value to maximize the sub modular function.
➢ Algorithm:
➢ Summary → ⌀
➢ allowedClusters ← allClusters
➢ while size(Summary) ≤ 665:
• pick the cluster most similar to corpus from
allowedClusters → chosenCluster
• chosenSentence ← highest ranking sentence of
chosenCluster based on coverage and diversity
measure
• Summary ← chosenSentence

Approach 2 : Agglomerative
Clustering
➢ Agglomerative clustering(also called Hierarchical clustering
analysis or HCA) is a method of cluster analysis which
seeks to build a hierarchy of cluster.
➢ It is a bottom up approac. Each observation starts in its
own cluster, and pairs of clusters are merged as one moves
up the hierarchy.
➢ O(n^3) approach.
➢ Each observation starts in its own cluster and clusters are
succesively merged together. The linkage criteria
determines the metric used for the merge strategy.
➢ In computing the clusters, we used Cosine Similarity
criteria.

Challenges Faced
➢ One the challenges that we faced was to understand the
Output given by the Tool Cluto, which Clusters the given
sentences. After which finding the summary of the folder
was just a basic implementation of the formula given in the
paper.
➢ Another challenge was that the time taken to find the
summary for a single folder was quite large. This time
could be drastically reduced by not using the condition of
Sub-modular formula f(A+v) - f(A) > f(B+v) - f(B), that is, the
incremental value of v decreases as the context in which v
is considered grows from A to B. But this would be at the
cost of accuracy.

Results and Conclusion
➢ The values for λ and α values for the diversity and
coverage measures giving us the submodular function were
calculated using a sweep search for the best values of
ROGUE scores.
➢ We recovered the best values as follows :
α = 15
λ = 4
➢ The clusters were summarized and their ROGUE scores
calculated to estimate the efficiency
➢ We can conclude by saying that K-means and Heirarchical
clustering using a weighted consideration for both, the
diversity and coverage, gives us the best scores in Text
Summarization.
Apporach ROUGE-R ROUGE-F
K-Means and hierarchical 0.3843 0.3792
Agglomerative 0.3724 0.3674

References
➢ Vishal Gupta and Gurpreet Singh Lehal, A Survey of Text
Summarization Extractive Techniques.
➢ Hui Lin and Jeff Bilmes, A Class of Submodular Functions
for Document Summerization.
➢ Hui Lin and Jeff Bilmes, Multi-document Summarization
via Budgeted Maximization of Submodular Function.
➢ Ying Zhao and George Karypis, Criterion Functions for
Document Clustering Experiments and Analysis.
➢ Chin-Yew LIN, A Package for Automatic Evaluation of
Summaries.
➢ DUC 2004,http://duc.nist.gov/data.html

Ire final

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ire final

Similar to Ire final (20)

Recently uploaded

Recently uploaded (20)

Ire final