(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Ire final
1. Diversity Measure for text
Summarization
Guided by: Dr.Vasudev Verma
Mentor: Litton J Kurisinkel
Submitted By:- Group No.27
Nirat Attri (201201013)
Dhruva Das (201301151)
Siddharth Saklecha (201505570)
2. Introduction
➢ A summary is brief but detailed outline of a
document, which conveys the essence of
the document.
➢ The goal of a text Summarizer is
condensing the source text into a shorter
version while its overall information and
meaning remains same.
➢ The Generated Summary should cover
relevant topics in the original corpus and
be diverse enough.
3. Motivation
➢ A short summary, which conveys the
essence of the document, helps in finding
relevant information quickly
➢ Text summarization also provides a way to
cluster similar documents and present a
summary.
➢ Text summarization has become an
important and timely tool for assisting and
interpreting text information in today’ fast-
growing information age.
4. Problem Statement :
➢ Derive a method to improve efficiency of the task
of Text Summarization.
➢ We design novel methods to improve efficiency of the
task of Text Summarization using a class of sub
modular functions.
➢ These functions each combine two terms, one which
encourages the summary to be representative of the
corpus (coverage), and the other which positively
rewards diversity.
➢ Our functions are monotone non-decreasing and
submodular, which means that an efficient scalable
greedy optimization scheme has a constant factor
guarantee of optimality.
Proposed Solution :
5. Submodular Function:
➢ Sub-modular functions are those that satisfy the
property of diminishing returns: for any A⊆B⊆ V
v, a sub-modular function F must satisfy
F(A+v) - F(A) >= F(B + v) - F(B).
That is, the incremental value of v decreases as
the context in which v is considered grows from A
to B.
➢ An equivalent definition, useful mathematically, is
that for any A,B⊆V, we must have that
F(A)+F(B) >= F(AUB)+F(AnB).
If this is satisfied everywhere with equality, then
the function F is called modular.
6. Summarization Tools :
➢ CLUTO
It is a software package for clustering low
and high dimensional datasets and for
analyzing the characteristics of the various
clusters.
➢ ROUGE
Recall-Oriented Understudy for Gisting
Evaluation, is a set of metrics and a
software package used for evaluating
automatic summarization and machine
translation software in natural language
processing.
7. Approach
➢ Two properties of a good summary are relevance and non
redundancy.
➢ Objective functions for extractive summarization usually
measure these two separately and then mix them together
trading off encouraging relevance and penalizing
Redundancy.
➢ The redundancy penalty usually violates the monotonicity
of the objective functions.
➢ In particular, we model the summary quality as
F(S) = L(S) + λ R(S)
where, L(S) measures the coverage, or fidelity, of summary
set S to the document, R(S) rewards diversity in S, and λ is
a trade-off coefficient.
➢ Coverage Measure:
L(S) can be interpreted either as a set function that
measures the similarity of summary set S to the document
to be summarized, or as a function representing some
form of coverage of V by S.
L(S) should be monotone, as coverage improves with a
larger summary.
8. Approach Continue…
Shannon entropy is a well-known monotone submodular
function. So, we take our coverage function as:
L(S) = Σ min { Ci(S) , α Ci(V) } and i ∈ V
Basically, Ci(S) measures how similar S is to element i, or how
much of i is covered by S and Ci (V) is just the largest value
that Ci(S) can achieve.
➢Diversity Measure :
where Pi, i = 1, ...,K is a partition of the ground set V into
separate clusters, and ri ≥ 0 indicates the singleton reward of i
(i.e., the reward of adding i into the empty set). The value ri
estimates the
importance of i to the summary.
9. Approach 1 : K-means Clustering
➢ The dataset was fed into CLUTO to perform K-means
Clustering to obtain clusters referring to similar data.
➢ We ran a Grid Search on the values to get the best optimal
value to maximize the sub modular function.
➢ Algorithm:
➢ Summary → ⌀
➢ allowedClusters ← allClusters
➢ while size(Summary) ≤ 665:
• pick the cluster most similar to corpus from
allowedClusters → chosenCluster
• chosenSentence ← highest ranking sentence of
chosenCluster based on coverage and diversity
measure
• Summary ← chosenSentence
10. Approach 2 : Agglomerative
Clustering
➢ Agglomerative clustering(also called Hierarchical clustering
analysis or HCA) is a method of cluster analysis which
seeks to build a hierarchy of cluster.
➢ It is a bottom up approach. Each observation starts in its
own cluster, and pairs of clusters are merged as one moves
up the hierarchy.
➢ O(n^3) approach.
➢ Each observation starts in its own cluster and clusters are
succesively merged together. The linkage criteria
determines the metric used for the merge strategy.
➢ In computing the clusters, we used Cosine Similarity
criteria.
11. Challenges Faced
➢ One of the challenges that we faced was to understand the
Output given by the Tool Cluto, which Clusters the given
sentences. Finding the summary of the folder was just a
basic implementation of the formula given in the paper.
➢ Another challenge was that the time taken to find the
summary for a single folder was quite large. This time
could be drastically reduced by not using the condition of
Sub-modular formula f(A+v) - f(A) > f(B+v) - f(B), that is, the
incremental value of v decreases as the context in which v
is considered grows from A to B. But this would be at the
cost of accuracy.
12. Results and Conclusion
➢ The values for λ and α for the diversity and coverage
measures giving us the submodular function were
calculated using a sweep search for the best values of
ROGUE scores.
➢ We recovered the best values as follows :
α = 15
λ = 4
➢ The clusters were summarized and their ROGUE scores
calculated to estimate the efficiency
➢ We can conclude by saying that Agglomerative clustering
using a weighted consideration for both, the diversity and
coverage, gives us the best scores in Text Summarization.
Approach ROUGE-R ROUGE-F
Agglomerative 0.3843 0.3792
K-Means 0.3724 0.3674
13. References
➢ Vishal Gupta and Gurpreet Singh Lehal, A Survey of Text
Summarization Extractive Techniques.
➢ Hui Lin and Jeff Bilmes, A Class of Submodular Functions
for Document Summerization.
➢ Hui Lin and Jeff Bilmes, Multi-document Summarization
via Budgeted Maximization of Submodular Function.
➢ Ying Zhao and George Karypis, Criterion Functions for
Document Clustering Experiments and Analysis.
➢ Chin-Yew LIN, A Package for Automatic Evaluation of
Summaries.
➢ DUC 2004,http://duc.nist.gov/data.html