Multi-document summarization is an automatic procedure aimed at extraction of information.
Multi-document summarization creates information reports that are both concise and comprehensive.
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Project Presentation.ppt
1. Multi-Document Summarization using
Closed Patterns
Superviser: Ms. Zakia Jalil
Co-Superviser: Ms Sabina Irum
Presented by
Hafsa Sattar [2896-FBAS/BSCS/F14]
Uswa Ihsan [2822-FBAS/BSCS/F14]
Department of Computer Science & Software Engineering
Faculty of Basic &Applied Sciences
International Islamic University, Islamabad.
2. 2
Overview of Presentation
• Introduction.
• Literature Review.
• Problem Statement.
• Proposed Solution.
• Block Diagram.
• Evaluation.
• Tools Technology Dataset.
• References.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
3. 3
Introduction
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
• The internet provides access to a huge volume of
documents.
• We propose multi-document summarization to extract all
information from multiple documents.
• This saves the time and effort of user instead of reading
the whole document.
4. 4
Left Click
Multi-Document Summarization
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
• Multi-document summarization is an automatic
procedure aimed at extraction of information.
• Multi-document summarization creates information
reports that are both concise and comprehensive.
5. 5
Left Click
Literature Review
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
• Multi-documentation summarization methods can be
classified into two classes:
• Extractive summarization :
Extractive summarization extracts the most informative
document components.
• Abstractive summarization:
Abstractive summarization involves reformulation of
contents.
6. 6
Existing System
• Term-Based Method:
A term-based method has the advantages of efficiency and
maturity for term weight calculation.
Term-Based methods can be divided into the following
categories:
1. Centroid-Based Method:
This method uses clustering algorithms to generate
sentences’ clusters by calculating sentence similarity.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
7. 7
Existing System
2. Graph-Based Method:
a)The graph-based approaches also belong to extractive
summarization.
b)This method builds a graph-based model.
c)Then select sentences by means of voting from their
neighbors.
•Ontology-Based Method:
Ontology-based approaches take into account of
meanings of vocabulary
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
8. 8
Problem Statement
• The explosion of electronic documents presents a
serious challenge for reader to extract information .
• The information extracted can be false or incomplete
that might cause trouble in future.
• The main problem in MDS occurred due to the
collection of multiple resources from where the data is
extracted.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
9. 9
Proposed Solution
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
• A pattern-based model for general multi-document
summarization is proposed.
• To extract the most informative sentences from a
document collection.
• Reduce redundancy in the summary.
• Calculates the weight of each sentence of a document
collection.
• By accumulating the weights of its covering closed
patterns.
10. 10
Closed Pattern
• Represent the terms with high frequency in the
document collection.
• Weight of each sentence is decided by the number of
closed patterns in sentence.
• Sentence containing more closed patterns have high
scores.
• Sentence that do not contain any closed pattern have
minimal score zero.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
11. 11
Block Diagram
• Block diagram of MDS using Closed Patterns
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
12. 12
Example
• Table 1. A set of sentences from two news reports
• Sentence is represented by Sj
i , where i is the document
number and j represents sentence number.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
14. 14
Left Click
Example (Cont…)
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
Table 3. All frequent patterns are shown in this table with minimum support 3.
15. 15
Example (Cont…)
• The support of all super-patterns is smaller than 4.
• A pattern is a super-pattern if other patterns in the
document are subset of that pattern.
• There are 23 frequent patterns in table 3 but all are not
frequent patterns.
• The longest patterns are consider namely closed
patterns.
• Closed patterns are shown in bold font in table 3.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
16. 16
Sentence Representation
• A closed pattern can be expressed as a term-weight
pair.
• Let ts be a term-weight pair composing of a set of
terms and their weights, such as {(t1, a1),…,(ti,
ai),…,(tx, ax)}
• Where ti denotes a single term, and ai is its weight.
• tw={(Obama, 3), (Republican, 3), (Leader, 3)},
where 3 is the weight of this closed pattern
• w(pi)=|coverSent(pi)|*|coverDoc(pi)|/N
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
17. 17
Literature Survey (Cont…)
• Using table 3, the term-weight pairs of all closed patterns
are;
a) tw1={(Obama, 4), (Leader, 4)}
b) tw2={(Obama, 3), (Republican, 3), (Leader, 3)}
c) tw3={(Obama, 3), (President, 3), (Leader, 3)}
d) tw4={(Obama, 3), (Leader, 3), (Mcconnell, 3),
(Senate, 3)}
• Using tw1, tw2, tw3 and tw4 in table 3
• We have S1
1 ={ tw1, tw2, tw4}, S2
1 ={ tw1, tw2, tw3}, S1
2={
tw1, tw2, tw3, tw4}, S2
2 ={ tw1, tw3, tw4}.
Sentence Representation (Cont…)
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
18. 18
Literature Survey (Cont…)
• tw1={(t1, a1),…,(ti, ai),…,(tx, ax)} and tw2={(w1, b1),…,(wj,
bj),…,(wy, by)} be two term-weight pairs associated to two
patterns.
• The composition operation between tw1 and tw2 will be used to
obtain sentence representation using closed patterns.
• For example, if tw1={(t1, 3), (t2, 2), (t4, 4)}, tw2={(t2, 5), (t3, 2), (t5,
1)},
• then composition operation between tw1 and tw2 is tw1 tw2={(t1,
3), (t2, 7), (t3, 2), (t4, 4), (t5, 1)}.
• Only closed pattern whose size is more than 1 is used.
• The representation of sentence Si can be obtained using formula;
tw(Si)= twi1 twi2…twir.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
Sentence Representation (Cont…)
19. 19
Literature Survey (Cont…)
Then representation of each sentence can be obtained as;
• tw(S1
1)={(Senate, 3), (Leader, 10), (Obama,10),
(Republican, 3), (Mcconnell, 3)}
• tw(S2
1)={(Obama,10), (Republican, 3), (President 3),
(Leader, 10)},
• tw(S1
2)={(Senate, 3), (Mcconnell, 3), (Leader, 13),
(Obama,13), (Republican, 3), (President, 3)},
• tw(S2
2)={(Senate, 3), (Leader, 10), (Obama,10),
(President, 3), (Mcconnell, 3)}.
Sentence Representation (Cont…)
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
20. 20
Literature Survey (Cont…)
• Pattern-based summarization ranks all sentences
according to their sentence representation.
• Those sentences will have high score that contain more
closed patterns with high weight.
• Score (Sj
i)= weight(Sj
i)/| Sj
i |*(1-j-1/|di|)
• The starting sentence in a document contains more new
information than the following sentences.
Sentence Ranking
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
21. 21
Sentence Selection
• Generating the summary of the document collection.
• Considering both content coverage and non-
redundancy.
• Until a given length of the summary is reached.
• Some methods measure the similarity of next
candidate sentence to that of previously selected ones.
• Select it if its similarity is below a threshold.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
22. 22
Dataset
• The standard benchmark DUC2004 dataset.
• Which is form Document Understanding Conference
(DUC) for generic summarization evaluation.
• There are 50 document clusters in DUC2004.
• Each cluster consists of 10 English documents.
• DUC2004 provides at least four human-generated
summaries in each cluster.
• Participants to the DUC2004 contest submitted their
own summaries and were evaluated against human-
generated summaries.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
23. 23
Evaluation (Cont…)
• ROUGE or Recall-Oriented understudy for Gisting
Evaluation.
• It is a set of metrics and a software package used for
evaluating automatic summarization.
• Rouge tells us how effective our summary is as compared to
human made summary.
• The metrics compare an automatically produced summary
against a reference or a set of reference summary.
• The formula for rouge calculation is:-
• Number of ngrams which are present in both human made and
system summary / total number of possible ngrams in system
made summary
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
24. 24
Tools Technology Dataset
Following are the list of tools and technology used for
this project:
• Python
• DUC 2004 (Document Understanding Conference)
• Matlab 2017
DCSSE/IIUI Multi-Document Summarization using Closed Patterns