Multi-Document Summarization using
Closed Patterns
Superviser: Ms. Zakia Jalil
Co-Superviser: Ms Sabina Irum
Presented by
Hafsa Sattar [2896-FBAS/BSCS/F14]
Uswa Ihsan [2822-FBAS/BSCS/F14]
Department of Computer Science & Software Engineering
Faculty of Basic &Applied Sciences
International Islamic University, Islamabad.
2
Overview of Presentation
• Introduction.
• Literature Review.
• Problem Statement.
• Proposed Solution.
• Block Diagram.
• Evaluation.
• Tools Technology Dataset.
• References.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
3
Introduction
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
• The internet provides access to a huge volume of
documents.
• We propose multi-document summarization to extract all
information from multiple documents.
• This saves the time and effort of user instead of reading
the whole document.
4
Left Click
Multi-Document Summarization
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
• Multi-document summarization is an automatic
procedure aimed at extraction of information.
• Multi-document summarization creates information
reports that are both concise and comprehensive.
5
Left Click
Literature Review
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
• Multi-documentation summarization methods can be
classified into two classes:
• Extractive summarization :
Extractive summarization extracts the most informative
document components.
• Abstractive summarization:
Abstractive summarization involves reformulation of
contents.
6
Existing System
• Term-Based Method:
A term-based method has the advantages of efficiency and
maturity for term weight calculation.
Term-Based methods can be divided into the following
categories:
1. Centroid-Based Method:
This method uses clustering algorithms to generate
sentences’ clusters by calculating sentence similarity.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
7
Existing System
2. Graph-Based Method:
a)The graph-based approaches also belong to extractive
summarization.
b)This method builds a graph-based model.
c)Then select sentences by means of voting from their
neighbors.
•Ontology-Based Method:
Ontology-based approaches take into account of
meanings of vocabulary
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
8
Problem Statement
• The explosion of electronic documents presents a
serious challenge for reader to extract information .
• The information extracted can be false or incomplete
that might cause trouble in future.
• The main problem in MDS occurred due to the
collection of multiple resources from where the data is
extracted.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
9
Proposed Solution
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
• A pattern-based model for general multi-document
summarization is proposed.
• To extract the most informative sentences from a
document collection.
• Reduce redundancy in the summary.
• Calculates the weight of each sentence of a document
collection.
• By accumulating the weights of its covering closed
patterns.
10
Closed Pattern
• Represent the terms with high frequency in the
document collection.
• Weight of each sentence is decided by the number of
closed patterns in sentence.
• Sentence containing more closed patterns have high
scores.
• Sentence that do not contain any closed pattern have
minimal score zero.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
11
Block Diagram
• Block diagram of MDS using Closed Patterns
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
12
Example
• Table 1. A set of sentences from two news reports
• Sentence is represented by Sj
i , where i is the document
number and j represents sentence number.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
13
Example (Cont…)
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
• Table 2. The terms that occur more than 3 times are saved in table
given below.
14
Left Click
Example (Cont…)
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
Table 3. All frequent patterns are shown in this table with minimum support 3.
15
Example (Cont…)
• The support of all super-patterns is smaller than 4.
• A pattern is a super-pattern if other patterns in the
document are subset of that pattern.
• There are 23 frequent patterns in table 3 but all are not
frequent patterns.
• The longest patterns are consider namely closed
patterns.
• Closed patterns are shown in bold font in table 3.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
16
Sentence Representation
• A closed pattern can be expressed as a term-weight
pair.
• Let ts be a term-weight pair composing of a set of
terms and their weights, such as {(t1, a1),…,(ti,
ai),…,(tx, ax)}
• Where ti denotes a single term, and ai is its weight.
• tw={(Obama, 3), (Republican, 3), (Leader, 3)},
where 3 is the weight of this closed pattern
• w(pi)=|coverSent(pi)|*|coverDoc(pi)|/N
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
17
Literature Survey (Cont…)
• Using table 3, the term-weight pairs of all closed patterns
are;
a) tw1={(Obama, 4), (Leader, 4)}
b) tw2={(Obama, 3), (Republican, 3), (Leader, 3)}
c) tw3={(Obama, 3), (President, 3), (Leader, 3)}
d) tw4={(Obama, 3), (Leader, 3), (Mcconnell, 3),
(Senate, 3)}
• Using tw1, tw2, tw3 and tw4 in table 3
• We have S1
1 ={ tw1, tw2, tw4}, S2
1 ={ tw1, tw2, tw3}, S1
2={
tw1, tw2, tw3, tw4}, S2
2 ={ tw1, tw3, tw4}.
Sentence Representation (Cont…)
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
18
Literature Survey (Cont…)
• tw1={(t1, a1),…,(ti, ai),…,(tx, ax)} and tw2={(w1, b1),…,(wj,
bj),…,(wy, by)} be two term-weight pairs associated to two
patterns.
• The composition operation between tw1 and tw2 will be used to
obtain sentence representation using closed patterns.
• For example, if tw1={(t1, 3), (t2, 2), (t4, 4)}, tw2={(t2, 5), (t3, 2), (t5,
1)},
• then composition operation between tw1 and tw2 is tw1 tw2={(t1,
3), (t2, 7), (t3, 2), (t4, 4), (t5, 1)}.
• Only closed pattern whose size is more than 1 is used.
• The representation of sentence Si can be obtained using formula;
tw(Si)= twi1 twi2…twir.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
Sentence Representation (Cont…)
19
Literature Survey (Cont…)
Then representation of each sentence can be obtained as;
• tw(S1
1)={(Senate, 3), (Leader, 10), (Obama,10),
(Republican, 3), (Mcconnell, 3)}
• tw(S2
1)={(Obama,10), (Republican, 3), (President 3),
(Leader, 10)},
• tw(S1
2)={(Senate, 3), (Mcconnell, 3), (Leader, 13),
(Obama,13), (Republican, 3), (President, 3)},
• tw(S2
2)={(Senate, 3), (Leader, 10), (Obama,10),
(President, 3), (Mcconnell, 3)}.
Sentence Representation (Cont…)
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
20
Literature Survey (Cont…)
• Pattern-based summarization ranks all sentences
according to their sentence representation.
• Those sentences will have high score that contain more
closed patterns with high weight.
• Score (Sj
i)= weight(Sj
i)/| Sj
i |*(1-j-1/|di|)
• The starting sentence in a document contains more new
information than the following sentences.
Sentence Ranking
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
21
Sentence Selection
• Generating the summary of the document collection.
• Considering both content coverage and non-
redundancy.
• Until a given length of the summary is reached.
• Some methods measure the similarity of next
candidate sentence to that of previously selected ones.
• Select it if its similarity is below a threshold.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
22
Dataset
• The standard benchmark DUC2004 dataset.
• Which is form Document Understanding Conference
(DUC) for generic summarization evaluation.
• There are 50 document clusters in DUC2004.
• Each cluster consists of 10 English documents.
• DUC2004 provides at least four human-generated
summaries in each cluster.
• Participants to the DUC2004 contest submitted their
own summaries and were evaluated against human-
generated summaries.
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
23
Evaluation (Cont…)
• ROUGE or Recall-Oriented understudy for Gisting
Evaluation.
• It is a set of metrics and a software package used for
evaluating automatic summarization.
• Rouge tells us how effective our summary is as compared to
human made summary.
• The metrics compare an automatically produced summary
against a reference or a set of reference summary.
• The formula for rouge calculation is:-
• Number of ngrams which are present in both human made and
system summary / total number of possible ngrams in system
made summary
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
24
Tools Technology Dataset
Following are the list of tools and technology used for
this project:
• Python
• DUC 2004 (Document Understanding Conference)
• Matlab 2017
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
25
References
DCSSE/IIUI Multi-Document Summarization using Closed Patterns
[1] Ji-Peng Qiang, Peng Chen, Wei Ding, Fei Xei, Xindong Wu,
Multi-document Summarization using Closed Patterns, Knowledge
Based System (2016).
[2] https://en.wikipedia.org/wiki/Multi-
document_summarization
[3] https://www.slideshare.net/LiaRatna1/sinonim-38250183
[4] https://www.hindawi.com/journals/tswj/2016/1784827/
[5] https://en.wikipedia.org/wiki/ROUGE_(metric)
[6] https://www.quora.com/What-is-the-meaning-and-formula-
for-the-ROUGE-SU-metric-for-evaluating-summaries
[7] https://en.wikipedia.org/wiki/N-gram
[8] https://en.wikipedia.org/wiki/F1_score
26
Questions & Answers
Q & A
Thank You
DCSSE/IIUI Multi-Document Summarization using Closed Patterns

Project Presentation.ppt

  • 1.
    Multi-Document Summarization using ClosedPatterns Superviser: Ms. Zakia Jalil Co-Superviser: Ms Sabina Irum Presented by Hafsa Sattar [2896-FBAS/BSCS/F14] Uswa Ihsan [2822-FBAS/BSCS/F14] Department of Computer Science & Software Engineering Faculty of Basic &Applied Sciences International Islamic University, Islamabad.
  • 2.
    2 Overview of Presentation •Introduction. • Literature Review. • Problem Statement. • Proposed Solution. • Block Diagram. • Evaluation. • Tools Technology Dataset. • References. DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 3.
    3 Introduction DCSSE/IIUI Multi-Document Summarizationusing Closed Patterns • The internet provides access to a huge volume of documents. • We propose multi-document summarization to extract all information from multiple documents. • This saves the time and effort of user instead of reading the whole document.
  • 4.
    4 Left Click Multi-Document Summarization DCSSE/IIUIMulti-Document Summarization using Closed Patterns • Multi-document summarization is an automatic procedure aimed at extraction of information. • Multi-document summarization creates information reports that are both concise and comprehensive.
  • 5.
    5 Left Click Literature Review DCSSE/IIUIMulti-Document Summarization using Closed Patterns • Multi-documentation summarization methods can be classified into two classes: • Extractive summarization : Extractive summarization extracts the most informative document components. • Abstractive summarization: Abstractive summarization involves reformulation of contents.
  • 6.
    6 Existing System • Term-BasedMethod: A term-based method has the advantages of efficiency and maturity for term weight calculation. Term-Based methods can be divided into the following categories: 1. Centroid-Based Method: This method uses clustering algorithms to generate sentences’ clusters by calculating sentence similarity. DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 7.
    7 Existing System 2. Graph-BasedMethod: a)The graph-based approaches also belong to extractive summarization. b)This method builds a graph-based model. c)Then select sentences by means of voting from their neighbors. •Ontology-Based Method: Ontology-based approaches take into account of meanings of vocabulary DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 8.
    8 Problem Statement • Theexplosion of electronic documents presents a serious challenge for reader to extract information . • The information extracted can be false or incomplete that might cause trouble in future. • The main problem in MDS occurred due to the collection of multiple resources from where the data is extracted. DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 9.
    9 Proposed Solution DCSSE/IIUI Multi-DocumentSummarization using Closed Patterns • A pattern-based model for general multi-document summarization is proposed. • To extract the most informative sentences from a document collection. • Reduce redundancy in the summary. • Calculates the weight of each sentence of a document collection. • By accumulating the weights of its covering closed patterns.
  • 10.
    10 Closed Pattern • Representthe terms with high frequency in the document collection. • Weight of each sentence is decided by the number of closed patterns in sentence. • Sentence containing more closed patterns have high scores. • Sentence that do not contain any closed pattern have minimal score zero. DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 11.
    11 Block Diagram • Blockdiagram of MDS using Closed Patterns DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 12.
    12 Example • Table 1.A set of sentences from two news reports • Sentence is represented by Sj i , where i is the document number and j represents sentence number. DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 13.
    13 Example (Cont…) DCSSE/IIUI Multi-DocumentSummarization using Closed Patterns • Table 2. The terms that occur more than 3 times are saved in table given below.
  • 14.
    14 Left Click Example (Cont…) DCSSE/IIUIMulti-Document Summarization using Closed Patterns Table 3. All frequent patterns are shown in this table with minimum support 3.
  • 15.
    15 Example (Cont…) • Thesupport of all super-patterns is smaller than 4. • A pattern is a super-pattern if other patterns in the document are subset of that pattern. • There are 23 frequent patterns in table 3 but all are not frequent patterns. • The longest patterns are consider namely closed patterns. • Closed patterns are shown in bold font in table 3. DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 16.
    16 Sentence Representation • Aclosed pattern can be expressed as a term-weight pair. • Let ts be a term-weight pair composing of a set of terms and their weights, such as {(t1, a1),…,(ti, ai),…,(tx, ax)} • Where ti denotes a single term, and ai is its weight. • tw={(Obama, 3), (Republican, 3), (Leader, 3)}, where 3 is the weight of this closed pattern • w(pi)=|coverSent(pi)|*|coverDoc(pi)|/N DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 17.
    17 Literature Survey (Cont…) •Using table 3, the term-weight pairs of all closed patterns are; a) tw1={(Obama, 4), (Leader, 4)} b) tw2={(Obama, 3), (Republican, 3), (Leader, 3)} c) tw3={(Obama, 3), (President, 3), (Leader, 3)} d) tw4={(Obama, 3), (Leader, 3), (Mcconnell, 3), (Senate, 3)} • Using tw1, tw2, tw3 and tw4 in table 3 • We have S1 1 ={ tw1, tw2, tw4}, S2 1 ={ tw1, tw2, tw3}, S1 2={ tw1, tw2, tw3, tw4}, S2 2 ={ tw1, tw3, tw4}. Sentence Representation (Cont…) DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 18.
    18 Literature Survey (Cont…) •tw1={(t1, a1),…,(ti, ai),…,(tx, ax)} and tw2={(w1, b1),…,(wj, bj),…,(wy, by)} be two term-weight pairs associated to two patterns. • The composition operation between tw1 and tw2 will be used to obtain sentence representation using closed patterns. • For example, if tw1={(t1, 3), (t2, 2), (t4, 4)}, tw2={(t2, 5), (t3, 2), (t5, 1)}, • then composition operation between tw1 and tw2 is tw1 tw2={(t1, 3), (t2, 7), (t3, 2), (t4, 4), (t5, 1)}. • Only closed pattern whose size is more than 1 is used. • The representation of sentence Si can be obtained using formula; tw(Si)= twi1 twi2…twir. DCSSE/IIUI Multi-Document Summarization using Closed Patterns Sentence Representation (Cont…)
  • 19.
    19 Literature Survey (Cont…) Thenrepresentation of each sentence can be obtained as; • tw(S1 1)={(Senate, 3), (Leader, 10), (Obama,10), (Republican, 3), (Mcconnell, 3)} • tw(S2 1)={(Obama,10), (Republican, 3), (President 3), (Leader, 10)}, • tw(S1 2)={(Senate, 3), (Mcconnell, 3), (Leader, 13), (Obama,13), (Republican, 3), (President, 3)}, • tw(S2 2)={(Senate, 3), (Leader, 10), (Obama,10), (President, 3), (Mcconnell, 3)}. Sentence Representation (Cont…) DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 20.
    20 Literature Survey (Cont…) •Pattern-based summarization ranks all sentences according to their sentence representation. • Those sentences will have high score that contain more closed patterns with high weight. • Score (Sj i)= weight(Sj i)/| Sj i |*(1-j-1/|di|) • The starting sentence in a document contains more new information than the following sentences. Sentence Ranking DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 21.
    21 Sentence Selection • Generatingthe summary of the document collection. • Considering both content coverage and non- redundancy. • Until a given length of the summary is reached. • Some methods measure the similarity of next candidate sentence to that of previously selected ones. • Select it if its similarity is below a threshold. DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 22.
    22 Dataset • The standardbenchmark DUC2004 dataset. • Which is form Document Understanding Conference (DUC) for generic summarization evaluation. • There are 50 document clusters in DUC2004. • Each cluster consists of 10 English documents. • DUC2004 provides at least four human-generated summaries in each cluster. • Participants to the DUC2004 contest submitted their own summaries and were evaluated against human- generated summaries. DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 23.
    23 Evaluation (Cont…) • ROUGEor Recall-Oriented understudy for Gisting Evaluation. • It is a set of metrics and a software package used for evaluating automatic summarization. • Rouge tells us how effective our summary is as compared to human made summary. • The metrics compare an automatically produced summary against a reference or a set of reference summary. • The formula for rouge calculation is:- • Number of ngrams which are present in both human made and system summary / total number of possible ngrams in system made summary DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 24.
    24 Tools Technology Dataset Followingare the list of tools and technology used for this project: • Python • DUC 2004 (Document Understanding Conference) • Matlab 2017 DCSSE/IIUI Multi-Document Summarization using Closed Patterns
  • 25.
    25 References DCSSE/IIUI Multi-Document Summarizationusing Closed Patterns [1] Ji-Peng Qiang, Peng Chen, Wei Ding, Fei Xei, Xindong Wu, Multi-document Summarization using Closed Patterns, Knowledge Based System (2016). [2] https://en.wikipedia.org/wiki/Multi- document_summarization [3] https://www.slideshare.net/LiaRatna1/sinonim-38250183 [4] https://www.hindawi.com/journals/tswj/2016/1784827/ [5] https://en.wikipedia.org/wiki/ROUGE_(metric) [6] https://www.quora.com/What-is-the-meaning-and-formula- for-the-ROUGE-SU-metric-for-evaluating-summaries [7] https://en.wikipedia.org/wiki/N-gram [8] https://en.wikipedia.org/wiki/F1_score
  • 26.
    26 Questions & Answers Q& A Thank You DCSSE/IIUI Multi-Document Summarization using Closed Patterns