SlideShare a Scribd company logo
1 of 47
DOCUMENT SUMMARIZATION
Abhirut Gupta
Mandar Joshi
Piyush Dungarwal
MOTIVATION
 The advent of WWW has created a large reservoir
of data
 A short summary, which conveys the essence of the
document, helps in finding relevant information
quickly
 Document summarization also provides a way to
cluster similar documents and present a summary
MOTIVATION
OUTLINE
 Definition
 Types: Extractive and Abstractive
 Techniques: Supervised and Unsupervised-
 Single document summarization
 Approaches
 TextRank
 Multi document summarization
 Challenges
 NEATS
 Multilingual summarization
 Summarization competitions
 Evaluation
WHAT IS SUMMARIZATION?
 A summary is a text that is produced from one or
more texts, that contains a significant portion of the
information in the original text(s), and that is no
longer than half of the original text(s).
 Summaries may be classified as:
 Extractive
 Abstractive
TYPES OF SUMMARIES
EXTRACTIVE SUMMARIES
 Extractive summaries are created by reusing
portions (words, sentences, etc.) of the input text
verbatim.
 For example, search engines typically generate
extractive summaries from webpages.
 Most of the summarization research today is on
extractive summarization.
Text:
Extractive summary:
ABSTRACTIVE SUMMARIES
 In abstractive summarization, information from the
source text is re-phrased.
 Human beings generally write abstractive
summaries (except when they do their assignments
).
 Abstractive summarization has not reached a
mature stage because allied problems such as
semantic representation, inference and natural
language generation are relatively harder.
ABSTRACTIVE SUMMARY: BOOK REVIEW
 An innocent hobbit of The Shire journeys with eight
companions to the fires of Mount Doom to destroy
the One Ring and the dark lord Sauron forever.
SUMMARIZATION TECHNIQUES
 Summarization techniques can be supervised or
unsupervised.
 Supervised techniques use a collection of
documents and human-generated summaries for
them to train a classifier for the given text.
SUPERVISED TECHNIQUES
SUPERVISED TECHNIQUES
 Features (e.g. position of the sentence, number of
words in the sentence etc.) of sentences that make
them good candidates for inclusion in the summary
are learnt.
 Sentences in an original training document can be
labelled as “in summary” or “not in summary”.
SUPERVISED TECHNIQUES
 The main drawback of supervised techniques is
that training data is expensive to produce and
relatively sparse.
 Also, most readily available human generated
summaries are abstractive in nature.
MAJOR SUPERVISED APPROACHES
 Wong et al. use SVM to judge the importance of a
sentence using feature categories:
 Surface features: position, length of sentence etc.
 Content features: uses stats of content-bearing words
 Relevance features: Exploit inter-sentence relationship
e.g. similarity of the sentence with the first sentence.
 Sentences are then ranked accordingly and top
ranked sentences are included in the summary.
MAJOR SUPERVISED APPROACHES
 Lin and Hovy use the concept of topic signatures to
rank sentences.
 TS = {topic, signature}
= {topic, <ti, wi>,…,<tn, wn>}
 TS = {restaurant-visit, <food, 0.5>, <menu, 0.2>,
<waiter, 0.15>, … }
 Topic signatures are learnt using a set of
documents pre-classified as relevant or non-
relevant for each topic.
EXAMPLE
 If a document is classified as relevant to restaurant
visit, we know the important words of this document
will form the topic signature of the topic “restaurant
visit”. This is a supervised process.
 Extraction of the important words from a document,
is an un-supervised process.
E.g., food, occurs a lot of time in a cook book and is an
important word in that document.
TOPIC SIGNATURES
 During deployment topic signatures are used to find
the topic or theme of the text.
 Sentences are then ranked according to the sum of
weights of terms relevant to the topic in the
sentence.
UNSUPERVISED
TECHNIQUES
UNSUPERVISED TECHNIQUES: TEXTRANK AND
LEXRANK
 This approach models the document as a graph
and uses an algorithm similar to Google’s
PageRank algorithm to find top-ranked sentences.
 The key intuition is the notion of centrality or
prestige in social networks i.e. a sentence should
be highly ranked if it is recommended by many
other highly ranked sentences.
INTUITION
 “If Sachin Tendulkar says Malinga is a good
batsman, he should be regarded highly. But then if
Sachin is a gentleman, who talks highly of
everyone, Malinga might not really be as good.”
FORMULA
AN EXAMPLE GRAPH:
Image courtesy: Wikipedia
TEXT AS A GRAPH
 Sentences in the text are modelled as vertices of
the graph.
 Two vertices are connected if there exists a
similarity relation between them.
 Similarity formula
 After the ranking algorithm is run on the graph,
sentences are sorted in reversed order of their
score, and the top ranked sentences are selected.
WHY TEXTRANK WORKS?
 Through the graphs, TextRank identifies
connections between various entities, and
implements the concept of recommendation.
 A text unit recommends other related text units, and
the strength of the recommendation is recursively
computed based on the importance of the units
making the recommendation.
 The sentences that are highly recommended by
other sentences in the text are likely to be more
informative
NEWSBLASTER DEMO
MULTI-DOCUMENT AND
MULTILINGUAL
SUMMARIZATION
MULTI DOCUMENT SUMMARIZATION
 A large set of documents may have thematic
diversity.
 Individual summaries may have overlapping
content.
 Many desirable features of an ideal summary are
relatively difficult to achieve in a multi document
setting.
 Clear structure
 Meaningful paragraphs
 Gradual transition form general to specific
 Good readability
NEATS
 Summarization is done in three stages – content
selection, content filtering and presentation.
 Selection stage is similar to that used in single
document summarization.
 Content Filtering uses the following techniques
 Sentence position
 Stigma Words
 MMR
NEATS
 Stigma Words
 conjunctions (e.g., but, although, however),
 the verb say and its derivatives,
 quotation marks,
 pronouns such as he, she, and they
 MMR
 Maximum Marginal Relevancy –
 Search for the sentence which is most relevant to the query
and most dissimilar with the sentences already in the summary
PRESENTATION
 The presentation stage proposes to solve two major
challenges – definite noun phrases and events
spread along an extended timeline.
 Definite noun phrase problem
 “The Great Depression of 1929 caused severe strain on
the economy. The President proposed the New Deal to
tackle this challenge.”
 NeATS uses a buddy system where each sentence is
paired with a suitable introductory sentence.
PRESENTATION
 In multi-document summarization, a date
expression such as Monday occurring in two
different documents might mean the same date or
different dates.
 Time Annotation is used to tackle such problems.
 Publication dates are used as reference points to
compute the actual date/time for date expressions –
weekdays (Sunday, Monday, etc), (past | next | coming)
+ weekdays, today, yesterday, last night.
MULTI LINGUAL SUMMARIZATION
 Two approaches exist
 Documents from the source language are translated to
the target language and then summarization is
performed on the target language.
 Language specific tools are used to perform
summarization in the source language and then
machine translation is applied on the summarized text.
COMPARISON
 Machine Translation is an involved process and not
very precise, hence the first approach tends to be
expensive as translation is applied to a large text(all
the documents in the source language to be
summarized) and also leads to error propagation.
 For the second approach, if the source language is
not very widely used, language specific tools for
summarization in the source language may not
exist and creation of such tools may be expensive.
EXAMPLE APPROACH 1-
 Hindi text –
 ऐसे कई तरीक
े हैं जिससे आप स्क
ू ल में अपने
बच्चे/बच्ची की मदद कर सकते हैं | वह स्क
ू ल ननयममत
रुप से और समय पर िाता/िाती है यह ननजचचत करने
क
े मलए आप कानूनी तौर पर उत्तरदायी हैं | परंतु स्क
ू ल
क
े ननयमों और होमवक
क क
े मलए इसक
े द्वारा ककये िाने
वाले व्यवस्थाओं का समथकन करते हुए भी आप मदद
कर सकते हैं | आप स्क
ू ल की नननतओं का समथकन करते
हैं यह बात आपका बच्चा िानता है यह ननजचचत करें |
EXAMPLE APPROACH 1-
 English translation –
 There are many ways you can help your child in
school. By law you are responsible for making sure
he or she goes to school regularly, and on time. But
you can also help by supporting the school's rules,
and its arrangements for homework. Make sure
your child knows that you support the school's
policies.
 Summary of the translated text – You are equally
responsible to ensure your child follows school
rules.
EXAMPLE APPROACH 2-
 Hindi Text –
 स्क
ू ल में अपने बच्चे की मदद आप ककस प्रकार कर
सकते हैं इस ववषय में आप अपने बच्चे क
े मिक्षकों से
पूछें | ऐसा क
ु छ भी जिससे कक स्क
ू ल में आपक
े बच्चे क
े
कायक पर प्रभाव पड़ सकता है उस बारे में आप उन्हें
बताएं, यदद आप अपने बच्चे क
े प्रगनत क
े ववषय में
चचजन्तत है तो उनसे बातचीत करें | ननजचचत करें स्क
ू ल
इस बात से अवगत है कक आप चाहतें हैं कक ककसी भी
समस्या क
े उत्पन्न होने पर , जिसमें कक आपका बच्चा
िाममल है आपको तुरंत ही इस बारे में बताया िाए |
EXAMPLE APPROACH 2-
 Hindi summary - आपक
े बच्चे क
े मिक्षकों क
े साथ
अच्छा संपक
क आपक
े बच्चे की प्रगनत में लाभदायक है |
 Translated Summary – Good communication with
your child’s teachers is beneficial for your child’s
progress.
EVALUATION
COMPETITIONS: DUC AND MUC
 Research forums for encouraging development of
new technologies for information extraction and text
summarization.
 Have resulted in the development of evaluation
criteria of summarization systems
EVALUATION
 ROUGE stands for Recall-Oriented Understudy for
Gisting Evaluation.
 Recall based score to compare system generated
summary with one or more human generated
summaries.
 Can be with respect to n-gram matching.
 Unigram matching is found to be the best indicator
for evaluation.
 ROUGE-1 is computed as division of count of
unigrams in reference that appear in system and
count of unigrams in reference summary.
EXAMPLES
Reference-summary: Beijing hosted the summer
Olympics.
System-summary: The summer Olympics were held in
Beijing.
ROUGE-1 score: 0.75
Reference-summary: The policemen killed the gunman
System-summary: The gunman killed the policemen
ROUGE-1 score: 1
EVALUATION
 The ROUGE score is averaged for multiple
references.
 ROUGE-1 does not determine if the result is
coherent or if the sentences flow together in a
sensible manner.
 A higher order n-gram ROUGE score can measure
fluency to some degree.
CONCLUSION
 Most of the current research is based on extractive
multi-document summarization.
 Current summarization systems are widely used to
summarize NEWS and other online articles.
 Top level view:
 Query-based vs. Generic
 Query based techniques give consideration to user
preferences which can be formulated as a query
 Rule-based vs. ML/ Statistical
 Most of the early techniques were rule-based whereas the
current one apply statistical approaches
 Rules(such as Sentence position) can be used as heuristic
CONCLUSION
 Keyword vs. Graph-based
 Keyword based techniques rank sentences based on
the occurrence of relevant keywords.
 Graph based techniques rank sentences based on
content overlap.
REFERENCES
 Wikipedia page on summarization
http://en.wikipedia.org/wiki/Automatic_summarization
 Mihalcea R. and Tarau P. 2004. TextRank: Bringing
Order into Text. In Proc. of EMNLP 2004.
 Lin C.Y. and Hovy E. 2000. The automated acquisition of
topic signatures for text summarization. In Proc. of the
18th conference on Computational linguistics - Volume
1.
 Lin C.Y. and Hovy E. 2002. From single to multi-
document summarization: a prototype system and its
evaluation. In Proc. of the 40th Annual Meeting on
Association for Computational Linguistics (ACL '02).
REFERENCES
 Kam-Fai Wong, Mingli Wu, and Wenjie Li. 2008.
Extractive summarization using supervised and
semi-supervised learning. In Proceedings of the
22nd International Conference on Computational
Linguistics - Volume 1
 Newsblaster: http://newsblaster.cs.columbia.edu/
Q & A

More Related Content

What's hot (7)

1Intro to Tech Writing
1Intro to Tech Writing1Intro to Tech Writing
1Intro to Tech Writing
 
Technical writing and communication
Technical writing and communicationTechnical writing and communication
Technical writing and communication
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and origin
 
SCTUR: A Sentiment Classification Technique for URDU
SCTUR: A Sentiment Classification Technique for URDUSCTUR: A Sentiment Classification Technique for URDU
SCTUR: A Sentiment Classification Technique for URDU
 
Four Point Persuasive Blog Rubric
Four Point Persuasive Blog RubricFour Point Persuasive Blog Rubric
Four Point Persuasive Blog Rubric
 
10. technical communication
10. technical communication10. technical communication
10. technical communication
 
Technical writing
Technical writingTechnical writing
Technical writing
 

Similar to summarization-oct12.pptx

reading skills task cards SHS UPDATED.pptx
reading skills task cards SHS UPDATED.pptxreading skills task cards SHS UPDATED.pptx
reading skills task cards SHS UPDATED.pptxJasmineLinogon
 
Week 4: Summarising and paraphrasing
Week 4: Summarising and paraphrasingWeek 4: Summarising and paraphrasing
Week 4: Summarising and paraphrasingADT U2B
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewINFOGAIN PUBLICATION
 
Sentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersSentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersMOHDSAIFWAJID1
 
Writing effective paragraphs 1
Writing effective paragraphs 1Writing effective paragraphs 1
Writing effective paragraphs 1philodias
 
Survey on Key Phrase Extraction using Machine Learning Approaches
Survey on Key Phrase Extraction using Machine Learning ApproachesSurvey on Key Phrase Extraction using Machine Learning Approaches
Survey on Key Phrase Extraction using Machine Learning ApproachesYogeshIJTSRD
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
 
1 Summary Assignment Rakesh Mittoo 1 THE UNIVE.docx
1 Summary Assignment  Rakesh Mittoo  1 THE UNIVE.docx1 Summary Assignment  Rakesh Mittoo  1 THE UNIVE.docx
1 Summary Assignment Rakesh Mittoo 1 THE UNIVE.docxjeremylockett77
 
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...TELKOMNIKA JOURNAL
 
How to achieve efl students´ reading comprehension
How to achieve efl students´ reading comprehensionHow to achieve efl students´ reading comprehension
How to achieve efl students´ reading comprehensionMayito0105
 
1. Research Paper75 points (See Grading Rubric Below)The purp.docx
1. Research Paper75 points (See Grading Rubric Below)The purp.docx1. Research Paper75 points (See Grading Rubric Below)The purp.docx
1. Research Paper75 points (See Grading Rubric Below)The purp.docxstilliegeorgiana
 
14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for TranslationRIILP
 
Dictionary based concept mining an application for turkish
Dictionary based concept mining  an application for turkishDictionary based concept mining  an application for turkish
Dictionary based concept mining an application for turkishcsandit
 
Uses knowledge of text structure to glean the information he/she needs
Uses knowledge of text structure to glean the information he/she needsUses knowledge of text structure to glean the information he/she needs
Uses knowledge of text structure to glean the information he/she needsMichelle390295
 
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHDICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
 

Similar to summarization-oct12.pptx (20)

reading skills task cards SHS UPDATED.pptx
reading skills task cards SHS UPDATED.pptxreading skills task cards SHS UPDATED.pptx
reading skills task cards SHS UPDATED.pptx
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
 
Week 4: Summarising and paraphrasing
Week 4: Summarising and paraphrasingWeek 4: Summarising and paraphrasing
Week 4: Summarising and paraphrasing
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
Sentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersSentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clusters
 
Writing effective paragraphs 1
Writing effective paragraphs 1Writing effective paragraphs 1
Writing effective paragraphs 1
 
Chp 9
Chp 9 Chp 9
Chp 9
 
Chp 9
Chp 9Chp 9
Chp 9
 
Survey on Key Phrase Extraction using Machine Learning Approaches
Survey on Key Phrase Extraction using Machine Learning ApproachesSurvey on Key Phrase Extraction using Machine Learning Approaches
Survey on Key Phrase Extraction using Machine Learning Approaches
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representation
 
1 Summary Assignment Rakesh Mittoo 1 THE UNIVE.docx
1 Summary Assignment  Rakesh Mittoo  1 THE UNIVE.docx1 Summary Assignment  Rakesh Mittoo  1 THE UNIVE.docx
1 Summary Assignment Rakesh Mittoo 1 THE UNIVE.docx
 
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
 
How to achieve efl students´ reading comprehension
How to achieve efl students´ reading comprehensionHow to achieve efl students´ reading comprehension
How to achieve efl students´ reading comprehension
 
The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...
 
1. Research Paper75 points (See Grading Rubric Below)The purp.docx
1. Research Paper75 points (See Grading Rubric Below)The purp.docx1. Research Paper75 points (See Grading Rubric Below)The purp.docx
1. Research Paper75 points (See Grading Rubric Below)The purp.docx
 
14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation
 
Dictionary based concept mining an application for turkish
Dictionary based concept mining  an application for turkishDictionary based concept mining  an application for turkish
Dictionary based concept mining an application for turkish
 
Uses knowledge of text structure to glean the information he/she needs
Uses knowledge of text structure to glean the information he/she needsUses knowledge of text structure to glean the information he/she needs
Uses knowledge of text structure to glean the information he/she needs
 
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHDICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
 

Recently uploaded

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

summarization-oct12.pptx

  • 2. MOTIVATION  The advent of WWW has created a large reservoir of data  A short summary, which conveys the essence of the document, helps in finding relevant information quickly  Document summarization also provides a way to cluster similar documents and present a summary
  • 4. OUTLINE  Definition  Types: Extractive and Abstractive  Techniques: Supervised and Unsupervised-  Single document summarization  Approaches  TextRank  Multi document summarization  Challenges  NEATS  Multilingual summarization  Summarization competitions  Evaluation
  • 5. WHAT IS SUMMARIZATION?  A summary is a text that is produced from one or more texts, that contains a significant portion of the information in the original text(s), and that is no longer than half of the original text(s).  Summaries may be classified as:  Extractive  Abstractive
  • 7. EXTRACTIVE SUMMARIES  Extractive summaries are created by reusing portions (words, sentences, etc.) of the input text verbatim.  For example, search engines typically generate extractive summaries from webpages.  Most of the summarization research today is on extractive summarization.
  • 9. ABSTRACTIVE SUMMARIES  In abstractive summarization, information from the source text is re-phrased.  Human beings generally write abstractive summaries (except when they do their assignments ).  Abstractive summarization has not reached a mature stage because allied problems such as semantic representation, inference and natural language generation are relatively harder.
  • 10. ABSTRACTIVE SUMMARY: BOOK REVIEW  An innocent hobbit of The Shire journeys with eight companions to the fires of Mount Doom to destroy the One Ring and the dark lord Sauron forever.
  • 11. SUMMARIZATION TECHNIQUES  Summarization techniques can be supervised or unsupervised.  Supervised techniques use a collection of documents and human-generated summaries for them to train a classifier for the given text.
  • 13. SUPERVISED TECHNIQUES  Features (e.g. position of the sentence, number of words in the sentence etc.) of sentences that make them good candidates for inclusion in the summary are learnt.  Sentences in an original training document can be labelled as “in summary” or “not in summary”.
  • 14. SUPERVISED TECHNIQUES  The main drawback of supervised techniques is that training data is expensive to produce and relatively sparse.  Also, most readily available human generated summaries are abstractive in nature.
  • 15. MAJOR SUPERVISED APPROACHES  Wong et al. use SVM to judge the importance of a sentence using feature categories:  Surface features: position, length of sentence etc.  Content features: uses stats of content-bearing words  Relevance features: Exploit inter-sentence relationship e.g. similarity of the sentence with the first sentence.  Sentences are then ranked accordingly and top ranked sentences are included in the summary.
  • 16. MAJOR SUPERVISED APPROACHES  Lin and Hovy use the concept of topic signatures to rank sentences.  TS = {topic, signature} = {topic, <ti, wi>,…,<tn, wn>}  TS = {restaurant-visit, <food, 0.5>, <menu, 0.2>, <waiter, 0.15>, … }  Topic signatures are learnt using a set of documents pre-classified as relevant or non- relevant for each topic.
  • 17. EXAMPLE  If a document is classified as relevant to restaurant visit, we know the important words of this document will form the topic signature of the topic “restaurant visit”. This is a supervised process.  Extraction of the important words from a document, is an un-supervised process. E.g., food, occurs a lot of time in a cook book and is an important word in that document.
  • 18. TOPIC SIGNATURES  During deployment topic signatures are used to find the topic or theme of the text.  Sentences are then ranked according to the sum of weights of terms relevant to the topic in the sentence.
  • 20. UNSUPERVISED TECHNIQUES: TEXTRANK AND LEXRANK  This approach models the document as a graph and uses an algorithm similar to Google’s PageRank algorithm to find top-ranked sentences.  The key intuition is the notion of centrality or prestige in social networks i.e. a sentence should be highly ranked if it is recommended by many other highly ranked sentences.
  • 21. INTUITION  “If Sachin Tendulkar says Malinga is a good batsman, he should be regarded highly. But then if Sachin is a gentleman, who talks highly of everyone, Malinga might not really be as good.” FORMULA
  • 22. AN EXAMPLE GRAPH: Image courtesy: Wikipedia
  • 23. TEXT AS A GRAPH  Sentences in the text are modelled as vertices of the graph.  Two vertices are connected if there exists a similarity relation between them.  Similarity formula  After the ranking algorithm is run on the graph, sentences are sorted in reversed order of their score, and the top ranked sentences are selected.
  • 24. WHY TEXTRANK WORKS?  Through the graphs, TextRank identifies connections between various entities, and implements the concept of recommendation.  A text unit recommends other related text units, and the strength of the recommendation is recursively computed based on the importance of the units making the recommendation.  The sentences that are highly recommended by other sentences in the text are likely to be more informative
  • 27. MULTI DOCUMENT SUMMARIZATION  A large set of documents may have thematic diversity.  Individual summaries may have overlapping content.  Many desirable features of an ideal summary are relatively difficult to achieve in a multi document setting.  Clear structure  Meaningful paragraphs  Gradual transition form general to specific  Good readability
  • 28. NEATS  Summarization is done in three stages – content selection, content filtering and presentation.  Selection stage is similar to that used in single document summarization.  Content Filtering uses the following techniques  Sentence position  Stigma Words  MMR
  • 29. NEATS  Stigma Words  conjunctions (e.g., but, although, however),  the verb say and its derivatives,  quotation marks,  pronouns such as he, she, and they  MMR  Maximum Marginal Relevancy –  Search for the sentence which is most relevant to the query and most dissimilar with the sentences already in the summary
  • 30. PRESENTATION  The presentation stage proposes to solve two major challenges – definite noun phrases and events spread along an extended timeline.  Definite noun phrase problem  “The Great Depression of 1929 caused severe strain on the economy. The President proposed the New Deal to tackle this challenge.”  NeATS uses a buddy system where each sentence is paired with a suitable introductory sentence.
  • 31. PRESENTATION  In multi-document summarization, a date expression such as Monday occurring in two different documents might mean the same date or different dates.  Time Annotation is used to tackle such problems.  Publication dates are used as reference points to compute the actual date/time for date expressions – weekdays (Sunday, Monday, etc), (past | next | coming) + weekdays, today, yesterday, last night.
  • 32. MULTI LINGUAL SUMMARIZATION  Two approaches exist  Documents from the source language are translated to the target language and then summarization is performed on the target language.  Language specific tools are used to perform summarization in the source language and then machine translation is applied on the summarized text.
  • 33. COMPARISON  Machine Translation is an involved process and not very precise, hence the first approach tends to be expensive as translation is applied to a large text(all the documents in the source language to be summarized) and also leads to error propagation.  For the second approach, if the source language is not very widely used, language specific tools for summarization in the source language may not exist and creation of such tools may be expensive.
  • 34. EXAMPLE APPROACH 1-  Hindi text –  ऐसे कई तरीक े हैं जिससे आप स्क ू ल में अपने बच्चे/बच्ची की मदद कर सकते हैं | वह स्क ू ल ननयममत रुप से और समय पर िाता/िाती है यह ननजचचत करने क े मलए आप कानूनी तौर पर उत्तरदायी हैं | परंतु स्क ू ल क े ननयमों और होमवक क क े मलए इसक े द्वारा ककये िाने वाले व्यवस्थाओं का समथकन करते हुए भी आप मदद कर सकते हैं | आप स्क ू ल की नननतओं का समथकन करते हैं यह बात आपका बच्चा िानता है यह ननजचचत करें |
  • 35. EXAMPLE APPROACH 1-  English translation –  There are many ways you can help your child in school. By law you are responsible for making sure he or she goes to school regularly, and on time. But you can also help by supporting the school's rules, and its arrangements for homework. Make sure your child knows that you support the school's policies.  Summary of the translated text – You are equally responsible to ensure your child follows school rules.
  • 36. EXAMPLE APPROACH 2-  Hindi Text –  स्क ू ल में अपने बच्चे की मदद आप ककस प्रकार कर सकते हैं इस ववषय में आप अपने बच्चे क े मिक्षकों से पूछें | ऐसा क ु छ भी जिससे कक स्क ू ल में आपक े बच्चे क े कायक पर प्रभाव पड़ सकता है उस बारे में आप उन्हें बताएं, यदद आप अपने बच्चे क े प्रगनत क े ववषय में चचजन्तत है तो उनसे बातचीत करें | ननजचचत करें स्क ू ल इस बात से अवगत है कक आप चाहतें हैं कक ककसी भी समस्या क े उत्पन्न होने पर , जिसमें कक आपका बच्चा िाममल है आपको तुरंत ही इस बारे में बताया िाए |
  • 37. EXAMPLE APPROACH 2-  Hindi summary - आपक े बच्चे क े मिक्षकों क े साथ अच्छा संपक क आपक े बच्चे की प्रगनत में लाभदायक है |  Translated Summary – Good communication with your child’s teachers is beneficial for your child’s progress.
  • 39. COMPETITIONS: DUC AND MUC  Research forums for encouraging development of new technologies for information extraction and text summarization.  Have resulted in the development of evaluation criteria of summarization systems
  • 40. EVALUATION  ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation.  Recall based score to compare system generated summary with one or more human generated summaries.  Can be with respect to n-gram matching.  Unigram matching is found to be the best indicator for evaluation.  ROUGE-1 is computed as division of count of unigrams in reference that appear in system and count of unigrams in reference summary.
  • 41. EXAMPLES Reference-summary: Beijing hosted the summer Olympics. System-summary: The summer Olympics were held in Beijing. ROUGE-1 score: 0.75 Reference-summary: The policemen killed the gunman System-summary: The gunman killed the policemen ROUGE-1 score: 1
  • 42. EVALUATION  The ROUGE score is averaged for multiple references.  ROUGE-1 does not determine if the result is coherent or if the sentences flow together in a sensible manner.  A higher order n-gram ROUGE score can measure fluency to some degree.
  • 43. CONCLUSION  Most of the current research is based on extractive multi-document summarization.  Current summarization systems are widely used to summarize NEWS and other online articles.  Top level view:  Query-based vs. Generic  Query based techniques give consideration to user preferences which can be formulated as a query  Rule-based vs. ML/ Statistical  Most of the early techniques were rule-based whereas the current one apply statistical approaches  Rules(such as Sentence position) can be used as heuristic
  • 44. CONCLUSION  Keyword vs. Graph-based  Keyword based techniques rank sentences based on the occurrence of relevant keywords.  Graph based techniques rank sentences based on content overlap.
  • 45. REFERENCES  Wikipedia page on summarization http://en.wikipedia.org/wiki/Automatic_summarization  Mihalcea R. and Tarau P. 2004. TextRank: Bringing Order into Text. In Proc. of EMNLP 2004.  Lin C.Y. and Hovy E. 2000. The automated acquisition of topic signatures for text summarization. In Proc. of the 18th conference on Computational linguistics - Volume 1.  Lin C.Y. and Hovy E. 2002. From single to multi- document summarization: a prototype system and its evaluation. In Proc. of the 40th Annual Meeting on Association for Computational Linguistics (ACL '02).
  • 46. REFERENCES  Kam-Fai Wong, Mingli Wu, and Wenjie Li. 2008. Extractive summarization using supervised and semi-supervised learning. In Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1  Newsblaster: http://newsblaster.cs.columbia.edu/
  • 47. Q & A