SlideShare a Scribd company logo
Automatic Text Summarization
Trends, Challenges and Opportunities
Siddhartha Banerjee
Research Scientist, Content Platform
Yahoo! (now Oath, a Verizon Company)
September 22, 2017
2Talk @ Saama Technologies Siddhartha Banerjee
❑ Undergraduate degree
• Industrial Engineering - 2009 (IIT Kharagpur)
❑ Professional Experience: 2009 – 2012
• Sabre Airline Solutions and Oracle Retail
❑ Ph.D. @Penn State Information Sciences (2012 - Dec’ 2016)
• Advised by Prof. Prasenjit Mitra
• Natural Language Processing
❑ Back to Industry: 2017
• Yahoo! (March 2017 - present)
• Question Answering
• Relationship extraction using distant supervision
• Deep Learning
My background
3Talk @ Saama Technologies Siddhartha Banerjee
Outline
● What is Text Summarization?
● Overview of existing work
● Challenges
● Current Trends
● My experiences
● The Future of Summarization
● Q&A
4Talk @ Saama Technologies Siddhartha Banerjee
What is Text Summarization?
Single-document summarization
Multi-document summarization
5Talk @ Saama Technologies Siddhartha Banerjee
An “ideal” summary
Informativeness Coherence Grammaticality
6Talk @ Saama Technologies Siddhartha Banerjee
Types of Summarization
● Extractive
○ “Extract” certain sentences
○ Easier
○ No issues with grammaticality
● Abstractive
○ Produce “abstracts”
○ Content understanding
○ Generation
7Talk @ Saama Technologies Siddhartha Banerjee
Extractive Summarization
1958
We have come a long way since then!
Sentences that mention words that occur frequently in the document are more important.
8Talk @ Saama Technologies Siddhartha Banerjee
Extractive Techniques
• Word-statistics based techniques
• Centroid [Radev et. al, 2004]
• TextRank [Mihalcea and Tarau, 2004]
• Supervised techniques
• Provide ranked sentences to train from documents
• Learning to Rank
• Topic-model based techniques
• Model sentences as topic vectors [Blei et. al, 2003]
• Select sentences that are more “central” to the document vector.
9Talk @ Saama Technologies Siddhartha Banerjee
Why “abstractive”?
❑ Consider opinions on iphone:
• The iPhone’s battery lasts long…have to charge it once every few days.
• iPhone’s battery is bulky but it is cheap..
• iPhone’s battery is bulky but it lasts long!
❑ Extractive: The iPhone’s battery lasts long…have to charge it once every few days.
• Limit on summary length
❑ Ideal: The iPhone’s battery lasts long and is cheap but is bulky.
• HARD!!
• Preferred (Murray et. al, 2010 – user study)
10Talk @ Saama Technologies Siddhartha Banerjee
Abstractive Summarization techniques
❏ Text-to-text generation at sentence level – Independent of other sentences
❏ Sentence compression (Cohn and Lapata’ 2009)
❏ Extractive to abstractive: Not possible using just compression
❏ Sentence fusion (Barzilay and McKeown’ 2005, Filippova and Strube, 2008)
Template-based (Genest and Lapalme’, 2011)
❏ Domain-specific templates - Lot of manual effort
I: But a month ago, she returned to Britain, taking the children with her.
O: She returned to Britain, taking the children
11Talk @ Saama Technologies Siddhartha Banerjee
Current Trends
● Deep Learning!!
● Neural Attention Model for Sentence Summarization (FAIR, 2015)
○ Headline generation
○ Feed-forward neural network
○ Attention model
● RNN-based summarization (FAIR, 2016)
12Talk @ Saama Technologies Siddhartha Banerjee
Sequence to Sequence models
❏ Originally modelled for machine translation
❏
❏
13Talk @ Saama Technologies Siddhartha Banerjee
RNN’s with attention
http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html
● Rare-word problem: Reproducing factual details inaccurately
● Pointer-Generator Networks to the rescue! Copy words from source to text.
● Get To The Point: Summarization with Pointer-Generator Networks (Stanford NLP Group, 2017)
14Talk @ Saama Technologies Siddhartha Banerjee
Evaluation
Automatic Evaluation
• ROUGE – Recall-Oriented Understudy for Gisting Evaluation (Lin, 2004)
Manual Evaluation
•Ask human judges and rate summaries on quality
15Talk @ Saama Technologies Siddhartha Banerjee
Datasets
• News articles
• CNN/Daily News dataset
• Document Understanding Conference datasets [DUC, now TAC]
• Several topics: Each topic with 8-10 documents
• Meeting conversations
• Single meeting transcript
• AMI Dataset [http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml]
• 139 meeting transcripts: 119 training + 20 test
16Talk @ Saama Technologies Siddhartha Banerjee
My Summarization Experience
Automatically authoring content for Wikipedia
Improving existing articles Constructing new articles
Web
information
Assign to Wiki
Sections
Summarization
17Talk @ Saama Technologies Siddhartha Banerjee
Summary sentence generation
S1 The outbreak is the largest ever reported in North America.
S2 Enterovirus D68 caused outbreak of respiratory disease.
S3 Clusters of the outbreak in the United States were reported in August.
1: Enterovirus D68 caused outbreak is the largest ever reported in North America.
2: Enterovirus D68 caused outbreak in the United States were reported in August.
3: The outbreak is the largest ever reported in August.
Output
Graph Construction
❑ Multi-sentence compression
(Filippova’ 2010)
• Directed Graph
• Nodes are words
■ (with POS)
• Edges are adjacencies
❑ Graph traversal
Overgenerate
and
Select
18Talk @ Saama Technologies Siddhartha Banerjee
A comprehensive model (Banerjee and Mitra’ 2016)
Word - graph
p2
p3 pk
Generated
sentences ❌ ❌✔
…...........
Select few sentences
Informativeness Linguistic Quality Coherence
p1
✔
Ordering of sentences
(Bollegala et al. 2012)
Information coverage Grammaticality
19Talk @ Saama Technologies Siddhartha Banerjee
Mathematical formulation
Maximize
Constraints
❑ Three factors:
• I – Information coverage [Textrank (2004)]
• LQ – Language model [Heafield et al. 2013]
• Coh – Regression based scoring
+
K
K
20Talk @ Saama Technologies Siddhartha Banerjee
Experimental Results: News dataset
•ROUGE evaluation on Document understanding conference (DUC) datasets
20
21Talk @ Saama Technologies Siddhartha Banerjee
❑ Manual Evaluation: 10 evaluators
• Informative coverage: ~5% improvement over `best’
extractive system
• Readability: ~4% reduction compared to extractive system
❑ Error Cases
• The U.N. imposed sanctions since 1992 for its refusal to hand over the two
Libyans wanted in the 1988 bombing that killed 270 people killed.
• The deal that will make Hun Sen prime minister and Ranariddh agreed to a
government formed.
Experimental Results (contd.)
22Talk @ Saama Technologies Siddhartha Banerjee
Disaster-event Tweet Summarization (Rudra et. al, 2016)
Content words: Numerals, nouns, locations, main verbs
• 5: Content word -> At least One Sentence
• 6: Sentence selected determines content words to be selected
Content- word based
Summary Quality Optimization
23Talk @ Saama Technologies Siddhartha Banerjee
Experimental Results
• Readability evaluation (COWABS is our proposed technique)
24Talk @ Saama Technologies Siddhartha Banerjee
Meeting summarization using fusion (Banerjee and Mitra, 2015)
•“Um well this is the kick-off meeting for our project.”
• “so we’re designing a new remote control and um.”
• “Um, as you can see it is supposed to be original, trendy and user friendly.”
25Talk @ Saama Technologies Siddhartha Banerjee
Results: Meeting data
❑ AMI Dataset (http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml)
• 139 meeting transcripts: 119 training + 20 test (for extractive)
❑ ROUGE Evaluation
• ~17 % R-2 score over other abstractive system (Filippova’ 2010)
❑ Readability Analysis
• Our model: Slightly curved around the sides like up to the main display as well. It
was voice activated .
• Human: The remote will be single-curved with a cherry design on top. A sample
sensor was included to add speech recognition.
26Talk @ Saama Technologies Siddhartha Banerjee
Resources
• https://github.com/miso-belica/sumy
• Lots of simple extractive summarization techniques
• https://github.com/facebookarchive/NAMAS
• Abstractive summarization: headline generation task
• http://kavita-ganesan.com/opinosis-summarizer-library
• Summarizing redundant opinions/ reviews
• http://pavel.surmenok.com/2016/10/15/how-to-run-text-summarization-with-tensorflow/
• Tutorial using seq2seq model on tensorflow
• https://github.com/g-deoliveira/TextSummarization
• Topic model-based summarization
• https://github.com/StevenLOL/AbTextSumm
• My abstractive summarization technique.
27Talk @ Saama Technologies Siddhartha Banerjee
Future of Summarization
❏ The importance of summarization is undeniable
❏ Growth of data
❏ Automatic authoring in journalism
❏ Medical report summarization
❏ Deep Learning (RNN’s)
❏ Still a long way to go!
❏ Sequence to sequence models are hard to control
❏ Better metrics. ROUGE is not good enough.
❏ Making sense of an entire summary -- mimicking human capabilities.
28Talk @ Saama Technologies Siddhartha Banerjee
Publications
• Siddhartha Banerjee and Prasenjit Mitra. WikiWrite: Generating Wikipedia Articles Automatically. 25th International Joint Conference
on Artificial Intelligence IJCAI-16.
• Koustav Rudra, Siddhartha Banerjee, Muhammad Imran, Niloy Ganguly, Pawan Goyal and Prasenjit Mitra. Summarizing Situational
Tweets in Crisis Scenario. ACM HyperText, 2016
• Siddhartha Banerjee and Prasenjit Mitra. Filling the Gaps: Improving Wikipedia Stubs., 15th ACM SIGWEB International Symposium on
Document Engineering (DocEng 2015).
• Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Generating Abstractive Summaries from Meeting Transcripts., 15th ACM
SIGWEB International Symposium on Document Engineering (DocEng 2015).
• Siddhartha Banerjee and Prasenjit Mitra. WikiKreator: Improving Wikipedia Stubs Automatically., Association of Computational
Linguistics (ACL, 2015).
• Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Multi-Document Abstractive Summarization using ILP-based
Multi-Sentence Compression. , International Joint Conference on Artificial Intelligence (IJCAI, 2015).
• Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Abstractive Meeting Summarization using Dependency Graph Fusion,
ACM International Conference on World Wide Web (WWW (poster) ), 2015, Florence, Italy.
• Siddhartha Banerjee, Cornelia Caragea and Prasenjit Mitra. Playscript Classification and Automatic Wikipedia Play Articles Generation,
International Conference on Pattern Recognition (ICPR '2014) Stockholm, Sweden
29Talk @ Saama Technologies Siddhartha Banerjee
Email id: sidd.iitkgp@gmail.com

More Related Content

Similar to Text Summarization Talk @ Saama Technologies

DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
IXIASOFT
 
See to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquirySee to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquiry
Deirdre Costello
 
DSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
DSC UTeM DevOps Session#1: Intro to DevOps Presentation SlidesDSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
DSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
DSC UTeM
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
Lifeng (Aaron) Han
 
User experience at Imperial: a case study of qualitative approaches to Primo ...
User experience at Imperial: a case study of qualitative approaches to Primo ...User experience at Imperial: a case study of qualitative approaches to Primo ...
User experience at Imperial: a case study of qualitative approaches to Primo ...
Andrew Preater
 
Dean r berry project the challenges of technology
Dean r berry project the challenges of  technologyDean r berry project the challenges of  technology
Dean r berry project the challenges of technology
Riverside County Office of Education
 
Harsh patel
Harsh patelHarsh patel
Harsh patel
harsh patel
 
Web crawlingchapter
Web crawlingchapterWeb crawlingchapter
Web crawlingchapter
Borseshweta
 
Ask your users
Ask your usersAsk your users
Ask your users
Marie Toler Raney
 
Dean r berry project loss of privacy
Dean r berry project loss of privacy Dean r berry project loss of privacy
Dean r berry project loss of privacy
Riverside County Office of Education
 
Crafting a Compelling Data Science Resume
Crafting a Compelling Data Science ResumeCrafting a Compelling Data Science Resume
Crafting a Compelling Data Science Resume
Arushi Prakash, Ph.D.
 
Predicting and Preparing For Emerging Learning Technologies
Predicting and Preparing For Emerging Learning TechnologiesPredicting and Preparing For Emerging Learning Technologies
Predicting and Preparing For Emerging Learning Technologies
lisbk
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
CrowdFlower
 
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
TechSoup
 
Writing research papers and articles
Writing research papers and articlesWriting research papers and articles
Writing research papers and articles
DrDivakarSingh
 
Interview preparation data_science
Interview preparation data_scienceInterview preparation data_science
Interview preparation data_science
Mallikarjuna G D
 
IT 150 Agenda for 11-14-16.pptx
IT 150 Agenda for 11-14-16.pptxIT 150 Agenda for 11-14-16.pptx
IT 150 Agenda for 11-14-16.pptx
MattMarino13
 
Scrum in Distributed Teams
Scrum in Distributed TeamsScrum in Distributed Teams
Scrum in Distributed Teams
Cprime
 
Life after-phd-10-nov
Life after-phd-10-novLife after-phd-10-nov
Life after-phd-10-nov
Sanjeev Deshmukh
 
Pathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and ChallengesPathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and Challenges
Tao Xie
 

Similar to Text Summarization Talk @ Saama Technologies (20)

DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
 
See to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquirySee to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquiry
 
DSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
DSC UTeM DevOps Session#1: Intro to DevOps Presentation SlidesDSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
DSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
User experience at Imperial: a case study of qualitative approaches to Primo ...
User experience at Imperial: a case study of qualitative approaches to Primo ...User experience at Imperial: a case study of qualitative approaches to Primo ...
User experience at Imperial: a case study of qualitative approaches to Primo ...
 
Dean r berry project the challenges of technology
Dean r berry project the challenges of  technologyDean r berry project the challenges of  technology
Dean r berry project the challenges of technology
 
Harsh patel
Harsh patelHarsh patel
Harsh patel
 
Web crawlingchapter
Web crawlingchapterWeb crawlingchapter
Web crawlingchapter
 
Ask your users
Ask your usersAsk your users
Ask your users
 
Dean r berry project loss of privacy
Dean r berry project loss of privacy Dean r berry project loss of privacy
Dean r berry project loss of privacy
 
Crafting a Compelling Data Science Resume
Crafting a Compelling Data Science ResumeCrafting a Compelling Data Science Resume
Crafting a Compelling Data Science Resume
 
Predicting and Preparing For Emerging Learning Technologies
Predicting and Preparing For Emerging Learning TechnologiesPredicting and Preparing For Emerging Learning Technologies
Predicting and Preparing For Emerging Learning Technologies
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
 
Writing research papers and articles
Writing research papers and articlesWriting research papers and articles
Writing research papers and articles
 
Interview preparation data_science
Interview preparation data_scienceInterview preparation data_science
Interview preparation data_science
 
IT 150 Agenda for 11-14-16.pptx
IT 150 Agenda for 11-14-16.pptxIT 150 Agenda for 11-14-16.pptx
IT 150 Agenda for 11-14-16.pptx
 
Scrum in Distributed Teams
Scrum in Distributed TeamsScrum in Distributed Teams
Scrum in Distributed Teams
 
Life after-phd-10-nov
Life after-phd-10-novLife after-phd-10-nov
Life after-phd-10-nov
 
Pathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and ChallengesPathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and Challenges
 

Recently uploaded

Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 

Recently uploaded (20)

Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 

Text Summarization Talk @ Saama Technologies

  • 1. Automatic Text Summarization Trends, Challenges and Opportunities Siddhartha Banerjee Research Scientist, Content Platform Yahoo! (now Oath, a Verizon Company) September 22, 2017
  • 2. 2Talk @ Saama Technologies Siddhartha Banerjee ❑ Undergraduate degree • Industrial Engineering - 2009 (IIT Kharagpur) ❑ Professional Experience: 2009 – 2012 • Sabre Airline Solutions and Oracle Retail ❑ Ph.D. @Penn State Information Sciences (2012 - Dec’ 2016) • Advised by Prof. Prasenjit Mitra • Natural Language Processing ❑ Back to Industry: 2017 • Yahoo! (March 2017 - present) • Question Answering • Relationship extraction using distant supervision • Deep Learning My background
  • 3. 3Talk @ Saama Technologies Siddhartha Banerjee Outline ● What is Text Summarization? ● Overview of existing work ● Challenges ● Current Trends ● My experiences ● The Future of Summarization ● Q&A
  • 4. 4Talk @ Saama Technologies Siddhartha Banerjee What is Text Summarization? Single-document summarization Multi-document summarization
  • 5. 5Talk @ Saama Technologies Siddhartha Banerjee An “ideal” summary Informativeness Coherence Grammaticality
  • 6. 6Talk @ Saama Technologies Siddhartha Banerjee Types of Summarization ● Extractive ○ “Extract” certain sentences ○ Easier ○ No issues with grammaticality ● Abstractive ○ Produce “abstracts” ○ Content understanding ○ Generation
  • 7. 7Talk @ Saama Technologies Siddhartha Banerjee Extractive Summarization 1958 We have come a long way since then! Sentences that mention words that occur frequently in the document are more important.
  • 8. 8Talk @ Saama Technologies Siddhartha Banerjee Extractive Techniques • Word-statistics based techniques • Centroid [Radev et. al, 2004] • TextRank [Mihalcea and Tarau, 2004] • Supervised techniques • Provide ranked sentences to train from documents • Learning to Rank • Topic-model based techniques • Model sentences as topic vectors [Blei et. al, 2003] • Select sentences that are more “central” to the document vector.
  • 9. 9Talk @ Saama Technologies Siddhartha Banerjee Why “abstractive”? ❑ Consider opinions on iphone: • The iPhone’s battery lasts long…have to charge it once every few days. • iPhone’s battery is bulky but it is cheap.. • iPhone’s battery is bulky but it lasts long! ❑ Extractive: The iPhone’s battery lasts long…have to charge it once every few days. • Limit on summary length ❑ Ideal: The iPhone’s battery lasts long and is cheap but is bulky. • HARD!! • Preferred (Murray et. al, 2010 – user study)
  • 10. 10Talk @ Saama Technologies Siddhartha Banerjee Abstractive Summarization techniques ❏ Text-to-text generation at sentence level – Independent of other sentences ❏ Sentence compression (Cohn and Lapata’ 2009) ❏ Extractive to abstractive: Not possible using just compression ❏ Sentence fusion (Barzilay and McKeown’ 2005, Filippova and Strube, 2008) Template-based (Genest and Lapalme’, 2011) ❏ Domain-specific templates - Lot of manual effort I: But a month ago, she returned to Britain, taking the children with her. O: She returned to Britain, taking the children
  • 11. 11Talk @ Saama Technologies Siddhartha Banerjee Current Trends ● Deep Learning!! ● Neural Attention Model for Sentence Summarization (FAIR, 2015) ○ Headline generation ○ Feed-forward neural network ○ Attention model ● RNN-based summarization (FAIR, 2016)
  • 12. 12Talk @ Saama Technologies Siddhartha Banerjee Sequence to Sequence models ❏ Originally modelled for machine translation ❏ ❏
  • 13. 13Talk @ Saama Technologies Siddhartha Banerjee RNN’s with attention http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html ● Rare-word problem: Reproducing factual details inaccurately ● Pointer-Generator Networks to the rescue! Copy words from source to text. ● Get To The Point: Summarization with Pointer-Generator Networks (Stanford NLP Group, 2017)
  • 14. 14Talk @ Saama Technologies Siddhartha Banerjee Evaluation Automatic Evaluation • ROUGE – Recall-Oriented Understudy for Gisting Evaluation (Lin, 2004) Manual Evaluation •Ask human judges and rate summaries on quality
  • 15. 15Talk @ Saama Technologies Siddhartha Banerjee Datasets • News articles • CNN/Daily News dataset • Document Understanding Conference datasets [DUC, now TAC] • Several topics: Each topic with 8-10 documents • Meeting conversations • Single meeting transcript • AMI Dataset [http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml] • 139 meeting transcripts: 119 training + 20 test
  • 16. 16Talk @ Saama Technologies Siddhartha Banerjee My Summarization Experience Automatically authoring content for Wikipedia Improving existing articles Constructing new articles Web information Assign to Wiki Sections Summarization
  • 17. 17Talk @ Saama Technologies Siddhartha Banerjee Summary sentence generation S1 The outbreak is the largest ever reported in North America. S2 Enterovirus D68 caused outbreak of respiratory disease. S3 Clusters of the outbreak in the United States were reported in August. 1: Enterovirus D68 caused outbreak is the largest ever reported in North America. 2: Enterovirus D68 caused outbreak in the United States were reported in August. 3: The outbreak is the largest ever reported in August. Output Graph Construction ❑ Multi-sentence compression (Filippova’ 2010) • Directed Graph • Nodes are words ■ (with POS) • Edges are adjacencies ❑ Graph traversal Overgenerate and Select
  • 18. 18Talk @ Saama Technologies Siddhartha Banerjee A comprehensive model (Banerjee and Mitra’ 2016) Word - graph p2 p3 pk Generated sentences ❌ ❌✔ …........... Select few sentences Informativeness Linguistic Quality Coherence p1 ✔ Ordering of sentences (Bollegala et al. 2012) Information coverage Grammaticality
  • 19. 19Talk @ Saama Technologies Siddhartha Banerjee Mathematical formulation Maximize Constraints ❑ Three factors: • I – Information coverage [Textrank (2004)] • LQ – Language model [Heafield et al. 2013] • Coh – Regression based scoring + K K
  • 20. 20Talk @ Saama Technologies Siddhartha Banerjee Experimental Results: News dataset •ROUGE evaluation on Document understanding conference (DUC) datasets 20
  • 21. 21Talk @ Saama Technologies Siddhartha Banerjee ❑ Manual Evaluation: 10 evaluators • Informative coverage: ~5% improvement over `best’ extractive system • Readability: ~4% reduction compared to extractive system ❑ Error Cases • The U.N. imposed sanctions since 1992 for its refusal to hand over the two Libyans wanted in the 1988 bombing that killed 270 people killed. • The deal that will make Hun Sen prime minister and Ranariddh agreed to a government formed. Experimental Results (contd.)
  • 22. 22Talk @ Saama Technologies Siddhartha Banerjee Disaster-event Tweet Summarization (Rudra et. al, 2016) Content words: Numerals, nouns, locations, main verbs • 5: Content word -> At least One Sentence • 6: Sentence selected determines content words to be selected Content- word based Summary Quality Optimization
  • 23. 23Talk @ Saama Technologies Siddhartha Banerjee Experimental Results • Readability evaluation (COWABS is our proposed technique)
  • 24. 24Talk @ Saama Technologies Siddhartha Banerjee Meeting summarization using fusion (Banerjee and Mitra, 2015) •“Um well this is the kick-off meeting for our project.” • “so we’re designing a new remote control and um.” • “Um, as you can see it is supposed to be original, trendy and user friendly.”
  • 25. 25Talk @ Saama Technologies Siddhartha Banerjee Results: Meeting data ❑ AMI Dataset (http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml) • 139 meeting transcripts: 119 training + 20 test (for extractive) ❑ ROUGE Evaluation • ~17 % R-2 score over other abstractive system (Filippova’ 2010) ❑ Readability Analysis • Our model: Slightly curved around the sides like up to the main display as well. It was voice activated . • Human: The remote will be single-curved with a cherry design on top. A sample sensor was included to add speech recognition.
  • 26. 26Talk @ Saama Technologies Siddhartha Banerjee Resources • https://github.com/miso-belica/sumy • Lots of simple extractive summarization techniques • https://github.com/facebookarchive/NAMAS • Abstractive summarization: headline generation task • http://kavita-ganesan.com/opinosis-summarizer-library • Summarizing redundant opinions/ reviews • http://pavel.surmenok.com/2016/10/15/how-to-run-text-summarization-with-tensorflow/ • Tutorial using seq2seq model on tensorflow • https://github.com/g-deoliveira/TextSummarization • Topic model-based summarization • https://github.com/StevenLOL/AbTextSumm • My abstractive summarization technique.
  • 27. 27Talk @ Saama Technologies Siddhartha Banerjee Future of Summarization ❏ The importance of summarization is undeniable ❏ Growth of data ❏ Automatic authoring in journalism ❏ Medical report summarization ❏ Deep Learning (RNN’s) ❏ Still a long way to go! ❏ Sequence to sequence models are hard to control ❏ Better metrics. ROUGE is not good enough. ❏ Making sense of an entire summary -- mimicking human capabilities.
  • 28. 28Talk @ Saama Technologies Siddhartha Banerjee Publications • Siddhartha Banerjee and Prasenjit Mitra. WikiWrite: Generating Wikipedia Articles Automatically. 25th International Joint Conference on Artificial Intelligence IJCAI-16. • Koustav Rudra, Siddhartha Banerjee, Muhammad Imran, Niloy Ganguly, Pawan Goyal and Prasenjit Mitra. Summarizing Situational Tweets in Crisis Scenario. ACM HyperText, 2016 • Siddhartha Banerjee and Prasenjit Mitra. Filling the Gaps: Improving Wikipedia Stubs., 15th ACM SIGWEB International Symposium on Document Engineering (DocEng 2015). • Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Generating Abstractive Summaries from Meeting Transcripts., 15th ACM SIGWEB International Symposium on Document Engineering (DocEng 2015). • Siddhartha Banerjee and Prasenjit Mitra. WikiKreator: Improving Wikipedia Stubs Automatically., Association of Computational Linguistics (ACL, 2015). • Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Multi-Document Abstractive Summarization using ILP-based Multi-Sentence Compression. , International Joint Conference on Artificial Intelligence (IJCAI, 2015). • Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Abstractive Meeting Summarization using Dependency Graph Fusion, ACM International Conference on World Wide Web (WWW (poster) ), 2015, Florence, Italy. • Siddhartha Banerjee, Cornelia Caragea and Prasenjit Mitra. Playscript Classification and Automatic Wikipedia Play Articles Generation, International Conference on Pattern Recognition (ICPR '2014) Stockholm, Sweden
  • 29. 29Talk @ Saama Technologies Siddhartha Banerjee Email id: sidd.iitkgp@gmail.com