Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Text Summarization Talk @ Saama Technologies

414 views

Published on

These are the slides of the talk I gave at Saama Technologies on Text Summarization.

Published in: Data & Analytics
  • Be the first to comment

Text Summarization Talk @ Saama Technologies

  1. 1. Automatic Text Summarization Trends, Challenges and Opportunities Siddhartha Banerjee Research Scientist, Content Platform Yahoo! (now Oath, a Verizon Company) September 22, 2017
  2. 2. 2Talk @ Saama Technologies Siddhartha Banerjee ❑ Undergraduate degree • Industrial Engineering - 2009 (IIT Kharagpur) ❑ Professional Experience: 2009 – 2012 • Sabre Airline Solutions and Oracle Retail ❑ Ph.D. @Penn State Information Sciences (2012 - Dec’ 2016) • Advised by Prof. Prasenjit Mitra • Natural Language Processing ❑ Back to Industry: 2017 • Yahoo! (March 2017 - present) • Question Answering • Relationship extraction using distant supervision • Deep Learning My background
  3. 3. 3Talk @ Saama Technologies Siddhartha Banerjee Outline ● What is Text Summarization? ● Overview of existing work ● Challenges ● Current Trends ● My experiences ● The Future of Summarization ● Q&A
  4. 4. 4Talk @ Saama Technologies Siddhartha Banerjee What is Text Summarization? Single-document summarization Multi-document summarization
  5. 5. 5Talk @ Saama Technologies Siddhartha Banerjee An “ideal” summary Informativeness Coherence Grammaticality
  6. 6. 6Talk @ Saama Technologies Siddhartha Banerjee Types of Summarization ● Extractive ○ “Extract” certain sentences ○ Easier ○ No issues with grammaticality ● Abstractive ○ Produce “abstracts” ○ Content understanding ○ Generation
  7. 7. 7Talk @ Saama Technologies Siddhartha Banerjee Extractive Summarization 1958 We have come a long way since then! Sentences that mention words that occur frequently in the document are more important.
  8. 8. 8Talk @ Saama Technologies Siddhartha Banerjee Extractive Techniques • Word-statistics based techniques • Centroid [Radev et. al, 2004] • TextRank [Mihalcea and Tarau, 2004] • Supervised techniques • Provide ranked sentences to train from documents • Learning to Rank • Topic-model based techniques • Model sentences as topic vectors [Blei et. al, 2003] • Select sentences that are more “central” to the document vector.
  9. 9. 9Talk @ Saama Technologies Siddhartha Banerjee Why “abstractive”? ❑ Consider opinions on iphone: • The iPhone’s battery lasts long…have to charge it once every few days. • iPhone’s battery is bulky but it is cheap.. • iPhone’s battery is bulky but it lasts long! ❑ Extractive: The iPhone’s battery lasts long…have to charge it once every few days. • Limit on summary length ❑ Ideal: The iPhone’s battery lasts long and is cheap but is bulky. • HARD!! • Preferred (Murray et. al, 2010 – user study)
  10. 10. 10Talk @ Saama Technologies Siddhartha Banerjee Abstractive Summarization techniques ❏ Text-to-text generation at sentence level – Independent of other sentences ❏ Sentence compression (Cohn and Lapata’ 2009) ❏ Extractive to abstractive: Not possible using just compression ❏ Sentence fusion (Barzilay and McKeown’ 2005, Filippova and Strube, 2008) Template-based (Genest and Lapalme’, 2011) ❏ Domain-specific templates - Lot of manual effort I: But a month ago, she returned to Britain, taking the children with her. O: She returned to Britain, taking the children
  11. 11. 11Talk @ Saama Technologies Siddhartha Banerjee Current Trends ● Deep Learning!! ● Neural Attention Model for Sentence Summarization (FAIR, 2015) ○ Headline generation ○ Feed-forward neural network ○ Attention model ● RNN-based summarization (FAIR, 2016)
  12. 12. 12Talk @ Saama Technologies Siddhartha Banerjee Sequence to Sequence models ❏ Originally modelled for machine translation ❏ ❏
  13. 13. 13Talk @ Saama Technologies Siddhartha Banerjee RNN’s with attention http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html ● Rare-word problem: Reproducing factual details inaccurately ● Pointer-Generator Networks to the rescue! Copy words from source to text. ● Get To The Point: Summarization with Pointer-Generator Networks (Stanford NLP Group, 2017)
  14. 14. 14Talk @ Saama Technologies Siddhartha Banerjee Evaluation Automatic Evaluation • ROUGE – Recall-Oriented Understudy for Gisting Evaluation (Lin, 2004) Manual Evaluation •Ask human judges and rate summaries on quality
  15. 15. 15Talk @ Saama Technologies Siddhartha Banerjee Datasets • News articles • CNN/Daily News dataset • Document Understanding Conference datasets [DUC, now TAC] • Several topics: Each topic with 8-10 documents • Meeting conversations • Single meeting transcript • AMI Dataset [http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml] • 139 meeting transcripts: 119 training + 20 test
  16. 16. 16Talk @ Saama Technologies Siddhartha Banerjee My Summarization Experience Automatically authoring content for Wikipedia Improving existing articles Constructing new articles Web information Assign to Wiki Sections Summarization
  17. 17. 17Talk @ Saama Technologies Siddhartha Banerjee Summary sentence generation S1 The outbreak is the largest ever reported in North America. S2 Enterovirus D68 caused outbreak of respiratory disease. S3 Clusters of the outbreak in the United States were reported in August. 1: Enterovirus D68 caused outbreak is the largest ever reported in North America. 2: Enterovirus D68 caused outbreak in the United States were reported in August. 3: The outbreak is the largest ever reported in August. Output Graph Construction ❑ Multi-sentence compression (Filippova’ 2010) • Directed Graph • Nodes are words ■ (with POS) • Edges are adjacencies ❑ Graph traversal Overgenerate and Select
  18. 18. 18Talk @ Saama Technologies Siddhartha Banerjee A comprehensive model (Banerjee and Mitra’ 2016) Word - graph p2 p3 pk Generated sentences ❌ ❌✔ …........... Select few sentences Informativeness Linguistic Quality Coherence p1 ✔ Ordering of sentences (Bollegala et al. 2012) Information coverage Grammaticality
  19. 19. 19Talk @ Saama Technologies Siddhartha Banerjee Mathematical formulation Maximize Constraints ❑ Three factors: • I – Information coverage [Textrank (2004)] • LQ – Language model [Heafield et al. 2013] • Coh – Regression based scoring + K K
  20. 20. 20Talk @ Saama Technologies Siddhartha Banerjee Experimental Results: News dataset •ROUGE evaluation on Document understanding conference (DUC) datasets 20
  21. 21. 21Talk @ Saama Technologies Siddhartha Banerjee ❑ Manual Evaluation: 10 evaluators • Informative coverage: ~5% improvement over `best’ extractive system • Readability: ~4% reduction compared to extractive system ❑ Error Cases • The U.N. imposed sanctions since 1992 for its refusal to hand over the two Libyans wanted in the 1988 bombing that killed 270 people killed. • The deal that will make Hun Sen prime minister and Ranariddh agreed to a government formed. Experimental Results (contd.)
  22. 22. 22Talk @ Saama Technologies Siddhartha Banerjee Disaster-event Tweet Summarization (Rudra et. al, 2016) Content words: Numerals, nouns, locations, main verbs • 5: Content word -> At least One Sentence • 6: Sentence selected determines content words to be selected Content- word based Summary Quality Optimization
  23. 23. 23Talk @ Saama Technologies Siddhartha Banerjee Experimental Results • Readability evaluation (COWABS is our proposed technique)
  24. 24. 24Talk @ Saama Technologies Siddhartha Banerjee Meeting summarization using fusion (Banerjee and Mitra, 2015) •“Um well this is the kick-off meeting for our project.” • “so we’re designing a new remote control and um.” • “Um, as you can see it is supposed to be original, trendy and user friendly.”
  25. 25. 25Talk @ Saama Technologies Siddhartha Banerjee Results: Meeting data ❑ AMI Dataset (http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml) • 139 meeting transcripts: 119 training + 20 test (for extractive) ❑ ROUGE Evaluation • ~17 % R-2 score over other abstractive system (Filippova’ 2010) ❑ Readability Analysis • Our model: Slightly curved around the sides like up to the main display as well. It was voice activated . • Human: The remote will be single-curved with a cherry design on top. A sample sensor was included to add speech recognition.
  26. 26. 26Talk @ Saama Technologies Siddhartha Banerjee Resources • https://github.com/miso-belica/sumy • Lots of simple extractive summarization techniques • https://github.com/facebookarchive/NAMAS • Abstractive summarization: headline generation task • http://kavita-ganesan.com/opinosis-summarizer-library • Summarizing redundant opinions/ reviews • http://pavel.surmenok.com/2016/10/15/how-to-run-text-summarization-with-tensorflow/ • Tutorial using seq2seq model on tensorflow • https://github.com/g-deoliveira/TextSummarization • Topic model-based summarization • https://github.com/StevenLOL/AbTextSumm • My abstractive summarization technique.
  27. 27. 27Talk @ Saama Technologies Siddhartha Banerjee Future of Summarization ❏ The importance of summarization is undeniable ❏ Growth of data ❏ Automatic authoring in journalism ❏ Medical report summarization ❏ Deep Learning (RNN’s) ❏ Still a long way to go! ❏ Sequence to sequence models are hard to control ❏ Better metrics. ROUGE is not good enough. ❏ Making sense of an entire summary -- mimicking human capabilities.
  28. 28. 28Talk @ Saama Technologies Siddhartha Banerjee Publications • Siddhartha Banerjee and Prasenjit Mitra. WikiWrite: Generating Wikipedia Articles Automatically. 25th International Joint Conference on Artificial Intelligence IJCAI-16. • Koustav Rudra, Siddhartha Banerjee, Muhammad Imran, Niloy Ganguly, Pawan Goyal and Prasenjit Mitra. Summarizing Situational Tweets in Crisis Scenario. ACM HyperText, 2016 • Siddhartha Banerjee and Prasenjit Mitra. Filling the Gaps: Improving Wikipedia Stubs., 15th ACM SIGWEB International Symposium on Document Engineering (DocEng 2015). • Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Generating Abstractive Summaries from Meeting Transcripts., 15th ACM SIGWEB International Symposium on Document Engineering (DocEng 2015). • Siddhartha Banerjee and Prasenjit Mitra. WikiKreator: Improving Wikipedia Stubs Automatically., Association of Computational Linguistics (ACL, 2015). • Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Multi-Document Abstractive Summarization using ILP-based Multi-Sentence Compression. , International Joint Conference on Artificial Intelligence (IJCAI, 2015). • Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Abstractive Meeting Summarization using Dependency Graph Fusion, ACM International Conference on World Wide Web (WWW (poster) ), 2015, Florence, Italy. • Siddhartha Banerjee, Cornelia Caragea and Prasenjit Mitra. Playscript Classification and Automatic Wikipedia Play Articles Generation, International Conference on Pattern Recognition (ICPR '2014) Stockholm, Sweden
  29. 29. 29Talk @ Saama Technologies Siddhartha Banerjee Email id: sidd.iitkgp@gmail.com

×