Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lecture: Summarization


Published on

abstracting, extractive summarization, abstractive summarization, summarization in question answering, single vs. multiple documents, query-focused summarization, snippets, unsupervised content selection, topic signature-based content selection, rouge, recall oriented understudy for gisting evaluation, semantics in language technology,

Published in: Education
  • Be the first to comment

Lecture: Summarization

  1. 1. Seman&c  Analysis  in  Language  Technology 
 Summarization Marina  San(ni   san$     Department  of  Linguis(cs  and   Philology   Uppsala  University,  Uppsala,   Sweden     Spring  2016      
  2. 2. Previous  Lecture:  Rela$on  Extrac$on   2  
  3. 3. What’s  a  rela$on?   •  A  rela(on  can  be  formally  defined  in  the  form  of  a  tuple     •  t  =  (e1;  e2  …;  en)     •  where  the  ei  are  en((es  in  a  predefined  rela(on  r  within   document  D.     •  Most  rela(on  extrac(on  systems  focus  on  extrac(ng  binary   rela$ons.     •  Examples  of  binary  rela(ons  include   •  located-­‐in(CMU,  PiHsburgh),     •  father-­‐of(ManuelBlum,  Avrim  Blum).     •  It  is  also  possible  to  go  to  higher-­‐order  rela(ons  as  well  and   extract  more  complex  rela(ons  (ex  biomedicine).     3  
  4. 4. Why  Rela$on  Extrac$on?   •  There  exists  a  vast  amount  of  unstructured  electronic  text  on  the   Web,  including  newswire,  blogs  ,emails,  governmental   documents,  chats,  and  so  on.     •  The  whole  idea  of  IE  is  turn  unstructured  text  into  structured  by   annota(ng  seman(c  informa(on.   •  RE  is  the  task    of  recognizing  rela(ons  between  en((es  in   unstructured  text.     ! If a query to a search engine is “When was Gandhi born ?”, then the expected answer would be“Gandhi was born in 1869”. The template of the answer is <PERSON> born-in <YEAR> which is nothing but the relational triple: ! born in(PERSON, YEAR) ! where PERSON and YEAR are the entities. ! 4  
  5. 5. Watch  out!   •  RE  =  extract  facts  from  unstructured  texts,  ie  rela(ons  that  exist   betw  en((es,  such  as  dates,  proper  names,  companies.     •  Other  rela(ons  (related  to  Word  Senses):  seman(c  rela(ons   betw  concepts:  hyperonyms,  hyponyms,  etc.  like  in  Wordnet.     5  
  6. 6. How  to  build  rela$on  extractors   1.  Hand-­‐wriHen  paHerns   2.  Supervised  machine  learning   3.  Semi-­‐supervised  and  unsupervised     •  Bootstrapping  (using  seeds)   •  Distant  supervision   •  Unsupervised  learning  from  the  web   6  
  7. 7. Seed-­‐based  or  bootstrapping  approaches   to  rela$on  extrac$on   •  No  training  set?  Maybe  you  have:   •  A  few  seed  tuples    or   •  A  few  high-­‐precision  paHerns   •  Can  you  use  those  seeds  to  do  something  useful?   •  Bootstrapping:  use  the  seeds  to  directly  learn  to  populate  a   rela(on   7   Roughly  said:  Use  seeds  to  ini(alize  a   process  of  annota(on,  then  refine   through  itera(ons  
  8. 8. Dipre:  Extract  <author,book>  pairs   •  Start  with  5  seeds:           •  Find  Instances:   The  Comedy  of  Errors,  by    William  Shakespeare,  was   The  Comedy  of  Errors,  by    William  Shakespeare,  is   The  Comedy  of  Errors,  one  of  William  Shakespeare's  earliest  aHempts   The  Comedy  of  Errors,  one  of  William  Shakespeare's  most   •  Extract  paHerns  (group  by  middle,  take  longest  common  prefix/suffix)   ?x , by ?y , ?x , one of ?y ‘s ! •  Now  iterate,  finding  new  seeds  that  match  the  paHern   ! Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web. Author   Book   Isaac  Asimov   The  Robots  of  Dawn   David  Brin   Star(de  Rising   James  Gleick   Chaos:  Making  a  New   Science   Charles  Dickens   Great  Expecta(ons   William   Shakespeare   The  Comedy  of  Errors   8  
  9. 9. Prac$cal  Ac$vity   Search  for  phrasal  paHerns  on  the  web       Our  seeds:     "*  is  a  novel  by  *"     "*  wrote  the  novel  *"     "the  novel  *  was  wriHen  by  *"   op#onally  add  more  phrases…     Further  refinemets  that  we  felt  are  needed:     •  get  read  of  non-­‐informa(ve  text  included  in  the  returned  strings   (maybe  via  adding  addi(onal  paHerns  in  the  regular  expressions)   •  Iden(fy  name  en((es   •  Maybe  via  Reg  Expressions  (eg.  iden(fy  words  star(ng  with  uppercase)   •  Maybe  combining  seeds  and  a  NER  system   •  ect.   9c   Google is fantastic, but also unpredictable… à different behaviours depending on the machines, domains, and some “hidden” criteria…  
  10. 10. End  of  previous  lecture   10  
  11. 11. Acknowledgements Most  slides  borrowed  or  adapted  from:   Dan  Jurafsky  and  Christopher  Manning,  Coursera   Some  inspira(on  from  Dragomir  Radev,  Coursera  ….           J&M(2009)              
  12. 12. Text  Summariza$on   12  
  13. 13. Summary   13  
  14. 14. News  Summariza$on   14  
  15. 15. Book  Summaries   15   Cliff’s  Notes  are  a  series   of  student  study  guides   available  primarily  in  the   United  States.  
  16. 16. Movie  Summaries   16  
  17. 17. Search  Engine  Snippets   17  
  18. 18. Genres   18  
  19. 19. Types  of  Summaries   19  
  20. 20. Stages   20  
  21. 21. Summariza$on   21  
  22. 22. Human  Summariza$on  and  Abstrac$ng   22  
  23. 23. Extrac$ve  Summariza$on   23  
  24. 24. Question Answering Summarization in Question Answering
  25. 25. Text  Summariza$on   •  Goal:  produce  an  abridged  version  of  a  text  that  contains   informa(on  that  is  important  or  relevant  to  a  user.             •  Summariza$on  Applica$ons   •  outlines  or  abstracts  of  any  document,  ar(cle,  etc   •  summaries  of  email  threads   •  ac$on  items  from  a  mee(ng   •  simplifying  text  by  compressing  sentences   25  
  26. 26. What  to  summarize?     Single  vs.  mul$ple  documents   •  Single-­‐document  summariza$on   •  Given  a  single  document,  produce   •  abstract   •  outline   •  headline   •  Mul$ple-­‐document  summariza$on   •  Given  a  group  of  documents,  produce  a  gist  of  the  content:   •  a  series  of  news  stories  on  the  same  event   •  a  set  of  web  pages  about  some  topic  or  ques(on   26  
  27. 27. Query-­‐focused  Summariza$on   &    Generic  Summariza$on   •  Generic  summariza(on:   •   Summarize  the  content  of  a  document   •  Query-­‐focused  summariza(on:   •   summarize  a  document  with  respect  to  an   informa(on  need  expressed  in  a  user  query.   •  a  kind  of  complex  ques(on  answering:   •  Answer  a  ques(on  by  summarizing  a  document   that  has  the  informa(on  to  construct  the  answer     27  
  28. 28. Summariza$on  for  Ques$on  Answering:   Snippets   •  Create  snippets  summarizing  a  web  page  for  a  query   •  Google:  156  characters  (about  26  words)  plus  (tle  and  link   28  
  29. 29. Summariza$on  for  Ques$on  Answering:   Mul$ple  documents   Create  answers  to  complex  ques(ons   summarizing  mul(ple  documents.   •  Instead  of  giving  a  snippet  for  each  document   •  Create  a  cohesive  answer  that  combines   informa(on  from  each  document   29  
  30. 30. Extrac$ve  summariza$on  &     Abstrac$ve  summariza$on   •  Extrac(ve  summariza(on:   •  create  the  summary  from  phrases  or  sentences  in  the  source   document(s)   •  Abstrac(ve  summariza(on:   •  express  the  ideas  in  the  source  documents  using  (at  least  in   part)  different  words   30  
  31. 31. Simple  baseline:  take  the  first  sentence   31  
  32. 32. Question Answering Generating Snippets and other Single- Document Answers
  33. 33. Snippets:  query-­‐focused  summaries   33  
  34. 34. Summariza$on:  Three  Stages   1.  content  selec(on:  choose  sentences  to  extract   from  the  document   2.  informa(on  ordering:  choose  an  order  to  place   them  in  the  summary   3.  sentence  realiza(on:  clean  up  the  sentences   34   Document Sentence Segmentation Sentence Extraction All sentences from documents Extracted sentences Information Ordering Sentence Realization Summary Content Selection Sentence Simplification
  35. 35. Basic  Summariza$on  Algorithm   1.  content  selec(on:  choose  sentences  to  extract   from  the  document   2.  informa(on  ordering:  just  use  document  order   3.  sentence  realiza(on:  keep  original  sentences   35   Document Sentence Segmentation Sentence Extraction All sentences from documents Extracted sentences Information Ordering Sentence Realization Summary Content Selection Sentence Simplification
  36. 36. Unsupervised  content  selec$on   •  Intui(on  da(ng  back  to  Luhn  (1958):   •  Choose  sentences  that  have  salient  or  informa(ve  words   •  Two  approaches  to  defining  salient  words   1.  o-­‐idf:  weigh  each  word  wi  in  document  j  by  o-­‐idf   2.  topic  signature:  choose  a  smaller  set  of  salient  words   •  mutual  informa(on   •  log-­‐likelihood  ra(o  (LLR)    Dunning  (1993),  Lin  and  Hovy  (2000)   36   weight(wi ) = tfij ×idfi weight(wi ) = 1 if -2logλ(wi ) >10 0 otherwise ! " # $# H.  P.  Luhn.  1958.  The  Automa(c  Crea(on  of  Literature  Abstracts.   IBM  Journal  of  Research  and  Development.  2:2,  159-­‐165.    
  37. 37. Topic  signature-­‐based  content  selec$on   with  queries   •  choose  words  that  are  informa(ve  either     •  by  log-­‐likelihood  ra(o  (LLR)   •  or  by  appearing  in  the  query   •  Weigh  a  sentence  (or  window)  by  weight  of  its  words:   37   Conroy,  Schlesinger,  and  O’Leary  2006   weight(wi ) = 1 if -2logλ(wi ) >10 1 if wi ∈ question 0 otherwise " # $$ % $ $ weight(s) = 1 S weight(w) w∈S ∑ (could  learn  more   complex  weights)  
  38. 38. Supervised  content  selec$on   •  Given:     •  a  labeled  training  set  of  good   summaries  for  each  document   •  Align:   •  the  sentences  in  the  document   with  sentences  in  the  summary   •  Extract  features   •  posi(on  (first  sentence?)     •  length  of  sentence   •  word  informa(veness,  cue  phrases   •  cohesion   •  Train   •  Problems:   •  hard  to  get  labeled  training   data   •  alignment  difficult   •  performance  not  beHer  than   unsupervised  algorithms   •  So  in  prac(ce:   •  Unsupervised  content   selec$on  is  more  common   •  a  binary  classifier  (put  sentence  in  summary?  yes  or  no)    
  39. 39. Question Answering Evalua(ng  Summaries:   ROUGE  
  40. 40. ROUGE  (Recall  Oriented  Understudy  for   Gis$ng  Evalua$on)     •  Intrinsic  metric  for  automa(cally  evalua(ng  summaries   •  Based  on  BLEU  (a  metric  used  for  machine  transla(on)   •  Not  as  good  as  human  evalua(on  (“Did  this  answer  the  user’s  ques(on?”)   •  But  much  more  convenient   •  Given  a  document  D,  and  an  automa(c  summary  X:   1.  Have  N  humans  produce  a  set  of  reference  summaries    of  D   2.  Run  system,  giving  automa(c  summary  X   3.  What  percentage  of  the  bigrams  from  the  reference   summaries  appear  in  X?   40   Lin and Hovy 2003   ROUGE − 2 = min(count(i, X),count(i,S)) bigrams i∈S ∑ s∈{RefSummaries} ∑ count(i,S) bigrams i∈S ∑ s∈{RefSummaries} ∑
  41. 41. A  ROUGE  example:   Q:  “What  is  water  spinach?”   Human  1:  Water  spinach  is  a  green  leafy  vegetable  grown  in  the   tropics.   Human  2:    Water  spinach  is  a  semi-­‐aqua(c  tropical  plant  grown  as  a   vegetable.   Human  3:  Water  spinach  is  a  commonly  eaten  leaf  vegetable  of  Asia.   •  System  answer:  Water  spinach  is  a  leaf  vegetable  commonly  eaten   in  tropical  areas  of  Asia.   •  ROUGE-­‐2    =   41   10  +  9  +  9   3  +  3  +  6   =  12/28  =  .43    
  42. 42. Question Answering Summarization for Complex Questions
  43. 43. Defini$on  ques$ons   Q:  What  is  water  spinach?   A:  Water  spinach  (ipomoea  aqua(ca)  is  a  semi-­‐aqua(c  leafy   green  plant  with  long  hollow  stems  and  spear-­‐  or  heart-­‐ shaped  leaves,  widely  grown  throughout  Asia  as  a  leaf   vegetable.  The  leaves  and  stems  are  oten  eaten  s(r-­‐fried   flavored  with  salt  or  in  soups.  Other  common  names  include   morning  glory  vegetable,  kangkong  (Malay),  rau  muong   (Viet.),  ong  choi  (Cant.),  and  kong  xin  cai  (Mand.).  It  is  not   related  to  spinach,  but  is  closely  related  to  sweet  potato  and   convolvulus.    
  44. 44. Medical  ques$ons   Q:  In  children  with  an  acute  febrile  illness,  what  is   the  efficacy  of  single  medica(on  therapy  with   acetaminophen  or  ibuprofen  in  reducing  fever?   A:  Ibuprofen  provided  greater  temperature   decrement  and  longer  dura(on  of  an(pyresis  than   acetaminophen  when  the  two  drugs  were   administered  in  approximately  equal  doses.   (PubMedID:  1621668,  Evidence  Strength:  A)   Demner-­‐Fushman  and  Lin  (2007)    
  45. 45. Other  complex  ques$ons   1.  How  is  compost  made  and  used  for  gardening  (including   different  types  of  compost,  their  uses,  origins  and  benefits)?   2.  What  causes  train  wrecks  and  what  can  be  done  to  prevent   them?   3.  Where  have  poachers  endangered  wildlife,  what  wildlife  has   been  endangered  and  what  steps  have  been  taken  to  prevent   poaching?   4.  What  has  been  the  human  toll  in  death  or  injury  of  tropical   storms  in  recent  years?     45   Modified  from  the  DUC  2005  compe((on  (Hoa  Trang  Dang  2005)  
  46. 46. Answering  harder  ques$ons:   Query-­‐focused  mul$-­‐document  summariza$on   •  The  (boHom-­‐up)  snippet  method   •  Find  a  set  of  relevant  documents   •  Extract  informa(ve  sentences  from  the  documents   •  Order  and  modify  the  sentences  into  an  answer   •  The  (top-­‐down)  informa(on  extrac(on  method   •  build  specific  answerers  for  different  ques(on  types:   •  defini(on  ques(ons   •  biography  ques(ons     •  certain  medical  ques(ons  
  47. 47. Query-­‐Focused  Mul$-­‐Document   Summariza$on   47   •  a   Document Document Document Document Document Input Docs Sentence Segmentation All sentences from documents Sentence Simplification Content Selection Sentence Extraction: LLR, MMR Extracted sentences Information Ordering Sentence Realization Summary All sentences plus simplified versions Query
  48. 48. Informa$on  Ordering   •  Chronological  ordering:   •  Order  sentences  by  the  date  of  the  document  (for  summarizing  news)..          (Barzilay,  Elhadad,  and  McKeown  2002)   •  Coherence:   •  Choose  orderings  that  make  neighboring  sentences  similar  (by  cosine).   •  Choose  orderings  in  which  neighboring  sentences  discuss  the  same  en(ty   (Barzilay  and  Lapata  2007)     •  Topical  ordering   •  Learn  the  ordering  of  topics  in  the  source  documents   48  
  49. 49. Domain-­‐specific  answering:   The  Informa$on  Extrac$on  method   •  a  good  biography  of  a  person  contains:   •  a  person’s  birth/death,  fame  factor,  educa$on,  na$onality  and  so  on   •  a  good  defini$on  contains:   •  genus  or  hypernym   •  The  Hajj  is  a  type  of  ritual   •  a  medical  answer  about  a  drug’s  use  contains:   •  the  problem  (the  medical  condi(on),     •  the  interven$on  (the  drug  or  procedure),  and     •  the  outcome  (the  result  of  the  study).  
  50. 50. Informa$on  that  should  be  in  the  answer   for  3  kinds  of  ques$ons  
  51. 51. The end