Text Summarization - Machine Learning
    TEXT SUMMARIZATION
1   Kareem El-Sayed Hashem
    Mohamed Mohsen Brary
TEXT SUMMARIZATION
   Goal: reducing a text with a computer program in
    order to create a summary that retains the most
    important points of the original text.




                                                           Text Summarization - Machine Learning
   Summarization Applications
     summaries of email threads
     action items from a meeting
     simplifying text by compressing sentences




                                                       2
WHAT TO SUMMARIZE?
SINGLE VS. MULTIPLE DOCUMENTS
   Single Document Summarization
       Given a single document produce




                                                               Text Summarization - Machine Learning
         Abstract
         Outline

         Headline




   Multiple Document Summarization
       Given a group of document produce a gist of the
        document
         A series of news stories of the same event
         A set of webpages about some topic or question


                                                           3
QUERY-FOCUSED SUMMARIZATION
& GENERIC SUMMARIZATION
   Generic Summarization
       Summarize the content of a document




                                                                       Text Summarization - Machine Learning
   Query-focused Summarization
     Summarize a document with respect to an
      information need expressed in a user query
     A kind of complex question answering
           Answer a question by summarizing a document that has
            the information to construct the answer




                                                                   4
SUMMARIZATION FOR QUESTION
ANSWERING:
   Snippets
       Create snippets summarizing a web page for a query




                                                                        Text Summarization - Machine Learning
   Multiple Documents
       Create answer to complex questions summarizing
        multiple documents.
         Instead of giving a snippet for each document
         Create a cohesive answer that combines information from

          each document




                                                                    5
EXTRACTIVE SUMMARIZATION
& ABSTRACTIVE SUMMARIZATION
   Extractive Summarization:
       Create the summary from phrases or sentences in the
        source document(s)




                                                                  Text Summarization - Machine Learning
   Abstractive Summarization
       Express the ideas in the source document using
        different words




                                                              6
SUMMARIZATION: THREE STAGES
 Content Selection: choose sentences to extract
  from the document




                                                       Text Summarization - Machine Learning
 Information Ordering: choose an order to place
  them in the summary
 Sentence Realization: clean up the sentence




                                                   7
UNSUPERVISED CONTENT SELECTION
   Intuition Dating Back to Luhn (1958):
       Choose sentences that have distinguished or
        informative words




                                                                   Text Summarization - Machine Learning
   Two Approaches to Define distinguished words
       tf-idf: weigh each word wi in document j by tf-idf



       Topic signature: choose smaller set of distinguished
        words
           Log-likelihood ratio (LLR)


                                                               8
TOPIC SIGNATURE-BASED CONTENT
SELECTION WITH QUERIES

   Choose words that are informative either
       By log-likelihood ratio (LLR)




                                                       Text Summarization - Machine Learning
       Or by appearing in the query




       Weigh a sentence by weight of its words:


                                                   9
SUPERVISED CONTENT SELECTION
   Given
       A labeled training set of good summaries for each
        document




                                                               Text Summarization - Machine Learning
   Align
       The sentences in the document with sentences in the
        summary
   Extract Features
     Position
     Length of sentence
     Word informativeness
     Cohesion
                                                              10
SUPERVISED CONTENT SELECTION
   Train
       A binary classifier (put sentence in summary? Yes or
        no)




                                                                Text Summarization - Machine Learning
   Problems
       Hard to get labeled training data
       Alignment is difficult
       Performance not better that unsupervised algorithm




                                                               11
EVALUATING SUMMARIES: ROUGE
   ROUGE “ Recall Oriented Understudy for
    Gisting Evaluation ”




                                                    Text Summarization - Machine Learning
   Internal metric for automatically evaluating
    summaries
     Based on BLEU (a metric used for machine
      translation)
     Not as good as human evaluation.
     But much more convenient

                                                   12
EVALUATING SUMMARIES: ROUGE
   Given a document D, and an automatic
    summary X:




                                                           Text Summarization - Machine Learning
     Have N humans produce a set of reference
      summaries of D
     Run System, giving automatic summary X
     What percentage of the bigrams from the reference
      summaries appear in X?




                                                          13
EXAMPLE
 Human 1: water spinach is a green leafy
  vegetable grown in the tropics.




                                                 Text Summarization - Machine Learning
 Human 2: water spinach is a semi-aquatic
  tropical plant grown as a vegetable.
 Human 3: water spinach is a commonly eaten
  leaf vegetable of Asia.

   System: water spinach is a leaf vegetable
    commonly eaten in tropical areas of Asia.

   ROUGE -2=                = 12/28 = 0.43     14
ANSWERING HARDER QUESTION:
QUERY-FOCUSED MULTI-DOCUMENT
SUMMARIZATION

   The (bottom-up) snippet method
       Find a set of relevant documents




                                                                 Text Summarization - Machine Learning
       Extract informative sentences form the documents
       Order and modify the sentences into an answer



   The(top-down) information extraction method
       Build specific answers for different questions types:
         Definition questions
         Biography questions

         Certain medical questions

                                                                15
QUERY-FOCUSED MULTI-DOCUMENT
SUMMARIZATION




                                Text Summarization - Machine Learning
                               16
MAXIMAL MARGINAL RELEVANCE (MMR)
 An iterative method for content selection from
  multiple documents




                                                          Text Summarization - Machine Learning
 Iteratively (greedily) choose the best sentence to
  insert in the summary/answer so far:
       Relevant: maximally relevant to the user query
           High cosine similarity to the query
       Novel: minimally redundant with the summary so
        far:
           Low cosine similarity to the summary




                                                         17
   Stop when desired length
LLR + MMR CHOOSING INFORMATIVE YET
NON-REDUNDANT SENTENCES

   One of many ways to combine the intuitions of
    LLR and MMR:




                                                           Text Summarization - Machine Learning
     Score each sentence based on LLR (including query
      words)
     Include the sentence with highest score in the
      summary
     Iteratively add into the summary high-scoring
      sentences that are not redundant with the summary
      so far.


                                                          18
INFORMATION ORDERING
   Chronological ordering:
       Order sentences by the date of the document “ for
        summarizing news”




                                                               Text Summarization - Machine Learning
   Coherence:
     Choose ordering that make neighboring sentences
      similar(by cosine similarity)
     Choose ordering in which neighboring sentences
      discuss the same entity


   Topical ordering
                                                              19
       Learn the ordering of topics in the source document
DOMAIN-SPECIFIC ANSWERING:
THE INFORMATION EXTRACTION METHOD

   A good biography of a person contains:
       A person’s birth/death, fame factor, education …etc




                                                               Text Summarization - Machine Learning
   A good definition contains
       Type or category “ The Hajj is a type of ritual ”
   A medical answer about a drug’s use contains:
     The problem : medical condition
     The intervention : drug or procedure
     The outcome : the result of the study




                                                              20
INFORMATION THAT SHOULD BE IN THE
ANSWER FOR 3 KINDS OF QUESTIONS




                                     Text Summarization - Machine Learning
                                    21
ARCHITECTURE FOR ANSWERING COMPLEX
QUESTIONS




                                      Text Summarization - Machine Learning
                                     22
Text Summarization - Machine Learning
                                                                             23
              NLP Stanford course.
REFERENCES:
                    
Text Summarization - Machine Learning
                        THANK YOU 
                                      24

Text summarization

  • 1.
    Text Summarization -Machine Learning TEXT SUMMARIZATION 1 Kareem El-Sayed Hashem Mohamed Mohsen Brary
  • 2.
    TEXT SUMMARIZATION  Goal: reducing a text with a computer program in order to create a summary that retains the most important points of the original text. Text Summarization - Machine Learning  Summarization Applications  summaries of email threads  action items from a meeting  simplifying text by compressing sentences 2
  • 3.
    WHAT TO SUMMARIZE? SINGLEVS. MULTIPLE DOCUMENTS  Single Document Summarization  Given a single document produce Text Summarization - Machine Learning  Abstract  Outline  Headline  Multiple Document Summarization  Given a group of document produce a gist of the document  A series of news stories of the same event  A set of webpages about some topic or question 3
  • 4.
    QUERY-FOCUSED SUMMARIZATION & GENERICSUMMARIZATION  Generic Summarization  Summarize the content of a document Text Summarization - Machine Learning  Query-focused Summarization  Summarize a document with respect to an information need expressed in a user query  A kind of complex question answering  Answer a question by summarizing a document that has the information to construct the answer 4
  • 5.
    SUMMARIZATION FOR QUESTION ANSWERING:  Snippets  Create snippets summarizing a web page for a query Text Summarization - Machine Learning  Multiple Documents  Create answer to complex questions summarizing multiple documents.  Instead of giving a snippet for each document  Create a cohesive answer that combines information from each document 5
  • 6.
    EXTRACTIVE SUMMARIZATION & ABSTRACTIVESUMMARIZATION  Extractive Summarization:  Create the summary from phrases or sentences in the source document(s) Text Summarization - Machine Learning  Abstractive Summarization  Express the ideas in the source document using different words 6
  • 7.
    SUMMARIZATION: THREE STAGES Content Selection: choose sentences to extract from the document Text Summarization - Machine Learning  Information Ordering: choose an order to place them in the summary  Sentence Realization: clean up the sentence 7
  • 8.
    UNSUPERVISED CONTENT SELECTION  Intuition Dating Back to Luhn (1958):  Choose sentences that have distinguished or informative words Text Summarization - Machine Learning  Two Approaches to Define distinguished words  tf-idf: weigh each word wi in document j by tf-idf  Topic signature: choose smaller set of distinguished words  Log-likelihood ratio (LLR) 8
  • 9.
    TOPIC SIGNATURE-BASED CONTENT SELECTIONWITH QUERIES  Choose words that are informative either  By log-likelihood ratio (LLR) Text Summarization - Machine Learning  Or by appearing in the query  Weigh a sentence by weight of its words: 9
  • 10.
    SUPERVISED CONTENT SELECTION  Given  A labeled training set of good summaries for each document Text Summarization - Machine Learning  Align  The sentences in the document with sentences in the summary  Extract Features  Position  Length of sentence  Word informativeness  Cohesion 10
  • 11.
    SUPERVISED CONTENT SELECTION  Train  A binary classifier (put sentence in summary? Yes or no) Text Summarization - Machine Learning  Problems  Hard to get labeled training data  Alignment is difficult  Performance not better that unsupervised algorithm 11
  • 12.
    EVALUATING SUMMARIES: ROUGE  ROUGE “ Recall Oriented Understudy for Gisting Evaluation ” Text Summarization - Machine Learning  Internal metric for automatically evaluating summaries  Based on BLEU (a metric used for machine translation)  Not as good as human evaluation.  But much more convenient 12
  • 13.
    EVALUATING SUMMARIES: ROUGE  Given a document D, and an automatic summary X: Text Summarization - Machine Learning  Have N humans produce a set of reference summaries of D  Run System, giving automatic summary X  What percentage of the bigrams from the reference summaries appear in X? 13
  • 14.
    EXAMPLE  Human 1:water spinach is a green leafy vegetable grown in the tropics. Text Summarization - Machine Learning  Human 2: water spinach is a semi-aquatic tropical plant grown as a vegetable.  Human 3: water spinach is a commonly eaten leaf vegetable of Asia.  System: water spinach is a leaf vegetable commonly eaten in tropical areas of Asia.  ROUGE -2= = 12/28 = 0.43 14
  • 15.
    ANSWERING HARDER QUESTION: QUERY-FOCUSEDMULTI-DOCUMENT SUMMARIZATION  The (bottom-up) snippet method  Find a set of relevant documents Text Summarization - Machine Learning  Extract informative sentences form the documents  Order and modify the sentences into an answer  The(top-down) information extraction method  Build specific answers for different questions types:  Definition questions  Biography questions  Certain medical questions 15
  • 16.
    QUERY-FOCUSED MULTI-DOCUMENT SUMMARIZATION Text Summarization - Machine Learning 16
  • 17.
    MAXIMAL MARGINAL RELEVANCE(MMR)  An iterative method for content selection from multiple documents Text Summarization - Machine Learning  Iteratively (greedily) choose the best sentence to insert in the summary/answer so far:  Relevant: maximally relevant to the user query  High cosine similarity to the query  Novel: minimally redundant with the summary so far:  Low cosine similarity to the summary 17  Stop when desired length
  • 18.
    LLR + MMRCHOOSING INFORMATIVE YET NON-REDUNDANT SENTENCES  One of many ways to combine the intuitions of LLR and MMR: Text Summarization - Machine Learning  Score each sentence based on LLR (including query words)  Include the sentence with highest score in the summary  Iteratively add into the summary high-scoring sentences that are not redundant with the summary so far. 18
  • 19.
    INFORMATION ORDERING  Chronological ordering:  Order sentences by the date of the document “ for summarizing news” Text Summarization - Machine Learning  Coherence:  Choose ordering that make neighboring sentences similar(by cosine similarity)  Choose ordering in which neighboring sentences discuss the same entity  Topical ordering 19  Learn the ordering of topics in the source document
  • 20.
    DOMAIN-SPECIFIC ANSWERING: THE INFORMATIONEXTRACTION METHOD  A good biography of a person contains:  A person’s birth/death, fame factor, education …etc Text Summarization - Machine Learning  A good definition contains  Type or category “ The Hajj is a type of ritual ”  A medical answer about a drug’s use contains:  The problem : medical condition  The intervention : drug or procedure  The outcome : the result of the study 20
  • 21.
    INFORMATION THAT SHOULDBE IN THE ANSWER FOR 3 KINDS OF QUESTIONS Text Summarization - Machine Learning 21
  • 22.
    ARCHITECTURE FOR ANSWERINGCOMPLEX QUESTIONS Text Summarization - Machine Learning 22
  • 23.
    Text Summarization -Machine Learning 23 NLP Stanford course. REFERENCES: 
  • 24.
    Text Summarization -Machine Learning THANK YOU  24