AUTOMATIC DOCUMENT SUMMARIZATION
             FINDWISE
Single document summarization




Proposed use for Findwise:
 • Meta data for indexing service
Unsupervised:
 • No need for trainingset
 • Relative domain independence
 • Relative language independence
Preprocessing




      Mandatory              Additional
      • Sentence splitting   •  Named Entity Recognition
      • Tokenization         •  Keyword extraction
      • Stemming             •  tfidf term weighting
      • PoS-tagging
Sentence extraction

      Sentence ranking
      •  Real value ranking
      •  Relevance ordering

      Sentence selection
      •  Desired summary length

      Sentence ordering
      •  Final presentation
TextRank

     Graph based
     •  Sentences as vertices
     •  Similarity as edges

     Iterative ranking
     •   PageRank
Sentence Similarity

     What makes two sentences similar?




      Explored variations
      •  Shared words
      •  Word importance
      •  Lexical filtering
      •  Length normalization
      •  Advanced analysis
K-means clustering




     Approach:
      • Sentences as points
      • Divide into clusters
      • Select sentences from each cluster
      • Diverse summaries
Domain customization




      Domain: short news articles in English
      • Sentence position important
      • Use domain knowledge to improve performance
      • Other boosting for other domains
Multi document summarization




      Sentence Ranking       Sentence selection
      • TextRank             • Similarity threshold
      • K-Means clustering
Sentence Ordering




     Paragraph selection      Paragraph merging
      • Topical closeness     • Date of publication
      • Sentence Similarity   • Original position
Results single document

                          Algorithm          ROUGE
                                             Ngram(1,1)
                          TextRank           0.4797

                          K-means            0.4680

                          One-class SVM      0.4343

                          TextRank           0.4708
                          Original
                          K-means Original   0.4791

                          Baseline 1         0.4649

                          Baseline 2         0.3998
Results multi document



                         Algorithm    ROUGE
                                      Ngram(1,1)
                         TextRank     0.2537

                         K-means      0.2400

                         MetaRank     0.2561

                         Baseline 1   0.2317

                         Baseline 2   0.2054

Automatic Document Summarization

  • 1.
  • 2.
    Single document summarization Proposeduse for Findwise: • Meta data for indexing service Unsupervised: • No need for trainingset • Relative domain independence • Relative language independence
  • 3.
    Preprocessing Mandatory Additional • Sentence splitting • Named Entity Recognition • Tokenization • Keyword extraction • Stemming • tfidf term weighting • PoS-tagging
  • 4.
    Sentence extraction Sentence ranking • Real value ranking • Relevance ordering Sentence selection • Desired summary length Sentence ordering • Final presentation
  • 5.
    TextRank Graph based • Sentences as vertices • Similarity as edges Iterative ranking • PageRank
  • 6.
    Sentence Similarity What makes two sentences similar? Explored variations • Shared words • Word importance • Lexical filtering • Length normalization • Advanced analysis
  • 7.
    K-means clustering Approach: • Sentences as points • Divide into clusters • Select sentences from each cluster • Diverse summaries
  • 8.
    Domain customization Domain: short news articles in English • Sentence position important • Use domain knowledge to improve performance • Other boosting for other domains
  • 9.
    Multi document summarization Sentence Ranking Sentence selection • TextRank • Similarity threshold • K-Means clustering
  • 10.
    Sentence Ordering Paragraph selection Paragraph merging • Topical closeness • Date of publication • Sentence Similarity • Original position
  • 11.
    Results single document Algorithm ROUGE Ngram(1,1) TextRank 0.4797 K-means 0.4680 One-class SVM 0.4343 TextRank 0.4708 Original K-means Original 0.4791 Baseline 1 0.4649 Baseline 2 0.3998
  • 12.
    Results multi document Algorithm ROUGE Ngram(1,1) TextRank 0.2537 K-means 0.2400 MetaRank 0.2561 Baseline 1 0.2317 Baseline 2 0.2054