Automatic Document Summarization

2,763 views

Published on

Published in: Technology
2 Comments
6 Likes
Statistics
Notes
No Downloads
Views
Total views
2,763
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
128
Comments
2
Likes
6
Embeds 0
No embeds

No notes for slide

Automatic Document Summarization

  1. 1. AUTOMATIC DOCUMENT SUMMARIZATION FINDWISE
  2. 2. Single document summarizationProposed use for Findwise: • Meta data for indexing serviceUnsupervised: • No need for trainingset • Relative domain independence • Relative language independence
  3. 3. Preprocessing Mandatory Additional • Sentence splitting • Named Entity Recognition • Tokenization • Keyword extraction • Stemming • tfidf term weighting • PoS-tagging
  4. 4. Sentence extraction Sentence ranking • Real value ranking • Relevance ordering Sentence selection • Desired summary length Sentence ordering • Final presentation
  5. 5. TextRank Graph based • Sentences as vertices • Similarity as edges Iterative ranking • PageRank
  6. 6. Sentence Similarity What makes two sentences similar? Explored variations • Shared words • Word importance • Lexical filtering • Length normalization • Advanced analysis
  7. 7. K-means clustering Approach: • Sentences as points • Divide into clusters • Select sentences from each cluster • Diverse summaries
  8. 8. Domain customization Domain: short news articles in English • Sentence position important • Use domain knowledge to improve performance • Other boosting for other domains
  9. 9. Multi document summarization Sentence Ranking Sentence selection • TextRank • Similarity threshold • K-Means clustering
  10. 10. Sentence Ordering Paragraph selection Paragraph merging • Topical closeness • Date of publication • Sentence Similarity • Original position
  11. 11. Results single document Algorithm ROUGE Ngram(1,1) TextRank 0.4797 K-means 0.4680 One-class SVM 0.4343 TextRank 0.4708 Original K-means Original 0.4791 Baseline 1 0.4649 Baseline 2 0.3998
  12. 12. Results multi document Algorithm ROUGE Ngram(1,1) TextRank 0.2537 K-means 0.2400 MetaRank 0.2561 Baseline 1 0.2317 Baseline 2 0.2054

×