Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Extractive Document Summarization - An Unsupervised Approach

1,070

Published on

In this paper we present and evaluate a system for automatic extractive document summarization. We employ three different unsupervised algorithms for sentence ranking - TextRank, K-means clustering …

In this paper we present and evaluate a system for automatic extractive document summarization. We employ three different unsupervised algorithms for sentence ranking - TextRank, K-means clustering and previously unexplored in this field, one- class SVM. By adding language and domain specific boosting we achieve state-of-the-art performance for English measured in ROUGE Ngram(1,1) score on the DUC 2002 dataset - 0,4797. In addition, the system can be used for both single and multi document summarization. We also present results for Swedish, based on a new corpus - featured Wikipedia articles.

1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
1,070
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
28
Comments
1
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Extractive document summarization - an unsupervised approach Jonatan Bengtsson⇤ , Christoffer Skeppstedt† , Svetoslav Marinov⇤ Findwise AB, † Tickster AB, Gothenburg, Sweden ⇤ ⇤ {jonatan.bengtsson,svetoslav.marinov}@findwise.com, † christoffer@tickster.com Abstract In this paper we present and evaluate a system for automatic extractive document summarization. We employ three different unsupervised algorithms for sentence ranking - TextRank, K-means clustering and previously unexplored in this field, one- class SVM. By adding language and domain specific boosting we achieve state-of-the-art performance for English measured in ROUGE Ngram(1,1) score on the DUC 2002 dataset - 0,4797. In addition, the system can be used for both single and multi document summarization. We also present results for Swedish, based on a new corpus - featured Wikipedia articles.1. Introduction frequency times inverse document frequency (TF*IDF)An extractive summarization system tries to identify the score.most relevant sentences in an input document (aka single NER for English is performed with Stanford Named En-document summarization, SDS) or cluster of similar docu- tity Recognizer (http://nlp.stanford.edu/software) while forments (aka multi-document summarization, MDS) and uses Swedish we use OpenNLP. The latter library is also usedthese to create a summary (Nenkova and McKeown, 2011). for NP chunking for English. Dependency parsing is per-This task can be divided into four subtask: document pro- formed by MaltParser (http://maltparser.org).cessing, sentence ranking, sentence selection and sentenceordering. Section 2. describes all necessary document pro- 3. Sentence Rankingcessing. A central component in an extractive summarization system We have chosen to work entirely with unsupervised ma- is the sentence ranking algorithm. Its role is to assign a realchine learning algorithms to achieve maximum domain in- value rank to each input sentence or order the sentencesdependence. We utilize three algorithms for sentence rank- according to their relevance.ing - TextRank (Mihalcea and Tarau, 2004), K-means clus-tering (Garc´a-Hern´ ndez et al., 2008) and One-class Sup- ı a 3.1 TextRankport Vector Machines (oSVM) (Sch¨ lkopf et al., 2001). In o TextRank (Mihalcea and Tarau, 2004) models a documentthree of the summarization subtasks, a crucial component as a graph, where nodes corresponds to sentences from theis the calculation of sentence similarities. Both the rank- document, and edges carry a weight describing the similar-ing algorithms and the similarity measures are presented in ity between the nodes. Once the graph has been constructedmore detail in Section 3. the nodes are ranked by an iterative algorithm based on Sections 4. and 5. deal with the different customizations Google’s PageRank (Brin and Page, 1998).of the system to tackle the tasks of SDS and MDS respec- The notion of sentence similarity is crucial to TextRank.tively. Here we also describe the subtasks of sentence se- Mihalcea and Tarau (2004) use the following similarity |S S |lections and ordering. measure: Similarity(Si , Sj ) = log|Sii j j | , where Si |+log|S In Section 6. we present the results from the evaluation is a sentence from the document, |Si | is its length in wordson Swedish and English corpora. and |Si Sj | is the word overlap between Si and Sj . We have tested several other enhanced approaches, such2. Document processing as cosine, TF*IDF, POS tags and dependency tree basedThe minimal preprocessing required by an extractive doc- similarity measures.ument summarization system is sentence splitting. Whilethis is sufficient to create a baseline, its performance will be 3.2 K-means clusteringsuboptimal. Further linguistic processing is central to op- Garc´a-Hern´ ndez et al. (2008) adapt the well-known K- ı atimizing the system. The basic requirements are tokeniza- means clustering algorithm to the task of document sum-tion, stemming and part-of-speech (POS) tagging. Given a marization. We use the same approach and divide sentenceslanguage where we have all the basic resources, our system into k clusters from which we then select the most salientwill be able to produce a summary of a document. ones. We have tested three different ways for sentence rel- Sentence splitting, tokenization and POS tagging are evance ordering - position, centroid and TextRank-based.done using OpenNLP (http://opennlp.apache.org) for both The value of k is conditioned on the mean sentence lengthEnglish and Swedish. In addition we have explored several in a document and the desired summary length.other means of linguistic analysis such as Named Entity Each sentence is converted to a word vector before theRecognition (NER), keyword extraction, dependency pars- clustering begins. The vectors can contain all unique wordsing and noun phrase (NP) chunking. Finally we can aug- of the document or a subset based on POS tag, documentment the importance of words by calculating their term- keywords or named entities.
  • 2. 3.3 One-class SVM English Swedish Algorithm SDS MDS SDSoSVM is a previously unexplored approach when it comes TextRank (TF*IDF, POS) 0.4797 0.2537 0.3593to unsupervised, extractive summarization. Similarly to the K-means (TF*IDF, POS) 0.4680 0.2400 0.3539K-means algorithm, sentences are seen as points in a coor- oSVM (cosine) 0.4343 0.3399dinate system, but the task is to find the outline boundary, 2-stageSum (TextRank-based) 0.2561i.e. the support vectors that enclose all points. These vec- (Mihalcea and Tarau, 2004) 0.4708tors (or sentences) arguably define the document and are (Garc´a-Hern´ ndez et al., 2008) ı a 0.4791therefore interesting from a summarization point of view. Baselinelead 0.4649 0.2317 0.3350 For the kernel function we choose the sentence similar- Baselinerand 0.3998 0.2054 0.3293ity measures (cf. 3.1). Similarly to choosing k in (3.2), thenumber of support vectors is dependent on the mean sen- Table 1: Results and Comparisontence and desired summary lengths. dence. With relatively little language dependent processing4. Single document summarization the system can be ported to new languages and domains.For SDS we can use domain specific knowledge in order to We have evaluated three different algorithms for sentenceboost the sentence rank and thus improve the performance ranking where oSVM is previously unexplored in this field.of the system. As an example, in the domain of newspa- By adding domain knowledge in the form of sentence rankper articles the sentence position tends to have a significant boosting with the TextRank algorithm we receive higherrole, with initital sentences containing the gist of the arti- ROUGE scores than other systems tested on DUC 2002cle. We use an inverse square function to update the sen- dataset. In addition, we have tested the system for Swedishtence ranks: Boost(Si ) = Si .rank ⇤ (1 + p 1 ), where on a new corpus with promising results. Si .posSi .rank is the prior value and Si .pos is the position of thesentence in the document. We see such boosting functions 8. Referencesas important steps for domain customization. Danushka Bollegala, Naoaki Okazaki, and Mitsuru Once the sentences have been ranked the selection and Ishizuka. 2006. A bottom-up approach to sentence or-ordering tasks are relatively straightforward - we take the dering for multi-document summarization. In In Pro-highest ranked sentences until a word limit is reached and ceedings of the COLING/ACL, pages 385–392.order these according to their original position in the text. Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Comput.5. Multi document summarization Netw. ISDN Syst., 30(1-7):107–117, April.When it comes to MDS, two different approaches have been Ren´ Arnulfo Garc´a-Hern´ ndez, Romyna Montiel, Yu- e ı atested. The first one is to summarize a document cluster lia Ledeneva, Er´ ndira Rend´ n, Alexander Gelbukh, e oby taking all sentences in it. The other approach is based and Rafael Cruz. 2008. Text Summarization by Sen-on the work of (Mihalcea and Tarau, 2005) who use a two tence Extraction Using Unsupervised Learning. In Pro-stage approach, where each document is first summarized ceedings of the 7th Mexican International Conferenceand then we summarize only the summaries (2-stageSum). on Artificial Intelligence: Advances in Artificial Intelli- MDS shares the same ranking algorithms as SDS, cou- gence, MICAI ’08, pages 133–143, Berlin, Heidelberg.pled with specific sentence selection and ordering. We rely Springer-Verlag.on similarity measures (cf. 3.1) to avoid selecting near sen- Chin-Yew Lin and Eduard Hovy. 2003. Automatic evalu-tence duplicates and adopt a topic/publication date-based ation of summaries using N-gram co-occurrence statis-approach for sentence ordering (Bollegala et al., 2006). tics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational6. Evaluation Linguistics on Human Language Technology, volume 1The system is evaluated on the DUC 2002 corpus, which of NAACL ’03, pages 71–78, Stroudsburg, PA, USA. As-consists of 567 English news articles in 59 clusters paired sociation for Computational Linguistics.with 100 word summaries. For Swedish we use a corpus Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringingof 251 featured Wikipedia articles from 2010, where the Order into Texts. In Conference on Empirical Methodsintroduction is considered to be the summary. in Natural Language Processing, Barcelona, Spain. We rely on the ROUGE toolkit to evaluate the automati- Rada Mihalcea and Paul Tarau. 2005. Multi-documentcally generated summaries and use Ngram(1,1) F1 settings, summarization with iterative graph-based algorithms.as these have been shown to closely relate to human ratings In in Proceedings of the AAAI Spring Symposium on(Lin and Hovy, 2003), without stemming and stop word re- Knowledge Collection from Volunteer Contributors.moval. Two kinds of baseline systems are also tested - ran- Ani Nenkova and Kathleen McKeown. 2011. Automaticdom selection and leading sentence selection (see Table 1). Summarization. Foundations and Trends in Information Retrieval, 5(2–3):103–233.7. Conclusion Bernhard Sch¨ lkopf, John C. Platt, John C. Shawe-Taylor, oIn this paper we have presented a system capable of do- Alex J. Smola, and Robert C. Williamson. 2001. Es-ing both SDS and MDS. By relying on unsupervised ma- timating the support of a high-dimensional distribution.chine learning algorithms we achieve domain indepen- Neural Comput., 13(7):1443–1471, July.

×