Automatically Generating Wikipedia Articles:
                          A Structure-Aware Approach

Diagnosis . . . No laboratories offering molecular genetic testing for prenatal diagnosis of 3-M syndrome are listed in th...
3 Method                                                             quality of a given excerpt. Using the percep-
We determine the average number of sections             is needed when candidate excerpts are drawn di-
k over all documen...
commonly used in generation and summarization                  Feature                Value
ranks all excerpts from the candidate excerpt            Input:
vector eij1 . . . eijr for document di and topic          ...
Avg. Excerpts   Avg. Sources      confidence scores.
  Amer. Film Actors                                          Our third...
Recall   Precision    F-score                  Type                       Count
    Amer. Film Actors                     ...
References                                                        Inderjeet Mani and Mark T. Maybury. 1999. Advances in
Upcoming SlideShare
Loading in …5

Automatically Generating Wikipedia Articles: A Structure-Aware Approach


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Automatically Generating Wikipedia Articles: A Structure-Aware Approach

  1. 1. Automatically Generating Wikipedia Articles: A Structure-Aware Approach Christina Sauper and Regina Barzilay Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology {csauper,regina} Abstract such as searching the Internet. Moreover, the chal- lenge of maintaining output readability is mag- In this paper, we investigate an ap- nified when creating a longer document that dis- proach for creating a comprehensive tex- cusses multiple topics. tual overview of a subject composed of in- In our approach, we explore how the high- formation drawn from the Internet. We use level structure of human-authored documents can the high-level structure of human-authored be used to produce well-formed comprehensive texts to automatically induce a domain- overview articles. We select relevant material for specific template for the topic structure of an article using a domain-specific automatically a new overview. The algorithmic innova- generated content template. For example, a tem- tion of our work is a method to learn topic- plate for articles about diseases might contain di- specific extractors for content selection agnosis, causes, symptoms, and treatment. Our jointly for the entire template. We aug- system induces these templates by analyzing pat- ment the standard perceptron algorithm terns in the structure of human-authored docu- with a global integer linear programming ments in the domain of interest. Then, it produces formulation to optimize both local fit of a new article by selecting content from the Internet information into each topic and global co- for each part of this template. An example of our herence across the entire overview. The system’s output1 is shown in Figure 1. results of our evaluation confirm the bene- The algorithmic innovation of our work is a fits of incorporating structural information method for learning topic-specific extractors for into the content selection process. content selection jointly across the entire template. Learning a single topic-specific extractor can be 1 Introduction easily achieved in a standard classification frame- In this paper, we consider the task of automatically work. However, the choices for different topics creating a multi-paragraph overview article that in a template are mutually dependent; for exam- provides a comprehensive summary of a subject of ple, in a multi-topic article, there is potential for interest. Examples of such overviews include ac- redundancy across topics. Simultaneously learn- tor biographies from IMDB and disease synopses ing content selection for all topics enables us to from Wikipedia. Producing these texts by hand is explicitly model these inter-topic connections. a labor-intensive task, especially when relevant in- We formulate this task as a structured classifica- formation is scattered throughout a wide range of tion problem. We estimate the parameters of our Internet sources. Our goal is to automate this pro- model using the perceptron algorithm augmented cess. We aim to create an overview of a subject – with an integer linear programming (ILP) formu- e.g., 3-M Syndrome – by intelligently combining lation, run over a training set of example articles relevant excerpts from across the Internet. in the given domain. As a starting point, we can employ meth- The key features of this structure-aware ap- ods developed for multi-document summarization. proach are twofold: However, our task poses additional technical chal- 1 lenges with respect to content planning. Gen- This system output was added to Wikipedia at http:// on June erating a well-rounded overview article requires 26, 2008. The page’s history provides examples of changes proactive strategies to gather relevant material, performed by human editors to articles created by our system.
  2. 2. Diagnosis . . . No laboratories offering molecular genetic testing for prenatal diagnosis of 3-M syndrome are listed in the GeneTests Laboratory Directory. However, prenatal testing may be available for families in which the disease-causing mutations have been identified in an affected family member in a research or clinical laboratory. Causes Three M syndrome is thought to be inherited as an autosomal recessive genetic trait. Human traits, including the classic genetic diseases, are the product of the interaction of two genes, one received from the father and one from the mother. In recessive disorders, the condition does not occur unless an individual inherits the same defective gene for the same trait from each parent. . . . Symptoms . . . Many of the symptoms and physical features associated with the disorder are apparent at birth (congenital). In some cases, individuals who carry a single copy of the disease gene (heterozygotes) may exhibit mild symptoms associated with Three M syndrome. Treatment . . . Genetic counseling will be of benefit for affected individuals and their families. Family members of affected indi- viduals should also receive regular clinical evaluations to detect any symptoms and physical characteristics that may be potentially associated with Three M syndrome or heterozygosity for the disorder. Other treatment for Three M syndrome is symptomatic and supportive. Figure 1: A fragment from the automatically created article for 3-M Syndrome. • Automatic template creation: Templates domains, output summaries often suffer from co- are automatically induced from human- herence and coverage problems. authored documents. This ensures that the In between these two approaches is work on overview article will have the breadth ex- domain-specific text-to-text generation. Instances pected in a comprehensive summary, with of these tasks are biography generation in sum- content drawn from a wide variety of Inter- marization and answering definition requests in net sources. question-answering. In contrast to a generic sum- • Joint parameter estimation for content se- marizer, these applications aim to characterize lection: Parameters are learned jointly for the types of information that are essential in a all topics in the template. This procedure op- given domain. This characterization varies greatly timizes both local relevance of information in granularity. For instance, some approaches for each topic and global coherence across coarsely discriminate between biographical and the entire article. non-biographical information (Zhou et al., 2004; Biadsy et al., 2008), while others go beyond binary We evaluate our approach by creating articles in distinction by identifying atomic events – e.g., oc- two domains: Actors and Diseases. For a data set, cupation and marital status – that are typically in- we use Wikipedia, which contains articles simi- cluded in a biography (Weischedel et al., 2004; lar to those we wish to produce in terms of length Filatova and Prager, 2005; Filatova et al., 2006). and breadth. An advantage of this data set is that Commonly, such templates are specified manually Wikipedia articles explicitly delineate topical sec- and are hard-coded for a particular domain (Fujii tions, facilitating structural analysis. The results and Ishikawa, 2004; Weischedel et al., 2004). of our evaluation confirm the benefits of structure- aware content selection over approaches that do Our work is related to these approaches; how- not explicitly model topical structure. ever, content selection in our work is driven by domain-specific automatically induced templates. 2 Related Work As our experiments demonstrate, patterns ob- served in domain-specific training data provide Concept-to-text generation and text-to-text gener- sufficient constraints for topic organization, which ation take very different approaches to content se- is crucial for a comprehensive text. lection. In traditional concept-to-text generation, a content planner provides a detailed template for Our work also relates to a large body of recent what information should be included in the output work that uses Wikipedia material. Instances of and how this information should be organized (Re- this work include information extraction, ontology iter and Dale, 2000). In text-to-text generation, induction and resource acquisition (Wu and Weld, such templates for information organization are 2007; Biadsy et al., 2008; Nastase, 2008; Nastase not available; sentences are selected based on their and Strube, 2008). Our focus is on a different task salience properties (Mani and Maybury, 1999). — generation of new overview articles that follow While this strategy is robust and portable across the structure of Wikipedia articles.
  3. 3. 3 Method quality of a given excerpt. Using the percep- tron framework augmented with an ILP for- The goal of our system is to produce a compre- mulation for global optimization, the system hensive overview article given a title – e.g., Can- is trained to select the best excerpt for each cer. We assume that relevant information on the document di and each topic tj . For train- subject is available on the Internet but scattered ing, we assume the best excerpt is the original among several pages interspersed with noise. human-authored text sij . We are provided with a training corpus consist- ing of n documents d1 . . . dn in the same domain 3. Application (Section 3.2) Given the title of – e.g., Diseases. Each document di has a title and a requested document, we select several ex- a set of delineated sections2 si1 . . . sim . The num- cerpts from the candidate vectors returned by ber of sections m varies between documents. Each the search procedure (1b) to create a com- section sij also has a corresponding heading hij – prehensive overview article. We perform the e.g., Treatment. decoding procedure jointly using learned pa- Our overview article creation process consists rameters w1 . . . wk and the same ILP formu- of three parts. First, a preprocessing step creates lation for global optimization as in training. a template and searches for a number of candidate The result is a new document with k excerpts, excerpts from the Internet. Next, parameters must one for each topic. be trained for the content selection algorithm us- ing our training data set. Finally, a complete ar- 3.1 Preprocessing ticle may be created by combining a selection of Template Induction A content template speci- candidate excerpts. fies the topical structure of documents in one do- main. For instance, the template for articles about 1. Preprocessing (Section 3.1) Our prepro- actors consists of four topics t1 . . . t4 : biography, cessing step leverages previous work in topic early life, career, and personal life. Using this segmentation and query reformulation to pre- template to create the biography of a new actor pare a template and a set of candidate ex- will ensure that its information coverage is con- cerpts for content selection. Template gen- sistent with existing human-authored documents. eration must occur once per domain, whereas We aim to derive these templates by discovering search occurs every time an article is gener- common patterns in the organization of documents ated in both learning and application. in a domain of interest. There has been a sizable (a) Template Induction To create a con- amount of research on structure induction ranging tent template, we cluster all section from linear segmentation (Hearst, 1994) to content headings hi1 . . . him for all documents modeling (Barzilay and Lee, 2004). At the core di . Each cluster is labeled with the most of these methods is the assumption that fragments common heading hij within the clus- of text conveying similar information have simi- ter. The largest k clusters are selected to lar word distribution patterns. Therefore, often a become topics t1 . . . tk , which form the simple segment clustering across domain texts can domain-specific content template. identify strong patterns in content structure (Barzi- (b) Search For each document that we lay and Elhadad, 2003). Clusters containing frag- wish to create, we retrieve from the In- ments from many documents are indicative of top- ternet a set of r excerpts ej1 . . . ejr for ics that are essential for a comprehensive sum- each topic tj from the template. We de- mary. Given the simplicity and robustness of this fine appropriate search queries using the approach, we utilize it for template induction. requested document title and topics tj . We cluster all section headings hi1 . . . him from all documents di using a repeated bisectioning 2. Learning Content Selection (Section 3.2) algorithm (Zhao et al., 2005). As a similarity For each topic tj , we learn the corresponding function, we use cosine similarity weighted with topic-specific parameters wj to determine the TF*IDF. We eliminate any clusters with low in- 2 ternal similarity (i.e., smaller than 0.5), as we as- In data sets where such mark-up is not available, one can employ topical segmentation algorithms as an additional pre- sume these are “miscellaneous” clusters that will processing step. not yield unified topics.
  4. 4. We determine the average number of sections is needed when candidate excerpts are drawn di- k over all documents in our training set, then se- rectly from the web, as they may be contaminated lect the k largest section clusters as topics. We or- with noise. der these topics as t1 . . . tk using a majority order- We propose a novel joint training algorithm that ing algorithm (Cohen et al., 1998). This algorithm learns selection criteria for all the topics simulta- finds a total order among clusters that is consistent neously. This approach enables us to maximize with a maximal number of pairwise relationships both local fit and global coherence. We implement observed in our data set. this algorithm using the perceptron framework, as Each topic tj is identified by the most frequent it can be easily modified for structured prediction heading found within the cluster – e.g., Causes. while preserving convergence guarantees (Daum´ e This set of topics forms the content template for a III and Marcu, 2005; Snyder and Barzilay, 2007). domain. In this section, we first describe the structure Search To retrieve relevant excerpts, we must and decoding procedure of our model. We then define appropriate search queries for each topic present an algorithm to jointly learn the parame- t1 . . . tk . Query reformulation is an active area of ters of all topic models. research (Agichtein et al., 2001). We have exper- 3.2.1 Model Structure imented with several of these methods for draw- The model inputs are as follows: ing search queries from representative words in the body text of each topic; however, we find that the • The title of the desired document best performance is provided by deriving queries • t1 . . . tk — topics from the content template from a conjunction of the document title and topic • ej1 . . . ejr — candidate excerpts for each – e.g., “3-M syndrome” diagnosis. topic tj Using these queries, we search using Yahoo! and retrieve the first ten result pages for each topic. In addition, we define feature and parameter From each of these pages, we extract all possible vectors: excerpts consisting of chunks of text between stan- • φ(ejl ) — feature vector for the lth candidate dardized boundary indicators (such as <p> tags). excerpt for topic tj In our experiments, there are an average of 6 ex- • w1 . . . wk — parameter vectors, one for each cerpts taken from each page. For each topic tj of of the topics t1 . . . tk each document we wish to create, the total number of excerpts r found on the Internet may differ. We Our model constructs a new article by following label the excerpts ej1 . . . ejr . these two steps: Ranking First, we attempt to rank candidate 3.2 Selection Model excerpts based on how representative they are of Our selection model takes the content template each individual topic. For each topic tj , we induce t1 . . . tk and the candidate excerpts ej1 . . . ejr for a ranking of the excerpts ej1 . . . ejr by mapping each topic tj produced in the previous steps. It each excerpt ejl to a score: then selects a series of k excerpts, one from each topic, to create a coherent summary. scorej (ejl ) = φ(ejl ) · wj One possible approach is to perform individ- Candidates for each topic are ranked from high- ual selections from each set of excerpts ej1 . . . ejr est to lowest score. After this procedure, the posi- and then combine the results. This strategy is tion l of excerpt ejl within the topic-specific can- commonly used in multi-document summariza- didate vector is the excerpt’s rank. tion (Barzilay et al., 1999; Goldstein et al., 2000; Optimizing the Global Objective To avoid re- Radev et al., 2000), where the combination step dundancy between topics, we formulate an opti- eliminates the redundancy across selected ex- mization problem using excerpt rankings to create cerpts. However, separating the two steps may not the final article. Given k topics, we would like to be optimal for this task — the balance between select one excerpt ejl for each topic tj , such that coverage and redundancy is harder to achieve the rank is minimized; that is, scorej (ejl ) is high. when a multi-paragraph summary is generated. In To select the optimal excerpts, we employ inte- addition, a more discriminative selection strategy ger linear programming (ILP). This framework is
  5. 5. commonly used in generation and summarization Feature Value UNI wordi count of word occurrences applications where the selection process is driven POS wordi first position of word in excerpt by multiple constraints (Marciniak and Strube, BI wordi wordi+1 count of bigram occurrences 2005; Clarke and Lapata, 2007). SENT count of all sentences EXCL count of exclamations We represent excerpts included in the output QUES count of questions using a set of indicator variables, xjl . For each WORD count of all words excerpt ejl , the corresponding indicator variable NAME count of title mentions DATE count of dates xjl = 1 if the excerpt is included in the final doc- PROP count of proper nouns ument, and xjl = 0 otherwise. PRON count of pronouns Our objective is to minimize the ranks of the NUM count of numbers FIRST word1 1∗ excerpts selected for the final document: FIRST word1 word2 1† SIMS count of similar excerpts‡ k r min l · xjl Table 1: Features employed in the ranking model. ∗ j=1 l=1 Defined as the first unigram in the excerpt. † Defined as the first bigram in the excerpt. ‡ We augment this formulation with two types of Defined as excerpts with cosine similarity > 0.5 constraints. Exclusivity Constraints We want to ensure that Features As shown in Table 1, most of the fea- exactly one indicator xjl is nonzero for each topic tures we select in our model have been employed tj . These constraints are formulated as follows: in previous work on summarization (Mani and r Maybury, 1999). All features except the SIMS xjl = 1 ∀j ∈ {1 . . . k} feature are defined for individual excerpts in isola- l=1 tion. For each excerpt ejl , the value of the SIMS Redundancy Constraints We also want to pre- feature is the count of excerpts ejl′ in the same vent redundancy across topics. We define topic tj for which sim(ejl , ejl′ ) > 0.5. This fea- sim(ejl , ej ′ l′ ) as the cosine similarity between ex- ture quantifies the degree of repetition within a cerpts ejl from topic tj and ej ′ l′ from topic tj ′ . topic, often indicative of an excerpt’s accuracy and We introduce constraints that ensure no pair of ex- relevance. cerpts has similarity above 0.5: 3.2.2 Model Training (xjl + xj ′ l′ ) · sim(ejl , ej ′ l′ ) ≤ 1 Generating Training Data For training, we are given n original documents d1 . . . dn , a content ∀j, j ′ = 1 . . . k ∀l, l′ = 1 . . . r template consisting of topics t1 . . . tk , and a set of candidate excerpts eij1 . . . eijr for each document If excerpts ejl and ej ′ l′ have cosine similarity di and topic tj . For each section of each docu- sim(ejl , ej ′ l′ ) > 0.5, only one excerpt may be ment, we add the gold excerpt sij to the corre- selected for the final document – i.e., either xjl sponding vector of candidate excerpts eij1 . . . eijr . or xj ′ l′ may be 1, but not both. Conversely, if This excerpt represents the target for our training sim(ejl , ej ′ l′ ) ≤ 0.5, both excerpts may be se- algorithm. Note that the algorithm does not re- lected. quire annotated ranking data; only knowledge of Solving the ILP Solving an integer linear pro- this “optimal” excerpt is required. However, if gram is NP-hard (Cormen et al., 1992); however, the excerpts provided in the training data have low in practice there exist several strategies for solving quality, noise is introduced into the system. certain ILPs efficiently. In our study, we employed Training Procedure Our algorithm is a lp solve,3 an efficient mixed integer programming modification of the perceptron ranking algo- solver which implements the Branch-and-Bound rithm (Collins, 2002), which allows for joint algorithm. On a larger scale, there are several al- learning across several ranking problems (Daum´ e ternatives to approximate the ILP results, such as a III and Marcu, 2005; Snyder and Barzilay, 2007). dynamic programming approximation to the knap- Pseudocode for this algorithm is provided in sack problem (McDonald, 2007). Figure 2. 3 First, we define Rank(eij1 . . . eijr , wj ), which
  6. 6. ranks all excerpts from the candidate excerpt Input: vector eij1 . . . eijr for document di and topic d1 . . . dn : A set of n documents, each containing tj . Excerpts are ordered by scorej (ejl ) using k sections si1 . . . sik eij1 . . . eijr : Sets of candidate excerpts for each topic the current parameter values. We also define tj and document di Optimize(eij1 . . . eijr ), which finds the optimal Define: Rank(eij1 . . . eijr , wj ): selection of excerpts (one per topic) given ranked As described in Section 3.2.1: lists of excerpts eij1 . . . eijr for each document di Calculates scorej (eijl ) for all excerpts for and topic tj . These functions follow the ranking document di and topic tj , using parameters wj . Orders the list of excerpts by scorej (eijl ) and optimization procedures described in Section from highest to lowest. 3.2.1. The algorithm maintains k parameter vec- Optimize(ei11 . . . eikr ): tors w1 . . . wk , one associated with each topic tj As described in Section 3.2.1: Finds the optimal selection of excerpts to form a desired in the final article. During initialization, final article, given ranked lists of excerpts all parameter vectors are set to zeros (line 2). for each topic t1 . . . tk . To learn the optimal parameters, this algorithm Returns a list of k excerpts, one for each topic. φ(eijl ): iterates over the training set until the parameters Returns the feature vector representing excerpt eijl converge or a maximum number of iterations is Initialization: 1 For j = 1 . . . k reached (line 3). For each document in the train- 2 Set parameters wj = 0 ing set (line 4), the following steps occur: First, Training: candidate excerpts for each topic are ranked (lines 3 Repeat until convergence or while iter < itermax : 4 For i = 1 . . . n 5-6). Next, decoding through ILP optimization is 5 For j = 1 . . . k performed over all ranked lists of candidate ex- 6 Rank(eij1 . . . eijr , wj ) cerpts, selecting one excerpt for each topic (line 7 x1 . . . xk = Optimize(ei11 . . . eikr ) 8 For j = 1 . . . k 7). Finally, the parameters are updated in a joint 9 If sim(xj , sij ) < 0.8 fashion. For each topic (line 8), if the selected 10 wj = wj + φ(sij ) − φ(xi ) excerpt is not similar enough to the gold excerpt 11 iter = iter + 1 12 Return parameters w1 . . . wk (line 9), the parameters for that topic are updated using a standard perceptron update rule (line 10). When convergence is reached or the maximum it- Figure 2: An algorithm for learning several rank- eration count is exceeded, the learned parameter ing problems with a joint decoding mechanism. values are returned (line 12). The use of ILP during each step of training sets this algorithm apart from previous work. In tor reaction to system-produced articles submitted prior research, ILP was used as a postprocess- to Wikipedia. ing step to remove redundancy and make other global decisions about parameters (McDonald, Data For evaluation, we consider two domains: 2007; Marciniak and Strube, 2005; Clarke and La- American Film Actors and Diseases. These do- pata, 2007). However, in our training, we inter- mains have been commonly used in prior work twine the complete decoding procedure with the on summarization (Weischedel et al., 2004; Zhou parameter updates. Our joint learning approach et al., 2004; Filatova and Prager, 2005; Demner- finds per-topic parameter values that are maxi- Fushman and Lin, 2007; Biadsy et al., 2008). Our mally suited for the global decoding procedure for text corpus consists of articles drawn from the cor- content selection. responding categories in Wikipedia. There are 2,150 articles in American Film Actors and 523 4 Experimental Setup articles in Diseases. For each domain, we ran- domly select 90% of articles for training and test We evaluate our method by observing the quality on the remaining 10%. Human-authored articles of automatically created articles in different do- in both domains contain an average of four top- mains. We compute the similarity of a large num- ics, and each topic contains an average of 193 ber of articles produced by our system and sev- words. In order to model the real-world scenario eral baselines to the original human-authored arti- where Wikipedia articles are not always available cles using ROUGE, a standard metric for summary (as for new or specialized topics), we specifically quality. In addition, we perform an analysis of edi- exclude Wikipedia sources during our search pro-
  7. 7. Avg. Excerpts Avg. Sources confidence scores. Amer. Film Actors Our third baseline, Disjoint, uses the ranking Search 2.3 1 No Template 4 4.0 perceptron framework as in our full system; how- Disjoint 4 2.1 ever, rather than perform an optimization step Full Model 4 3.4 during training and decoding, we simply select Oracle 4.3 4.3 Diseases the highest-ranked excerpt for each topic. This Search 3.1 1 equates to standard linear classification for each No Template 4 2.5 section individually. Disjoint 4 3.0 Full Model 4 3.2 In addition to these baselines, we compare Oracle 5.8 3.9 against an Oracle system. For each topic present in the human-authored article, the Oracle selects Table 2: Average number of excerpts selected and the excerpt from our full model’s candidate ex- sources used in article creation for test articles. cerpts with the highest cosine similarity to the human-authored text. This excerpt is the optimal automatic selection from the results available, and cedure (Section 3.1) for evaluation. therefore represents an upper bound on our excerpt selection task. Some articles contain additional Baselines Our first baseline, Search, relies topics beyond those in the template; in these cases, solely on search engine ranking for content selec- the Oracle system produces a longer article than tion. Using the article title as a query – e.g., Bacil- our algorithm. lary Angiomatosis, this method selects the web page that is ranked first by the search engine. From Table 2 shows the average number of excerpts this page we select the first k paragraphs where k selected and sources used in articles created by our is defined in the same way as in our full model. If full model and each baseline. there are less than k paragraphs on the page, all Automatic Evaluation To assess the quality of paragraphs are selected, but no other sources are the resulting overview articles, we compare them used. This yields a document of comparable size with the original human-authored articles. We with the output of our system. Despite its sim- use ROUGE, an evaluation metric employed at the plicity, this baseline is not naive: extracting ma- Document Understanding Conferences (DUC), terial from a single document guarantees that the which assumes that proximity to human-authored output is coherent, and a page highly ranked by a text is an indicator of summary quality. We search engine may readily contain a comprehen- use the publicly available ROUGE toolkit (Lin, sive overview of the subject. 2004) to compute recall, precision, and F-score for ROUGE-1. We use the Wilcoxon Signed Rank Test Our second baseline, No Template, does not to determine statistical significance. use a template to specify desired topics; there- Analysis of Human Edits In addition to our auto- fore, there are no constraints on content selection. matic evaluation, we perform a study of reactions Instead, we follow a simplified form of previous to system-produced articles by the general pub- work on biography creation, where a classifier is lic. To achieve this goal, we insert automatically trained to distinguish biographical text (Zhou et created articles4 into Wikipedia itself and exam- al., 2004; Biadsy et al., 2008). ine the feedback of Wikipedia editors. Selection In this case, we train a classifier to distinguish of specific articles is constrained by the need to domain-specific text. Positive training data is find topics which are currently of “stub” status that drawn from all topics in the given domain cor- have enough information available on the Internet pus. To find negative training data, we perform to construct a valid article. After a period of time, the search procedure as in our full model (see we analyzed the edits made to the articles to deter- Section 3.1) using only the article titles as search mine the overall editor reaction. We report results queries. Any excerpts which have very low sim- on 15 articles in the Diseases category5 . ilarity to the original articles are used as negative 4 examples. During the decoding procedure, we use In addition to the summary itself, we also include proper the same search procedure. We then classify each citations to the sources from which the material is extracted. 5 We are continually submitting new articles; however, we excerpt as relevant or irrelevant and select the k report results on those that have at least a 6 month history at non-redundant excerpts with the highest relevance time of writing.
  8. 8. Recall Precision F-score Type Count Amer. Film Actors Total articles 15 Search 0.09 0.37 0.13 ∗ Promoted articles 10 No Template 0.33 0.50 0.39 ∗ Edit types Disjoint 0.45 0.32 0.36 ∗ Intra-wiki links 36 Full Model 0.46 0.40 0.41 Formatting 25 Oracle 0.48 0.64 0.54 ∗ Grammar 20 Diseases Minor topic edits 2 Search 0.31 0.37 0.32 † Major topic changes 1 No Template 0.32 0.27 0.28 ∗ Total edits 85 Disjoint 0.33 0.40 0.35 ∗ Full Model 0.36 0.39 0.37 Table 4: Distribution of edits on Wikipedia. Oracle 0.59 0.37 0.44 ∗ Table 3: Results of ROUGE-1 evaluation. by human editors from stubs to regular Wikipedia ∗ Significant with respect to our full model for p ≤ 0.05. entries based on the quality and coverage of the † Significant with respect to our full model for p ≤ 0.10. material. Information was removed in three cases for being irrelevant, one entire section and two Since Wikipedia is a live resource, we do not smaller pieces. The most common changes were repeat this procedure for our baseline systems. small edits to formatting and introduction of links Adding articles from systems which have previ- to other Wikipedia articles in the body text. ously demonstrated poor quality would be im- proper, especially in Diseases. Therefore, we 6 Conclusion present this analysis as an additional observation In this paper, we investigated an approach for cre- rather than a rigorous technical study. ating a multi-paragraph overview article by select- ing relevant material from the web and organiz- 5 Results ing it into a single coherent text. Our algorithm Automatic Evaluation The results of this evalu- yields significant gains over a structure-agnostic ation are shown in Table 3. Our full model outper- approach. Moreover, our results demonstrate the forms all of the baselines. By surpassing the Dis- benefits of structured classification, which out- joint baseline, we demonstrate the benefits of joint performs independently trained topical classifiers. classification. Furthermore, the high performance Overall, the results of our evaluation combined of both our full model and the Disjoint baseline with our analysis of human edits confirm that the relative to the other baselines shows the impor- proposed method can effectively produce compre- tance of structure-aware content selection. The hensive overview articles. Oracle system, which represents an upper bound This work opens several directions for future re- on our system’s capabilities, performs well. search. Diseases and American Film Actors ex- The remaining baselines have different flaws: hibit fairly consistent article structures, which are Articles produced by the No Template baseline successfully captured by a simple template cre- tend to focus on a single topic extensively at the ation process. However, with categories that ex- expense of breadth, because there are no con- hibit structural variability, more sophisticated sta- straints to ensure diverse topic selection. On the tistical approaches may be required to produce ac- other hand, performance of the Search baseline curate templates. Moreover, a promising direction varies dramatically. This is expected; this base- is to consider hierarchical discourse formalisms line relies heavily on both the search engine and such as RST (Mann and Thompson, 1988) to sup- individual web pages. The search engine must cor- plement our template-based approach. rectly rank relevant pages, and the web pages must provide the important material first. Acknowledgments Analysis of Human Edits The results of our ob- The authors acknowledge the support of the NSF (CA- servation of editing patterns are shown in Table REER grant IIS-0448168, grant IIS-0835445, and grant IIS- 0835652) and NIH (grant V54LM008748). Thanks to Mike 4. These articles have resided on Wikipedia for Collins, Julia Hirschberg, and members of the MIT NLP a period of time ranging from 5-11 months. All group for their helpful suggestions and comments. Any opin- ions, findings, conclusions, or recommendations expressed in of them have been edited, and no articles were re- this paper are those of the authors, and do not necessarily re- moved due to lack of quality. Moreover, ten au- flect the views of the funding organizations. tomatically created articles have been promoted
  9. 9. References Inderjeet Mani and Mark T. Maybury. 1999. Advances in Automatic Text Summarization. The MIT Press. Eugene Agichtein, Steve Lawrence, and Luis Gravano. 2001. Learning search engine specific query transformations for William C. Mann and Sandra A. Thompson. 1988. Rhetor- question answering. In Proceedings of WWW, pages 169– ical structure theory: Toward a functional theory of text 178. organization. Text, 8(3):243–281. Regina Barzilay and Noemie Elhadad. 2003. Sentence align- Tomasz Marciniak and Michael Strube. 2005. Beyond the ment for monolingual comparable corpora. In Proceed- pipeline: Discrete optimization in NLP. In Proceedings ings of EMNLP, pages 25–32. of CoNLL, pages 136–143. Regina Barzilay and Lillian Lee. 2004. Catching the drift: Ryan McDonald. 2007. A study of global inference algo- Probabilistic content models, with applications to genera- rithms in multi-document summarization. In Proceedings tion and summarization. In Proceedings of HLT-NAACL, of EICR, pages 557–564. pages 113–120. Vivi Nastase and Michael Strube. 2008. Decoding wikipedia Regina Barzilay, Kathleen R. McKeown, and Michael El- categories for knowledge acquisition. In Proceedings of hadad. 1999. Information fusion in the context of multi- AAAI, pages 1219–1224. document summarization. In Proceedings of ACL, pages 550–557. Vivi Nastase. 2008. Topic-driven multi-document summa- rization with encyclopedic knowledge and spreading acti- Fadi Biadsy, Julia Hirschberg, and Elena Filatova. 2008. vation. In Proceedings of EMNLP, pages 763–772. An unsupervised approach to biography production using wikipedia. In Proceedings of ACL/HLT, pages 807–815. Dragomir R. Radev, Hongyan Jing, and Malgorzata James Clarke and Mirella Lapata. 2007. Modelling com- Budzikowska. 2000. Centroid-based summarization pression with discourse constraints. In Proceedings of of multiple documents: sentence extraction, utility- EMNLP-CoNLL, pages 1–11. based evaluation, and user studies. In Proceedings of ANLP/NAACL, pages 21–29. William W. Cohen, Robert E. Schapire, and Yoram Singer. 1998. Learning to order things. In Proceedings of NIPS, Ehud Reiter and Robert Dale. 2000. Building Natural Lan- pages 451–457. guage Generation Systems. Cambridge University Press, Cambridge. Michael Collins. 2002. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Pro- Benjamin Snyder and Regina Barzilay. 2007. Multiple as- ceedings of ACL, pages 489–496. pect ranking using the good grief algorithm. In Proceed- ings of HLT-NAACL, pages 300–307. Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. 1992. Intoduction to Algorithms. The MIT Press. Ralph M. Weischedel, Jinxi Xu, and Ana Licuanan. 2004. A hybrid approach to answering biographical questions. In Hal Daum´ III and Daniel Marcu. 2005. A large-scale explo- e New Directions in Question Answering, pages 59–70. ration of effective global features for a joint entity detec- tion and tracking model. In Proceedings of HLT/EMNLP, Fei Wu and Daniel S. Weld. 2007. Autonomously semanti- pages 97–104. fying wikipedia. In Proceedings of CIKM, pages 41–50. Dina Demner-Fushman and Jimmy Lin. 2007. Answer- Ying Zhao, George Karypis, and Usama Fayyad. 2005. ing clinical questions with knowledge-based and statisti- Hierarchical clustering algorithms for document datasets. cal techniques. Computational Linguistics, 33(1):63–103. Data Mining and Knowledge Discovery, 10(2):141–168. Elena Filatova and John M. Prager. 2005. Tell me what you L. Zhou, M. Ticrea, and Eduard Hovy. 2004. Multi- do and I’ll tell you what you are: Learning occupation- document biography summarization. In Proceedings of related activities for biographies. In Proceedings of EMNLP, pages 434–441. HLT/EMNLP, pages 113–120. Elena Filatova, Vasileios Hatzivassiloglou, and Kathleen McKeown. 2006. Automatic creation of domain tem- plates. In Proceedings of ACL, pages 207–214. Atsushi Fujii and Tetsuya Ishikawa. 2004. Summarizing en- cyclopedic term descriptions on the web. In Proceedings of COLING, page 645. Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and Mark Kantrowitz. 2000. Multi-document summarization by sentence extraction. In Proceedings of NAACL-ANLP, pages 40–48. Marti A. Hearst. 1994. Multi-paragraph segmentation of ex- pository text. In Proceedings of ACL, pages 9–16. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of ACL, pages 74–81.