Adaptive web page content identification

2,007 views
1,875 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,007
On SlideShare
0
From Embeds
0
Number of Embeds
514
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Adaptive web page content identification

  1. 1. Adaptive Web-page Content Identification<br />Author: John Gibson, Ben Wellner, Susan Lubar<br />Publication: WIDM’2007<br />Presenter: Jhih-Ming Chen<br />1<br />
  2. 2. Outline<br />Introduction<br />Content Identification as Sequence Labeling<br />Models<br />Conditional Random Fields<br />Maximum Entropy Classifier<br />Maximum Entropy Markov Models<br />Data<br />Harvesting and Annotation<br />Dividing into Blocks<br />Experimental Setup<br />Results and Analysis<br />Conclusion and Feature Work<br />2<br />
  3. 3. Introduction<br />Web pages containing news stories also include many other pieces of extraneous information such as navigation bars, JavaScript, images and advertisements.<br />This paper’s goal:<br />Detect identical and near duplicate articles within a set of web pages from a conglomerate of websites.<br />Why do this work?<br />Provide input for an application such as a Natural Language tool or for an index for a search engine.<br />Re-display it on a small screen such as a cell phone or PDA.<br />3<br />
  4. 4. Introduction<br />Typically, content extraction is done via a hand-crafted tool targeted to handle a single web page format.<br />Shortcoming:<br />When the page format changes, the extractor is likely to break.<br />It is labor intensive.<br />Web page formats change fairly quickly and custom extractors often become obsolete a short time after they are written.<br />Some websites use multiple formats concurrently and identifying each one and handling them properly makes this a complex task.<br />In general, the site-specific extractors are unworkable as a long-term solution.<br />The approach described in this paper is meant to overcome these issues.<br />4<br />
  5. 5. Introduction<br />The data set for this work consisted of web pages from 27 different news sites.<br />Identifying portions of relevant content in web-pages can be construed as a sequence labeling problem.<br />i.e. each document is broken into a sequence of blocks and the task is to label each block as Content or NotContent.<br />The best system, based on Conditional Random Fields, can correctly identify individual Content blocks with recall above 99.5% and precision above 97.9%.<br />5<br />
  6. 6. Content Identification as Sequence Labeling<br />Problem description<br />Identify the portions of news-source web-pages that contain relevant content – i.e. the news article itself.<br />Two general ways to approach this problem:<br />Boundary Detection Method<br />Identify positions where content begins and ends.<br />Sequence Labeling Method<br />Divide the original Web document into a sequence of some appropriately sized units or blocks.<br />Categorize each block as Content or NotContent.<br />6<br />
  7. 7. Content Identification as Sequence Labeling<br />In this work, the authors focus largely on the sequence labeling method.<br />Shortcomings of boundary detection methods:<br />A number of web pages contain ‘noncontiguous’ content.<br />The paragraphs of the article body have other page content, such as advertisements, interspersed.<br />Boundary detection methods aren’t able to nicely model transitions from Content to NotContentback to Content.<br />If a boundary is not identified at all, an entire section of content can be missed.<br />When developing a statistical classifier to identify boundaries, there are many, many more negative examples of boundaries than positive ones.<br />Itmay be possible to sub-sample negative examples or identifysome reasonable set of candidate boundaries and train a classifieron those, but this complicates matters greatly.<br />7<br />
  8. 8. Models<br />This section describes the three statistical sequence labeling models employed in the experiments:<br />Conditional Random Fields (CRF)<br />Maximum Entropy Classifiers (MaxEnt)<br />Maximum Entropy Markov Models (MEMM)<br />8<br />
  9. 9. Conditional Random Fields<br />Let x= x1, x2, …, xnbe a sequence of observations,<br />such as a sequence of words, paragraphs, or, as in our setting, a sequence of HTML “segments”.<br />Given a set of possible output values (i.e., labels),<br />sequence CRFs define the conditional probability of a label sequence y = y1, y2, …, ynas:<br />Zxis a normalization term overall possible label sequences.<br />Current position is i.<br />Current and previous labels are yiand yi-1<br />Often, the range of feature functions fkis {0,1}.<br />Associated with each feature function, fk, is a learned parameter λkthat captures the strength and polarity of the correlation between the label transition, yi-1to yi.<br />9<br />
  10. 10. Conditional Random Fields<br />Training<br />D ={(y(1), x(1)), (y(2), x(2)), …,(y(m), x(m))}.<br />A set of training data consisting of a set of pairs ofsequences.<br />The model parameters are learned by maximizing the conditional log-likelihood of the training data.<br />which is simply the sum of the log-probabilities assigned by the model to each label-observation sequence pair:<br />The second term in the above equation is a Gaussian prior over the parameters, which helps the model to overcome over-fitting.<br />10<br />Regularization term<br />
  11. 11. Conditional Random Fields<br />Decoding (Testing)<br />Given a trained model, the problem of decoding in sequence CRFs involves finding the most likely label sequence for a given observation sequence.<br />There are NMpossible label sequences.<br />N is the number of labels.<br />M is the length of the sequence.<br />Dynamic programming, specifically a variation on the Viterbi algorithm, can find the optimal sequence in time linear in the length of the sequence and quadratic in the number of possible labels.<br />11<br />
  12. 12. Maximum Entropy Classifier<br />MaxEnt classifiers are conditional models that given a set of parameters produce a conditional multinomial distribution according to:<br />In contrast with CRFs, MaxEnt models (and classifiers generally) are “state-less” and do not model any dependence between different positions in the sequence.<br />12<br />
  13. 13. Maximum Entropy Markov Models<br />MEMMs model a state sequence just as CRFs do, but use a “local” training method rather than performing global inference over the sequence at each training iteration.<br />Viterbi decoding is employed as with CRFs to find the best label sequence.<br />MEMMs are prone to various biases that can reduce their accuracy.<br />Do not normalize over the entire sequence like CRFs.<br />13<br />
  14. 14. Data<br />Harvesting and Annotation<br />1620 labeled documents from 27 of the sites.<br />388 distinctarticles within the 1620 documents.<br />This large number ofduplicate and near duplicate documents introduced bias into ourtraining set.<br />14<br />
  15. 15. Data<br />Dividing into Blocks<br />Sanitize the raw HTML and transform it intoXHTML.<br />Exclude all words insidestyle and script tags.<br />Tokenize the document.<br />Divide up theentire sequence of <lex> into smaller sequences called blocks.<br />except the following:<a>,<span>, <strong>, ...etc.<br />Wrap each block as tightly as possible with a <span>.<br />Create features based on the material from the <span> tags in each block.<br />There is a large skew of NotContent vs. Content blocks.<br />234,436 NotContentblocks.<br />24,388 Content blocks.<br />15<br />
  16. 16. Experimental Setup<br />16<br />
  17. 17. Experimental Setup<br />Data Set Creation<br />The authors ran four separate cross validation experiments to measure the bias introduced by duplicate articles and mixed sources.<br />Duplicates, Mixed Sources.<br />Split documents with 75% in the training set and 25% in the testing set.<br />No Duplicates, Mixed Sources.<br />This split prevented duplicate documents from spanning the training and testing boundary.<br />Duplicates Allowed, Separate Sources.<br />Create four separate bundles each containing 6 or 7 sources.<br />Fill the bundles in round-robin fashion by selecting a source at random.<br />Three bundles were assigned to training and one to testing.<br />No Duplicates, Separate Sources.<br />Ensure that no duplicates crossed the training/test set boundary.<br />17<br />
  18. 18. Results and Analysis<br />Each of the threedifferent models, CRF, MEMM and MaxEnt are evaluated oneach of the four different data setups.<br />All results are a weightedaverage of four-fold cross validation.<br />18<br />
  19. 19. Results and Analysis<br />Feature Type Analysis<br />Shows results for the CRF when individually removing each class of feature using the “No Duplicates; Separate Sources” data set.<br />19<br />
  20. 20. Results and Analysis<br />Contiguous vs.Non-Contiguous performance<br />Document level results.<br />20<br />
  21. 21. Results and Analysis<br />Amount of Training Data<br />Show the results for the CRF with varying quantities of training data using the “No Duplicates; Separate Sources” experimentalsetup.<br />21<br />
  22. 22. Results and Analysis<br />Error Analysis<br />NotContent blocks found interspersed within portions of Content being falsely labeled as Content.<br />Sometimes section headers would be incorrectly singled out as NotContent.<br />The beginning and end of article boundaries where sometimes the first or last block would incorrectly be labeled NotContent.<br />22<br />
  23. 23. Conclusion and Feature Work<br />Sequence labeling emerged as the clear winner with CRF edging out MEMM.<br />MaxEnt was only competitive on the easiest of the four data sets and is not a viable alternative to site-specific wrappers.<br />Future work includes applying the techniques to additional data, including additional sources, different languages, and other types of data such as weblogs.<br />Another interesting avenue: semi-Markov CRFs.<br />23<br />

×