Content Extraction via Tag Ratios<br />Author: Tim Weninger, William H. Hsu, Jiawei Han<br />Publication: WWW’2010<br />Pr...
Outline<br />Introduction<br />Tag Ratio<br />Threshold Method<br />Smoothing<br />Selecting Content from the Threshold<br...
Introduction<br />Typical webpage contains:<br />A title banner (or something similar) towards the top of the page.<br />A...
Introduction<br />Most current content extraction techniques make use of particular HTML cues such as tables, fonts, sizes...
Introduction<br />Content Extraction via Tag Ratios (CETR)<br />This paper construct a tag ratio (TR) array with the conte...
Introduction<br />Three clustering approaches are investigated in this work.<br />The first two (CETR-TM, CETR-KM) are pre...
Tag Ratio<br />Tag Ratios, essentially, are the ratios of the count of non-HTML-tag characters to the count of HTML-tags p...
Tag Ratio<br />Example 1 shows a snippet of code from an article published on The Hutchinson News’ website with the corres...
Threshold Method<br />Determine a threshold τ which discriminates TRs into content and non-content sections.<br />That is,...
Smoothing<br />Without smoothing, many important content lines might be lost.<br />i.e.the page title, the news article by...
Smoothing<br />k is normalized to form k’.<br />The Gaussian kernel k is convolved with T in order to form a smoothed hist...
Smoothing<br />12<br />
Selecting Content from the Threshold<br />Let C be the set of content lines such that Di∈ C iffTi’≥ τwhere Di≌Ti’and τ ← λ...
Selecting Content via Clustering<br />Alternatively, this paper apply the k-means clustering method togroup content C and ...
2D Modal<br />One shortcoming of the Threshold Clustering and k-Means: <br />They view the TR histogram as a set of values...
2D Modal<br />To compute G:<br />First smooth T.<br />Find the derivatives for each element in the array<br />subtract Ti ...
2D Modal<br />17<br />
Constructing 2D Tag Ratios<br />Combining the Smoothed Tag Ratios T’ from Figure 3and the Absolute Smoothed Derivatives of...
Clustering Tag Ratios<br />Based on the k-Means algorithm:<br />k← 3.<br />One cluster has a centroid which is always set ...
Implementation details<br />Without HTML tags, the method will fail.<br />Assume that tagless webpages contain only conten...
Experiments<br />Data Set<br />News site data from Pasternak and Roth’s 2009 WWW paper on maximum subsequence segmentation...
Results(CETR-TM)<br />The threshold τ is set to 1.0σ, i.e., λ ←1.0.<br />If λ is increased (e.g., 1.1σ, 1.2σ), then the se...
Results (CETR-KM)<br />The CETR-KM method typically achieves either a high recall or a high precision but rarely both at t...
Results (CETR)<br />The relatively low precision reported by NY Post, Freep and Techweb is likely due to the fact that the...
Methods Comparison<br />Some of the MSS results are not listed because the original work did not perform experiments on th...
Discussion<br />Unlike MSS, CETR is a completely unsupervised algorithm and therefore does not require labeled training ex...
Summary<br />Results show that when compared to several other leadingcontent extraction methods CETR performs best on aver...
Future Work<br />The authors intend to incorporate this method into standard searchengines in order to see what effect, if...
Upcoming SlideShare
Loading in …5
×

Content extraction via tag ratios

992 views
912 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
992
On SlideShare
0
From Embeds
0
Number of Embeds
38
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Content extraction via tag ratios

  1. 1. Content Extraction via Tag Ratios<br />Author: Tim Weninger, William H. Hsu, Jiawei Han<br />Publication: WWW’2010<br />Presenter: Jhih-Ming Chen<br />1<br />
  2. 2. Outline<br />Introduction<br />Tag Ratio<br />Threshold Method<br />Smoothing<br />Selecting Content from the Threshold<br />Selecting Content via Clustering<br />2D Modal<br />Constructing 2D Tag Ratios<br />Clustering Tag Ratios<br />Implementation details<br />Experiments<br />Summary<br />2<br />
  3. 3. Introduction<br />Typical webpage contains:<br />A title banner (or something similar) towards the top of the page.<br />A list of hyperlinks on the left or right side of the page with advertisements interspersed.<br />Most usually the meaningful content of the page is located in the middle.<br />Of course, this layout is not standard among all webpages; therefore a flexible, robust content extraction tool is necessary.<br />3<br />
  4. 4. Introduction<br />Most current content extraction techniques make use of particular HTML cues such as tables, fonts, sizes, lines, etc..<br />since modern webpages no longer include these cues, many content extraction algorithms have begun to perform poorly.<br />One difference:<br />This paper make no assumptions about the particular structure of a given webpage.<br />4<br />
  5. 5. Introduction<br />Content Extraction via Tag Ratios (CETR)<br />This paper construct a tag ratio (TR) array with the contention that for each line i in the array.<br />The higher the tag ratio is for the ith line the more likely that i represents a line of content-text within the HTML document.<br />TR array can be represented as a histogram, wherein each histogram bucket represents the tag ratio of a line in the document.<br />The problem is reduced to a histogram clustering task.<br />Appropriate clusters should discriminate between TR-lines which correspond to webpage content and those TR-lines which do not.<br />5<br />
  6. 6. Introduction<br />Three clustering approaches are investigated in this work.<br />The first two (CETR-TM, CETR-KM) are preliminary versions of the approach.<br />The final version (CETR) is claimed to be the most advanced and best performing.<br />6<br />
  7. 7. Tag Ratio<br />Tag Ratios, essentially, are the ratios of the count of non-HTML-tag characters to the count of HTML-tags per line.<br />The count of HTML-tags on a particular line is 0 then the ratio is set to the length of the line.<br />7<br />
  8. 8. Tag Ratio<br />Example 1 shows a snippet of code from an article published on The Hutchinson News’ website with the corresponding tag ratios.<br />Figure 2 shows the resulting TR-histogram T.<br />the high tag ratio portion to be indicative of the webpage’s content.<br />8<br />
  9. 9. Threshold Method<br />Determine a threshold τ which discriminates TRs into content and non-content sections.<br />That is, any TR value greater than or equal to τ should be labeled content, and conversely, any TR value less than τ should be labeled not content.<br />The problem becomes a task of finding the best value for τ.<br />9<br />
  10. 10. Smoothing<br />Without smoothing, many important content lines might be lost.<br />i.e.the page title, the news article byline or dateline, short or one sentence paragraphs.<br />To resolve this problem, this paper apply a Gaussian smoothing pass to T.<br />Construct a Gaussian kernel k with a radius of 1 standard deviation 1σ, giving a total window size of 2(σ) + 1.<br />The size of and values within k vary according to σ because as the variance of T increases, smoothing necessity also increases.<br />10<br />
  11. 11. Smoothing<br />k is normalized to form k’.<br />The Gaussian kernel k is convolved with T in order to form a smoothed histogram (T’)<br />11<br />
  12. 12. Smoothing<br />12<br />
  13. 13. Selecting Content from the Threshold<br />Let C be the set of content lines such that Di∈ C iffTi’≥ τwhere Di≌Ti’and τ ← λσwhere λ is a parameter and σ is the standard deviation.<br />λ ←1.<br />λ is discussed further in the experiment result.<br />This threshold method is hereafter referredto as CETR-TM.<br />13<br />
  14. 14. Selecting Content via Clustering<br />Alternatively, this paper apply the k-means clustering method togroup content C and non-content N lines by using T’ asthe only similarity measure.<br />k←3.<br />The resulting k clusters S1, S2. . . Skare labeled by selectingthe cluster which has its centroid closest to the origin (i.e.zero in 1-dimensional space) Sminand assigning N ←Smin.The remaining clusters are assigned to C.<br />This 1-dimensional k-means clustering method is hereafter referredto as CETR-KM.<br />14<br />
  15. 15. 2D Modal<br />One shortcoming of the Threshold Clustering and k-Means: <br />They view the TR histogram as a set of values rather than an ordered sequence of values, and as a result they are not sensitive to jumps in the TR histogram.<br />Transforming the histogram data so that it may be represented in 2-dimensions we can capture the ordered nature of the histogram data and obtain more accurate results.<br />Define the two dimensions:<br />A smoothed TR histogram (T’).<br />The absolute smoothed derivatives of the smoothed TR histogram (G).<br />15<br />ˆ<br />
  16. 16. 2D Modal<br />To compute G:<br />First smooth T.<br />Find the derivatives for each element in the array<br />subtract Ti from the mean of the next α elements in order to differentiate on the moving average instead of line-by-line.<br />α = 3.<br />Next, smooth G by the above way to get G’.<br />Finally, compute G such that Gi= |Gi’| for each i in G’.<br />16<br />ˆ<br />ˆ<br />
  17. 17. 2D Modal<br />17<br />
  18. 18. Constructing 2D Tag Ratios<br />Combining the Smoothed Tag Ratios T’ from Figure 3and the Absolute Smoothed Derivatives of Smoothed Tag Ratios G from Figure 4.<br />Manually label each point <br />content (×) <br />non-content (+)<br />18<br />ˆ<br />ˆ<br />Gi<br />Ti’<br />
  19. 19. Clustering Tag Ratios<br />Based on the k-Means algorithm:<br />k← 3.<br />One cluster has a centroid which is always set to the origin.<br />Define mjito be the centroid of Siat iteration j and then force m1j= (0,0).<br />This approach is beneficial in 2 ways:<br />It forces the remaining clusters to migrate away from the origin where the non-content points are located.<br />It provides an easy means for labeling the resulting clusters.<br />i.e. N ← S1.<br />i.e. C ← S2, . . . , Sk.<br />19<br />
  20. 20. Implementation details<br />Without HTML tags, the method will fail.<br />Assume that tagless webpages contain only content and then return the entire text.<br />There exist some webpages wherein the HTML markup is written in a single line.<br />Resolve this issue by detecting these instances and inserting line breaks every 65 characters.<br />If the 65thcharacter is located within a tag, then the line break is inserted at the next non-tag location.<br />20<br />
  21. 21. Experiments<br />Data Set<br />News site data from Pasternak and Roth’s 2009 WWW paper on maximum subsequence segmentation (MSS).<br />Training and test data sets from the CleanEval competition.<br />CleanEval:<br />The CleanEval project is a shared task for cleaning arbitrary webpages.<br />This corpus includes four divisions: a development/training set and an evaluation set in both English and Chinese languages which are all hand-labeled.<br />Because our approach does not require training there is no need to separate between training/development and evaluation documents.<br />21<br />
  22. 22. Results(CETR-TM)<br />The threshold τ is set to 1.0σ, i.e., λ ←1.0.<br />If λ is increased (e.g., 1.1σ, 1.2σ), then the selectivity of the threshold would increase causing the precision to increase and the recall to decrease.<br />If λ is decreased (e.g., 0.9σ, 0.8σ), then the selectivity of the threshold would decrease resulting in a lower precision and a higher recall.<br />22<br />
  23. 23. Results (CETR-KM)<br />The CETR-KM method typically achieves either a high recall or a high precision but rarely both at the same time.<br />These results show that CETR-KM typically outperforms CETR-TM.<br />23<br />
  24. 24. Results (CETR)<br />The relatively low precision reported by NY Post, Freep and Techweb is likely due to the fact that these sources contain user comments, feedback, etc. after each article.<br />CETR typically does not include short comments as content whereas the gold standard extractions include these comments.<br />24<br />
  25. 25. Methods Comparison<br />Some of the MSS results are not listed because the original work did not perform experiments on these data sources.<br />The MSS algorithm performs highest on the Big 5 data set.<br />comment effect!<br />25<br />
  26. 26. Discussion<br />Unlike MSS, CETR is a completely unsupervised algorithm and therefore does not require labeled training examples.<br />Finding a good threshold value is difficult because λ must be empirically found for each domain.<br />26<br />
  27. 27. Summary<br />Results show that when compared to several other leadingcontent extraction methods CETR performs best on average.<br />The complete CETR algorithm contains:<br />no parameters to adjust (k ← 3 and α ← 3 works for most cases).<br />no training to be done.<br />no classifier models to build.<br />all that is required is to give, as input, an HTML document and the approximate content will be returned.<br />27<br />
  28. 28. Future Work<br />The authors intend to incorporate this method into standard searchengines in order to see what effect, if any, it has on the resultrelevance.<br />For instance, many webpages include word strings and links in order to boost their search engine rank, if we can filter the irrelevant text from the page during indexing then it may be possible to present more relevant search results.<br />Another area for further investigation is the clustering algorithm used in CETR.<br />In fact, a linear max-margin clustering approach may give better results.<br />28<br />

×