Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web2Text: Deep Structured Boilerplate Removal

351 views

Published on

Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this issue, we introduce a novel model that performs sequence labeling to collectively classify all text elements in an HTML page as either boilerplate or main content. Our method uses convolutional networks on top of DOM tree features to learn unary classification potentials for each block of text on the page and pairwise potentials for each pair of neighboring text blocks. We find the most likely labeling according to these potentials using the Viterbi algorithm.

The proposed method improves page cleaning performance on the CleanEval benchmark compared to the state-of-the-art. As a component of information retrieval pipelines it improves retrieval performance on the ClueWeb12 collection.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Web2Text: Deep Structured Boilerplate Removal

  1. 1. Web2Text: Deep Structured Boilerplate Removal Thijs Vogels1 , Octavian-Eugen Ganea2 , Carsten Eickhoff3 1 Disney Research 2 ETH Zurich 3 Brown University
  2. 2. Outline 1. Background & Goals 2. Feature Extraction 3. Representation Learning 4. Inference 5. Experiments
  3. 3. Boilerplate Removal
  4. 4. Boilerplate Removal CFP
  5. 5. Boilerplate Removal CFP Header
  6. 6. Boilerplate Removal CFP Header Navigation
  7. 7. Boilerplate Removal CFP Header Navigation Search
  8. 8. Boilerplate Removal CFP Header Navigation Search Dates
  9. 9. Boilerplate Removal CFP
  10. 10. Boilerplate Removal CFP Edgar
  11. 11. Beyond HTML Cleaning ● Main content ● Boilerplate ○ Ads ○ Banners ○ Navigation ○ Feeds ○ Next article preview ○ Link lists ○ etc.
  12. 12. Beyond HTML Cleaning ● Main content ● Boilerplate ○ Ads ○ Banners ○ Navigation ○ Feeds ○ Next article preview ○ Link lists ○ etc.
  13. 13. Downstream Benefits ● More accurate content extraction ● Improved quality of derived systems ● More compact dataset/index sizes
  14. 14. Method
  15. 15. DOM Parsing ● JSoup ● Remove empty nodes: ○ Empty content ○ Whitespace only ● Remove non-content nodes: ○ <br> ○ <checkbox> ○ <hr> ○ Etc.
  16. 16. Collapsing the DOM Tree ● Inflated node hierarchies can hinder expressiveness ● Especially troublesome for distance calculation
  17. 17. Collapsing DOM Trees
  18. 18. Collapsing DOM Trees ● Merge single-child nodes with their respective children ● Repeat until collapse
  19. 19. Collapsed DOM Trees (CDOM)
  20. 20. Webpages as Sequences of Blocks
  21. 21. Unary Features (128) per Block ● Avg. word length ● Stopword ratio ● Numeric character ratio ● Relative distance from root ● parent/grandparent information ● ...
  22. 22. Pairwise Features (25) per Neighboring Pair ● Tree distance in CDOM ● Are blocks separated by line break? ● Node features of common ancestor...
  23. 23. Representation Learning ● 128/25 hand crafted features ● Consolidate “raw” features in 2 CNNs:
  24. 24. Representation Learning ● 128/25 hand crafted features ● Consolidate “raw” features in 2 CNNs: ● Unary ○ 5 layers ○ ReLU ○ Softmax ○ Cross entropy loss ○ Outputs: ■ pi (li = 1) ■ pi (li = 0)
  25. 25. Representation Learning ● 128/25 hand crafted features ● Consolidate “raw” features in 2 CNNs: ● Pairwise ○ 5 layers ○ ReLU ○ Softmax ○ Cross entropy loss ○ Outputs: ■ pi,i+1 (li = 1, li+1 = 1) ■ pi,i+1 (li = 1, li+1 = 0) ■ pi,i+1 (li = 0, li+1 = 1) ■ pi,i+1 (li = 0, li+1 = 0) ● Unary ○ 5 layers ○ ReLU ○ Softmax ○ Cross entropy loss ○ Outputs: ■ pi (li = 1) ■ pi (li = 0)
  26. 26. Inference b1 l1 = ?
  27. 27. Inference p1 (l1 =1) b1 p1 (l1 =0) l1 = ?
  28. 28. Inference p1 (l1 =1) b1 p1 (l1 =0) l1 = ? p2 (l2 =1) b2 p2 (l2 =0) l2 = ?
  29. 29. Inference p1 (l1 =1) b1 p1 (l1 =0) l1 = ? p2 (l2 =1) b2 p2 (l2 =0) l2 = ? p1,2 (l1 = 1, l2 = 1) p1,2 (l1 = 0, l2 = 0) p 1,2(l1=0,l2=1) p1,2 (l1 =1,l2 =0)
  30. 30. Inference p1 (l1 =1) b1 p1 (l1 =0) l1 = ? p2 (l2 =1) b2 p2 (l2 =0) l2 = ? p1,2 (l1 = 1, l2 = 1) p1,2 (l1 = 0, l2 = 0) p 1,2(l1=0,l2=1) p1,2 (l1 =1,l2 =0) ... ...
  31. 31. Inference p1 (l1 =1) b1 p1 (l1 =0) l1 = ? p2 (l2 =1) b2 p2 (l2 =0) l2 = ? pn (ln =1) bn pn (ln =0) ln = ? p1,2 (l1 = 1, l2 = 1) p1,2 (l1 = 0, l2 = 0) p 1,2(l1=0,l2=1) p1,2 (l1 =1,l2 =0) ... ...
  32. 32. Overview
  33. 33. Experiments
  34. 34. Experiment I: Boilerplate Removal ● CleanEval data ● 736 manually annotated Web pages ● On average 188 blocks per page ● Measure Acc, P, R, F ● Baselines: ○ BTE heuristic ○ CRF ○ Unfluff ○ Boilerpipe
  35. 35. Experiment I: Boilerplate Removal Acc P R F1 BTE 0.75 0.76 0.84 0.80 CRF 0.82 0.88 0.81 0.84 Boilerpipe (def) 0.79 0.89 0.74 0.81 Boilerpipe (art) 0.67 0.89 0.50 0.64 Boilerpipe (text) 0.59 0.93 0.33 0.48 Unfluff 0.68 0.90 0.51 0.65 Web2Text 0.86 0.87 0.90 0.88
  36. 36. Experiment II: Document Retrieval ● Effect on ad-hoc retrieval ● ClueWeb’12 collection (733M docs) ● 50 TREC 2013 Web track queries ● Indri Query likelihood (QL) and relevance-based language model (RM)
  37. 37. Experiment II: Document Retrieval
  38. 38. Experiment II: Document Retrieval ● Low-recall extractors hurt retrieval performance (BTE, art, text, Unfluff)
  39. 39. Experiment II: Document Retrieval ● Low-recall extractors hurt retrieval performance (BTE, art, text, Unfluff) ● CRF and Web2Text extraction sig. better than raw content indexing
  40. 40. Experiment II: Document Retrieval ● Low-recall extractors hurt retrieval performance (BTE, art, text, Unfluff) ● CRF and Web2Text extraction sig. better than raw content indexing ● Web2Text extraction sig. better than all compared methods
  41. 41. Run Times ● Average runtime per Web page ● Macbook with 2.8 GHz Intel Core i5 processor ● Global: 54 ms ○ DOM parsing & feature extraction: 35 ms ○ NN forward pass & Viterbi: 19 ms
  42. 42. Conclusion ● Deep structure prediction pipeline for Web content extraction ○ Collapsed DOMs ○ Unary and pairwise potentials ○ CNN representation learning ○ HMM-based inference ● Solid content extraction performance ● Can translate into increased downstream effectiveness
  43. 43. Thank You! Code available at: https://github.com/dalab/web2text

×