Web pages are a valuable source of information for many natural language processing and information retrieval tasks. Extracting the main content from those documents is essential for the performance of derived applications. To address this issue, we introduce a novel model that performs sequence labeling to collectively classify all text elements in an HTML page as either boilerplate or main content. Our method uses convolutional networks on top of DOM tree features to learn unary classification potentials for each block of text on the page and pairwise potentials for each pair of neighboring text blocks. We find the most likely labeling according to these potentials using the Viterbi algorithm.
The proposed method improves page cleaning performance on the CleanEval benchmark compared to the state-of-the-art. As a component of information retrieval pipelines it improves retrieval performance on the ClueWeb12 collection.