Content extraction: By Hadi Mohammadzadeh

.

Content Extraction

Identifying The Main Content in Html Documents

By : Hadi Mohammadzadeh
Institute of Applied Information Processing
University of Ulm – 6th of July . 2010

Hadi Mohammadzadeh Content Extraction 1

.

Outline

1. Introduction
2. Basic Terms and Concepts
3. New Single document Algorithms
4. Template clustering and detection


.

Part One

Introduction


.

What is the Problem

• Most HTML documents on the World Wide Web contain far more than the article or text
which forms their main content.
Navigation menus, functional and design elements or commercial banners are typical
examples of additional contents.


.

What is the Problem-Cont

• Now question is what is Content Extraction :
CE is the process of identifying the main content and/or removing the
additional contents.

• Two different kind of approaches evolved to solve the CE task:
– Heuristic approaches on single documents.
– Template Detection (TD) approaches on multiple documents. The template
portions of the documents occur more frequently or even in every document.


.

What is the Problem-Cont

• Several applications benefit from CE under different aspects:
– Web Mining (WM) and Information Retrieval (IR) applications use CE to
preprocess the raw HTML data to reduce noise and to obtain more accurate results.
– Other applications use CE to reduce the document size for presentation on screen
readers and small screen devices.


.

Part Two

Basic Terms and Concepts


.

What you need to know before ….

• Here three essential fields are addressed to know :
– Some common data models for web documents and their representations
• XHTML (Extensible Hypertext Markup Language) , XML (Extensible Markup Language) , XSLT
(Extensible Style sheet Language Transformations) , Xpath,
• SAX (Simple API for XML)
• DOM (Document Object Model )
• Templates, Content Management System (CMS)
– Including : Main navigation, Location display, Date of publication, News article,
Commercials, Related links, External links
– Basic issues from the field of Information Retrieval
• Concepts, Instances and Attributes
• Distance and Similarly Measures
• Query, Result Set , and Gold Standard
• Evaluation and Visualization
– Recall, Precision, F1-measure


.

What you need to know before ….

1. Methods and data structures could be used to represent documents for data and
text mining applications
• Document Representation
• Methods for classifications and clustering
– Instance based methods
» K-means for clustering
» K nearest neighbor for classification
– Statistical method
» Naïve Bayes (NB)
– Kernel based method
» Support vector machine


.

Part Three

New Single document Algorithms
Content Code Blurring (CCB)


.

Single Document Content Extraction

• CE methods which are based on single documents perform the extraction by analyzing
only the document at hand.
• CE algorithms and framework:
– Crunch framework
– Body Text Extraction (BTE) algorithm interprets a HTML document as a sequence of word
and tag tokens. It identifies a single, continuous region which contains most words while
excluding most tags. A problem of BTE is its quadratic complexity and its restriction to
discover only a single and continuous text passage as main content.

– Document Slope Curves (DSC) algorithm is an extended BTE. Using a windowing
technique they are capable to locate also several document regions in which the word
tokens are more frequent than tag tokens, while also reducing the complexity to linear
runtime.

– Link Quota Filters (LQF) is a quite common heuristic for identifying link lists and
navigation elements. The basic idea is to find DOM elements which consist mainly of text
in hyperlink anchors.

– Content Code Blurring (CCB) is based on finding regions in the source code character
sequence which represent homogeneously formatted text. Its ACCB variation, which
ignores format changes caused by hyperlinks, performed better than all previous CE
heuristics.

.

Evaluation of Content Extraction Algorithms

• Human User Evaluation
• Application Specific Evaluation
• Evaluation based on Information Retrieval Measures


.

Introduction of CCB

• CCB is a novel CE algorithm.

• CCB is:
– It is robust to invalid or badly formatted HTML documents,
– It is fast and delivers very good results on most documents.

• The idea underlying content code blurring is
to take advantage of visual features
of the main and the additional contents.
Additional contents are usually highly formatted and contain little and short texts.

• The main text content, on the other hand, is long and homogeneously formatted.

• As in the source code of an HTML document any change of format is indicated by a tag,
we will try to identify those parts of the document which contain a lot of text and few or
no tags.


.

Concept and Idea of CCB

• Two different ways to obtain a suitable document representation
– Strikes a new path for document representations in the CE context by
determining for each single character whether it is content or code.
– The second approach is based on a token sequence as used by BTE and DSC.

• Both ways lead to a representation of a document as a sequence of atomic
elements which are either content or code. We will refer to this vector from now
on as the content code vector (CCV).


.

Concept and Idea of CCB

• For each single element in the CCV we determine a ratio of content to code in its
vicinity to find out if it is surrounded mainly by content or by code.

• If for several elements in a row this content code ratio (CCR) is high, i.e. they
are surrounded mainly by text and only by a few tags.


.

Blurring the Content Code Vector

• Each entry in the CCV is initialized with a value of 1 if the according element is
of type content and with a value of 0 for code.

• To obtain the CCR we calculate for each entry a weighted and local average of
the values in a neighborhood with a fixed symmetric range. In inhomogeneous
neighborhoods the average value will be between 0 and 1. If they are mainly
content, the ratio will be high, if they are mainly code, the ratio will be low. So,
the average values have exactly the properties we need for our CCR values.


.

Implementation and Adaptations

• To find main content corresponds to selecting those elements of the CCV which have
a high CCR value, i.e. a value closer to 1.

• An element in the CCV is considered to be part of the main content, if it has a CCR
value above a fixed threshold t.


.

Part Four

Clustering

Template Based Web Documents
(TBWD)


.

Abstract
• More and more documents on the World Wide Web are based on templates.

• On a technical level this causes those documents to have a quite similar source
code and DOM tree structure.

• Grouping together documents which are based on the same template is an
important task for applications that analyze the template structure and need
clean training data.

• This paper develops and compares several distance measures for clustering web
documents according to their underlying templates. In other words we take a
closer look at web document distance measures which are supposed to reflect
template related structural similarities and dissimilarities.


.

General Information

• As more and more documents on the World Wide Web are generated
automatically by Content Management Systems (CMS), more and more of them
are based on templates.

• Templates can be seen as framework documents which are filled with different
contents to compile the final documents

• A technical side effect is that the source code of template generated documents
is always very similar.


.

Related Works -1
for

Recognizing template structures in HTML documents
• First Bar-Yossef and Rajagopalan proposed a template recognition algorithm
based on DOM tree segmentation and segment selection.
(Template detection via data mining and its applications-2002)

• Lin and Ho developed InfoDiscoverer which is based on the idea, that – opposite
to the main content – template generated contents appear more frequently.
(Discovering informative content blocks from web documents.-2002)

• Debnath et al. used a similar assumption of redundant blocks in
ContentExtractor but take into account not only words and text but also other
features like image or script elements.
(Automatic extraction of informative blocks from webpages-2005)


.

Related Works - 2
for

• The Site Style Tree(SST) approach of Yi, Liu and Li instead is concentrating
more on the visual impression single DOM tree elements are supposed to achieve
and declares identically formated DOM sub-trees to be template generated.
(Eliminating noisy information in web pages for data mining-2003)

• Cruz et al. describe several distance measures for web documents. They
distinguish between distance measures based on tag vectors, parametric functions
or tree edit distances.
(Measuring structural similarity among web documents: preliminary results-1998)

• In the more general context of comparing XML documents Buttler stated tree
edit distances to be probably the best but as well very expensive similarity
measures. Therefore Buttler proposes the path shingling approach which makes
use of the shingling technique.
(A short survey of document structure similarity algorithms-2004)


.

Related Works -3
for

• Shi et al. propose an alignment based on simplified DOM tree representation to
find parallel versions of web documents in different languages.
(A DOM tree alignment model for mining parallel data from the web.-2006)


.

Distance Measures for TBWD Structures

There are six tag sequence based measures for calculating
distances between TBWD.

• RTDM (Restricted Top-Down Mapping) Algorithm– Tree Edit Distance
This distance measure is based on calculating the cost for transforming a
source tree into a target tree structure.

• CP – Common Paths
Another way is to look at the paths leading from the root node to the leaf
nodes in the DOM tree.

• CPS – Common Path Shingles
The idea is not to compare complete paths but rather breaking them up in
smaller pieces of equal length – the shingles.


.

Distance Measures for TBWD Structures

• TV – Tag Vector
Counting how many times each possible tag appears converts a document D
in a vector v(D) of fixed dimension N.

• LCTS – Longest Common Tag Subsequence
The distance of two documents can be expressed based on their longest
common tag subsequence.

• CTSS – Common Tag Sequence Shingles
To overcome the computational costs of the previous distance measure we
utilize again the shingling techniques.


.

Clustering Techniques

In this paper we have applied two different techniques for clustering TBWD.

3. K-Median Clustering
4. Single Linkage


.

Experiments

• To evaluate the different distance measures we collected a corpus
of 500 document from five different German news web sites.

• Each web site contributed 20 documents from five different
topical categories: national and international politics, sports,
business and IT related news.

• Once the distance matrices had been computed, the different
cluster analysis methods were applied to each of them.


.

Experiments-Cont

• Evaluation of Clustering:
We used three different measures to evaluate the k-median and
the single linkage algorithms :
– The Rand index
• Rand Index or Rand Measure is a measure of how the clustering results

are close to the original classes. Value one means perfect clustering
– Cluster purity
– Mutual information


.

Experiments-Cont

Evaluation of k-median clustering for k = 5 (Average of 100 repetitions)
based on the different distance measures
RTDM , CP , CPS , TV , LCTS , CTSS
With considering different performance measures
The Rand index , Cluster purity , Mutual information

Distance RTDM TV CP CPS LCTS CTSS
Measure
Rand Index 0.9399 0.9140 0.9157 0.9293
0.9608 0.9560
Ave. Purity 0.9235 0.9057 0.8629 0.9218
0.9613 0.9535
Mutual 0.1354 0.1302 0.1250 0.1350
0.1444 0.1432
Information

RTDM is providing the best results, followed by common path measures.


.

Experiments-Cont

Evaluation of single linkage clustering for five clusters.
based on the different distance measures
RTDM , CP , CPS , TV , LCTS , CTSS
With considering different performance measures
The Rand index , Cluster purity , Mutual information

Distance RTDM TV CP CPS LCTS CTSS
Measure
Rand Index 0.9200 0.9200 1.0000 1.0000 1.0000 1.0000

Ave. Purity 0.9005 0.9005 1.0000 1.0000 1.0000 1.0000

Mutual 0.1287 0.1287 0.1553 0.1553 0.1553 0.1553
Information

We can deduce that single linkage is a better way to form clusters for template based documents.


.

References
• Thomas Gottron. Evaluating content extraction on HTML documents. In ITA ’07: Proceedings of the 2nd
International Conference on Internet Technologies and Applications, pages 123–132, September 2007.

• Thomas Gottron. Combining content extraction heuristics: the combine system. In iiWAS ’08: Proceedings
of the 10th International Conference on Information Integration and Web-based Applications &Services,
pages 591–595, New York, NY, USA, 2008.ACM.

• Thomas Gottron. Content code blurring: A new approach to content extraction. In DEXA ’08:19th
International Workshop on Database and Expert Systems Applications, pages 29 – 33. IEEE Computer
Society, September 2008

• Thomas Gottron. Clustering Template Based Web Documents . Proceedings of the 30th European
Conference on Information Retrieval, 2008, 40—51.


Content extraction: By Hadi Mohammadzadeh

Recommended

Recommended

More Related Content

Similar to Content extraction: By Hadi Mohammadzadeh

Similar to Content extraction: By Hadi Mohammadzadeh (20)

More from Hadi Mohammadzadeh

More from Hadi Mohammadzadeh (7)

Recently uploaded

Recently uploaded (20)

Content extraction: By Hadi Mohammadzadeh