E-MINE: A NOVEL WEB MINING
In recent years government agencies and
industrial enterprises are using web as a
medium of publication.
It became increasingly difficult to identify
relevant pieces of information, since pages
are cluttered with irrelevant content like
advertisements, copyright notices…
surrounding the main content.
Thus we propose a technique that mines the
relevant data regions from a web page.
Several attempts have been made to extract
the regularly structured data from the web
The main disadvantage of the existing
document is that the relevant information of
a data record is contained in HTML code
which is not always true.
So, we propose a more effective method to
mine the data region in the web page.
MDR (Mining Data Record) is a technique
mainly used in the area of data mining.
It exploits the regularities in HTML tag
MDR algorithm makes use of all the HTML
tag tree of the web page to extract data
records from the page.
The algorithm is based on two observations
(a) A group of data records are always
presented in a contiguous region of the web
page and are formatted using similar HTML
tags. Such region is called a Data Region.
(b) The nested structure of the HTML tags in
a web page usually forms a tag tree and a set
of similar data records are formed by some
child sub-trees of the same parent node
This proposed technique can help the system in three
a)It enables the system to identify gaps that separate
records, which helps to segment data records
b)The visual information also contains information
hierarchical structure of the tags.
c)By observing a webpage, it can be analysed that
the relevant data region occupies the major central
SYSTEM MODEL OF AN E-MINE
HTML source of a web page
Largest Rectangle Identifier
Relevant Data Region
System model mainly consists of three
Largest Rectangle Identifier,
Container Identifier and
The output of each component is the input of next
The e-mine technique is based on three
A group of data records, is typically presented in
the neighbouring region of a page.
The area covered by a rectangle that bounds the
data region is more than the area covered by the
rectangles bounding other regions, e.g.
Advertisements and links.
The height of an irrelevant data record within a
collection of data records is less than the average
height of relevant data records within that region.
INPUT : HTML source of web-page.
STEP 1:Determine the height & width of all the bounding
Rectangles in the HTML document.
STEP 2: Calculate the areas of all the Bounding
STEP 3:Identify the Maximum Rectangle from all the
STEP 4:Identify the container within the Maximum
Rectangle obtained from step 3.
STEP 5:Identify the Data Region in the container
obtained from step 4.
STEP 6:Filter the Data Region obtained after step 5 for
removal of some more irrelevant data.
HOW THE ALGORITHM WORKS?
Determining the Height and Width of all
Identification of the largest rectangle.
Identification of the container within the largest
Identification of data region containing data
records with in the container.
DETERMINING HEIGHT AND
WIDTH OF ALL BOUNDING
RECTANGLES In the first step of the proposed technique,
we determine the dimensions of all the
bounding rectangles in the web page.
If not specified, the MSHTML parsing and
rendering engine of Microsoft Internet
Explorer 6.0 can be used.
The parsing and rendering engine of the web
browser gives us the co-ordinates of a
IDENTIFICATION OF THE
Based on the height and width of bounding
rectangles obtained in previous step, we
determine area of bounding rectangle.
Among these rectangles determine the
The reason for doing is that the largest
bounding rectangle will always contain the
most relevant data in web page.
PROCEDURE FOR IDENTIFICATION
OF LARGEST RECTANGLE
Input: <body> of the HTML source
for each child of <body> tag
Find the coordinates of the bounding rectangles
for the child
the area of the bounding rectangle >
area of maximum Rectangle
then Maximum Rectangle = child
IDENTIFICATION OF THE
CONTAINER WITH IN THE LARGEST
RECTANGLE Once we have obtained the largest
rectangle, we form a set of the entire
The reason is that the most important data of
webpage must occupy a significant portion of
the web page.
Determine the bounding rectangle having the
largest area in the set because only the
largest rectangle will contain the data
PROCEDURE FOR IDENTIFICATION
OF CONTAINER WITH IN THE
Input: The Largest Rectangle out of all Bounding Rectangles.
List_of_Children=depth first listing of all the
children of the tag associated with Maximum Rectangle.
for each tag in List_of_Children
if area of bounding rectangle of a tag > half the area of
then container = tag
IDENTIFICATION OF DATA REGION
CONTAINING DATA RECORDS WITH IN
To remove the irrelevant data from the
container we use a filter.
The filter determines the average heights of
data with in the container.
Those data whose heights are less than the
average height are identified as irrelevant
PROCEDURE FOR FILTER
Input: The container obtained from the previous step.
for each child tag within container
totalHeight+=height of the bounding rectangle of child
averageHeight = totalHeight/no of children of container
for each child within container
if height of child’s bounding rectangle < averageHeight
then Discard child from container
MDR VS E-MINE
Here the proposed technique is evaluated and
it is compared with MDR(Mining Data
Record).This evaluation consists of three
Data Region Extraction,
Data Record Extraction,
Overall Time Complexity.
DATA REGION EXTRACTION
MDR is dependent on certain tags like
<table>,<tbody>,etc for identifying data
A data region can be contained in some tags
like <table>,<tbody>,<p>,<li>,<forms> etc.
In the proposed emine system, the data
region identification is independent of
specific tags and forms.
DATA RECORD EXTRACTION
MDR identifies data records based on
keyword search. Eg.”$”.
MDR not only identifies the relevant data
region containing the search result records
but also extract records from all other
sections of the page.
OVERALL TIME COMPLEXITY
The existing algorithm MDR has the
complexity of the order O(nk).
n- total number of nodes,
K- maximum number of tags.
In this paper we proposed a new approach to
extract structured data from webpages.
Although there are several techniques e-mine is
a pure visual structure oriented method that can
correctly identify the data regions.
Most of the current algorithm fails to correctly
determine the data region, when the data region
consists of only one data record.
Thus e-mine overcomes the drawbacks of
existing method and performs significantly
better than existing tasks.