E-MINE: A NOVEL WEB MINING
APPROACH
Submitted By,
V.DINESH KUMAR,
II-MCA.
ABSTRACT
 In recent years government agencies and
industrial enterprises are using web as a
medium of publication.
 It b...
INTRODUCTION
 Several attempts have been made to extract
the regularly structured data from the web
page.
 The main disa...
RELATED WORK
 MDR (Mining Data Record) is a technique
mainly used in the area of data mining.
 It exploits the regularit...
 The algorithm is based on two observations
(a) A group of data records are always
presented in a contiguous region of th...
PROPOSED TECHNIQUE
 This proposed technique can help the system in three
ways,
a)It enables the system to identify gaps t...
SYSTEM MODEL OF AN E-MINE
TECHNIQUE
HTML source of a web page
Largest Rectangle Identifier
Container Identifier
Filter
Rel...
 System model mainly consists of three
components,
 Largest Rectangle Identifier,
 Container Identifier and
 Filter.
T...
 The e-mine technique is based on three
observations:
 A group of data records, is typically presented in
the neighbouri...
ALGORITHM e-Mine
INPUT : HTML source of web-page.
STEP 1:Determine the height & width of all the bounding
Rectangles in th...
HOW THE ALGORITHM WORKS?
 Determining the Height and Width of all
bounding rectangles.
 Identification of the largest re...
DETERMINING HEIGHT AND
WIDTH OF ALL BOUNDING
RECTANGLES In the first step of the proposed technique,
we determine the dim...
IDENTIFICATION OF THE
LARGEST RECTANGLE
 Based on the height and width of bounding
rectangles obtained in previous step, ...
PROCEDURE FOR IDENTIFICATION
OF LARGEST RECTANGLE
Procedure getMaxRect
Input: <body> of the HTML source
for each child of ...
IDENTIFICATION OF THE
CONTAINER WITH IN THE LARGEST
RECTANGLE Once we have obtained the largest
rectangle, we form a set ...
PROCEDURE FOR IDENTIFICATION
OF CONTAINER WITH IN THE
LARGEST RECTANGLE
Procedure getContainer
Input: The Largest Rectangl...
IDENTIFICATION OF DATA REGION
CONTAINING DATA RECORDS WITH IN
THE CONTAINER
 To remove the irrelevant data from the
conta...
PROCEDURE FOR FILTER
Procedure Filter
Input: The container obtained from the previous step.
totalHeight=0
for each child t...
MDR VS E-MINE
 Here the proposed technique is evaluated and
it is compared with MDR(Mining Data
Record).This evaluation c...
DATA REGION EXTRACTION
 MDR is dependent on certain tags like
<table>,<tbody>,etc for identifying data
region.
 A data r...
DATA RECORD EXTRACTION
 MDR identifies data records based on
keyword search. Eg.”$”.
 MDR not only identifies the releva...
OVERALL TIME COMPLEXITY
 The existing algorithm MDR has the
complexity of the order O(nk).
 n- total number of nodes,
 ...
CONCLUSION
 In this paper we proposed a new approach to
extract structured data from webpages.
 Although there are sever...
QUERIES???
THANK YOU…
Upcoming SlideShare
Loading in …5
×

E mine by V.DINESH KUMAR KSRCT

3,950 views

Published on

E-MINE is a novel web minig technology which is used to extract only the important data from a website.

Published in: Education, Technology

E mine by V.DINESH KUMAR KSRCT

  1. 1. E-MINE: A NOVEL WEB MINING APPROACH Submitted By, V.DINESH KUMAR, II-MCA.
  2. 2. ABSTRACT  In recent years government agencies and industrial enterprises are using web as a medium of publication.  It became increasingly difficult to identify relevant pieces of information, since pages are cluttered with irrelevant content like advertisements, copyright notices… surrounding the main content.  Thus we propose a technique that mines the relevant data regions from a web page.
  3. 3. INTRODUCTION  Several attempts have been made to extract the regularly structured data from the web page.  The main disadvantage of the existing document is that the relevant information of a data record is contained in HTML code which is not always true.  So, we propose a more effective method to mine the data region in the web page.
  4. 4. RELATED WORK  MDR (Mining Data Record) is a technique mainly used in the area of data mining.  It exploits the regularities in HTML tag structure directly.  MDR algorithm makes use of all the HTML tag tree of the web page to extract data records from the page.
  5. 5.  The algorithm is based on two observations (a) A group of data records are always presented in a contiguous region of the web page and are formatted using similar HTML tags. Such region is called a Data Region. (b) The nested structure of the HTML tags in a web page usually forms a tag tree and a set of similar data records are formed by some child sub-trees of the same parent node
  6. 6. PROPOSED TECHNIQUE  This proposed technique can help the system in three ways, a)It enables the system to identify gaps that separate records, which helps to segment data records correctly. b)The visual information also contains information about the hierarchical structure of the tags. c)By observing a webpage, it can be analysed that the relevant data region occupies the major central
  7. 7. SYSTEM MODEL OF AN E-MINE TECHNIQUE HTML source of a web page Largest Rectangle Identifier Container Identifier Filter Relevant Data Region
  8. 8.  System model mainly consists of three components,  Largest Rectangle Identifier,  Container Identifier and  Filter. The output of each component is the input of next component.
  9. 9.  The e-mine technique is based on three observations:  A group of data records, is typically presented in the neighbouring region of a page.  The area covered by a rectangle that bounds the data region is more than the area covered by the rectangles bounding other regions, e.g. Advertisements and links.  The height of an irrelevant data record within a collection of data records is less than the average height of relevant data records within that region.
  10. 10. ALGORITHM e-Mine INPUT : HTML source of web-page. STEP 1:Determine the height & width of all the bounding Rectangles in the HTML document. STEP 2: Calculate the areas of all the Bounding Rectangles. STEP 3:Identify the Maximum Rectangle from all the bounding Rectangles. STEP 4:Identify the container within the Maximum Rectangle obtained from step 3. STEP 5:Identify the Data Region in the container obtained from step 4. STEP 6:Filter the Data Region obtained after step 5 for removal of some more irrelevant data.
  11. 11. HOW THE ALGORITHM WORKS?  Determining the Height and Width of all bounding rectangles.  Identification of the largest rectangle.  Identification of the container within the largest rectangle.  Identification of data region containing data records with in the container.
  12. 12. DETERMINING HEIGHT AND WIDTH OF ALL BOUNDING RECTANGLES In the first step of the proposed technique, we determine the dimensions of all the bounding rectangles in the web page.  If not specified, the MSHTML parsing and rendering engine of Microsoft Internet Explorer 6.0 can be used.  The parsing and rendering engine of the web browser gives us the co-ordinates of a bounding rectangle.
  13. 13. IDENTIFICATION OF THE LARGEST RECTANGLE  Based on the height and width of bounding rectangles obtained in previous step, we determine area of bounding rectangle.  Among these rectangles determine the largest rectangle.  The reason for doing is that the largest bounding rectangle will always contain the most relevant data in web page.
  14. 14. PROCEDURE FOR IDENTIFICATION OF LARGEST RECTANGLE Procedure getMaxRect Input: <body> of the HTML source for each child of <body> tag begin Find the coordinates of the bounding rectangles for the child If the area of the bounding rectangle > area of maximum Rectangle then Maximum Rectangle = child endif end
  15. 15. IDENTIFICATION OF THE CONTAINER WITH IN THE LARGEST RECTANGLE Once we have obtained the largest rectangle, we form a set of the entire bounding rectangles.  The reason is that the most important data of webpage must occupy a significant portion of the web page.  Determine the bounding rectangle having the largest area in the set because only the largest rectangle will contain the data records.
  16. 16. PROCEDURE FOR IDENTIFICATION OF CONTAINER WITH IN THE LARGEST RECTANGLE Procedure getContainer Input: The Largest Rectangle out of all Bounding Rectangles. List_of_Children=depth first listing of all the children of the tag associated with Maximum Rectangle. for each tag in List_of_Children begin if area of bounding rectangle of a tag > half the area of Maximum Rectangle then container = tag endif end
  17. 17. IDENTIFICATION OF DATA REGION CONTAINING DATA RECORDS WITH IN THE CONTAINER  To remove the irrelevant data from the container we use a filter.  The filter determines the average heights of data with in the container.  Those data whose heights are less than the average height are identified as irrelevant and discarded.
  18. 18. PROCEDURE FOR FILTER Procedure Filter Input: The container obtained from the previous step. totalHeight=0 for each child tag within container totalHeight+=height of the bounding rectangle of child averageHeight = totalHeight/no of children of container for each child within container if height of child’s bounding rectangle < averageHeight then Discard child from container endif end for end for
  19. 19. MDR VS E-MINE  Here the proposed technique is evaluated and it is compared with MDR(Mining Data Record).This evaluation consists of three aspects,  Data Region Extraction,  Data Record Extraction,  Overall Time Complexity.
  20. 20. DATA REGION EXTRACTION  MDR is dependent on certain tags like <table>,<tbody>,etc for identifying data region.  A data region can be contained in some tags like <table>,<tbody>,<p>,<li>,<forms> etc.  In the proposed emine system, the data region identification is independent of specific tags and forms.
  21. 21. DATA RECORD EXTRACTION  MDR identifies data records based on keyword search. Eg.”$”.  MDR not only identifies the relevant data region containing the search result records but also extract records from all other sections of the page.
  22. 22. OVERALL TIME COMPLEXITY  The existing algorithm MDR has the complexity of the order O(nk).  n- total number of nodes,  K- maximum number of tags.
  23. 23. CONCLUSION  In this paper we proposed a new approach to extract structured data from webpages.  Although there are several techniques e-mine is a pure visual structure oriented method that can correctly identify the data regions.  Most of the current algorithm fails to correctly determine the data region, when the data region consists of only one data record.  Thus e-mine overcomes the drawbacks of existing method and performs significantly better than existing tasks.
  24. 24. QUERIES???
  25. 25. THANK YOU…

×