• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

E mine by V.DINESH KUMAR KSRCT

on

  • 2,961 views

E-MINE is a novel web minig technology which is used to extract only the important data from a website.

E-MINE is a novel web minig technology which is used to extract only the important data from a website.

Statistics

Views

Total Views
2,961
Views on SlideShare
2,958
Embed Views
3

Actions

Likes
4
Downloads
0
Comments
0

1 Embed 3

http://search.mywebsearch.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    E mine by V.DINESH KUMAR KSRCT E mine by V.DINESH KUMAR KSRCT Presentation Transcript

    • E-MINE: A NOVEL WEB MINING APPROACH Submitted By, V.DINESH KUMAR, II-MCA.
    • ABSTRACT
      • In recent years government agencies and industrial enterprises are using web as a medium of publication.
      • It became increasingly difficult to identify relevant pieces of information, since pages are cluttered with irrelevant content like advertisements, copyright notices… surrounding the main content.
      • Thus we propose a technique that mines the relevant data regions from a web page.
    • INTRODUCTION
      • Several attempts have been made to extract the regularly structured data from the web page.
      • The main disadvantage of the existing document is that the relevant information of a data record is contained in HTML code which is not always true.
      • So, we propose a more effective method to mine the data region in the web page.
    • RELATED WORK
      • MDR (Mining Data Record) is a technique mainly used in the area of data mining.
      • It exploits the regularities in HTML tag structure directly.
      • MDR algorithm makes use of all the HTML tag tree of the web page to extract data records from the page.
      • The algorithm is based on two observations
      • (a) A group of data records are always presented in a contiguous region of the web page and are formatted using similar HTML tags. Such region is called a Data Region.
      • (b) The nested structure of the HTML tags in a web page usually forms a tag tree and a set of similar data records are formed by some child sub-trees of the same parent node
    • PROPOSED TECHNIQUE
      • This proposed technique can help the system in three ways,
      • a)It enables the system to identify gaps that separate records, which helps to segment data records correctly.
      • b)The visual information also contains information about the hierarchical structure of the tags.
      • c)By observing a webpage, it can be analysed that
      • the relevant data region occupies the major central
      • part of the Webpage.
    • SYSTEM MODEL OF AN E-MINE TECHNIQUE HTML source of a web page Largest Rectangle Identifier Container Identifier Filter Relevant Data Region
      • System model mainly consists of three components,
        • Largest Rectangle Identifier,
        • Container Identifier and
        • Filter.
        • The output of each component is the input of next component.
      • The e-mine technique is based on three observations:
        • A group of data records , is typically presented in the neighbouring region of a page.
        • The area covered by a rectangle that bounds the data region is more than the area covered by the rectangles bounding other regions, e.g. Advertisements and links.
        • The height of an irrelevant data record within a collection of data records is less than the average height of relevant data records within that region.
    • ALGORITHM e-Mine
      • INPUT : HTML source of web-page.
      • STEP 1 :Determine the height & width of all the bounding Rectangles in the HTML document.
      • STEP 2 : Calculate the areas of all the Bounding Rectangles.
      • STEP 3 :Identify the Maximum Rectangle from all the bounding Rectangles.
      • STEP 4 :Identify the container within the Maximum Rectangle obtained from step 3.
      • STEP 5 :Identify the Data Region in the container obtained from step 4.
      • STEP 6 :Filter the Data Region obtained after step 5 for removal of some more irrelevant data.
      •  
    • HOW THE ALGORITHM WORKS?
      • Determining the Height and Width of all bounding rectangles.
      • Identification of the largest rectangle.
      • Identification of the container within the largest rectangle.
      • Identification of data region containing data records with in the container.
    • DETERMINING HEIGHT AND WIDTH OF ALL BOUNDING RECTANGLES
      • In the first step of the proposed technique, we determine the dimensions of all the bounding rectangles in the web page.
      • If not specified, the MSHTML parsing and rendering engine of Microsoft Internet Explorer 6.0 can be used.
      • The parsing and rendering engine of the web browser gives us the co-ordinates of a bounding rectangle.
    • IDENTIFICATION OF THE LARGEST RECTANGLE
      • Based on the height and width of bounding rectangles obtained in previous step, we determine area of bounding rectangle.
      • Among these rectangles determine the largest rectangle.
      • The reason for doing is that the largest bounding rectangle will always contain the most relevant data in web page.
    • PROCEDURE FOR IDENTIFICATION OF LARGEST RECTANGLE
      • Procedure getMaxRect
      • Input: <body> of the HTML source
      • for each child of <body> tag
      • begin
      • Find the coordinates of the bounding rectangles for the child
      • If the area of the bounding rectangle > area of maximum Rectangle
      • then Maximum Rectangle = child
      • endif
      • end
    • IDENTIFICATION OF THE CONTAINER WITH IN THE LARGEST RECTANGLE
      • Once we have obtained the largest rectangle, we form a set of the entire bounding rectangles.
      • The reason is that the most important data of webpage must occupy a significant portion of the web page.
      • Determine the bounding rectangle having the largest area in the set because only the largest rectangle will contain the data records.
    • PROCEDURE FOR IDENTIFICATION OF CONTAINER WITH IN THE LARGEST RECTANGLE
      • Procedure getContainer
      • Input: The Largest Rectangle out of all Bounding Rectangles.
      • List_of_Children=depth first listing of all the
      • children of the tag associated with Maximum Rectangle.
      • for each tag in List_of_Children
      • begin
      • if area of bounding rectangle of a tag > half the area of Maximum Rectangle
      • then container = tag
      • endif
      • end
    • IDENTIFICATION OF DATA REGION CONTAINING DATA RECORDS WITH IN THE CONTAINER
      • To remove the irrelevant data from the container we use a filter.
      • The filter determines the average heights of data with in the container.
      • Those data whose heights are less than the average height are identified as irrelevant and discarded.
    • PROCEDURE FOR FILTER
      • Procedure Filter
      • Input: The container obtained from the previous step.
      • totalHeight=0
      • for each child tag within container
      • totalHeight+=height of the bounding rectangle of child
      • averageHeight = totalHeight/no of children of container
      • for each child within container
      • if height of child’s bounding rectangle < averageHeight
      • then Discard child from container
      • endif
      • end for
      • end for
    • MDR VS E-MINE
      • Here the proposed technique is evaluated and it is compared with MDR(Mining Data Record).This evaluation consists of three aspects,
        • Data Region Extraction,
        • Data Record Extraction,
        • Overall Time Complexity.
    • DATA REGION EXTRACTION
      • MDR is dependent on certain tags like <table>,<tbody>,etc for identifying data region.
      • A data region can be contained in some tags like <table>,<tbody>,<p>,<li>,<forms> etc.
      • In the proposed emine system, the data region identification is independent of specific tags and forms.
    • DATA RECORD EXTRACTION
      • MDR identifies data records based on keyword search. Eg.”$”.
      • MDR not only identifies the relevant data region containing the search result records but also extract records from all other sections of the page.
    • OVERALL TIME COMPLEXITY
      • The existing algorithm MDR has the complexity of the order O(nk).
      • n- total number of nodes,
      • K- maximum number of tags.
    • CONCLUSION
      • In this paper we proposed a new approach to extract structured data from webpages.
      • Although there are several techniques e-mine is a pure visual structure oriented method that can correctly identify the data regions.
      • Most of the current algorithm fails to correctly determine the data region, when the data region consists of only one data record.
      • Thus e-mine overcomes the drawbacks of existing method and performs significantly better than existing tasks.
      • QUERIES???
      • THANK YOU…