DEByE─Data Extraction By Example
Upcoming SlideShare
Loading in...5
×
 

DEByE─Data Extraction By Example

on

  • 3,568 views

Alberto H. F. Laender, , Berthier Ribeiro-Neto and Altigran S. Da Silva

Alberto H. F. Laender, , Berthier Ribeiro-Neto and Altigran S. Da Silva
2008/12/30 regular meeting

Statistics

Views

Total Views
3,568
Views on SlideShare
3,491
Embed Views
77

Actions

Likes
0
Downloads
31
Comments
0

5 Embeds 77

http://web204seminar.blogspot.com 65
http://web204seminar.blogspot.tw 8
http://www.slideshare.net 2
http://203.208.37.132 1
http://web204seminar.blogspot.hk 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

DEByE─Data Extraction By Example DEByE─Data Extraction By Example Presentation Transcript

  • DEByE-Data Extraction By Example Alberto H. F. Laender, , Berthier Ribeiro-Neto and Altigran S. Da Silva Data & Knowledge Engineering ,2002
  • Abstract
    • Data Extraction By Example(DEByE): a approach to extracting data from Web sources, based on a small set of examples specified by the user.
    • The examples provided by the user are then used to generate patterns which allow extracting data from new documents.
  • Outline
    • Introduction
    • The DEByE approach
    • Data Extraction
    • The DEByE tool
    • Experimental results
    • Conclusion
  • Introduction(1/3)
    • The spreading of modern digital libraries and the popularization of the web have made a huge volume of textual information available to a very large audience.
    • Finding the desired information in the large text databases present in the web is not a trivial task.
      • One main reason: users might be interested in semi-structured data that is not recognized by traditional Web interfaces and search engines.
    • Problem: how to gain access to semi-structured Web data which is present in Web pages .
  • Introduction(2/3)
    • The structure is said to be implicit because it has not been declared explicitly as done when we specify the schema of a database.
    • The structure might vary from one to another, so it call that actual data is semi-structured.
    • The problem of extracting data from Web sources can be stated as follow: Given a Web page S containing a set of implicit objects, determine a mapping W that populates a data repository R with the object in S.
    • An important point connecting the generation of such a mapping is defining what an object is.
    • They present a new approach for generating wrappers that implement the mapping W, which is called DEByE.
    • Motivation: to let the user or database designer specify a target structure for the data to be extracted.
  • Introduction(3/3)
    • Advantage:
      • Shield user from many of details.
      • User can map the data into a structure of his preference.
      • The step for the extraction procedure is simpler and more intuitive
    • The DEByE tool represents the structure of the data through nested table.
    • The examples provided by the user are used to generate extraction pattern which allow extracting new data from new Web pages.
  • The DEByE approach(1/4)
    • Text or pages which present such a type of inherent structure and whose data is restricted to a specific domain are said to be data rich and narrow in ontological breadth.
    • In the DEByE approach, this is done indirectly through the specification of an example object.
    • This is accomplished by cutting pieces of data from a sample page and inserting these pieces into the nested table.
  • The DEByE approach(2/4)
    • The nested tables are not as powerful as XML or OEM.
    • An critical issue: How to automatically extract new data from new Web pages to populate a given nested table.
  • The DEByE approach(3/4)
    • Two module: Graphical User Interface (GUI) and Extractor
      • GUI provides the user with a Java interface that he uses to assemble the example objects.
      • The Extractor takes these patterns and applies them to new pages from the target Web source. ( The set of Extracted Objects is coded in an XML-based format )
  • The DEByE approach(4/4)
  • Data Extraction─Notation and terminology
    • In DEByE, the examples of objects are used to generate Object Extraction Pattern (oe-patterns).These oe-patterns combine structural and textual information that are used to recognize and extract new objects.
    • Use the notion of an object type to represent complex object. An object of a specific object type is called an instance of the type.
  • Data Extraction─Notation and terminology
    • Instances of a v-type are objects of any type from a list of types called the alternatives of the variant types.
  • Data Extraction─Object Extraction Patterns
    • The oe-patterns describe the hierarchical structure of the example objects.
    • The nodes at the bottom of the hierarchy are used to match AVPs and are called Attribute-Value Pair patterns (avp-pattern).
    • To each such AVP, we associate a local syntactic context that can be derived from the strings surrounding the AVP value in the text.
    • Use the concept of a passage (or window) and techniques from information retrieval.
  • Data Extraction ─Object Extraction Patterns
    • The key problem is that this pattern includes too much information about the local context in which the value 10.95 appeared.
  • Data Extraction ─Object Extraction Patterns
    • Oe-patterns are essntially trees containing information on the structure of the objects and on their associated AVPs.
    • The sub-tree of an oe-pattern are themselves oe-pattern, modeling the structure of component objects.
  • Data Extraction─Extraction strategies
    • Top-down extraction strategy:
    • The algorithm assembles an oe-pattern and uses this pattern to identify new object in new pages.
    • This procedure is repeated if more than one example object is provided.
    • This top-down extraction strategy works well with pages that are well structured.
  • Data Extraction ─Extraction strategies
    • Bottom-up extraction strategy
    • The main feature of our bottom-up extraction strategy is that it recognize and extracts atomic components, prior to the recognition of the object itself.
    • The extracted AVPs are then used to assemble the object through a bottom-up composition operation.
  • Data Extraction─Extraction strategies
    • The bottom-up extraction strategy is considerably more complex than the top-down strategy.
    • Contiguous pairs of Title and Price values are combined to form Book instances. Each of these instances is labeled with the smaller position value (also called lowest component).
  • Data Extraction─Extraction strategies
    • The assembling phase procedure is based on two fundamental assumptions
        • AVPs can be correctly identified and extracted from a text (page).
        • The presence of any component of an instance indicates the existence of such an instance.
    • In DEByE, many of the problems caused by imperfect avp-pattern can be alleviated by the features of the interface.
    • Advantage: easiness of use, quick proto-typing and converge of a variety of data sources with variations in structure.
    • The bottom-up strategy is more flexible than the top-down strategy because it assembles complex objects through a composition of simpler object components.
    • It is suitable for cases where missing components or components out of order are expected.
  • The DEByE tool
  • Column operation
  • Experimental results(1/3)
    • Table 1: the percentage figure for the number of objects retrieved are relative to the total number of object identified manually in the source pages.
    • The number of examples used in the extraction id determined by trial and error.
  • Experimental results(2/3)
    • RISE (Repository of Online Information Sources Used in information Extraction Tasks)
    • Purpose: provide a preliminary comparison of DEByE with three other systems.
  • Experimental results(3/3)
    • Do not allow a full and direct comparison between DEByE and the three other system.
    • In DEByE, there is no knowledge base or set of heuristics to guide the extraction procedure.
    • These preliminary results with the RISE repository suggest that DEByE is as effective as the known alternatives based on the wrapper induction.
  • Comparison and related work
    • Data model
        • OEM
        • UnQL query language
    • Wrapper induction
    • Machine Learning
        • Do not deal with missing or out of order components of the objects.
        • The solution provided is computationally intractable and has not been implemented in the WIEN system.
    • NLP techniques
        • Rapier
        • SRV
    • Ontology-based extraction approaches
        • Rely on mainly the expected contents of the pages, according to what was anticipated by the pre-specified ontology.
    • NoDoSE: the way examples are provided by the user to generate the extraction patterns.
    • XWRAP: the explicit use of the HTML syntax and structure.
  • Conclusions
    • An example-based approach to automatically extracting semi-structured data.
        • User specifies examples according to a structure of his liking.
        • Bottom-up procedure can take advantage of the user provided examples to recognize and extract new data with great efficacy.
    • Convenience in the specification of the examples can be achieved through a user interface that adopts nested table as its fundamental paradigm.
    • The bottom-up extraction procedure is quite effective.
    • Few examples are enough to allow the recognition and extraction of almost the totality of the objects present in the Web source we considerd.