1.2M .pdf


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

1.2M .pdf

  1. 1. A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System A Thesis Proposal Presented to the Department of Computer Science Brigham Young University In Partial Fulfillment of the Requirements for the Degree Master of Science By Alan Wessman March 2003
  2. 2. I.Introduction The proposed thesis deals with the subject of data extraction from unstructured or semi- structured documents such as Web pages. Specifically, the proposed thesis is concerned with abstracting the extraction process and related heuristics for an ontology-based data- extraction system. As such, it is closely connected with the Ontos research project currently underway in the Data Extraction Group at Brigham Young University. The fundamental tasks of data extraction are to identify and extract relevant information within a semi-structured or unstructured document, and to map the identified data to a more formal structure to permit easier manipulation of the data [3,4,6,9,18,24]. Data extraction is particularly important in obtaining information from the World Wide Web, which although entirely unpredictable in the way its documents present useful data, is also the largest and most accessible repository for information. Perhaps the most common solution to the data-extraction problem is the construction and use of wrappers [1,2,3,9,12,14,19,24], which are basically templates that use clues in the document’s structure and syntax to locate useful data. The “holes” or slots in the wrapper, filled with data, are mapped to fields of a common semantic model, such as a data model in a relational database. But wrappers suffer from an important weakness: being based on the structure and syntax of the document, they become invalid when the document structure or presentation changes, even if the information contained in the document is still relevant. This is of course a major shortcoming in Web-based data extraction, since Web documents are notorious for their tendency to change suddenly and drastically. Even without the problem of evolving pages, though, wrappers are still specific to a narrow set of documents, rather than applying to all documents belonging to an application domain. Thus an application may require the ongoing construction of new wrappers as additional data sources are discovered. Roadrunner [3] is one approach that seeks to mitigate this problem by automating the process of wrapper construction. However, the Roadrunner method only works for fairly regular data sources. The BYU-Ontos approach [6] is one data-extraction solution that automatically compensates for structural variations across data sources as well as variations over time by a single data source, by taking a novel approach to the problem of data extraction. In contrast with the Roadrunner approach, BYU-Ontos also works for highly unstructured data sources. It is centered on a human-designed ontology that describes the domain of interest and provides certain general sets of patterns (termed “data frames”, from [5]) for recognizing elements of the domain. Independent of any particular document format or syntax, the recognition patterns of the ontology detect strings and symbols using lexicons and regular expressions. Recognized tokens are mapped to elements of the ontology, or are discarded using ontology constraints and a set of built-in heuristics. BYU-Ontos (hereafter “Ontos”) has been shown to be quite successful in extracting data for several ontologically narrow applications, such as used car advertisements and obituaries. Its precision and recall results are competitive with those of other approaches, and it is resilient in the face of document change and data source variety, since its rules are constructed independent of any single document or format. But Ontos in its current incarnation has several limitations. One notable problem is that several discrepancies exist between the Ontos specification and its implementation. In particular, the Ontos implementation is sensitive to the order in which object sets and 2
  3. 3. relationship sets are given in the domain ontology; such a dependency does not exist in the specification. This results in errors such as the one depicted in Figure 1 below. In this case, the extracted date matches with two object sets: Funeral Date and Marriage Date. Despite the nearby keyword indicating that the date is a Funeral Date, the system extracts it as a Marriage Date. This error is due to the one-by-one processing of object sets in the order of their appearance in the ontology. Since Marriage Date was processed first and yielded a match on the date value, it accepted the value and excluded it from inclusion in the correct Funeral Date object set. Simply changing the order of the object sets in the ontology will not solve the general problem, for it is easy to conceive of a scenario in which one record has only a funeral date and the other has only a marriage date. In such a case, the ordering will be wrong for one of the records. Figure 1: Example of errors due to object-set order dependencies. Figure 2 provides additional insight into the difficulties posed by the heuristics specification and implementation. The figure first lists the specified set of heuristics. It then gives pseudocode for the actual implementation of the heuristics. It is difficult to identify from the pseudocode any of the five specified heuristics. Hence verifying program correctness or making modifications to the heuristics involves a great deal of attention and effort, and changes to the heuristics are likely to introduce bugs into the code. 3
  4. 4. Specified Heuristics • Keyword proximity • Subsumed/overlapping constants • Functional relationships • Nonfunctional relationships • First occurrence without constraint violation Heuristics Pseudocode Method 1: FOR EACH object set IN the ontology: IF a keyword exists for the object set THEN Find the closest constant to the keyword END IF WHILE keyword exists AND constant is not found DO Find the next keyword Find the closest constant to the keyword END WHILE IF keyword doesn’t exist THEN Find the first constant matching this object set END IF IF constant exists THEN Update data structures: - add the constant to the output table - delete subsumed data lines - delete other data lines for this object set ELSE IF this object set is required THEN Abort ELSE Continue with next object set END IF END IF END FOR Method 2: FOR EACH relational table DO FOR EACH keyword in dataline list DO IF keyword corresponds to an attribute of this table THEN Find context surrounding this keyword (look backward for first preceding keyword, then look forward for next succeeding keyword) Clear this table’s template tuple Fill in object places of the template with object(s) FOR EACH constant in this context DO IF constant corresponds to an attribute in template THEN IF attribute slot is empty AND adding constant would not violate any constraints for this table THEN Fill slot with constant Mark the constant as used (delete from dataline list, remove other overlapping constants) IF this slot is a nonlexical object generator AND the corresponding nonlexical object isn’t yet generated THEN Generate nonlexical object END IF IF the tuple is now complete THEN Add it to the rows of this table Re-initialize template tuple END IF ELSE IF slot is full THEN IF slot is functionally determined by another slot THEN Replace the slot’s contents with this constant Mark the constant as used ELSE IF template tuple is sufficiently complete to write THEN Add template tuple to the table Replace the slot’s contents with this constant Mark the constant as used ELSE Ignore the constant END IF END IF ELSE Ignore the constant END IF END IF END FOR IF template tuple is sufficiently complete to write THEN Add template tuple to the table END IF END IF END FOR END FOR Figure 2: Specification and implementation of Ontos heuristics. 4
  5. 5. If these were the only limitations of Ontos, it would be a straightforward matter to define more carefully the heuristics and perform some reworking of the code. But there are a number of data-extraction tasks that Ontos simply cannot handle at all, that represent important advances and opportunities in data-extraction research. Aside from fixing broken code, there is a great need to provide Ontos with the capability to support various data-extraction techniques with such modularity and flexibility that new features and heuristics can be “plugged in” without requiring extensive redesign of the Ontos code. We list several examples of desired new Ontos functionality below. Full support of object set and data frame specialization (inheritance). Ontos currently has limited support of inheritance of object sets and data frames, but lacks such important features as overriding and hiding of inherited extraction rules. These features are needed to implement heuristics that correctly assign extracted values to object sets, since a specialized extraction rule should cause matching values to be mapped to the specializing object set. Access to and use of previously extracted data. A major area of research that has hitherto been unexplored in Ontos is learning-enabled data extraction. We would like the performance of Ontos to improve over time as more data is recognized and mapped to the ontology. This necessitates a feedback cycle wherein previously extracted values can influence decisions to be made regarding new data. It follows that access to previously extracted data is a necessary condition for learning-enabled data extraction. At present, Ontos emits extracted values in the form of SQL insert statements. To access this data, the SQL must be parsed and mapped back to the ontology. This process is inefficient. In contrast, populating the ontology directly with extracted values would permit immediate use of the values in the heuristics, and would also permit the construction of translation code to traverse the ontology structure and emit the values in whatever format is desired. Support of complex document structures. Researchers have recently investigated applying ontology-based data extraction to differing document structures, such as tabular data or HTML input forms [7,16]. These reflect an expansion of the research from the original assumption that the input documents contain a relatively flat, linear multi-record structure. Unfortunately, Ontos has had to be retooled for each different document structure. We desire to provide Ontos with the ability to recognize different documents and choose a structure parser that will break the documents down into appropriate sections for more accurate processing by Ontos. The goal is that Ontos can extract from complex documents such as a university course catalog, which is rich both in terms of data as well as structural complexity. Declarative specification of extraction plans. The extraction plan specifies the order in which a data-extraction engine processes documents and target structure elements during the extraction phase, as well as how it reconciles the heuristics results. The pseudocode for the Ontos heuristics represents an example of an extraction plan. There is doubt, however, that the Ontos implementation can be claimed to apply the optimal extraction plan for all cases. We hypothesize instead that different extraction algorithms are better for different applications. Indeed, even to measure the utility of different extraction plans requires the ability to contrast them in controlled experiments. We wish therefore to allow an ontology researcher or designer to specify a particular extraction plan declaratively for a given ontology. Declarative extraction plans facilitate experimentation and dynamic tailoring of extraction plans for various applications. 5
  6. 6. Parameterization of heuristics. The Ontos heuristics produce, at each step, a Boolean decision on membership of a value in a particular object set or relationship set. Once a value has been accepted, it is no longer available for assignment to a different set. As the obituaries example above demonstrated, this decision can sometimes be incorrect. We thus desire the heuristics to produce real-numbered confidence values for each value mapping. This will also permit research to be conducted in how to weight certain heuristics and otherwise combine confidence values to achieve a final set of Boolean decisions for proposed value mappings. This will also permit the use of probabilistic and vector-based approaches, such as a hidden Markov model (HMM) learner for classification [8,10,15,17,20,22,23]. One of the factors motivating the proposed work is the considerable effort spent to develop a hidden Markov model learner for the current Ontos implementation. In the process of attempting to integrate the learner into Ontos, it was discovered that the heuristics could not work with such a different paradigm. In particular, the Boolean decisions of the heuristics could not easily be reconciled with the probabilities output by the HMM. The HMM work helped to demonstrate how flexible the Ontos code needs to be to accommodate such novel approaches. The examples given above are a subset of the many possible ideas and techniques applicable to ontology-based data extraction that have not been achievable with the existing Ontos code. Combined with the errors and general rigidity of the current codebase, it is clear that a more flexible, understandable, modular, and powerful design for Ontos is required. The proposed thesis addresses these problems by defining an abstract object-oriented framework that permits powerful extraction plans and heuristics to be developed, configured, and tuned for an ontology-based data-extraction system. II.Thesis Statement A generalized framework of interfaces and abstract classes written in Java can decouple the value-mapping heuristics from the rest of the Ontos data-extraction code. The framework will be sufficiently flexible both to allow the current heuristics to be re- implemented under the framework and to enable new heuristics to be created and incorporated into the Ontos system without requiring significant rewrites of the non- heuristics code. III.Methods This section describes the steps involved in accomplishing the work for the proposed thesis. The work may be broken down into five major steps, which are listed below: 1. Analysis of data-extraction functionality 2. Construction of the framework 3. Implementation of modules under the framework 4. Upgrades to existing tools 5. Evaluation of the new framework We provide details in the following sections. A.Analysis of Data Extraction Functionality The first step in constructing a framework for data extraction is to determine common functionality at each level of abstraction. The abstract class DataExtractionEngine, 6
  7. 7. which encompasses any data-extraction system (not just those that are ontology-based), represents the highest level of abstraction. A survey of current data-extraction research [4,6,15,18] provides the basis for defining the functionality at the highest levels of abstraction. Examination of the research centered on Ontos [6,7,16] supplies insight into abstracting ontology-based data- extraction systems. Much of this work has already been done for the proposed thesis. As an example of the work done, the following paragraph describes the abstract functionality of data-extraction systems [25]: In general, a data-extraction engine is initialized with a corpus of source documents (often mediated by an information retrieval system), a set of extraction rules to apply to the documents, and a target data structure to which extracted values are mapped. After initialization the engine executes an extraction plan, yielding a mapping between information in the documents and elements of the target data structure. Following the extraction step, the engine performs finalization tasks such as canonicalization of the extracted data or queries against the target data structure. Figure 3 shows a UML diagram of a partial framework that supports the functionality described above for the Ontos system. Subclasses of ExtractionPlan implement the execute() method, accessing methods of the DocumentRetriever, ValueMappingHeuristic, Ontology, and DocumentStructureRecognizer classes to perform the extraction. DataExtractionEngine #extractionPlan : ExtractionPlan ExtractionPlan +setInitializationParameters() +initialize() +execute() +doExtraction() +finalize() OntosExtractionPlan Ontos #documentRetriever : DocumentRetriever #documentRecognizer : DocumentStructureRecognizer #ontology : Ontology +setExtractionPlan(in plan : OntosExtractionPlan) «uses» DocumentRetriever DocumentStructureRecognizer ValueMappingHeuristic Ontology DocumentStructureParser ObjectSet RelationshipSet Figure 3: UML diagram of partial Ontos framework. There is still some work to be done in determining at what level to abstract certain features. Most data-extraction systems make a number of assumptions regarding how they obtain documents to process, how the documents are structured, what language the document is written in, and so forth. It is therefore a significant challenge to decide how 7
  8. 8. to allow these assumptions to be made explicit when implementing the abstraction. This work will be completed for the proposed thesis. B.Construction of a Framework The main piece of work to be performed for the proposed thesis is development of the framework. The framework will be implemented in the Java language [13], which is the implementation language for the existing Ontos system and provides the object-oriented features necessary for abstraction of data-extraction functionality. A set of abstract classes and interfaces will comprise the framework. In Java, abstract classes may contain instance attributes and method implementations, while interfaces may only contain static (class) attributes and method declarations. On the other hand, a Java class may implement multiple interfaces but inherit from only one parent class. Thus, most of the framework will be contained in interfaces to avoid problems with having to inherit from multiple abstract classes. A key part of designing an abstract framework is specifying the data structures that are published by parts of the interface. For example, a domain ontology is a data structure that would presumably be passed into the initialization portion of a data-extraction engine. It is reasonable therefore to declare an Ontology abstract class or interface as part of the framework. Other data types will be defined to represent documents, extracted values, value-mapping heuristics, data frames, and so forth. For data extraction using OSM ontologies, the OSM grammar must be extended to accommodate new capabilities such as declaring a reference to an existing extraction plan. The grammar must also be implemented in a data structure with an underlying concrete representation so that ontologies may be read from and written to a file. XML has been chosen for the concrete representation of OSM ontologies, and an XML Schema defining the grammar, termed OSM-X, has been designed. Work is currently underway to wrap the XML structure within interfaces that translate XML nodes to OSM ontology elements. C.Implementing Modules under the Framework The proposed thesis will not only entail construction of the framework, but will also involve incorporation of the current Ontos code into the new framework. This will require reconstruction of the heuristics code as an ExtractionPlan subclass and individual Heuristic subclasses. Unintended features such as the object-set order dependency will be corrected to harmonize with the heuristics specification. In addition to wrapping the current functionality of Ontos into the framework, a few additional features (namely, full generalization/specialization support and access to previously extracted data) will be newly implemented under the framework as a demonstration of its capabilities. However, many of the desired new features will be left for implementation by future researchers. These may include various document structure recognizers and parsers, specialized heuristics, learning-enabled extraction plans, and so forth. 8
  9. 9. D.Upgrades to Existing Tools In addition to the core Ontos data-extraction system, a graphical ontology editor tool [11] exists for use in constructing ontologies visually rather than textually. This tool will need to be rewritten to a sufficient degree to accommodate the new framework, particularly the changes to the OSM grammar. We will perform the work required to upgrade the ontology editor, coordinating with other researchers insofar as they seek to contribute to the tool upgrade. E.Evaluation of the Framework We will demonstrate the improved performance and capability of our new framework by means of a series of experiments that measure precision and recall in several data- extraction domains Support for existing functionality. The capabilities of Ontos under the new framework must be no less accurate than its present capabilities. A series of experiments will exercise the features of both the new and old Ontos, and the results will show whether the new framework is capable of performing the same tasks with comparable (or better) accuracy. To achieve this, we will select three different application domains—obituaries, used-car advertisements, and campground listings—and a corpus of documents on which to test the old and new Ontos implementations. We will process the documents and compare the extracted data to values extracted manually by a domain expert. This comparison will provide precision and recall figures for each implementation. We will then compare the precision and recall of the two implementations to gauge the performance of Ontos under the new framework. Despite the weaknesses in the current implementation of Ontos, its performance has been quite good in certain domains. We therefore face the prospect of achieving lower precision and recall figures from the new heuristics code in certain cases, since the original code’s order dependencies may have serendipitously resulted in correct classification of extracted values. We anticipate, however, that such differences are likely to be minor and counterbalanced by the new code’s greater reliability and faithfulness to theoretical underpinnings. Also, with the ability to reorder and tune the heuristics beyond the scope of the original ontology-based extraction specification, we expect to see overall improvements in precision and recall over the current implementation. Support for new functionality. We will demonstrate that our new framework will be capable of more functionality than the current Ontos by successfully applying the new version of Ontos to problems that the old version was incapable of handling. Because the old version cannot even handle the problems, we will have no accuracy results to compare against. The first test of new functionality will focus on support of generalization/specialization semantics. We will construct an ontology that contains a significant generalization/specialization relationship between some of its object sets. The selected application domain is product descriptions for digital display devices—i.e., monitors, projectors, etc. This ontology will be applied to a set of documents in the application 9
  10. 10. domain to determine the effectiveness of the extraction. We will obtain precision and recall figures in a similar manner as that described earlier. A second test of new functionality will be centered on use of previously extracted data. We will construct an extraction plan that will treat previously extracted data as an “on- the-fly” lexicon for recognizing related values later in the document. Our application domain will be a university catalog, in which we will seek out course codes to comprise a lexicon for identifying prerequisites or cross-listings. We will again calculate precision and recall figures as described above. IV.Contribution to Computer Science The thesis will contribute the following to the field of data extraction: • An abstracted approach to data extraction, especially ontology-based data extraction • The concept of extraction plans for data-extraction systems, comparable to query execution plans in relational database systems • An object-oriented framework that permits extensive experimentation with different data-extraction techniques • A revised Ontos codebase and ontology editor to aid in data-extraction research • More powerful Ontos data extraction with at least the same accuracy as before • The capability to process applications previously beyond the ability of Ontos V.Delimitations of the Thesis The following will not be addressed by the proposed thesis: • Implementation of information retrieval modules • Implementation of document structure recognizers and parsers • Implementation of non-ontology-based data-extraction systems under the framework • Support for full declarative specification of extraction plans as part of the ontology definition • Canonicalization of extracted data VI.Thesis Outline 1. Introduction (5-10 pages) 2. Related Work (2-4 pages) 3. Scenarios in Data Extraction (15-20 pages) a. Value Ambiguity b. Data Frame Inheritance c. Learning Capability d. Complex Documents e. Tuning 4. Data Extraction Engines, Plans, and Heuristics (15-20 pages) a. General Data Extraction b. Ontologies and Data Frames c. Extraction Plans d. Value-Mapping Heuristics 5. A Framework for Extraction Plans and Heuristics (10-15 pages) a. Design Patterns and Frameworks 10
  11. 11. b. Plan-Level Classes and Interfaces c. Heuristics-Level Classes and Interfaces 6. Implementation of Heuristics and Extraction Plans (2-5 pages) 7. Experimental Results (5-8 pages) a. Comparison of Existing Functionality b. Evaluation of New Functionality 8. Conclusions and Future Work (3-5 pages) 9. Appendices a. Documentation on Framework b. OSM-X Definition VII.Thesis Schedule Literature Search and Reading September 2001 – March 2003 Design and Coding March 2003 – June 2003 Chapter 3 April 2003 – May 2003 Chapter 4 May 2003 – June 2003 Chapter 5 June 2003 – July 2003 Chapter 6 July 2003 Chapter 7 August 2003 Chapters 1,2 & 8 August 2003 – September 2003 Thesis Revision and Defense September 2003 VIII.Bibliography [1] David Buttler, Ling Liu, Calton Pu. A fully automated object extraction system for the World Wide Web. In Proceedings of the 2001 International Conference on Distributed Computing Systems, 2001. This paper reports excellent precision and recall results for a fully-automated data- extraction system for HTML documents. Its primary limitations are its requirement for the documents to be well formed and its dependence on document structure for locating tokens of interest. It is also highly reliant upon a limited set of built-in heuristics that decide how to identify token separator tags. [2] Mary Elaine Califf and Raymond J. Mooney. Relational learning of pattern-match rules for information extraction. Working Notes on AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, 1998. Rapier, the system described in this paper, is a relational learner of pattern matching rules. It is more general than most wrapper approaches since its patterns are based on parts of speech and lexicons of related terms, instead of specific delimiters found in a document. However, it requires text that is grammatical enough to be tagged for parts of speech. [3] Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo. Roadrunner: toward automatic data extraction from large Web sites. The VLDB Journal, 2001. The authors introduce a system for fully-automated wrapper generation to extract data from HTML pages. Their system is novel in several ways: it does not rely on user interaction, it has no a priori knowledge about the page contents, and it works with two pages at a time instead of just one. 11
  12. 12. [4] Line Eikvil. Information extraction from World Wide Web: A survey. Technical Report 945, Norwegian Computing Center, 1999. A general survey of Web-based information extraction, this paper explains the history, metrics, and nature of the information extraction problem. It focuses primarily on wrapper generation and systems centered on that approach. [5] David W. Embley. Programming with data frames for everyday data items. AFIPS Conference Proceedings, 1980. The author introduces the concept of data frames, which are similar to abstract data types and classes. Data frames define value recognizers, context keywords, internal representations, operators, and so forth. This work has influenced the design and use of data frames in Ontos. [6] D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.-K. Ng, R.D. Smith. Conceptual-model-based data extraction from multiple-record Web pages. Data & Knowledge Engineering, 31, 1999. This paper introduces the notion of ontology-based data extraction and describes the BYU-Ontos system. A domain ontology is a conceptual model for a particular subject, and may be composed of object sets, relationship sets, constraints, and various other elements. A data-extraction system can use ontologies together with lexicons and pattern-matching rules to recognize and extract portions of a document related to the domain of interest. This paper is the foundation for most of the work of the Data Extraction Group at BYU. [7] D.W. Embley, C. Tao, S.W. Liddle. Automatically extracting ontologically specified data from HTML tables with unknown structure. Proceedings of the 21st International Conference on Conceptual Modeling, Tampere, Finland, 2002. The paper describes an ontology-based approach to extracting tabular data from HTML by interpreting how values appear in HTML table structures. It represents a departure from the sequential-record assumption of earlier ontology-based data extraction work at BYU. [8] Shai Fine, Voram Singer, Naftali Tishby. The hierarchical hidden Markov model: analysis and applications. Machine Learning, 32, 1998. This paper describes hierarchical hidden Markov models and algorithms for using them. A hierarchical HMM is a recursive generalization of a classical HMM: it is an HMM with one or more hidden states replaced by “nested” HMMs, which themselves may contain other HMMs in place of some of any of their states. The hierarchical nature of the HHMMs may provide a natural fit to the relational aspect of domain ontologies. [9] Dayne Freitag. Information extraction from HTML: application of a general machine learning approach. AAAI-98, 1998. This paper presents SRV, which is a wrapper-related information extraction system. SRV is an extraction rule learner based on the FOIL algorithm. Its learning algorithm is domain- and syntax-independent, but the rules it learns are specific to the documents from which it learned them. [10] Dayne Freitag and Andrew Kachites McCallum. Information extraction with HMMs and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, 1999. 12
  13. 13. The authors provide a good explanation of using HMMs in information extraction. Using HMMs with certain constraints, such as manually-defined topologies, they apply a statistical smoothing technique called “shrinkage” to the model parameters in order to make the accuracy of the model more robust while requiring fewer training instances. [11] Kimball Hewitt. An integrated ontology development environment for data extraction. Master’s thesis, Brigham Young University, 2000. The thesis describes an implementation of a graphical ontology editor still in use by the Data Extraction Group for construction of ontologies and data frames. [12] Chun-Nan Hsu. Initial results on wrapping semistructured Web pages with finite-state transducers and contextual rules. AAAI-98 Workshop on AI and Information Integration, 1998. This paper introduces SoftMealy, a wrapper induction algorithm. It constructs a finite state transducer for each type of item to extract, based on the delimiters encountered during document scanning. The finite state transducers are more flexible in dealing with missing attributes and various permutations of attributes. [13] The Java tutorial. At http://java.sun.com/docs/books/tutorial/, 2003. This site explains features of the Java programming language, the primary implementation language of Ontos and the proposed framework. [14] Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos. Wrapper induction for information extraction. International Joint Conference on Artificial Intelligence (IJCAI), 1997. The system presented in this paper follows the wrapper generation orientation of early information extraction work. Using PAC-learning principles and syntactic analysis, wrappers are induced for specific documents. [15] T.R. Leek. Information extraction using hidden Markov models. Master’s thesis, UC San Diego, 1997. The thesis introduces the concept of using a hidden Markov model to extract factual information from a corpus of English prose. It includes a sound justification for using HMMs in information extraction, focusing on the technique’s independence from human engineering and prior knowledge, as well as the usefulness of computing a probability for class labels. [16] S.W. Liddle, D.W. Embley, D.T. Scott, S.H. Yau. Extracting data behind Web forms. Proceedings of the Workshop on Conceptual Modeling Approaches for e-Business, Tampere, Finland, 2002. The paper describes an approach to document retrieval and data extraction for the “hidden Web” (data available only via HTML form requests). The approach is an example of the need for access to previously extracted data, which in this case can be used to provide values for a form request. 13
  14. 14. [17] Andrew McCallum, Dayne Freitag, Fernando Pereira. Maximum entropy Markov models for information extraction and segmentation. In Machine Learning: Proceedings of the Seventeenth International Conference (ICML 2000), 2000. The authors present an alternative to hidden Markov models that still retains many of the advantages of HMMs. Maximum entropy Markov models allow observations to be treated as arbitrary overlapping features, and more accurately represent the probability of a sequence by focusing on the conditional probability of reaching a state given an observation and the previous state, rather than the joint probability that maximizes the likelihood of the observation sequence(s) used to train the model. [18] Ion Muslea. Extraction patterns for information extraction tasks: a survey. The AAAI-99 Workshop on Machine Learning for Information Extraction, 1999. Muslea surveys the existing systems for rule-based information extraction and categorizes them based on the type of text (free, structured, semi-structured) targeted for extraction. The survey does not encompass BYU-Ontos. [19] Ion Muslea, Steve Minton, Craig Knoblock. A hierarchical approach to wrapper induction. Proceedings of the Third International Conference on Autonomous Agents (Agents ’99), 1999. The authors present an intriguing variation on wrapper induction that accommodates richly structured documents. They view a document as a collection of nested lists, and therefore construct a wrapper that is hierarchical to accommodate the nested lists. [20] L.R. Rabiner and B.H. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, 3 (1), 1986. This article is an oft-cited overview of hidden Markov models, the three primary problems of interest associated with them, and algorithms for solving these problems. HMMs consist of a finite set of hidden states that transition to each other according to a probability distribution; a finite set of symbols that may be emitted from any hidden state according to another probability distribution; and a probability distribution that determines the initial state of the model. Efficient algorithms exist for determining the likelihood of a particular sequence of symbols given a model, or the most likely sequence of hidden states that generated a particular sequence. [21] Ellen Riloff. Automatically constructing a dictionary for information extraction tasks. National Conference on Artificial Intelligence, 1993. Riloff’s paper describes AutoSlog, which is a system that uses selective concept extraction in a one-shot machine learning approach to extract terms to build a dictionary for a given domain. This paper provides an example of earlier and alternative approaches to machine-learned information extraction. Of particular note is the set of core heuristics upon which AutoSlog relies for its capability; this dependence on heuristics resembles BYU-Ontos’ use of heuristics for its class labeler. [22] Tobias Scheffer, Christian Decomain, Stefan Wrobel. Active hidden Markov models for information extraction. In Proceedings of the International Symposium on Intelligent Data Analysis, 2001. This paper gives an algorithm for training a hidden Markov model on partially labeled data using active learning methods. The research examines how to train a model when the observation labels are not fully specified, although this paper focuses more on computing the error of estimated labels in order to present the more difficult observations interactively for a user to label. 14
  15. 15. [23] Kristie Seymore, Andrew McCallum, Ronald Rosenfeld. Learning hidden Markov model structure for information extraction. In Proceedings of the AAAI Workshop on Machine Learning for Information Extraction, 1999. This paper focuses on learning the underlying topology of a hidden Markov model for information extraction purposes, since the optimal topology is often not known a priori. An important result is that a topology containing only one hidden state per class label may be suboptimal. The paper also proposes some techniques for learning with distantly-labeled data, which is labeled data from another domain whose labels partially overlap those of the target domain. [24] Stephen Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning 34, 1999. Soderland presents WHISK, a rule-based extraction approach. WHISK learns to detect information based on delimiting strings in the document, as well as tagged grammatical elements. The system's main contribution is its ability to learn extraction rules for free, semi-structured, and structured texts. [25] Alan Wessman. Unpublished design document, 2003. This document contains working specifications for the framework proposed for the thesis. IX.Artifacts Aside from the thesis itself, the following will be produced as artifacts: • An XML Schema defining the underlying data structures for Ontos • The proposed framework, realized in Java • An extraction plan and heuristics implemented under the new framework • Revision of a graphical ontology editor and the Ontos system using the framework • Documentation on using the framework to implement extraction plans and heuristics 15
  16. 16. X.Signatures This thesis proposal by Alan Wessman is accepted in its present form by the Department of Computer Science of Brigham Young University as satisfying the thesis proposal requirement for the degree of Master of Science. ______________________________ _______________________________ Date David W. Embley, Committee Chair _______________________________ Stephen W. Liddle, Committee Member _______________________________ Thomas W. Sederberg, Committee Member _______________________________ David W. Embley, Graduate Coordinator 16