This presentation includes work that was performed as a collaboration between IBM Research and Stanford, and one of the participants is now at Texas. The authors greatly appreciate the nomination of this paper for a best paper award.
The context of this work is relatively common. There is a lot of important information out there that is not structured. We want to extract that information, combine it with formal knowledge, and reason about it. In this talk we are focusing on coherent explanations of end-to-end systems that perform these steps.
For example, a user may make some request for information and get some result. In some cases, the user may be satisfied with that result as it is. However, in other cases, the user may want to know why the answer should be believed. A traditional solution to that problem is to provide some sort of logical proof that shows how facts and axioms combine to establish the result. However, in some cases the user will want to drill down even further. The user may want to know where the facts and axioms came from. Some may be directly asserted in some hand-coded knowledge base, but others may have been automatically extracted from documents. The user may wish to find out what text the fact was derived from, how that text was annotated, and even which components were responsible for each part of the extraction.
One part of the background of this work is UIMA. UIMA is an architecture for analyzing unstructured information such as text or video. The architecture is undergoing standardization through OASIS. A reference implementation of UIMA is available as open source. UIMA provides shared programming interfaces and data structures for analysis; this makes it possible to develop generic tools that are not specific to a particular analysis component because they operate at the level of the structures defined by the architecture. For example, it is possible to record provenance for analysis without having to instrument individual components by developing the recording mechanisms at the level of the architecture and framework.
Another part of the background of this work is Inference Web. Inference web provides infrastructure for storing and browsing provenance. It encodes process descriptions as graphs of inferences. It has been applied to a variety of different technologies that naturally lend themselves to a formal inference perspective. In this work we using Inference Web to record provenance for knowledge extraction. We show that it is possible to view extraction as a form of inference.
Specifically, we have identified nine types of extraction inferences. Six of these involve the analysis of the unstructured sources and three involve integrating the analyses into a target ontology. Here we show two of the inference types. Entity Recognition involves labeling a span of text with an entity type such as person. Relation Argument Identification involves connecting text labeled as an entity to text labeled as a relationship via a role such as “subject.”
Let’s revisit our motivating example, looking more closely at how the result was produced. The end-to-end system began with some text and some assertions in a knowledge base. Analysis of text begins by labeling spans of text with entity types and relation types. Given those labels, it is possible to assign arguments to relation annotations and to perform coreference over entities. All that information in combination allows us to conclude a formal logical assertion. That assertion can be combined with other assertions to draw a conclusion via theorem proving. I would like to emphasize that this trace spans two distinct kinds of technology: extraction and inference. We can look at these as two distinct modules, but the provenance shown here has a consistent form throughout the end-to-end system.
This is one of the graphical interfaces that Inference Web provides for browsing provenance. Steps in the process can be viewed a level at a time...
... or they can be expanded out to see a more complete view. The interface is highly interactive, for example, a user can click on a button on each node to see a description of the component that performed the inference.
This is an example of the OWL-based representation that Inference Web is based on. The inference engine responsible for this step in the process was IBM’s statistical ACE annotator. The step had three antecedents, which are identified by URI’s, so they could potentially be distributed across different locations. The inference rule that was used in this step is Relation Identification . The conclusion of this step is that entity 184 is the manager of entity 199. The language used to encode that conclusion is KIF.
Our main result here is that we provide coherent provenance for an end-to-end system that reasons over both hand-coded and extracted knowledge. To that end we have represented extraction as a form of inference. UIMA has supported this work by making it possible to work with analysis components in terms of what they do instead of being forced to dig into the internal technical details of each component separately. Inference Web has supported this work by providing a formal interlingua for encoding provenance and an interface that allows us to view that provenance for complex end-to-end systems that include extraction and logical deduction.
Explaining Conclusions from Diverse Knowledge Sources J. William Murdock 1 , Deborah McGuinness 2 , Paulo Pinheiro da Silva 3 , Chris Welty 1 , David Ferrucci 1 1 IBM Research 2 Stanford 3 U. Texas El Paso
Core Ideas <ul><li>Extracting that information </li></ul><ul><li>Combining that information with existing KB’s </li></ul><ul><li>Automated reasoning about that information </li></ul><ul><li>Coherent explanations for results </li></ul>Lots of important information is currently unstructured (e.g., natural language text on an HTML page)
Motivating Example “ Major Julian Allen, Ph.D., director of the Automated System Project” Major Julian Allen Major Julian Allen managerOf Mississippi Automated Systems Project transitivity of managerOf pressrelease/1107628109.html kb1.owl Why should I believe that the unstructured text says that? Why should I believe these? Why should I believe this? Who manages the Mississippi automated data infrastructure? OrganizationalRelationAnnotator EntityAnnotator2 EntityAnnotator1 Mississippi Automated Systems Project managerOf Mississippi automated data infrastructure CoreferenceResolver managerOf
Pre-Existing UIMA Technology <ul><li>In-progress OASIS standard architecture for analysis of unstructured sources (e.g., text, video, audio, images). </li></ul><ul><li>Open-source framework implementation http://uima-framework.sourceforge.net/ </li></ul><ul><li>Shared API’s and data structures </li></ul><ul><ul><li>Thus generic tools can interact with components (e.g., recording provenance of analysis tasks as inferences) </li></ul></ul>
Pre-Existing Inference Web Technology <ul><li>Enables browsing representations of processes (i.e., knowledge provenance) </li></ul><ul><li>Uses descriptions of processes as inferences </li></ul><ul><li>Has been used with theorem-proving technology, task execution engines, web services, etc. </li></ul><ul><ul><li>Lend themselves to formal, inference representation </li></ul></ul><ul><li>We are applying it to knowledge extraction </li></ul><ul><ul><li>Requires a new perspective: extraction as inference </li></ul></ul>
Taxonomy of Extraction Methods <ul><li>Identified 9 types of extraction inferences </li></ul><ul><ul><li>6 for analysis, and 3 for knowledge integration </li></ul></ul><ul><li>E.g., </li></ul>Major Julian Allen, Ph.D., director of the Automated System Project. Entity Recognition Person Relation Argument Identification managerOf subject Major Julian Allen, Ph.D., director of the Automated System Project. Person
Motivating Example: Details (managerOf MASProject1 MissDataInfrastructure1 ) (managerOf MJAllen1 MissDataInfrastructure1 ) (transitiveProperty managerOf) JTP Java Theorem Prover Transitive Property Inference Direct assertion from KB1.owl IBM Coreference Major Julian Allen [Person] [refers to MJAllen1] , Ph.D., director of the Automated System Project [Organization] [refers to MASProject1] Entity Identification IBM EAnnotator Major Julian Allen [Person] , Ph.D., director of the Automated System Project [Organization] Entity Recognition direct assertion from pressrelease/1107628109.html “ Major Julian Allen, Ph.D., director of the Automated System Project” IBM Relation Detector Major Julian Allen, Ph.D., director of the Automated System Project [managerOf] Relation Recognition IBM Relation Detector Major Julian Allen [subject] , Ph.D., director of the Automated System Project [object] Relation Argument Identification IBM Coreference (managerOf MJAllen1 MASProject1 ) Relation Identification Direct assertion from KB1.owl Extraction Theorem Proving
Conclusions <ul><li>Uniform provenance for end-to-end system reasoning over hand-coded and extracted knowledge </li></ul><ul><ul><li>Representing extraction as inference </li></ul></ul><ul><li>UIMA provides a framework for integrating extraction systems. </li></ul><ul><ul><li>Data structures, API’s, etc. are shared </li></ul></ul><ul><ul><li>Possible to build traces of different components in the same way </li></ul></ul><ul><li>Inference Web provides mechanisms for encoding, browsing, and analyzing traces. </li></ul><ul><ul><li>Includes a provenance interlingua and supporting tools </li></ul></ul><ul><ul><li>Used for explaining </li></ul></ul><ul><ul><ul><li>Logical inference systems </li></ul></ul></ul><ul><ul><ul><li>Knowledge extraction from text </li></ul></ul></ul><ul><ul><ul><li>Complex processes that combine both </li></ul></ul></ul>
References <ul><li>Ferrucci, D. 2004. Text Analysis as Formal Inference for the Purposes of Uniform Tracing and Explanation Generation . IBM Research Report RC23372. </li></ul><ul><li>Ferrucci, D. and Lally, A. 2004. UIMA by Example. IBM Systems Journal 43, No. 3, 455-475. </li></ul><ul><li>McGuinness, D. and Pinheiro da Silva, P. 2004. Explaining Answers from the Semantic Web: The Inference Web Approach. Journal of Web Semantics 1(4):397-413. </li></ul><ul><li>Pinheiro da Silva, P., McGuinness, D. and Fikes, R. 2006. A Proof Markup Language for Semantic Web Services. Information Systems 31(4-5): 381-395. </li></ul><ul><li>Welty, C., Murdock, J.W., Pinheiro da Silva, P., McGuinness, D., Ferrucci, D. 2005. Tracking Information Extraction from Intelligence Documents. Proceedings of the International Conference on Intelligence Analysis . McClean, VA. </li></ul>