Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Knowledge intensive query processing copy


Published on

Published in: Business, Technology
  • Be the first to comment

  • Be the first to like this

Knowledge intensive query processing copy

  1. 1. Knowledge-Intensive Query Processing Barbara Starr Vinay K. Chaudhri SAIC Artificial Intelligence Center San Diego SRI International Adam Farquhar Richard Waldinger Knowledge Systems Laboratory Artificial Intelligence Center Stanford University SRI International waldinger@ai.sri.com1 Introduction We conclude the paper by listing a few of the research challenges faced in building such a system. The workInnovative query interfaces to knowledge and database described in the paper is preliminary and is aimed atsystems must go beyond simply returning the re- suggesting directions for future work rather than atquested information. They must be capable of produc- describing in-depth technical intentional answers when a description improvesthe understanding of an answer [Mot94], producingconditional answers when no one answer matches the 2 Crisis Management Benchmarkconditions of a query, and using ontological informa- The Crisis Management Benchmark (CMB) defines ation in processing a query. They should be able to collection of approximately 100 queries, their expectedcall upon stand-alone reasoning modules that are most answers, and the knowledge sources that can be usedsuitable for a given query. When answering a question to answer those queries [IACR97]. The CMB has beeninvolves reasoning beyond a simple lookup, the system motivated by the needs of a crisis analyst who is moni-must be able to explain the answer to the user. toring a part of the world with the objective of predict- We are building a question answering system with ing a crisis. The CMB has been defined in the contextthese objectives. The heart of the system is a knowl- of a scenario involving a conflict among Persian Gulfedge base (KB) and a collection of reasoning meth- nations. Many of the queries in the CMB are, however,ods. The KB is being constructed by a combination of of general interest. Two features distinguishing themanual and semiautomatic methods. The reasoning CMB from other benchmarks for measuring databasemethods include conventional database query process- performance [Gra93] are as follows: (1) it measures theing, frame-based reasoning, and full first-order theo- knowledge content of the system and not just time torem proving. The performance of this system will be process a query, and (2) it is designed for queries thattested on the Crisis Management Benchmark (CMB), require processing beyond a simple lookup or join ofwhich defines a collection of queries of interest to a relations.crisis analyst. We begin the paper by a description of the CMB. 2.1 Knowledge BenchmarkWe describe the architecture of our system and then For a system to be useful to a crisis analyst, it mustsketch some design ideas for two of its components. have access to su ciently broad geo-political knowl- edge. The knowledge benchmark represents a user’sThe copyright of this paper belongs to the paper’s authors. Per-mission to copy without fee all or part of this material is granted view of the system in the sense that it defines the do-provided that the copies are not made or distributed for direct mains in which a user is interested. The CMB does notcommercial advantage. require the system to necessarily collect all the knowl-Proceedings of the 5th KRDB Workshop edge in one place. We list here the domain areas inSeattle, WA, 31-May-1998 which the knowledge must be encoded. For each do-(A. Borgida, V. Chaudhri, M. Staudt, eds.) main area we give an example question from the CMB that is representative of that type of knowledge.B. Starr, V. Chaudhri, A. Farquhar, R. Waldinger 18-1
  2. 2. World trade and economic information. How phase, the KB and the scenario will be kept fixed andmuch oil does Japan purchase from the Persian Gulf new questions will be asked. In the second phase, thestates? KB will be fixed and the scenario will be changed. InGeography and demographics. Which states bor- the third phase, new knowledge will be added to theder the Persian Gulf? KB to address the new questions and the changed sce-History of international behavior. Has Japan ever nario.refused to trade with some other country?Country policies. What is the US policy on illegal 3 System Architectureimmigrants?Country capabilities. Is Iraq capable of refusing The components in our system can be classified intoinspection by UN o cials? three categories: user interface, knowledge services,International organizations. What is the Interna- and question answering. Many of the components havetional Monetary Fund and who are its members? been developed by di↵erent research groups working independently.2.2 Processing Benchmark An overview of our system architecture is shown in Figure 1. The components are held together by a com-The processing benchmark represents an implemen- mon application programming interface: Open Knowl-tor’s view of the system as it defines processing ca- edge Base Connectivity (OKBC) [CFF+ 98]. OKBCpabilities that are necessary for answering the CMB interfaces for some of the components are already func-questions. The CMB does not mandate any specific tional, whereas others are being constructed.reasoning method that must be used for answering a Two kinds of user interface are supported by ourgiven question. The CMB queries can classified into system. HIKE is a form-based interface that allows athe following categories. user to construct queries by using pull-down menus.Retrieval queries. At what fraction of its current Forms are provided to construct any of the queriessustainable capacity is Iran producing oil? in the CMB. START (Syntactic Analysis using Re-What-if queries. Assuming constant production by versible Transformations) is a natural language inter-Iran and others, would a 5% increase in production face that accepts queries in English [Kat97]. STARTby Saudi Arabia have a positive or negative e↵ect on generates a formal representation of a query and trans-the economy of Iran? Would a 5% increase by Kuwait mits it for evaluation to one of the question answeringhave as large an e↵ect? systems. Automatic generation of a formal representa-Analysis queries. Is Iraq capable of refusing inspec- tion for an English query is supported only for a subsettion by UN o cials? of queries. Even though the goal of START is to ac- cept arbitrary questions expressed in English, during2.3 Operation of the Benchmark the course of the current project, the use of STARTThe initial evaluation criteria for the CMB are qual- will be restricted to the questions defined in the CMB.itative, and the answers produced by the system will Knowledge services are provided by three compo-be judged by a team of experts. Each answer will be nents: GKB-Editor, WebKB, and Ontolingua. GKB-tested on the following criteria. Is the answer correct Editor is a graphical tool for browsing and editingand accurate? Does the answer include any correct, large knowledge bases [KCP98]. It is primarily used fornontrivial analysis which was not obvious? Are the manual knowledge acquisition. WebKB is a semiauto-assumptions behind the knowledge appropriate? Does matic tool for extracting information from the World-it constitute a realistic model for the question’s pur- wide Web (WWW) [CDF+ 98]. Given an ontology, andpose? Are simplifications appropriate? Is the level a few examples of the information to be extracted, We-of generality or detail appropriate? Is an explanation bKB can extract objects, relations and probabilistico↵ered? Is it intelligible to a nonexpert? The bench- rules from the text sources on the Internet. The ex-mark is being refined to include quantitative measures traction of knowledge is done in a semiautomatic fash-to evaluate these aspects of the answers. ion. Ontolingua is the knowledge server and stores all The system will be tested over a period of three the knowledge in the system [FFR97]. Since it has ayears. Each year will consist of a development pe- focal role in our architecture, we discuss it in detail.riod of 11 months and a testing period of one month. Ontolingua is a tool for creating, evaluating, ac-Each year will begin with a specific scenario and a cessing, using, and maintaining reusable ontologies. Itlist of questions that are relevant for it. During the contains a collection of tools and services to support11-month development period, a system will be devel- not only the development of ontologies by individuals,oped to satisfy this scenario. During the final month, but also the process of achieving consensus on commonthe system will be tested in three phases. In the first ontologies by distributed groups. These tools include aB. Starr, V. Chaudhri, A. Farquhar, R. Waldinger 18-2
  3. 3. SPOOK ATP ATP SPOOK WWW WWW OKBC BUS Training Training WebKB WebKB SNARK SNARK Data Data GKB GKB SKC SKC Editor Editor Knowledge Ontolingua Ontolingua Engineer HIKE HIKE START START WWW WWW AnalystFigure 1: System Architecture: ATP – Abstract Theorem Prover, GKB-Editor – Generic KB Editor, HIKE –form-based GUI, OKBC – Open Knowledge Base Connectivity, Ontolingua – Knowledge Server, SKC – ScalableKnowledge Composition, SNARK – SRI’s New Automated Reasoning Kit, SPOOK – System for ProbabilisticObject-Oriented Knowledgesemiformal representation language that supports the ponents. The components may query the knowledgedescription of terms in a representation language that server (Ontolingua) during query evaluation. The re-is an extension of the Knowledge Interchange Format sulting answer is returned to HIKE and/or START.(KIF) [GF92], browsing and retrieval of ontologies,and facilities for translating ontologies into multiple 4 Knowledge Servicesrepresentation languages. Two knowledge bases have been playing an active role Question answering is supported by multiple rea- in our work so far — the HPKB upper-level ontol-soning methods. SNARK is a first order theorem ogy (HPKB-UL) and the World Fact Book Knowledgeprover [SWL+ 94]. SPOOK is a reasoner based on Base (WFBKB).Bayes nets [KP97]. When answers are returned bymultiple reasoning methods or by using alternativeknowledge sources, the answers should be ranked be- 4.1 Upper Ontologyfore being presented to the user. Such a ranking will The HPKB-UL is the upper-level ontology of the Cycbe done by SKC [WG95]. SKC is not yet a part of KB [LG89] augmented by some links to the Sensus on-our system and will be integrated in the later phases tology [KL94]. It is being used in DARPA’s High Per-of development. Finally, some of the question answer- formance Knowledge Bases (HPKB) program. HPKB-ing is done by START that uses information retrieval UL provides a taxonomy of about 3000 terms and rela-methods to process a query expressed in English on tions for general terms such as tangible-object, action,a text-based source. We will illustrate a sample an- and transportation. It also defines some relations be-swer produced by one of these components later in the tween them, such as the starting time of an event, thepaper. relationship between an object and its parts, and the Let us consider an example of how a question is an- borders of a geographic region. The X3T2 workingswered by our system. A user poses her query by us- group of ANSI has adopted HPKB-UL as the currenting either HIKE or START. In either case, the query draft for a standard upper mapped to a formal representation expressed in a There are at least two advantages in using HPKB-language which is an extension of KIF. The query is UL. First, for formalizing the CMB questions, we needshipped to one or more of the question answering com- vocabulary. HPKB-UL provides a significant subsetB. Starr, V. Chaudhri, A. Farquhar, R. Waldinger 18-3
  4. 4. of the vocabulary necessary for this purpose. Sec- a broad range of information about the countries andond, since our knowledge base is being developed by territories of the world.1 The fact book includes geo-multiple research groups, we need a terminology that graphical, economic, demographic, and some historicalwill help us in combining KBs developed by di↵erent information.groups. Using a common ontology makes the task of We have been augmenting HPKB-UL and linking itsharing knowledge easier. to WFBKB. HPKB-UL stops well above many of the For formalizing the CMB queries, we used the concrete terms that appear in the fact book, such asfollowing approach: map nouns to concepts, verbs bauxite mining, the food-and-beverage industry, sandyto events and actions, and adjectives to quality at- beaches, ethnic minorities, and spoken languages. Wetributes, identify individuals, and finally; specify tem- are working to construct a richer ontology that spansporal and spatial information. If a necessary term is the substantial gaps between the HPKB-UL conceptsnot found in HPKB-UL, we extend HPKB-UL appro- and the terms that are introduced by the World Factpriately. As an example, consider the question: Book. We expect the WFBKB to provide knowledge necessary to answer many of the CMB questions.Has post-Shah Iran launched ballisticmissiles in wartime? 5 Question Answering Our formalization of this query is as follows. For the analysis queries defined in the CMB, a sim- ple yes or no answer is not appropriate. Instead, the(and system must return a descriptive answer and provide (attack ?act) some justification for that answer. Let us consider an (performed-by ?act Iran) example of a conditional answer for an analysis ques- (device-used ?act ballistic-missile) tion. This answer was produced using SNARK, which (later-than (start-of ?act) is one of the question answering components. (start-of post-shah-iran))) Consider the following question: What will be the likely position of Iraq on allowing inspection by UN In this formalization, performed-by, device-used, o cials? For an analysis question like this, usually welater-than, and start-of are predicates defined in want to know more than what the question literallyHPKB-UL. For example, device-used relates an ac- requires. For this question, on one hand, we mean thattion to the device that was used in performing it. if Iraq is likely to refuse inspection, we would like toThe predicate attack represents a collection of ac- know the reasons. On the other hand, if Iraq is likely totions in which an agent attacks another agent, and go along with such an inspection, we need confirmationpost-shah-iran is a constant representing the time of it; simply failing to find a proof that Iraq can refuseinterval after Shah of Iran. This formalization does UN inspection does not necessarily imply that it will—not explicitly represent wartime as it assumes that an there are many true things we cannot prove. In eitherattack is performed only during wartime. case, we would like some indication of the reasoning Since all the question answering components sub- by which the conclusion was reached.scribe to the same ontology, the above question can Suppose we have a domain knowledge rule in thebe sent for evaluation to any of them. A component KB which states that Iraq is likely to refuse the in-may decide to transform this representation into an spection if it has political support from Russia: if (sup-alternative representation which is more e cient for ports Russia Iraq) then (refuses Iraq UN-Inspection)evaluation by it. else (delays Iraq UN-Inspection) In finding a proof for this question, SNARK per-4.2 World Fact Book KB forms a case analysis, depending on whether Iraq hasThe World Fact Book KB, being developed by the support from Russia. If Iraq has support from Russia,Knowledge Systems Laboratory at Stanford is a sub- it returns an answer (refuses Iraq UN-Inspection). Ifstantial knowledge base covering basic geographic, eco- Iraq does not have support from Russia, it returns annomic, political, and demographic knowledge about answer (delays Iraq UN-Inspection). If it cannot deter-the world’s nations. The goal of the project is to pro- mine whether Iraq has support from Russia, it returnsvide a useful knowledge resource, to explore ways for a conditional answer stating that if Iraq has supportstructuring large knowledge bases, and to develop a from Russia, it can refuse UN inspection; otherwise, itknowledge base large enough to stress existing knowl- will try to delay it. The capability to produce condi-edge representation systems. tional answers is not supported in any of the existing The primary source for the World Fact Book knowl- 1 For more information on the World Fact Book, seeedge base is the CIA World Fact Book, which collects Starr, V. Chaudhri, A. Farquhar, R. Waldinger 18-4
  5. 5. query processing systems. • What are the principles and techniques for de- SNARK also prints an explanation for the answer signing a large knowledge base that would enableproduced by it. The explanation shows all the axioms knowledge reuse?used and the intermediate inference steps. Each in-ference step shows the axioms, inference method, and • What is a good interface for allowing a user torewrites used in that step, the conclusion derived, and construct queries by using ontological informa-the current answer term. If an English description of tion?an axiom is available, it is shown. The output is pro- • What techniques are useful for explaining the an-duced in the HTML format making it easier for a user swers of a system that uses multiple reasoningto navigate amongst several inference steps. methods?6 Summary and Conclusions AcknowledgmentsWe have presented initial design ideas of an innovative At Science Applications International Corporationquery interface for a knowledge and database system. (SAIC), this work was supported by a DARPA/SSCThe system is capable of producing conditional an- contract entitled Knowledge-Base Technology Devel-swers when no one answer matches the conditions of a opment and Integration (Contract Number N66007-query, and uses ontological information in processing 97-C-8546). At SRI International, it was supporteda query. The heart of the system is a knowledge base by a DARPA contract entitled Ontology Construction(KB) and a collection of reasoning methods. The KB Toolkit (Contract Number N66001-97-C-8550). Atis being constructed by a combination of manual and Stanford University, it was supported by a DARPAsemiautomatic methods. The reasoning methods in- contract entitled Large-Scale Repositories of Highlyclude conventional database query processing, frame- Expressive Reusable Knowledge (Contract Numberbased reasoning, and full first-order theorem proving. N66001-97-C-8554).The performance of this system will be tested on theCrisis Management Benchmark (CMB) that defines a Referencescollection of queries which are of interest to a crisisanalyst. [CDF+ 98] M. Craven, D. DiPasquo, D. Freitag, We believe that innovative query interfaces such as A. McCallum, T. Mitchell, K. Nigam, andthe one described here represent a major advance in S. Slattery. Learning to Extract Symbolicthe query processing capabilities of current knowledge Knowledge from the World Wide Web. Inand database systems. They open up several challeng- Proceedings of the National Conference oning research problems that must be addressed. We Artificial Intelligence, July 1998. To ap-believe that the following problems are fundamental enabling the construction of such interfaces. [CFF+ 98] Vinay K. Chaudhri, Adam Farquhar, • How can one take a KB such as WFBKB or Richard Fikes, Peter D. Karp, and HPKB-UL developed for one purpose and use it James P. Rice. OKBC: A Foundation for in a di↵erent context? Knowledge Base Interoperability. In Pro- ceedings of the National Conference on Ar- • How can we reformulate a KB into a form that tificial Intelligence, July 1998. To appear. allows e cient evaluation of a query with a new reasoner? [FFR97] Adam Farquhar, Richard Fikes, and James P. Rice. A Collaborative Tool • When, and in what forms are conditional answers for Ontology Construction. Interna- useful? tional Journal of Human Computer Stud- • How e↵ective can be a generic API such as OKBC ies, 46:707–727, 1997. in integrating diverse technology components into [GF92] Michael R. Genesereth and Richard E. one system? Fikes. Knowledge Interchange Format, • Given a query, on what basis should it be dis- Version 3.0 Reference Manual. Technical patched to a component subsystem and how to Report Logic-92-1, Computer Science De- combine the results of sub-queries returned by dif- partment, Stanford University, Stanford, ferent systems? CA, 1992. • How can we quantitatively measure the perfor- [Gra93] James N. Gray. The Benchmark Handbook mance of answering analysis questions? for Database and Transaction ProcessingB. Starr, V. Chaudhri, A. Farquhar, R. Waldinger 18-5
  6. 6. Systems. Morgan Kaufmann Publishers, 1993.[IACR97] IET, Alphatech, Paul Cohen, and Pa- cific Sierra Research. HPKB year 1 end-to-end challenge problem specification, version 1.1. Technical report, Informa- tion Extraction and Transport (IET) Inc., Rosslyn, Virginia, December 1997. See[Kat97] Boris Katz. From Sentence Processing to Information Access on the World Wide Web. In AAAI Spring Symposium on Nat- ural Language Processing for the World Wide Web, Stanford, CA, 1997.[KCP98] Peter D. Karp, Vinay K. Chaudhri, and Suzanne M. Paley. A Collaborative En- vironment for Authoring Large Knowledge Bases. Journal of Intelligent Information Systems, 1998. To appear.[KL94] K. Knight and S. Luk. Building a Large- Scale Knowledge Base for Machine Trans- lation. In Proceedings of the National Con- ference on Artificial Intelligence, Seattle, WA, August 1994.[KP97] D. Koller and A. Pfe↵er. Object-Oriented Bayesian Networks. In Proceedings of the 13th Annual Conference on Uncertainity in AI (UAI), Providence, RI, August 1997.[LG89] Douglas B. Lenat and R.V. Guha. Building Large Knowledge-based Systems: Repre- sentation and Inference in the Cyc Project. Reading, MA, Addison-Wesley Publishing Co., 1989.[Mot94] Amihai Motro. Intensional Answers to Database Queries. IEEE Transac- tions on Knowledge and Data Engineering, 6(3):444–454, 1994.[SWL+ 94] M. Stickel, R. Waldinger, M. Lowry, T. Pressburger, and I. Underwood. De- ductive Composition of Astronomical Soft- ware from Subroutine Libraries. In Pro- ceedings of the Twelfth International Con- ference on Automated Deduction (CADE- 12), pages 341–355, June 1994.[WG95] Gio Widerhold and Michael Genesereth. The Conceptual Basis for Mediation Ser- vices. In Proceedings of the Interna- tional Conference on Cooperative Informa- tion Systems, Vienna, Austria, May 1995.B. Starr, V. Chaudhri, A. Farquhar, R. Waldinger 18-6