Your SlideShare is downloading. ×
Hpkb   year 1 results
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hpkb year 1 results


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. AI Magazine Volume 19 Number 4 (1998) (© AAAI) Articles The DARPA High- Performance Knowledge Bases Project Paul Cohen, Robert Schrag, Eric Jones, Adam Pease, Albert Lin, Barbara Starr, David Gunning, and Murray Burke■ Now completing its first year, the High-Perfor- many applications, and they should be main- mance Knowledge Bases Project promotes technol- tained and modified easily. Clearly, these goals ogy for developing very large, flexible, and require innovation in many areas, from knowl- reusable knowledge bases. The project is supported edge representation to formal reasoning and by the Defense Advanced Research Projects Agency special-purpose problem solving, from knowl- and includes more than 15 contractors in univer- sities, research laboratories, and companies. The edge acquisition to information gathering on evaluation of the constituent technologies centers the web to machine learning, from natural lan- on two challenge problems, in crisis management guage understanding to semantic integration and battlespace reasoning, each demanding pow- of disparate knowledge bases. erful problem solving with very large knowledge For roughly one year, HPKB researchers have bases. This article discusses the challenge prob- been developing knowledge bases containing lems, the constituent technologies, and their inte- tens of thousands of axioms concerning crises gration and evaluation. and battlefield situations. Recently, the tech- nology was tested in a month-long evaluationA involving sets of open-ended test items, most lthough a computer has beaten the of which were similar to sample (training) world chess champion, no computer has items but otherwise novel. Changes to the cri- the commonsense of a six-year-old sis and battlefield scenarios were introducedchild. Programs lack knowledge about the during the evaluation to test the comprehen-world sufficient to understand and adjust tonew situations as people do. Consequently, siveness and flexibility of knowledge in theprograms have been poor at interpreting and HPKB systems. The requirement for compre-reasoning about novel and changing events, hensive, flexible knowledge about general sce-such as international crises and battlefield sit- narios forces knowledge bases to be large. Chal-uations. These problems are more open ended lenge problems, which define the scenariosthan chess. Their solution requires shallow and thus drive knowledge base development,knowledge about motives, goals, people, coun- are a central innovation of HPKB. This articletries, adversarial situations, and so on, as well discusses HPKB challenge problems, technolo-as deeper knowledge about specific political gies and integrated systems, and the evaluationregimes, economies, geographies, and armies. of these systems. The High-Performance Knowledge Base The challenge problems require significant(HPKB) Project is sponsored by the Defense developments in three broad areas of knowl-Advanced Research Projects Agency (DARPA) edge-based technology. First, the overridingto develop new technology for knowledge- goal of HPKB—to be able to select, compose,based systems.1 It is a three-year program, end- extend, specialize, and modify componentsing in fiscal year 1999, with funding totaling from a library of reusable ontologies, common$34 million. HPKB technology will enable domain theories, and generic problem-solvingdevelopers to rapidly build very large knowl- strategies—is not immediately achievable andedge bases—on the order of 106 rules, axioms, requires some research into foundations ofor frames—enabling a new level of intelligence very large knowledge bases, particularlyfor military systems. These knowledge bases research in knowledge representation andshould be comprehensive and reusable across ontological engineering. Second, there is theCopyright © 1998, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1998 / $2.00 WINTER 1998 25
  • 2. Articles problem of building on these foundations to Often, one will accept an answer that is populate very large knowledge bases. The goal roughly correct, especially when the alterna- is for collaborating teams of domain experts tives are no answer at all or a very specific but (who might lack training in computer science) wrong answer. This is Lenat and Feigenbaum’s to easily extend the foundation theories, breadth hypothesis: “Intelligent performance define additional domain theories and prob- often requires the problem solver to fall back lem-solving strategies, and acquire domain on increasingly general knowledge, and/or to facts. Third, because knowledge is not enough, analogize to specific knowledge from far-flung one also requires efficient problem-solving domains” (Lenat and Feigenbaum 1987, p. methods. HPKB supports research on efficient, 1173). We must, therefore, augment high-pow- general inference methods and optimized task- er knowledge-based systems, which give spe- specific methods. cific and precise answers, with weaker but ade- HPKB is a timely impetus for knowledge- quate knowledge and inference. The inference based technology, although some might think methods might not all be sound and complete. it overdue. Some of the tenets of HPKB were Indeed, one might need a multitude of meth- voiced in 1987 by Doug Lenat and Ed Feigen- ods to implement what Polya called plausible baum (Lenat and Feigenbaum 1987), and some inference. HPKB encompasses work on a vari- have been around for longer. Lenat’s CYC Pro- ety of logical, probabilistic, and other infer- ject has also contributed much to our under- ence methods. standing of large knowledge bases and ontolo- It is one thing to recognize the need for gies. Now, 13 years into the CYC Project and commonsense knowledge, another to inte- more than a decade after Lenat and Feigen- grate it seamlessly into knowledge-based sys- baum’s paper, there seems to be consensus on tems. Lenat observes that ontologies often are the following points: missing a middle level, the purpose of which is The first and most intellectually taxing task to connect very general ontological concepts when building a large knowledge base is to such as human and activity with domain-specif- design an ontology. If you get it wrong, you ic concepts such as the person who is responsible can expect ongoing trouble organizing the for navigating a B-52 bomber. Because HPKB is knowledge you acquire in a natural way. grounded in domain-specific tasks, the focus of Whenever two or more systems are built for much ontological engineering is this middle related tasks (for example, medical expert sys- layer. tems, planning, modeling of physical process- es, scheduling and logistics, natural language understanding), the architects of the systems The Participants realize, often too late, that someone else has The HPKB participants are organized into three already done, or is in the process of doing, the groups: (1) technology developers, (2) integra- hard ontological work. HPKB challenges the tion teams, and (3) challenge problem devel- research community to share, merge, and col- opers. Roughly speaking, the integration teams lectively develop large ontologies for signifi- build systems with the new technologies to cant military problems. However, an ontology solve challenge problems. The integration alone is not sufficient. Axioms are required to teams are led by SAIC and Teknowledge. Each give meaning to the terms in an ontology. integration team fields systems to solve chal- Without them, users of the ontology can inter- lenge problems in an annual evaluation. Uni- pret the terms differently. versity participants include Stanford Universi- Most knowledge-based systems have no ty, Massachusetts Institute of Technology common sense; so, they cannot be trusted. (MIT), Carnegie Mellon University (CMU), Suppose you have a knowledge-based system Northwestern University, University of Massa- for scheduling resources such as heavy-lift heli- chusetts (UMass), George Mason University copters, and none of its knowledge concerns (GMU), and the University of Edinburgh noncombatant evacuation operations. Now, (AIAI). In addition, SRI International, the Uni- suppose you have to evacuate a lot of people. versity of Southern California Information Sci- Lacking common sense, your system is literally ences Institute (USC-ISI), the Kestrel Institute, useless. With a little common sense, it could and TextWise, Inc., have developed important not only support human planning but might components. Information Extraction and be superior to it because it could think outside Transport (IET), Inc., with Pacific Sierra the box and consider using the helicopters in Research (PSR), Inc., developed and evaluated an unconventional way. Common sense is the crisis-management challenge problem, and needed to recognize and exploit opportunities Alphatech, Inc., is responsible for the battle- as well as avoid foolish mistakes. space challenge problem.26 AI MAGAZINE
  • 3. Articles Challenge Problems be assessed not only by technology developers but also by DARPA management and involvedA programmatic innovation of HPKB is chal- members of the DoD community.lenge problems. The crisis-management chal- The HPKB challenge problems are designedlenge problem, developed by IET and PSR, is to support new and ongoing DARPA initiativesdesigned to exercise broad, relatively shallow in intelligence analysis and battlespace infor-knowledge about international tensions. The mation systems. Crisis-management systemsbattlespace challenge problem, developed by will assist strategic analysts by evaluating theAlphatech, Inc., has two parts, each designed political, economic, and military courses ofto exercise relatively specific knowledge about action available to nations engaged at variousactivities in armed conflicts. Movement analysis levels of conflict. Battlespace systems will sup-involves interpreting vehicle movements port operations officers and intelligence ana-detected and tracked by idealized sensors. The lysts by inferring militarily significant targetsworkaround problem is concerned with finding and sites, reasoning about road network traffi-military engineering solutions to traffic- cability, and anticipating responses to militaryobstruction problems, such as destroyed strikes.bridges and blocked tunnels. Good challenge problems must satisfy sever- Crisis-Managemental, often conflicting, criteria. A challenge prob- Challenge Problemlem must be challenging: It must raise the bar The crisis-management challenge problem isfor both technology and science. A problem intended to drive the development of broad,that requires only technical ingenuity will not relatively shallow commonsense knowledgehold the attention of the technology develop- bases to facilitate intelligence analysis. Theers, nor will it help the United States maintain client program at DARPA for this problem isits preeminence in science. Equally important, Project GENOA—Collaborative Crisis Under-a challenge problem for a DARPA program standing and Management. GENOA is intendedmust have clear significance to the UnitedStates Department of Defense (DoD). Chal- to help analysts more rapidly understandlenge problems should serve for the duration emerging international crises to preserve U.S.of the program, becoming more challenging policy options. Proactive crisis management—each year. This continuity is preferable to before a situation has evolved into a crisis thatdesigning new problems every year because might engage the U.S. military—enables morethe infrastructure to support challenge prob- effective responses than reactive management.lems is expensive. Crisis-management systems will assist strategic A challenge problem should require little or analysts by evaluating the political, economic,no access to military subject-matter experts. It and military courses of action available toshould not introduce a knowledge-acquisition nations engaged at various levels of conflict.bottleneck that results in delays and low pro- The challenge problem development teamductivity from the technology developers. As worked with GENOA representatives to identifymuch as possible, the problem should be solv- areas for the application of HPKB with accessible, open-source material. A This work took three or four months, but thechallenge problem should exercise all (or crisis-management challenge problem specifi-most) of the contributions of the technology cation has remained fairly stable since its ini-developers, and it should exercise an integra- tial release in draft form in July 1997.tion of these technologies. A challenge prob- The first step in creating the challenge prob-lem should have unambiguous criteria for lem was to develop a scenario to provide con-evaluating its solutions. These criteria need text for intelligence analysis in time of crisis.not be so objective that one can write algo- To ensure that the problem should requirerithms to score performance (for example, development of real knowledge about thehuman judgment might be needed to assess world, the scenario includes real nationalscores), but they must be clear and they must actors with a fictional yet plausible story published early in the program. In addition, The scenario, which takes place in the Persianalthough performance is important, challenge Gulf, involves hostilities between Saudi Arabiaproblems that value performance above all else and Iran that culminate in closing the Strait ofencourage “one-off” solutions (a solution Hormuz to international shipping.developed for a specific problem, once only) Next, IET worked with experts at PSR toand discourage researchers from trying to develop a description of the intelligence analy-understand why their technologies work well sis process, which involves the following tasks:and poorly. A challenge problem should pro- information gathering—what happened, situ-vide a steady stream of results, so progress can ation assessment—what does it mean, and sce- WINTER 1998 27
  • 4. Articles III. What of significance might happen following the Saudi air strikes? B. Options evaluation Evaluate the options available to Iran. Close the Strait of Hormuz to shipping. Evaluation: Probable Motivation: Respond to Saudi air strikes and deter future strikes. Capability: (Q51) Can Iran close the Strait of Hormuz to international shipping? (Q83) Is Iran capable of firing upon tankers in the Strait of Hormuz? With what weapons? Negative outcomes: (Q53) What risks would Iran face in closing the strait? Figure 1. Sample Questions Pertaining to the Responses to an Event. nario development—what might happen next. rent and a historical context. Crises can be rep- Situation assessment (or interpretation) resented as events or as larger episodes tracking includes factors that pertain to the specific sit- the evolution of a conflict over time, from uation at hand, such as motives, intents, risks, inception or trigger, through any escalation, to rewards, and ramifications, and factors that eventual resolution or stasis. The representa- make up a general context, or “strategic cul- tions being developed in HPKB are intended to ture,” for a state actor’s behavior in interna- serve as a crisis corporate memory to help ana- tional relations, such as capabilities, interests, lysts discover historical precedents and analo- policies, ideologies, alliances, and enmities. gies for actions. Much of the challenge-prob- Scenario development, or speculative predic- lem specification is devoted to sample tion, starts with the generation of plausible questions that are intended to drive the devel- actions for each actor. Then, options are eval- opment of general models for reasoning about uated with respect to the same factors for situ- crisis events. ation assessment, and a likelihood rating is Sample questions are embedded in an ana- produced. The most plausible actions are lytic context. The question “What might hap- reported back to policy makers. pen next?” is instantiated as “What might These analytic tasks afford many opportuni- happen following the Saudi air strikes?” as ties for knowledge-based systems. One is to use shown in figure 1. Q51 is refined to Q83 in a knowledge bases to retain or multiply corpo- way that is characteristic of the analytic rate expertise; another is to use knowledge and process; that is, higher-level questions are reasoning to “think outside the box,” to gener- refined into sets of lower-level questions that ate analytic possibilities that a human analyst provide detail. might overlook. The latter task requires exten- The challenge-problem developers (IET with sive commonsense knowledge, or “analyst’s PSR) developed an answer key for sample ques- sense,” about the domain to rule out implausi- tions, a fragment of which is shown in figure 2. ble options. Although simple factual questions (for exam- The crisis-management challenge problem ple, “What is the gross national product of the includes an informal specification for a proto- United States?”) have just one answer; ques- type crisis-management assistant to support ana- tions such as Q53 usually have several. The lysts. The assistant is tested by asking ques- answer key actually lists five answers, two of tions. Some are simple requests for factual which are shown in figure 2. Each is accompa- information, others require the assistant to nied by suitable explanations, including source interpret the actions of nations in the context material. The first source (Convention on the of strategic culture. Actions are motivated by Law of the Sea) is electronic. IET maintains a interests, balancing risks and rewards. They web site with links to pages that are expected to have impacts and require capabilities. Interests be useful in answering the questions. The sec- drive the formation of alliances, the exercise of ond source is a fragment of a model developed influence, and the generation of tensions by IET and published in the challenge-problem among actors. These factors play out in a cur- specification. IET developed these fragments to28 AI MAGAZINE
  • 5. Articles Answer(s): 1. Economic sanctions from {Saudi Arabia, GCC, U.S., U.N.} • The closure of the Strait of Hormuz would violate an international norm promoting freedom of the seas and would jeopardize the interests of many states. • In response, states might act unilaterally or jointly to impose economic sanctions on Iran to compel it to reopen the strait. • The United Nations Security Council might authorize economic sanctions against Iran. 2. Limited military response from {Saudi Arabia, GCC, U.S., others}… Source(s): • The Convention on the Law of the Sea. • (B5) States may act unilaterally or collectively to isolate and/or punish a group or state that violates interna- tional norms. Unilateral and collective action can involve a wide range of mechanisms, such as intelligence collection, military retaliation, economic sanction, and diplomatic censure/isolation. Figure 2. Part of the Answer Key for Question 53.indicate the kinds of reasoning they would betesting in the challenge problem. For the challenge-problem evaluation, held PQ53 [During/After <TimeInterval>,] what {risks, rewards}in June 1998, IET developed a way to generate would <InternationalAgent> face in <InternationalAction-test questions through parameterization. Test Type>?questions deviate from sample questions inspecified, controlled ways, so the teams partic- <InternationalActionType> =ipating in the challenge problem know the {[exposure of its] {supporting, sponsoring}space of questions from which test items will <InternationalAgentType> in <InternationalAgent2>,be selected. This space includes billions of successful terrorist attacks against <InternationalAgent2>’squestions so the challenge problem cannot be <EconomicSector>,solved by relying on question-specific knowl- <InternationalActionType>(PQ51),edge. The teams must rely on general knowl- taking hostage citizens of <InternationalAgent2>,edge to perform well in the evaluation. attacking targets <SpatialRelationship>(Semantics provide practical constraints on the <InternationalAgent2> with <Force>}number of reasonable instantiations of para-meterized questions, as do online sources pro- <InternationalAgentType> =vided by IET.) To illustrate, Q53 is parameter- {terrorist group, dissident group, political party, humani-ized in figure 3. Parameterized question 53, tarian organization}PQ53, actually covers 8 of the roughly 100sample questions in the specification. Parameterized questions and associated classdefinitions are based on natural language, giv- Figure 3. A Parameterized Question Suitable for Generatinging the integration teams responsibility for Sample Questions and Test Questions.developing (potentially different) formal rep-resentations of the questions. This decisionwas made at the request of the teams. Aninstance of a parameterized question, say,PQ53, is mechanically generated, then theteams must create a formal representation andreason with it—without human intervention.Battlespace Challenge ProblemsThe second challenge-problem domain forHPKB is battlespace reasoning. Battlespace is anabstract notion that includes not only the WINTER 1998 29
  • 6. Articles physical geography of a conflict but also the form and an order of battle that describes the plans, goals, and activities of all combatants structure and composition of the enemy forces prior to, and during, a battle and during the in the scenario region. activities leading to the battle. Three battle- Given these input, movement analysis com- space programs within DARPA were identified prises the following tasks: as potential users of HPKB technologies: (1) the First is to distinguish military from nonmil- dynamic multiinformation fusion program, itary traffic. Almost all military traffic travels in (2) the dynamic database program, and (3) the convoys, which makes this task fairly straight- joint forces air-component commander forward except for very small convoys of two (JFACC) program. Two battlespace challenge or three vehicles. Second is to identify the sites problems have been developed. between which military convoys travel, deter- The second mine which of these sites are militarily signifi- The Movement-Analysis Challenge challenge- Problem The movement-analysis challenge cant, and determine the types of each militar- problem problem concerns high-level analysis of ideal- ily significant site. Site types include battle ized sensor data, particularly the airborne positions, command posts, support areas, air- domain for JSTARS moving target indicator radar. This defense sites, artillery sites, and assembly-stag- HPKB is Doppler radar can generate vast quantities of ing areas. information—one reading every minute for Third is to identify which units (or parts of battlespace each vehicle in motion within a 10,000- units) in the enemy order of battle are partici- reasoning. square-mile area.2 The movement-analysis sce- pating in each military convoy. Fourth is to determine the purpose of eachBattlespace is nario involves an enemy mobilizing a full divi- convoy movement. Purposes include recon- sion of ground forces—roughly 200 military an abstract units and 2000 vehicles—to defend against a naissance, movement of an entire unit toward notion that possible attack. A simulation of the vehicle a battle position, activities by command ele- ments, and support activities. movements of this division was developed, the includes not output of which includes reports of the posi- Fifth is to infer the exact types of the vehi- only the tions of all the vehicles in the division at 1- cles that make up each convoy. About 20 types minute intervals over a 4-day period for 18 of military vehicle are distinguished in the physical hours each day. These military vehicle move- enemy order of battle, all of which show up in geography of ments were then interspersed with plausible the scenario data. civilian traffic to add the problem of distin- To help the technology base and the integra- a conflict but guishing military from nonmilitary traffic. The tion teams develop their systems, a portion of also the movement-analysis task is to monitor the the simulation data was released in advance of plans, goals, movements of the enemy to detect and identi- the evaluation phase, accompanied by an answer key that supplied model answers for fy types of military site and convoy. and activities Because HPKB is not concerned with signal each of the inference tasks listed previously. of all processing, the input are not real JSTARS data Movement analysis is currently carried out but are instead generated by a simulator and manually by human intelligence analysts, combatants preprocessed into vehicle tracks. There is no who appear to rely on models of enemy prior to, and uncertainty in vehicle location and no radar behavior at several levels of abstraction. These shadowing, and each vehicle is always accu- include models of how different sites or con- during, a rately identified by a unique bumper number. voys are structured for different purposes and battle and However, vehicle tracks do not precisely iden- models of military systems such as logistics (supply and resupply). For example, in a logis- during the tify vehicle type but instead define each vehi- tics model, one might find the following frag- cle as either light wheeled, heavy wheeled, or activities tracked. Low-speed and stationary vehicles are ment: “Each echelon in a military organiza-leading to the not reported. tion is responsible for resupplying its Vehicle-track data are supplemented by subordinate echelons. Each echelon, from bat- battle. small quantities of high-value intelligence talion on up, has a designated area for storing data, including accurate identification of a few supplies. Supplies are provided by higher ech- key enemy sites, electronic intelligence reports of elons and transshipped to lower echelons at locations and times at which an enemy radar is these areas.” Model fragments such as these turned on, communications intelligence reports are thought to constitute the knowledge of that summarize information obtained by mon- intelligence analysts and, thus, should be the itoring enemy communications, and human content of HPKB movement-analysis systems. intelligence reports that provide detailed infor- Some such knowledge was elicited from mili- mation about the numbers and types of vehi- tary intelligence analysts during programwide cle passing a given location. Other input meetings. These same analysts also scripted include a detailed road network in electronic the simulation scenario.30 AI MAGAZINE
  • 7. ArticlesThe Workaround Challenge Problem battle damage are carried out by Army engi-The workaround challenge problem supports air- neers; so, this description takes the form of acampaign planning by the JFACC and his/her detailed engineering order of battle.staff. One task for the JFACC is to determine All input are provided in a formal represen-suitable targets for air strikes. Good targets tation language.allow one to achieve maximum military effect The workaround generator is expected towith minimum risk to friendly forces and min- provide three output: First is a reconstitutionimum loss of life on all sides. Infrastructure schedule, giving the capacity of the damagedoften provides such targets: It can be sufficient link as a function of time since the damage wasto destroy supplies at a few key sites or critical inflicted. For example, the workaround gener-nodes in a transportation network, such as ator might conclude that the capacity of thebridges along supply routes. However, bridges link is zero for the first 48 hours, but thereafter,and other targets can be repaired, and there is a temporary bridge will be in place that canlittle point in destroying a bridge if an avail- sustain a capacity of 170 vehicles an hour. Sec-able fording site is nearby. If a plan requires an ond is a time line of engineering actions that theinterruption in traffic of several days, and the enemy might carry out to implement thebridge can be repaired in a few hours, then repair, the time these actions require, and theanother target might be more suitable. Target temporal constraints among them. If thereselection, then, requires some reasoning about appears to be more than one viable repair strat-how an enemy might “work around” the dam- egy, a time line should be provided for each.age to the target. Third is a set of required assets for each time line The task of the workaround challenge prob- of actions, a description of the engineeringlem is to automatically assess how rapidly and resources that are used to repair the damageby what method an enemy can reconstitute or and pointers to the actions in the time line The challengebypass damage to a target and, thereby, help that utilize these assets. The reconstitution problems areair-campaign planners rapidly choose effective schedule provides the minimal informationtargets. The focus of the workaround problem required to evaluate the suitability of a given solved byin the first year of HPKB is automatic target. The time line of actions provides an integratedworkaround generation. explanation to justify the reconstitution schedule. The set of required assets is easily systems The workaround task involves detailed rep-resentation of targets and the local terrain derived from the time line of actions and can fielded byaround the target and detailed reasoning about be used to suggest further targets for preemp- integrationactions the enemy can take to reconstitute or tive air strikes against the enemy to frustrate itsbypass this damage. Thus, the input to repair efforts. teams led byworkaround systems include the following ele- A training data set was provided to help Teknowledgements: developers build their systems. It supplied First is a description of a target (for example, input and output for several sample problems, and SAIC.a bridge or a tunnel), the damage to it (for together with detailed descriptions of the cal-example, one span of a bridge is dropped; the culations carried out to compute action dura-bridge and vicinity are mined), and key fea- tions; lists of simplifying assumptions made totures of the local terrain (for example, the facilitate these calculations; and pointers toslope and soil types of a terrain cross section text sources for information on engineeringcoincident with the road near the bridge, resources and their use (mainly Army fieldtogether with the maximum depth and the manuals available on the World Wide Web).speed of any river or stream the bridge crosses). Workaround generation requires detailed Second is a specific enemy unit or capability knowledge about what the capabilities of theto be interdicted, such as a particular armored enemy’s engineering equipment are and howbattalion or supply trucks carrying ammuni- it is typically used by enemy forces. For exam-tion. ple, repairing damage to a bridge typically Third is a time period over which this unit involves mobile bridging equipment, such asor capability is to be denied access to the tar- armored vehicle-launched bridges (AVLBs),geted route. The presumption is that the ene- medium girder bridges, Bailey bridges, or floatmy will try to repair the damage within this bridges such as ribbon bridges or M4T6time period; a target is considered to be effec- bridges, together with a range of earthmovingtive if there appears to be no way for the ene- equipment such as bulldozers. Each kind ofmy to make this repair. mobile bridge takes a characteristic amount of Fourth is a detailed description of the enemy time to deploy, requires different kinds of bankresources in the area that could be used to preparation, and is “owned” by different eche-repair the damage. For the most part, repairs to lons in the military hierarchy, all of which WINTER 1998 31
  • 8. Articles affect the time it takes to bring the bridge to a object-oriented format (Pease and Carrico damage site and effect a repair. Because HPKB 1997a, 1997b), and applications of this generic operates in an entirely unclassified environ- semantics to domain-specific tasks are promis- ment, U.S. engineering resources and doctrine ing (Pease and Albericci 1998). The develop- were used throughout. Information from ment of ontologies for integrating manufactur- Army field manuals was supplemented by a ing planning applications (Tate 1998) and series of programwide meetings with an Army work flow (Lee et al. 1996) is ongoing. combat engineer, who also helped construct Another option for semantic integration is sample problems and solutions. software mediation (Park, Gennari, and Musen 1997). This software mediation can be seen as a variant on pairwise integration, but because Integrated Systems integration is done by knowledge-based The challenge problems are solved by integrat- means, one has an explicit expression of the ed systems fielded by integration teams led by semantics of the conversion. Researchers at Teknowledge and SAIC. Teknowledge favors a Kestrel Institute have successfully defined for- centralized architecture that contains a large mal specifications for data and used these the- commonsense ontology (CYC); SAIC has a dis- ories to integrate formally specified software. tributed architecture that relies on sharing spe- In addition, researchers at Cycorp have suc- cialized domain ontologies and knowledge cessfully applied CYC to the integration of mul- bases, including a large upper-level ontology tiple databases. based on the merging of CYC, SENSUS, and other The Teknowledge approach to integration is knowledge bases. to share knowledge among applications and create new knowledge to support the challenge Teknowledge Integration Teknowledge The Teknowledge integration team comprises problems. Teknowledge is defining formal semantics for the input and output of each favors a Teknowledge, Cycorp, and Kestrel Institute. Its application and the information in the chal- centralized focus is on semantic integration and the cre- lenge problems. ation of massive amounts of knowledge. Many concepts defy simple definitions. architecture Semantic Integration Three issues make Although there has been much success in that contains software integration difficult. Transport issues defining the semantics of mathematical con- cepts, it is harder to be precise about the a large concern mechanisms to get data from one semantics of the concepts people use every process or machine to another. Solutions commonsense include sockets, remote-method invocation day. These concepts seem to acquire meaning ontology (RMI), and CORBA. Syntactic issues concern how through their associations with other con- cepts, their use in situations and communica- to convert number formats, “syntactic sugar,” (CYC) …. and the labels of data. The more challenging tion, and their relations to instances. To give issues concern semantic integration: To integrate the concepts in our integrated system real elements properly, one must understand the meaning, we must provide a rich set of associ- meaning of each. The database community ations, which requires an extremely large has addressed this issue (Wiederhold 1996); it knowledge base. CYC offers just such a knowl- is even more pressing in knowledge-based sys- edge base. tems. CYC (Lenat 1995; Lenat and Guha 1990) The current state of practice in software inte- consists of an immense, multicontextual gration consists largely of interfacing pairs of knowledge base; an efficient inference engine; systems, as needed. Pairwise integration of this and associated tools and interfaces for acquir- kind does not scale up, unanticipated uses are ing, browsing, editing, and combining knowl- hard to cover later, and chains of integrated edge. Its premise is that knowledge-based soft- systems at best evolve into stovepipe systems. ware will be less brittle if and only if it has Each integration is only as general as it needs access to a foundation of basic commonsense to be to solve the problem at hand. knowledge. This semantic substratum of Some success has been achieved in low-level terms, rules, and relations enables application integration and reuse; for example, systems programs to cope with unforeseen circum- that share scientific subroutine libraries or stances and situations. graphics packages are often forced into similar The CYC knowledge base represents millions representational choices for low-level data. of hand-crafted axioms entered during the 13 DARPA has invested in early efforts to create years since CYC’s inception. Through careful reuse libraries for integrating large systems at policing and generalizing, there are now higher levels. Some effort has gone into slightly fewer than 1 million axioms in the expressing a generic semantics of plans in an knowledge base, interrelating roughly 50,00032 AI MAGAZINE
  • 9. Articlesatomic terms. Fewer than two percent of these Teknowledge has developed a template intoaxioms represent simple facts about proper which user-specified parameters can be insert-nouns of the sort one might find in an ed. START parses English queries for a few of thealmanac. Most embody general consensus crisis-management questions to fill in theseinformation about the concepts. For example, templates. Each filled template is a legal CYCone axiom says one cannot perform volitional query. TextWise Corporation has been devel-actions while one sleeps, another says one can- oping natural language information-retrievalnot be in two places at once, and another says software primarily for news articles (Liddy,you must be at the same place as a tool to use Paik, and McKenna 1995). Teknowledgeit. The knowledge base spans human capabili- intends to use the TextWise knowledge baseties and limitations, including information on information tools (KNOW-IT) to supply manyemotions, beliefs, expectations, dreads, and instances to CYC of facts discovered from newsgoals; common everyday objects, processes, stories. The system can parse English text andand situations; and the physical universe, return a series of binary relations that expressincluding such phenomena as time, space, the content of the sentences. There are severalcausality, and motion. dozen relation types, and the constants that The CYC inference engine comprises an epis- instantiate each relation are WORDNET synsettemological and a heuristic level. The epistemo- mappings (Miller et al. 1993). Each of the con-logical level is an expressive nth-order logical cepts has been mapped to a CYC expression,language with clean formal semantics. The and a portion of WORDNET has been mapped toheuristic level is a set of some three dozen spe- CYC. For those synsets not in CYC, the WORDNETcial-purpose modules that each contains its hyponym links are traversed until a mapped …own algorithms and data structures and can CYC term is found.recognize and handle some commonly occur- Battlespace Integration Teknowledge SAIC has aring sorts of inference. For example, one supported the movement-analysis workaround distributedheuristic-level module handles temporal rea-soning efficiently by converting temporal rela- problem. Movement analysis: Several movement- architecturetions into a before-and-after graph and then analysis systems were to be integrated, and that reliesdoing graph searching rather than theoremproving to derive an answer. A truth mainte- much preliminary integration work was done. on sharing Ultimately, the time pressure of the challengenance system and an argumentation-based problem evaluation precluded a full integra- specializedexplanation and justification system are tight-ly integrated into the system and are efficient tion. The MIT and UMass movement-analysis domain systems are described briefly here; the SMI andenough to be in operation at all times. In addi- SRI systems are described in the section enti- ontologiestion to these inference engines, CYC includes tled SAIC Integrated Systems. andnumerous browsers, editors, and consistency The MIT MAITA system provides tools forcheckers. A rich interface has been defined. constructing and controlling networks of dis- knowledgeCrisis-Management Integration The cri- tributed-monitoring processes. These tools bases ….sis-management challenge problem involves provide access to large knowledge bases ofanswering test questions presented in a struc- monitoring methods, organized around thetured grammar. The first step in answering a hierarchies of tasks performed, knowledgetest question is to convert it to a form that CYC used, contexts of application, the alerting ofcan reason with, a declarative decision tree. utility models, and other dimensions. Individ-When the tree is applied to the test question ual monitoring processes can also make use ofinput, a CYC query is generated and sent to CYC. knowledge bases representing commonsense Answering the challenge problem questions or expert knowledge in conducting their rea-takes a great deal of knowledge. For the first soning or reporting their findings. MIT builtyear’s challenge problem alone, the Cycorp monitoring processes for sites and convoysand Teknowledge team added some 8,000 con- with these tools.cepts and 80,000 assertions to CYC. To meet the The UMass group tried to identify convoysneeds of this challenge problem, the team cre- and sites with very simple rules. Rules wereated significant amounts of new knowledge, developed for three site types: (1) battle posi-some developed by collaborators and merged tions, (2) command posts, and (3) assembly-into CYC, some added by automated processes. staging areas. The convoy detector simply The Teknowledge integrated system in- looked for vehicles traveling at fixed distancescludes two natural language components: from one another. Initially, UMass was goingSTART and TextWise. The START system was cre- to recognize convoys from their dynamics, inated by Boris Katz (1997) and his group at MIT. which the distances between vehicles fluctuateFor each of the crisis-management questions, in a characteristic way, but in the simulated WINTER 1998 33
  • 10. Articles data, the distances between vehicles remained soning process. For this reason, a hierarchical fixed. UMass also intended to detect sites by task network (HTN) approach was taken. A the dynamics of vehicle movements between planning-specific ontology was defined within them, but no significant dynamic patterns the larger CYC ontology, and planning rules could be found in the movement data. only referenced concepts within this more Workarounds: Teknowledge developed two constrained context. The planning application workaround integrations, one an internal was essentially embedded in CYC. Teknowledge system, the other from AIAI at CYC had to be extended to represent com- the University of Edinburgh. posite actions that have several alternative Teknowledge developed a planning tool decompositions and complex preconditions- based on CYC, essentially a wrapper around effects. Although it is not a commonsense CYC’s existing knowledge base and inference approach, AIAI decided to explore HTN plan- engine. A plan is a proof that there is a path ning because it appeared suitable for the from the final goal to the initial situation workaround domain. It was possible to repre- through a partially ordered set of actions. The sent actions, their conditions and effects, the rules in the knowledge base driving the plan- plan-node network, and plan resources in a ner are rules about action preconditions and relational style. The structure of a plan was about which actions can bring about a certain implicitly represented in the proof that the state of affairs. There is no explicit temporal corresponding composite action was a relation reasoning; the partial order of temporal prece- between particular sets of conditions and dence between actions is established on the effects. Once proved, action relations are basis of the rules about preconditions and retained by CYC and are potentially reusable. effects. An advantage of implementing the AIAI plan- The planner is a new kind of inference ner in CYC was the ability to remove brittleness engine, performing its own search but in a from the planner-input knowledge format; for much smaller search space. However, each step example, it was not necessary to account for all in the search involves interaction with the the possible permutations of argument order existing inference engine by hypothesizing in predicates such as bordersOn and between. actions and microtheories and doing asks and asserts in these microtheories. This hypothe- SAIC Integrated System sizing and asserting on the fly in effect SAIC built an HPKB integrated knowledge envi- amounts to dynamically updating the knowl- ronment (HIKE) to support both crisis-manage- edge base in the course of inference; this capa- ment and battlespace challenge problems. The bility is new for the CYC inference engine. architecture of HIKE for crisis management is Consistent with the goals of HPKB, the shown in figure 4. For battlespace, the architec- Teknowledge workaround planner reused CYC’s ture is similar in that it is distributed and relies knowledge, although it was not knowledge on the open knowledge base connectivity specific to workarounds. In fact, CYC had never (OKBC) protocol, but of course, the compo- been the basis of a planner before, so even stat- nents integrated by the battlespace architecture ing things in terms of an action’s precondi- are different. HIKE’s goals are to address the dis- tions was new. What CYC provided, however, tributed communications and interoperability was a rich basis on which to build workaround requirements among the HPKB technology knowledge. For example, the Teknowledge components—knowledge servers, knowledge- team needed to write only one rule to state “to acquisition tools, question-and-answering sys- use something as a device you must have con- tems, problem solvers, process monitors, and so trol over that device,” and this rule covered the on—and provide a graphic user interface (GUI) cases of using an M88 to clear rubble, a mine tailored to the end users of the HPKB environ- plow to breach a minefield, a bulldozer to cut ment. into a bank or narrow the gap, and so on. The HIKE provides a distributed computing infra- reason one rule can cover so many cases is structure that addresses two types of commu- because clearing rubble, demining an area, nications needs: First are input and output narrowing a gap, and cutting into a bank are data-transportation and software connectivi- all specializations of IntrinsicStateChange- ties. These include connections between the Event, an extant part of the CYC ontology. HIKE server and technology components, con- The AIAI workaround planner was also nections between components, and connec- implemented in CYC and took data from tions between servers. HIKE encapsulates infor- Teknowledge’s FIRE&ISE-TO-MELD translator as its mation content and data transportation input. The central idea was to use the scriptlike through JAVA objects, hypertext transfer proto- structure of workaround plans to guide the rea- col (HTTP), remote-method invocation ( JAVA34 AI MAGAZINE
  • 11. Articles Electronic WWW archival sources Real time news feeds Question Answering Qu es KNOW tio SKC n e -IT An f ac START sw r er te in In ATP g s er U OKBC Hike (Open Knowledge SNARK Servers Base Connectivity) BUS SPOOK HIKE Clients Analyst GKB- webKB Editor Ontolingua WWW Knowledge Services Training Data Knowledge Engineer Figure 4. SAIC Crisis-Management Challenge Problem Architecture.RMI), and database access (JDBC). Second, HIKE system for GMU. SAIC built a front end to theprovides for knowledge-content assertion and OKBC server for LOOM that was extensivelydistribution and query requests to knowledge used by the members of the battlespace chal-services. lenge problem team. The OKBC protocol proved essential. SRI With OKBC and other methods, the HIKEused it to interface the theorem prover SNARK to infrastructure permits the integration of newan OKBC server storing the Central Intelli- technology components (either clients orgence Agency (CIA) World Fact Book knowledge servers) in the integrated end-to-end HPKB sys-base. Because this knowledge base is large, SRI tem without introducing major changes, pro-did not want to incorporate it into SNARK but vided that the new components adhere to theinstead used the procedural attachment fea- specified protocols.ture of SNARK to look up facts that were avail- Crisis-Management Integration Theable only in the World Fact Book. MIT’s START SAIC crisis-management architecture issystem used OKBC to connect to SRI’s OCELOT- focused around a central OKBC bus, as shownSNARK OKBC server. This connection will even- in figure 4. The technology components pro-tually give users the ability to pose questions vide user interfaces, question answering, andin English, which are then transformed to a knowledge services. Some components haveformal representation by START and shipped to overlapping roles. For example, MIT’s START sys-SNARK using OKBC; the result is returned using tem serves both as a user interface and a ques-OKBC. ISI built an OKBC server for their LOOM tion-answering component. Similarly, CMU’s WINTER 1998 35
  • 12. Articles WEBKB supports both question answering and to reuse knowledge whenever it made sense. knowledge services. The SAIC team reused three knowledge bases: HIKE provides a form-based GUI with which (1) the HPKB upper-level ontology developed users can construct queries with pull-down by Cycorp, (2) the World Fact Book knowledge menus. Query-construction templates corre- base from the CIA, and the Units and Measures spond to the templates defined in the crisis- Ontology from Stanford. Reusing the upper- management challenge problem specification. level ontology required translation, compre- Questions also can be entered in natural lan- hension, and reformulation. The ontology was guage. START and the TextWise component released in MELD (a language used by Cycorp) accept natural language queries and then and was not directly readable by the SAIC sys- attempt to answer the questions. To answer tem. In conjunction with Stanford, SRI devel- questions that involve more complex types of oped a translator to load the upper-level ontol- reasoning, START generates a formal representa- ogy into any OKBC-compliant server. Once tion of the query and passes it to one of the loaded into the OCELOT server, the GKB editor theorem provers. was used to comprehend the upper ontology. The Stanford University Knowledge Systems The graphic nature of the GKB editor illuminat- Laboratory ONTOLINGUA, SRI International’s ed the interrelationships between classes and graphic knowledge base (GKB) editor, WEBKB, predicates of the upper-level ontology. Because and TextWise provide the knowledge service the upper-level ontology represents functional components. The GKB editor is a graphic tool relationships as predicates but SNARK reasons for browsing and editing large knowledge bases, efficiently with functions, it was necessary to used primarily for manual knowledge acquisi- reformulate the ontology to use functions The guiding tion. WEBKB supports semiautomatic knowledge whenever a functional relationship existed. philosophy acquisition. Given some training data and an ontology as input, a web spider searches in a Battlespace Integration The distributed HIKE infrastructure is well suited to support an during directed manner and populates instances of integrated battlespace challenge problem as it knowledge classes and relations defined in the ontology. was originally designed: a single information Probabilistic rules are also extracted. TextWise system for movement analysis, trafficability,base develop- extracts information from text and newswire and workaround reasoning. However, the traffi-ment for crisis feeds, converting them into knowledge inter- cability problem (establishing routes for various change format (KIF) triples, which are then kinds of vehicle given the characteristics) was management loaded into ONTOLINGUA. ONTOLINGUA is SAIC’s dropped, and the integration of the other prob- was to reuse central knowledge server and information lems was delayed. The components that solved knowledge repository for the crisis-management challenge problem. ONTOLINGUA supports KIF as well as these problems are described briefly later. Movement analysis: The movement-analy- whenever it compositional modeling language (CML). Flow sis problem is solved by MIT’s monitoring, made sense. models developed by Northwestern University analysis, and interpretation tools arsenal (MAI- (NWU) answer challenge problem questions TA); Stanford University’s Section on Medical related to world oil-transportation networks Informatics’ (SMI) problem-solving methods; and reside within ONTOLINGUA. Stanford’s system and SRI International’s Bayesian networks. The for probabilistic object-oriented knowledge MIT effort was described briefly in the section (SPOOK) provides a language for class frames to entitled Teknowledge Integration. Here, we be annotated with probabilistic information, focus on the SMI and SRI movement-analysis representing uncertainty about the properties systems. of instances in this class. SPOOK is capable of rea- For scalability, SMI adopted a three-layered soning with the probabilistic information based approach to the challenge problem: The first on Bayesian networks. layer consisted primarily of simple, context- Question answering is implemented in sev- free data processing that attempted to find eral ways. SRI International’s SNARK and Stan- important preliminary abstractions in the data ford’s abstract theorem prover (ATP) are first- set. The most important of these were traffic order theorem provers. WEBKB answers centers (locations that were either the starting questions based on the information it gathers. or stopping points for a significant number of Question answering is also accomplished by vehicles) and convoy segments (a number of START and TextWise taking a query in English as vehicles that depart from the same traffic cen- input and using information retrieval to ter at roughly the same time, going in roughly extract the answers from text-based sources the same direction). Spotting these abstractions (such as the web, newswire feeds). required setting a number of parameters (for The guiding philosophy during knowledge example, how big a traffic center is). Once base development for crisis management was trained, these first-layer algorithms are linear in36 AI MAGAZINE
  • 13. Articlesthe size of the data set and enabled SMI to use ers were developed. One is a novel plannerknowledge-intensive techniques on the result- whose knowledge base is represented in theing (much smaller) set of data abstractions. ontologies, including its operators, state The second layer was a repair layer, which descriptions, and problem-specific informa-used knowledge of typical convoy behaviors tion. It uses a novel partial-match capabilityand locations on the battlespace to construct a developed in LOOM (MacGregor 1991). The oth-“map” of militarily significant traffic and traf- er is based on a state-of-the-art planner (Velosofic centers. The end result was a network of et al. 1995). Each solution lists several engi-traffic connected by traffic. Three main tasks neering actions for this workaround (for exam-remain: (1) classify the traffic centers, (2) figure ple, deslope the banks of the river, install aout what the convoys are doing, and (3) iden- temporary bridge), includes information abouttify which units are involved. SMI iteratively the sources used (for example, what kind ofanswered these questions by using repeated earthmoving equipment or bridge is used), andlayers of heuristic classification and constraint asserts temporal constraints among the indi-satisfaction. The heuristic-classification com- vidual actions to indicate which can be execut-ponents operated independently of the net- ed in parallel. A temporal estimation-assess-work, using known (and deduced) facts about ment problem solver evaluates each of thesingle convoys or traffic centers. Consider the alternatives and selects one as the most likelyfollowing rule for trying to identify a main choice for an enemy workaround. This prob-supply brigade (MSB) site (paraphrased into lem solver was developed in EXPECT (SwartoutEnglish, with abstractions in boldface): and Gil 1995; Gil 1994). If we have a current site which is unclas- Several general battlespace ontologies (for sified example, military units, vehicles), anchored and it’s in the Division support area, on the HPKB upper ontology, were used and and the traffic is high enough augmented with ontologies needed to reason and the traffic is balanced about workarounds (for example, engineering and the site is persistent with no major equipment). Besides these ontologies, the deployments emanating from it knowledge bases used included a number of then it’s probably an MSB problem-solving methods to represent knowl-SMI used similar rules for the constraint-satis- edge about how to solve the task. Both ontolo-faction component of its system, allowing gies and problem-solving knowledge were usedinformation to propagate through the network by two main problem a manner similar to Waltz’s (1975) well- EXPECT’s knowledge-acquisition tools wereknown constraint-satisfaction algorithm for used throughout the evaluation to detect miss-edge labeling. ing knowledge. EXPECT uses problem-solving The goal of the SRI group was to induce a knowledge and ontologies to analyze whichknowledge base to characterize and identify information is needed to solve the task. Thistypes of site such as command posts and battle capability allows EXPECT to alert a user whenpositions. Its approach was to induce a there is missing knowledge about a problemBayesian classifier and use a generative model (for example, unspecified bridge lengths) or aapproach, producing a Bayesian network that situation. It also helps debug and refinecould serve as a knowledge base. This model- ontologies by detecting missing axioms anding required transforming raw vehicle tracks overgeneral definitions.into features (for example, the frequency of GMU developed the DISCIPLE98 system. DISCI-certain vehicles at sites, number of stops) that PLE is an apprenticeship multistrategy learningcould be used to predict sites. Thus, it was also system that learns from examples, from expla-necessary to have hypothetical sites to test. SRI nations, and by analogy and can be taught byrelied on SMI to provide hypothetical sites, an expert how to perform domain-specificand it also used some of the features that SMI tasks through examples and explanations in acomputed. As a classifier, SRI used tree-aug- way that resembles how experts teach appren-mented naive (TAN) Bayes (Friedman, Geiger, tices (Tecuci 1998). For the workaroundand Goldszmidt 1997). domain, DISCIPLE was extended into a baseline- Workarounds: The SAIC team integrated integrated system that creates an ontology bytwo approaches to workaround generation, acquiring concepts from a domain expert orone developed by USC-ISI, the other by GMU. importing them (through OKBC) from shared ISI developed course-of-action–generation ontologies. It learns task-decomposition rulesproblem solvers to create alternative solutions from a domain expert and uses this knowledgeto workaround problems. In fact, two alterna- to solve workaround problems through hierar-tive course-of-action–generation problem solv- chical nonlinear planning. WINTER 1998 37
  • 14. Articles First, with DISCIPLE’s ontology-building tools, ferent metrics. The test items for crisis manage- a domain expert assisted by a knowledge engi- ment were questions, and the test was similar neer built the object ontology from several to an exam. Overall competence is a function sources, including expert’s manuals, Alphate- of the number of questions answered correctly, ch’s FIRE&ISE document and ISI’s LOOM ontology. but the crisis-management systems are also Second, a task taxonomy was defined by refin- expected to “show their work” and provide ing the task taxonomy provided by Alphatech. justifications (including sources) for their This taxonomy indicates principled decomposi- answers. Examples of questions, answers, and tions of generic workaround tasks into subtasks justifications for crisis management are shown but does not indicate the conditions under in the section entitled Crisis-Management which such decompositions should be per- Challenge Problem. formed. Third, the examples of hierarchical Performance metrics for the movement- workaround plans provided by Alphatech were analysis problem are related to recall and pre- used to teach DISCIPLE. Each such plan provided cision. The basic problem is to identify sites, DISCIPLE with specific examples of decomposi- vehicles, and purposes given vehicle trackWe claim that tions of tasks into subtasks, and the expert guid- data; so, performance is a function of how HPKB ed DISCIPLE to “understand” why each task many of these entities are correctly identified decomposition was appropriate in a particular technology situation. From these examples and the expla- and how many incorrect identifications are made. In general, identifications can be facilitates nations of why they are appropriate in the giv- marked down on three dimensions: First, the en situations, DISCIPLE learned general task- rapid decomposition rules. After a knowledge base identified entity can be more or less like the actual entity; second, the location of the iden- modification consisting of an object ontology and task- tified entity can be displaced from the actual decomposition rules was built, the hierarchical of nonlinear planner of DISCIPLE was used to auto- entity’s true location; and third, the identifica- knowledge- matically generate workaround plans for new tion can be more or less timely. The workaround problem involves generat- based workaround problems. ing workarounds to military actions such assystems. This bombing a bridge. Here, the criteria for suc- claim was Evaluation cessful performance include coverage (the generation of all workarounds generated), The SAIC and Teknowledge integrated systemstested in both for crisis management, movement analysis, appropriateness (the generation of work- arounds appropriate given the action), speci- phases of the and workarounds were tested in an extensive ficity (the exact implementation of the work- experiment study in June 1998. The study followed a two- around), and accuracy of timing inferences phase, test-retest schedule. In the first phase, because each the systems were tested on problems similar to (the length each step in the workaround takes to implement). phase allows those used for system development, but in the Performance evaluation, although essential, second, the problems required significant time to modifications to the systems. Within each tells us relatively little about the HPKB inte- grated systems, still less about the component improve phase, the systems were tested and retested on technologies. We also want to know why the the same problems. The test at the beginning performance of each phase established a baseline level of systems perform well or poorly. Answering this on test performance, but the test at the end measured question requires credit assignment because the systems comprise many technologies. We problems. improvement during the phase. We claim that HPKB technology facilitates also want to gather evidence pertinent to sev- rapid modification of knowledge-based sys- eral important, general claims. One claim is tems. This claim was tested in both phases of that HPKB facilitates rapid construction of the experiment because each phase allows knowledge-based systems because ontologies time to improve performance on test prob- and knowledge bases can be reused. The chal- lems. Phase 2 provides a more stringent test: lenge problems by design involve broad, rela- Only some of the phase 2 problems can be tively shallow knowledge in the case of crisis solved by the phase 1 systems, so the systems management and deep, fairly specific knowl- were expected to perform poorly in the test at edge in the battlespace problems. It is unclear the beginning of phase 2. The improvement in which kind of problem most favors the reuse performance on these problems during phase claim and why. We are developing analytic 2 is a direct measure of how well HPKB tech- models of reuse. Although the predictions of nology facilitates knowledge capture, represen- these models will not be directly tested in the tation, merging, and modification. first year’s evaluation, we will gather data to Each challenge problem is evaluated by dif- calibrate these models for a later evaluation.38 AI MAGAZINE
  • 15. Articles Results of the Challenge ity of the presentation of the explanation, the automatic production by the system of a repre- Problem Evaluation sentation of the question, source novelty, andWe present the results of the crisis-manage- reconciliation of multiple sources. Each ques-ment evaluation first, followed by the results tion could garner a score between 0 and 3 onof the battle-space evaluation. each criterion, and the criteria were themselves weighted. Some questions had multiple parts, When youCrisis Management and the number of parts was a further weight-The evaluation of the SAIC and Teknowledge ing criterion. In retrospect, it might have been consider thecrisis-management systems involved 7 trials or clearer to assign each question a percentage of difficulty of the the points available, thus standardizing allbatches of roughly 110 questions. Thus, morethan 1500 answers were manually graded by scores, but in the data that follow, scores are task, boththe challenge problem developer, IET, and sub- on an open-ended scale. Subject-matter systemsject matter experts at PSR on criteria ranging experts were assisted with scoring the quality performedfrom correctness to completeness of source of knowledge representations when necessary.material to the quality of the representation of A web-based form was developed for scor- remarkablythe question. Each question in a batch was ing, with clear instructions on how to assign well. Scores onposed in English accompanied by the syntax of scores. For example, on the correct-answer cri-the corresponding parameterized question (fig- terion, the subject-matter expert was instruct- the sampleure 3). The crisis-management systems were ed to award “zero points if no top-level answer questions weresupposed to translate these questions into an is provided and you cannot infer an intendedinternal representation, MELD for the Teknowl- answer; one point for a wrong answer without relatively high,edge system and KIF for the SAIC system. The any convincing arguments, or most required which is notMELD translator was operational for all the tri- answer elements; two points for a partially cor- rect answer; three points for a correct answer surprisingals; the KIF translator was used to a limitedextent on later trials. addressing most required elements.” because these The first trial involved testing the systems When you consider the difficulty of the task, questions hadon the sample questions that had been avail- both systems performed remarkably for several months for training. The Scores on the sample questions were relatively been availableremaining trials implemented the “test and high, which is not surprising because these for training forretest with scenario modification” strategy dis- questions had been available for training forcussed earlier. The first batch of test questions, several months (figure 5). It is also not surpris- several monthsTQA, was repeated four days later as a retest; it ing that scores on the first batch of test ques- ….was designated TQA’ for scoring purposes. The tions (TQA) were not high. It is gratifying, however, to see how scores improve steadily It is also notdifference in scores between TQA and TQA’represents improvements in the systems. After between test and retest (TQA and TQA’, TQC surprising thatsolving the questions in TQA’, the systems and TQC’) and that these gains are general: scores on thetackled a new set, TQB, designed to be “close The scores on TQA’ and TQB and TQC’ andto” TQA. The purpose of TQB was to check TQD were similar. first batch ofwhether the improvements to the systems gen- The scores designated auto in figure 5 refer test questionseralized to new questions. After a short break, to questions that were translated automaticallya modification was introduced into the crisis- from English into a formal representation. The (TQA) were notmanagement scenario, and new fragments of Teknowledge system translated all questions high. It is automatically, the SAIC system very few. Ini-knowledge about the scenario were released.Then, the cycle repeated: A new batch of ques- tially, the Teknowledge team did not manipu- gratifying,tions, TQC, tested how well the systems coped late the resulting representations, but in later however, to seewith the scenario modification; then after four batches, they permitted themselves minor how scoresdays, the systems were retested on the same modifications. The effects of these can be seenquestions, TQC’, and on the same day, a final in the differences between TQB and TQB-Auto, improvebatch, TQD, was released and answered. TQC and TQC-Auto, and TQD and TQD-Auto. steadily Each question in a trial was scored according Although the scores of the Teknowledge andto several criteria, some official and others SAIC systems appear close in figure 5, differ- between testoptional. The four official criteria were (1) the ences between the systems appear in other and retest…correctness of the answer, (2) the quality of the views of the data. Figure 6 shows the perfor-explanation of the answer, (3) the complete- mance of the systems on all official questionsness and quality of cited sources, and (4) the plus a few optional questions. Although thesequality of the representation of the question. extra questions widen the gap between the sys-The optional criteria included lay intelligibility tems, the real effect comes from addingof explanations, novelty of assumptions, qual- optional components to the scores. Here, WINTER 1998 39
  • 16. Articles Official Questions and Official Criteria Only 140.00 120.00 100.00 Average Score 80.00 TFS SAIC 60.00 40.00 20.00 0.00 SQ TQA TQA-Auto TQA TQB TQB-Auto TQC TQC-Auto TQC TQD TQD-Auto Figure 5. Scores on the Seven Batches of Questions That Made Up the Crisis-Management Challenge Problem Evaluation. Some questions were automatically translated into a formal representation. The scores for these are denoted with the suffix Auto. Teknowledge got credit for its automatic trans- sample questions, Teknowledge got perfect lation of questions and also for the lay intelli- scores on nearly 80 percent of the questions it gibility of its explanations, the novelty of its answered, and SAIC got perfect scores on near- assumptions, and other criteria. ly 70 percent of the questions it answered. Recall that each question is given a score However, on TQA, TQA’, and TQB, the systems between zero and three on each of four official are very similar, getting perfect scores on criteria. Figure 7 represents the distribution of roughly 60 percent, 70 percent, and 60 percent the scores 0, 1, 2, and 3 for the correctness cri- of the questions they answered. The gap terion over all questions for each batch of widens to a roughly 10-percent spread in questions and each system. A score of 0 gener- Teknowledge’s favor on TQC, TQC’, and TQD. ally means the question was not answered. The Similarly, when one averages the scores of Teknowledge system answered more questions answered questions in each trial, the averages than the SAIC system, and it had a larger pro- for the Teknowledge and SAIC systems are sim- portion of perfect scores. ilar on all but the sample-question trial. Even Interestingly, if one looks at the number of so, the Teknowledge system had both higher perfect scores as a fraction of the number of coverage (questions answered) and higher cor- questions answered, the systems are not so dif- rectness on all trials. ferent. One might argue that this comparison If one looks at the minimum of the official is invalid, that one’s grade on a test depends component scores for each question—answer on all the test items, not just those one correctness, explanation adequacy, source ade- answers. However, suppose one is comparing quacy, and representation adequacy—the dif- students who have very different backgrounds ference between the SAIC and Teknowledge and preparation; then, it can be informative to systems is quite pronounced. Calculating the compare them on both the number and qual- minimum component score is like finding ity of their answers. The Teknowledge-SAIC something to complain about in each comparison is analogous to a comparison of answer—the answer or the explanation, the students with very different backgrounds. sources or the question representation. It is the Teknowledge used CYC, a huge, mature knowl- statistic of most interest to a potential user edge base, whereas SAIC used CYC’s upper-level who would dismiss a system for failure on any ontology and a variety of disparate knowledge of these criteria. Figure 8 shows that neither bases. The systems were not at the same level system did very well on this stringent criteri- of preparation at the time of the experiment, on, but the Teknowledge system got a perfect so it is informative to ask how each system per- or good minimum score (3 or 2) roughly twice formed on the questions it answered. On the as often as the SAIC system in each trial.40 AI MAGAZINE
  • 17. Articles All Questions and All Criteria 1 8 0 .0 0 1 6 0 .0 0 1 4 0 .0 0 1 2 0 .0 0 Average Score 1 0 0 .0 0 TFS 8 0 .0 0 SAIC 6 0 .0 0 4 0 .0 0 2 0 .0 0 0 .0 0 SQ TQA TQA-Auto TQA TQB TQB-Auto TQC TQC-Auto TQC TQD TQD-Auto Figure 6. Scores on All Questions (Including Optional Ones) and Both Official and Optional Criteria.Movement Analysis artillery site. Entity error ranges from 0 forThe task for movement analysis was to identify completely correct identifications to 1 forsites—command posts, artillery sites, battle hopelessly wrong identifications. Fractionalpositions, and so on—given data about vehicle entity errors provide partial credit for near-movements and some intelligence sources. miss identifications. If one of the sites withinAnother component of the problem was to the radius of an identification matches theidentify convoys. The data were generated by identification (for example, a battalion com-simulation for an area of approximately mand post), then the identification score is 0,10,000 square kilometers over a period of but if none matches, then the score is the aver-approximately 4 simulated days. The scenario age entity error for the identification matchedinvolved 194 military units arranged into 5 against all the sites in the radius. If no sitebrigades, 1848 military vehicles, and 7666 exists within the radius of an identification,civilian vehicles. Some units were collocated at then the identification is a false positive.sites; others stood alone. There were 726 con- Recall and precision rates for sites arevoys and nearly 1.8 million distinct reports of defined in terms of entity error. Let H be thevehicle movement. number of sites identified with zero entity Four groups developed systems for all or part error, M be the number of sites identified withof the movement-analysis problem, and SAIC entity error less than one (near misses), and Rand Teknowledge provided integration sup- be the number of sites identified with maxi-port. The groups were Stanford’s SMI, SRI, mum entity error. Let T be the total number ofUMass, and MIT. Each site identification was sites, N be the total number of identifications,scored for its accuracy, and recall and precision and F be the number of false positives. The fol-scores were maintained for each site. Suppose lowing statistics describe the performance ofan identification asserts at time t that a battal- the movement-analysis systems:ion command post exists at a location (x, y). To zero-entity error recall = H / Tscore this identification, we find all sites with- non-one-entity error recall = (H + M) / Tin a fixed radius of (x, y). Some site types are maximum-entity error recallvery similar to others. For example, all com- = (H + M + R) / Tmand posts have similar characteristics; so, if zero-entity error precision = H / None mistakes a battalion command post for, non–one-entity error precisionsay, a division command post, then one = (H + M) / Nshould get partial credit for the identification. maximum-entity error precisionThe identification is incorrect but not as incor- = (H + M + R) / Nrect as, say, mistaking a command post for an false positive rate = F / N WINTER 1998 41
  • 18. Articles SAIC-TQD TFS-TQD SAIC-TQC TFS-TQC SAIC-TQC TFS-TQC Score SAIC-TQB 3 TFS-TQB 2 1 SAIC-TQA 0 TFS-TQA SAIC-TQA TFS-TQA SAIC-SQp TFS-SQp 0% 20% 40% 60% 80% 100% Distribution of Scores Figure 7. The Distribution of Correctness Scores for the Crisis-Management Challenge Problem.Correctness is one of the official scoring criteria and takes values in {0, 1, 2, 3}, with 3 being the highest score. This graph shows the pro-portion of the correctness scores that are 0, 1, 2, and 3. The lower bar, for example, shows that roughly 70 percent of the correctness scoreswere 3, and roughly 10 percent were 1 or 0.42 AI MAGAZINE
  • 19. Articles SAIC-TQD TFS-TQD SAIC-TQC TFS-TQC SAIC-TQC TFS-TQC Score SAIC-TQB 3 2 TFS-TQB 1 0 SAIC-TQA TFS-TQA SAIC-TQA TFS-TQA SAIC-SQp TFS-SQp 0% 20% 40% 60% 80% 100% Distribution of Minimum Scores Figure 8. The Distribution of Minimum Scores for the Crisis-Management Challenge Problem.Each question is awarded four component scores corresponding to the four official criteria. This graph shows the distribution of the min-imum of these four components over questions. With the lowest bar, for example, on 20 percent of the questions, the minimum of the 4component scores was 3. WINTER 1998 43
  • 20. Articles UMass B Test Mil UMass A Retest Mil UMass A Test Mil SMI B Test Mil SMI A Test Mil/Mil+Civ SMI A Test Mil Non-one EE Precision Zero EE Precision MIT B Test Mil/Mil+Civ Max EE Recall MIT B Test Mil MIT A Retest Mil/Mil+Civ MIT A Retest Mil MIT A Test Mil/Mil+Civ MIT A Test Mil 0 20 40 60 80 Figure 9. Scores for Detecting and Identifying Sites in the Movement-Analysis Problem.Trials labeled A are from the first phase of the experiment, before the scenario modification. Trials labeled B are from after the scenario mod-ification. Mil and Mil + Civ refer to data sets with military traffic only and with both military and civilian traffic, respectively. Recall and precision scores for the groups are 15 percent and are not shown in figure 9. Of shown in figure 9. The experiment design the participating research groups, MIT did not involved two phases with a test and retest attempt to identify the types of site, only within each phase. Additionally, the data sets whether sites are present; so, their entity error for each test included either military traffic is always maximum. The other groups did not only or military plus civilian traffic. Had each attempt to identify all kinds of site; for exam- group run their systems in each experimental ple, UMass tried to identify only battle posi- condition, there would have been 4 tests, 2 tions, command posts, and assembly-staging data sets with 2 versions of each (military only areas. Even so, each group was scored against and military plus civilian), and 4 groups, or 64 all site types. Obviously, the scores would be conditions to score. All these conditions were somewhat higher had the groups been scored not run. Delays were introduced by the process against only the kinds of site they intended to of reformatting data to make it compatible identify (for example, recall rates for UMass with a scoring program, so scores were not ranged from 20 percent to 39 percent for the released in time for groups to modify their sys- sites they tried to identify). tems; one group devoted much time to scoring Precision rates are reported only for the SMI and did not participate in all trials, and anoth- and UMass teams because MIT did not try to er group participated in no trials for reasons identify the types of site. UMass’s precision discussed at the end of this section. was highest on its first trial; a small modifica- The trials that were run are reported in fig- tion to its system boosted recall but at a signif- ure 9. Recall rates range from 40 percent to just icant cost to precision. SMI’s precision hovered over 60 percent, but these are for maximum- around 20 percent on all trials. entity error recall—the frequency of detecting Scores were much higher for convoy detec- a site when one exists but not identifying it tion. Although scoring convoy identifications correctly. Rates for correct and near-miss iden- posed some interesting technical challenges, tification are lower, ranging from 10 percent to the results were plain enough: SMI, UMass,44 AI MAGAZINE
  • 21. Articlesand MIT detected 507 (70 percent), 616 (85 toward movement analysis, detecting mostpercent), and 465 (64 percent) of the 726 con- convoys and sites. Identifying the sites accu-voys, respectively. The differences between rately will be a challenge for the future.these figures do not indicate that one technol-ogy was superior to another because each Workaroundsgroup emphasized a slightly different aspect of The workaround problem was evaluated inthe problem. For example, SMI didn’t try to much the same way as the other problems: Thedetect small convoys (fewer than four vehi- evaluation period was divided into two phases,cles). In any case, the scores for convoy identi- and within each phase, the systems were givenfications are quite similar. a test and a retest on the same problems. In the In sum, none of the systems identified site first phase, the systems were tested on 20 prob-types accurately, although they identified all lems and retested after a week on the samedetected sites and convoys pretty well. The rea- problems; in the second phase, a modification What weson for these results is that sites can be detect- was introduced into the scenario, the systems might called by observing vehicle halts, just as convoys were tested on five problems and after a weekcan be detected by observing clusters of vehi- retested on the same five problems and five errors ofcles. However, it is difficult to identify site new ones. specificity,types without more information. It would be Solutions were scored along five equallyworth studying the types of information that weighted dimensions: (1) viability of enumer- in which anhumans require for the task. ated workaround options, (2) correctness of the answer is The experience of SRI is instructive: The SRI overall time estimate for a workaround, (3) cor-system used features of movement data to rectness of solution steps provided for each less specific orinfer site types; unfortunately, the data did not viable option, (4) correctness of temporal con- complete thanseem to support the task. There were insuffi-cient quantities of training data to train classi- straints among these steps, and (5) appropriate- ness of engineering resources used. Scores were it should be,fiers (fewer than 40 instances of sites), but assigned by comparing the systems’ answers are notmore importantly, the data did not have suffi- with those of human experts. Bonus points inconsistentcient underlying structure—it was too random. were awarded when, occasionally, systems gaveThe SRI group tried a variety of classifier tech- better answers than the experts. These answers with thenologies, including identification trees, nearest became the gold standard for scoring answers philosophy ofneighbor, and C4.5, but classification accuracy when the systems were retested.remained in the low 30-percent range. This fig- Four systems were fielded by ISI, GMU, HPKB, whichure is comparable to those released by UMass Teknowledge, and AIAI. The results are shown expects in figures 10 and 11. In each figure, the upperon the performance of its system on threekinds of site. Even when the UMass recall fig- extreme of the vertical axis represents the max- systemsures are not diluted by sites they weren’t look- imum score a system could get by answering to giveing for, they did not exceed 40 percent. SRI all the questions correctly (that is, 200 points partial—and UMass both performed extensive ex- for the initial phase, 50 points for the first testploratory data analysis, looking for statistically in the modification phase, 100 points for the evensignificant relationships between features of retest in the modification phase). The bars rep- common-movement data and site types, with little suc- resent the number of points scored, and thecess, and the disclosed regularities were not circles represent the number of points that sense—strongly predictive. The reason for these teams could have been awarded given the number of answers whenfound little statistical regularity is probably questions that were actually answered. Forthat the data were artificial. It is extraordinari- example, in the initial phase, ISI answered all they lackly difficult to build simulators that generate questions, so it could have been awarded 200 specificdata with the rich dependency structure of points on the test and 200 on the retest; GMU knowledge.natural data. covered only a portion of the domain, and it Some simulated intelligence and radar infor- could have been awarded a maximum of 90mation were released with the movement data. and 100 points, respectively. The bars repre-Although none of the teams reported finding it sent the number of points actually scored byuseful, we believe these kind of data probably each human analysts. One assumes humans How one views the performance of theseperform the task better than the systems systems depends on how one values correct-reported here, but because no human ever ness; coverage (the number of questionssolved these movement-analysis problems, answered); and, more subtly, the prospects forone cannot tell whether the systems per- scaling the systems to larger problem sets. Anformed comparatively well or poorly. In any assistant should answer any question posed tocase, the systems took a valuable first step it, but if the system is less than ideal, should it WINTER 1998 45
  • 22. Articles Initial Phase: Test and Retest Scores and Scopes 200 180 160 140 Total 120 available 100 points 80 60 40 20 0 AIAI GMU ISI TFS Figure 10. Results of the Initial Phase of the Workaround Evaluation. Lighter bars represent test scores; darker bars represent retest scores. Circles represent the scope of the task attempted by each system, that is, the best score that the system could have received given the number of questions it answered. answer more questions with some errors or Two factors complicate comparisons be- fewer questions with fewer errors? Obviously, tween the systems. First, if one is allowed to the answer depends on the severity of errors, select a subset of problems for one’s system to the application, and the prospect for improv- solve, one might be expected to select prob- ing system coverage and accuracy. What we lems on which one hopes the system will do might call errors of specificity, in which an well. Thus, ISI’s correctness scores are not answer is less specific or complete than it strictly comparable to those of the other sys- should be, are not inconsistent with the phi- tems because ISI did not select problems but losophy of HPKB, which expects systems to solved them all. Said differently, what we call give partial—even commonsense—answers correctness is really “correctness in one’s when they lack specific knowledge. declared scope.” It is not difficult to get a per- Figures 10 and 11 show that USC-ISI was the fect correctness score if one selects just one only group to attempt to solve all the problem; conversely, it is not easy to get a high workaround problems, although its answers correctness score if one solves all problems. were not all correct, whereas GMU solved fewer Second, Figure 12 shows that coverage can problems with higher overall correctness. AIAI be increased by knowledge engineering, but at solved fewer problems still, but quite correctly, what cost to correctness? GMU believes cor- and Teknowledge solved more problems than rectness need not suffer with increased cover- AIAI with more variable correctness. One can age and cites the rapid growth of its knowledge compute coverage and correctness scores as fol- base. At the beginning of the evaluation peri- lows: Coverage is the number of questions od, the coverage of the knowledge base was attempted divided by the total number of ques- about 40 percent (11841 binary predicates), tions in the experiment (55 in this case). Cor- and two weeks later, the coverage had grown rectness is the total number of points awarded significantly to roughly 80 percent of the divided by the number of points that might domain (20324 binary predicates). In terms of have been awarded given the number of coverage and correctness (figure 12), GMU’s answers attempted. Figure 12 shows a plot of coverage increased from 47.5 percent in the coverage against correctness for all the test phase to 80 percent in the modification workaround systems. Points above and to the phase, with almost no decrement in correct- right of other points are superior; thus, the ISI ness. However, ISI suggests that the complexity and GMU systems are preferred to the other sys- of the workaround task increases significantly tems, but the ranking of these systems depends as the coverage is extended, to the point where on how one values coverage and correctness. a perfect score might be out of reach given the46 AI MAGAZINE
  • 23. Articles Modification Phase: Test and Retest Scores and Scopes 100 90 80 Total 70 available 60 points 50 40 30 20 10 0 AIAI GMU ISI TFS Figure 11. Results of the Modification Phase of the Workaround Evaluation. Lighter bars represent test scores; darker bars represent retest scores. Circles represent the scope of the task attempted by each system, that is, the best score that the system could have received given the number of questions it answered. Note that only 5 problems were released for the test, 10 for the retest; so, the maximum available points for each are 50 and 100, respectively.state of the art in AI technology. For example, release the challenge problem specificationsin some cases, the selection of resources early. Many difficulties can be traced to delaysrequires combining plan generation with rea- in the release of the battlespace specifications.soning about temporal aspects of the (partially Experiment procedures that involve multiplegenerated) plan. Moreover, interdependencies sites and technologies must be debugged inin the knowledge bases grow nonlinearly with dry-run experiments before the evaluationcoverage because different aspects of the period begins. There must be agreement ondomain interact and must be coordinated. For the formats of input data, answers, andexample, including mine-clearing operations answer keys—a mundane requirement butchanges the way one looks at how to ford a riv- one that caused delays in movement-analysiser. Such interactions in large knowledge bases evaluation.might degrade the performance of a system A major contributor to the success of the cri-when changes are made during a retest and sis-management problem was the parameter-modification phase. ized question syntax, which gave all partici- The coverage versus correctness debate pants a concise representation of the kinds ofshould not cloud the accomplishments of the problem they would solve during the evalua-workaround systems. Both ISI and GMU were tion. Similarly, evaluation metrics for crisisjudged to perform the workaround task at an management were published relatively level. All the systems were developed Initially, there was a strong desire for evalua-quite rapidly—indeed, GMU’s knowledge base tion criteria to be objective enough for adoubled during the experimental period machine to score performance. Relaxing thisitself—and knowledge reuse was prevalent. requirement had a positive effect in crisis man-These results take us some way toward the agement. The criteria were objective in theHPKB goal of very rapid development of pow- sense that experts followed scoring guidelines,erful and flexible knowledge base systems. but we could ask the experts to assess the qual- ity of a system’s explanation—something noEvaluation Lessons Learned program can do. However, scoring roughlyThe experiments reported here constitute one 1500 answers in crisis management took itsof the larger studies of AI technology. In addi- toll on the experts and IET. It is worth consid-tion to the results of these experiments, much ering whether equally informative resultswas learned about how to conduct such stud- could be had at a lower cost.ies. The primary lesson is that one should The HPKB evaluations were set up as friend- WINTER 1998 47
  • 24. Articles were constructed rapidly from many sources, reuse of knowledge was a significant factor, and semantic integration across knowledge Correctness bases started to become feasible. HPKB has integrated many technologies, cutting 1.0 GMU through traditional groupings in AI, to devel- op new ways of acquiring, organizing, sharing, ISI and reasoning with knowledge, and it has .75 AIAI nourished a remarkable assault on perhaps the grandest of the AI grand challenges—to build Teknowledge intelligent programs that answer questions .5 about everyday things, important things like international relations but also ordinary things like river banks. .25 Acknowledgments The authors wish to thank Stuart Aitken, Vinay Chaudhri, Cleo Condoravdi, Jon Doyle, John Gennari, Yolanda Gil, Moises Goldszmidt, .25 .5 .75 1.0 William Grosso, Mark Musen, Bill Swartout, and Gheorghe Tecuci for their help preparing Coverage this article. Notes Figure 12. A Plot of Overall Coverage against Overall Correctness for Each 1. Many HPKB resources and documents are System on the Workaround Challenge Problem. found at A web site devoted to the Year 1 HPKB Challenge Problem Conference can be found at www. ly competitions between the integration teams, SAIC and Teknowledge, each incorpo- ing/. Definitive information on the year 1 cri- rating a subset of the technology in a unique sis-management challenge problem, including architecture. However, some technology was updated specification, evaluation procedures, used by both teams, and some was not tested and evaluation results, is available at www.iet. in integrated systems but rather as stand-alone com/Projects/HPKB/Y1Eval/. Earlier, draft components. The purpose of the friendly com- information about both challenge problems is petition—to evaluate merits of different tech- available at HPKB/ nologies and architectures—was not fully real- Combined-CPs.doc. Scoring procedures and ized. Given the cost of developing HPKB data for the movement-analysis problem are systems, we might reevaluate the purported available at and actual benefits of competitions. search/hpkb/scores/. For additional sources, contact the authors. Conclusion 2. These numbers and all information regard- HPKB is an ambitious program, a high-risk, ing MTI radar are approximate; actual figures high-payoff venture beyond the edge of are classified. knowledge-based technology. In its first year, References HPKB suffered some setbacks, but these were Friedman, N.; Geiger, D.; and Goldszmidt, M. 1997. informative and will help the program in Bayesian Network Classifiers. Machine Learning future. More importantly, HPKB celebrated 29:131–163. some extraordinary successes. It has long been Gil, Y. 1994. Knowledge Refinement in a Reflective a dream in AI to ask a program a question in Architecture. In Proceedings of the Twelfth National natural language about almost anything and Conference on Artificial Intelligence, 520–526. Men- receive a comprehensive, intelligible, well- lo Park, Calif.: American Association for Artificial informed answer. Although we have not Intelligence. achieved this aim, we are getting close. In Jones, E. 1998. Formalized Intelligence Report HPKB, questions were posed in natural lan- Ensemble (FIRE) & Intelligence Stream Encoding guage, questions on a hitherto impossible (ISE)”, memo, Alphatech. range of topics were answered, and the Katz, B. 1997. From Sentence Processing to Informa- answers were supported by page after page of tion Access on the World Wide Web. Paper presented sources and justifications. Knowledge bases at the AAAI Spring Symposium on Natural Language48 AI MAGAZINE
  • 25. ArticlesProcessing for the World Wide Web, 24–26 Tate, A. 1998. Roots of SPAR—Shared Plan- ontologies and large knowledge-based sys-March, Stanford, California. ning and Activity Representation. Knowl- tems. He leads an integration team forLee, J.; Grunninger, M.; Jin, Y.; Malone, T.; edge Engineering Review (Special Issue on Teknowledge on the High-PerformanceTate, A.; Yost, G.; and the PIF Working Ontologies) 13(1): 121–128. Knowledge Base Project.Group, eds. 1996. The PIF Process Inter- Tecuci G. 1998. Building Intelligent Agents:change and Framework Version 1.1, Work- An Apprenticeship Multistrategy Learning The- Albert D. Lin received a B.S.E.E. from Fenging paper, 194, Center for Coordination ory, Methodology, Tool, and Case Studies. San Chia University in Taiwan in 1984 and anScience, Massachusetts Institute of Tech- Diego, Calif.: Academic. M.S. in electrical and computer engineer-nology. ing from San Diego State University in Veloso, M.; Carbonell J.; Perez, A.; Borrajo, 1990. He is a technical lead at ScienceLenat, D. 1995. Artificial Intelligence. Scien- D.; Fink, E.; and Blythe, J. 1995. Integrating Applications International Corporation,tific American 273(3): 80–82. Planning and Learning: The PRODIGY Archi- currently leading the High-PerformanceLenat, D. B., and Feigenbaum, E. A. 1987. tecture. Journal of Experimental and Theoret- Knowledge Base Project.On the Thresholds of Knowledge. In Pro- ical 7:81–120.ceedings of the Tenth International Joint Waltz, D. 1975. Understanding Line Draw- Barbara H. Starr received a B.S. from theConference on Artificial Intelligence, ings of Scenes with Shadows. In The Psy- University of Witwatersrand in South1173–1182. Menlo Park, Calif.: Internation- chology of Computer Vision, ed. P. H. Win- Africa in 1980 and an Honors degree inal Joint Conferences on Artificial Intelli- ston, 19–91. New York: McGraw-Hill. computer science at the University of Wit-gence. Wiederhold, G., ed. 1996. Intelligent Integra- watersrand in 1981. She has been a consul-Lenat, D., and Guha, R. 1990. Building Large tion of Information. Boston, Mass.: Kluwer tant in the areas of AI and object-orientedKnowledge-Based Systems. Reading, Mass.: Academic. software development in the United StatesAddison Wesley. for the past 13 years and in South Africa forLiddy, E. D.; Paik, W.; and McKenna, M. several years before that. She is currently Paul Cohen is professor of computer sci-1995. Development and Implementation leading the crisis-management integration ence at the University of Massachusetts. Heof a Discourse Model for Newspaper Texts. effort for the High-Performance Knowledge received his Ph.D. from Stanford UniversityPaper presented at the AAAI Symposium on Base Project at Science Applications Inter- in 1983. His publications include Empirical national Corporation.Empirical Methods in Discourse Interpreta- Methods for Artificial Intelligence (MIT Press,tion and Generation, 27–29 March, Stan- 1995) and The Handbook of Artificial Intelli- David Gunning has been the programford, California. gence, Volumes 3 and 4 (Addison-Wesley), manager for the High-Performance Knowl-MacGregor, R. 1991. The Evolving Technol- which he edited with Edward A. Feigen- edge Base Project since its inception inogy of Classification-Based Knowledge Rep- baum and Avron B. Barr. Cohen is develop- 1995 and just became the assistant directorresentation Systems. In Principles of Seman- ing theories of how sensorimotor agents for Command and Control in the Defensetic Networks: Explorations in the such as robots and infants learn ontologies. Advanced Research Projects Agency’sRepresentation of Knowledge, ed. J. Sowa, research projects in Command and Con-385–400. San Francisco, Calif.: Morgan Robert Schrag is a principal scientist at trol, which include work in intelligent rea-Kaufmann. Information Extraction & Transport, Inc., soning and decision aids for crisis manage- in Rosslyn, Virginia. Schrag has served asMiller, G.; Beckwith, R.; Fellbaum, C.; ment, air campaign planning, course-of- principal investigator to develop the high-Gross, D.; and Miller, K. 1993. Introduction action analysis, and logistics planning. performance knowledge base crisis-man-to WORDNET: An Online Lexical Database. Gunning has an M.S. in computer science agement challenge problem. Besides evalu-Available at from Stanford University and an M.S. in ating emerging AI technologies, his~wn/papers/. cognitive psychology from the University research interests include constraint satis-Park, J.; Gennari, J.; and Musen, M. 1997. of Dayton. faction, propositional reasoning, and tem-Mappings for Reuse in Knowledge-Based poral reasoning.Systems, Technical Report, SMI-97-0697, Murray Burke has served as the High-Per-Section on Medical Informatics, Stanford formance Knowledge Base Project deputy Eric Jones is a senior member of the tech-University. program manager for the past year, and the nical staff at Alphatech Inc. From 1992 to Defense Advanced Research ProjectsPease, A., and Albericci, D. 1998. The War- 1996, he was a member of the academic Agency has selected him as its new programplan. CJL256, Teknowledge, Palo Alto, Cal- staff in the Department of Computer Sci- manager. He has over 30 years in commandifornia. ence at Victoria University, Wellington, and control information systems researchPease, A., and Carrico, T. 1997a. The JTF New Zealand. He received his Ph.D. in AI and development. He was cofounder andATD Core Plan Representation: A Progress from Yale University in 1992. He also has executive vice president of Knowledge Sys- degrees in mathematics and geographyReport. Paper presented at the AAAI 1997 tems Concepts, Inc., a software company from the University of Otago, NewSpring Symposium on Ontological Engi- specializing in defense applications of Zealand. His research interests include case-neering, 24–26 March, Stanford, California. advanced information technologies. Burke based reasoning, natural language process-Pease, A., and Carrico, T. 1997b. The JTF is a member of the American Association ing, and statistical approaches to machineATD Core Plan Representation: Request for for Artificial Intelligence, the Association of learning. He serves on the American Mete-Comment, AL/HR-TP-96-9631, Armstrong Computing Machinery, and the Institute of orological Society Committee on AI Appli-Lab, Wright-Patterson Air Force Base, Ohio. Electrical and Electronics Engineers. cations to Environmental Science.Swartout, B., and Gil, Y. 1995. EXPECT:Explicit Representations for Flexible Acqui- Adam Pease received his B.S. and M.S. insition. Paper presented at the Ninth Knowl- computer science from Worcester Polytech-edge Acquisition for Knowledge-Based Sys- nic Institute. He has worked in the area oftems Workshop, 26 February–3 March, knowledge-based systems for the past 10Banff, Canada. years. His current interests are in planning WINTER 1998 49
  • 26. The Third International Conference on Autonomous Agents (Agents ‘99)■ May 1 - May 5, 1999 Technical Program Co-Chairs: Oren Etzioni (University of Washington)■ Seattle, Washington—Hyatt Regency Bellevue Joerg P. Mueller (John Wiley & Sons Ltd, UK)Hotel Craig A. Knoblock (University of Southern California)Sponsored by SIGART ACM Workshops Chair: Elisabeth AndreCo-sponsored by ACM SIGGRAPH and ACM SIGCHI (DFKI GmbH, Germany)Held in cooperation with AAAI Tutorials Chair: Carles Sierra (AI Research Institute - CSIC, Spain)CONFERENCE DATES Software Demos Chair: Keith Golden■ April 1, 1999 (NASA Ames Research Center) Cutoff date for early registration and hotel conference rate Robotics Demos Chair: Wei-Min Shen■ May 1-2, 1999 (University of Southern California) Workshops and tutorials Videos Chair: Maria Gini■ May 3-5, 1999 (University of Minnesota) Conference technical sessions Posters Chair: Frank Dignum (Eindhoven University, The Netherlands)CONFERENCE ORGANIZERS Treasurer: Daniela RusGeneral Chair: (Dartmouth College)Jeffrey M. Bradshaw (The Boeing Company) Local Arrangements Chair: Gene Ball (Microsoft Research) One Roof Five Major International Conferences Under and Multi Agents lication of Intelligent Agents PAAM99 - The Practical App 19th-21st April, 1999 .uk/PAAM99 ent lication of Knowledge Managem PAKeM99 - The Practical App 21st-23rd April, 1999 .uk/PAKeM99 lication of Java PA Java99 - The Practical App PA Expo99 will bring 21-23 April 1999 together developers and .uk/PAJAVA99 users from commerce, lication of industry and academia to PACLP99 - The Practical App ramming examine and discuss solu- Constraint Technologies and Logic Prog tions for solving real world 19th-21st Apri l, 1999 www.practical-applica problems found in key overy and Data Mining lication of Knowledge Disc commercial areas such PADD99 - The Practical App 21st-23rd April, 1999 as: Transpor tation and .uk/PADD99 Distribution, Finance, the ce and Events Centre, The Commonwealth Conferen Intelligent Internet, Deci- sion Support, Information 1999 Management, Customer London, UK 19th-23rd April, Retention, Fraud Detec- tion, Planning and Sched- uling, Business Modelling and Support.