The document summarizes the High-Performance Knowledge Bases Project, which aims to develop very large, flexible, reusable knowledge bases through the integration of multiple technologies. It discusses two challenge problems involving crisis management and battlespace reasoning being used to evaluate systems. The project involves researchers developing ontologies and knowledge bases containing tens of thousands of rules to handle these scenarios. It also discusses the need for commonsense knowledge and different types of inference methods to achieve intelligent performance on open-ended problems.
2. Articles
problem of building on these foundations to Often, one will accept an answer that is
populate very large knowledge bases. The goal roughly correct, especially when the alterna-
is for collaborating teams of domain experts tives are no answer at all or a very specific but
(who might lack training in computer science) wrong answer. This is Lenat and Feigenbaum’s
to easily extend the foundation theories, breadth hypothesis: “Intelligent performance
define additional domain theories and prob- often requires the problem solver to fall back
lem-solving strategies, and acquire domain on increasingly general knowledge, and/or to
facts. Third, because knowledge is not enough, analogize to specific knowledge from far-flung
one also requires efficient problem-solving domains” (Lenat and Feigenbaum 1987, p.
methods. HPKB supports research on efficient, 1173). We must, therefore, augment high-pow-
general inference methods and optimized task- er knowledge-based systems, which give spe-
specific methods. cific and precise answers, with weaker but ade-
HPKB is a timely impetus for knowledge- quate knowledge and inference. The inference
based technology, although some might think methods might not all be sound and complete.
it overdue. Some of the tenets of HPKB were Indeed, one might need a multitude of meth-
voiced in 1987 by Doug Lenat and Ed Feigen- ods to implement what Polya called plausible
baum (Lenat and Feigenbaum 1987), and some inference. HPKB encompasses work on a vari-
have been around for longer. Lenat’s CYC Pro- ety of logical, probabilistic, and other infer-
ject has also contributed much to our under- ence methods.
standing of large knowledge bases and ontolo- It is one thing to recognize the need for
gies. Now, 13 years into the CYC Project and commonsense knowledge, another to inte-
more than a decade after Lenat and Feigen- grate it seamlessly into knowledge-based sys-
baum’s paper, there seems to be consensus on tems. Lenat observes that ontologies often are
the following points: missing a middle level, the purpose of which is
The first and most intellectually taxing task to connect very general ontological concepts
when building a large knowledge base is to such as human and activity with domain-specif-
design an ontology. If you get it wrong, you ic concepts such as the person who is responsible
can expect ongoing trouble organizing the for navigating a B-52 bomber. Because HPKB is
knowledge you acquire in a natural way. grounded in domain-specific tasks, the focus of
Whenever two or more systems are built for much ontological engineering is this middle
related tasks (for example, medical expert sys- layer.
tems, planning, modeling of physical process-
es, scheduling and logistics, natural language
understanding), the architects of the systems
The Participants
realize, often too late, that someone else has The HPKB participants are organized into three
already done, or is in the process of doing, the groups: (1) technology developers, (2) integra-
hard ontological work. HPKB challenges the tion teams, and (3) challenge problem devel-
research community to share, merge, and col- opers. Roughly speaking, the integration teams
lectively develop large ontologies for signifi- build systems with the new technologies to
cant military problems. However, an ontology solve challenge problems. The integration
alone is not sufficient. Axioms are required to teams are led by SAIC and Teknowledge. Each
give meaning to the terms in an ontology. integration team fields systems to solve chal-
Without them, users of the ontology can inter- lenge problems in an annual evaluation. Uni-
pret the terms differently. versity participants include Stanford Universi-
Most knowledge-based systems have no ty, Massachusetts Institute of Technology
common sense; so, they cannot be trusted. (MIT), Carnegie Mellon University (CMU),
Suppose you have a knowledge-based system Northwestern University, University of Massa-
for scheduling resources such as heavy-lift heli- chusetts (UMass), George Mason University
copters, and none of its knowledge concerns (GMU), and the University of Edinburgh
noncombatant evacuation operations. Now, (AIAI). In addition, SRI International, the Uni-
suppose you have to evacuate a lot of people. versity of Southern California Information Sci-
Lacking common sense, your system is literally ences Institute (USC-ISI), the Kestrel Institute,
useless. With a little common sense, it could and TextWise, Inc., have developed important
not only support human planning but might components. Information Extraction and
be superior to it because it could think outside Transport (IET), Inc., with Pacific Sierra
the box and consider using the helicopters in Research (PSR), Inc., developed and evaluated
an unconventional way. Common sense is the crisis-management challenge problem, and
needed to recognize and exploit opportunities Alphatech, Inc., is responsible for the battle-
as well as avoid foolish mistakes. space challenge problem.
26 AI MAGAZINE
3. Articles
Challenge Problems be assessed not only by technology developers
but also by DARPA management and involved
A programmatic innovation of HPKB is chal- members of the DoD community.
lenge problems. The crisis-management chal- The HPKB challenge problems are designed
lenge problem, developed by IET and PSR, is to support new and ongoing DARPA initiatives
designed to exercise broad, relatively shallow in intelligence analysis and battlespace infor-
knowledge about international tensions. The mation systems. Crisis-management systems
battlespace challenge problem, developed by will assist strategic analysts by evaluating the
Alphatech, Inc., has two parts, each designed political, economic, and military courses of
to exercise relatively specific knowledge about action available to nations engaged at various
activities in armed conflicts. Movement analysis levels of conflict. Battlespace systems will sup-
involves interpreting vehicle movements port operations officers and intelligence ana-
detected and tracked by idealized sensors. The lysts by inferring militarily significant targets
workaround problem is concerned with finding and sites, reasoning about road network traffi-
military engineering solutions to traffic- cability, and anticipating responses to military
obstruction problems, such as destroyed strikes.
bridges and blocked tunnels.
Good challenge problems must satisfy sever- Crisis-Management
al, often conflicting, criteria. A challenge prob- Challenge Problem
lem must be challenging: It must raise the bar
The crisis-management challenge problem is
for both technology and science. A problem
intended to drive the development of broad,
that requires only technical ingenuity will not
relatively shallow commonsense knowledge
hold the attention of the technology develop-
bases to facilitate intelligence analysis. The
ers, nor will it help the United States maintain
client program at DARPA for this problem is
its preeminence in science. Equally important,
Project GENOA—Collaborative Crisis Under-
a challenge problem for a DARPA program
standing and Management. GENOA is intended
must have clear significance to the United
States Department of Defense (DoD). Chal- to help analysts more rapidly understand
lenge problems should serve for the duration emerging international crises to preserve U.S.
of the program, becoming more challenging policy options. Proactive crisis management—
each year. This continuity is preferable to before a situation has evolved into a crisis that
designing new problems every year because might engage the U.S. military—enables more
the infrastructure to support challenge prob- effective responses than reactive management.
lems is expensive. Crisis-management systems will assist strategic
A challenge problem should require little or analysts by evaluating the political, economic,
no access to military subject-matter experts. It and military courses of action available to
should not introduce a knowledge-acquisition nations engaged at various levels of conflict.
bottleneck that results in delays and low pro- The challenge problem development team
ductivity from the technology developers. As worked with GENOA representatives to identify
much as possible, the problem should be solv- areas for the application of HPKB technology.
able with accessible, open-source material. A This work took three or four months, but the
challenge problem should exercise all (or crisis-management challenge problem specifi-
most) of the contributions of the technology cation has remained fairly stable since its ini-
developers, and it should exercise an integra- tial release in draft form in July 1997.
tion of these technologies. A challenge prob- The first step in creating the challenge prob-
lem should have unambiguous criteria for lem was to develop a scenario to provide con-
evaluating its solutions. These criteria need text for intelligence analysis in time of crisis.
not be so objective that one can write algo- To ensure that the problem should require
rithms to score performance (for example, development of real knowledge about the
human judgment might be needed to assess world, the scenario includes real national
scores), but they must be clear and they must actors with a fictional yet plausible story line.
be published early in the program. In addition, The scenario, which takes place in the Persian
although performance is important, challenge Gulf, involves hostilities between Saudi Arabia
problems that value performance above all else and Iran that culminate in closing the Strait of
encourage “one-off” solutions (a solution Hormuz to international shipping.
developed for a specific problem, once only) Next, IET worked with experts at PSR to
and discourage researchers from trying to develop a description of the intelligence analy-
understand why their technologies work well sis process, which involves the following tasks:
and poorly. A challenge problem should pro- information gathering—what happened, situ-
vide a steady stream of results, so progress can ation assessment—what does it mean, and sce-
WINTER 1998 27
4. Articles
III. What of significance might happen following the Saudi air strikes?
B. Options evaluation
Evaluate the options available to Iran.
Close the Strait of Hormuz to shipping.
Evaluation: Probable
Motivation: Respond to Saudi air strikes and deter future strikes.
Capability:
(Q51) Can Iran close the Strait of Hormuz to international shipping?
(Q83) Is Iran capable of firing upon tankers in the Strait of Hormuz? With what weapons?
Negative outcomes:
(Q53) What risks would Iran face in closing the strait?
Figure 1. Sample Questions Pertaining to the Responses to an Event.
nario development—what might happen next. rent and a historical context. Crises can be rep-
Situation assessment (or interpretation) resented as events or as larger episodes tracking
includes factors that pertain to the specific sit- the evolution of a conflict over time, from
uation at hand, such as motives, intents, risks, inception or trigger, through any escalation, to
rewards, and ramifications, and factors that eventual resolution or stasis. The representa-
make up a general context, or “strategic cul- tions being developed in HPKB are intended to
ture,” for a state actor’s behavior in interna- serve as a crisis corporate memory to help ana-
tional relations, such as capabilities, interests, lysts discover historical precedents and analo-
policies, ideologies, alliances, and enmities. gies for actions. Much of the challenge-prob-
Scenario development, or speculative predic- lem specification is devoted to sample
tion, starts with the generation of plausible questions that are intended to drive the devel-
actions for each actor. Then, options are eval- opment of general models for reasoning about
uated with respect to the same factors for situ- crisis events.
ation assessment, and a likelihood rating is Sample questions are embedded in an ana-
produced. The most plausible actions are lytic context. The question “What might hap-
reported back to policy makers. pen next?” is instantiated as “What might
These analytic tasks afford many opportuni- happen following the Saudi air strikes?” as
ties for knowledge-based systems. One is to use shown in figure 1. Q51 is refined to Q83 in a
knowledge bases to retain or multiply corpo- way that is characteristic of the analytic
rate expertise; another is to use knowledge and process; that is, higher-level questions are
reasoning to “think outside the box,” to gener- refined into sets of lower-level questions that
ate analytic possibilities that a human analyst provide detail.
might overlook. The latter task requires exten- The challenge-problem developers (IET with
sive commonsense knowledge, or “analyst’s PSR) developed an answer key for sample ques-
sense,” about the domain to rule out implausi- tions, a fragment of which is shown in figure 2.
ble options. Although simple factual questions (for exam-
The crisis-management challenge problem ple, “What is the gross national product of the
includes an informal specification for a proto- United States?”) have just one answer; ques-
type crisis-management assistant to support ana- tions such as Q53 usually have several. The
lysts. The assistant is tested by asking ques- answer key actually lists five answers, two of
tions. Some are simple requests for factual which are shown in figure 2. Each is accompa-
information, others require the assistant to nied by suitable explanations, including source
interpret the actions of nations in the context material. The first source (Convention on the
of strategic culture. Actions are motivated by Law of the Sea) is electronic. IET maintains a
interests, balancing risks and rewards. They web site with links to pages that are expected to
have impacts and require capabilities. Interests be useful in answering the questions. The sec-
drive the formation of alliances, the exercise of ond source is a fragment of a model developed
influence, and the generation of tensions by IET and published in the challenge-problem
among actors. These factors play out in a cur- specification. IET developed these fragments to
28 AI MAGAZINE
5. Articles
Answer(s):
1. Economic sanctions from {Saudi Arabia, GCC, U.S., U.N.}
• The closure of the Strait of Hormuz would violate an international norm promoting freedom of the seas and
would jeopardize the interests of many states.
• In response, states might act unilaterally or jointly to impose economic sanctions on Iran to compel it to
reopen the strait.
• The United Nations Security Council might authorize economic sanctions against Iran.
2. Limited military response from {Saudi Arabia, GCC, U.S., others}…
Source(s):
• The Convention on the Law of the Sea.
• (B5) States may act unilaterally or collectively to isolate and/or punish a group or state that violates interna-
tional norms. Unilateral and collective action can involve a wide range of mechanisms, such as intelligence
collection, military retaliation, economic sanction, and diplomatic censure/isolation.
Figure 2. Part of the Answer Key for Question 53.
indicate the kinds of reasoning they would be
testing in the challenge problem.
For the challenge-problem evaluation, held
PQ53 [During/After <TimeInterval>,] what {risks, rewards}
in June 1998, IET developed a way to generate
would <InternationalAgent> face in <InternationalAction-
test questions through parameterization. Test
Type>?
questions deviate from sample questions in
specified, controlled ways, so the teams partic-
<InternationalActionType> =
ipating in the challenge problem know the
{[exposure of its] {supporting, sponsoring}
space of questions from which test items will
<InternationalAgentType> in <InternationalAgent2>,
be selected. This space includes billions of
successful terrorist attacks against <InternationalAgent2>’s
questions so the challenge problem cannot be
<EconomicSector>,
solved by relying on question-specific knowl- <InternationalActionType>(PQ51),
edge. The teams must rely on general knowl- taking hostage citizens of <InternationalAgent2>,
edge to perform well in the evaluation. attacking targets <SpatialRelationship>
(Semantics provide practical constraints on the <InternationalAgent2> with <Force>}
number of reasonable instantiations of para-
meterized questions, as do online sources pro- <InternationalAgentType> =
vided by IET.) To illustrate, Q53 is parameter- {terrorist group, dissident group, political party, humani-
ized in figure 3. Parameterized question 53, tarian organization}
PQ53, actually covers 8 of the roughly 100
sample questions in the specification.
Parameterized questions and associated class
definitions are based on natural language, giv- Figure 3. A Parameterized Question Suitable for Generating
ing the integration teams responsibility for Sample Questions and Test Questions.
developing (potentially different) formal rep-
resentations of the questions. This decision
was made at the request of the teams. An
instance of a parameterized question, say,
PQ53, is mechanically generated, then the
teams must create a formal representation and
reason with it—without human intervention.
Battlespace Challenge Problems
The second challenge-problem domain for
HPKB is battlespace reasoning. Battlespace is an
abstract notion that includes not only the
WINTER 1998 29
6. Articles
physical geography of a conflict but also the form and an order of battle that describes the
plans, goals, and activities of all combatants structure and composition of the enemy forces
prior to, and during, a battle and during the in the scenario region.
activities leading to the battle. Three battle- Given these input, movement analysis com-
space programs within DARPA were identified prises the following tasks:
as potential users of HPKB technologies: (1) the First is to distinguish military from nonmil-
dynamic multiinformation fusion program, itary traffic. Almost all military traffic travels in
(2) the dynamic database program, and (3) the convoys, which makes this task fairly straight-
joint forces air-component commander forward except for very small convoys of two
(JFACC) program. Two battlespace challenge or three vehicles. Second is to identify the sites
problems have been developed. between which military convoys travel, deter-
The second mine which of these sites are militarily signifi-
The Movement-Analysis Challenge
challenge- Problem The movement-analysis challenge cant, and determine the types of each militar-
problem problem concerns high-level analysis of ideal- ily significant site. Site types include battle
ized sensor data, particularly the airborne positions, command posts, support areas, air-
domain for JSTARS moving target indicator radar. This defense sites, artillery sites, and assembly-stag-
HPKB is Doppler radar can generate vast quantities of ing areas.
information—one reading every minute for Third is to identify which units (or parts of
battlespace each vehicle in motion within a 10,000- units) in the enemy order of battle are partici-
reasoning. square-mile area.2 The movement-analysis sce- pating in each military convoy.
Fourth is to determine the purpose of each
Battlespace is nario involves an enemy mobilizing a full divi-
convoy movement. Purposes include recon-
sion of ground forces—roughly 200 military
an abstract units and 2000 vehicles—to defend against a naissance, movement of an entire unit toward
notion that possible attack. A simulation of the vehicle a battle position, activities by command ele-
ments, and support activities.
movements of this division was developed, the
includes not output of which includes reports of the posi- Fifth is to infer the exact types of the vehi-
only the tions of all the vehicles in the division at 1- cles that make up each convoy. About 20 types
minute intervals over a 4-day period for 18 of military vehicle are distinguished in the
physical hours each day. These military vehicle move- enemy order of battle, all of which show up in
geography of ments were then interspersed with plausible the scenario data.
civilian traffic to add the problem of distin- To help the technology base and the integra-
a conflict but guishing military from nonmilitary traffic. The tion teams develop their systems, a portion of
also the movement-analysis task is to monitor the the simulation data was released in advance of
plans, goals, movements of the enemy to detect and identi- the evaluation phase, accompanied by an
answer key that supplied model answers for
fy types of military site and convoy.
and activities Because HPKB is not concerned with signal each of the inference tasks listed previously.
of all processing, the input are not real JSTARS data Movement analysis is currently carried out
but are instead generated by a simulator and manually by human intelligence analysts,
combatants preprocessed into vehicle tracks. There is no who appear to rely on models of enemy
prior to, and uncertainty in vehicle location and no radar behavior at several levels of abstraction. These
shadowing, and each vehicle is always accu- include models of how different sites or con-
during, a rately identified by a unique bumper number. voys are structured for different purposes and
battle and However, vehicle tracks do not precisely iden- models of military systems such as logistics
(supply and resupply). For example, in a logis-
during the tify vehicle type but instead define each vehi-
tics model, one might find the following frag-
cle as either light wheeled, heavy wheeled, or
activities tracked. Low-speed and stationary vehicles are ment: “Each echelon in a military organiza-
leading to the not reported. tion is responsible for resupplying its
Vehicle-track data are supplemented by subordinate echelons. Each echelon, from bat-
battle. small quantities of high-value intelligence talion on up, has a designated area for storing
data, including accurate identification of a few supplies. Supplies are provided by higher ech-
key enemy sites, electronic intelligence reports of elons and transshipped to lower echelons at
locations and times at which an enemy radar is these areas.” Model fragments such as these
turned on, communications intelligence reports are thought to constitute the knowledge of
that summarize information obtained by mon- intelligence analysts and, thus, should be the
itoring enemy communications, and human content of HPKB movement-analysis systems.
intelligence reports that provide detailed infor- Some such knowledge was elicited from mili-
mation about the numbers and types of vehi- tary intelligence analysts during programwide
cle passing a given location. Other input meetings. These same analysts also scripted
include a detailed road network in electronic the simulation scenario.
30 AI MAGAZINE
7. Articles
The Workaround Challenge Problem battle damage are carried out by Army engi-
The workaround challenge problem supports air- neers; so, this description takes the form of a
campaign planning by the JFACC and his/her detailed engineering order of battle.
staff. One task for the JFACC is to determine All input are provided in a formal represen-
suitable targets for air strikes. Good targets tation language.
allow one to achieve maximum military effect The workaround generator is expected to
with minimum risk to friendly forces and min- provide three output: First is a reconstitution
imum loss of life on all sides. Infrastructure schedule, giving the capacity of the damaged
often provides such targets: It can be sufficient link as a function of time since the damage was
to destroy supplies at a few key sites or critical inflicted. For example, the workaround gener-
nodes in a transportation network, such as ator might conclude that the capacity of the
bridges along supply routes. However, bridges link is zero for the first 48 hours, but thereafter,
and other targets can be repaired, and there is a temporary bridge will be in place that can
little point in destroying a bridge if an avail- sustain a capacity of 170 vehicles an hour. Sec-
able fording site is nearby. If a plan requires an ond is a time line of engineering actions that the
interruption in traffic of several days, and the enemy might carry out to implement the
bridge can be repaired in a few hours, then repair, the time these actions require, and the
another target might be more suitable. Target temporal constraints among them. If there
selection, then, requires some reasoning about appears to be more than one viable repair strat-
how an enemy might “work around” the dam- egy, a time line should be provided for each.
age to the target. Third is a set of required assets for each time line
The task of the workaround challenge prob- of actions, a description of the engineering
lem is to automatically assess how rapidly and resources that are used to repair the damage
by what method an enemy can reconstitute or and pointers to the actions in the time line The challenge
bypass damage to a target and, thereby, help that utilize these assets. The reconstitution problems are
air-campaign planners rapidly choose effective schedule provides the minimal information
targets. The focus of the workaround problem required to evaluate the suitability of a given
solved by
in the first year of HPKB is automatic target. The time line of actions provides an integrated
workaround generation. explanation to justify the reconstitution
schedule. The set of required assets is easily
systems
The workaround task involves detailed rep-
resentation of targets and the local terrain derived from the time line of actions and can fielded by
around the target and detailed reasoning about be used to suggest further targets for preemp- integration
actions the enemy can take to reconstitute or tive air strikes against the enemy to frustrate its
bypass this damage. Thus, the input to repair efforts. teams led by
workaround systems include the following ele- A training data set was provided to help Teknowledge
ments: developers build their systems. It supplied
First is a description of a target (for example, input and output for several sample problems,
and SAIC.
a bridge or a tunnel), the damage to it (for together with detailed descriptions of the cal-
example, one span of a bridge is dropped; the culations carried out to compute action dura-
bridge and vicinity are mined), and key fea- tions; lists of simplifying assumptions made to
tures of the local terrain (for example, the facilitate these calculations; and pointers to
slope and soil types of a terrain cross section text sources for information on engineering
coincident with the road near the bridge, resources and their use (mainly Army field
together with the maximum depth and the manuals available on the World Wide Web).
speed of any river or stream the bridge crosses). Workaround generation requires detailed
Second is a specific enemy unit or capability knowledge about what the capabilities of the
to be interdicted, such as a particular armored enemy’s engineering equipment are and how
battalion or supply trucks carrying ammuni- it is typically used by enemy forces. For exam-
tion. ple, repairing damage to a bridge typically
Third is a time period over which this unit involves mobile bridging equipment, such as
or capability is to be denied access to the tar- armored vehicle-launched bridges (AVLBs),
geted route. The presumption is that the ene- medium girder bridges, Bailey bridges, or float
my will try to repair the damage within this bridges such as ribbon bridges or M4T6
time period; a target is considered to be effec- bridges, together with a range of earthmoving
tive if there appears to be no way for the ene- equipment such as bulldozers. Each kind of
my to make this repair. mobile bridge takes a characteristic amount of
Fourth is a detailed description of the enemy time to deploy, requires different kinds of bank
resources in the area that could be used to preparation, and is “owned” by different eche-
repair the damage. For the most part, repairs to lons in the military hierarchy, all of which
WINTER 1998 31
8. Articles
affect the time it takes to bring the bridge to a object-oriented format (Pease and Carrico
damage site and effect a repair. Because HPKB 1997a, 1997b), and applications of this generic
operates in an entirely unclassified environ- semantics to domain-specific tasks are promis-
ment, U.S. engineering resources and doctrine ing (Pease and Albericci 1998). The develop-
were used throughout. Information from ment of ontologies for integrating manufactur-
Army field manuals was supplemented by a ing planning applications (Tate 1998) and
series of programwide meetings with an Army work flow (Lee et al. 1996) is ongoing.
combat engineer, who also helped construct Another option for semantic integration is
sample problems and solutions. software mediation (Park, Gennari, and Musen
1997). This software mediation can be seen as
a variant on pairwise integration, but because
Integrated Systems integration is done by knowledge-based
The challenge problems are solved by integrat- means, one has an explicit expression of the
ed systems fielded by integration teams led by semantics of the conversion. Researchers at
Teknowledge and SAIC. Teknowledge favors a Kestrel Institute have successfully defined for-
centralized architecture that contains a large mal specifications for data and used these the-
commonsense ontology (CYC); SAIC has a dis- ories to integrate formally specified software.
tributed architecture that relies on sharing spe- In addition, researchers at Cycorp have suc-
cialized domain ontologies and knowledge cessfully applied CYC to the integration of mul-
bases, including a large upper-level ontology tiple databases.
based on the merging of CYC, SENSUS, and other The Teknowledge approach to integration is
knowledge bases. to share knowledge among applications and
create new knowledge to support the challenge
Teknowledge Integration
Teknowledge The Teknowledge integration team comprises
problems. Teknowledge is defining formal
semantics for the input and output of each
favors a Teknowledge, Cycorp, and Kestrel Institute. Its application and the information in the chal-
centralized focus is on semantic integration and the cre- lenge problems.
ation of massive amounts of knowledge. Many concepts defy simple definitions.
architecture Semantic Integration Three issues make Although there has been much success in
that contains software integration difficult. Transport issues defining the semantics of mathematical con-
cepts, it is harder to be precise about the
a large concern mechanisms to get data from one
semantics of the concepts people use every
process or machine to another. Solutions
commonsense include sockets, remote-method invocation day. These concepts seem to acquire meaning
ontology (RMI), and CORBA. Syntactic issues concern how through their associations with other con-
cepts, their use in situations and communica-
to convert number formats, “syntactic sugar,”
(CYC) …. and the labels of data. The more challenging tion, and their relations to instances. To give
issues concern semantic integration: To integrate the concepts in our integrated system real
elements properly, one must understand the meaning, we must provide a rich set of associ-
meaning of each. The database community ations, which requires an extremely large
has addressed this issue (Wiederhold 1996); it knowledge base. CYC offers just such a knowl-
is even more pressing in knowledge-based sys- edge base.
tems. CYC (Lenat 1995; Lenat and Guha 1990)
The current state of practice in software inte- consists of an immense, multicontextual
gration consists largely of interfacing pairs of knowledge base; an efficient inference engine;
systems, as needed. Pairwise integration of this and associated tools and interfaces for acquir-
kind does not scale up, unanticipated uses are ing, browsing, editing, and combining knowl-
hard to cover later, and chains of integrated edge. Its premise is that knowledge-based soft-
systems at best evolve into stovepipe systems. ware will be less brittle if and only if it has
Each integration is only as general as it needs access to a foundation of basic commonsense
to be to solve the problem at hand. knowledge. This semantic substratum of
Some success has been achieved in low-level terms, rules, and relations enables application
integration and reuse; for example, systems programs to cope with unforeseen circum-
that share scientific subroutine libraries or stances and situations.
graphics packages are often forced into similar The CYC knowledge base represents millions
representational choices for low-level data. of hand-crafted axioms entered during the 13
DARPA has invested in early efforts to create years since CYC’s inception. Through careful
reuse libraries for integrating large systems at policing and generalizing, there are now
higher levels. Some effort has gone into slightly fewer than 1 million axioms in the
expressing a generic semantics of plans in an knowledge base, interrelating roughly 50,000
32 AI MAGAZINE
9. Articles
atomic terms. Fewer than two percent of these Teknowledge has developed a template into
axioms represent simple facts about proper which user-specified parameters can be insert-
nouns of the sort one might find in an ed. START parses English queries for a few of the
almanac. Most embody general consensus crisis-management questions to fill in these
information about the concepts. For example, templates. Each filled template is a legal CYC
one axiom says one cannot perform volitional query. TextWise Corporation has been devel-
actions while one sleeps, another says one can- oping natural language information-retrieval
not be in two places at once, and another says software primarily for news articles (Liddy,
you must be at the same place as a tool to use Paik, and McKenna 1995). Teknowledge
it. The knowledge base spans human capabili- intends to use the TextWise knowledge base
ties and limitations, including information on information tools (KNOW-IT) to supply many
emotions, beliefs, expectations, dreads, and instances to CYC of facts discovered from news
goals; common everyday objects, processes, stories. The system can parse English text and
and situations; and the physical universe, return a series of binary relations that express
including such phenomena as time, space, the content of the sentences. There are several
causality, and motion. dozen relation types, and the constants that
The CYC inference engine comprises an epis- instantiate each relation are WORDNET synset
temological and a heuristic level. The epistemo- mappings (Miller et al. 1993). Each of the con-
logical level is an expressive nth-order logical cepts has been mapped to a CYC expression,
language with clean formal semantics. The and a portion of WORDNET has been mapped to
heuristic level is a set of some three dozen spe- CYC. For those synsets not in CYC, the WORDNET
cial-purpose modules that each contains its hyponym links are traversed until a mapped …
own algorithms and data structures and can CYC term is found.
recognize and handle some commonly occur- Battlespace Integration Teknowledge
SAIC has a
ring sorts of inference. For example, one supported the movement-analysis workaround distributed
heuristic-level module handles temporal rea-
soning efficiently by converting temporal rela-
problem.
Movement analysis: Several movement-
architecture
tions into a before-and-after graph and then analysis systems were to be integrated, and that relies
doing graph searching rather than theorem
proving to derive an answer. A truth mainte-
much preliminary integration work was done. on sharing
Ultimately, the time pressure of the challenge
nance system and an argumentation-based problem evaluation precluded a full integra- specialized
explanation and justification system are tight-
ly integrated into the system and are efficient
tion. The MIT and UMass movement-analysis domain
systems are described briefly here; the SMI and
enough to be in operation at all times. In addi- SRI systems are described in the section enti-
ontologies
tion to these inference engines, CYC includes tled SAIC Integrated Systems. and
numerous browsers, editors, and consistency The MIT MAITA system provides tools for
checkers. A rich interface has been defined. constructing and controlling networks of dis-
knowledge
Crisis-Management Integration The cri- tributed-monitoring processes. These tools bases ….
sis-management challenge problem involves provide access to large knowledge bases of
answering test questions presented in a struc- monitoring methods, organized around the
tured grammar. The first step in answering a hierarchies of tasks performed, knowledge
test question is to convert it to a form that CYC used, contexts of application, the alerting of
can reason with, a declarative decision tree. utility models, and other dimensions. Individ-
When the tree is applied to the test question ual monitoring processes can also make use of
input, a CYC query is generated and sent to CYC. knowledge bases representing commonsense
Answering the challenge problem questions or expert knowledge in conducting their rea-
takes a great deal of knowledge. For the first soning or reporting their findings. MIT built
year’s challenge problem alone, the Cycorp monitoring processes for sites and convoys
and Teknowledge team added some 8,000 con- with these tools.
cepts and 80,000 assertions to CYC. To meet the The UMass group tried to identify convoys
needs of this challenge problem, the team cre- and sites with very simple rules. Rules were
ated significant amounts of new knowledge, developed for three site types: (1) battle posi-
some developed by collaborators and merged tions, (2) command posts, and (3) assembly-
into CYC, some added by automated processes. staging areas. The convoy detector simply
The Teknowledge integrated system in- looked for vehicles traveling at fixed distances
cludes two natural language components: from one another. Initially, UMass was going
START and TextWise. The START system was cre- to recognize convoys from their dynamics, in
ated by Boris Katz (1997) and his group at MIT. which the distances between vehicles fluctuate
For each of the crisis-management questions, in a characteristic way, but in the simulated
WINTER 1998 33
10. Articles
data, the distances between vehicles remained soning process. For this reason, a hierarchical
fixed. UMass also intended to detect sites by task network (HTN) approach was taken. A
the dynamics of vehicle movements between planning-specific ontology was defined within
them, but no significant dynamic patterns the larger CYC ontology, and planning rules
could be found in the movement data. only referenced concepts within this more
Workarounds: Teknowledge developed two constrained context. The planning application
workaround integrations, one an internal was essentially embedded in CYC.
Teknowledge system, the other from AIAI at CYC had to be extended to represent com-
the University of Edinburgh. posite actions that have several alternative
Teknowledge developed a planning tool decompositions and complex preconditions-
based on CYC, essentially a wrapper around effects. Although it is not a commonsense
CYC’s existing knowledge base and inference approach, AIAI decided to explore HTN plan-
engine. A plan is a proof that there is a path ning because it appeared suitable for the
from the final goal to the initial situation workaround domain. It was possible to repre-
through a partially ordered set of actions. The sent actions, their conditions and effects, the
rules in the knowledge base driving the plan- plan-node network, and plan resources in a
ner are rules about action preconditions and relational style. The structure of a plan was
about which actions can bring about a certain implicitly represented in the proof that the
state of affairs. There is no explicit temporal corresponding composite action was a relation
reasoning; the partial order of temporal prece- between particular sets of conditions and
dence between actions is established on the effects. Once proved, action relations are
basis of the rules about preconditions and retained by CYC and are potentially reusable.
effects. An advantage of implementing the AIAI plan-
The planner is a new kind of inference ner in CYC was the ability to remove brittleness
engine, performing its own search but in a from the planner-input knowledge format; for
much smaller search space. However, each step example, it was not necessary to account for all
in the search involves interaction with the the possible permutations of argument order
existing inference engine by hypothesizing in predicates such as bordersOn and between.
actions and microtheories and doing asks and
asserts in these microtheories. This hypothe-
SAIC Integrated System
sizing and asserting on the fly in effect SAIC built an HPKB integrated knowledge envi-
amounts to dynamically updating the knowl- ronment (HIKE) to support both crisis-manage-
edge base in the course of inference; this capa- ment and battlespace challenge problems. The
bility is new for the CYC inference engine. architecture of HIKE for crisis management is
Consistent with the goals of HPKB, the shown in figure 4. For battlespace, the architec-
Teknowledge workaround planner reused CYC’s ture is similar in that it is distributed and relies
knowledge, although it was not knowledge on the open knowledge base connectivity
specific to workarounds. In fact, CYC had never (OKBC) protocol, but of course, the compo-
been the basis of a planner before, so even stat- nents integrated by the battlespace architecture
ing things in terms of an action’s precondi- are different. HIKE’s goals are to address the dis-
tions was new. What CYC provided, however, tributed communications and interoperability
was a rich basis on which to build workaround requirements among the HPKB technology
knowledge. For example, the Teknowledge components—knowledge servers, knowledge-
team needed to write only one rule to state “to acquisition tools, question-and-answering sys-
use something as a device you must have con- tems, problem solvers, process monitors, and so
trol over that device,” and this rule covered the on—and provide a graphic user interface (GUI)
cases of using an M88 to clear rubble, a mine tailored to the end users of the HPKB environ-
plow to breach a minefield, a bulldozer to cut ment.
into a bank or narrow the gap, and so on. The HIKE provides a distributed computing infra-
reason one rule can cover so many cases is structure that addresses two types of commu-
because clearing rubble, demining an area, nications needs: First are input and output
narrowing a gap, and cutting into a bank are data-transportation and software connectivi-
all specializations of IntrinsicStateChange- ties. These include connections between the
Event, an extant part of the CYC ontology. HIKE server and technology components, con-
The AIAI workaround planner was also nections between components, and connec-
implemented in CYC and took data from tions between servers. HIKE encapsulates infor-
Teknowledge’s FIRE&ISE-TO-MELD translator as its mation content and data transportation
input. The central idea was to use the scriptlike through JAVA objects, hypertext transfer proto-
structure of workaround plans to guide the rea- col (HTTP), remote-method invocation ( JAVA
34 AI MAGAZINE
11. Articles
Electronic WWW
archival
sources
Real time
news feeds
Question Answering
Qu
es
KNOW tio
SKC n
e -IT An
f ac START sw
r er
te in
In ATP g
s er
U
OKBC
Hike (Open Knowledge SNARK
Servers Base Connectivity)
BUS
SPOOK
HIKE
Clients
Analyst GKB- webKB
Editor Ontolingua WWW
Knowledge Services Training
Data
Knowledge Engineer
Figure 4. SAIC Crisis-Management Challenge Problem Architecture.
RMI), and database access (JDBC). Second, HIKE system for GMU. SAIC built a front end to the
provides for knowledge-content assertion and OKBC server for LOOM that was extensively
distribution and query requests to knowledge used by the members of the battlespace chal-
services. lenge problem team.
The OKBC protocol proved essential. SRI With OKBC and other methods, the HIKE
used it to interface the theorem prover SNARK to infrastructure permits the integration of new
an OKBC server storing the Central Intelli- technology components (either clients or
gence Agency (CIA) World Fact Book knowledge servers) in the integrated end-to-end HPKB sys-
base. Because this knowledge base is large, SRI tem without introducing major changes, pro-
did not want to incorporate it into SNARK but vided that the new components adhere to the
instead used the procedural attachment fea- specified protocols.
ture of SNARK to look up facts that were avail- Crisis-Management Integration The
able only in the World Fact Book. MIT’s START SAIC crisis-management architecture is
system used OKBC to connect to SRI’s OCELOT- focused around a central OKBC bus, as shown
SNARK OKBC server. This connection will even- in figure 4. The technology components pro-
tually give users the ability to pose questions vide user interfaces, question answering, and
in English, which are then transformed to a knowledge services. Some components have
formal representation by START and shipped to overlapping roles. For example, MIT’s START sys-
SNARK using OKBC; the result is returned using tem serves both as a user interface and a ques-
OKBC. ISI built an OKBC server for their LOOM tion-answering component. Similarly, CMU’s
WINTER 1998 35
12. Articles
WEBKB supports both question answering and to reuse knowledge whenever it made sense.
knowledge services. The SAIC team reused three knowledge bases:
HIKE provides a form-based GUI with which (1) the HPKB upper-level ontology developed
users can construct queries with pull-down by Cycorp, (2) the World Fact Book knowledge
menus. Query-construction templates corre- base from the CIA, and the Units and Measures
spond to the templates defined in the crisis- Ontology from Stanford. Reusing the upper-
management challenge problem specification. level ontology required translation, compre-
Questions also can be entered in natural lan- hension, and reformulation. The ontology was
guage. START and the TextWise component released in MELD (a language used by Cycorp)
accept natural language queries and then and was not directly readable by the SAIC sys-
attempt to answer the questions. To answer tem. In conjunction with Stanford, SRI devel-
questions that involve more complex types of oped a translator to load the upper-level ontol-
reasoning, START generates a formal representa- ogy into any OKBC-compliant server. Once
tion of the query and passes it to one of the loaded into the OCELOT server, the GKB editor
theorem provers. was used to comprehend the upper ontology.
The Stanford University Knowledge Systems The graphic nature of the GKB editor illuminat-
Laboratory ONTOLINGUA, SRI International’s ed the interrelationships between classes and
graphic knowledge base (GKB) editor, WEBKB, predicates of the upper-level ontology. Because
and TextWise provide the knowledge service the upper-level ontology represents functional
components. The GKB editor is a graphic tool relationships as predicates but SNARK reasons
for browsing and editing large knowledge bases, efficiently with functions, it was necessary to
used primarily for manual knowledge acquisi- reformulate the ontology to use functions
The guiding tion. WEBKB supports semiautomatic knowledge whenever a functional relationship existed.
philosophy acquisition. Given some training data and an
ontology as input, a web spider searches in a
Battlespace Integration The distributed
HIKE infrastructure is well suited to support an
during directed manner and populates instances of integrated battlespace challenge problem as it
knowledge classes and relations defined in the ontology. was originally designed: a single information
Probabilistic rules are also extracted. TextWise system for movement analysis, trafficability,
base develop- extracts information from text and newswire and workaround reasoning. However, the traffi-
ment for crisis feeds, converting them into knowledge inter- cability problem (establishing routes for various
change format (KIF) triples, which are then kinds of vehicle given the characteristics) was
management loaded into ONTOLINGUA. ONTOLINGUA is SAIC’s dropped, and the integration of the other prob-
was to reuse central knowledge server and information lems was delayed. The components that solved
knowledge repository for the crisis-management challenge
problem. ONTOLINGUA supports KIF as well as
these problems are described briefly later.
Movement analysis: The movement-analy-
whenever it compositional modeling language (CML). Flow sis problem is solved by MIT’s monitoring,
made sense. models developed by Northwestern University analysis, and interpretation tools arsenal (MAI-
(NWU) answer challenge problem questions TA); Stanford University’s Section on Medical
related to world oil-transportation networks Informatics’ (SMI) problem-solving methods;
and reside within ONTOLINGUA. Stanford’s system and SRI International’s Bayesian networks. The
for probabilistic object-oriented knowledge MIT effort was described briefly in the section
(SPOOK) provides a language for class frames to entitled Teknowledge Integration. Here, we
be annotated with probabilistic information, focus on the SMI and SRI movement-analysis
representing uncertainty about the properties systems.
of instances in this class. SPOOK is capable of rea- For scalability, SMI adopted a three-layered
soning with the probabilistic information based approach to the challenge problem: The first
on Bayesian networks. layer consisted primarily of simple, context-
Question answering is implemented in sev- free data processing that attempted to find
eral ways. SRI International’s SNARK and Stan- important preliminary abstractions in the data
ford’s abstract theorem prover (ATP) are first- set. The most important of these were traffic
order theorem provers. WEBKB answers centers (locations that were either the starting
questions based on the information it gathers. or stopping points for a significant number of
Question answering is also accomplished by vehicles) and convoy segments (a number of
START and TextWise taking a query in English as vehicles that depart from the same traffic cen-
input and using information retrieval to ter at roughly the same time, going in roughly
extract the answers from text-based sources the same direction). Spotting these abstractions
(such as the web, newswire feeds). required setting a number of parameters (for
The guiding philosophy during knowledge example, how big a traffic center is). Once
base development for crisis management was trained, these first-layer algorithms are linear in
36 AI MAGAZINE
13. Articles
the size of the data set and enabled SMI to use ers were developed. One is a novel planner
knowledge-intensive techniques on the result- whose knowledge base is represented in the
ing (much smaller) set of data abstractions. ontologies, including its operators, state
The second layer was a repair layer, which descriptions, and problem-specific informa-
used knowledge of typical convoy behaviors tion. It uses a novel partial-match capability
and locations on the battlespace to construct a developed in LOOM (MacGregor 1991). The oth-
“map” of militarily significant traffic and traf- er is based on a state-of-the-art planner (Veloso
fic centers. The end result was a network of et al. 1995). Each solution lists several engi-
traffic connected by traffic. Three main tasks neering actions for this workaround (for exam-
remain: (1) classify the traffic centers, (2) figure ple, deslope the banks of the river, install a
out what the convoys are doing, and (3) iden- temporary bridge), includes information about
tify which units are involved. SMI iteratively the sources used (for example, what kind of
answered these questions by using repeated earthmoving equipment or bridge is used), and
layers of heuristic classification and constraint asserts temporal constraints among the indi-
satisfaction. The heuristic-classification com- vidual actions to indicate which can be execut-
ponents operated independently of the net- ed in parallel. A temporal estimation-assess-
work, using known (and deduced) facts about ment problem solver evaluates each of the
single convoys or traffic centers. Consider the alternatives and selects one as the most likely
following rule for trying to identify a main choice for an enemy workaround. This prob-
supply brigade (MSB) site (paraphrased into lem solver was developed in EXPECT (Swartout
English, with abstractions in boldface): and Gil 1995; Gil 1994).
If we have a current site which is unclas- Several general battlespace ontologies (for
sified example, military units, vehicles), anchored
and it’s in the Division support area, on the HPKB upper ontology, were used and
and the traffic is high enough augmented with ontologies needed to reason
and the traffic is balanced about workarounds (for example, engineering
and the site is persistent with no major equipment). Besides these ontologies, the
deployments emanating from it knowledge bases used included a number of
then it’s probably an MSB problem-solving methods to represent knowl-
SMI used similar rules for the constraint-satis- edge about how to solve the task. Both ontolo-
faction component of its system, allowing gies and problem-solving knowledge were used
information to propagate through the network by two main problem solvers.
in a manner similar to Waltz’s (1975) well- EXPECT’s knowledge-acquisition tools were
known constraint-satisfaction algorithm for used throughout the evaluation to detect miss-
edge labeling. ing knowledge. EXPECT uses problem-solving
The goal of the SRI group was to induce a knowledge and ontologies to analyze which
knowledge base to characterize and identify information is needed to solve the task. This
types of site such as command posts and battle capability allows EXPECT to alert a user when
positions. Its approach was to induce a there is missing knowledge about a problem
Bayesian classifier and use a generative model (for example, unspecified bridge lengths) or a
approach, producing a Bayesian network that situation. It also helps debug and refine
could serve as a knowledge base. This model- ontologies by detecting missing axioms and
ing required transforming raw vehicle tracks overgeneral definitions.
into features (for example, the frequency of GMU developed the DISCIPLE98 system. DISCI-
certain vehicles at sites, number of stops) that PLE is an apprenticeship multistrategy learning
could be used to predict sites. Thus, it was also system that learns from examples, from expla-
necessary to have hypothetical sites to test. SRI nations, and by analogy and can be taught by
relied on SMI to provide hypothetical sites, an expert how to perform domain-specific
and it also used some of the features that SMI tasks through examples and explanations in a
computed. As a classifier, SRI used tree-aug- way that resembles how experts teach appren-
mented naive (TAN) Bayes (Friedman, Geiger, tices (Tecuci 1998). For the workaround
and Goldszmidt 1997). domain, DISCIPLE was extended into a baseline-
Workarounds: The SAIC team integrated integrated system that creates an ontology by
two approaches to workaround generation, acquiring concepts from a domain expert or
one developed by USC-ISI, the other by GMU. importing them (through OKBC) from shared
ISI developed course-of-action–generation ontologies. It learns task-decomposition rules
problem solvers to create alternative solutions from a domain expert and uses this knowledge
to workaround problems. In fact, two alterna- to solve workaround problems through hierar-
tive course-of-action–generation problem solv- chical nonlinear planning.
WINTER 1998 37
14. Articles
First, with DISCIPLE’s ontology-building tools, ferent metrics. The test items for crisis manage-
a domain expert assisted by a knowledge engi- ment were questions, and the test was similar
neer built the object ontology from several to an exam. Overall competence is a function
sources, including expert’s manuals, Alphate- of the number of questions answered correctly,
ch’s FIRE&ISE document and ISI’s LOOM ontology. but the crisis-management systems are also
Second, a task taxonomy was defined by refin- expected to “show their work” and provide
ing the task taxonomy provided by Alphatech. justifications (including sources) for their
This taxonomy indicates principled decomposi- answers. Examples of questions, answers, and
tions of generic workaround tasks into subtasks justifications for crisis management are shown
but does not indicate the conditions under in the section entitled Crisis-Management
which such decompositions should be per- Challenge Problem.
formed. Third, the examples of hierarchical Performance metrics for the movement-
workaround plans provided by Alphatech were analysis problem are related to recall and pre-
used to teach DISCIPLE. Each such plan provided cision. The basic problem is to identify sites,
DISCIPLE with specific examples of decomposi- vehicles, and purposes given vehicle track
We claim that tions of tasks into subtasks, and the expert guid- data; so, performance is a function of how
HPKB ed DISCIPLE to “understand” why each task many of these entities are correctly identified
decomposition was appropriate in a particular
technology situation. From these examples and the expla-
and how many incorrect identifications are
made. In general, identifications can be
facilitates nations of why they are appropriate in the giv- marked down on three dimensions: First, the
en situations, DISCIPLE learned general task-
rapid decomposition rules. After a knowledge base
identified entity can be more or less like the
actual entity; second, the location of the iden-
modification consisting of an object ontology and task-
tified entity can be displaced from the actual
decomposition rules was built, the hierarchical
of nonlinear planner of DISCIPLE was used to auto-
entity’s true location; and third, the identifica-
knowledge- matically generate workaround plans for new
tion can be more or less timely.
The workaround problem involves generat-
based workaround problems.
ing workarounds to military actions such as
systems. This bombing a bridge. Here, the criteria for suc-
claim was Evaluation cessful performance include coverage (the
generation of all workarounds generated),
The SAIC and Teknowledge integrated systems
tested in both for crisis management, movement analysis,
appropriateness (the generation of work-
arounds appropriate given the action), speci-
phases of the and workarounds were tested in an extensive
ficity (the exact implementation of the work-
experiment study in June 1998. The study followed a two-
around), and accuracy of timing inferences
phase, test-retest schedule. In the first phase,
because each the systems were tested on problems similar to
(the length each step in the workaround takes
to implement).
phase allows those used for system development, but in the
Performance evaluation, although essential,
second, the problems required significant
time to modifications to the systems. Within each
tells us relatively little about the HPKB inte-
grated systems, still less about the component
improve phase, the systems were tested and retested on
technologies. We also want to know why the
the same problems. The test at the beginning
performance of each phase established a baseline level of systems perform well or poorly. Answering this
on test performance, but the test at the end measured question requires credit assignment because
the systems comprise many technologies. We
problems. improvement during the phase.
We claim that HPKB technology facilitates also want to gather evidence pertinent to sev-
rapid modification of knowledge-based sys- eral important, general claims. One claim is
tems. This claim was tested in both phases of that HPKB facilitates rapid construction of
the experiment because each phase allows knowledge-based systems because ontologies
time to improve performance on test prob- and knowledge bases can be reused. The chal-
lems. Phase 2 provides a more stringent test: lenge problems by design involve broad, rela-
Only some of the phase 2 problems can be tively shallow knowledge in the case of crisis
solved by the phase 1 systems, so the systems management and deep, fairly specific knowl-
were expected to perform poorly in the test at edge in the battlespace problems. It is unclear
the beginning of phase 2. The improvement in which kind of problem most favors the reuse
performance on these problems during phase claim and why. We are developing analytic
2 is a direct measure of how well HPKB tech- models of reuse. Although the predictions of
nology facilitates knowledge capture, represen- these models will not be directly tested in the
tation, merging, and modification. first year’s evaluation, we will gather data to
Each challenge problem is evaluated by dif- calibrate these models for a later evaluation.
38 AI MAGAZINE
15. Articles
Results of the Challenge ity of the presentation of the explanation, the
automatic production by the system of a repre-
Problem Evaluation sentation of the question, source novelty, and
We present the results of the crisis-manage- reconciliation of multiple sources. Each ques-
ment evaluation first, followed by the results tion could garner a score between 0 and 3 on
of the battle-space evaluation. each criterion, and the criteria were themselves
weighted. Some questions had multiple parts, When you
Crisis Management and the number of parts was a further weight-
The evaluation of the SAIC and Teknowledge ing criterion. In retrospect, it might have been consider the
crisis-management systems involved 7 trials or clearer to assign each question a percentage of difficulty of the
the points available, thus standardizing all
batches of roughly 110 questions. Thus, more
than 1500 answers were manually graded by scores, but in the data that follow, scores are
task, both
the challenge problem developer, IET, and sub- on an open-ended scale. Subject-matter systems
ject matter experts at PSR on criteria ranging experts were assisted with scoring the quality performed
from correctness to completeness of source of knowledge representations when necessary.
material to the quality of the representation of A web-based form was developed for scor- remarkably
the question. Each question in a batch was ing, with clear instructions on how to assign well. Scores on
posed in English accompanied by the syntax of scores. For example, on the correct-answer cri-
the corresponding parameterized question (fig- terion, the subject-matter expert was instruct- the sample
ure 3). The crisis-management systems were ed to award “zero points if no top-level answer questions were
supposed to translate these questions into an is provided and you cannot infer an intended
internal representation, MELD for the Teknowl- answer; one point for a wrong answer without
relatively high,
edge system and KIF for the SAIC system. The any convincing arguments, or most required which is not
MELD translator was operational for all the tri- answer elements; two points for a partially cor-
rect answer; three points for a correct answer
surprising
als; the KIF translator was used to a limited
extent on later trials. addressing most required elements.” because these
The first trial involved testing the systems When you consider the difficulty of the task, questions had
on the sample questions that had been avail- both systems performed remarkably well.
able for several months for training. The Scores on the sample questions were relatively been available
remaining trials implemented the “test and high, which is not surprising because these for training for
retest with scenario modification” strategy dis- questions had been available for training for
cussed earlier. The first batch of test questions, several months (figure 5). It is also not surpris-
several months
TQA, was repeated four days later as a retest; it ing that scores on the first batch of test ques- ….
was designated TQA’ for scoring purposes. The tions (TQA) were not high. It is gratifying,
however, to see how scores improve steadily
It is also not
difference in scores between TQA and TQA’
represents improvements in the systems. After between test and retest (TQA and TQA’, TQC surprising that
solving the questions in TQA’, the systems and TQC’) and that these gains are general: scores on the
tackled a new set, TQB, designed to be “close The scores on TQA’ and TQB and TQC’ and
to” TQA. The purpose of TQB was to check TQD were similar. first batch of
whether the improvements to the systems gen- The scores designated auto in figure 5 refer test questions
eralized to new questions. After a short break, to questions that were translated automatically
a modification was introduced into the crisis- from English into a formal representation. The
(TQA) were not
management scenario, and new fragments of Teknowledge system translated all questions high. It is
automatically, the SAIC system very few. Ini-
knowledge about the scenario were released.
Then, the cycle repeated: A new batch of ques- tially, the Teknowledge team did not manipu-
gratifying,
tions, TQC, tested how well the systems coped late the resulting representations, but in later however, to see
with the scenario modification; then after four batches, they permitted themselves minor how scores
days, the systems were retested on the same modifications. The effects of these can be seen
questions, TQC’, and on the same day, a final in the differences between TQB and TQB-Auto, improve
batch, TQD, was released and answered. TQC and TQC-Auto, and TQD and TQD-Auto. steadily
Each question in a trial was scored according Although the scores of the Teknowledge and
to several criteria, some official and others SAIC systems appear close in figure 5, differ-
between test
optional. The four official criteria were (1) the ences between the systems appear in other and retest…
correctness of the answer, (2) the quality of the views of the data. Figure 6 shows the perfor-
explanation of the answer, (3) the complete- mance of the systems on all official questions
ness and quality of cited sources, and (4) the plus a few optional questions. Although these
quality of the representation of the question. extra questions widen the gap between the sys-
The optional criteria included lay intelligibility tems, the real effect comes from adding
of explanations, novelty of assumptions, qual- optional components to the scores. Here,
WINTER 1998 39