Automated identification of sensitive information

Cover Page

Using Ultra‐Structure for
Automated Identification of
Sensitive Information in
Documents

Author: Jeffrey G. Long (jefflong@aol.com)

Date: October 21, 1999

Forum: Talk presented at the 20th annual conference of the American Society for
Engineering Management.

Contents
Pages 1‐5: Preprint f paper

Pages 6‐24: Slides (but no text) for presentation

License
This work is licensed under the Creative Commons Attribution‐NonCommercial
3.0 Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by‐nc/3.0/ or send a letter to Creative
Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.

Uploaded July 1, 2011

USING ULTRA-STRUCTURE FOR
AUTOMATED IDENTIFICATION OF SENSITIVE INFORMATION IN DOCUMENTS1
Jeffrey G. Long, Sr. Knowledge Engineer, DynCorp NSP

Abstract. The Government has a strong interest in somewhat, but they too have failed to make systems
protecting nuclear and national security information more flexible in the face of changing user requirements.
while maximizing the availability of information about Indeed, if both (a) the system being developed is
its operations. Towards this end federal agencies have complex and (b) the user requirements are subject to
developed tens of thousands of 'guidance rules' to help significant change over time, then the existing structured
determine what is or is not classified, and at what level. approaches usually do not work. The greatest
Trained and certified document reviewers apply these engineering accomplishments of the 20th Century are of
rules. The Reviewer's Assistant System (RAS) at the the former type. Complex systems ranging from
Department of Energy is a new type of expert system computers to the Space Shuttle have of course been
that uses a relational database to store very large successfully built – but only on the condition that their
numbers of rules. It is being developed to automatically user requirements change little if at all after the design
apply DOE guidance rules to text documents. This stage.
requires far more than a mere keyword search, as the Unfortunately for us, software systems are both
ideas being distinguished can be quite subtle. The complex and ever changing. As such, they require a
purpose of the system is not to 'understand' the different engineering management approach. The theory
documents it processes, but rather to simply check them of “Ultra-Structure” was developed in part because the
for the existence of certain ideas or facts. This application of traditional engineering approaches has
technology for 'concept-spotting' can be applied to other failed when applied to designing software systems. This
arenas, such as detecting junk email or searching web theory was described in (Long and Denning, 1995), and
sites. This paper will discuss the features, goals and less technically in (Long, 1999), but will be briefly and
general methods of the Reviewers Assistant System. nontechnically described here.
This article primarily discusses the application of
Background Ultra-Structure theory to a new area, namely the analysis
of text documents for sensitive information. The U.S.
The traditional approach to managing any engineering Department of Energy (DOE) Declassification
project is structured: it moves from general planning to Productivity Initiative (DPI), which will be described
requirements analysis, design, implementation, and then later in this paper, has funded this work. The use of
long-term maintenance, and has explicit criteria to Ultra-Structure theory has thus far allowed us to address
determine whether and when to move to the next stage. several difficult problems that have limited and hindered
This approach works quite well for creating most types previous efforts at Natural Language Processing (NLP)
of systems. If a system is simple and the requirements systems, expert systems, and large knowledgebases; in
change, the traditional structured approach works particular:
because the system can be affordably modified.
Alternatively, if a system is complex but the  the ability to manage large numbers of rules in a
requirements never change, the system can be knowledgebase, numbering in the tens of thousands
successfully built. Traditional structured approaches now and eventually in the hundreds of thousands
have proven to be better than completely unstructured  the ability to give knowledge engineers a set of tools
approaches and have led to the development of many to help them visualize and manage large
successful systems. However, standard structured knowledgebases of rules (“rulebases”)
approaches have failed to satisfactorily address the  the ability to manage and maintain both metadata
problems involved in creating time-viable software and content information regarding large numbers of
systems, especially over long times. Instead, they have documents.
led to the frequent replacement of systems with wholly
new ones, most often at great cost for both developers Ultra-Structure Fundamentals
and users of software. Attempts to get better user
requirements through rapid prototyping and better Ultra-Structure theory is based on a different way of
charting tools for notating work processes have helped looking at the world – a different paradigm or
1
-- This work was funded by the U.S. Department of Energy under contract DE-AC01-98NN50049.

worldview. The traditional Western, Aristotelian Finally, the notational structure is the set of tokens used
worldview sees the world as composed of objects having by the rules to represent various abstractions. For RAS,
attributes and relationships to other objects. Ultra- these tokens are numbers and letters; for (say) a music
Structure theory sees the world as being a process which, system they would be letters and other signs interpreted
as a minor by-product, occasionally generates physical as musical “notes” or instructions.
entities and new relationships among physical entities. Ultra-Structure theory specifies how these various
The first task of an Ultra-Structure analysis is to layers can be represented on a computer as tables in a
understand these processes and to represent them very relational database. To implement a ruleform, we create
accurately. The development task of an Ultra-Structure a table where each row is a separate simple rule, and
analysis is to ensure that the representation models not each column is a separate universal. Typically a
just the processes that exist currently, but all logically complex rule will require several simple rules that are
possible processes within that family of systems. stored in different ruleforms; this group of ruleforms
Ultra-Structure theory suggests that any process can, must be examined as a whole in order for the system to
in principle and in practice, be analyzed into or reduced make decisions; it is referred to as a cluster. There are
to a set of If-Then rules. Seemingly simple processes several types of ruleforms, as will be illustrated later.
usually follow just a few simple rules; and seemingly Properly specifying these structures for a model of a
complex processes may follow either a few, or many, system enables the users of the model to enter the rules
simple rules. One of the interesting discoveries in of the system as data, which is easily changed as
cellular automata studies has been that very simple rules necessary when the rules change. Approximately 99%
can generate very complex behaviors. This has also of the rules of the system are specified as data, so the
been observed in the work done in the last 20 years in model itself – its software and data structures – has no
fractal geometry, where the recursive execution of a knowledge of the outside world. All the model itself
single remarkably simple formula, which is a type of knows is the order in which to read the ruleforms. This
rule, can specify very complex shapes. control logic, called animation procedures, is very small;
The kind of rules that humans usually work with may typically just a few thousand lines of code even for a
be thought of as complex rules. However, a good analyst very complex system.
can analyze complex rules into simple or atomic rules,
according to Ultra-Structure theory. Rules may be Natural Language: A New Application Area
defined in terms of sets, with each set having a specified for Ultra-Structure
and limited possible domain or set of values. A
particular rule in this definition is a particular ordered Ultra-Structure theory has been applied in a number of
sequence of these values. application areas, mostly in business. This has included
Ultra-Structure theory suggests that a good analyst the traditional application areas of order entry, inventory
can always define an ordered sequence of domains (in control, billing and cash application, and similar business
Ultra-Structure terminology, universals) which will functions. It has also been applied experimentally to
contain any particular instance of a rule. Such an other areas, as indicated by (Shostko 1999) and (Oh and
ordered set of universals is called a ruleform. A Scotti, 1999).
ruleform is to a rule what ordinary algebra is to Starting with part-time work in 1995, it has now been
arithmetic: it is a more generalized way of specifying the applied to the automated identification of sensitive
essential structural ideas and relationships of a system. information in text documents. The present application
While ordinary algebra uses symbols to represent area is new for Ultra-Structure: attempting to specify the
numbers and various arithmetical operations, Ultra- rules by which natural language can be at least partially
Structure uses universals to represent various domains understood, i.e. partially interpreted and assigned
whose contents may be numeric, alphabetical, or indeed meaning.
the tokens of any notational system. The Government has a strong interest in protecting
Exhibit 1 shows proposed terminology and national security information (NSI), while facilitating
distinctions for the different layers of structure of any government openness to public scrutiny and not
system. Simply put, the surface structure is defined to be spending significant amounts of money and time
the physical manifestation of any system, consisting of protecting information that does not truly need to be
its physical entities, relationships and processes. The classified. President Clinton issued Executive Order
middle structure is the set of all rules governing the 12958 (E.O.), the most recent codification of the
system, which generate the surface structure. The deep Government’s intentions in this area, on April 14, 1995.
structure is the set of ordered domains from which It states, among other things, that all classified
particular rules may be constructed. The sub-structure documents containing only NSI (not RD or
represents the set of all possible domains (universals). FRD,discussed below) will be automatically declassified

after 25 years unless one of nine specified conditions for  highlight any segments of the text to which a
exemption is met. Guidance Rule applies
The estimated volume of documents to be reviewed  highlight for the user (i.e. a certified document
for declassification under the E.O. exceeds one billion, reviewer) the specific Guidance Rule(s) that caused
and the five year grace period specified by the E.O. for any particular sections of text to be selected.
reviews to identify exemptions from automatic
declassification has been extended for an additional 18 The purpose of RAS is not to “understand” the
months to help the Agencies meet the enormous work documents it processes, but rather to detect the existence
loads. of any classified concepts or facts. While this is simpler
Under the E.O., any information that is covered under than true document understanding, it nevertheless
the Atomic Energy Act (AEA) is exempt from automatic requires far more than mere keyword searching, where a
declassification. Such information includes anything system simply scans a document for the existence of one
pertaining to the construction, design or use of nuclear or more specified terms. It also requires more than a
weapons, nuclear propulsion systems, and other special Boolean keyword search, where specific terms can be
nuclear materials. This information was exempted ANDed, ORed, etc. We are seeking specific concepts
because the President does not have the authority to having specific relations to one another, which we refer
unilaterally change the AEA (a law), and also because it to as ideas or propositions. The ideas being sought can
is generally recognized that even “old” nuclear design be quite subtle.
information would still be of current value to a would-be
proliferant. It is simply not in the interests of the United Merging Databases and Text Markup Languages
States to provide such information.
To help identify this kind of information, called Traditionally the task of defining the elements of
“Restricted Data” (RD) or “Formerly Restricted Data” document structure would be performed using a text
(FRD), as well as other kinds of national security markup language such as a derivative of the Standard
information (NSI), the Department of Energy has Generalized Markup Language (SGML). In this kind of
developed about 65,000 specific guidance topics. Their language, “tags” indicating different structural features
purpose is to help determine what is or is not classified of the document are inserted into the document at the
as RD, FRD or NSI, and at what level (confidential, beginning and end of each structural feature. Following
secret, or top secret). Trained and certified document Ultra-Structure theory, RAS represents the information
reviewers apply these topics. Moreover, under the in terms of rules, and the rules are stored as records of
Freedom of Information Act (FOIA), the public is data in various tables. In RAS, therefore, all structural
entitled to request documents and DOE must be prepared markup information is stored in a database. This kind of
to justify any classification actions it takes before a markup does not use in-line tags, but instead uses
federal judge. They must have a clear rationale tracing different fields in a table.
back to the 65,000 guidance topics and from there to There are a number of advantages to storing
either the AEA or to the latest E.O. pertaining to national structured text information in database tables rather than
security. in a flat file with tags. Chief among these are the
Document reviewing is a manually intensive process following general capabilities of relational databases
requiring years of education and training. Congress over flat files:
funded the Declassification Productivity Initiative (DPI)
at the Department of Energy (DOE) in order to develop  control access to the data through a security system
advanced tools to help reviewers in various ways. One and audit trail
of the primary tools we have been developing under DPI  enforce referential integrity, such that when a value
is called the “Reviewer’s Assistant System” (RAS), changes in one part of the system it is immediately
which was built using Ultra-Structure theory. changed in all parts of the system
 permit use of complex queries using (e.g.) Standard
Reviewer’s Assistant System Functions Query Language (SQL)
 give users quick access to volumes of data through
We are building RAS using Ultra-Structure theory easy-to-use forms and reports
because the number of rules is quite large and these rules  store and retrieve various types of objects in
are likely to change over time. RAS is designed to: addition to standard text (e.g. images, sounds).
 rigorously apply DOE Guidance Rules to text Merging Databases and Knowledgebases
documents

Ever since mankind first used an abacus about 5,000 There have been a number of attempts in the last ten
years ago, and possibly since we first notched tallies on a years to bridge the gap between these two classes of
stick 30,000 years ago, we have distinguished algorithms applications – to merge databases and knowledgebases
from data. This has been a useful distinction, but the and their associated technologies. There is a growing
veritable wall between the two began to break down belief that modern database systems must evolve towards
when John von Neumann proposed in a memo in 1945 knowledgebase systems, and that more "inferencing" is
that not just data but also algorithms (as computer necessary for better understanding and use of data. This
instructions) could be stored on a computer in a binary could lead to applications involving hundreds of
form. This insight – based on work done by him and thousands of complex rules that make decisions that
others at the University of Pennsylvania, including John seem truly “intelligent.”
Mauchley and J. Presper Eckert – led to programmable The Ultra-Structure paradigm does not make these
(stored program) computers. conventional distinctions between algorithms and data.
Although both parts are stored in the same way as Rather it defines whatever is stored in a relational
binary digits (bits), computer applications are still database table to be rules which have two different types
viewed as consisting of two very different things: of parts, called factors and considerations. Factors are
algorithms and data. An algorithm is a finite series of primary keys in a table that determine under what
steps taken to compute an answer. Data is the values or general conditions a rule should be looked at; and
parameters used by an algorithm to reach its conclusions, following standard normalization rules it requires that
which data may have initial, intermediate and final there be unique keys (factors) for each record (rule).
values. What is traditionally considered to be data (i.e., a fact) is
Database applications are generally viewed as usually stored as a consideration (a non-primary-key
applications that provide storage places and access attribute) in the record, and this attribute serves merely
methods for the safe storage and retrieval of persistent to guide the execution of a rule cluster. In an inventory
data, and the safe adding, changing and deleting of data system, for example, the quantity-on-hand of a particular
following certain integrity rules regardless of whether item is simply a consideration determining where the
the application software using the database enforces item may be sourced for an order. That and other rules
those rules or not. Under this paradigm, databases store in the cluster must all be examined in order for the
and protect “facts” or “data,” and the algorithms that inventory system to make an intelligent sourcing
read and use these facts are stored in software programs, decision. The inference engine (called animation
queries, stored procedures, job control language procedures) consists of just a few thousand lines of
procedures, etc. Examples of such applications are order code. All knowledge of the external world lies in the
entry, inventory, purchasing, and accounting systems. rulebase, and none in the animation procedures. RAS is
This class of systems is concerned primarily with data an example of a new type of system that uses a
storage, arithmetic and logical calculations, and relational database to store a very large number of
information retrieval. For this class of systems, rules as data.
changing the rules of a business area requires changing This perspective requires a new and broader
the software – a frequently difficult task. understanding of the nature of rules. If we broaden our
Expert systems are a different class of applications concept of rules from
which consist of rules and an inference engine, and
which are concerned primarily with applying reasoning IF x THEN do y and z
to facts in order to simulate the behavior of human to
experts in a particular subject domain. The inference IF x THEN consider y and z before deciding what to do,
engine processes the rules, which are stored in a
“knowledgebase” rather than a database. These rules then y and z can serve the role traditionally reserved for
may include executable code, or they may be mere data. data, that is they can represent the facts of the world.
The reasoning process may be similar to that of a human They do this as an integral part of a larger and more
expert, or it may be completely different. The behavior comprehensive cluster of rules, acting as considerations
of the system as a whole is intended to mimic, and for the execution of individual rules.
hopefully outperform, a human expert. Examples of This means that all the business rules of an
such applications include bank credit approval, medical organization can be stored as data, and the only software
diagnosis, and hardware configuration systems. These that is necessary is the inference engine, which should
systems are usually intended to aid rather than replace never need to change. This puts all knowledge of the
human decision-makers. They offer the benefits of high world and all the knowledge of rules in a format which
speed, high consistency, and perfect attention to detail. is easy to update, easy to review, and can be managed
easily by a standard relational database.

There are other ruleforms (tables) in RAS, but these
Rules in RAS give the general idea of what the system contains.

As used by RAS, Ultra-Structure defines several basic Executing (Animating) the RAS Rules
kinds of existential rules or types of entities:
In order to search for concepts in a text, the text must
 semantic entities, which can be letters, words, first be “pre-analyzed.” This involves the determination
phrases, guide topics, or entire guides of various boundaries (e.g. sentence boundaries) and the
 documents, which are the entities being analyzed by determination of the nature of certain kinds of lexical
the system entities, e.g. whether a specific entity is numeric or non-
 markings, which indicate what to do in the event numeric, and whether a period is part of a number (a
that certain ideas are found in a text, e.g. mark the decimal point), is used as part of an acronym or
document as “confidential” abbreviation, or is indeed the end of a sentence. Each
 users, which define the authorized users of the word in a document is usually treated as a separate
system. “semantic entity.” But since words in a phrase often
have meanings very different than the same words
These entities typically have complex relations to one outside the phrase (e.g. “A horse of a different color” has
another. nothing to do with either horses or colors!), there is
frequently a need to indicate that several words must
If related to other entities of the same type they are always be treated as a single phrase, in which case the
called network rules. RAS has several kinds of network entire phrase becomes a single semantic entity. This is
rule: defined by a replacement rule in the Entities Network
ruleform.
 entities network relates semantic entities to one Each semantic entity has a number of attributes such
another as character position in the document, word number,
 markings network relates markings to one another, sentence number, paragraph number, whether it is
and in particular indicates a hierarchy of markings numeric, etc. These and other attributes for each
 documents network relates documents to one semantic entity are stored as a rule on a single record in
another, indicating (e.g.) that one document replaces the Document Detail ruleform, in lieu of using an
another, or is a duplicate of another, etc. SGML-type markup language. The system is thus
generating new “rules” based on other rules, which
If existential entities of one kind are related to entities facilitates subsequent analysis.
of another kind, we represent that with an authorization After performing an analysis it is necessary to
rule. RAS has several kinds of authorization rules: indicate which portions of the text are considered
classified or are otherwise marked, and what specific
guidance topic(s) caused the text to be selected. These
 document detail contains the results of the pre-
rules, also generated by the system based on other rules,
analysis of a document, specifying the semantic
are stored in the Document Analysis ruleform.
entities and their characteristics and order in the
Performing the analysis itself requires looking for the
document
tokens in the target documents, and applying the
 document analysis contains the results of the
markings indicated. Since each guidance topic is
analysis of a document
translated into one or more propositions, and there are
 entity markings relates semantic entities (e.g. guide about 65,000 guidance topics, we anticipate that there
topics) and their associated markings (if any) will be about 100,000 propositions to be represented and
searched for in each text. This number accounts for and
Note that each ruleform (table) may be interpreted as excludes duplicate guidance topics. This number of
defining rules. For example, the Document Detail table rules alone would make RAS a very large expert system.
may be interpreted as specifying rules for the (re- As indicated in Exhibit 2, specifying a proposition (in
)construction of the original document. The Markings the sense used here) means specifying usually two to
Network specifies how markings are ordered in a four concepts which occur within a defined proximity of
hierarchy, e.g. if a marking of “confidential” and a one another in a text, e.g. 5 sentences or 15 words. We
marking of “secret” both apply to the same document, have found thus far that even the most complex
then the overall classification of the document is propositions require only six concepts.
“secret”. Specifying several concepts to search for is not by
itself adequate: the computer must also know all the

possible ways that each concept can be tokenized (i.e. work is underway to automatically generate RAS rules
lexically expressed) in any text document. This calls for from the written English guidance. Of these, about ten
a large tree of relationships specifying how concepts can percent are what we call “good false positives”, i.e. items
be tokenized. This mapping requirement – essentially a that are not in fact classified but which a reviewer would
large ontology of all areas of DOE activity – will want to look at closely before making that determination.
probably add another 500,000+ additional rules. We In terms of missing items that should have been
keep these rules in the Entities Network ruleform, first identified as “hot” (i.e., false negatives), results are
defining all concepts and tokenizations in the Semantic harder to determine since sometimes even human
Entities ruleform. Note that in many cases there is no reviewers may disagree about what is sensitive. But
need to specify all forms of an entity, e.g. singular, results to date indicate that almost all missed items are
plural, possessive, etc.; using a wildcard before or after a readily accounted for as outside the domain of the
word stem is sometimes adequate so long as the rulebase, either pertaining to a different subject area or
knowledge engineer is aware that use of stems may including tokens that the rulebase was unaware of.
increase the false positive hit rate. This low false negative and low false positive rate is
Of course, not all concepts and tokenizations are in great contrast to other approaches which have often
equally related to one another. We represent this degree been found unusable based on 50%+ false positive hit
of closeness with a fuzzy fitness number from 0 to 1 to rates, using that term as defined above.
indicate the degree to which the two are related. We
then need to test this rulebase against a large set of Other Possible RAS Applications
documents, and to go back and correct the rules to
minimize or eliminate false positives and false negatives, The RAS technology for ”concept-spotting” –
either by adding new entities to look for, changing the reviewing documents for the existence of specific
relationships of existing entities, or specifying tighter hit propositions – can theoretically be applied to other
ranges. arenas, such as:
In the long term, we will need to be able to keep the
rules up-to-date as the original guidance topics change  identifying unsolicited commercial email (“spam”)
over time and as we apply it do different corpora having  searching web sites for certain ideas
variations in their lexical representations, e.g., using a  searching patents for certain ideas
different (former) name for a national laboratory. We  scanning computer source code for Y2K date issues.
are still in the early stages of creating and validating
these rules, and we expect this to be the most difficult We are still a long way from completing the RAS
part of building the system due to the wide range of system. The results to date have been very promising:
subject areas to be covered. RAS is demonstrating the advantages of Ultra-Structure
theory for concept detection and large knowledgebases.
Results to Date The Declassification Productivity Research Center
(DPRC) at The George Washington University is
We have run RAS on several different corpora having carrying out other Ultra-Structure based research
very different characteristics, in order to see how it projects, which are also showing positive results (Oh and
performs with these different corpora. Characteristics of Scotti, 1999).
interest include whether the documents are known or
believed to be classified or unclassified; the size of each Summary and Conclusions
document in the corpus, ranging from a few sentences to
hundreds of pages; the size of the total corpus, ranging A million records is small by database system
so far up to about 3 million words; and whether the standards, but a million rules is essentially an impossible
corpus was originally created electronically or whether it number for a traditional expert system to manage. We
was OCRed and therefore has some number of OCR expect to be able to effectively handle very large
errors in it. numbers of rules, numbering in the hundreds of
The rulebase tested has over 700 guidance rules in it, thousands, using the techniques being followed for RAS.
and maps to about 20,000 tokens. We expect soon to Ultra-Structure theory may constitute a real merger of
greatly increase the number of guidance rules applied. knowledgebase and database technologies. If so, it has
Results so far show a typical false positive hit rate of the potential to usher in a new era of vastly larger expert
about ten percent of the documents reviewed, meaning systems for carrying out policies and procedures of
that of 100 documents, RAS will incorrectly identify ten extreme complexity.
as “hot” when they are not. We hope of course to reduce
this rate by broadening and deepening the rulebase, and References

About the Author
Long, Jeffrey G. and Denning, Dorothy E., “Ultra-
Structure: A design theory for complex systems and Mr. Long is Senior Knowledge Engineer on the DPI
processes,” Communications of the ACM 38(1), (1995) project. He is also Director of the Notational
103-120. Engineering Laboratory, an effort to create a
clearinghouse for people interested in problems of
Long, Jeffrey G., “A new notation for representing representation in any field of science, art or other
business and other rules,” Semiotica 125-1/3, (1999) activity. His experience includes 25 years of consulting
215-228 on various kinds of applications software development,
with a particular focus on studying complex systems and
Oh, Youngsuck and Scotti, Richard, “Analysis and the problems of representing them.
Design of a Database using Ultra-Structure Theory
(UST) – Conversion of a Traditional Software System to
One Based on UST,” Proceeding of the 20th Annual
Conference, American Society for Engineering
Management (1999)

Shostko, Alexander, “Design of an automatic course-
scheduling system using Ultra-Structure,” Semiotica
125-1/3, (1999) 197-214

Standard Terminology (if any) Ultra-Structure Instance Name Ultra-Structure Level Name U-S Implementation
behavior, physical entities and particular(s) surface structure system behavior
relationships, processes
rules, laws, constraints, rule(s) middle structure data and some software
guidelines, rules of thumb (animation procedures)
(no standard or common term) ruleform(s) deep structure tables
(no standard or common term) universal(s) sub-structure attributes, fields
tokens, signs or symbols token(s) notational structure character set

Exhibit 1: Layers of Structure in Any System, According to Ultra-Structure Theory

Exhibit 2: RAS Breakdown of Topics to Tokens

Using Ultra-Structure for
Automated Identification of
Sensitive Information in
Documents

Jeffrey G. Long
Sr. Knowledge Engineer, DynMeridian
notate@aol.com

Traditional Engineering Approaches Work
g g pp
Only Under Certain Conditions

Unfortunately,
Unfortunately Complex and Changing
Needs Exist in Every Organization

Needs

SW & DB

time 1 time 2 time 3...

Ultra Structure
Ultra-Structure Theory Was Created to
Support Complex and Changing Rules

 New theory of systems design, developed 1985
 Focuses on optimal computer representation of
F ti l t t ti f
complex, conditional and changing rules
 Based on a new abstraction called ruleforms

 The breakthrough was to find the unchanging
features of changing systems

The Theory Offers a Different Way to
Look at Complex Systems and Processes

observable
behaviors surface structure
generates
rules middle structure
constrains
form of rules
f f l deep structure

This Creates New Levels for Analysis
and Representation

Standard Terminology (if any) Ultra-Structure Instance Ultra-Structure Level U-S Implementation
Name Name

behavior, physical entities particular(s) surface structure system behavior
and relationships, processes

rules, laws constraints,
rules laws, constraints rule(s) middle structure data and some
guidelines, rules of thumb software (animation
procedures)

(no standard or common ruleform(s) deep structure tables
term)

(no standard or common universal(s) sub-structure attributes, fields
term)

tokens,
tokens signs or symbols token(s) notational structure character set

The R l f
Th Ruleform H
Hypothesis
h i

Complex system structures are created by not-necessarily
complex processes; and these processes are created by the
animation of operating rules. Operating rules can be grouped
into a small number of classes whose form is prescribed by
"ruleforms". While the operating rules of a system change over
time, the ruleforms remain constant. A well-designed collection
g
of ruleforms can anticipate all logically possible operating rules
that might apply to the system, and constitutes the deep
structure of the system.

The C RE Hypothesis
Th CoRE H h i

There exist Competency Rule Engines, or CoREs, consisting of
<50 ruleforms, that are sufficient to represent all rules found
among systems sharing broad family resemblances e g all
resemblances, e.g.
corporations. Their definitive deep structure will be permanent,
unchanging, and robust for all members of the family, whose
differences in manifest structures and behaviors will be
represented entirely as differences in operating rules. The
animation procedures for each engine will be relatively simple
compared to current applications, requiring less than 100,000 lines
p pp , q g ,
of code in a third generation language.

DOE Reviewer’s Assistant System
Reviewer s
Requirements

 650 guides defining 65,000 topics that are or may be
classified
 Extensive background knowledge required to
interpret guidance
 Guidance changes over time
 Terminology in documents changes over time
 Current backlog of 300+ million pages
 Objective is concept spotting, not document
understanding g

Normally This Would be Done Using
an Expert System Shell

 ES often have trouble with > 100 rules
 DOE system will require about 500 000 rules
500,000
 Key issue: maintainability of rules
 Many benefits from using relational database to store
rules as data
 Built-in referential integrity
 Easy report-writing and queries
E t iti d i
 Simple user interface for KE and Reviewers

RAS Defines Guidance Concepts and
p
All Possible Lexical Expressions of
Those Conceptsp

System Define
Convert Guides Interpretations
Ready

Read Apply Document
Document Guidance Reviewed

Rules Specify Relations Between
Concepts, Tokens and Markings

Results to D
R l Date are P
Promising
i i

 In a corpus of 3,750 unclassified documents, the
false positive rate was less than 10%
 In
I another corpus of 16,500 unclassified d
th f 16 500 l ifi d documents,
t
the false positive rate was 2.5%

 In other (e.g. keyword and statistical systems)
approaches, false positive and false negative rates
are often i excess of 50%
ft in f

The Ultra-Structure-Based RAS
System Offers Substantial Benefits to
S Off S b i lB fi
Reviewers and Knowledge Engineers
 System can provide precise and rigorous
interpretation of DOE Classification Guidance
p
 Rules can become more complex if necessary
 Rules are easy to specify, change and review
 Implications and consequences of changes can be
better foreseen
 Changes to rules do not require changing software or
table structures – just data

Next Steps for RAS Development
N S f D l

 Work with subject experts to expand scope and
improve quality and completeness of rulebase
 Continue t ti system against many types of
C ti testing t i t t f
documents
 Improve design to minimize/eliminate false negatives
and false positives
 Work with end-users to improve user interface
 Integrate into other systems
 Improve design to increase speed: parallel
processing,
processing stored queries etc
queries, etc.

As the CoRE Hypothesis Promises RAS
Promises,
Could be Used in Other Areas Also

 Categorize documents by subject
 Scan email for spam/UCE
 Scan websites, e.g. for compliance to a standard
 Categorize p
g patents or scan them for specified
p
concepts
 Scan source code, e.g. Y2K
 Scan any machine-readable corpus f specified
S hi d bl for ifi d
ideas

Automated identification of sensitive information

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (7)

Similar to Automated identification of sensitive information

Similar to Automated identification of sensitive information (20)

More from Jeff Long

More from Jeff Long (20)

Recently uploaded

Recently uploaded (20)

Automated identification of sensitive information