SlideShare a Scribd company logo
1 of 25
Download to read offline
Cover Page 

 



      Using Ultra‐Structure for 
     Automated Identification of 
      Sensitive Information in 
            Documents 
 

Author: Jeffrey G. Long (jefflong@aol.com) 

Date: October 21, 1999 

Forum: Talk presented at the 20th annual conference of the American Society for 
Engineering Management.


                                 Contents 
Pages 1‐5: Preprint f paper 

Pages 6‐24: Slides (but no text) for presentation 

 


                                  License 
This work is licensed under the Creative Commons Attribution‐NonCommercial 
3.0 Unported License. To view a copy of this license, visit 
http://creativecommons.org/licenses/by‐nc/3.0/ or send a letter to Creative 
Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA. 


                                 Uploaded July 1, 2011 
USING ULTRA-STRUCTURE FOR
           AUTOMATED IDENTIFICATION OF SENSITIVE INFORMATION IN DOCUMENTS1
                                     Jeffrey G. Long, Sr. Knowledge Engineer, DynCorp NSP


    Abstract. The Government has a strong interest in              somewhat, but they too have failed to make systems
    protecting nuclear and national security information           more flexible in the face of changing user requirements.
    while maximizing the availability of information about         Indeed, if both (a) the system being developed is
    its operations. Towards this end federal agencies have         complex and (b) the user requirements are subject to
    developed tens of thousands of 'guidance rules' to help        significant change over time, then the existing structured
    determine what is or is not classified, and at what level.     approaches usually do not work.             The greatest
    Trained and certified document reviewers apply these           engineering accomplishments of the 20th Century are of
    rules. The Reviewer's Assistant System (RAS) at the            the former type.       Complex systems ranging from
    Department of Energy is a new type of expert system            computers to the Space Shuttle have of course been
    that uses a relational database to store very large            successfully built – but only on the condition that their
    numbers of rules. It is being developed to automatically       user requirements change little if at all after the design
    apply DOE guidance rules to text documents. This               stage.
    requires far more than a mere keyword search, as the               Unfortunately for us, software systems are both
    ideas being distinguished can be quite subtle. The             complex and ever changing. As such, they require a
    purpose of the system is not to 'understand' the               different engineering management approach. The theory
    documents it processes, but rather to simply check them        of “Ultra-Structure” was developed in part because the
    for the existence of certain ideas or facts. This              application of traditional engineering approaches has
    technology for 'concept-spotting' can be applied to other      failed when applied to designing software systems. This
    arenas, such as detecting junk email or searching web          theory was described in (Long and Denning, 1995), and
    sites. This paper will discuss the features, goals and         less technically in (Long, 1999), but will be briefly and
    general methods of the Reviewers Assistant System.             nontechnically described here.
                                                                       This article primarily discusses the application of
                          Background                               Ultra-Structure theory to a new area, namely the analysis
                                                                   of text documents for sensitive information. The U.S.
    The traditional approach to managing any engineering           Department of Energy (DOE) Declassification
    project is structured: it moves from general planning to       Productivity Initiative (DPI), which will be described
    requirements analysis, design, implementation, and then        later in this paper, has funded this work. The use of
    long-term maintenance, and has explicit criteria to            Ultra-Structure theory has thus far allowed us to address
    determine whether and when to move to the next stage.          several difficult problems that have limited and hindered
    This approach works quite well for creating most types         previous efforts at Natural Language Processing (NLP)
    of systems. If a system is simple and the requirements         systems, expert systems, and large knowledgebases; in
    change, the traditional structured approach works              particular:
    because the system can be affordably modified.
    Alternatively, if a system is complex but the                     the ability to manage large numbers of rules in a
    requirements never change, the system can be                       knowledgebase, numbering in the tens of thousands
    successfully built. Traditional structured approaches              now and eventually in the hundreds of thousands
    have proven to be better than completely unstructured             the ability to give knowledge engineers a set of tools
    approaches and have led to the development of many                 to help them visualize and manage large
    successful systems.       However, standard structured             knowledgebases of rules (“rulebases”)
    approaches have failed to satisfactorily address the              the ability to manage and maintain both metadata
    problems involved in creating time-viable software                 and content information regarding large numbers of
    systems, especially over long times. Instead, they have            documents.
    led to the frequent replacement of systems with wholly
    new ones, most often at great cost for both developers                     Ultra-Structure Fundamentals
    and users of software. Attempts to get better user
    requirements through rapid prototyping and better              Ultra-Structure theory is based on a different way of
    charting tools for notating work processes have helped         looking at the world – a different paradigm or
1
     -- This work was funded by the U.S. Department of Energy under contract DE-AC01-98NN50049.
worldview.        The traditional Western, Aristotelian      Finally, the notational structure is the set of tokens used
worldview sees the world as composed of objects having       by the rules to represent various abstractions. For RAS,
attributes and relationships to other objects. Ultra-        these tokens are numbers and letters; for (say) a music
Structure theory sees the world as being a process which,    system they would be letters and other signs interpreted
as a minor by-product, occasionally generates physical       as musical “notes” or instructions.
entities and new relationships among physical entities.         Ultra-Structure theory specifies how these various
The first task of an Ultra-Structure analysis is to          layers can be represented on a computer as tables in a
understand these processes and to represent them very        relational database. To implement a ruleform, we create
accurately. The development task of an Ultra-Structure       a table where each row is a separate simple rule, and
analysis is to ensure that the representation models not     each column is a separate universal. Typically a
just the processes that exist currently, but all logically   complex rule will require several simple rules that are
possible processes within that family of systems.            stored in different ruleforms; this group of ruleforms
    Ultra-Structure theory suggests that any process can,    must be examined as a whole in order for the system to
in principle and in practice, be analyzed into or reduced    make decisions; it is referred to as a cluster. There are
to a set of If-Then rules. Seemingly simple processes        several types of ruleforms, as will be illustrated later.
usually follow just a few simple rules; and seemingly           Properly specifying these structures for a model of a
complex processes may follow either a few, or many,          system enables the users of the model to enter the rules
simple rules. One of the interesting discoveries in          of the system as data, which is easily changed as
cellular automata studies has been that very simple rules    necessary when the rules change. Approximately 99%
can generate very complex behaviors. This has also           of the rules of the system are specified as data, so the
been observed in the work done in the last 20 years in       model itself – its software and data structures – has no
fractal geometry, where the recursive execution of a         knowledge of the outside world. All the model itself
single remarkably simple formula, which is a type of         knows is the order in which to read the ruleforms. This
rule, can specify very complex shapes.                       control logic, called animation procedures, is very small;
    The kind of rules that humans usually work with may      typically just a few thousand lines of code even for a
be thought of as complex rules. However, a good analyst      very complex system.
can analyze complex rules into simple or atomic rules,
according to Ultra-Structure theory. Rules may be                 Natural Language: A New Application Area
defined in terms of sets, with each set having a specified                   for Ultra-Structure
and limited possible domain or set of values. A
particular rule in this definition is a particular ordered   Ultra-Structure theory has been applied in a number of
sequence of these values.                                    application areas, mostly in business. This has included
    Ultra-Structure theory suggests that a good analyst      the traditional application areas of order entry, inventory
can always define an ordered sequence of domains (in         control, billing and cash application, and similar business
Ultra-Structure terminology, universals) which will          functions. It has also been applied experimentally to
contain any particular instance of a rule. Such an           other areas, as indicated by (Shostko 1999) and (Oh and
ordered set of universals is called a ruleform. A            Scotti, 1999).
ruleform is to a rule what ordinary algebra is to               Starting with part-time work in 1995, it has now been
arithmetic: it is a more generalized way of specifying the   applied to the automated identification of sensitive
essential structural ideas and relationships of a system.    information in text documents. The present application
While ordinary algebra uses symbols to represent             area is new for Ultra-Structure: attempting to specify the
numbers and various arithmetical operations, Ultra-          rules by which natural language can be at least partially
Structure uses universals to represent various domains       understood, i.e. partially interpreted and assigned
whose contents may be numeric, alphabetical, or indeed       meaning.
the tokens of any notational system.                            The Government has a strong interest in protecting
    Exhibit 1 shows proposed terminology and                 national security information (NSI), while facilitating
distinctions for the different layers of structure of any    government openness to public scrutiny and not
system. Simply put, the surface structure is defined to be   spending significant amounts of money and time
the physical manifestation of any system, consisting of      protecting information that does not truly need to be
its physical entities, relationships and processes. The      classified. President Clinton issued Executive Order
middle structure is the set of all rules governing the       12958 (E.O.), the most recent codification of the
system, which generate the surface structure. The deep       Government’s intentions in this area, on April 14, 1995.
structure is the set of ordered domains from which           It states, among other things, that all classified
particular rules may be constructed. The sub-structure       documents containing only NSI (not RD or
represents the set of all possible domains (universals).     FRD,discussed below) will be automatically declassified
after 25 years unless one of nine specified conditions for         highlight any segments of the text to which a
exemption is met.                                                   Guidance Rule applies
    The estimated volume of documents to be reviewed               highlight for the user (i.e. a certified document
for declassification under the E.O. exceeds one billion,            reviewer) the specific Guidance Rule(s) that caused
and the five year grace period specified by the E.O. for            any particular sections of text to be selected.
reviews to identify exemptions from automatic
declassification has been extended for an additional 18        The purpose of RAS is not to “understand” the
months to help the Agencies meet the enormous work             documents it processes, but rather to detect the existence
loads.                                                         of any classified concepts or facts. While this is simpler
    Under the E.O., any information that is covered under      than true document understanding, it nevertheless
the Atomic Energy Act (AEA) is exempt from automatic           requires far more than mere keyword searching, where a
declassification. Such information includes anything           system simply scans a document for the existence of one
pertaining to the construction, design or use of nuclear       or more specified terms. It also requires more than a
weapons, nuclear propulsion systems, and other special         Boolean keyword search, where specific terms can be
nuclear materials. This information was exempted               ANDed, ORed, etc. We are seeking specific concepts
because the President does not have the authority to           having specific relations to one another, which we refer
unilaterally change the AEA (a law), and also because it       to as ideas or propositions. The ideas being sought can
is generally recognized that even “old” nuclear design         be quite subtle.
information would still be of current value to a would-be
proliferant. It is simply not in the interests of the United       Merging Databases and Text Markup Languages
States to provide such information.
    To help identify this kind of information, called              Traditionally the task of defining the elements of
“Restricted Data” (RD) or “Formerly Restricted Data”           document structure would be performed using a text
(FRD), as well as other kinds of national security             markup language such as a derivative of the Standard
information (NSI), the Department of Energy has                Generalized Markup Language (SGML). In this kind of
developed about 65,000 specific guidance topics. Their         language, “tags” indicating different structural features
purpose is to help determine what is or is not classified      of the document are inserted into the document at the
as RD, FRD or NSI, and at what level (confidential,            beginning and end of each structural feature. Following
secret, or top secret). Trained and certified document         Ultra-Structure theory, RAS represents the information
reviewers apply these topics. Moreover, under the              in terms of rules, and the rules are stored as records of
Freedom of Information Act (FOIA), the public is               data in various tables. In RAS, therefore, all structural
entitled to request documents and DOE must be prepared         markup information is stored in a database. This kind of
to justify any classification actions it takes before a        markup does not use in-line tags, but instead uses
federal judge. They must have a clear rationale tracing        different fields in a table.
back to the 65,000 guidance topics and from there to               There are a number of advantages to storing
either the AEA or to the latest E.O. pertaining to national    structured text information in database tables rather than
security.                                                      in a flat file with tags. Chief among these are the
    Document reviewing is a manually intensive process         following general capabilities of relational databases
requiring years of education and training. Congress            over flat files:
funded the Declassification Productivity Initiative (DPI)
at the Department of Energy (DOE) in order to develop              control access to the data through a security system
advanced tools to help reviewers in various ways. One               and audit trail
of the primary tools we have been developing under DPI             enforce referential integrity, such that when a value
is called the “Reviewer’s Assistant System” (RAS),                  changes in one part of the system it is immediately
which was built using Ultra-Structure theory.                       changed in all parts of the system
                                                                   permit use of complex queries using (e.g.) Standard
        Reviewer’s Assistant System Functions                       Query Language (SQL)
                                                                   give users quick access to volumes of data through
We are building RAS using Ultra-Structure theory                    easy-to-use forms and reports
because the number of rules is quite large and these rules         store and retrieve various types of objects in
are likely to change over time. RAS is designed to:                 addition to standard text (e.g. images, sounds).
   rigorously apply DOE Guidance Rules to text                       Merging Databases and Knowledgebases
    documents
Ever since mankind first used an abacus about 5,000           There have been a number of attempts in the last ten
years ago, and possibly since we first notched tallies on a   years to bridge the gap between these two classes of
stick 30,000 years ago, we have distinguished algorithms      applications – to merge databases and knowledgebases
from data. This has been a useful distinction, but the        and their associated technologies. There is a growing
veritable wall between the two began to break down            belief that modern database systems must evolve towards
when John von Neumann proposed in a memo in 1945              knowledgebase systems, and that more "inferencing" is
that not just data but also algorithms (as computer           necessary for better understanding and use of data. This
instructions) could be stored on a computer in a binary       could lead to applications involving hundreds of
form. This insight – based on work done by him and            thousands of complex rules that make decisions that
others at the University of Pennsylvania, including John      seem truly “intelligent.”
Mauchley and J. Presper Eckert – led to programmable              The Ultra-Structure paradigm does not make these
(stored program) computers.                                   conventional distinctions between algorithms and data.
    Although both parts are stored in the same way as         Rather it defines whatever is stored in a relational
binary digits (bits), computer applications are still         database table to be rules which have two different types
viewed as consisting of two very different things:            of parts, called factors and considerations. Factors are
algorithms and data. An algorithm is a finite series of       primary keys in a table that determine under what
steps taken to compute an answer. Data is the values or       general conditions a rule should be looked at; and
parameters used by an algorithm to reach its conclusions,     following standard normalization rules it requires that
which data may have initial, intermediate and final           there be unique keys (factors) for each record (rule).
values.                                                       What is traditionally considered to be data (i.e., a fact) is
    Database applications are generally viewed as             usually stored as a consideration (a non-primary-key
applications that provide storage places and access           attribute) in the record, and this attribute serves merely
methods for the safe storage and retrieval of persistent      to guide the execution of a rule cluster. In an inventory
data, and the safe adding, changing and deleting of data      system, for example, the quantity-on-hand of a particular
following certain integrity rules regardless of whether       item is simply a consideration determining where the
the application software using the database enforces          item may be sourced for an order. That and other rules
those rules or not. Under this paradigm, databases store      in the cluster must all be examined in order for the
and protect “facts” or “data,” and the algorithms that        inventory system to make an intelligent sourcing
read and use these facts are stored in software programs,     decision.     The inference engine (called animation
queries, stored procedures, job control language              procedures) consists of just a few thousand lines of
procedures, etc. Examples of such applications are order      code. All knowledge of the external world lies in the
entry, inventory, purchasing, and accounting systems.         rulebase, and none in the animation procedures. RAS is
This class of systems is concerned primarily with data        an example of a new type of system that uses a
storage, arithmetic and logical calculations, and             relational database to store a very large number of
information retrieval.     For this class of systems,         rules as data.
changing the rules of a business area requires changing           This perspective requires a new and broader
the software – a frequently difficult task.                   understanding of the nature of rules. If we broaden our
    Expert systems are a different class of applications      concept of rules from
which consist of rules and an inference engine, and
which are concerned primarily with applying reasoning         IF x THEN do y and z
to facts in order to simulate the behavior of human           to
experts in a particular subject domain. The inference         IF x THEN consider y and z before deciding what to do,
engine processes the rules, which are stored in a
“knowledgebase” rather than a database. These rules           then y and z can serve the role traditionally reserved for
may include executable code, or they may be mere data.        data, that is they can represent the facts of the world.
The reasoning process may be similar to that of a human       They do this as an integral part of a larger and more
expert, or it may be completely different. The behavior       comprehensive cluster of rules, acting as considerations
of the system as a whole is intended to mimic, and            for the execution of individual rules.
hopefully outperform, a human expert. Examples of                 This means that all the business rules of an
such applications include bank credit approval, medical       organization can be stored as data, and the only software
diagnosis, and hardware configuration systems. These          that is necessary is the inference engine, which should
systems are usually intended to aid rather than replace       never need to change. This puts all knowledge of the
human decision-makers. They offer the benefits of high        world and all the knowledge of rules in a format which
speed, high consistency, and perfect attention to detail.     is easy to update, easy to review, and can be managed
                                                              easily by a standard relational database.
There are other ruleforms (tables) in RAS, but these
                       Rules in RAS                              give the general idea of what the system contains.

As used by RAS, Ultra-Structure defines several basic                    Executing (Animating) the RAS Rules
kinds of existential rules or types of entities:
                                                                     In order to search for concepts in a text, the text must
   semantic entities, which can be letters, words,              first be “pre-analyzed.” This involves the determination
    phrases, guide topics, or entire guides                      of various boundaries (e.g. sentence boundaries) and the
   documents, which are the entities being analyzed by          determination of the nature of certain kinds of lexical
    the system                                                   entities, e.g. whether a specific entity is numeric or non-
   markings, which indicate what to do in the event             numeric, and whether a period is part of a number (a
    that certain ideas are found in a text, e.g. mark the        decimal point), is used as part of an acronym or
    document as “confidential”                                   abbreviation, or is indeed the end of a sentence. Each
   users, which define the authorized users of the              word in a document is usually treated as a separate
    system.                                                      “semantic entity.” But since words in a phrase often
                                                                 have meanings very different than the same words
   These entities typically have complex relations to one        outside the phrase (e.g. “A horse of a different color” has
another.                                                         nothing to do with either horses or colors!), there is
                                                                 frequently a need to indicate that several words must
   If related to other entities of the same type they are        always be treated as a single phrase, in which case the
called network rules. RAS has several kinds of network           entire phrase becomes a single semantic entity. This is
rule:                                                            defined by a replacement rule in the Entities Network
                                                                 ruleform.
   entities network relates semantic entities to one                Each semantic entity has a number of attributes such
    another                                                      as character position in the document, word number,
   markings network relates markings to one another,            sentence number, paragraph number, whether it is
    and in particular indicates a hierarchy of markings          numeric, etc. These and other attributes for each
   documents network relates documents to one                   semantic entity are stored as a rule on a single record in
    another, indicating (e.g.) that one document replaces        the Document Detail ruleform, in lieu of using an
    another, or is a duplicate of another, etc.                  SGML-type markup language. The system is thus
                                                                 generating new “rules” based on other rules, which
   If existential entities of one kind are related to entities   facilitates subsequent analysis.
of another kind, we represent that with an authorization             After performing an analysis it is necessary to
rule. RAS has several kinds of authorization rules:              indicate which portions of the text are considered
                                                                 classified or are otherwise marked, and what specific
                                                                 guidance topic(s) caused the text to be selected. These
   document detail contains the results of the pre-
                                                                 rules, also generated by the system based on other rules,
    analysis of a document, specifying the semantic
                                                                 are stored in the Document Analysis ruleform.
    entities and their characteristics and order in the
                                                                     Performing the analysis itself requires looking for the
    document
                                                                 tokens in the target documents, and applying the
   document analysis contains the results of the
                                                                 markings indicated. Since each guidance topic is
    analysis of a document
                                                                 translated into one or more propositions, and there are
   entity markings relates semantic entities (e.g. guide        about 65,000 guidance topics, we anticipate that there
    topics) and their associated markings (if any)               will be about 100,000 propositions to be represented and
                                                                 searched for in each text. This number accounts for and
   Note that each ruleform (table) may be interpreted as         excludes duplicate guidance topics. This number of
defining rules. For example, the Document Detail table           rules alone would make RAS a very large expert system.
may be interpreted as specifying rules for the (re-                  As indicated in Exhibit 2, specifying a proposition (in
)construction of the original document. The Markings             the sense used here) means specifying usually two to
Network specifies how markings are ordered in a                  four concepts which occur within a defined proximity of
hierarchy, e.g. if a marking of “confidential” and a             one another in a text, e.g. 5 sentences or 15 words. We
marking of “secret” both apply to the same document,             have found thus far that even the most complex
then the overall classification of the document is               propositions require only six concepts.
“secret”.                                                            Specifying several concepts to search for is not by
                                                                 itself adequate: the computer must also know all the
possible ways that each concept can be tokenized (i.e.          work is underway to automatically generate RAS rules
lexically expressed) in any text document. This calls for       from the written English guidance. Of these, about ten
a large tree of relationships specifying how concepts can       percent are what we call “good false positives”, i.e. items
be tokenized. This mapping requirement – essentially a          that are not in fact classified but which a reviewer would
large ontology of all areas of DOE activity – will              want to look at closely before making that determination.
probably add another 500,000+ additional rules. We                 In terms of missing items that should have been
keep these rules in the Entities Network ruleform, first        identified as “hot” (i.e., false negatives), results are
defining all concepts and tokenizations in the Semantic         harder to determine since sometimes even human
Entities ruleform. Note that in many cases there is no          reviewers may disagree about what is sensitive. But
need to specify all forms of an entity, e.g. singular,          results to date indicate that almost all missed items are
plural, possessive, etc.; using a wildcard before or after a    readily accounted for as outside the domain of the
word stem is sometimes adequate so long as the                  rulebase, either pertaining to a different subject area or
knowledge engineer is aware that use of stems may               including tokens that the rulebase was unaware of.
increase the false positive hit rate.                              This low false negative and low false positive rate is
    Of course, not all concepts and tokenizations are           in great contrast to other approaches which have often
equally related to one another. We represent this degree        been found unusable based on 50%+ false positive hit
of closeness with a fuzzy fitness number from 0 to 1 to         rates, using that term as defined above.
indicate the degree to which the two are related. We
then need to test this rulebase against a large set of                     Other Possible RAS Applications
documents, and to go back and correct the rules to
minimize or eliminate false positives and false negatives,         The RAS technology for ”concept-spotting” –
either by adding new entities to look for, changing the         reviewing documents for the existence of specific
relationships of existing entities, or specifying tighter hit   propositions – can theoretically be applied to other
ranges.                                                         arenas, such as:
    In the long term, we will need to be able to keep the
rules up-to-date as the original guidance topics change            identifying unsolicited commercial email (“spam”)
over time and as we apply it do different corpora having           searching web sites for certain ideas
variations in their lexical representations, e.g., using a         searching patents for certain ideas
different (former) name for a national laboratory. We              scanning computer source code for Y2K date issues.
are still in the early stages of creating and validating
these rules, and we expect this to be the most difficult           We are still a long way from completing the RAS
part of building the system due to the wide range of            system. The results to date have been very promising:
subject areas to be covered.                                    RAS is demonstrating the advantages of Ultra-Structure
                                                                theory for concept detection and large knowledgebases.
                     Results to Date                            The Declassification Productivity Research Center
                                                                (DPRC) at The George Washington University is
    We have run RAS on several different corpora having         carrying out other Ultra-Structure based research
very different characteristics, in order to see how it          projects, which are also showing positive results (Oh and
performs with these different corpora. Characteristics of       Scotti, 1999).
interest include whether the documents are known or
believed to be classified or unclassified; the size of each                   Summary and Conclusions
document in the corpus, ranging from a few sentences to
hundreds of pages; the size of the total corpus, ranging           A million records is small by database system
so far up to about 3 million words; and whether the             standards, but a million rules is essentially an impossible
corpus was originally created electronically or whether it      number for a traditional expert system to manage. We
was OCRed and therefore has some number of OCR                  expect to be able to effectively handle very large
errors in it.                                                   numbers of rules, numbering in the hundreds of
    The rulebase tested has over 700 guidance rules in it,      thousands, using the techniques being followed for RAS.
and maps to about 20,000 tokens. We expect soon to              Ultra-Structure theory may constitute a real merger of
greatly increase the number of guidance rules applied.          knowledgebase and database technologies. If so, it has
    Results so far show a typical false positive hit rate of    the potential to usher in a new era of vastly larger expert
about ten percent of the documents reviewed, meaning            systems for carrying out policies and procedures of
that of 100 documents, RAS will incorrectly identify ten        extreme complexity.
as “hot” when they are not. We hope of course to reduce
this rate by broadening and deepening the rulebase, and                                References
About the Author
Long, Jeffrey G. and Denning, Dorothy E., “Ultra-
Structure: A design theory for complex systems and                   Mr. Long is Senior Knowledge Engineer on the DPI
processes,” Communications of the ACM 38(1), (1995)               project.    He is also Director of the Notational
103-120.                                                          Engineering Laboratory, an effort to create a
                                                                  clearinghouse for people interested in problems of
Long, Jeffrey G., “A new notation for representing                representation in any field of science, art or other
business and other rules,” Semiotica 125-1/3, (1999)              activity. His experience includes 25 years of consulting
215-228                                                           on various kinds of applications software development,
                                                                  with a particular focus on studying complex systems and
Oh, Youngsuck and Scotti, Richard, “Analysis and                  the problems of representing them.
Design of a Database using Ultra-Structure Theory
(UST) – Conversion of a Traditional Software System to
One Based on UST,” Proceeding of the 20th Annual
Conference, American Society for Engineering
Management (1999)

Shostko, Alexander, “Design of an automatic course-
scheduling system using Ultra-Structure,” Semiotica
125-1/3, (1999) 197-214



Standard Terminology (if any)     Ultra-Structure Instance Name    Ultra-Structure Level Name U-S Implementation
behavior, physical entities and   particular(s)                    surface structure          system behavior
relationships, processes
rules, laws, constraints,         rule(s)                          middle structure           data and some software
guidelines, rules of thumb                                                                    (animation procedures)
(no standard or common term)      ruleform(s)                      deep structure             tables
(no standard or common term)      universal(s)                     sub-structure              attributes, fields
tokens, signs or symbols          token(s)                         notational structure       character set

Exhibit 1: Layers of Structure in Any System, According to Ultra-Structure Theory
Exhibit 2: RAS Breakdown of Topics to Tokens
Using Ultra-Structure for
Automated Identification of
 Sensitive Information in
       Documents


            Jeffrey G. Long
  Sr. Knowledge Engineer, DynMeridian
            notate@aol.com
Traditional Engineering Approaches Work
              g       g pp
Only Under Certain Conditions
Unfortunately,
 Unfortunately Complex and Changing
 Needs Exist in Every Organization


     Needs


SW & DB

time 1         time 2       time 3...
Ultra Structure
      Ultra-Structure Theory Was Created to
      Support Complex and Changing Rules

   New theory of systems design, developed 1985
   Focuses on optimal computer representation of
    F             ti l        t           t ti  f
    complex, conditional and changing rules
   Based on a new abstraction called ruleforms

   The breakthrough was to find the unchanging
    features of changing systems
The Theory Offers a Different Way to
     Look at Complex Systems and Processes


  observable
   behaviors                    surface structure
                        generates
        rules                   middle structure
                        constrains
form of rules
f     f l                        deep structure
This Creates New Levels for Analysis
and Representation

Standard Terminology (if any)   Ultra-Structure Instance   Ultra-Structure Level   U-S Implementation
                                Name                       Name




behavior, physical entities     particular(s)              surface structure       system behavior
and relationships, processes



rules, laws constraints,
rules laws, constraints         rule(s)                    middle structure        data and some
guidelines, rules of thumb                                                         software (animation
                                                                                   procedures)


(no standard or common          ruleform(s)                deep structure          tables
term)


(no standard or common          universal(s)               sub-structure           attributes, fields
term)


tokens,
tokens signs or symbols         token(s)                   notational structure    character set
The R l f
   Th Ruleform H
               Hypothesis
                    h i

Complex system structures are created by not-necessarily
complex processes; and these processes are created by the
animation of operating rules. Operating rules can be grouped
into a small number of classes whose form is prescribed by
"ruleforms". While the operating rules of a system change over
time, the ruleforms remain constant. A well-designed collection
                                                   g
of ruleforms can anticipate all logically possible operating rules
that might apply to the system, and constitutes the deep
structure of the system.
The C RE Hypothesis
      Th CoRE H     h i

There exist Competency Rule Engines, or CoREs, consisting of
<50 ruleforms, that are sufficient to represent all rules found
among systems sharing broad family resemblances e g all
                                        resemblances, e.g.
corporations. Their definitive deep structure will be permanent,
unchanging, and robust for all members of the family, whose
differences in manifest structures and behaviors will be
represented entirely as differences in operating rules. The
animation procedures for each engine will be relatively simple
compared to current applications, requiring less than 100,000 lines
     p                 pp         , q       g               ,
of code in a third generation language.
DOE Reviewer’s Assistant System
           Reviewer s
       Requirements

   650 guides defining 65,000 topics that are or may be
    classified
   Extensive background knowledge required to
    interpret guidance
   Guidance changes over time
   Terminology in documents changes over time
   Current backlog of 300+ million pages
   Objective is concept spotting, not document
    understanding g
Normally This Would be Done Using
        an Expert System Shell

   ES often have trouble with > 100 rules
   DOE system will require about 500 000 rules
                                     500,000
   Key issue: maintainability of rules
   Many benefits from using relational database to store
    rules as data
       Built-in referential integrity
       Easy report-writing and queries
        E            t iti       d     i
       Simple user interface for KE and Reviewers
RAS Defines Guidance Concepts and
                              p
All Possible Lexical Expressions of
Those Conceptsp


                 System         Define
Convert Guides              Interpretations
                 Ready




    Read          Apply      Document
  Document       Guidance    Reviewed
Rules Specify Relations Between
Concepts, Tokens and Markings
Results to D
      R   l      Date are P
                          Promising
                              i i

   In a corpus of 3,750 unclassified documents, the
    false positive rate was less than 10%
   In
    I another corpus of 16,500 unclassified d
           th            f 16 500    l  ifi d documents,
                                                     t
    the false positive rate was 2.5%

   In other (e.g. keyword and statistical systems)
    approaches, false positive and false negative rates
    are often i excess of 50%
          ft in          f
The Ultra-Structure-Based RAS
       System Offers Substantial Benefits to
       S       Off    S b     i lB   fi
       Reviewers and Knowledge Engineers
   System can provide precise and rigorous
    interpretation of DOE Classification Guidance
         p
   Rules can become more complex if necessary
   Rules are easy to specify, change and review
   Implications and consequences of changes can be
    better foreseen
   Changes to rules do not require changing software or
    table structures – just data
Next Steps for RAS Development
      N    S     f       D   l

   Work with subject experts to expand scope and
    improve quality and completeness of rulebase
   Continue t ti system against many types of
    C ti      testing     t       i t     t       f
    documents
   Improve design to minimize/eliminate false negatives
    and false positives
   Work with end-users to improve user interface
   Integrate into other systems
   Improve design to increase speed: parallel
    processing,
    processing stored queries etc
                        queries, etc.
As the CoRE Hypothesis Promises RAS
                              Promises,
      Could be Used in Other Areas Also

   Categorize documents by subject
   Scan email for spam/UCE
   Scan websites, e.g. for compliance to a standard
   Categorize p
        g      patents or scan them for specified
                                          p
    concepts
   Scan source code, e.g. Y2K
   Scan any machine-readable corpus f specified
    S             hi       d bl         for     ifi d
    ideas

More Related Content

What's hot

Semantic Sensor Networks and Linked Stream Data
Semantic Sensor Networks and Linked Stream DataSemantic Sensor Networks and Linked Stream Data
Semantic Sensor Networks and Linked Stream DataOscar Corcho
 
A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud ...
A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud ...A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud ...
A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud ...1crore projects
 
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...Alexander Decker
 
Abstract matsumura
Abstract matsumuraAbstract matsumura
Abstract matsumuraharmonylab
 
Reliable and Efficient Data Acquisition in Wireless Sensor Network
Reliable and Efficient Data Acquisition in Wireless Sensor NetworkReliable and Efficient Data Acquisition in Wireless Sensor Network
Reliable and Efficient Data Acquisition in Wireless Sensor NetworkIJMTST Journal
 
IRJET-AI Neural Network Disaster Recovery Cloud Operations Systems
IRJET-AI Neural Network Disaster Recovery Cloud Operations SystemsIRJET-AI Neural Network Disaster Recovery Cloud Operations Systems
IRJET-AI Neural Network Disaster Recovery Cloud Operations SystemsIRJET Journal
 

What's hot (9)

Semantic Sensor Networks and Linked Stream Data
Semantic Sensor Networks and Linked Stream DataSemantic Sensor Networks and Linked Stream Data
Semantic Sensor Networks and Linked Stream Data
 
Neural network
Neural networkNeural network
Neural network
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud ...
A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud ...A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud ...
A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud ...
 
An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...An asynchronous replication model to improve data available into a heterogene...
An asynchronous replication model to improve data available into a heterogene...
 
Abstract matsumura
Abstract matsumuraAbstract matsumura
Abstract matsumura
 
DR KL CV v5
DR KL CV v5DR KL CV v5
DR KL CV v5
 
Reliable and Efficient Data Acquisition in Wireless Sensor Network
Reliable and Efficient Data Acquisition in Wireless Sensor NetworkReliable and Efficient Data Acquisition in Wireless Sensor Network
Reliable and Efficient Data Acquisition in Wireless Sensor Network
 
IRJET-AI Neural Network Disaster Recovery Cloud Operations Systems
IRJET-AI Neural Network Disaster Recovery Cloud Operations SystemsIRJET-AI Neural Network Disaster Recovery Cloud Operations Systems
IRJET-AI Neural Network Disaster Recovery Cloud Operations Systems
 

Viewers also liked

Issues in the study of abstractions
Issues in the study of abstractionsIssues in the study of abstractions
Issues in the study of abstractionsJeff Long
 
Managing and benefiting from multi million rule systems
Managing  and benefiting from multi million rule systemsManaging  and benefiting from multi million rule systems
Managing and benefiting from multi million rule systemsJeff Long
 
Four ways to represent computer executable rules
Four ways to represent computer executable rulesFour ways to represent computer executable rules
Four ways to represent computer executable rulesJeff Long
 
Case study of rules as relational data
Case study of rules as relational dataCase study of rules as relational data
Case study of rules as relational dataJeff Long
 
Representing emergence with rules
Representing emergence with rulesRepresenting emergence with rules
Representing emergence with rulesJeff Long
 
New ways to represent complex systems and processes
New ways to represent complex systems and processesNew ways to represent complex systems and processes
New ways to represent complex systems and processesJeff Long
 
The evolution of symbol systems and society
The evolution of symbol systems and societyThe evolution of symbol systems and society
The evolution of symbol systems and societyJeff Long
 

Viewers also liked (7)

Issues in the study of abstractions
Issues in the study of abstractionsIssues in the study of abstractions
Issues in the study of abstractions
 
Managing and benefiting from multi million rule systems
Managing  and benefiting from multi million rule systemsManaging  and benefiting from multi million rule systems
Managing and benefiting from multi million rule systems
 
Four ways to represent computer executable rules
Four ways to represent computer executable rulesFour ways to represent computer executable rules
Four ways to represent computer executable rules
 
Case study of rules as relational data
Case study of rules as relational dataCase study of rules as relational data
Case study of rules as relational data
 
Representing emergence with rules
Representing emergence with rulesRepresenting emergence with rules
Representing emergence with rules
 
New ways to represent complex systems and processes
New ways to represent complex systems and processesNew ways to represent complex systems and processes
New ways to represent complex systems and processes
 
The evolution of symbol systems and society
The evolution of symbol systems and societyThe evolution of symbol systems and society
The evolution of symbol systems and society
 

Similar to Automated identification of sensitive information

Applying a new software development paradigm to biology
Applying a new software development paradigm to biologyApplying a new software development paradigm to biology
Applying a new software development paradigm to biologyJeff Long
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clusteringbutest
 
Lectura 2.2 the roleofontologiesinemergnetmiddleware
Lectura 2.2   the roleofontologiesinemergnetmiddlewareLectura 2.2   the roleofontologiesinemergnetmiddleware
Lectura 2.2 the roleofontologiesinemergnetmiddlewareMatias Menendez
 
International journal of engineering issues vol 2015 - no 2 - paper4
International journal of engineering issues   vol 2015 - no 2 - paper4International journal of engineering issues   vol 2015 - no 2 - paper4
International journal of engineering issues vol 2015 - no 2 - paper4sophiabelthome
 
IRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET Journal
 
Evaluation of rule extraction algorithms
Evaluation of rule extraction algorithmsEvaluation of rule extraction algorithms
Evaluation of rule extraction algorithmsIJDKP
 
06 styles and_greenfield_design
06 styles and_greenfield_design06 styles and_greenfield_design
06 styles and_greenfield_designMajong DevJfu
 
USING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEY
USING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEYUSING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEY
USING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEYcseij
 
Data Structures in the Multicore Age : Notes
Data Structures in the Multicore Age : NotesData Structures in the Multicore Age : Notes
Data Structures in the Multicore Age : NotesSubhajit Sahu
 
Web based-distributed-sesnzer-using-service-oriented-architecture
Web based-distributed-sesnzer-using-service-oriented-architectureWeb based-distributed-sesnzer-using-service-oriented-architecture
Web based-distributed-sesnzer-using-service-oriented-architectureAidah Izzah Huriyah
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesASIS&T
 
Re-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming TechnologyRe-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming TechnologyGihan Wikramanayake
 
Concurrency Issues in Object-Oriented Modeling
Concurrency Issues in Object-Oriented ModelingConcurrency Issues in Object-Oriented Modeling
Concurrency Issues in Object-Oriented ModelingIRJET Journal
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS systembenosteen
 
The Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDayThe Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDayAmazon Web Services
 
A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...Mumbai Academisc
 
Real time and distributed design
Real time and distributed designReal time and distributed design
Real time and distributed designpriyapavi96
 
Finding new framework for resolving problems in various dimensions by the use...
Finding new framework for resolving problems in various dimensions by the use...Finding new framework for resolving problems in various dimensions by the use...
Finding new framework for resolving problems in various dimensions by the use...Alexander Decker
 

Similar to Automated identification of sensitive information (20)

Applying a new software development paradigm to biology
Applying a new software development paradigm to biologyApplying a new software development paradigm to biology
Applying a new software development paradigm to biology
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
 
Lectura 2.2 the roleofontologiesinemergnetmiddleware
Lectura 2.2   the roleofontologiesinemergnetmiddlewareLectura 2.2   the roleofontologiesinemergnetmiddleware
Lectura 2.2 the roleofontologiesinemergnetmiddleware
 
International journal of engineering issues vol 2015 - no 2 - paper4
International journal of engineering issues   vol 2015 - no 2 - paper4International journal of engineering issues   vol 2015 - no 2 - paper4
International journal of engineering issues vol 2015 - no 2 - paper4
 
IRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect Information
 
CS846_report_akshat_kumar
CS846_report_akshat_kumarCS846_report_akshat_kumar
CS846_report_akshat_kumar
 
Evaluation of rule extraction algorithms
Evaluation of rule extraction algorithmsEvaluation of rule extraction algorithms
Evaluation of rule extraction algorithms
 
STUDY OF AGENT ASSISTED METHODOLOGIES FOR DEVELOPMENT OF A SYSTEM
STUDY OF AGENT ASSISTED METHODOLOGIES FOR DEVELOPMENT OF A SYSTEMSTUDY OF AGENT ASSISTED METHODOLOGIES FOR DEVELOPMENT OF A SYSTEM
STUDY OF AGENT ASSISTED METHODOLOGIES FOR DEVELOPMENT OF A SYSTEM
 
06 styles and_greenfield_design
06 styles and_greenfield_design06 styles and_greenfield_design
06 styles and_greenfield_design
 
USING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEY
USING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEYUSING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEY
USING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEY
 
Data Structures in the Multicore Age : Notes
Data Structures in the Multicore Age : NotesData Structures in the Multicore Age : Notes
Data Structures in the Multicore Age : Notes
 
Web based-distributed-sesnzer-using-service-oriented-architecture
Web based-distributed-sesnzer-using-service-oriented-architectureWeb based-distributed-sesnzer-using-service-oriented-architecture
Web based-distributed-sesnzer-using-service-oriented-architecture
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication Repositories
 
Re-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming TechnologyRe-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming Technology
 
Concurrency Issues in Object-Oriented Modeling
Concurrency Issues in Object-Oriented ModelingConcurrency Issues in Object-Oriented Modeling
Concurrency Issues in Object-Oriented Modeling
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS system
 
The Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDayThe Smart Way To Invest in AI and ML_SFStartupDay
The Smart Way To Invest in AI and ML_SFStartupDay
 
A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...
 
Real time and distributed design
Real time and distributed designReal time and distributed design
Real time and distributed design
 
Finding new framework for resolving problems in various dimensions by the use...
Finding new framework for resolving problems in various dimensions by the use...Finding new framework for resolving problems in various dimensions by the use...
Finding new framework for resolving problems in various dimensions by the use...
 

More from Jeff Long

Notational systems and the abstract built environment
Notational systems and the abstract built environmentNotational systems and the abstract built environment
Notational systems and the abstract built environmentJeff Long
 
Case study of rules as relational data
Case study of rules as relational dataCase study of rules as relational data
Case study of rules as relational dataJeff Long
 
Ten lessons from a study of ten notational systems
Ten lessons from a study of ten notational systemsTen lessons from a study of ten notational systems
Ten lessons from a study of ten notational systemsJeff Long
 
Notational systems and cognitive evolution
Notational systems and cognitive evolutionNotational systems and cognitive evolution
Notational systems and cognitive evolutionJeff Long
 
Notational systems and abstractions
Notational systems and abstractionsNotational systems and abstractions
Notational systems and abstractionsJeff Long
 
Developing applications that stand the test of time
Developing applications that stand the test of timeDeveloping applications that stand the test of time
Developing applications that stand the test of timeJeff Long
 
Towards a new paradigm to resolve the software crisis
Towards a new paradigm to resolve the software crisisTowards a new paradigm to resolve the software crisis
Towards a new paradigm to resolve the software crisisJeff Long
 
Notational engineering and the search for new intellectual primitives
Notational engineering and the search for new intellectual primitivesNotational engineering and the search for new intellectual primitives
Notational engineering and the search for new intellectual primitivesJeff Long
 
Understanding complex systems
Understanding complex systemsUnderstanding complex systems
Understanding complex systemsJeff Long
 
The hunt for new abstractions
The hunt for new abstractionsThe hunt for new abstractions
The hunt for new abstractionsJeff Long
 
Why we dont understand complex systems
Why we dont understand complex systemsWhy we dont understand complex systems
Why we dont understand complex systemsJeff Long
 
Mathematics rules and scientific representations
Mathematics rules and scientific representationsMathematics rules and scientific representations
Mathematics rules and scientific representationsJeff Long
 
Notational engineering
Notational engineeringNotational engineering
Notational engineeringJeff Long
 
The evolution of abstractions
The evolution of abstractionsThe evolution of abstractions
The evolution of abstractionsJeff Long
 
A metaphsical system that includes numbers rules and bricks
A metaphsical system that includes numbers rules and bricksA metaphsical system that includes numbers rules and bricks
A metaphsical system that includes numbers rules and bricksJeff Long
 
The nature of notational engineering
The nature of notational engineeringThe nature of notational engineering
The nature of notational engineeringJeff Long
 
The co evolution of symbol systems and society
The co evolution of symbol systems and societyThe co evolution of symbol systems and society
The co evolution of symbol systems and societyJeff Long
 
Towards a new metaphysics of complex processes
Towards a new metaphysics of complex processesTowards a new metaphysics of complex processes
Towards a new metaphysics of complex processesJeff Long
 
Call for a new notation
Call for a new notationCall for a new notation
Call for a new notationJeff Long
 
Notation as a basis for societal evolution
Notation as a basis for societal evolutionNotation as a basis for societal evolution
Notation as a basis for societal evolutionJeff Long
 

More from Jeff Long (20)

Notational systems and the abstract built environment
Notational systems and the abstract built environmentNotational systems and the abstract built environment
Notational systems and the abstract built environment
 
Case study of rules as relational data
Case study of rules as relational dataCase study of rules as relational data
Case study of rules as relational data
 
Ten lessons from a study of ten notational systems
Ten lessons from a study of ten notational systemsTen lessons from a study of ten notational systems
Ten lessons from a study of ten notational systems
 
Notational systems and cognitive evolution
Notational systems and cognitive evolutionNotational systems and cognitive evolution
Notational systems and cognitive evolution
 
Notational systems and abstractions
Notational systems and abstractionsNotational systems and abstractions
Notational systems and abstractions
 
Developing applications that stand the test of time
Developing applications that stand the test of timeDeveloping applications that stand the test of time
Developing applications that stand the test of time
 
Towards a new paradigm to resolve the software crisis
Towards a new paradigm to resolve the software crisisTowards a new paradigm to resolve the software crisis
Towards a new paradigm to resolve the software crisis
 
Notational engineering and the search for new intellectual primitives
Notational engineering and the search for new intellectual primitivesNotational engineering and the search for new intellectual primitives
Notational engineering and the search for new intellectual primitives
 
Understanding complex systems
Understanding complex systemsUnderstanding complex systems
Understanding complex systems
 
The hunt for new abstractions
The hunt for new abstractionsThe hunt for new abstractions
The hunt for new abstractions
 
Why we dont understand complex systems
Why we dont understand complex systemsWhy we dont understand complex systems
Why we dont understand complex systems
 
Mathematics rules and scientific representations
Mathematics rules and scientific representationsMathematics rules and scientific representations
Mathematics rules and scientific representations
 
Notational engineering
Notational engineeringNotational engineering
Notational engineering
 
The evolution of abstractions
The evolution of abstractionsThe evolution of abstractions
The evolution of abstractions
 
A metaphsical system that includes numbers rules and bricks
A metaphsical system that includes numbers rules and bricksA metaphsical system that includes numbers rules and bricks
A metaphsical system that includes numbers rules and bricks
 
The nature of notational engineering
The nature of notational engineeringThe nature of notational engineering
The nature of notational engineering
 
The co evolution of symbol systems and society
The co evolution of symbol systems and societyThe co evolution of symbol systems and society
The co evolution of symbol systems and society
 
Towards a new metaphysics of complex processes
Towards a new metaphysics of complex processesTowards a new metaphysics of complex processes
Towards a new metaphysics of complex processes
 
Call for a new notation
Call for a new notationCall for a new notation
Call for a new notation
 
Notation as a basis for societal evolution
Notation as a basis for societal evolutionNotation as a basis for societal evolution
Notation as a basis for societal evolution
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Automated identification of sensitive information

  • 1. Cover Page    Using Ultra‐Structure for  Automated Identification of  Sensitive Information in  Documents    Author: Jeffrey G. Long (jefflong@aol.com)  Date: October 21, 1999  Forum: Talk presented at the 20th annual conference of the American Society for  Engineering Management. Contents  Pages 1‐5: Preprint f paper  Pages 6‐24: Slides (but no text) for presentation    License  This work is licensed under the Creative Commons Attribution‐NonCommercial  3.0 Unported License. To view a copy of this license, visit  http://creativecommons.org/licenses/by‐nc/3.0/ or send a letter to Creative  Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.  Uploaded July 1, 2011 
  • 2. USING ULTRA-STRUCTURE FOR AUTOMATED IDENTIFICATION OF SENSITIVE INFORMATION IN DOCUMENTS1 Jeffrey G. Long, Sr. Knowledge Engineer, DynCorp NSP Abstract. The Government has a strong interest in somewhat, but they too have failed to make systems protecting nuclear and national security information more flexible in the face of changing user requirements. while maximizing the availability of information about Indeed, if both (a) the system being developed is its operations. Towards this end federal agencies have complex and (b) the user requirements are subject to developed tens of thousands of 'guidance rules' to help significant change over time, then the existing structured determine what is or is not classified, and at what level. approaches usually do not work. The greatest Trained and certified document reviewers apply these engineering accomplishments of the 20th Century are of rules. The Reviewer's Assistant System (RAS) at the the former type. Complex systems ranging from Department of Energy is a new type of expert system computers to the Space Shuttle have of course been that uses a relational database to store very large successfully built – but only on the condition that their numbers of rules. It is being developed to automatically user requirements change little if at all after the design apply DOE guidance rules to text documents. This stage. requires far more than a mere keyword search, as the Unfortunately for us, software systems are both ideas being distinguished can be quite subtle. The complex and ever changing. As such, they require a purpose of the system is not to 'understand' the different engineering management approach. The theory documents it processes, but rather to simply check them of “Ultra-Structure” was developed in part because the for the existence of certain ideas or facts. This application of traditional engineering approaches has technology for 'concept-spotting' can be applied to other failed when applied to designing software systems. This arenas, such as detecting junk email or searching web theory was described in (Long and Denning, 1995), and sites. This paper will discuss the features, goals and less technically in (Long, 1999), but will be briefly and general methods of the Reviewers Assistant System. nontechnically described here. This article primarily discusses the application of Background Ultra-Structure theory to a new area, namely the analysis of text documents for sensitive information. The U.S. The traditional approach to managing any engineering Department of Energy (DOE) Declassification project is structured: it moves from general planning to Productivity Initiative (DPI), which will be described requirements analysis, design, implementation, and then later in this paper, has funded this work. The use of long-term maintenance, and has explicit criteria to Ultra-Structure theory has thus far allowed us to address determine whether and when to move to the next stage. several difficult problems that have limited and hindered This approach works quite well for creating most types previous efforts at Natural Language Processing (NLP) of systems. If a system is simple and the requirements systems, expert systems, and large knowledgebases; in change, the traditional structured approach works particular: because the system can be affordably modified. Alternatively, if a system is complex but the  the ability to manage large numbers of rules in a requirements never change, the system can be knowledgebase, numbering in the tens of thousands successfully built. Traditional structured approaches now and eventually in the hundreds of thousands have proven to be better than completely unstructured  the ability to give knowledge engineers a set of tools approaches and have led to the development of many to help them visualize and manage large successful systems. However, standard structured knowledgebases of rules (“rulebases”) approaches have failed to satisfactorily address the  the ability to manage and maintain both metadata problems involved in creating time-viable software and content information regarding large numbers of systems, especially over long times. Instead, they have documents. led to the frequent replacement of systems with wholly new ones, most often at great cost for both developers Ultra-Structure Fundamentals and users of software. Attempts to get better user requirements through rapid prototyping and better Ultra-Structure theory is based on a different way of charting tools for notating work processes have helped looking at the world – a different paradigm or 1 -- This work was funded by the U.S. Department of Energy under contract DE-AC01-98NN50049.
  • 3. worldview. The traditional Western, Aristotelian Finally, the notational structure is the set of tokens used worldview sees the world as composed of objects having by the rules to represent various abstractions. For RAS, attributes and relationships to other objects. Ultra- these tokens are numbers and letters; for (say) a music Structure theory sees the world as being a process which, system they would be letters and other signs interpreted as a minor by-product, occasionally generates physical as musical “notes” or instructions. entities and new relationships among physical entities. Ultra-Structure theory specifies how these various The first task of an Ultra-Structure analysis is to layers can be represented on a computer as tables in a understand these processes and to represent them very relational database. To implement a ruleform, we create accurately. The development task of an Ultra-Structure a table where each row is a separate simple rule, and analysis is to ensure that the representation models not each column is a separate universal. Typically a just the processes that exist currently, but all logically complex rule will require several simple rules that are possible processes within that family of systems. stored in different ruleforms; this group of ruleforms Ultra-Structure theory suggests that any process can, must be examined as a whole in order for the system to in principle and in practice, be analyzed into or reduced make decisions; it is referred to as a cluster. There are to a set of If-Then rules. Seemingly simple processes several types of ruleforms, as will be illustrated later. usually follow just a few simple rules; and seemingly Properly specifying these structures for a model of a complex processes may follow either a few, or many, system enables the users of the model to enter the rules simple rules. One of the interesting discoveries in of the system as data, which is easily changed as cellular automata studies has been that very simple rules necessary when the rules change. Approximately 99% can generate very complex behaviors. This has also of the rules of the system are specified as data, so the been observed in the work done in the last 20 years in model itself – its software and data structures – has no fractal geometry, where the recursive execution of a knowledge of the outside world. All the model itself single remarkably simple formula, which is a type of knows is the order in which to read the ruleforms. This rule, can specify very complex shapes. control logic, called animation procedures, is very small; The kind of rules that humans usually work with may typically just a few thousand lines of code even for a be thought of as complex rules. However, a good analyst very complex system. can analyze complex rules into simple or atomic rules, according to Ultra-Structure theory. Rules may be Natural Language: A New Application Area defined in terms of sets, with each set having a specified for Ultra-Structure and limited possible domain or set of values. A particular rule in this definition is a particular ordered Ultra-Structure theory has been applied in a number of sequence of these values. application areas, mostly in business. This has included Ultra-Structure theory suggests that a good analyst the traditional application areas of order entry, inventory can always define an ordered sequence of domains (in control, billing and cash application, and similar business Ultra-Structure terminology, universals) which will functions. It has also been applied experimentally to contain any particular instance of a rule. Such an other areas, as indicated by (Shostko 1999) and (Oh and ordered set of universals is called a ruleform. A Scotti, 1999). ruleform is to a rule what ordinary algebra is to Starting with part-time work in 1995, it has now been arithmetic: it is a more generalized way of specifying the applied to the automated identification of sensitive essential structural ideas and relationships of a system. information in text documents. The present application While ordinary algebra uses symbols to represent area is new for Ultra-Structure: attempting to specify the numbers and various arithmetical operations, Ultra- rules by which natural language can be at least partially Structure uses universals to represent various domains understood, i.e. partially interpreted and assigned whose contents may be numeric, alphabetical, or indeed meaning. the tokens of any notational system. The Government has a strong interest in protecting Exhibit 1 shows proposed terminology and national security information (NSI), while facilitating distinctions for the different layers of structure of any government openness to public scrutiny and not system. Simply put, the surface structure is defined to be spending significant amounts of money and time the physical manifestation of any system, consisting of protecting information that does not truly need to be its physical entities, relationships and processes. The classified. President Clinton issued Executive Order middle structure is the set of all rules governing the 12958 (E.O.), the most recent codification of the system, which generate the surface structure. The deep Government’s intentions in this area, on April 14, 1995. structure is the set of ordered domains from which It states, among other things, that all classified particular rules may be constructed. The sub-structure documents containing only NSI (not RD or represents the set of all possible domains (universals). FRD,discussed below) will be automatically declassified
  • 4. after 25 years unless one of nine specified conditions for  highlight any segments of the text to which a exemption is met. Guidance Rule applies The estimated volume of documents to be reviewed  highlight for the user (i.e. a certified document for declassification under the E.O. exceeds one billion, reviewer) the specific Guidance Rule(s) that caused and the five year grace period specified by the E.O. for any particular sections of text to be selected. reviews to identify exemptions from automatic declassification has been extended for an additional 18 The purpose of RAS is not to “understand” the months to help the Agencies meet the enormous work documents it processes, but rather to detect the existence loads. of any classified concepts or facts. While this is simpler Under the E.O., any information that is covered under than true document understanding, it nevertheless the Atomic Energy Act (AEA) is exempt from automatic requires far more than mere keyword searching, where a declassification. Such information includes anything system simply scans a document for the existence of one pertaining to the construction, design or use of nuclear or more specified terms. It also requires more than a weapons, nuclear propulsion systems, and other special Boolean keyword search, where specific terms can be nuclear materials. This information was exempted ANDed, ORed, etc. We are seeking specific concepts because the President does not have the authority to having specific relations to one another, which we refer unilaterally change the AEA (a law), and also because it to as ideas or propositions. The ideas being sought can is generally recognized that even “old” nuclear design be quite subtle. information would still be of current value to a would-be proliferant. It is simply not in the interests of the United Merging Databases and Text Markup Languages States to provide such information. To help identify this kind of information, called Traditionally the task of defining the elements of “Restricted Data” (RD) or “Formerly Restricted Data” document structure would be performed using a text (FRD), as well as other kinds of national security markup language such as a derivative of the Standard information (NSI), the Department of Energy has Generalized Markup Language (SGML). In this kind of developed about 65,000 specific guidance topics. Their language, “tags” indicating different structural features purpose is to help determine what is or is not classified of the document are inserted into the document at the as RD, FRD or NSI, and at what level (confidential, beginning and end of each structural feature. Following secret, or top secret). Trained and certified document Ultra-Structure theory, RAS represents the information reviewers apply these topics. Moreover, under the in terms of rules, and the rules are stored as records of Freedom of Information Act (FOIA), the public is data in various tables. In RAS, therefore, all structural entitled to request documents and DOE must be prepared markup information is stored in a database. This kind of to justify any classification actions it takes before a markup does not use in-line tags, but instead uses federal judge. They must have a clear rationale tracing different fields in a table. back to the 65,000 guidance topics and from there to There are a number of advantages to storing either the AEA or to the latest E.O. pertaining to national structured text information in database tables rather than security. in a flat file with tags. Chief among these are the Document reviewing is a manually intensive process following general capabilities of relational databases requiring years of education and training. Congress over flat files: funded the Declassification Productivity Initiative (DPI) at the Department of Energy (DOE) in order to develop  control access to the data through a security system advanced tools to help reviewers in various ways. One and audit trail of the primary tools we have been developing under DPI  enforce referential integrity, such that when a value is called the “Reviewer’s Assistant System” (RAS), changes in one part of the system it is immediately which was built using Ultra-Structure theory. changed in all parts of the system  permit use of complex queries using (e.g.) Standard Reviewer’s Assistant System Functions Query Language (SQL)  give users quick access to volumes of data through We are building RAS using Ultra-Structure theory easy-to-use forms and reports because the number of rules is quite large and these rules  store and retrieve various types of objects in are likely to change over time. RAS is designed to: addition to standard text (e.g. images, sounds).  rigorously apply DOE Guidance Rules to text Merging Databases and Knowledgebases documents
  • 5. Ever since mankind first used an abacus about 5,000 There have been a number of attempts in the last ten years ago, and possibly since we first notched tallies on a years to bridge the gap between these two classes of stick 30,000 years ago, we have distinguished algorithms applications – to merge databases and knowledgebases from data. This has been a useful distinction, but the and their associated technologies. There is a growing veritable wall between the two began to break down belief that modern database systems must evolve towards when John von Neumann proposed in a memo in 1945 knowledgebase systems, and that more "inferencing" is that not just data but also algorithms (as computer necessary for better understanding and use of data. This instructions) could be stored on a computer in a binary could lead to applications involving hundreds of form. This insight – based on work done by him and thousands of complex rules that make decisions that others at the University of Pennsylvania, including John seem truly “intelligent.” Mauchley and J. Presper Eckert – led to programmable The Ultra-Structure paradigm does not make these (stored program) computers. conventional distinctions between algorithms and data. Although both parts are stored in the same way as Rather it defines whatever is stored in a relational binary digits (bits), computer applications are still database table to be rules which have two different types viewed as consisting of two very different things: of parts, called factors and considerations. Factors are algorithms and data. An algorithm is a finite series of primary keys in a table that determine under what steps taken to compute an answer. Data is the values or general conditions a rule should be looked at; and parameters used by an algorithm to reach its conclusions, following standard normalization rules it requires that which data may have initial, intermediate and final there be unique keys (factors) for each record (rule). values. What is traditionally considered to be data (i.e., a fact) is Database applications are generally viewed as usually stored as a consideration (a non-primary-key applications that provide storage places and access attribute) in the record, and this attribute serves merely methods for the safe storage and retrieval of persistent to guide the execution of a rule cluster. In an inventory data, and the safe adding, changing and deleting of data system, for example, the quantity-on-hand of a particular following certain integrity rules regardless of whether item is simply a consideration determining where the the application software using the database enforces item may be sourced for an order. That and other rules those rules or not. Under this paradigm, databases store in the cluster must all be examined in order for the and protect “facts” or “data,” and the algorithms that inventory system to make an intelligent sourcing read and use these facts are stored in software programs, decision. The inference engine (called animation queries, stored procedures, job control language procedures) consists of just a few thousand lines of procedures, etc. Examples of such applications are order code. All knowledge of the external world lies in the entry, inventory, purchasing, and accounting systems. rulebase, and none in the animation procedures. RAS is This class of systems is concerned primarily with data an example of a new type of system that uses a storage, arithmetic and logical calculations, and relational database to store a very large number of information retrieval. For this class of systems, rules as data. changing the rules of a business area requires changing This perspective requires a new and broader the software – a frequently difficult task. understanding of the nature of rules. If we broaden our Expert systems are a different class of applications concept of rules from which consist of rules and an inference engine, and which are concerned primarily with applying reasoning IF x THEN do y and z to facts in order to simulate the behavior of human to experts in a particular subject domain. The inference IF x THEN consider y and z before deciding what to do, engine processes the rules, which are stored in a “knowledgebase” rather than a database. These rules then y and z can serve the role traditionally reserved for may include executable code, or they may be mere data. data, that is they can represent the facts of the world. The reasoning process may be similar to that of a human They do this as an integral part of a larger and more expert, or it may be completely different. The behavior comprehensive cluster of rules, acting as considerations of the system as a whole is intended to mimic, and for the execution of individual rules. hopefully outperform, a human expert. Examples of This means that all the business rules of an such applications include bank credit approval, medical organization can be stored as data, and the only software diagnosis, and hardware configuration systems. These that is necessary is the inference engine, which should systems are usually intended to aid rather than replace never need to change. This puts all knowledge of the human decision-makers. They offer the benefits of high world and all the knowledge of rules in a format which speed, high consistency, and perfect attention to detail. is easy to update, easy to review, and can be managed easily by a standard relational database.
  • 6. There are other ruleforms (tables) in RAS, but these Rules in RAS give the general idea of what the system contains. As used by RAS, Ultra-Structure defines several basic Executing (Animating) the RAS Rules kinds of existential rules or types of entities: In order to search for concepts in a text, the text must  semantic entities, which can be letters, words, first be “pre-analyzed.” This involves the determination phrases, guide topics, or entire guides of various boundaries (e.g. sentence boundaries) and the  documents, which are the entities being analyzed by determination of the nature of certain kinds of lexical the system entities, e.g. whether a specific entity is numeric or non-  markings, which indicate what to do in the event numeric, and whether a period is part of a number (a that certain ideas are found in a text, e.g. mark the decimal point), is used as part of an acronym or document as “confidential” abbreviation, or is indeed the end of a sentence. Each  users, which define the authorized users of the word in a document is usually treated as a separate system. “semantic entity.” But since words in a phrase often have meanings very different than the same words These entities typically have complex relations to one outside the phrase (e.g. “A horse of a different color” has another. nothing to do with either horses or colors!), there is frequently a need to indicate that several words must If related to other entities of the same type they are always be treated as a single phrase, in which case the called network rules. RAS has several kinds of network entire phrase becomes a single semantic entity. This is rule: defined by a replacement rule in the Entities Network ruleform.  entities network relates semantic entities to one Each semantic entity has a number of attributes such another as character position in the document, word number,  markings network relates markings to one another, sentence number, paragraph number, whether it is and in particular indicates a hierarchy of markings numeric, etc. These and other attributes for each  documents network relates documents to one semantic entity are stored as a rule on a single record in another, indicating (e.g.) that one document replaces the Document Detail ruleform, in lieu of using an another, or is a duplicate of another, etc. SGML-type markup language. The system is thus generating new “rules” based on other rules, which If existential entities of one kind are related to entities facilitates subsequent analysis. of another kind, we represent that with an authorization After performing an analysis it is necessary to rule. RAS has several kinds of authorization rules: indicate which portions of the text are considered classified or are otherwise marked, and what specific guidance topic(s) caused the text to be selected. These  document detail contains the results of the pre- rules, also generated by the system based on other rules, analysis of a document, specifying the semantic are stored in the Document Analysis ruleform. entities and their characteristics and order in the Performing the analysis itself requires looking for the document tokens in the target documents, and applying the  document analysis contains the results of the markings indicated. Since each guidance topic is analysis of a document translated into one or more propositions, and there are  entity markings relates semantic entities (e.g. guide about 65,000 guidance topics, we anticipate that there topics) and their associated markings (if any) will be about 100,000 propositions to be represented and searched for in each text. This number accounts for and Note that each ruleform (table) may be interpreted as excludes duplicate guidance topics. This number of defining rules. For example, the Document Detail table rules alone would make RAS a very large expert system. may be interpreted as specifying rules for the (re- As indicated in Exhibit 2, specifying a proposition (in )construction of the original document. The Markings the sense used here) means specifying usually two to Network specifies how markings are ordered in a four concepts which occur within a defined proximity of hierarchy, e.g. if a marking of “confidential” and a one another in a text, e.g. 5 sentences or 15 words. We marking of “secret” both apply to the same document, have found thus far that even the most complex then the overall classification of the document is propositions require only six concepts. “secret”. Specifying several concepts to search for is not by itself adequate: the computer must also know all the
  • 7. possible ways that each concept can be tokenized (i.e. work is underway to automatically generate RAS rules lexically expressed) in any text document. This calls for from the written English guidance. Of these, about ten a large tree of relationships specifying how concepts can percent are what we call “good false positives”, i.e. items be tokenized. This mapping requirement – essentially a that are not in fact classified but which a reviewer would large ontology of all areas of DOE activity – will want to look at closely before making that determination. probably add another 500,000+ additional rules. We In terms of missing items that should have been keep these rules in the Entities Network ruleform, first identified as “hot” (i.e., false negatives), results are defining all concepts and tokenizations in the Semantic harder to determine since sometimes even human Entities ruleform. Note that in many cases there is no reviewers may disagree about what is sensitive. But need to specify all forms of an entity, e.g. singular, results to date indicate that almost all missed items are plural, possessive, etc.; using a wildcard before or after a readily accounted for as outside the domain of the word stem is sometimes adequate so long as the rulebase, either pertaining to a different subject area or knowledge engineer is aware that use of stems may including tokens that the rulebase was unaware of. increase the false positive hit rate. This low false negative and low false positive rate is Of course, not all concepts and tokenizations are in great contrast to other approaches which have often equally related to one another. We represent this degree been found unusable based on 50%+ false positive hit of closeness with a fuzzy fitness number from 0 to 1 to rates, using that term as defined above. indicate the degree to which the two are related. We then need to test this rulebase against a large set of Other Possible RAS Applications documents, and to go back and correct the rules to minimize or eliminate false positives and false negatives, The RAS technology for ”concept-spotting” – either by adding new entities to look for, changing the reviewing documents for the existence of specific relationships of existing entities, or specifying tighter hit propositions – can theoretically be applied to other ranges. arenas, such as: In the long term, we will need to be able to keep the rules up-to-date as the original guidance topics change  identifying unsolicited commercial email (“spam”) over time and as we apply it do different corpora having  searching web sites for certain ideas variations in their lexical representations, e.g., using a  searching patents for certain ideas different (former) name for a national laboratory. We  scanning computer source code for Y2K date issues. are still in the early stages of creating and validating these rules, and we expect this to be the most difficult We are still a long way from completing the RAS part of building the system due to the wide range of system. The results to date have been very promising: subject areas to be covered. RAS is demonstrating the advantages of Ultra-Structure theory for concept detection and large knowledgebases. Results to Date The Declassification Productivity Research Center (DPRC) at The George Washington University is We have run RAS on several different corpora having carrying out other Ultra-Structure based research very different characteristics, in order to see how it projects, which are also showing positive results (Oh and performs with these different corpora. Characteristics of Scotti, 1999). interest include whether the documents are known or believed to be classified or unclassified; the size of each Summary and Conclusions document in the corpus, ranging from a few sentences to hundreds of pages; the size of the total corpus, ranging A million records is small by database system so far up to about 3 million words; and whether the standards, but a million rules is essentially an impossible corpus was originally created electronically or whether it number for a traditional expert system to manage. We was OCRed and therefore has some number of OCR expect to be able to effectively handle very large errors in it. numbers of rules, numbering in the hundreds of The rulebase tested has over 700 guidance rules in it, thousands, using the techniques being followed for RAS. and maps to about 20,000 tokens. We expect soon to Ultra-Structure theory may constitute a real merger of greatly increase the number of guidance rules applied. knowledgebase and database technologies. If so, it has Results so far show a typical false positive hit rate of the potential to usher in a new era of vastly larger expert about ten percent of the documents reviewed, meaning systems for carrying out policies and procedures of that of 100 documents, RAS will incorrectly identify ten extreme complexity. as “hot” when they are not. We hope of course to reduce this rate by broadening and deepening the rulebase, and References
  • 8. About the Author Long, Jeffrey G. and Denning, Dorothy E., “Ultra- Structure: A design theory for complex systems and Mr. Long is Senior Knowledge Engineer on the DPI processes,” Communications of the ACM 38(1), (1995) project. He is also Director of the Notational 103-120. Engineering Laboratory, an effort to create a clearinghouse for people interested in problems of Long, Jeffrey G., “A new notation for representing representation in any field of science, art or other business and other rules,” Semiotica 125-1/3, (1999) activity. His experience includes 25 years of consulting 215-228 on various kinds of applications software development, with a particular focus on studying complex systems and Oh, Youngsuck and Scotti, Richard, “Analysis and the problems of representing them. Design of a Database using Ultra-Structure Theory (UST) – Conversion of a Traditional Software System to One Based on UST,” Proceeding of the 20th Annual Conference, American Society for Engineering Management (1999) Shostko, Alexander, “Design of an automatic course- scheduling system using Ultra-Structure,” Semiotica 125-1/3, (1999) 197-214 Standard Terminology (if any) Ultra-Structure Instance Name Ultra-Structure Level Name U-S Implementation behavior, physical entities and particular(s) surface structure system behavior relationships, processes rules, laws, constraints, rule(s) middle structure data and some software guidelines, rules of thumb (animation procedures) (no standard or common term) ruleform(s) deep structure tables (no standard or common term) universal(s) sub-structure attributes, fields tokens, signs or symbols token(s) notational structure character set Exhibit 1: Layers of Structure in Any System, According to Ultra-Structure Theory
  • 9. Exhibit 2: RAS Breakdown of Topics to Tokens
  • 10. Using Ultra-Structure for Automated Identification of Sensitive Information in Documents Jeffrey G. Long Sr. Knowledge Engineer, DynMeridian notate@aol.com
  • 11. Traditional Engineering Approaches Work g g pp Only Under Certain Conditions
  • 12. Unfortunately, Unfortunately Complex and Changing Needs Exist in Every Organization Needs SW & DB time 1 time 2 time 3...
  • 13. Ultra Structure Ultra-Structure Theory Was Created to Support Complex and Changing Rules  New theory of systems design, developed 1985  Focuses on optimal computer representation of F ti l t t ti f complex, conditional and changing rules  Based on a new abstraction called ruleforms  The breakthrough was to find the unchanging features of changing systems
  • 14. The Theory Offers a Different Way to Look at Complex Systems and Processes observable behaviors surface structure generates rules middle structure constrains form of rules f f l deep structure
  • 15. This Creates New Levels for Analysis and Representation Standard Terminology (if any) Ultra-Structure Instance Ultra-Structure Level U-S Implementation Name Name behavior, physical entities particular(s) surface structure system behavior and relationships, processes rules, laws constraints, rules laws, constraints rule(s) middle structure data and some guidelines, rules of thumb software (animation procedures) (no standard or common ruleform(s) deep structure tables term) (no standard or common universal(s) sub-structure attributes, fields term) tokens, tokens signs or symbols token(s) notational structure character set
  • 16. The R l f Th Ruleform H Hypothesis h i Complex system structures are created by not-necessarily complex processes; and these processes are created by the animation of operating rules. Operating rules can be grouped into a small number of classes whose form is prescribed by "ruleforms". While the operating rules of a system change over time, the ruleforms remain constant. A well-designed collection g of ruleforms can anticipate all logically possible operating rules that might apply to the system, and constitutes the deep structure of the system.
  • 17. The C RE Hypothesis Th CoRE H h i There exist Competency Rule Engines, or CoREs, consisting of <50 ruleforms, that are sufficient to represent all rules found among systems sharing broad family resemblances e g all resemblances, e.g. corporations. Their definitive deep structure will be permanent, unchanging, and robust for all members of the family, whose differences in manifest structures and behaviors will be represented entirely as differences in operating rules. The animation procedures for each engine will be relatively simple compared to current applications, requiring less than 100,000 lines p pp , q g , of code in a third generation language.
  • 18. DOE Reviewer’s Assistant System Reviewer s Requirements  650 guides defining 65,000 topics that are or may be classified  Extensive background knowledge required to interpret guidance  Guidance changes over time  Terminology in documents changes over time  Current backlog of 300+ million pages  Objective is concept spotting, not document understanding g
  • 19. Normally This Would be Done Using an Expert System Shell  ES often have trouble with > 100 rules  DOE system will require about 500 000 rules 500,000  Key issue: maintainability of rules  Many benefits from using relational database to store rules as data  Built-in referential integrity  Easy report-writing and queries E t iti d i  Simple user interface for KE and Reviewers
  • 20. RAS Defines Guidance Concepts and p All Possible Lexical Expressions of Those Conceptsp System Define Convert Guides Interpretations Ready Read Apply Document Document Guidance Reviewed
  • 21. Rules Specify Relations Between Concepts, Tokens and Markings
  • 22. Results to D R l Date are P Promising i i  In a corpus of 3,750 unclassified documents, the false positive rate was less than 10%  In I another corpus of 16,500 unclassified d th f 16 500 l ifi d documents, t the false positive rate was 2.5%  In other (e.g. keyword and statistical systems) approaches, false positive and false negative rates are often i excess of 50% ft in f
  • 23. The Ultra-Structure-Based RAS System Offers Substantial Benefits to S Off S b i lB fi Reviewers and Knowledge Engineers  System can provide precise and rigorous interpretation of DOE Classification Guidance p  Rules can become more complex if necessary  Rules are easy to specify, change and review  Implications and consequences of changes can be better foreseen  Changes to rules do not require changing software or table structures – just data
  • 24. Next Steps for RAS Development N S f D l  Work with subject experts to expand scope and improve quality and completeness of rulebase  Continue t ti system against many types of C ti testing t i t t f documents  Improve design to minimize/eliminate false negatives and false positives  Work with end-users to improve user interface  Integrate into other systems  Improve design to increase speed: parallel processing, processing stored queries etc queries, etc.
  • 25. As the CoRE Hypothesis Promises RAS Promises, Could be Used in Other Areas Also  Categorize documents by subject  Scan email for spam/UCE  Scan websites, e.g. for compliance to a standard  Categorize p g patents or scan them for specified p concepts  Scan source code, e.g. Y2K  Scan any machine-readable corpus f specified S hi d bl for ifi d ideas