Automated identification of sensitive information

309
-1

Published on

October 21, 1999: "Using Ultra-Structure for Automated Identification of Sensitive Information in Documents". Presented at the 20th annual conference of the American Society for Engineering Management. Paper published in conference proceedings.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
309
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Automated identification of sensitive information

  1. 1. Cover Page   Using Ultra‐Structure for  Automated Identification of  Sensitive Information in  Documents  Author: Jeffrey G. Long (jefflong@aol.com) Date: October 21, 1999 Forum: Talk presented at the 20th annual conference of the American Society for Engineering Management. Contents Pages 1‐5: Preprint f paper Pages 6‐24: Slides (but no text) for presentation   License This work is licensed under the Creative Commons Attribution‐NonCommercial 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by‐nc/3.0/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.  Uploaded July 1, 2011 
  2. 2. USING ULTRA-STRUCTURE FOR AUTOMATED IDENTIFICATION OF SENSITIVE INFORMATION IN DOCUMENTS1 Jeffrey G. Long, Sr. Knowledge Engineer, DynCorp NSP Abstract. The Government has a strong interest in somewhat, but they too have failed to make systems protecting nuclear and national security information more flexible in the face of changing user requirements. while maximizing the availability of information about Indeed, if both (a) the system being developed is its operations. Towards this end federal agencies have complex and (b) the user requirements are subject to developed tens of thousands of guidance rules to help significant change over time, then the existing structured determine what is or is not classified, and at what level. approaches usually do not work. The greatest Trained and certified document reviewers apply these engineering accomplishments of the 20th Century are of rules. The Reviewers Assistant System (RAS) at the the former type. Complex systems ranging from Department of Energy is a new type of expert system computers to the Space Shuttle have of course been that uses a relational database to store very large successfully built – but only on the condition that their numbers of rules. It is being developed to automatically user requirements change little if at all after the design apply DOE guidance rules to text documents. This stage. requires far more than a mere keyword search, as the Unfortunately for us, software systems are both ideas being distinguished can be quite subtle. The complex and ever changing. As such, they require a purpose of the system is not to understand the different engineering management approach. The theory documents it processes, but rather to simply check them of “Ultra-Structure” was developed in part because the for the existence of certain ideas or facts. This application of traditional engineering approaches has technology for concept-spotting can be applied to other failed when applied to designing software systems. This arenas, such as detecting junk email or searching web theory was described in (Long and Denning, 1995), and sites. This paper will discuss the features, goals and less technically in (Long, 1999), but will be briefly and general methods of the Reviewers Assistant System. nontechnically described here. This article primarily discusses the application of Background Ultra-Structure theory to a new area, namely the analysis of text documents for sensitive information. The U.S. The traditional approach to managing any engineering Department of Energy (DOE) Declassification project is structured: it moves from general planning to Productivity Initiative (DPI), which will be described requirements analysis, design, implementation, and then later in this paper, has funded this work. The use of long-term maintenance, and has explicit criteria to Ultra-Structure theory has thus far allowed us to address determine whether and when to move to the next stage. several difficult problems that have limited and hindered This approach works quite well for creating most types previous efforts at Natural Language Processing (NLP) of systems. If a system is simple and the requirements systems, expert systems, and large knowledgebases; in change, the traditional structured approach works particular: because the system can be affordably modified. Alternatively, if a system is complex but the  the ability to manage large numbers of rules in a requirements never change, the system can be knowledgebase, numbering in the tens of thousands successfully built. Traditional structured approaches now and eventually in the hundreds of thousands have proven to be better than completely unstructured  the ability to give knowledge engineers a set of tools approaches and have led to the development of many to help them visualize and manage large successful systems. However, standard structured knowledgebases of rules (“rulebases”) approaches have failed to satisfactorily address the  the ability to manage and maintain both metadata problems involved in creating time-viable software and content information regarding large numbers of systems, especially over long times. Instead, they have documents. led to the frequent replacement of systems with wholly new ones, most often at great cost for both developers Ultra-Structure Fundamentals and users of software. Attempts to get better user requirements through rapid prototyping and better Ultra-Structure theory is based on a different way of charting tools for notating work processes have helped looking at the world – a different paradigm or1 -- This work was funded by the U.S. Department of Energy under contract DE-AC01-98NN50049.
  3. 3. worldview. The traditional Western, Aristotelian Finally, the notational structure is the set of tokens usedworldview sees the world as composed of objects having by the rules to represent various abstractions. For RAS,attributes and relationships to other objects. Ultra- these tokens are numbers and letters; for (say) a musicStructure theory sees the world as being a process which, system they would be letters and other signs interpretedas a minor by-product, occasionally generates physical as musical “notes” or instructions.entities and new relationships among physical entities. Ultra-Structure theory specifies how these variousThe first task of an Ultra-Structure analysis is to layers can be represented on a computer as tables in aunderstand these processes and to represent them very relational database. To implement a ruleform, we createaccurately. The development task of an Ultra-Structure a table where each row is a separate simple rule, andanalysis is to ensure that the representation models not each column is a separate universal. Typically ajust the processes that exist currently, but all logically complex rule will require several simple rules that arepossible processes within that family of systems. stored in different ruleforms; this group of ruleforms Ultra-Structure theory suggests that any process can, must be examined as a whole in order for the system toin principle and in practice, be analyzed into or reduced make decisions; it is referred to as a cluster. There areto a set of If-Then rules. Seemingly simple processes several types of ruleforms, as will be illustrated later.usually follow just a few simple rules; and seemingly Properly specifying these structures for a model of acomplex processes may follow either a few, or many, system enables the users of the model to enter the rulessimple rules. One of the interesting discoveries in of the system as data, which is easily changed ascellular automata studies has been that very simple rules necessary when the rules change. Approximately 99%can generate very complex behaviors. This has also of the rules of the system are specified as data, so thebeen observed in the work done in the last 20 years in model itself – its software and data structures – has nofractal geometry, where the recursive execution of a knowledge of the outside world. All the model itselfsingle remarkably simple formula, which is a type of knows is the order in which to read the ruleforms. Thisrule, can specify very complex shapes. control logic, called animation procedures, is very small; The kind of rules that humans usually work with may typically just a few thousand lines of code even for abe thought of as complex rules. However, a good analyst very complex system.can analyze complex rules into simple or atomic rules,according to Ultra-Structure theory. Rules may be Natural Language: A New Application Areadefined in terms of sets, with each set having a specified for Ultra-Structureand limited possible domain or set of values. Aparticular rule in this definition is a particular ordered Ultra-Structure theory has been applied in a number ofsequence of these values. application areas, mostly in business. This has included Ultra-Structure theory suggests that a good analyst the traditional application areas of order entry, inventorycan always define an ordered sequence of domains (in control, billing and cash application, and similar businessUltra-Structure terminology, universals) which will functions. It has also been applied experimentally tocontain any particular instance of a rule. Such an other areas, as indicated by (Shostko 1999) and (Oh andordered set of universals is called a ruleform. A Scotti, 1999).ruleform is to a rule what ordinary algebra is to Starting with part-time work in 1995, it has now beenarithmetic: it is a more generalized way of specifying the applied to the automated identification of sensitiveessential structural ideas and relationships of a system. information in text documents. The present applicationWhile ordinary algebra uses symbols to represent area is new for Ultra-Structure: attempting to specify thenumbers and various arithmetical operations, Ultra- rules by which natural language can be at least partiallyStructure uses universals to represent various domains understood, i.e. partially interpreted and assignedwhose contents may be numeric, alphabetical, or indeed meaning.the tokens of any notational system. The Government has a strong interest in protecting Exhibit 1 shows proposed terminology and national security information (NSI), while facilitatingdistinctions for the different layers of structure of any government openness to public scrutiny and notsystem. Simply put, the surface structure is defined to be spending significant amounts of money and timethe physical manifestation of any system, consisting of protecting information that does not truly need to beits physical entities, relationships and processes. The classified. President Clinton issued Executive Ordermiddle structure is the set of all rules governing the 12958 (E.O.), the most recent codification of thesystem, which generate the surface structure. The deep Government’s intentions in this area, on April 14, 1995.structure is the set of ordered domains from which It states, among other things, that all classifiedparticular rules may be constructed. The sub-structure documents containing only NSI (not RD orrepresents the set of all possible domains (universals). FRD,discussed below) will be automatically declassified
  4. 4. after 25 years unless one of nine specified conditions for  highlight any segments of the text to which aexemption is met. Guidance Rule applies The estimated volume of documents to be reviewed  highlight for the user (i.e. a certified documentfor declassification under the E.O. exceeds one billion, reviewer) the specific Guidance Rule(s) that causedand the five year grace period specified by the E.O. for any particular sections of text to be selected.reviews to identify exemptions from automaticdeclassification has been extended for an additional 18 The purpose of RAS is not to “understand” themonths to help the Agencies meet the enormous work documents it processes, but rather to detect the existenceloads. of any classified concepts or facts. While this is simpler Under the E.O., any information that is covered under than true document understanding, it neverthelessthe Atomic Energy Act (AEA) is exempt from automatic requires far more than mere keyword searching, where adeclassification. Such information includes anything system simply scans a document for the existence of onepertaining to the construction, design or use of nuclear or more specified terms. It also requires more than aweapons, nuclear propulsion systems, and other special Boolean keyword search, where specific terms can benuclear materials. This information was exempted ANDed, ORed, etc. We are seeking specific conceptsbecause the President does not have the authority to having specific relations to one another, which we referunilaterally change the AEA (a law), and also because it to as ideas or propositions. The ideas being sought canis generally recognized that even “old” nuclear design be quite subtle.information would still be of current value to a would-beproliferant. It is simply not in the interests of the United Merging Databases and Text Markup LanguagesStates to provide such information. To help identify this kind of information, called Traditionally the task of defining the elements of“Restricted Data” (RD) or “Formerly Restricted Data” document structure would be performed using a text(FRD), as well as other kinds of national security markup language such as a derivative of the Standardinformation (NSI), the Department of Energy has Generalized Markup Language (SGML). In this kind ofdeveloped about 65,000 specific guidance topics. Their language, “tags” indicating different structural featurespurpose is to help determine what is or is not classified of the document are inserted into the document at theas RD, FRD or NSI, and at what level (confidential, beginning and end of each structural feature. Followingsecret, or top secret). Trained and certified document Ultra-Structure theory, RAS represents the informationreviewers apply these topics. Moreover, under the in terms of rules, and the rules are stored as records ofFreedom of Information Act (FOIA), the public is data in various tables. In RAS, therefore, all structuralentitled to request documents and DOE must be prepared markup information is stored in a database. This kind ofto justify any classification actions it takes before a markup does not use in-line tags, but instead usesfederal judge. They must have a clear rationale tracing different fields in a table.back to the 65,000 guidance topics and from there to There are a number of advantages to storingeither the AEA or to the latest E.O. pertaining to national structured text information in database tables rather thansecurity. in a flat file with tags. Chief among these are the Document reviewing is a manually intensive process following general capabilities of relational databasesrequiring years of education and training. Congress over flat files:funded the Declassification Productivity Initiative (DPI)at the Department of Energy (DOE) in order to develop  control access to the data through a security systemadvanced tools to help reviewers in various ways. One and audit trailof the primary tools we have been developing under DPI  enforce referential integrity, such that when a valueis called the “Reviewer’s Assistant System” (RAS), changes in one part of the system it is immediatelywhich was built using Ultra-Structure theory. changed in all parts of the system  permit use of complex queries using (e.g.) Standard Reviewer’s Assistant System Functions Query Language (SQL)  give users quick access to volumes of data throughWe are building RAS using Ultra-Structure theory easy-to-use forms and reportsbecause the number of rules is quite large and these rules  store and retrieve various types of objects inare likely to change over time. RAS is designed to: addition to standard text (e.g. images, sounds). rigorously apply DOE Guidance Rules to text Merging Databases and Knowledgebases documents
  5. 5. Ever since mankind first used an abacus about 5,000 There have been a number of attempts in the last tenyears ago, and possibly since we first notched tallies on a years to bridge the gap between these two classes ofstick 30,000 years ago, we have distinguished algorithms applications – to merge databases and knowledgebasesfrom data. This has been a useful distinction, but the and their associated technologies. There is a growingveritable wall between the two began to break down belief that modern database systems must evolve towardswhen John von Neumann proposed in a memo in 1945 knowledgebase systems, and that more "inferencing" isthat not just data but also algorithms (as computer necessary for better understanding and use of data. Thisinstructions) could be stored on a computer in a binary could lead to applications involving hundreds ofform. This insight – based on work done by him and thousands of complex rules that make decisions thatothers at the University of Pennsylvania, including John seem truly “intelligent.”Mauchley and J. Presper Eckert – led to programmable The Ultra-Structure paradigm does not make these(stored program) computers. conventional distinctions between algorithms and data. Although both parts are stored in the same way as Rather it defines whatever is stored in a relationalbinary digits (bits), computer applications are still database table to be rules which have two different typesviewed as consisting of two very different things: of parts, called factors and considerations. Factors arealgorithms and data. An algorithm is a finite series of primary keys in a table that determine under whatsteps taken to compute an answer. Data is the values or general conditions a rule should be looked at; andparameters used by an algorithm to reach its conclusions, following standard normalization rules it requires thatwhich data may have initial, intermediate and final there be unique keys (factors) for each record (rule).values. What is traditionally considered to be data (i.e., a fact) is Database applications are generally viewed as usually stored as a consideration (a non-primary-keyapplications that provide storage places and access attribute) in the record, and this attribute serves merelymethods for the safe storage and retrieval of persistent to guide the execution of a rule cluster. In an inventorydata, and the safe adding, changing and deleting of data system, for example, the quantity-on-hand of a particularfollowing certain integrity rules regardless of whether item is simply a consideration determining where thethe application software using the database enforces item may be sourced for an order. That and other rulesthose rules or not. Under this paradigm, databases store in the cluster must all be examined in order for theand protect “facts” or “data,” and the algorithms that inventory system to make an intelligent sourcingread and use these facts are stored in software programs, decision. The inference engine (called animationqueries, stored procedures, job control language procedures) consists of just a few thousand lines ofprocedures, etc. Examples of such applications are order code. All knowledge of the external world lies in theentry, inventory, purchasing, and accounting systems. rulebase, and none in the animation procedures. RAS isThis class of systems is concerned primarily with data an example of a new type of system that uses astorage, arithmetic and logical calculations, and relational database to store a very large number ofinformation retrieval. For this class of systems, rules as data.changing the rules of a business area requires changing This perspective requires a new and broaderthe software – a frequently difficult task. understanding of the nature of rules. If we broaden our Expert systems are a different class of applications concept of rules fromwhich consist of rules and an inference engine, andwhich are concerned primarily with applying reasoning IF x THEN do y and zto facts in order to simulate the behavior of human toexperts in a particular subject domain. The inference IF x THEN consider y and z before deciding what to do,engine processes the rules, which are stored in a“knowledgebase” rather than a database. These rules then y and z can serve the role traditionally reserved formay include executable code, or they may be mere data. data, that is they can represent the facts of the world.The reasoning process may be similar to that of a human They do this as an integral part of a larger and moreexpert, or it may be completely different. The behavior comprehensive cluster of rules, acting as considerationsof the system as a whole is intended to mimic, and for the execution of individual rules.hopefully outperform, a human expert. Examples of This means that all the business rules of ansuch applications include bank credit approval, medical organization can be stored as data, and the only softwarediagnosis, and hardware configuration systems. These that is necessary is the inference engine, which shouldsystems are usually intended to aid rather than replace never need to change. This puts all knowledge of thehuman decision-makers. They offer the benefits of high world and all the knowledge of rules in a format whichspeed, high consistency, and perfect attention to detail. is easy to update, easy to review, and can be managed easily by a standard relational database.
  6. 6. There are other ruleforms (tables) in RAS, but these Rules in RAS give the general idea of what the system contains.As used by RAS, Ultra-Structure defines several basic Executing (Animating) the RAS Ruleskinds of existential rules or types of entities: In order to search for concepts in a text, the text must semantic entities, which can be letters, words, first be “pre-analyzed.” This involves the determination phrases, guide topics, or entire guides of various boundaries (e.g. sentence boundaries) and the documents, which are the entities being analyzed by determination of the nature of certain kinds of lexical the system entities, e.g. whether a specific entity is numeric or non- markings, which indicate what to do in the event numeric, and whether a period is part of a number (a that certain ideas are found in a text, e.g. mark the decimal point), is used as part of an acronym or document as “confidential” abbreviation, or is indeed the end of a sentence. Each users, which define the authorized users of the word in a document is usually treated as a separate system. “semantic entity.” But since words in a phrase often have meanings very different than the same words These entities typically have complex relations to one outside the phrase (e.g. “A horse of a different color” hasanother. nothing to do with either horses or colors!), there is frequently a need to indicate that several words must If related to other entities of the same type they are always be treated as a single phrase, in which case thecalled network rules. RAS has several kinds of network entire phrase becomes a single semantic entity. This isrule: defined by a replacement rule in the Entities Network ruleform. entities network relates semantic entities to one Each semantic entity has a number of attributes such another as character position in the document, word number, markings network relates markings to one another, sentence number, paragraph number, whether it is and in particular indicates a hierarchy of markings numeric, etc. These and other attributes for each documents network relates documents to one semantic entity are stored as a rule on a single record in another, indicating (e.g.) that one document replaces the Document Detail ruleform, in lieu of using an another, or is a duplicate of another, etc. SGML-type markup language. The system is thus generating new “rules” based on other rules, which If existential entities of one kind are related to entities facilitates subsequent analysis.of another kind, we represent that with an authorization After performing an analysis it is necessary torule. RAS has several kinds of authorization rules: indicate which portions of the text are considered classified or are otherwise marked, and what specific guidance topic(s) caused the text to be selected. These document detail contains the results of the pre- rules, also generated by the system based on other rules, analysis of a document, specifying the semantic are stored in the Document Analysis ruleform. entities and their characteristics and order in the Performing the analysis itself requires looking for the document tokens in the target documents, and applying the document analysis contains the results of the markings indicated. Since each guidance topic is analysis of a document translated into one or more propositions, and there are entity markings relates semantic entities (e.g. guide about 65,000 guidance topics, we anticipate that there topics) and their associated markings (if any) will be about 100,000 propositions to be represented and searched for in each text. This number accounts for and Note that each ruleform (table) may be interpreted as excludes duplicate guidance topics. This number ofdefining rules. For example, the Document Detail table rules alone would make RAS a very large expert system.may be interpreted as specifying rules for the (re- As indicated in Exhibit 2, specifying a proposition (in)construction of the original document. The Markings the sense used here) means specifying usually two toNetwork specifies how markings are ordered in a four concepts which occur within a defined proximity ofhierarchy, e.g. if a marking of “confidential” and a one another in a text, e.g. 5 sentences or 15 words. Wemarking of “secret” both apply to the same document, have found thus far that even the most complexthen the overall classification of the document is propositions require only six concepts.“secret”. Specifying several concepts to search for is not by itself adequate: the computer must also know all the
  7. 7. possible ways that each concept can be tokenized (i.e. work is underway to automatically generate RAS ruleslexically expressed) in any text document. This calls for from the written English guidance. Of these, about tena large tree of relationships specifying how concepts can percent are what we call “good false positives”, i.e. itemsbe tokenized. This mapping requirement – essentially a that are not in fact classified but which a reviewer wouldlarge ontology of all areas of DOE activity – will want to look at closely before making that determination.probably add another 500,000+ additional rules. We In terms of missing items that should have beenkeep these rules in the Entities Network ruleform, first identified as “hot” (i.e., false negatives), results aredefining all concepts and tokenizations in the Semantic harder to determine since sometimes even humanEntities ruleform. Note that in many cases there is no reviewers may disagree about what is sensitive. Butneed to specify all forms of an entity, e.g. singular, results to date indicate that almost all missed items areplural, possessive, etc.; using a wildcard before or after a readily accounted for as outside the domain of theword stem is sometimes adequate so long as the rulebase, either pertaining to a different subject area orknowledge engineer is aware that use of stems may including tokens that the rulebase was unaware of.increase the false positive hit rate. This low false negative and low false positive rate is Of course, not all concepts and tokenizations are in great contrast to other approaches which have oftenequally related to one another. We represent this degree been found unusable based on 50%+ false positive hitof closeness with a fuzzy fitness number from 0 to 1 to rates, using that term as defined above.indicate the degree to which the two are related. Wethen need to test this rulebase against a large set of Other Possible RAS Applicationsdocuments, and to go back and correct the rules tominimize or eliminate false positives and false negatives, The RAS technology for ”concept-spotting” –either by adding new entities to look for, changing the reviewing documents for the existence of specificrelationships of existing entities, or specifying tighter hit propositions – can theoretically be applied to otherranges. arenas, such as: In the long term, we will need to be able to keep therules up-to-date as the original guidance topics change  identifying unsolicited commercial email (“spam”)over time and as we apply it do different corpora having  searching web sites for certain ideasvariations in their lexical representations, e.g., using a  searching patents for certain ideasdifferent (former) name for a national laboratory. We  scanning computer source code for Y2K date issues.are still in the early stages of creating and validatingthese rules, and we expect this to be the most difficult We are still a long way from completing the RASpart of building the system due to the wide range of system. The results to date have been very promising:subject areas to be covered. RAS is demonstrating the advantages of Ultra-Structure theory for concept detection and large knowledgebases. Results to Date The Declassification Productivity Research Center (DPRC) at The George Washington University is We have run RAS on several different corpora having carrying out other Ultra-Structure based researchvery different characteristics, in order to see how it projects, which are also showing positive results (Oh andperforms with these different corpora. Characteristics of Scotti, 1999).interest include whether the documents are known orbelieved to be classified or unclassified; the size of each Summary and Conclusionsdocument in the corpus, ranging from a few sentences tohundreds of pages; the size of the total corpus, ranging A million records is small by database systemso far up to about 3 million words; and whether the standards, but a million rules is essentially an impossiblecorpus was originally created electronically or whether it number for a traditional expert system to manage. Wewas OCRed and therefore has some number of OCR expect to be able to effectively handle very largeerrors in it. numbers of rules, numbering in the hundreds of The rulebase tested has over 700 guidance rules in it, thousands, using the techniques being followed for RAS.and maps to about 20,000 tokens. We expect soon to Ultra-Structure theory may constitute a real merger ofgreatly increase the number of guidance rules applied. knowledgebase and database technologies. If so, it has Results so far show a typical false positive hit rate of the potential to usher in a new era of vastly larger expertabout ten percent of the documents reviewed, meaning systems for carrying out policies and procedures ofthat of 100 documents, RAS will incorrectly identify ten extreme complexity.as “hot” when they are not. We hope of course to reducethis rate by broadening and deepening the rulebase, and References
  8. 8. About the AuthorLong, Jeffrey G. and Denning, Dorothy E., “Ultra-Structure: A design theory for complex systems and Mr. Long is Senior Knowledge Engineer on the DPIprocesses,” Communications of the ACM 38(1), (1995) project. He is also Director of the Notational103-120. Engineering Laboratory, an effort to create a clearinghouse for people interested in problems ofLong, Jeffrey G., “A new notation for representing representation in any field of science, art or otherbusiness and other rules,” Semiotica 125-1/3, (1999) activity. His experience includes 25 years of consulting215-228 on various kinds of applications software development, with a particular focus on studying complex systems andOh, Youngsuck and Scotti, Richard, “Analysis and the problems of representing them.Design of a Database using Ultra-Structure Theory(UST) – Conversion of a Traditional Software System toOne Based on UST,” Proceeding of the 20th AnnualConference, American Society for EngineeringManagement (1999)Shostko, Alexander, “Design of an automatic course-scheduling system using Ultra-Structure,” Semiotica125-1/3, (1999) 197-214Standard Terminology (if any) Ultra-Structure Instance Name Ultra-Structure Level Name U-S Implementationbehavior, physical entities and particular(s) surface structure system behaviorrelationships, processesrules, laws, constraints, rule(s) middle structure data and some softwareguidelines, rules of thumb (animation procedures)(no standard or common term) ruleform(s) deep structure tables(no standard or common term) universal(s) sub-structure attributes, fieldstokens, signs or symbols token(s) notational structure character setExhibit 1: Layers of Structure in Any System, According to Ultra-Structure Theory
  9. 9. Exhibit 2: RAS Breakdown of Topics to Tokens
  10. 10. Using Ultra-Structure forAutomated Identification of Sensitive Information in Documents Jeffrey G. Long Sr. Knowledge Engineer, DynMeridian notate@aol.com
  11. 11. Traditional Engineering Approaches Work g g ppOnly Under Certain Conditions
  12. 12. Unfortunately, Unfortunately Complex and Changing Needs Exist in Every Organization NeedsSW & DBtime 1 time 2 time 3...
  13. 13. Ultra Structure Ultra-Structure Theory Was Created to Support Complex and Changing Rules New theory of systems design, developed 1985 Focuses on optimal computer representation of F ti l t t ti f complex, conditional and changing rules Based on a new abstraction called ruleforms The breakthrough was to find the unchanging features of changing systems
  14. 14. The Theory Offers a Different Way to Look at Complex Systems and Processes observable behaviors surface structure generates rules middle structure constrainsform of rulesf f l deep structure
  15. 15. This Creates New Levels for Analysisand RepresentationStandard Terminology (if any) Ultra-Structure Instance Ultra-Structure Level U-S Implementation Name Namebehavior, physical entities particular(s) surface structure system behaviorand relationships, processesrules, laws constraints,rules laws, constraints rule(s) middle structure data and someguidelines, rules of thumb software (animation procedures)(no standard or common ruleform(s) deep structure tablesterm)(no standard or common universal(s) sub-structure attributes, fieldsterm)tokens,tokens signs or symbols token(s) notational structure character set
  16. 16. The R l f Th Ruleform H Hypothesis h iComplex system structures are created by not-necessarilycomplex processes; and these processes are created by theanimation of operating rules. Operating rules can be groupedinto a small number of classes whose form is prescribed by"ruleforms". While the operating rules of a system change overtime, the ruleforms remain constant. A well-designed collection gof ruleforms can anticipate all logically possible operating rulesthat might apply to the system, and constitutes the deepstructure of the system.
  17. 17. The C RE Hypothesis Th CoRE H h iThere exist Competency Rule Engines, or CoREs, consisting of<50 ruleforms, that are sufficient to represent all rules foundamong systems sharing broad family resemblances e g all resemblances, e.g.corporations. Their definitive deep structure will be permanent,unchanging, and robust for all members of the family, whosedifferences in manifest structures and behaviors will berepresented entirely as differences in operating rules. Theanimation procedures for each engine will be relatively simplecompared to current applications, requiring less than 100,000 lines p pp , q g ,of code in a third generation language.
  18. 18. DOE Reviewer’s Assistant System Reviewer s Requirements 650 guides defining 65,000 topics that are or may be classified Extensive background knowledge required to interpret guidance Guidance changes over time Terminology in documents changes over time Current backlog of 300+ million pages Objective is concept spotting, not document understanding g
  19. 19. Normally This Would be Done Using an Expert System Shell ES often have trouble with > 100 rules DOE system will require about 500 000 rules 500,000 Key issue: maintainability of rules Many benefits from using relational database to store rules as data  Built-in referential integrity  Easy report-writing and queries E t iti d i  Simple user interface for KE and Reviewers
  20. 20. RAS Defines Guidance Concepts and pAll Possible Lexical Expressions ofThose Conceptsp System DefineConvert Guides Interpretations Ready Read Apply Document Document Guidance Reviewed
  21. 21. Rules Specify Relations BetweenConcepts, Tokens and Markings
  22. 22. Results to D R l Date are P Promising i i In a corpus of 3,750 unclassified documents, the false positive rate was less than 10% In I another corpus of 16,500 unclassified d th f 16 500 l ifi d documents, t the false positive rate was 2.5% In other (e.g. keyword and statistical systems) approaches, false positive and false negative rates are often i excess of 50% ft in f
  23. 23. The Ultra-Structure-Based RAS System Offers Substantial Benefits to S Off S b i lB fi Reviewers and Knowledge Engineers System can provide precise and rigorous interpretation of DOE Classification Guidance p Rules can become more complex if necessary Rules are easy to specify, change and review Implications and consequences of changes can be better foreseen Changes to rules do not require changing software or table structures – just data
  24. 24. Next Steps for RAS Development N S f D l Work with subject experts to expand scope and improve quality and completeness of rulebase Continue t ti system against many types of C ti testing t i t t f documents Improve design to minimize/eliminate false negatives and false positives Work with end-users to improve user interface Integrate into other systems Improve design to increase speed: parallel processing, processing stored queries etc queries, etc.
  25. 25. As the CoRE Hypothesis Promises RAS Promises, Could be Used in Other Areas Also Categorize documents by subject Scan email for spam/UCE Scan websites, e.g. for compliance to a standard Categorize p g patents or scan them for specified p concepts Scan source code, e.g. Y2K Scan any machine-readable corpus f specified S hi d bl for ifi d ideas

×