Horizontal Integration of Big Intelligence Data

943 views
782 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
943
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Horizontal Integration of Big Intelligence Data

  1. 1. HORIZONTAL INTEGRATION OF BIG INTELLIGENCE DATA The Role of Ontology in the Era of Big DataT. Malyuta, Ph. DNew York City College of Technology, NY, NYB. Smith, Ph. DUniversity at Buffalo, Buffalo, NYR. RudnickiCUBRC, Buffalo, NY
  2. 2. 2Big Data Problem• Wikipedia defines Big Data as “…a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.”• Gartner defines Big Data with three „V‟s: • Volume • Velocity (of production and analysis) • Variety• This means that Big Data are beyond our control (as opposed to those complex and big systems with diverse and changing data where the complexity is known)
  3. 3. 3Big Data Solution – Agility• Dimensions of agility • Storage paradigms that accommodate massive volumes of heterogeneous data • Data processing paradigms that can deal with the massive volumes of heterogeneous data coming onstream • Dynamic data stores that can easily accommodate diverse and a priori unknown data types and semantics • Methods and tools that leverage dynamic and diverse content
  4. 4. 4Agile Integration and Interoperability• Today, the main problem of the Big Data is using it• Utilization of „Variety‟ – diverse types and semantics – requires data integration and interoperability• Traditional integration approaches fail• Agile integration paradigms are needed
  5. 5. 5The Problem of Horizontal Integration ofBig Intelligence Data • HI =Def. the ability to exploit multiple data sources as if they are one • Recognized issues for HI with existing approaches • Data silos • Lexicon/semantics silos • Requirement for HI of Big Intelligence Data – Agile Semantic Interoperability  A strategy for HI must be agile in the sense that it can be quickly extended to new zones of emerging data according to need  Ontology allows an incremental approach – big bang already from the very first buck (we showed in I2WD)  Ontology can provide the needed agility
  6. 6. 6Agile Semantic Interoperability• A good solution has to be • Able to grow incrementally • Able to be developed in a distributed manner • Without losing consistency • Independent of particular implementations, and data producers and consumers • Applicable to data in an agile manner• We call our solution: „semantic enhancement‟ (SE) of data
  7. 7. 7 SE• SE is realized with the help of ontologies that are used to annotate (tag) data • Vocabulary of ontologies used for annotations provides agile horizontal integration • Ontologies, by virtue of their nature and organization, provide semantic enhancement of data Skill Education Technical ComputerSkill Education ProgrammingSkil l SQ Jav C+ L a + PersonID Name Description 111 Java Programming 222 SQL Database
  8. 8. 8The Meaning of „Enhancement‟• Semantic enhancement/enrichment of data = arm‟s length approach (no change to data) – through simple annotation we associate an entire knowledge system with a database field • enables analytics to process data, e.g. about computer skills, “vertically” along the Skill hierarchy, as well as “horizontally” via relations between Skill and Education. • and further… while data in the database does not change, its analysis can be richer and richer as our understanding of the reality changes• For this richness to be leveraged by different communities, persons, and applications it needs to have the properties mentioned above and be constructed in accordance with the principles of the SE
  9. 9. 9SE Principles⁻ Create a Shared Semantic Resource (SSR) of ontologies to be used for annotation⁻ Establish an agile strategy for building ontologies within this SSR, and apply and extend these ontologies to annotate new source data as they come onstream ⁻ Strategy pioneered in biomedical and other scientific fields: leaves data as they are, and incrementally tags data sources with terms from a growing, consistent, non- redundant set of ontologies⁻ Problem: Given the immense and growing variety of data sources, the development methodology must be applied by multiple different groups
  10. 10. 10 Achieving the Goal• Methodology of incremental distributed ontology development• A common ontology architecture incorporating a common, domain-neutral, upper-level ontology (BFO)• A shared governance and change management process• A simple, repeatable process for ontology development• An ontology registry• A process of intelligence data capture through „annotation‟ or „tagging‟ of source data artifacts
  11. 11. 11 Main Methodological Points• Ontological realism • Based on Doctrine; • Involves SMEs in label selection and definition • Thoroughly tested*• Arms-length process, with minimal disturbance to existing data and data semantics• Reference ontologies – capture generic content and are designed for aggressive reuse in multiple different types of context • Single reference ontology for each domain of interest• Application ontologies – are tied to specific local applications • An application ontology is created by combining local content with generic content taken over from relevant reference ontologies • Are still interoperable as are based on the common set of reference* Barry Smith and Werner Ceusters, “Ontological Realism as a Methodology for Coordinated Evolution ofScientific Ontologies”, Applied Ontology, 5 (2010), 139–188.
  12. 12. 12 Arms-length Process• Focusing on the terms (labels, acronyms, codes) used in ***our source data.• Where multiple distinct terms {t1, …, tn} are used in separate data sources with one and the same meaning, they are associated with a single preferred label drawn from a standard set of such labels• All the separate data items associated with the {t1, … tn} thereby linked together through the corresponding preferred labels.• Preferred labels form basis the for the ontologies we build SE ontology labels XYZ AB Heterogeneous KL C Contents M
  13. 13. 13Reference and Application Ontologies Reference Ontology Application Definitionsvehicle =def: an object used for artillery vehicle = def. vehicle designed fortransporting people or goods the transport of one or more artillery weapons tractor =def: a vehicle that is used for towing wheeled tractor = def. a tractor that has a wheeled platform crane =def: a vehicle that is used for lifting and moving heavy objects tracked tractor = def. a tractor that has a tracked platformvehicle platform=def: means of providingmobility to a vehicle artillery tractor = def. an artillery vehicle that is a tractor wheeled platform=def: a vehicle platform that provides mobility through wheeled artillery tractor = def. an artillery the use of wheels tractor that has a wheeled platform tracked platform=def: a vehicle platform that provides mobility through the use of continuous tracks
  14. 14. 14Illustration of Ontology Types (Toy Example) Vehicle Black – reference ontologies Artillery Red – Tractor Vehicle application ontologies Wheeled Artillery Tractor Tractor Wheeled Artillery Tractor
  15. 15. 15 Role of Reference Ontologies• Normalized • Maintains a set of consistent ontologies • Eliminates redundancy• Modular • A set of plug-and-play ontology modules • Enables distributed consistent development• Surveyable
  16. 16. 16SE Architecture• The Upper Level Ontology (ULO) in the SE hierarchy must be maximally general (no overlap with domain ontologies)• The Mid-Level Ontologies (MLOs) introduce successively less general and more detailed representations of types which arise in successively narrower domains until we reach the Lowest Level Ontologies (LLOs).• The LLOs are maximally specific representation of the entities in a particular one-dimensional domain
  17. 17. 17Architecture Illustration
  18. 18. 18Current State• Completed • Data Representation and Integration Framework (DRIF): architectural solution and implementation to create Dataspace (cloud of intelligence data) • Lossless representation of sources with their native semantics • Semantic Enhancement (SE): suite of prototype ontologies with coverage allowing annotation of these native semantics • Index exposing the content of the Dataspace via SE with proven benefits • Methodology and architecture for ontology development• In progress • Assembling the Shared Semantic Resource (SSR) as a separate store and enabling its use outside the Dataspace; in discussions with various agencies
  19. 19. 19The SSR DoD AirForce Navy NSA useReference Ontologies (Shared Semantic Resource) Application … Ontologies: Geospatial Weapon Agent-related Agent Organization Information Weapon-related Artifact … for purposes of Event Intelligence Video NLP Reporting Analysis Analysis
  20. 20. 20Challenges to HI• Too many lexicons• The scope of the domain: signal, sensor, image, … intelligence about … the whole world• Difficult to conduct governance and management of ontology development to ensure consistent evolution• Lack of expertise
  21. 21. 21Preventing Failure• The method we use offers solutions to some of the common reasons for failure• Lack of Consensus • Realism offers an objective standard for settling disputes over terminology. Ontology development becomes an empirical science instead of an exercise in the publication of dialects • Governance helps to resolve conflicts and achieve consensus• High Maintenance • Arm‟s length implementation places no additional overhead onto applications• Parochialism • Architecture and methodology prevent development of vocabularies that apply only to a single perspective• Poor Quality • Experience prevents common mistakes in vocabularies that cause downstream problems with search and analytics
  22. 22. Distributed Common Ground System – Army (DCGS-A)Semantic Enhancement of the Dataspace on the Cloud
  23. 23. 23Integrated Store of Intelligence Data• Lossless integration without heavy pre-processing• Ability to: • Incorporate multiple integration models / approaches / points of view of data and data-semantics • Perform continuous semantic enrichment of the integrated store• Scalability
  24. 24. 24Solution Components• Cloud implementation • Cloudbase (Accumulo)• Data Representation and Integration Framework • Comprehensive unified representation of data, data semantics, and metadata• This work was funded by US Army CERDEC Intelligence and Information Warfare Directorate (I2WD)
  25. 25. 25 Dealing with Semantic HeterogeneityPhysical Virtual integration. AIntegration. A projection onto aseparate data store homogeneous data-homogenizing model exposed tosemantics in a users – is moreparticular data- flexible, but maymodel – works only have the problem offor special cases, data availability (e.g.entails loss and military, intelligence).distortion of data Also, a particularand semantics, homogeneous modelcreates a new data has limited usage,silo. does not expose all content, and does not support enrichment
  26. 26. 26Pursuit of the Holy Grail of IntelligenceData Integration • In a highly dynamic semantic environment evolving in ad hoc ways • how to have it all and have it available immediately and at any time? • Traditional physical and virtual integration approaches fail to respond to these requirements • how to use these data resources efficiently (integrate, query, and analyze)?
  27. 27. 27 Workable SolutionA physical store Light Weightincorporating Semanticheterogeneous contents. Enhancement (SE)Data Representation and supports semanticIntegration Framework (DRIF) – is integration andbased on a decomposedrepresentation of structured data provides a decent(RDF-style) and allows collection utilization capabilityof data resources without loss and without addingor distortion and thereby achieverepresentational integration storage and processing weight to the already storage- and processing- heavy Dataspace
  28. 28. 28 DRIF Dataspace• Integration without heavy pre-processing (ad-hoc rapid integration): • Of any data artifact regardless of the model (or absence of it) and modality • Without loss and or distortion of data and data-semantics• Continuous evolution and enrichment• Pay-as-you-go solution • While data and data-semantics are expected to be enriched and refined, they can be efficiently utilized immediately after entering the DataSpace through querying, navigation, and drilling
  29. 29. Organization of the DRIF Dataspace Registration Ingestion Extraction [Transformation] / Enrichment
  30. 30. 30Semantic Enhancement of the Dataspace• Simple yet efficient harmonization strategy • Takes place not by changing the data semantics to which it is applied , but rather by adding an extra semantic layer to it • Long-lasting solution that can be applied consistently and in cumulative fashion to new models entering the Dataspace• Strategy compliant with and complementing the DRIF • Source data models are not changed• Be used efficiently, and in a unified fashion, in search, reasoning, and analytics • Provides views of the Dataspace of different level of detail• Mapping to a particular Über-model or choosing a single comprehensive model for harmonization do not provide the benefits described
  31. 31. 31Illustration• DRIF Dataspace accommodates lots of data models and is a microcosm of a collection of systems with diverse and heterogeneous data• Incremental annotations of these data models through SE ontologies• Preserving the native content of data resources• Presenting the native content via the SE annotations• Benefits of the approach
  32. 32. 32 Sources• Source database Db1, with tables Person and Skill, containing person data and data pertaining to skills of different kinds, respectively. PersonID SkillID SkillID Name Description 111 222 222 Java Programming• Source database Db2, with the table Person, containing data about IT personnel and their skills: ID SkillDescr 333 SQL• Source database Db3, with the table ProgrSkill, containing data about programmers‟ skills: EmplID SkillName 444 Java
  33. 33. 33 Representation in theSE Label Label Relation Dataspace Representation ofDb1.Name Is-a SE.Skill data-models, SEDb2.SkillDescr Is-a SE.ComputerSkill and SE annotationsDb3.SkillName Is-a SE.ProgrammingSkill as Concepts and ConceptAssociationDb1.PersonID Is-a SE.PersonID sDb2.ID Is-a SE.PersonIDDb3.EmplID Is-a SE.PersonID Blue – SE annotationsSE.ComputerSkill Is-a SE.Skill Red – SESE.ProgrammingSkill Is-a SE.ComputerSkill hierarchies Value and Relation Value and Associated Label Associated Label Native111, Db1.PersonID hasSkillID 222, Db1.SkillID representation of structured222, Db1.SkillID hasName Java, Db1.Name data222, Db1.SkillID hasDescription Programming, Db1.Description333, Db2.ID hasSkillDescr SQL, Db2.SkillDescr444, Db3.EmplID hasSkillName Java, Db3.SkillName
  34. 34. 34 Indexed Contents Based on the SEIndex entries based on the SE and native (blue) vocabularies Index Entry Associated Field-Value 111, Type: Person PersonID Skill: Java Db1.Description:Programming 333, Type: Person PersonID ComputerSkill: SQL 444, Type: Person PersonID ProgrammingSkill: Java
  35. 35. 35 Benefits of DRIF + SE• Leverages syntactic integration provided by DRIF, semantic integration provided by the SE vocabulary and annotations of native sources, and rich semantics provided by ontologies in general • Entering Skill = Java (which will be re-written at run time as: Skill = Java OR ComputerSkill = Java OR ProgrammingSkill = Java OR NetworkSkill = Java) will return: persons 111 and 444 • Entering ComputerSkill = Java OR ComputerSkill = SQL will return: persons 333 and 444 • entering ProgrammingSkill = Java will return: person 444 • entering Description = Programming will return: person 111• Allows to query/search and manipulate native representations• Light-weight non-intrusive approach that can be improved and refined without impacting the Dataspace
  36. 36. 36Index Contents without the SE Index entries based on native vocabularies Index Entry Associated Field-Value 111, PersonID Type: Person Name: Java Description: Programming 333, ID Type: Person SkillDescr: SQL 444, EmplID Type: Person SkillName: Java
  37. 37. 37 Problems• Even for our toy example we can see how much manual effort the analyst needs to apply in performing search without SE – and even then the information he will gain will be meager in comparison with what is made available through the Index with SE. • For example, if an analyst is familiar with the labels used in Db1 and is thus in a position to enter Name = Java, his query will still return only: person 111. Directly salient Db4 information will thus be missed.
  38. 38. 38 Additional Notes on the SE process• Original data and data-semantics are included in the Dataspace without loss and or distortion; thus there is no need to cover all semantics of the Dataspace – what is unlikely to be used in search or is not important for integration will still be available when needed• A complex ontology is not needed – a common and shared vocabulary is sufficient for virtual semantic integration and search/analytics• The approach is very flexible, and investments can be made in specific areas according to need (pay-as-you-go)• The approach is tunable – if the chosen annotations of a particular subset of a source data-model are too general for data analyses, the respective ontologies can be further developed and source models re-annotated
  39. 39. 39 Benefits of the Approach• Does not interfere with the source content• Enhancement enables this content to evolve in a cumulative fashion as it accommodates new kinds of data• Does not depend on the data resources and can be developed independently from them in an incremental and distributed fashion• Provides a more consistent, homogeneous, and well-articulated presentation of the content which originates in multiple internally inconsistent and heterogeneous systems• Makes management and exploitation of the content more cost- effective• The use of the selected ontologies brings integration with other government initiatives and brings the system closer to the federally mandated net-centric data strategy• Creates an integrated content that is effectively searchable and that provides content to which more powerful analytics can be applied
  40. 40. 40 Towards Globalization and Sharing• Using the SE approach to create a Shared Semantic Resource for the Intelligence Community to enable interoperability across systems• Applying it directly to or projecting its contents on a particular integration solution
  41. 41. 41References• Smith B. et al. Horizontal Integration of Warfighter Intelligence Data: A Shared Semantic Resource for the Intelligence Community, STIDS Conference, 2012.•• Smith B. et al., “Ontology for the Intelligence Analyst”, Crosstalk: The Journal of Defense Software Engineering, 2012.•• Salmen D. et al. Integration of Intelligence Data through Semantic Enhancement, STIDS Conference, 2011.
  42. 42. Follow Us Data Tactics Corporation 7901 Jones Branch Dr. Suite 700 McLean, VA 22102www.data-tactics-corp.com

×