Alignment and Analytics of Large Scale, Disparate Data from IARPA's Knowledge Discovery and Dissemination (KDD) Program

  • 223 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
223
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Test notes
  • Speak about Big Data in terms
  • - Left side shows KDD developed resources. - Right side shows CUBRC KDD architecture. - Data flows through system as follows:Visualization: Analyst issues query through visualization tool.Query Expansion: Expands query using semantic index developed in alignment. Also, queries ranked based on relevance to original question.Query Execution: Query is executed against data services API. Aligned data models negotiate what sources query executed against.Graph Creation: Query results passed to unstructured and structured graph creation modules. Highly parallelized process where Hadoop framework is leveraged to create and assemble RDF graphs quickly.Global Model: Graphs written as part of a global model. Model connects all graphs for all analyst queries yet leaves them easily segregable by query, analyst, date of creation, data sources used, etc…Association: Takes RDF graph and associates all people, places, events, etc. that are the same. Again, highly parallelized process where Hadoop framework is leveraged to reduce overall comparison time between entities to determine similarity.Task Answering: Dirty graph matching techniques use templates to find specific answers for analyst questions. Employs fuzzy computations to determine answers when underlying graph structure missing information.Re-query: Finds likely queries analyst would want to issue next. For example, if graph indicated that someone worked as a software engineer but not the company person worked for, abductive re-query would suggest a query to and search for the name of the company.Visualization: Analyst presented with answers in visualization tool. Also, presented with re-query options.
  • 1. ontology (backbone of this project) -- Why is an ontology important; It speaks the language. -- Here are our ontologies -- Here is data that we have developed. -- Maybe some statistics on the explosion of data -- How overlaying a model to truly network information together is the best approach -- Show the exotic queries from Phase 1; very very powerful -- Query can go from the raw data to the extracted types
  • - Left side shows KDD developed resources. - Right side shows CUBRC KDD architecture. - Data flows through system as follows:Visualization: Analyst issues query through visualization tool.Query Expansion: Expands query using semantic index developed in alignment. Also, queries ranked based on relevance to original question.Query Execution: Query is executed against data services API. Aligned data models negotiate what sources query executed against.Graph Creation: Query results passed to unstructured and structured graph creation modules. Highly parallelized process where Hadoop framework is leveraged to create and assemble RDF graphs quickly.Global Model: Graphs written as part of a global model. Model connects all graphs for all analyst queries yet leaves them easily segregable by query, analyst, date of creation, data sources used, etc…Association: Takes RDF graph and associates all people, places, events, etc. that are the same. Again, highly parallelized process where Hadoop framework is leveraged to reduce overall comparison time between entities to determine similarity.Task Answering: Dirty graph matching techniques use templates to find specific answers for analyst questions. Employs fuzzy computations to determine answers when underlying graph structure missing information.Re-query: Finds likely queries analyst would want to issue next. For example, if graph indicated that someone worked as a software engineer but not the company person worked for, abductive re-query would suggest a query to and search for the name of the company.Visualization: Analyst presented with answers in visualization tool. Also, presented with re-query options.
  • This architecture will allow us to integrate more machine learning algorithms and create a hybrid system for producing predictions for alignmentSupport weighting of alignment learnersLearner can be Mega-Learner, therefore it supports multiple levels of predictionAll learners can utilize the data contained within the Learner ContextEach learner will post its alignment result and score to the Alignment Data Cube for other Learners to access if needed.Alignment Data Cube is similar architecture to a Data Cube used within Data MiningAll scores are normalized between 0 and 1Data Value CharacterizationRegex to determine overall categorization of the data in the columnLucene Based AlignmentTF/IDF based learnerUtilizes Wordnet to expand the search terms
  • This architecture will allow us to integrate more machine learning algorithms and create a hybrid system for producing predictions for alignmentSupport weighting of alignment learnersLearner can be Mega-Learner, therefore it supports multiple levels of predictionAll learners can utilize the data contained within the Learner ContextEach learner will post its alignment result and score to the Alignment Data Cube for other Learners to access if needed.Alignment Data Cube is similar architecture to a Data Cube used within Data MiningAll scores are normalized between 0 and 1Data Value CharacterizationRegex to determine overall categorization of the data in the columnLucene Based AlignmentTF/IDF based learnerUtilizes Wordnet to expand the search terms
  • - Left side shows KDD developed resources. - Right side shows CUBRC KDD architecture. - Data flows through system as follows:Visualization: Analyst issues query through visualization tool.Query Expansion: Expands query using semantic index developed in alignment. Also, queries ranked based on relevance to original question.Query Execution: Query is executed against data services API. Aligned data models negotiate what sources query executed against.Graph Creation: Query results passed to unstructured and structured graph creation modules. Highly parallelized process where Hadoop framework is leveraged to create and assemble RDF graphs quickly.Global Model: Graphs written as part of a global model. Model connects all graphs for all analyst queries yet leaves them easily segregable by query, analyst, date of creation, data sources used, etc…Association: Takes RDF graph and associates all people, places, events, etc. that are the same. Again, highly parallelized process where Hadoop framework is leveraged to reduce overall comparison time between entities to determine similarity.Task Answering: Dirty graph matching techniques use templates to find specific answers for analyst questions. Employs fuzzy computations to determine answers when underlying graph structure missing information.Re-query: Finds likely queries analyst would want to issue next. For example, if graph indicated that someone worked as a software engineer but not the company person worked for, abductive re-query would suggest a query to and search for the name of the company.Visualization: Analyst presented with answers in visualization tool. Also, presented with re-query options.
  • - Left side shows KDD developed resources. - Right side shows CUBRC KDD architecture. - Data flows through system as follows:Visualization: Analyst issues query through visualization tool.Query Expansion: Expands query using semantic index developed in alignment. Also, queries ranked based on relevance to original question.Query Execution: Query is executed against data services API. Aligned data models negotiate what sources query executed against.Graph Creation: Query results passed to unstructured and structured graph creation modules. Highly parallelized process where Hadoop framework is leveraged to create and assemble RDF graphs quickly.Global Model: Graphs written as part of a global model. Model connects all graphs for all analyst queries yet leaves them easily segregable by query, analyst, date of creation, data sources used, etc…Association: Takes RDF graph and associates all people, places, events, etc. that are the same. Again, highly parallelized process where Hadoop framework is leveraged to reduce overall comparison time between entities to determine similarity.Task Answering: Dirty graph matching techniques use templates to find specific answers for analyst questions. Employs fuzzy computations to determine answers when underlying graph structure missing information.Re-query: Finds likely queries analyst would want to issue next. For example, if graph indicated that someone worked as a software engineer but not the company person worked for, abductive re-query would suggest a query to and search for the name of the company.Visualization: Analyst presented with answers in visualization tool. Also, presented with re-query options.

Transcript

  • 1. Advantage Through TechnologyActionable Intelligence Retrieval System (AIRS) Overview 27 November 2012 CUBRC KDD AIRS System 1
  • 2. Alignment of Data Models Apr – Jul OccursOn Crop 2008 Type Failure- Single representation for all datasources Event- Easily plug-in new data sources Report OccursAt Transcript Western RecordedBy Afghanistan Newsletter Remove Perspective Report Transcript Newsletter Observer ID:556AS4 Date: 15 May 08 Date: 10 Apr 08 Date: 26 Apr 08 Event: Situation Description: Data Model Event: Crop in certain areas Crop outlook Perspective Failure Extent dire as lack of for early Detection rain … summer … Confounds Data Integration Event Observer CUBRC KDD AIRS System 2
  • 3. Event Advanced Analytics Algorithms Quantitative“Easy” Analyst Questions- Identify All Event Information Timeline “Harder” Analyst Questions -Identify Similar Events “Hardest” Analyst Question - Identify Predictor Events Qualitative CUBRC KDD AIRS System 3
  • 4. Probe Tasks • Fully automated tasks • Test system plumbing • Ex: Find all associates of Jim Johnson and list the person’s affiliation to Jim. Use only data sets A, E, M. • 20 questions like theseAnalyst Tasks • Manual task executed by actual analysts • Test usability and applicability of developed algorithms to realistic tasks • Ex: Find all information that may have predicted an attack was imminent in Khost, Afghanistan on 3 June, 2008. • 10 questions like these CUBRC KDD AIRS System 4
  • 5. Many Sources Many Records Many Types 1K 100K 1MDS 1 Reports ArticlesDS 2DS 3 Blogs TranscriptsDS 4 StructuredDS 5DS 6 DOMEXDS 7 Semi-StructuredDS 8 Social Media CUBRC KDD AIRS System 5
  • 6. Three Essential Components Architecture Research Integrated Tasks Prototype CUBRC KDD AIRS System 6
  • 7. 9 High Level Research Areas 30 Research Tasks in Phase 2 •Task 1.1.3 (CUBRC) April - PreProto •Task 1.1.4 (CUBRC) Aug - Lab •Task 1.1.5 (CUBRC) Aug - LabALIGNMENT •Task 1.2.2 (CUBRC) April - Lab •Task 2.1.2 (ISS) April - Lab1. Ontology Development •Task 2.1.3.a (ISS) April - Lab •Task 2.1.3.c (ISS) August - Lab2. Structured Data Alignment •Task 3.1.2.a (GDIT) April Lab | Aug PreProto •Task 3.1.3 (GDIT)3. Unstructured Data Alignment •Task 3.1.4 (GDIT) April PreProto April Lab | Aug Preproto4. Alignment Reasoner •Task 3.2.1.a (GDIT) •Task 3.2.1.b (GDIT) April Lab | Aug PreProto April Lab | Aug PreProto5. Alignment Optimization •Task 3.2.1.c (GDIT) April Lab | Aug PreProto •Task 3.2.1.d (GDIT) April Lab | Aug Preproto •Task 3.2.3 (GDIT) April Proproto •Task 4.2.1 (Securboration) Aug LabADVANCED ANALYTICS •Task 5.1.1 (CUBRC) Aug Lab •Task 6.1.2 (CUBRC/UB) April Lab | Aug Preproto6. Workflow Optimization •Task 6.1.4 (CUBRC) April Lab | Aug Preproto •Task 6.1.5 (CUBRC)7. Application of Analyst Context •Task 6.3.1 (CUBRC) April Lab8. Data Association for Entity Resolution •Task 7.3.1 (Securboration) •Task 7.4.1 (UB) April Lab Aug PreProto9. Distributed Graph Matching •Task 8.1.1 (UB) Aug PreProto •Task 8.3.1 (UB) April PreProto •Task 8.3.2 (UB) April Lab | Aug PreProto •Task 9.1.1 (UB) Aug Preprotp •Task 9.1.2 (CUBRC) Aug Lab •Task 9.2.1 (UB) Aug Lab •Task 9.3.1 (UB) Aug Lab CUBRC KDD AIRS System 7
  • 8. Visualization Answers Analytics Data Flow Invoked Invoke Algorithm Algorithm Query: Query Expansion Single Threaded KDD RDF Ranked Requery: Queries Abduction Analyze Query Use Query: Evaluate Aligned Execution Models ResultsData ServicesSearch Graph Graph Association: Query: Creation: Creation: Entities & Events Sparql Structured Unstructured Parallelized Parallelized Parallelized Read & Write Raw Data Write Data Global Sources Model CUBRC KDD AIRS System 8
  • 9. Backbone of Project Basic Formal Ontology – Relation Ontology Artifact TimeOntology Ontology Extended Information Agent Event Geospatial Quality Relation Technology Ontology Ontology Ontology Ontology Ontology Ontology AIRS Mid- Level Ontology Defines Input & Output Format Most Counterterrorism Processes Ontology CUBRC KDD AIRS System 9
  • 10. Information Entity Ontology Sample Document • 76 local classes • 21 equivalence class axioms • 1 superclass axioms • 28 local object properties • 7 datatype propertiesAgent Ontology • 787 local classes • 231 equivalence class axioms (mostly persons with roles, e.g. Physician, Lawyer) • 70 local object properties (mostly familial relationships) • SPARQL Inferencing Notation (SPIN) rules that infer familial relationships from the primitive relationships of the child_of #Note #Paragraph #SectionOfText and parent_of and the qualities of male and female gender. #Person #Place CUBRC KDD AIRS System 10
  • 11. Analytics Query ‘Soup-to-Nuts’ Graph“Documents where Smyth is a Person && has Associates && Ontology footnote contains ‘XY’ && from data set 4 or 5” 4 5 SPARQL Query Raw CUBRC KDD AIRS System 11
  • 12. Visualization Answers Analytics Data Flow Invoked Invoke Algorithm Algorithm Query: Query Expansion Single Threaded KDD RDF Ranked Requery: Queries Abduction Analyze Query Use Query: Evaluate Aligned Execution Models ResultsData ServicesSearch Graph Graph Association: Query: Creation: Creation: Entities & Events Sparql Structured Unstructured Parallelized Parallelized Parallelized Read & Write Raw Data Write Data Global Sources Model CUBRC KDD AIRS System 12
  • 13. Architecture Implementation Column Alignment Request Data Value Learner Characterization Learner Learner Learner Context Column Categorical Based Alignment Mega- Data Value Alignment Data Cube Learner Characterization Mega- Learner Lucene Base Alignment * Spring Framework Column Alignment PredictionData Value Characterization• Used metadata, data values, regular expressions, and neural networks to classify columns• Combined with a collection of heuristics • Date Time • Person’s Name, Alias, and Birth Date • Recognizing unstructured data within structured 13 CUBRC KDD AIRS System
  • 14. D2RQ Mapping File• Enable dynamic RDF generation CUBRC KDD AIRS System 14
  • 15. Method1. Document Type Identification: • Determine document type with pattern-based configurations2. Passage & Metadata Retrieval: • With Document Type, Identify & extract data using: a. Template / Grammar Process b. Generic Heuristic Process3. Document Genre Association: • Link associated document genres Document Type Passage & Metadata Document Genre Identification Retrieval Association Identification Template Passage & Configuration Grammars Metadata Document Type Annotations Passages,Document Metadata, Document Type (a) Template / Document Genre Identification Genre links Grammar Process Association Process Process (b) Generic Heuristic Process CUBRC KDD AIRS System 15
  • 16. Methods• Extraction of Entity types (People, Place, Location, Facility, etc.)• Extraction of Events and Relationships - Uses an external file of patterns to extract attributes, relationships, and events.• Speed is 100 - 250K per second for information extraction Purchaser Pattern Language Seller Quickly Define 16 CUBRC KDD AIRS System
  • 17. Developed Tools Create Corpora Tool 1. Pulls down documents from data sources (uses samples) 2. Performs document analysis 3. Generates Core Types ~20 minutes for full markup of 1200 documents CUBRC KDD AIRS System 17
  • 18. Developed Tools Corner Case Coverage Text to RDF tool CUBRC KDD AIRS System 18
  • 19. Visualization Answers Analytics Data Flow Invoked Invoke Algorithm Algorithm Query: Query Expansion Single Threaded KDD RDF Ranked Requery: Queries Abduction Analyze Query Use Query: Evaluate Aligned Execution Models ResultsData ServicesSearch Graph Graph Association: Query: Creation: Creation: Entities & Events Sparql Structured Unstructured Parallelized Parallelized Parallelized Read & Write Raw Data Write Data Global Sources Model CUBRC KDD AIRS System 19
  • 20. Many Data Keyword- Fast Core Dynamic Graph Sources based Analytics Generation Query Structured Data Processing Keyword Natural Language Index Processing Custom Analytics Data Service Consistent 5 Minute Realist Scalable Running Time Goal Ontology (Hadoop) CUBRC KDD AIRS System 20
  • 21. PurposeTo create a component that selectsthe workflow definition thatsatisfies a set of QoS requirements,maximizing the expected outcomeof the workflow.MethodSolve Composite Service Problem • The problem is decomposed into a sequence of functionalities. • Functionalities (service classes) can be executed by many candidate services. • Candidates have associated benefits/costs (QoS Parameters). • Candidates are substitute and complementary within a service class. • Given QoS requirements, e.g., algorithm runtime ≤ 5 minutes CUBRC KDD AIRS System 21
  • 22. • Implemented in prototype system as runtime QoS Structured Processing Write SPARQL Write to Search Model Query VIZ Unstructured Processing 5 Minutes• Developers must adhere to QoS parameters• Phenomenal feedback loop developed with analysts; analysts understood and diagnosed system• Choose two additional QoS metrics for Phase 3 (memory) CUBRC KDD AIRS System 22
  • 23. Method Representation Similarity Euclidean Dynamic Weighting (.80) Location String Static Weighting Spatial/Hierarchical Logistic Regression (.75) Event Time Neural Network (.77) TFIDF (0.80) SVM (0.75) Description Semantic (0.64) (Max F)Major Research Tasks:• Identified succinct easily extractable event representation• Tested Location and Description similarity measures• Tested Event Similarity Algorithms• Tested performance on natural language and structured data sources CUBRC KDD AIRS System 23
  • 24. GTD: 200804060007 WITS: 20080450904/06/2008: On Sunday, unknown gunmen set On 6 April 2008, in the morning, in Jurn, Ninawa,up a fake checkpoint and intercepted two Iraq, armed assailants stopped two school busescollege buses, one carrying male students and carrying students to Mosul University at a fakeone carrying female students, in Mosul, checkpoint. The assailants then fired upon one of theNineveh province, Iraq. The bus carrying the busses as it managed to escape, wounding threefemale students managed to escape but the students and damaging the bus. Assailants kidnappedgunmen held the 42 male college students… all 42 students on board the second bus… Jurn ≈ Mosul Gaza ≠ Sderot Mosul 25 km Jurn Close Distance ≠ Similarity 24 CUBRC KDD AIRS System 24
  • 25. Processing Pipelines for Speed vs. Quality Decision <RDF INPUT DIRECTORY> FastestEntityResolutionSolverLocal.java Text Files LREntityResolutionSolverLocal.java <NEW-RDF OUTPUT DIRECTORY> Ont Model 1 Text File Ont Model 2 EntityResolutionSubproblemConstruction.java New Ont Model Ont Model 3 Ont Model 4 Subproblems FastestEntityResolutionSolverMR.java Subproblem (1,2) LREntityResolutionSolverMR.javaAssociate: … Person Subproblem (3,4) Location <SUBPROBLEM DIRECTORY> Implements JavaJobRunner Organization Implements JavaJobRunner, but runs MR Jobs Date Implements MapReduceJobRunner Artifact CUBRC KDD AIRS System 25
  • 26. Method P1 Lagrangian relaxation of an integer programming formulation of the clustering problem. This 55 65 algorithm iteratively adjusts scores to resolve inconsistencies, and also provides a performance P2 P3 guarantee (optimality gap) on the solutions. -85 310 45 290 40 Run Time per Iteration (minutes) 35 270 30Objective Value 250 25 230 20 210 15 190 10 170 5 150 0 1 6 11 16 21 26 31 36 41 46 0 4 8 12 16 Iteration Number # Processors CUBRC KDD AIRS System 26
  • 27. Results Cluster AIRS SearchArrest Similar Content Trial Cluster Similar Group 300 Distinct Content Information Results CUBRC KDD AIRS System 27
  • 28. • Analyst Context and Current State – Analyst may come to the system with some information • “There was a Terrorist Act at time X” • “I am interested in this suspected Insurgent” • “I want to know about a relationship between groups A and B” – Initial queries may produce statements aligned with CTO • Abductive Requery is applied – Select weighted fragments whose bound variables match CTO elements used in Context/State – Select rules those fragments correspond to, weighting by selected fragments – Combine rule statements with known Context/State – Produce subsequent query with known values ‘filled in’ SELECT ?w1 {Context: CONSTRUCT { }“Jane Doe” wife “John Doe” ?p1 wife ?p2 . WHERE { ?p2 husband ?p1 . “Jane Doe” bride ?w1 . } “John Doe” groom ?w1 . WHERE { ?w1 rdf:type Wedding . ?p1 bride ?w1 . } Fragment 1 { ?p2 groom ?w1 . ?p1 wife ?p2 . } ?w1 rdf:type Wedding . } CUBRC KDD AIRS System 28
  • 29. Visualization Answers Analytics Data Flow Invoked Invoke Algorithm Algorithm Query: Query Expansion Single Threaded KDD RDF Ranked Requery: Queries Abduction Analyze Query Use Query: Evaluate Aligned Execution Models ResultsData ServicesSearch Graph Graph Association: Query: Creation: Creation: Entities & Events Sparql Structured Unstructured Parallelized Parallelized Parallelized Read & Write Raw Data Write Data Global Sources Model CUBRC KDD AIRS System 29
  • 30. • Developed on the Hadoop/ MapReduce framework • Distributed services used in AIRS – Algorithms are written within the MapReduce and HDFS (file-system) environment – single threaded algorithms are a single “slot” algorithm – Oozie is the workflow coordination service; all jobs are monitored, dispatched, and logged – HBase and HDFS are used as distributed data stores for document metadata, and RDF graphs AIRS Software HBase Database Oozie Workflow Coordination ServiceMySQL Database Map Reduce Processing Framework Hadoop Distributed File System (HDFS) Server / Cluster Hardware CUBRC KDD AIRS System 30
  • 31. SELECT DISTINCT ?personNameTextWHERE{ ?act rdf:type event:Act . ?act ro:has_participant ?person . ?person rdf:type agent:Person . ?person ero:designated_by ?personName . ?personName ero:bearer_of ?personNameBearer . ?personNameBearer info:has_text_value ?PersonNameText .} Initial Query Merging Query Merging Query Merging Query • ?act rdf:type event:Act • ?act ro:has_participant • ?person rdf:type • ?person ?person agent:Person ero:designated_by ?personName Merging Query Merging Query Distinct Query Save a result • ?personName • ?personNameBearer Step iterator and return ero:bearer_of info:has_text_value • Filter on distinct ?personNameBearer ?PersonNameText results to the user ?PersonNameText’s CUBRC KDD AIRS System 31
  • 32. “Raw” Algorithms “Secondary” AlgorithmsAccept Model Query Airs QueryData Association Query Ingestion Cluster ResultsData Association Only Query Inprocess Extract All OrganizationsIngestion Query Structured Extract All PersonsIngestion Only Translation Data Filter By Date Association Find EventsSparql Translation Ingestion Topic Filters (32 variants)Structured • Leadership • Corruption • Dirty bombs • Drugs, etc. CUBRC KDD AIRS System 32
  • 33. Probe Task - Wrapper Algorithms 1400 1200Total Wrapper Lines of Code 1000 800 600 400 200 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Probe Task • Total Lines: 13,958* Wrapper Code 29% – Wrapper Code: 6,778 Implementation 49% Code – Implementation Code: 3,186* Validation Code – Validation Code: 3,994 23% * Less code developed before Test & Evaluation CUBRC KDD AIRS System 33
  • 34. Task: Find Life Events of an Individual Day 0 1 2 3 4 5 Tune Life Develop AlgorithmEvent Extraction (glue code) to New Analytic (NLP & SDA) Align Events Capabilities in Days CUBRC KDD AIRS System 34
  • 35.  Over 1200 workflows were issued by analysts over a 3 day period CUBRC KDD AIRS System 35
  • 36. Cluster Monitoring (Ganglia) • System Load • CPU Usage • Memory Usage • Network Bandwidth CUBRC KDD AIRS System 36
  • 37. • Fast translation technologies for structured and unstructured• Many analytics successes - more to come in Phase 3• All open source software, written entirely in Java • Full Government Purpose Rights• Installation manual and user manual ready to go CUBRC KDD AIRS System 37
  • 38. Justin Del Vecchiodelvecchio@cubrc.org716-204-5139 CUBRC KDD AIRS System 38