SlideShare a Scribd company logo
1 of 20
A Scalable Architecture for
Extracting, Aligning, Linking, and
Visualizing Multi-Int Data
Craig Knoblock & Pedro Szekely
University of Southern California
Introduction
• Massive quantities of data available for
analysis
– OSING, HUMINT, SIGINT, MASINT, GEOINT, …
• Data is spread across multiple sources,
multiple sites and multiple formats
– Databases, text, web sites, XML, JSON, etc...
• If an analyst could exploit all of this data, it
could transform analysis
– Disruptive technology for analysis
University of Southern California 2
Solution:
Domain-specific Insight Graphs
• Innovative architecture
– Extracting, aligning, linking, and
visualizing massive amounts of data
– Domain-specific content from structured
and unstructured sources
• State-of-the art open source software
– Open architecture with flexible APIs
– Cloud-based infrastructure (HDFS,
Hadoop, ElasticSearch, etc.)
University of Southern California 3
Example Scenario
• Want to determine the nuclear know-how of a
given country from open source data
• Analyze the universities, academics, publications,
reports, articles within the country
University of Southern California 4
Scenario Results
• Exploit the data
available from
– Web pages,
publications,
articles, etc.
• Produce a
knowledge graph
– Key people and
connections
– Technical
capabilities and how
they have changed
over time
University of Southern California 5
DIG Pipeline
• Crawling
• Extracting
• Cleaning
• Integration
• Computing simlarity
• Entity resolution
• Graph construction
• Query, analysis, and visualization
University of Southern California 6
Crawling
• Challenge: how to crawl just the relevant pages
• Approach:
– Uses the Apache Nutch framework for Web pages
– Uses Karma to extract pages from the deep Web
University of Southern California 7
Extracting
• Need to produce a structured
representation for indexing
and linking
• Highly configurable
architecture for extractors
– Learning of landmark
extractors for structured data
– Trainable CRF-based
extractors for unstructured
data
– Uses Mechanical Turk to
crowd source training data
University of Southern California 8
Cleaning
• Cleaning and normalization to support
analysis and linking
– Visualization showing data distribution
– Learned transformations from examples
– Cleaning programs written in Python
University of Southern California 9
Integration
• Need to align the data across extracted data and
structured sources
• Performed using a data integration tool called Karma
• Karma maps arbitrary sources into a shared domain vocabulary
(schema alignment)
• Uses machine learning to minimize user effort
University of Southern California 10
Integration Using Karma
University of Southern California 11
Similarity
• Computes similarity
across text fields and
images
– Image similarity done
using DeepSentiBank
– Text similarity done
using Minhash/LSH
University of Southern California 12
Entity Resolution
• Finds matching entities
• Reference source
– Match against source to
disambiguate entities
– E.g., geonames for
locations
• No reference source
– Combine entities by
considering the similarity
across multiple fields
University of Southern California 13
Graph Construction
• Data is integrated into a
graph that can be
queries and analyzed
– Data stored in HDFS
– Data represented in a
common language JSON-
LD
– Represented using a
common terminology
University of Southern California 14
Query, Analysis and Visualization
• Challenge: support efficient querying against the
graph
• Employ ElasticSearch to provide keyword querying,
faceted browsing, and aggregation queries
University of Southern California 15
Query, Analysis & Visualization
• Visualization interface that provides faceted
queries, timeslines, maps, etc.
University of Southern California 16
Discussion
• Technology that can provide dramatic new
insights from data that is already available
• Applies to a wide range of problems
– Determining the nuclear know-how of a given country
• Technologies, key scientists, relevant organizations
– Combating human trafficking
– Understanding trends in technical areas
• E.g., Material Science
– Analyzing the competitive landscape of companies
– and many other domains with massive quantities of
data
University of Southern California 17
USC DIG Team
University of Southern California 18
Acknowledgements
• Collaborators
– Next Century Technologies
– InferLink Inc.
– JPL
– Columbia University
• Sponsor
– DARPA
• AFRL contract number FA8750-14-C-0240
University of Southern California 19
Thanks!
• More information:
– Homepage
• isi.edu/~knoblock
– DIG
• usc-isi-i2.github.io/dig
– Karma
• usc-isi-i2.github.io/karma
University of Southern California 20

More Related Content

What's hot

Recommendations for selection process automation in systematic reviews
Recommendations for selection process automation in systematic reviewsRecommendations for selection process automation in systematic reviews
Recommendations for selection process automation in systematic reviewsFaisal Razzak
 
Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Andrea Scharnhorst
 
Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)infoblog
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text miningKrish_ver2
 
Demonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsDemonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsGESIS
 
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'ScienceWorks
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsarx-deidentifier
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval Tariq Hassan
 
Managing international comparative data
Managing international comparative dataManaging international comparative data
Managing international comparative dataEOSC-hub project
 
Query reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic webQuery reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic webLeandro Tabares-Martin
 
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1 Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1 ÚISK FF UK
 
Tracing Networks: Ontology Software in a Nutshell
Tracing Networks: Ontology Software in a NutshellTracing Networks: Ontology Software in a Nutshell
Tracing Networks: Ontology Software in a Nutshellenoch1982
 

What's hot (20)

Text mining
Text miningText mining
Text mining
 
Recommendations for selection process automation in systematic reviews
Recommendations for selection process automation in systematic reviewsRecommendations for selection process automation in systematic reviews
Recommendations for selection process automation in systematic reviews
 
Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...
 
Text mining
Text miningText mining
Text mining
 
Text MIning
Text MIningText MIning
Text MIning
 
Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
06 Scott, David
06 Scott, David06 Scott, David
06 Scott, David
 
Project progress
Project progressProject progress
Project progress
 
Demonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsDemonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations Systems
 
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithms
 
Resume
ResumeResume
Resume
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Mike Dietze PEcAn
Mike Dietze PEcAnMike Dietze PEcAn
Mike Dietze PEcAn
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
 
Managing international comparative data
Managing international comparative dataManaging international comparative data
Managing international comparative data
 
Query reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic webQuery reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic web
 
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1 Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
 
Tracing Networks: Ontology Software in a Nutshell
Tracing Networks: Ontology Software in a NutshellTracing Networks: Ontology Software in a Nutshell
Tracing Networks: Ontology Software in a Nutshell
 

Similar to A scalable architecture for extracting, aligning, linking, and visualizing multi-int data

Describing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgDescribing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgOCLC
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Dataaba-sah
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...giuseppe_futia
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016Manjula Ambur
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
 
NSF Software @ ApacheConNA
NSF Software @ ApacheConNANSF Software @ ApacheConNA
NSF Software @ ApacheConNADaniel S. Katz
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...Yongyao Jiang
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna
 
Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...rmacneil88
 

Similar to A scalable architecture for extracting, aligning, linking, and visualizing multi-int data (20)

Open University Data
Open University DataOpen University Data
Open University Data
 
Summary of 3DPAS
Summary of 3DPASSummary of 3DPAS
Summary of 3DPAS
 
Describing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgDescribing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.org
 
Hmp 201512
Hmp 201512Hmp 201512
Hmp 201512
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Data
 
Linked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter HaaseLinked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter Haase
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 
Shifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data ProviderShifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data Provider
 
NSF Software @ ApacheConNA
NSF Software @ ApacheConNANSF Software @ ApacheConNA
NSF Software @ ApacheConNA
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...
 

More from Craig Knoblock

Learning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and FailuresLearning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and FailuresCraig Knoblock
 
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...Craig Knoblock
 
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Craig Knoblock
 
Lessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art CollaborativeLessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art CollaborativeCraig Knoblock
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsCraig Knoblock
 
Building and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human TraffickingBuilding and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human TraffickingCraig Knoblock
 
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeFrom Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeCraig Knoblock
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisCraig Knoblock
 
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...Craig Knoblock
 
Discovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked DataDiscovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked DataCraig Knoblock
 

More from Craig Knoblock (10)

Learning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and FailuresLearning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and Failures
 
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
 
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
 
Lessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art CollaborativeLessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art Collaborative
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
 
Building and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human TraffickingBuilding and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human Trafficking
 
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeFrom Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and Analysis
 
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
 
Discovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked DataDiscovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked Data
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 

A scalable architecture for extracting, aligning, linking, and visualizing multi-int data

  • 1. A Scalable Architecture for Extracting, Aligning, Linking, and Visualizing Multi-Int Data Craig Knoblock & Pedro Szekely University of Southern California
  • 2. Introduction • Massive quantities of data available for analysis – OSING, HUMINT, SIGINT, MASINT, GEOINT, … • Data is spread across multiple sources, multiple sites and multiple formats – Databases, text, web sites, XML, JSON, etc... • If an analyst could exploit all of this data, it could transform analysis – Disruptive technology for analysis University of Southern California 2
  • 3. Solution: Domain-specific Insight Graphs • Innovative architecture – Extracting, aligning, linking, and visualizing massive amounts of data – Domain-specific content from structured and unstructured sources • State-of-the art open source software – Open architecture with flexible APIs – Cloud-based infrastructure (HDFS, Hadoop, ElasticSearch, etc.) University of Southern California 3
  • 4. Example Scenario • Want to determine the nuclear know-how of a given country from open source data • Analyze the universities, academics, publications, reports, articles within the country University of Southern California 4
  • 5. Scenario Results • Exploit the data available from – Web pages, publications, articles, etc. • Produce a knowledge graph – Key people and connections – Technical capabilities and how they have changed over time University of Southern California 5
  • 6. DIG Pipeline • Crawling • Extracting • Cleaning • Integration • Computing simlarity • Entity resolution • Graph construction • Query, analysis, and visualization University of Southern California 6
  • 7. Crawling • Challenge: how to crawl just the relevant pages • Approach: – Uses the Apache Nutch framework for Web pages – Uses Karma to extract pages from the deep Web University of Southern California 7
  • 8. Extracting • Need to produce a structured representation for indexing and linking • Highly configurable architecture for extractors – Learning of landmark extractors for structured data – Trainable CRF-based extractors for unstructured data – Uses Mechanical Turk to crowd source training data University of Southern California 8
  • 9. Cleaning • Cleaning and normalization to support analysis and linking – Visualization showing data distribution – Learned transformations from examples – Cleaning programs written in Python University of Southern California 9
  • 10. Integration • Need to align the data across extracted data and structured sources • Performed using a data integration tool called Karma • Karma maps arbitrary sources into a shared domain vocabulary (schema alignment) • Uses machine learning to minimize user effort University of Southern California 10
  • 11. Integration Using Karma University of Southern California 11
  • 12. Similarity • Computes similarity across text fields and images – Image similarity done using DeepSentiBank – Text similarity done using Minhash/LSH University of Southern California 12
  • 13. Entity Resolution • Finds matching entities • Reference source – Match against source to disambiguate entities – E.g., geonames for locations • No reference source – Combine entities by considering the similarity across multiple fields University of Southern California 13
  • 14. Graph Construction • Data is integrated into a graph that can be queries and analyzed – Data stored in HDFS – Data represented in a common language JSON- LD – Represented using a common terminology University of Southern California 14
  • 15. Query, Analysis and Visualization • Challenge: support efficient querying against the graph • Employ ElasticSearch to provide keyword querying, faceted browsing, and aggregation queries University of Southern California 15
  • 16. Query, Analysis & Visualization • Visualization interface that provides faceted queries, timeslines, maps, etc. University of Southern California 16
  • 17. Discussion • Technology that can provide dramatic new insights from data that is already available • Applies to a wide range of problems – Determining the nuclear know-how of a given country • Technologies, key scientists, relevant organizations – Combating human trafficking – Understanding trends in technical areas • E.g., Material Science – Analyzing the competitive landscape of companies – and many other domains with massive quantities of data University of Southern California 17
  • 18. USC DIG Team University of Southern California 18
  • 19. Acknowledgements • Collaborators – Next Century Technologies – InferLink Inc. – JPL – Columbia University • Sponsor – DARPA • AFRL contract number FA8750-14-C-0240 University of Southern California 19
  • 20. Thanks! • More information: – Homepage • isi.edu/~knoblock – DIG • usc-isi-i2.github.io/dig – Karma • usc-isi-i2.github.io/karma University of Southern California 20