SlideShare a Scribd company logo
1 of 20
A Scalable Architecture for
Extracting, Aligning, Linking, and
Visualizing Multi-Int Data
Craig Knoblock & Pedro Szekely
University of Southern California
Introduction
• Massive quantities of data available for
analysis
– OSING, HUMINT, SIGINT, MASINT, GEOINT, …
• Data is spread across multiple sources,
multiple sites and multiple formats
– Databases, text, web sites, XML, JSON, etc...
• If an analyst could exploit all of this data, it
could transform analysis
– Disruptive technology for analysis
University of Southern California 2
Solution:
Domain-specific Insight Graphs
• Innovative architecture
– Extracting, aligning, linking, and
visualizing massive amounts of data
– Domain-specific content from structured
and unstructured sources
• State-of-the art open source software
– Open architecture with flexible APIs
– Cloud-based infrastructure (HDFS,
Hadoop, ElasticSearch, etc.)
University of Southern California 3
Example Scenario
• Want to determine the nuclear know-how of a
given country from open source data
• Analyze the universities, academics, publications,
reports, articles within the country
University of Southern California 4
Scenario Results
• Exploit the data
available from
– Web pages,
publications,
articles, etc.
• Produce a
knowledge graph
– Key people and
connections
– Technical
capabilities and how
they have changed
over time
University of Southern California 5
DIG Pipeline
• Crawling
• Extracting
• Cleaning
• Integration
• Computing simlarity
• Entity resolution
• Graph construction
• Query, analysis, and visualization
University of Southern California 6
Crawling
• Challenge: how to crawl just the relevant pages
• Approach:
– Uses the Apache Nutch framework for Web pages
– Uses Karma to extract pages from the deep Web
University of Southern California 7
Extracting
• Need to produce a structured
representation for indexing
and linking
• Highly configurable
architecture for extractors
– Learning of landmark
extractors for structured data
– Trainable CRF-based
extractors for unstructured
data
– Uses Mechanical Turk to
crowd source training data
University of Southern California 8
Cleaning
• Cleaning and normalization to support
analysis and linking
– Visualization showing data distribution
– Learned transformations from examples
– Cleaning programs written in Python
University of Southern California 9
Integration
• Need to align the data across extracted data and
structured sources
• Performed using a data integration tool called Karma
• Karma maps arbitrary sources into a shared domain vocabulary
(schema alignment)
• Uses machine learning to minimize user effort
University of Southern California 10
Integration Using Karma
University of Southern California 11
Similarity
• Computes similarity
across text fields and
images
– Image similarity done
using DeepSentiBank
– Text similarity done
using Minhash/LSH
University of Southern California 12
Entity Resolution
• Finds matching entities
• Reference source
– Match against source to
disambiguate entities
– E.g., geonames for
locations
• No reference source
– Combine entities by
considering the similarity
across multiple fields
University of Southern California 13
Graph Construction
• Data is integrated into a
graph that can be
queries and analyzed
– Data stored in HDFS
– Data represented in a
common language JSON-
LD
– Represented using a
common terminology
University of Southern California 14
Query, Analysis and Visualization
• Challenge: support efficient querying against the
graph
• Employ ElasticSearch to provide keyword querying,
faceted browsing, and aggregation queries
University of Southern California 15
Query, Analysis & Visualization
• Visualization interface that provides faceted
queries, timeslines, maps, etc.
University of Southern California 16
Discussion
• Technology that can provide dramatic new
insights from data that is already available
• Applies to a wide range of problems
– Determining the nuclear know-how of a given country
• Technologies, key scientists, relevant organizations
– Combating human trafficking
– Understanding trends in technical areas
• E.g., Material Science
– Analyzing the competitive landscape of companies
– and many other domains with massive quantities of
data
University of Southern California 17
USC DIG Team
University of Southern California 18
Acknowledgements
• Collaborators
– Next Century Technologies
– InferLink Inc.
– JPL
– Columbia University
• Sponsor
– DARPA
• AFRL contract number FA8750-14-C-0240
University of Southern California 19
Thanks!
• More information:
– Homepage
• isi.edu/~knoblock
– DIG
• usc-isi-i2.github.io/dig
– Karma
• usc-isi-i2.github.io/karma
University of Southern California 20

More Related Content

What's hot

Recommendations for selection process automation in systematic reviews
Recommendations for selection process automation in systematic reviewsRecommendations for selection process automation in systematic reviews
Recommendations for selection process automation in systematic reviewsFaisal Razzak
 
Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Andrea Scharnhorst
 
Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)infoblog
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text miningKrish_ver2
 
Demonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsDemonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsGESIS
 
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'ScienceWorks
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsarx-deidentifier
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval Tariq Hassan
 
Managing international comparative data
Managing international comparative dataManaging international comparative data
Managing international comparative dataEOSC-hub project
 
Query reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic webQuery reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic webLeandro Tabares-Martin
 
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1 Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1 ÚISK FF UK
 
Tracing Networks: Ontology Software in a Nutshell
Tracing Networks: Ontology Software in a NutshellTracing Networks: Ontology Software in a Nutshell
Tracing Networks: Ontology Software in a Nutshellenoch1982
 

What's hot (20)

Text mining
Text miningText mining
Text mining
 
Recommendations for selection process automation in systematic reviews
Recommendations for selection process automation in systematic reviewsRecommendations for selection process automation in systematic reviews
Recommendations for selection process automation in systematic reviews
 
Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...
 
Text mining
Text miningText mining
Text mining
 
Text MIning
Text MIningText MIning
Text MIning
 
Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)Claremont Report on Database Research: Research Directions (Le Gruenwald)
Claremont Report on Database Research: Research Directions (Le Gruenwald)
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
06 Scott, David
06 Scott, David06 Scott, David
06 Scott, David
 
Project progress
Project progressProject progress
Project progress
 
Demonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsDemonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations Systems
 
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
Erwin Folmer - Congres 'Data gedreven Beleidsontwikkeling'
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithms
 
Resume
ResumeResume
Resume
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Mike Dietze PEcAn
Mike Dietze PEcAnMike Dietze PEcAn
Mike Dietze PEcAn
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
 
Managing international comparative data
Managing international comparative dataManaging international comparative data
Managing international comparative data
 
Query reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic webQuery reverse engineering in the context of the semantic web
Query reverse engineering in the context of the semantic web
 
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1 Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
Jan Dvořák: CERIF - evropský formát pro informace o výzkumu, část 1
 
Tracing Networks: Ontology Software in a Nutshell
Tracing Networks: Ontology Software in a NutshellTracing Networks: Ontology Software in a Nutshell
Tracing Networks: Ontology Software in a Nutshell
 

Similar to A scalable architecture for extracting, aligning, linking, and visualizing multi-int data

Describing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgDescribing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgOCLC
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Dataaba-sah
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...giuseppe_futia
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016Manjula Ambur
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
 
NSF Software @ ApacheConNA
NSF Software @ ApacheConNANSF Software @ ApacheConNA
NSF Software @ ApacheConNADaniel S. Katz
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...Yongyao Jiang
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna
 
Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...rmacneil88
 

Similar to A scalable architecture for extracting, aligning, linking, and visualizing multi-int data (20)

Open University Data
Open University DataOpen University Data
Open University Data
 
Summary of 3DPAS
Summary of 3DPASSummary of 3DPAS
Summary of 3DPAS
 
Describing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgDescribing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.org
 
Hmp 201512
Hmp 201512Hmp 201512
Hmp 201512
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Data
 
Linked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter HaaseLinked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter Haase
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
Big Data e tecnologie semantiche - Utilizzare i Linked data come driver d'int...
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 
Shifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data ProviderShifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data Provider
 
NSF Software @ ApacheConNA
NSF Software @ ApacheConNANSF Software @ ApacheConNA
NSF Software @ ApacheConNA
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...
 

More from Craig Knoblock

Learning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and FailuresLearning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and FailuresCraig Knoblock
 
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...Craig Knoblock
 
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Craig Knoblock
 
Lessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art CollaborativeLessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art CollaborativeCraig Knoblock
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsCraig Knoblock
 
Building and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human TraffickingBuilding and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human TraffickingCraig Knoblock
 
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeFrom Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeCraig Knoblock
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisCraig Knoblock
 
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...Craig Knoblock
 
Discovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked DataDiscovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked DataCraig Knoblock
 

More from Craig Knoblock (10)

Learning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and FailuresLearning to Adapt to Sensor Changes and Failures
Learning to Adapt to Sensor Changes and Failures
 
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs u...
 
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
 
Lessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art CollaborativeLessons Learned in Building Linked Data for the American Art Collaborative
Lessons Learned in Building Linked Data for the American Art Collaborative
 
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge GraphsExtracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
 
Building and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human TraffickingBuilding and Using a Knowledge Graph to Combat Human Trafficking
Building and Using a Knowledge Graph to Combat Human Trafficking
 
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked KnowledgeFrom Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
From Virtual Museums to Peacebuilding: Creating and Using Linked Knowledge
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and Analysis
 
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...A Semantic Approach to Retrieving, Linking, and  Integrating Heterogeneous Ge...
A Semantic Approach to Retrieving, Linking, and Integrating Heterogeneous Ge...
 
Discovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked DataDiscovering Alignments in Ontologies of Linked Data
Discovering Alignments in Ontologies of Linked Data
 

Recently uploaded

Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimaginedpanagenda
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandIES VE
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptxBT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptxNeo4j
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 

Recently uploaded (20)

Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptxBT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
BT & Neo4j _ How Knowledge Graphs help BT deliver Digital Transformation.pptx
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 

A scalable architecture for extracting, aligning, linking, and visualizing multi-int data

  • 1. A Scalable Architecture for Extracting, Aligning, Linking, and Visualizing Multi-Int Data Craig Knoblock & Pedro Szekely University of Southern California
  • 2. Introduction • Massive quantities of data available for analysis – OSING, HUMINT, SIGINT, MASINT, GEOINT, … • Data is spread across multiple sources, multiple sites and multiple formats – Databases, text, web sites, XML, JSON, etc... • If an analyst could exploit all of this data, it could transform analysis – Disruptive technology for analysis University of Southern California 2
  • 3. Solution: Domain-specific Insight Graphs • Innovative architecture – Extracting, aligning, linking, and visualizing massive amounts of data – Domain-specific content from structured and unstructured sources • State-of-the art open source software – Open architecture with flexible APIs – Cloud-based infrastructure (HDFS, Hadoop, ElasticSearch, etc.) University of Southern California 3
  • 4. Example Scenario • Want to determine the nuclear know-how of a given country from open source data • Analyze the universities, academics, publications, reports, articles within the country University of Southern California 4
  • 5. Scenario Results • Exploit the data available from – Web pages, publications, articles, etc. • Produce a knowledge graph – Key people and connections – Technical capabilities and how they have changed over time University of Southern California 5
  • 6. DIG Pipeline • Crawling • Extracting • Cleaning • Integration • Computing simlarity • Entity resolution • Graph construction • Query, analysis, and visualization University of Southern California 6
  • 7. Crawling • Challenge: how to crawl just the relevant pages • Approach: – Uses the Apache Nutch framework for Web pages – Uses Karma to extract pages from the deep Web University of Southern California 7
  • 8. Extracting • Need to produce a structured representation for indexing and linking • Highly configurable architecture for extractors – Learning of landmark extractors for structured data – Trainable CRF-based extractors for unstructured data – Uses Mechanical Turk to crowd source training data University of Southern California 8
  • 9. Cleaning • Cleaning and normalization to support analysis and linking – Visualization showing data distribution – Learned transformations from examples – Cleaning programs written in Python University of Southern California 9
  • 10. Integration • Need to align the data across extracted data and structured sources • Performed using a data integration tool called Karma • Karma maps arbitrary sources into a shared domain vocabulary (schema alignment) • Uses machine learning to minimize user effort University of Southern California 10
  • 11. Integration Using Karma University of Southern California 11
  • 12. Similarity • Computes similarity across text fields and images – Image similarity done using DeepSentiBank – Text similarity done using Minhash/LSH University of Southern California 12
  • 13. Entity Resolution • Finds matching entities • Reference source – Match against source to disambiguate entities – E.g., geonames for locations • No reference source – Combine entities by considering the similarity across multiple fields University of Southern California 13
  • 14. Graph Construction • Data is integrated into a graph that can be queries and analyzed – Data stored in HDFS – Data represented in a common language JSON- LD – Represented using a common terminology University of Southern California 14
  • 15. Query, Analysis and Visualization • Challenge: support efficient querying against the graph • Employ ElasticSearch to provide keyword querying, faceted browsing, and aggregation queries University of Southern California 15
  • 16. Query, Analysis & Visualization • Visualization interface that provides faceted queries, timeslines, maps, etc. University of Southern California 16
  • 17. Discussion • Technology that can provide dramatic new insights from data that is already available • Applies to a wide range of problems – Determining the nuclear know-how of a given country • Technologies, key scientists, relevant organizations – Combating human trafficking – Understanding trends in technical areas • E.g., Material Science – Analyzing the competitive landscape of companies – and many other domains with massive quantities of data University of Southern California 17
  • 18. USC DIG Team University of Southern California 18
  • 19. Acknowledgements • Collaborators – Next Century Technologies – InferLink Inc. – JPL – Columbia University • Sponsor – DARPA • AFRL contract number FA8750-14-C-0240 University of Southern California 19
  • 20. Thanks! • More information: – Homepage • isi.edu/~knoblock – DIG • usc-isi-i2.github.io/dig – Karma • usc-isi-i2.github.io/karma University of Southern California 20