This document discusses building knowledge graphs using DIG (Distributed Information Graphs) to integrate heterogeneous data sources. It describes the steps involved, including data acquisition, feature extraction, mapping to an ontology, entity resolution, graph construction, and deployment. As a use case, DIG has been used to build a knowledge graph from over 100 million web pages related to human trafficking to help law enforcement identify victims and prosecute traffickers.
Presentation of the Semantic Knowledge Graph research paper at the 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics (Montreal, Canada - October 18th, 2016)
Abstract—This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.
Reflected Intelligence: Real world AI in Digital TransformationTrey Grainger
The goal of most digital transformations is to create competitive advantage by enhancing customer experience and employee success, so giving these stakeholders the ability to find the right information at their moment of need is paramount. Employees and customers increasingly expect an intuitive, interactive experience where they can simply type or speak their questions or keywords into a search box, their intent will be understood, and the best answers and content are then immediately presented.
Providing this compelling experience, however, requires a deep understanding of your content, your unique business domain, and the collective and personalized needs of each of your users. Modern artificial intelligence (AI) approaches are able to continuously learn from both your content and the ongoing stream of user interactions with your applications, and to automatically reflect back that learned intelligence in order to instantly and scalably deliver contextually-relevant answers to employees and customers.
In this talk, we'll discuss how AI is currently being deployed across the Fortune 1000 to accomplish these goals, both in the digital workplace (helping employees more efficiently get answers and make decisions) and in digital commerce (understanding customer intent and connecting them with the best information and products). We'll separate fact from fiction as we break down the hype around AI and show how it is being practically implemented today to power many real-world digital transformations for the next generation of employees and customers.
"Searching for Meaning: The Hidden Structure in Unstructured Data". Presentation by Trey Grainger at the Southern Data Science Conference (SDSC) 2018. Covers linguistic theory, application in search and information retrieval, and knowledge graph and ontology learning methods for automatically deriving contextualized meaning from unstructured (free text) content.
Building a semantic search system - one that can correctly parse and interpret end-user intent and return the ideal results for users’ queries - is not an easy task. It requires semantically parsing the terms, phrases, and structure within queries, disambiguating polysemous terms, correcting misspellings, expanding to conceptually synonymous or related concepts, and rewriting queries in a way that maps the correct interpretation of each end user’s query into the ideal representation of features and weights that will return the best results for that user. Not only that, but the above must often be done within the confines of a very specific domain - ripe with its own jargon and linguistic and conceptual nuances.
This talk will walk through the anatomy of a semantic search system and how each of the pieces described above fit together to deliver a final solution. We'll leverage several recently-released capabilities in Apache Solr (the Semantic Knowledge Graph, Solr Text Tagger, Statistical Phrase Identifier) and Lucidworks Fusion (query log mining, misspelling job, word2vec job, query pipelines, relevancy experiment backtesting) to show you an end-to-end working Semantic Search system that can automatically learn the nuances of any domain and deliver a substantially more relevant search experience.
Presentation of the Semantic Knowledge Graph research paper at the 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics (Montreal, Canada - October 18th, 2016)
Abstract—This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.
Reflected Intelligence: Real world AI in Digital TransformationTrey Grainger
The goal of most digital transformations is to create competitive advantage by enhancing customer experience and employee success, so giving these stakeholders the ability to find the right information at their moment of need is paramount. Employees and customers increasingly expect an intuitive, interactive experience where they can simply type or speak their questions or keywords into a search box, their intent will be understood, and the best answers and content are then immediately presented.
Providing this compelling experience, however, requires a deep understanding of your content, your unique business domain, and the collective and personalized needs of each of your users. Modern artificial intelligence (AI) approaches are able to continuously learn from both your content and the ongoing stream of user interactions with your applications, and to automatically reflect back that learned intelligence in order to instantly and scalably deliver contextually-relevant answers to employees and customers.
In this talk, we'll discuss how AI is currently being deployed across the Fortune 1000 to accomplish these goals, both in the digital workplace (helping employees more efficiently get answers and make decisions) and in digital commerce (understanding customer intent and connecting them with the best information and products). We'll separate fact from fiction as we break down the hype around AI and show how it is being practically implemented today to power many real-world digital transformations for the next generation of employees and customers.
"Searching for Meaning: The Hidden Structure in Unstructured Data". Presentation by Trey Grainger at the Southern Data Science Conference (SDSC) 2018. Covers linguistic theory, application in search and information retrieval, and knowledge graph and ontology learning methods for automatically deriving contextualized meaning from unstructured (free text) content.
Building a semantic search system - one that can correctly parse and interpret end-user intent and return the ideal results for users’ queries - is not an easy task. It requires semantically parsing the terms, phrases, and structure within queries, disambiguating polysemous terms, correcting misspellings, expanding to conceptually synonymous or related concepts, and rewriting queries in a way that maps the correct interpretation of each end user’s query into the ideal representation of features and weights that will return the best results for that user. Not only that, but the above must often be done within the confines of a very specific domain - ripe with its own jargon and linguistic and conceptual nuances.
This talk will walk through the anatomy of a semantic search system and how each of the pieces described above fit together to deliver a final solution. We'll leverage several recently-released capabilities in Apache Solr (the Semantic Knowledge Graph, Solr Text Tagger, Statistical Phrase Identifier) and Lucidworks Fusion (query log mining, misspelling job, word2vec job, query pipelines, relevancy experiment backtesting) to show you an end-to-end working Semantic Search system that can automatically learn the nuances of any domain and deliver a substantially more relevant search experience.
Bigdata and ai in p2 p industry: Knowledge graph and inferencesfbiganalytics
Title: Knowledge graph and inference: use cases in online financial market
Abstract: While the knowledge graph is an active research field in machine learning community, this powerful tool is still less known to the people in the industry. In this talk, I will first introduce knowledge graph and inference techniques including the recent developments which combine with deep learning. Then I will talk about several use cases in online financial market: fraud/anomaly detection, lost contact discovery, intelligent search, name disambiguation and etc. I will also briefly mention how to build knowledge graph using neo4j from different data sources.
Better Search Through Query Understanding
Presented as a Data Talk at Intuit on April 22, 2014
Search is a fundamental problem of our time — we use search engines daily to satisfy a variety of personal and professional information needs. But search engine development still feels stuck in an information retrieval paradigm that focuses on result ranking. In this talk, I’ll advocate an emphasis on query understanding. I’ll talk about how we implement query understanding at LinkedIn, and I’ll present examples from the broader web. Hopefully you’ll come out with a different perspective on search and share my appreciation for how we can improve search through query understanding.
About the Speaker
Daniel Tunkelang leads LinkedIn's efforts around query understanding. Before that, he led LinkedIn's product data science team. He previously led a local search quality team at Google and was a founding employee of Endeca (acquired by Oracle in 2011). He has written a textbook on faceted search, and is a recognized advocate of human-computer interaction and information retrieval (HCIR). He has a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT.
Semantic search helps business people find answers to pressing questions by wading through oceans of information to find nuggets of meaningful information. In this presentation we’ll discuss how semantic search and content analysis technologies are starting to appear in the marketplace today. We’ll provide a recap of what semantic search is and what the key benefits are, then we’ll answer the following questions:
• Is semantic search a feature, an application, or enterprise system?
• How can I add semantic search to my existing work processes?
• Will I need to replace my existing content technologies?
• What will I need to do to prepare my content for semantic search?
• Is semantic search just for documents or can I search my data too?
• Can I use semantic search to find information on the internet and other public data sources?
• Are there standards to consider?
Hadoop and Neo4j: A Winning Combination for Bioinformaticsosintegrators
This presentation includes an intro to bioinformatics with an emphasis on human genome re-sequencing and how Hadoop and Neo4j can be used together to open striking possibilities.
This project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre-trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case.
Overview of structured search technology. Using the structure of a document to create better search results for document search and retrieval.
How both search precision and recall is improved when the structure of a document is used.
How a keyword match in a title of a document can be used to boost the search score.
Case studies with the eXist native XML database.
Steps to set up a pilot project.
The year of the graph: do you really need a graph database? How do you choose...George Anadiotis
Graph databases have been around for more than 15 years, but it was AWS and Microsoft getting in the domain that attracted widespread interest. If they are into this, there must be a reason.
Everyone wants to know more, few can really keep up and provide answers. And as this hitherto niche domain is in the mainstream now, the dynamics are changing dramatically. Besides new entries, existing players keep evolving.
I’ve done the hard work of evaluating solutions, so you don’t have to. An overview of the domain and selection methodology, as presented in Big Data Spain 2018
Bigdata and ai in p2 p industry: Knowledge graph and inferencesfbiganalytics
Title: Knowledge graph and inference: use cases in online financial market
Abstract: While the knowledge graph is an active research field in machine learning community, this powerful tool is still less known to the people in the industry. In this talk, I will first introduce knowledge graph and inference techniques including the recent developments which combine with deep learning. Then I will talk about several use cases in online financial market: fraud/anomaly detection, lost contact discovery, intelligent search, name disambiguation and etc. I will also briefly mention how to build knowledge graph using neo4j from different data sources.
Better Search Through Query Understanding
Presented as a Data Talk at Intuit on April 22, 2014
Search is a fundamental problem of our time — we use search engines daily to satisfy a variety of personal and professional information needs. But search engine development still feels stuck in an information retrieval paradigm that focuses on result ranking. In this talk, I’ll advocate an emphasis on query understanding. I’ll talk about how we implement query understanding at LinkedIn, and I’ll present examples from the broader web. Hopefully you’ll come out with a different perspective on search and share my appreciation for how we can improve search through query understanding.
About the Speaker
Daniel Tunkelang leads LinkedIn's efforts around query understanding. Before that, he led LinkedIn's product data science team. He previously led a local search quality team at Google and was a founding employee of Endeca (acquired by Oracle in 2011). He has written a textbook on faceted search, and is a recognized advocate of human-computer interaction and information retrieval (HCIR). He has a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT.
Semantic search helps business people find answers to pressing questions by wading through oceans of information to find nuggets of meaningful information. In this presentation we’ll discuss how semantic search and content analysis technologies are starting to appear in the marketplace today. We’ll provide a recap of what semantic search is and what the key benefits are, then we’ll answer the following questions:
• Is semantic search a feature, an application, or enterprise system?
• How can I add semantic search to my existing work processes?
• Will I need to replace my existing content technologies?
• What will I need to do to prepare my content for semantic search?
• Is semantic search just for documents or can I search my data too?
• Can I use semantic search to find information on the internet and other public data sources?
• Are there standards to consider?
Hadoop and Neo4j: A Winning Combination for Bioinformaticsosintegrators
This presentation includes an intro to bioinformatics with an emphasis on human genome re-sequencing and how Hadoop and Neo4j can be used together to open striking possibilities.
This project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre-trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case.
Overview of structured search technology. Using the structure of a document to create better search results for document search and retrieval.
How both search precision and recall is improved when the structure of a document is used.
How a keyword match in a title of a document can be used to boost the search score.
Case studies with the eXist native XML database.
Steps to set up a pilot project.
The year of the graph: do you really need a graph database? How do you choose...George Anadiotis
Graph databases have been around for more than 15 years, but it was AWS and Microsoft getting in the domain that attracted widespread interest. If they are into this, there must be a reason.
Everyone wants to know more, few can really keep up and provide answers. And as this hitherto niche domain is in the mainstream now, the dynamics are changing dramatically. Besides new entries, existing players keep evolving.
I’ve done the hard work of evaluating solutions, so you don’t have to. An overview of the domain and selection methodology, as presented in Big Data Spain 2018
RDAP 15: You’re in good company: Unifying campus research data servicesASIS&T
Research Data Access and Preservation Summit, 2015
Minneapolis, MN
April 22-23
Cynthia Hudson-Vitale, Digital Data Outreach Librarian, Washington University
Brianna Marshall, Digital Curation Coordinator, University of Wisconsin-Madison
Amy Nurnberger, Research Data Manager, Columbia University
IoT-Daten: Mehr und schneller ist nicht automatisch besser.
Über optimale Sampling-Strategien, wie man rechnen kann, ob IoT sich rechnet, und warum es nicht immer Deep Learning und Real-Time-Analytics sein muss. (Folien Deutsch/Englisch)
Keynote talk at the 3rd International Conference on Supercomputing in Mexico: www.isum.mx. A great group of people!
A substantially revised version of a talk with the same title given on previous occasions.
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Chris Bizer
Slideset of the keynote talk given by Christian Bizer at the Language Data and Knowledge (LDK 2019) conference in Leipzig, Germany
Abstract of the talk:
Millions of websites have started to annotate data describing products, local business, events, jobs, places, recipes, and reviews within their HTML pages using the schema.org vocabulary. These annotations are widely used by search engines to render rich snippets within search results. Surprisingly, the annotations are hardly used by the research community. In the talk, Christian Bizer investigates the potential of schema.org annotations for being used as training data for tasks such as entity matching, information extraction, and sentiment analysis. Web pages that offer semantic annotations often also contain additional structured data in the form of HTML tables. In the second part of the talk, Christian Bizer discusses the interplay of semantic annotations and web tables for information extraction as well as the general potential of relational HTML tables for complementing knowledge bases such as DBpedia, focusing on the discovery of formerly unknown long tail entities as well as the extraction of n-ary relations.
Link to the conference website:
http://2019.ldk-conf.org/invited-speakers/
Cloud computing and networking course: paper presentation -Data Mining for In...Cristian Consonni
This is the presentation for the course "Cloud Computing and Networking" of the ICT Doctoral School of the University of Trento.
The paper presented is
"Data mining for internet of things: A survey."
by Tsai, Chun-Wei, et al.
(Communications Surveys & Tutorials, IEEE 16.1 (2014): 77-97.)
I gave this talk at the 'Digital Twin Conference' hosted by LH Corp at COEX, Seoul on August 8th, 2019.
Abstract: 'Digital Twin' is a digital replication of real world objects, processes, phenomena that can be used for various purposes. Digital twin concept backs to manufacturing industry in early 2000s for the PLM (Product Lifecycle Management) purposes. It is based on the idea that a digital informational construct about a physical system could be created as an entity on its own. Definitions of digital twin emphasize the three important levels or characteristics. At first, there should be connection between real physical world and corresponding virtual world. To do this, Level 1 digital twin provides virtual 3D models. Secondly, this connection between real world and virtual world is established by generating (near) real time data using sensors or IoT. This is called Level 2 digital twin. Thirdly, Level 3 digital twin carries out certain analyses, predictions, and simulations using virtual 3D and (near) real time data. ‘Smart Spaces’ are interactive environments where humans and technology can openly communicate with each other in a physical or digital setting. Examples of smart spaces include smart cities, smart factories, and smart homes. ‘Smart Spaces’ is one of Garner’s Top 10 Tech Trends for 2019. As spaces are going through digital transformation with 4th industrial revolution, there are many attempts to apply digital twin technology to manage urban, spatial, and industrial issues around the world. Those attempts look set to play an increasingly important role in the creation of smart cities, smart factories, and smart homes. Bringing the virtual and real worlds together in this way can help to give better analysis, visualization, and simulation to the decision-making process. This will be a multi-way process with iterative feedback among stakeholders.
In this talk, I'll share my real experiences in carrying out digital twin and smart space projects. Also I’ll talk about what I’ve learnt from these projects.
Talk given at FBK, Trento with my views on how we could progress towards Smarter Cities, those cities that do not only pursue resource efficiency but mainly focus on addressing the citizen actual needs in their daily interactions with the city. This presentation addresses: a) how an enabling platform for Smarter Cities must support developers by providing well-known interfaces and data management languages (REST, JSON and SQL) and b) also end-users by enabling them to contribute with data, still continuously analyzing the quality of their provided data.
Using A Distributed Graph Database To Make Sense Of Disparate Data StoresInfiniteGraph
Presented at DataWeek SF Oct 13
Most analytics depend on data-mining and statistical correlation of information held in single data stores. It is generally inefficient to replicate diverse data, which may be stored in enterprise databases or NoSQL "Big Data" repositories and consolidate them using a single database technology. Although federated queries can help with statistical correlation of data values across data stores the technique is not very good at handling the data stored in relationships because the data stores generally have no knowledge of one another. The speaker describes a different approach that uses graph (relationship) analytics to extract structural data from existing repositories, store representations of the nodes and connections in a graph database, then analyze them to extract additional value.
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningCambridge Semantics
This EDM Council webinar, sponsored by Cambridge Semantics Inc. and featuring FI Consulting, explores the challenges common to a risk analytics pipeline, application of graph analytics to mortgage loan data and use cases in adjacent areas including customer service, collections, fraud and AML.
Accelerating Data Lakes and Streams with Real-time AnalyticsArcadia Data
As organizations modernize their data and analytics platforms, the data lake concept has gained momentum as a shared enterprise resource for supporting insights across multiple lines of business. The perception is that data lakes are vast, slow-moving bodies of data, but innovations like Apache Kafka for streaming-first architectures put real-time data flows at the forefront. Combining real-time alerts and fast-moving data with rich historical analysis lets you respond quickly to changing business conditions with powerful data lake analytics to make smarter decisions.
Join this complimentary webinar with industry experts from 451 Research and Arcadia Data who will discuss:
- Business requirements for combining real-time streaming and ad hoc visual analytics.
- Innovations in real-time analytics using tools like Confluent’s KSQL.
- Machine-assisted visualization to guide business analysts to faster insights.
- Elevating user concurrency and analytic performance on data lakes.
- Applications in cybersecurity, regulatory compliance, and predictive maintenance on manufacturing equipment all benefit from streaming visualizations.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Building Knowledge Graphs in DIG
1. Building knowledge graphs
in DIG
Pedro Szekely and Craig Knoblock
University of Southern California
Information Sciences Institute
dig.isi.edu
2. Goal
USC Information Sciences Institute CC-By 2.0 2
raw w messy w disconnected clean w organized w linked
hard to query, analyze & visualize easy to query, analyze & visualize
3. Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 3
raw w messy w disconnected clean w organized w linked
hard to query, analyze & visualize easy to query, analyze & visualize
4. Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 4
100 million pages
~ 100 Web sites
help victims
prosecute traffickers
5. Salient Statistics on
Human Trafficking
• Profits per Year: $32 Billion
• Average Age of Entry To Prostitution in the US: 14
• PIMP’s Profit Per Victim Per Year: $150,000
• Advertising Budget On the Web:$45 Million
CC-By 2.0 5USC Information Sciences Institute
6. Task: Tracking the Victim’s
Locations
> 100 million pages advertising adult services
USC Information Sciences Institute CC-By 2.0 6
7. Example: Investigating a Reported Victim
San Diego, where else?
USC Information Sciences Institute CC-By 2.0 7
8. DIG Interface: Find the locations where a
potential victim was advertised
CC-By 2.0 8
9. Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 9
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Data
Acquisition
10. Data Acquisition
USC Information Sciences Institute CC-By 2.0 10
downloading relevant data
batch w real-time
Web pagesw Web service w database w
CSV w Excel w XML w JSON
11. Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 11
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
12. Feature Extraction
USC Information Sciences Institute CC-By 2.0 12
from raw sources to structured data
• trainable text extractors
• extraction from structured Web pages
• image features
• PDF extractor
13. Feature Extraction from Text
USC Information Sciences Institute CC-By 2.0 13
“YOU don't wanna miss out on
ME :) Perfect lil booty Green
eyes Long curly black hair Im a
Irish,Armenian and Filipino
mixed princess :) ❤ Kim ❤
7○7~7two7~7four77 ❤ HH 80
roses ❤ Hour 120 roses ❤ 15
mins 60 roses”
name: Kim
eye-color: green
hair-color: black
phone: 707-727-7477
rate: $60/15min
$80/30min
$120/60min
20. Extraction Evaluation
Title Desc Seller Date Price Loc Cat
Member
Since
Expires Views ID
Perfect 1.0
(50/50)
.76
(37/49)
.95
(40/42)
.83
(40/48
)
.87
(39/45
)
.51
(23/45)
.68
(34/50)
1.0
(35/35)
.52
(15/29)
.76
(19/25)
.97
(35/36
)
Pretty
Good
1.0
(50/50)
.98
(48/49)
.95
(40/42)
.83
(40/48
)
.98
(44/45
)
.84
(38/45)
.88
(44/50)
1.0
(35/35)
.55
(16/29)
1.0
(25/25)
1.0
(36/36
)
10 websites, 5 pages each
fields
USC Information Sciences Institute CC-By 2.0 20
21. Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 21
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
22. Feature Alignment
USC Information Sciences Institute CC-By 2.0 22
from multiple schemas to a common domain schema
- CSV, Excel
- Database tables
- Web services
- Extractors
- Nomenclature
- Spelling
Multiple Schemas
23. Karma: Mapping Data to Ontologies
Services
Relational
Sources
Karma
{ JSON-LD }
Hierarchical
Sources
Schema.org
USC Information Sciences Institute CC-By 2.0 23
karma.isi.edu
24. Karma Solves Feature Alignment
CC-By 2.0 24USC Information Sciences Institute
Provenance
Domain Schema
took ~30 minutes to align
the output of the Stanford name extractor
25. Feature Alignment Statistics
• 5 contractors provided data
• ~ 15 datasets
• > 30 Karma models
• > 200 million records
• 1 hour processing in 20 node Hadoop cluster
CC-By 2.0 25USC Information Sciences Institute
26. Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 26
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
27. Entity Resolution
USC Information Sciences Institute CC-By 2.0 27
merging records that refer to the same entity
missing data
incorrect data
scale (~50 million records)
currently working on techniques to address
28. Entity Resolutuion on Strong Attributes
AdultService-1
Person-1
Offer-1
availableAt
seller
phone
619-319-7315
Santa Barbara
hairColor
red
price
250/hour
startDate
2014-12-07
eyeColor
blue
name
Jessica
itemProvided
Offer-2
Person-2
availableAt
Washington DC
phone
seller
email
price
250/hour
startDate
2014-05-28
AdultService-2
eyeColor
blue
name
Jessica
itemProvided
USC Information Sciences Institute CC-By 2.0 28
29. Linking Using Text Similarity
E M I LY SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me.
O_U_T_C___A___L_L_S
LAY LA SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me.
O____U____T____C___A___L____L____S
L I LA SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me.
O_U_T_C___A___L_L_S
USC Information Sciences Institute CC-By 2.0 29
30. Linking Using Image Similarity
CC-By 2.0 30USC Information Sciences Institute
100 Million Images Technology: Deep Learning
33. Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 33
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
34. Graph Construction
USC Information Sciences Institute CC-By 2.0 34
assembling the data for efficient query & analysis
- ElasticSearch: scalable, efficient query
- graph databases: network analytics
- NoSQL: scalable analytics
- bulk loading: massive data imports
- real-time updates: live, changing data
35. Elastic Search Data Model
Adult
Service
Offer Person Phone
Web
Page
USC Information Sciences Institute CC-By 2.0 35
36. Indexing for High Performance
Knowledge Graph Queries
Avg. Query Times in Milliseconds
Single User Query Load
1.2 billion triples
State of the Art Graph Database (RDF)
DIG indexing deployed in ElasticSearch
USC Information Sciences Institute CC-By 2.0 36
37. Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 37
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
38.
39.
40. DIG Deployment for Human Trafficking
USC Information Sciences Institute CC-By 2.0 40
- 100 million Web pages
- Live updates (~5,000 pages/hour)
- ElasticSearch database (7 nodes)
- Hadoop workflows (20 nodes)
- District Attorney
- Law Enforcement
- NGOs
42. DIG Applications
Human Trafficking
large, real users
Material Science Research
70,000 paper abstracts (built in 1 week)
Arms Trafficking
Identify illegal sales
Patent Trolls
Identify patent trolls
Cyber Attacks
Predict cyber attacks from dark web data
CC-By 2.0 42USC Information Sciences Institute
43. Conclusions
• Complete tool-chain to build domain-specific
knowledge graphs
• Integrates heterogeneous data: web pages,
databases, CSV, web APIs, images, etc.
• Scales to ~100 million pages, ~3 billion facts
• Deployed to law enforcement
USC Information Sciences Institute CC-By 2.0 43
Simplest kind of linking we do – linking based on strong, explicit attributes (phones, emails, websites, etc.)
So person-1 and person-2 might be the same person … but can we find more attributes to improve our confidence …
Estimating text similarity is challenging – here we are emphasizing stylometric similarity; map->n-grams->jacquard similarity
Clever scheme for storing pair-wise similarities in a database that can be updated incrementally (so we can bypass hashing that leverages elastic search w/lucene)
Why is linking significant in this domain? Slide shows why.
There is some clever tricks
We produce json documents rooted on the classes we care about .. Contain enough of the graph-neighborhood so that keyword queries can work so that I can search for an adultservice using a phone number even though the phone number is really part of the seller. Or search all offers that have the same phone number. Basically copying over some content.