Building knowledge graphs
in DIG
Pedro Szekely and Craig Knoblock
University of Southern California
Information Sciences Institute
dig.isi.edu
Goal
USC Information Sciences Institute CC-By 2.0 2
raw w messy w disconnected clean w organized w linked
hard to query, analyze & visualize easy to query, analyze & visualize
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 3
raw w messy w disconnected clean w organized w linked
hard to query, analyze & visualize easy to query, analyze & visualize
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 4
100 million pages
~ 100 Web sites
help victims
prosecute traffickers
Salient Statistics on
Human Trafficking
• Profits per Year: $32 Billion
• Average Age of Entry To Prostitution in the US: 14
• PIMP’s Profit Per Victim Per Year: $150,000
• Advertising Budget On the Web:$45 Million
CC-By 2.0 5USC Information Sciences Institute
Task: Tracking the Victim’s
Locations
>	100	million	pages	advertising	adult	services
USC Information Sciences Institute CC-By 2.0 6
Example: Investigating a Reported Victim
San	Diego,	where	else?
USC Information Sciences Institute CC-By 2.0 7
DIG Interface: Find the locations where a
potential victim was advertised
CC-By 2.0 8
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 9
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Data
Acquisition
Data Acquisition
USC Information Sciences Institute CC-By 2.0 10
downloading relevant data
batch w real-time
Web pagesw Web service w database w
CSV w Excel w XML w JSON
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 11
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Feature Extraction
USC Information Sciences Institute CC-By 2.0 12
from raw sources to structured data
• trainable text extractors
• extraction from structured Web pages
• image features
• PDF extractor
Feature Extraction from Text
USC Information Sciences Institute CC-By 2.0 13
“YOU don't wanna miss out on
ME :) Perfect lil booty Green
eyes Long curly black hair Im a
Irish,Armenian and Filipino
mixed princess :) ❤ Kim ❤
7○7~7two7~7four77 ❤ HH 80
roses ❤ Hour 120 roses ❤ 15
mins 60 roses”
name: Kim
eye-color: green
hair-color: black
phone: 707-727-7477
rate: $60/15min
$80/30min
$120/60min
20 Examples
CC-By 2.0 14USC Information Sciences Institute
1,000’s of Tasks (2 Cents/Sentence)
CC-By 2.0 15
Performance of CRF Extractors
80
10
18
99
91 94
0
20
40
60
80
100
120
Precision Recall F
Regular	Expressions DIG
80
6
12
99
73
84
0
20
40
60
80
100
120
Precision Recall F
Regular	Expressions DIG
Eyes Hair
USC Information Sciences Institute CC-By 2.0 16
Structured Extraction
CC-By 2.0 17
Automated Extraction
input:	
a pile	of	pages
Classify	by
Templates
pages	clustered
by	template	
Infer
Extractor
Infer
Extractor
Infer
Extractor
Infer
Extractor
extractor
USC Information Sciences Institute CC-By 2.0 18
Unsupervised Extraction Tool
CC-By 2.0 19
Extraction Evaluation
Title Desc Seller Date Price Loc Cat
Member
Since
Expires Views ID
Perfect 1.0
(50/50)
.76
(37/49)
.95
(40/42)
.83
(40/48
)
.87
(39/45
)
.51
(23/45)
.68
(34/50)
1.0
(35/35)
.52
(15/29)
.76
(19/25)
.97
(35/36
)
Pretty
Good
1.0
(50/50)
.98
(48/49)
.95
(40/42)
.83
(40/48
)
.98
(44/45
)
.84
(38/45)
.88
(44/50)
1.0
(35/35)
.55
(16/29)
1.0
(25/25)
1.0
(36/36
)
10	websites,	5	pages	each
fields
USC Information Sciences Institute CC-By 2.0 20
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 21
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Feature Alignment
USC Information Sciences Institute CC-By 2.0 22
from multiple schemas to a common domain schema
- CSV, Excel
- Database tables
- Web services
- Extractors
- Nomenclature
- Spelling
Multiple Schemas
Karma: Mapping Data to Ontologies
Services
Relational
Sources
Karma
{	JSON-LD	}
Hierarchical	
Sources
Schema.org
USC Information Sciences Institute CC-By 2.0 23
karma.isi.edu
Karma Solves Feature Alignment
CC-By 2.0 24USC Information Sciences Institute
Provenance
Domain Schema
took ~30 minutes to align
the output of the Stanford name extractor
Feature Alignment Statistics
• 5 contractors provided data
• ~ 15 datasets
• > 30 Karma models
• > 200 million records
• 1 hour processing in 20 node Hadoop cluster
CC-By 2.0 25USC Information Sciences Institute
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 26
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Entity Resolution
USC Information Sciences Institute CC-By 2.0 27
merging records that refer to the same entity
missing data
incorrect data
scale (~50 million records)
currently working on techniques to address
Entity Resolutuion on Strong Attributes
AdultService-1
Person-1
Offer-1
availableAt
seller
phone
619-319-7315
Santa Barbara
hairColor
red
price
250/hour
startDate
2014-12-07
eyeColor
blue
name
Jessica
itemProvided
Offer-2
Person-2
availableAt
Washington DC
phone
seller
email
price
250/hour
startDate
2014-05-28
AdultService-2
eyeColor
blue
name
Jessica
itemProvided
USC Information Sciences Institute CC-By 2.0 28
Linking Using Text Similarity
E M I LY SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me.
O_U_T_C___A___L_L_S
LAY LA SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me.
O____U____T____C___A___L____L____S
L I LA SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me.
O_U_T_C___A___L_L_S
USC Information Sciences Institute CC-By 2.0 29
Linking Using Image Similarity
CC-By 2.0 30USC Information Sciences Institute
100 Million Images Technology: Deep Learning
AdultService-1
Person-1
Offer-1
availableAt
seller
phone
619-319-7315
Santa Barbara
hairColor
red
price
250/hour
startDate
2014-12-07
eyeColor
blue
name
Jessica
itemProvided
Offer-2
Person-2
availableAt
Washington DC
phone
seller
email
price
250/hour
startDate
2014-05-28
AdultService-2
eyeColor
blue
name
Jessica
itemProvided
same victim
same Trafficker
Unsupervised Collective Entity Resolution
USC Information Sciences Institute CC-By 2.0 31
Unsupervised Collective Entity
Resolution
USC Information Sciences Institute CC-By 2.0 32
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 33
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
Graph Construction
USC Information Sciences Institute CC-By 2.0 34
assembling the data for efficient query & analysis
- ElasticSearch: scalable, efficient query
- graph databases: network analytics
- NoSQL: scalable analytics
- bulk loading: massive data imports
- real-time updates: live, changing data
Elastic Search Data Model
Adult
Service
Offer Person Phone
Web
Page
USC Information Sciences Institute CC-By 2.0 35
Indexing for High Performance
Knowledge Graph Queries
Avg.	Query	Times	in	Milliseconds
Single	User	Query	Load
1.2	billion	triples
State	of	the	Art	Graph	Database	(RDF)
DIG	indexing	deployed	in	ElasticSearch
USC Information Sciences Institute CC-By 2.0 36
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 37
Crawling Extraction
Data Acquisition
Mapping To
Ontology
Entity Linking
& Similarity
Knowledge Graph
Deployment
Query &
Visualization
Elastic
Search
Graph
DB
schema.org geonames
Data
Acquisition
Feature
Extraction
Feature
Alignment
Entity
Resolution
Graph
Construction
User
Interface
DIG Deployment for Human Trafficking
USC Information Sciences Institute CC-By 2.0 40
- 100 million Web pages
- Live updates (~5,000 pages/hour)
- ElasticSearch database (7 nodes)
- Hadoop workflows (20 nodes)
- District Attorney
- Law Enforcement
- NGOs
Deployed	to	6
Law	Enforcement	
Agencies	and	Successfully	
Used	to	Prosecute	
Traffickers
USC Information Sciences Institute CC-By 2.0 41
DIG Applications
Human Trafficking
large, real users
Material Science Research
70,000 paper abstracts (built in 1 week)
Arms Trafficking
Identify illegal sales
Patent Trolls
Identify patent trolls
Cyber Attacks
Predict cyber attacks from dark web data
CC-By 2.0 42USC Information Sciences Institute
Conclusions
• Complete tool-chain to build domain-specific
knowledge graphs
• Integrates heterogeneous data: web pages,
databases, CSV, web APIs, images, etc.
• Scales to ~100 million pages, ~3 billion facts
• Deployed to law enforcement
USC Information Sciences Institute CC-By 2.0 43
Questions?
dig.isi.edu
Open Source, Apache 2 License
CC-By 2.0 44USC Information Sciences Institute

Building Knowledge Graphs in DIG

  • 1.
    Building knowledge graphs inDIG Pedro Szekely and Craig Knoblock University of Southern California Information Sciences Institute dig.isi.edu
  • 2.
    Goal USC Information SciencesInstitute CC-By 2.0 2 raw w messy w disconnected clean w organized w linked hard to query, analyze & visualize easy to query, analyze & visualize
  • 3.
    Use Case: HumanTrafficking USC Information Sciences Institute CC-By 2.0 3 raw w messy w disconnected clean w organized w linked hard to query, analyze & visualize easy to query, analyze & visualize
  • 4.
    Use Case: HumanTrafficking USC Information Sciences Institute CC-By 2.0 4 100 million pages ~ 100 Web sites help victims prosecute traffickers
  • 5.
    Salient Statistics on HumanTrafficking • Profits per Year: $32 Billion • Average Age of Entry To Prostitution in the US: 14 • PIMP’s Profit Per Victim Per Year: $150,000 • Advertising Budget On the Web:$45 Million CC-By 2.0 5USC Information Sciences Institute
  • 6.
    Task: Tracking theVictim’s Locations > 100 million pages advertising adult services USC Information Sciences Institute CC-By 2.0 6
  • 7.
    Example: Investigating aReported Victim San Diego, where else? USC Information Sciences Institute CC-By 2.0 7
  • 8.
    DIG Interface: Findthe locations where a potential victim was advertised CC-By 2.0 8
  • 9.
    Steps To Builda DIG USC Information Sciences Institute CC-By 2.0 9 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface Data Acquisition
  • 10.
    Data Acquisition USC InformationSciences Institute CC-By 2.0 10 downloading relevant data batch w real-time Web pagesw Web service w database w CSV w Excel w XML w JSON
  • 11.
    Steps To Builda DIG USC Information Sciences Institute CC-By 2.0 11 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 12.
    Feature Extraction USC InformationSciences Institute CC-By 2.0 12 from raw sources to structured data • trainable text extractors • extraction from structured Web pages • image features • PDF extractor
  • 13.
    Feature Extraction fromText USC Information Sciences Institute CC-By 2.0 13 “YOU don't wanna miss out on ME :) Perfect lil booty Green eyes Long curly black hair Im a Irish,Armenian and Filipino mixed princess :) ❤ Kim ❤ 7○7~7two7~7four77 ❤ HH 80 roses ❤ Hour 120 roses ❤ 15 mins 60 roses” name: Kim eye-color: green hair-color: black phone: 707-727-7477 rate: $60/15min $80/30min $120/60min
  • 14.
    20 Examples CC-By 2.014USC Information Sciences Institute
  • 15.
    1,000’s of Tasks(2 Cents/Sentence) CC-By 2.0 15
  • 16.
    Performance of CRFExtractors 80 10 18 99 91 94 0 20 40 60 80 100 120 Precision Recall F Regular Expressions DIG 80 6 12 99 73 84 0 20 40 60 80 100 120 Precision Recall F Regular Expressions DIG Eyes Hair USC Information Sciences Institute CC-By 2.0 16
  • 17.
  • 18.
  • 19.
  • 20.
    Extraction Evaluation Title DescSeller Date Price Loc Cat Member Since Expires Views ID Perfect 1.0 (50/50) .76 (37/49) .95 (40/42) .83 (40/48 ) .87 (39/45 ) .51 (23/45) .68 (34/50) 1.0 (35/35) .52 (15/29) .76 (19/25) .97 (35/36 ) Pretty Good 1.0 (50/50) .98 (48/49) .95 (40/42) .83 (40/48 ) .98 (44/45 ) .84 (38/45) .88 (44/50) 1.0 (35/35) .55 (16/29) 1.0 (25/25) 1.0 (36/36 ) 10 websites, 5 pages each fields USC Information Sciences Institute CC-By 2.0 20
  • 21.
    Steps To Builda DIG USC Information Sciences Institute CC-By 2.0 21 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 22.
    Feature Alignment USC InformationSciences Institute CC-By 2.0 22 from multiple schemas to a common domain schema - CSV, Excel - Database tables - Web services - Extractors - Nomenclature - Spelling Multiple Schemas
  • 23.
    Karma: Mapping Datato Ontologies Services Relational Sources Karma { JSON-LD } Hierarchical Sources Schema.org USC Information Sciences Institute CC-By 2.0 23 karma.isi.edu
  • 24.
    Karma Solves FeatureAlignment CC-By 2.0 24USC Information Sciences Institute Provenance Domain Schema took ~30 minutes to align the output of the Stanford name extractor
  • 25.
    Feature Alignment Statistics •5 contractors provided data • ~ 15 datasets • > 30 Karma models • > 200 million records • 1 hour processing in 20 node Hadoop cluster CC-By 2.0 25USC Information Sciences Institute
  • 26.
    Steps To Builda DIG USC Information Sciences Institute CC-By 2.0 26 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 27.
    Entity Resolution USC InformationSciences Institute CC-By 2.0 27 merging records that refer to the same entity missing data incorrect data scale (~50 million records) currently working on techniques to address
  • 28.
    Entity Resolutuion onStrong Attributes AdultService-1 Person-1 Offer-1 availableAt seller phone 619-319-7315 Santa Barbara hairColor red price 250/hour startDate 2014-12-07 eyeColor blue name Jessica itemProvided Offer-2 Person-2 availableAt Washington DC phone seller email price 250/hour startDate 2014-05-28 AdultService-2 eyeColor blue name Jessica itemProvided USC Information Sciences Institute CC-By 2.0 28
  • 29.
    Linking Using TextSimilarity E M I LY SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S LAY LA SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O____U____T____C___A___L____L____S L I LA SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S USC Information Sciences Institute CC-By 2.0 29
  • 30.
    Linking Using ImageSimilarity CC-By 2.0 30USC Information Sciences Institute 100 Million Images Technology: Deep Learning
  • 31.
  • 32.
    Unsupervised Collective Entity Resolution USCInformation Sciences Institute CC-By 2.0 32
  • 33.
    Steps To Builda DIG USC Information Sciences Institute CC-By 2.0 33 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 34.
    Graph Construction USC InformationSciences Institute CC-By 2.0 34 assembling the data for efficient query & analysis - ElasticSearch: scalable, efficient query - graph databases: network analytics - NoSQL: scalable analytics - bulk loading: massive data imports - real-time updates: live, changing data
  • 35.
    Elastic Search DataModel Adult Service Offer Person Phone Web Page USC Information Sciences Institute CC-By 2.0 35
  • 36.
    Indexing for HighPerformance Knowledge Graph Queries Avg. Query Times in Milliseconds Single User Query Load 1.2 billion triples State of the Art Graph Database (RDF) DIG indexing deployed in ElasticSearch USC Information Sciences Institute CC-By 2.0 36
  • 37.
    Steps To Builda DIG USC Information Sciences Institute CC-By 2.0 37 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  • 40.
    DIG Deployment forHuman Trafficking USC Information Sciences Institute CC-By 2.0 40 - 100 million Web pages - Live updates (~5,000 pages/hour) - ElasticSearch database (7 nodes) - Hadoop workflows (20 nodes) - District Attorney - Law Enforcement - NGOs
  • 41.
  • 42.
    DIG Applications Human Trafficking large,real users Material Science Research 70,000 paper abstracts (built in 1 week) Arms Trafficking Identify illegal sales Patent Trolls Identify patent trolls Cyber Attacks Predict cyber attacks from dark web data CC-By 2.0 42USC Information Sciences Institute
  • 43.
    Conclusions • Complete tool-chainto build domain-specific knowledge graphs • Integrates heterogeneous data: web pages, databases, CSV, web APIs, images, etc. • Scales to ~100 million pages, ~3 billion facts • Deployed to law enforcement USC Information Sciences Institute CC-By 2.0 43
  • 44.
    Questions? dig.isi.edu Open Source, Apache2 License CC-By 2.0 44USC Information Sciences Institute

Editor's Notes

  • #24 Karma offers suggestions on how to do the mapping
  • #29 Simplest kind of linking we do – linking based on strong, explicit attributes (phones, emails, websites, etc.) So person-1 and person-2 might be the same person … but can we find more attributes to improve our confidence …
  • #30 Estimating text similarity is challenging – here we are emphasizing stylometric similarity; map->n-grams->jacquard similarity
  • #31 Clever scheme for storing pair-wise similarities in a database that can be updated incrementally (so we can bypass hashing that leverages elastic search w/lucene)
  • #32 Why is linking significant in this domain? Slide shows why.
  • #36 There is some clever tricks We produce json documents rooted on the classes we care about .. Contain enough of the graph-neighborhood so that keyword queries can work so that I can search for an adultservice using a phone number even though the phone number is really part of the seller. Or search all offers that have the same phone number. Basically copying over some content.