Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building knowledge graphs
in DIG
Pedro Szekely and Craig Knoblock
University of Southern California
Information Sciences I...
Goal
USC Information Sciences Institute CC-By 2.0 2
raw w messy w disconnected clean w organized w linked
hard to query, a...
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 3
raw w messy w disconnected clean w organized w ...
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 4
100 million pages
~ 100 Web sites
help victims
...
Salient Statistics on
Human Trafficking
• Profits per Year: $32 Billion
• Average Age of Entry To Prostitution in the US: ...
Task: Tracking the Victim’s
Locations
>	100	million	pages	advertising	adult	services
USC Information Sciences Institute CC...
Example: Investigating a Reported Victim
San	Diego,	where	else?
USC Information Sciences Institute CC-By 2.0 7
DIG Interface: Find the locations where a
potential victim was advertised
CC-By 2.0 8
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 9
Crawling Extraction
Data Acquisition
Mapping To
Ontolo...
Data Acquisition
USC Information Sciences Institute CC-By 2.0 10
downloading relevant data
batch w real-time
Web pagesw We...
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 11
Crawling Extraction
Data Acquisition
Mapping To
Ontol...
Feature Extraction
USC Information Sciences Institute CC-By 2.0 12
from raw sources to structured data
• trainable text ex...
Feature Extraction from Text
USC Information Sciences Institute CC-By 2.0 13
“YOU don't wanna miss out on
ME :) Perfect li...
20 Examples
CC-By 2.0 14USC Information Sciences Institute
1,000’s of Tasks (2 Cents/Sentence)
CC-By 2.0 15
Performance of CRF Extractors
80
10
18
99
91 94
0
20
40
60
80
100
120
Precision Recall F
Regular	Expressions DIG
80
6
12
9...
Structured Extraction
CC-By 2.0 17
Automated Extraction
input:	
a pile	of	pages
Classify	by
Templates
pages	clustered
by	template	
Infer
Extractor
Infer
Extr...
Unsupervised Extraction Tool
CC-By 2.0 19
Extraction Evaluation
Title Desc Seller Date Price Loc Cat
Member
Since
Expires Views ID
Perfect 1.0
(50/50)
.76
(37/49)
....
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 21
Crawling Extraction
Data Acquisition
Mapping To
Ontol...
Feature Alignment
USC Information Sciences Institute CC-By 2.0 22
from multiple schemas to a common domain schema
- CSV, E...
Karma: Mapping Data to Ontologies
Services
Relational
Sources
Karma
{	JSON-LD	}
Hierarchical	
Sources
Schema.org
USC Infor...
Karma Solves Feature Alignment
CC-By 2.0 24USC Information Sciences Institute
Provenance
Domain Schema
took ~30 minutes to...
Feature Alignment Statistics
• 5 contractors provided data
• ~ 15 datasets
• > 30 Karma models
• > 200 million records
• 1...
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 26
Crawling Extraction
Data Acquisition
Mapping To
Ontol...
Entity Resolution
USC Information Sciences Institute CC-By 2.0 27
merging records that refer to the same entity
missing da...
Entity Resolutuion on Strong Attributes
AdultService-1
Person-1
Offer-1
availableAt
seller
phone
619-319-7315
Santa Barbar...
Linking Using Text Similarity
E M I LY SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me.
O_U_T_C___A___L_L_S...
Linking Using Image Similarity
CC-By 2.0 30USC Information Sciences Institute
100 Million Images Technology: Deep Learning
AdultService-1
Person-1
Offer-1
availableAt
seller
phone
619-319-7315
Santa Barbara
hairColor
red
price
250/hour
startDate...
Unsupervised Collective Entity
Resolution
USC Information Sciences Institute CC-By 2.0 32
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 33
Crawling Extraction
Data Acquisition
Mapping To
Ontol...
Graph Construction
USC Information Sciences Institute CC-By 2.0 34
assembling the data for efficient query & analysis
- El...
Elastic Search Data Model
Adult
Service
Offer Person Phone
Web
Page
USC Information Sciences Institute CC-By 2.0 35
Indexing for High Performance
Knowledge Graph Queries
Avg.	Query	Times	in	Milliseconds
Single	User	Query	Load
1.2	billion	...
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 37
Crawling Extraction
Data Acquisition
Mapping To
Ontol...
DIG Deployment for Human Trafficking
USC Information Sciences Institute CC-By 2.0 40
- 100 million Web pages
- Live update...
Deployed	to	6
Law	Enforcement	
Agencies	and	Successfully	
Used	to	Prosecute	
Traffickers
USC Information Sciences Institut...
DIG Applications
Human Trafficking
large, real users
Material Science Research
70,000 paper abstracts (built in 1 week)
Ar...
Conclusions
• Complete tool-chain to build domain-specific
knowledge graphs
• Integrates heterogeneous data: web pages,
da...
Questions?
dig.isi.edu
Open Source, Apache 2 License
CC-By 2.0 44USC Information Sciences Institute
Building Knowledge Graphs in DIG
Building Knowledge Graphs in DIG
Upcoming SlideShare
Loading in …5
×

Building Knowledge Graphs in DIG

4,134 views

Published on

DIG Slides, Pedro Szekely and Craig Knoblock

Published in: Data & Analytics
  • Be the first to comment

Building Knowledge Graphs in DIG

  1. 1. Building knowledge graphs in DIG Pedro Szekely and Craig Knoblock University of Southern California Information Sciences Institute dig.isi.edu
  2. 2. Goal USC Information Sciences Institute CC-By 2.0 2 raw w messy w disconnected clean w organized w linked hard to query, analyze & visualize easy to query, analyze & visualize
  3. 3. Use Case: Human Trafficking USC Information Sciences Institute CC-By 2.0 3 raw w messy w disconnected clean w organized w linked hard to query, analyze & visualize easy to query, analyze & visualize
  4. 4. Use Case: Human Trafficking USC Information Sciences Institute CC-By 2.0 4 100 million pages ~ 100 Web sites help victims prosecute traffickers
  5. 5. Salient Statistics on Human Trafficking • Profits per Year: $32 Billion • Average Age of Entry To Prostitution in the US: 14 • PIMP’s Profit Per Victim Per Year: $150,000 • Advertising Budget On the Web:$45 Million CC-By 2.0 5USC Information Sciences Institute
  6. 6. Task: Tracking the Victim’s Locations > 100 million pages advertising adult services USC Information Sciences Institute CC-By 2.0 6
  7. 7. Example: Investigating a Reported Victim San Diego, where else? USC Information Sciences Institute CC-By 2.0 7
  8. 8. DIG Interface: Find the locations where a potential victim was advertised CC-By 2.0 8
  9. 9. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 9 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface Data Acquisition
  10. 10. Data Acquisition USC Information Sciences Institute CC-By 2.0 10 downloading relevant data batch w real-time Web pagesw Web service w database w CSV w Excel w XML w JSON
  11. 11. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 11 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  12. 12. Feature Extraction USC Information Sciences Institute CC-By 2.0 12 from raw sources to structured data • trainable text extractors • extraction from structured Web pages • image features • PDF extractor
  13. 13. Feature Extraction from Text USC Information Sciences Institute CC-By 2.0 13 “YOU don't wanna miss out on ME :) Perfect lil booty Green eyes Long curly black hair Im a Irish,Armenian and Filipino mixed princess :) ❤ Kim ❤ 7○7~7two7~7four77 ❤ HH 80 roses ❤ Hour 120 roses ❤ 15 mins 60 roses” name: Kim eye-color: green hair-color: black phone: 707-727-7477 rate: $60/15min $80/30min $120/60min
  14. 14. 20 Examples CC-By 2.0 14USC Information Sciences Institute
  15. 15. 1,000’s of Tasks (2 Cents/Sentence) CC-By 2.0 15
  16. 16. Performance of CRF Extractors 80 10 18 99 91 94 0 20 40 60 80 100 120 Precision Recall F Regular Expressions DIG 80 6 12 99 73 84 0 20 40 60 80 100 120 Precision Recall F Regular Expressions DIG Eyes Hair USC Information Sciences Institute CC-By 2.0 16
  17. 17. Structured Extraction CC-By 2.0 17
  18. 18. Automated Extraction input: a pile of pages Classify by Templates pages clustered by template Infer Extractor Infer Extractor Infer Extractor Infer Extractor extractor USC Information Sciences Institute CC-By 2.0 18
  19. 19. Unsupervised Extraction Tool CC-By 2.0 19
  20. 20. Extraction Evaluation Title Desc Seller Date Price Loc Cat Member Since Expires Views ID Perfect 1.0 (50/50) .76 (37/49) .95 (40/42) .83 (40/48 ) .87 (39/45 ) .51 (23/45) .68 (34/50) 1.0 (35/35) .52 (15/29) .76 (19/25) .97 (35/36 ) Pretty Good 1.0 (50/50) .98 (48/49) .95 (40/42) .83 (40/48 ) .98 (44/45 ) .84 (38/45) .88 (44/50) 1.0 (35/35) .55 (16/29) 1.0 (25/25) 1.0 (36/36 ) 10 websites, 5 pages each fields USC Information Sciences Institute CC-By 2.0 20
  21. 21. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 21 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  22. 22. Feature Alignment USC Information Sciences Institute CC-By 2.0 22 from multiple schemas to a common domain schema - CSV, Excel - Database tables - Web services - Extractors - Nomenclature - Spelling Multiple Schemas
  23. 23. Karma: Mapping Data to Ontologies Services Relational Sources Karma { JSON-LD } Hierarchical Sources Schema.org USC Information Sciences Institute CC-By 2.0 23 karma.isi.edu
  24. 24. Karma Solves Feature Alignment CC-By 2.0 24USC Information Sciences Institute Provenance Domain Schema took ~30 minutes to align the output of the Stanford name extractor
  25. 25. Feature Alignment Statistics • 5 contractors provided data • ~ 15 datasets • > 30 Karma models • > 200 million records • 1 hour processing in 20 node Hadoop cluster CC-By 2.0 25USC Information Sciences Institute
  26. 26. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 26 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  27. 27. Entity Resolution USC Information Sciences Institute CC-By 2.0 27 merging records that refer to the same entity missing data incorrect data scale (~50 million records) currently working on techniques to address
  28. 28. Entity Resolutuion on Strong Attributes AdultService-1 Person-1 Offer-1 availableAt seller phone 619-319-7315 Santa Barbara hairColor red price 250/hour startDate 2014-12-07 eyeColor blue name Jessica itemProvided Offer-2 Person-2 availableAt Washington DC phone seller email price 250/hour startDate 2014-05-28 AdultService-2 eyeColor blue name Jessica itemProvided USC Information Sciences Institute CC-By 2.0 28
  29. 29. Linking Using Text Similarity E M I LY SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S LAY LA SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O____U____T____C___A___L____L____S L I LA SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S USC Information Sciences Institute CC-By 2.0 29
  30. 30. Linking Using Image Similarity CC-By 2.0 30USC Information Sciences Institute 100 Million Images Technology: Deep Learning
  31. 31. AdultService-1 Person-1 Offer-1 availableAt seller phone 619-319-7315 Santa Barbara hairColor red price 250/hour startDate 2014-12-07 eyeColor blue name Jessica itemProvided Offer-2 Person-2 availableAt Washington DC phone seller email price 250/hour startDate 2014-05-28 AdultService-2 eyeColor blue name Jessica itemProvided same victim same Trafficker Unsupervised Collective Entity Resolution USC Information Sciences Institute CC-By 2.0 31
  32. 32. Unsupervised Collective Entity Resolution USC Information Sciences Institute CC-By 2.0 32
  33. 33. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 33 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  34. 34. Graph Construction USC Information Sciences Institute CC-By 2.0 34 assembling the data for efficient query & analysis - ElasticSearch: scalable, efficient query - graph databases: network analytics - NoSQL: scalable analytics - bulk loading: massive data imports - real-time updates: live, changing data
  35. 35. Elastic Search Data Model Adult Service Offer Person Phone Web Page USC Information Sciences Institute CC-By 2.0 35
  36. 36. Indexing for High Performance Knowledge Graph Queries Avg. Query Times in Milliseconds Single User Query Load 1.2 billion triples State of the Art Graph Database (RDF) DIG indexing deployed in ElasticSearch USC Information Sciences Institute CC-By 2.0 36
  37. 37. Steps To Build a DIG USC Information Sciences Institute CC-By 2.0 37 Crawling Extraction Data Acquisition Mapping To Ontology Entity Linking & Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  38. 38. DIG Deployment for Human Trafficking USC Information Sciences Institute CC-By 2.0 40 - 100 million Web pages - Live updates (~5,000 pages/hour) - ElasticSearch database (7 nodes) - Hadoop workflows (20 nodes) - District Attorney - Law Enforcement - NGOs
  39. 39. Deployed to 6 Law Enforcement Agencies and Successfully Used to Prosecute Traffickers USC Information Sciences Institute CC-By 2.0 41
  40. 40. DIG Applications Human Trafficking large, real users Material Science Research 70,000 paper abstracts (built in 1 week) Arms Trafficking Identify illegal sales Patent Trolls Identify patent trolls Cyber Attacks Predict cyber attacks from dark web data CC-By 2.0 42USC Information Sciences Institute
  41. 41. Conclusions • Complete tool-chain to build domain-specific knowledge graphs • Integrates heterogeneous data: web pages, databases, CSV, web APIs, images, etc. • Scales to ~100 million pages, ~3 billion facts • Deployed to law enforcement USC Information Sciences Institute CC-By 2.0 43
  42. 42. Questions? dig.isi.edu Open Source, Apache 2 License CC-By 2.0 44USC Information Sciences Institute

×