Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Extracting, Aligning, and
Linking Data to Build
Knowledge Graphs
Craig Knoblock
University of Southern California
Thanks t...
Goal
USC Information Sciences Institute CC-By 2.0 2
raw  messy  disconnected clean  organized  linked
hard to query, a...
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 3
raw  messy  disconnected clean  organized  ...
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 4
100 million pages
~ 100 Web sites
help victims
...
Example: Investigating a Reported Victim
San Diego, where else?
USC Information Sciences Institute CC-By 2.0 5
DIG Interface: Find the locations where a
potential victim was advertised
CC-By 2.0 6
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 7
Crawling Extraction
DataAcquisition
Mapping To
Ontology...
Data Acquisition
USC Information Sciences Institute CC-By 2.0 8
downloading relevant data
batch  real-time
Web pages Web...
Traditional Web Crawler
(e.g., Nutch, Scrapy)
CC-By 2.0 9USC Information Sciences Institute
Web Crawling
24/7
5,000 Pages/Hour
~100,000,000 pages
Total
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 11
Crawling Extraction
DataAcquisition
Mapping To
Ontolog...
Feature Extraction
USC Information Sciences Institute CC-By 2.0 12
from raw sources to structured data
• extraction from t...
Extraction
USC Information Sciences Institute CC-By 2.0 13
Structured Extraction
CC-By 2.0 14
Automated Extraction
[Minton et al., Inferlink]
• Title
• Description
• Seller
• Post Date
• Expiry Date
• Price
• Locatio...
Automated Extraction
Input: A Pile of Pages
USC Information Sciences Institute CC-By 2.0 16
Automated Extraction
input:
a pile of pages
Classify by
Templates
pages clustered
by template
USC Information Sciences Ins...
Automated Extraction
input:
a pile of pages
Classify by
Templates
pages clustered
by template
Infer
Extractor
Infer
Extrac...
Unsupervised Extraction Tool
USC Information Sciences Institute CC-By 2.0 19
Pretty Good Extractions
Want Extracted
Extra Jan. 23, 2015 Jan. 23, 2015 expires Feb
Partial Jan. 23, 2015 Jan. 23
Extraction Evaluation
Title Desc Seller Date Price Loc Cat
Member
Since
Expires Views ID
Perfect 1.0
(50/50)
.76
(37/49)
....
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 22
Crawling Extraction
DataAcquisition
Mapping To
Ontolog...
Feature Alignment
USC Information Sciences Institute CC-By 2.0 23
from multiple schemas to a common domain schema
- CSV, E...
Karma: Mapping Data to Ontologies
Services
Relational
Sources
Karma
{ JSON-LD }
Hierarchical
Sources
Schema.org
USC Inform...
Semantic Labeling
[Pham et al., ISWC’16]
Offer Place Person
name price idname
Offer
Column-1 Column-2 Column-3 Column-4
Br...
Learning Semantic Types
Requirements:
Learn from a small number of examples
Distinguish both string and numeric values
Can...
Textual
Data
Learning Semantic Types
Textual Data
Treat each column of data as a document
Apply TF-IDF Cosine Similarity
Numeric
Data
Learning Semantic Types
Numeric Data:
Apply statistical hypothesis testing to
determine which distribution fi...
Features for
Semantic Labeling
• Features
– KS = Kolmogorov-Smirnov
– MW = Mann-Whitney
CC-By 2.0 29USC Information Scienc...
Combining the Features for
Semantic Labeling
CC-By 2.0 30USC Information Sciences Institute
Automatically Assigned
Semantic Labels
Offer
name
CreativeWork
fragment
Offer
description
Offer
identifier
Offer
datePoste...
Results on www.msguntrader.com
number of attributes 19
Correct prediction 16
Correct label is in the top 4 predictions 18
...
Results on Gun Sites
Evaluation Dataset
Average number of attributes 18
Total number of attributes 176
Correct prediction ...
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 34
Crawling Extraction
DataAcquisition
Mapping To
Ontolog...
Entity Resolution
USC Information Sciences Institute CC-By 2.0 35
merging records that refer to the same entity
missing da...
Unsupervised Collective Entity Resolution
36
USC Information Sciences Institute
same victim
same Trafficker
Unsupervised Collective Entity Resolution
USC Information Sciences Institute CC-By 2.0 37
Collective Entity Resolution
[Zhu et al, ISWC’16]
Identifying and linking instances of the same real world entity
Quiet Co...
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
N...
Common Approach:
Pairwise Comparisons
Product 5 299
Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 B...
Missing Values
Product 5 299
Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling He...
Multiple Values
Product 5 299
Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling H...
Weights
Product 5 299
Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling Headphone...
Unidirectional
Product 5 299
Quiet Comfort 25 Noise Cancelling
Headphone
Bose
Electronic
299, 229 Bose Noise Cancelling He...
Graph Summarization:
Original Graph
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancellin...
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
N...
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
N...
Quiet Comfort 25 Noise
Cancelling Headphone
Noise Cancelling
Headphones
Premium Noise
Cancelling Headphones
Dish Washer
Bo...
Noise
Cancelling
Headphones
Premium
Noise
Cancelling
Headphones
Dish Washer
Quiet Comfort
25 Noise
Cancelling
Headphone
Bo...
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
N...
Quiet Comfort
25 Noise
Cancelling
Headphone
Bose
Electroni
c
Product
1
Noise
Cancelling
Headphones
Product
2
292
Premium
N...
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Predict Links In Original Graph
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Predict Links In Original Graph
Bose
Electroni
c
Product
3
Bosc...
Predict Links In Original Graph
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Re-Clustering Improves Reconstruction
Quality
Bose
Electroni
c
Product
3
Bosch
Bos
e
Product
5
Product
4
Bose
Electroni
c
...
Comparable Approaches
Pairwise Clustering Unsupervised Supervised
Limes, Ngomo’11 ✔ ✔
SILK, Isele’10 ✔ ✔ ✔
Serf, Benjellou...
Quality Comparison
Precision Recall F-measure
Author Paper Product Author Paper Product Author Paper Product
Limes-F 0.958...
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 58
Crawling Extraction
DataAcquisition
Mapping To
Ontolog...
Graph Construction
USC Information Sciences Institute CC-By 2.0 59
assembling the data for efficient query & analysis
- El...
elasticsearch
• Cloud-based search engine
• Based on Apache Lucene
• Horizontal scaling, replication, load balancing
• Bla...
Adult
Service
Offer Person
Efficient indexing and query
Phone
Web
Page
ElasticSearch Data Model
Offers As Roots
Products (AdultService) As Roots
Indexing for High Performance
Knowledge Graph Queries
Avg. Query Times in Milliseconds
Single User Query Load
1.2 billion ...
Steps To Build a KG
USC Information Sciences Institute CC-By 2.0 66
Crawling Extraction
DataAcquisition
Mapping To
Ontolog...
DIG Deployment for Human Trafficking
USC Information Sciences Institute CC-By 2.0 68
- 100 million Web pages
- Live update...
DIG Applications
Human Trafficking
large, real users
Material Science Research
70,000 paper abstracts (built in 1 week)
Ar...
Conclusions
• Presented the end-to-end tool-chain to
build domain-specific knowledge graphs
• Integrates heterogeneous dat...
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Extracting, Aligning, and Linking Data to Build Knowledge Graphs
Upcoming SlideShare
Loading in …5
×

Extracting, Aligning, and Linking Data to Build Knowledge Graphs

568 views

Published on

Invited Talk at the Workshop on Industrial Knowledge Graphs as part of the 2017 Web Science Conference. June 25, 2017

Published in: Data & Analytics
  • Be the first to comment

Extracting, Aligning, and Linking Data to Build Knowledge Graphs

  1. 1. Extracting, Aligning, and Linking Data to Build Knowledge Graphs Craig Knoblock University of Southern California Thanks to my collaborators: Pedro Szekely, Linhong Zhu, Majid Ghasemi-Gol, Mohsen Taheriyan, Minh Pham, and Steve Minton
  2. 2. Goal USC Information Sciences Institute CC-By 2.0 2 raw  messy  disconnected clean  organized  linked hard to query, analyze & visualize easy to query, analyze & visualize
  3. 3. Use Case: Human Trafficking USC Information Sciences Institute CC-By 2.0 3 raw  messy  disconnected clean  organized  linked hard to query, analyze & visualize easy to query, analyze & visualize
  4. 4. Use Case: Human Trafficking USC Information Sciences Institute CC-By 2.0 4 100 million pages ~ 100 Web sites help victims prosecute traffickers
  5. 5. Example: Investigating a Reported Victim San Diego, where else? USC Information Sciences Institute CC-By 2.0 5
  6. 6. DIG Interface: Find the locations where a potential victim was advertised CC-By 2.0 6
  7. 7. Steps To Build a KG USC Information Sciences Institute CC-By 2.0 7 Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface Data Acquisition
  8. 8. Data Acquisition USC Information Sciences Institute CC-By 2.0 8 downloading relevant data batch  real-time Web pages Web service  database  CSV  Excel  XML  JSON
  9. 9. Traditional Web Crawler (e.g., Nutch, Scrapy) CC-By 2.0 9USC Information Sciences Institute
  10. 10. Web Crawling 24/7 5,000 Pages/Hour ~100,000,000 pages Total
  11. 11. Steps To Build a KG USC Information Sciences Institute CC-By 2.0 11 Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  12. 12. Feature Extraction USC Information Sciences Institute CC-By 2.0 12 from raw sources to structured data • extraction from text • extraction from structured Web pages • extraction of image features
  13. 13. Extraction USC Information Sciences Institute CC-By 2.0 13
  14. 14. Structured Extraction CC-By 2.0 14
  15. 15. Automated Extraction [Minton et al., Inferlink] • Title • Description • Seller • Post Date • Expiry Date • Price • Location • Category • Member Since • Num Views • Post ID USC Information Sciences Institute CC-By 2.0 15
  16. 16. Automated Extraction Input: A Pile of Pages USC Information Sciences Institute CC-By 2.0 16
  17. 17. Automated Extraction input: a pile of pages Classify by Templates pages clustered by template USC Information Sciences Institute CC-By 2.0 17
  18. 18. Automated Extraction input: a pile of pages Classify by Templates pages clustered by template Infer Extractor Infer Extractor Infer Extractor Infer Extractor extractor USC Information Sciences Institute CC-By 2.0 18
  19. 19. Unsupervised Extraction Tool USC Information Sciences Institute CC-By 2.0 19
  20. 20. Pretty Good Extractions Want Extracted Extra Jan. 23, 2015 Jan. 23, 2015 expires Feb Partial Jan. 23, 2015 Jan. 23
  21. 21. Extraction Evaluation Title Desc Seller Date Price Loc Cat Member Since Expires Views ID Perfect 1.0 (50/50) .76 (37/49) .95 (40/42) .83 (40/48) .87 (39/45) .51 (23/45) .68 (34/50) 1.0 (35/35) .52 (15/29) .76 (19/25) .97 (35/36) Pretty Good 1.0 (50/50) .98 (48/49) .95 (40/42) .83 (40/48) .98 (44/45) .84 (38/45) .88 (44/50) 1.0 (35/35) .55 (16/29) 1.0 (25/25) 1.0 (36/36) 10 websites, 5 pages each fields USC Information Sciences Institute CC-By 2.0 21
  22. 22. Steps To Build a KG USC Information Sciences Institute CC-By 2.0 22 Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  23. 23. Feature Alignment USC Information Sciences Institute CC-By 2.0 23 from multiple schemas to a common domain schema - CSV, Excel - Database tables - Web services - Extractors - Nomenclature - Spelling Multiple Schemas
  24. 24. Karma: Mapping Data to Ontologies Services Relational Sources Karma { JSON-LD } Hierarchical Sources Schema.org USC Information Sciences Institute CC-By 2.0 24
  25. 25. Semantic Labeling [Pham et al., ISWC’16] Offer Place Person name price idname Offer Column-1 Column-2 Column-3 Column-4 British Lee-Enfield No 4 MK 2 still … 1,000 68155c13de2f2532 Cabelas Millenium Revolver in .45 colt 700 1711 Anderson Rd 12155a1a2938bc1 e
  26. 26. Learning Semantic Types Requirements: Learn from a small number of examples Distinguish both string and numeric values Can be learned quickly and is highly scalable to large numbers of semantic types Person OrganizationCity State name birthdate name namename Person name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google Domain Ontology
  27. 27. Textual Data Learning Semantic Types Textual Data Treat each column of data as a document Apply TF-IDF Cosine Similarity
  28. 28. Numeric Data Learning Semantic Types Numeric Data: Apply statistical hypothesis testing to determine which distribution fits best Apply Kolmogorov-Smirnov Test
  29. 29. Features for Semantic Labeling • Features – KS = Kolmogorov-Smirnov – MW = Mann-Whitney CC-By 2.0 29USC Information Sciences Institute
  30. 30. Combining the Features for Semantic Labeling CC-By 2.0 30USC Information Sciences Institute
  31. 31. Automatically Assigned Semantic Labels Offer name CreativeWork fragment Offer description Offer identifier Offer datePosted CreativeWork Fragment 35 Whelen Handi-Rifle No Tags 35 Whelen Handi-rifle. Black synthetic stock/forearm, blued barrel. Text 601-813-7280 …. 245625390711756 October 19, 2015 12:43 pm Cabelas Millenium Revolver in .45 colt No Tags This single action is built to shoot and is a great way for any level of shooter to get involved with a single action. … 12155a1a2938bc1e July 11, 2015 5:17 pm 1711 Anderson Rd swap stocks No Tags want to trade butler creek folding stock for black stock ruger mini stock folder by butler creek will swap even for full rifle stock …. 5815600fd181fe3b September 22, 2015 1:05 am white streetAddress does not appear in training data -> more similar to noisy data
  32. 32. Results on www.msguntrader.com number of attributes 19 Correct prediction 16 Correct label is in the top 4 predictions 18 Accuracy 84% MRR 89%
  33. 33. Results on Gun Sites Evaluation Dataset Average number of attributes 18 Total number of attributes 176 Correct prediction (Accuracy) 56% Correct label is in the top 4 predictions 89% MRR 70%
  34. 34. Steps To Build a KG USC Information Sciences Institute CC-By 2.0 34 Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  35. 35. Entity Resolution USC Information Sciences Institute CC-By 2.0 35 merging records that refer to the same entity missing data incorrect data scale (~100 million records) techniques to address
  36. 36. Unsupervised Collective Entity Resolution 36 USC Information Sciences Institute
  37. 37. same victim same Trafficker Unsupervised Collective Entity Resolution USC Information Sciences Institute CC-By 2.0 37
  38. 38. Collective Entity Resolution [Zhu et al, ISWC’16] Identifying and linking instances of the same real world entity Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch Product 4 229 Bose Noise Cancelling Headphones Bos e Product 5 299 price description manufacturerproduct Multi-Type Graph
  39. 39. Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch Product 4 229 Bose Noise Cancelling Headphones Bos e Product 5 299 price description manufacturerproduct Multi-Type Graph Collective Entity Resolution [Zhu et al, ISWC’16] Identifying and linking instances of the same real world entity
  40. 40. Common Approach: Pairwise Comparisons Product 5 299 Quiet Comfort 25 Noise Cancelling Headphone Bose Electronic 299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4 599 Dish WasherBoschProduct 3 292 Premium Noise Cancelling HeadphonesSonyProduct 2 Noise Cancelling HeadphonesSonyProduct 1 Price TitleManufacturer Jaro 0.5 distance 0.2 Jaccard 0.3 Acceptance Threshold: 0.8
  41. 41. Missing Values Product 5 299 Quiet Comfort 25 Noise Cancelling Headphone Bose Electronic 299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4 599 Dish WasherBoschProduct 3 292 Premium Noise Cancelling HeadphonesSonyProduct 2 Noise Cancelling HeadphonesSonyProduct 1 Price TitleManufacturer Jaro 0.5 distance 0.2 Jaccard 0.3
  42. 42. Multiple Values Product 5 299 Quiet Comfort 25 Noise Cancelling Headphone Bose Electronic 299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4 599 Dish WasherBoschProduct 3 292 Premium Noise Cancelling HeadphonesSonyProduct 2 Noise Cancelling HeadphonesSonyProduct 1 Price TitleManufacturer Jaro 0.5 distance 0.2 Jaccard 0.3
  43. 43. Weights Product 5 299 Quiet Comfort 25 Noise Cancelling Headphone Bose Electronic 299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4 599 Dish WasherBoschProduct 3 292 Premium Noise Cancelling HeadphonesSonyProduct 2 Noise Cancelling HeadphonesSonyProduct 1 Price TitleManufacturer Jaro 0.5 distance 0.2 Jaccard 0.30.5 0.2 0.3
  44. 44. Unidirectional Product 5 299 Quiet Comfort 25 Noise Cancelling Headphone Bose Electronic 299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4 599 Dish WasherBoschProduct 3 292 Premium Noise Cancelling HeadphonesSonyProduct 2 Noise Cancelling HeadphonesSonyProduct 1 Price TitleManufacturer Jaro 0.5 distance 0.2 Jaccard 0.30.5 0.2 0.3
  45. 45. Graph Summarization: Original Graph Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch Product 4 229 Bose Noise Cancelling Headphones Bos e Product 5 299 price description manufacturerproduct
  46. 46. Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch 229 Bose Noise Cancelling Headphones Bos e Product 5 299 Product 4 Similar Nodes simt(x, y)
  47. 47. Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch 229 Bose Noise Cancelling Headphones Bos e Product 5 299 Product 4 Graph Sumarization: Super-Nodes
  48. 48. Quiet Comfort 25 Noise Cancelling Headphone Noise Cancelling Headphones Premium Noise Cancelling Headphones Dish Washer Bose Noise Cancelling Headphones Super-nodes Ct(x) 0.7 0.2 0.1 0.7 0.2 0.1 0.2 0.7 0.1 0.2 0.7 0.1 0.1 0.1 0.8 probability that a node x belongs to each super-node one matrix for each type Ct
  49. 49. Noise Cancelling Headphones Premium Noise Cancelling Headphones Dish Washer Quiet Comfort 25 Noise Cancelling Headphone Bose Noise Cancelling Headphones Similar Nodes Should Be In The Same Super-Node
  50. 50. Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch 229 Bose Noise Cancelling Headphones Bos e Product 5 299 Product 4 Super-Links
  51. 51. Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch 229 Bose Noise Cancelling Headphones Bos e Product 5 299 Product 4 Super-Links
  52. 52. Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4 Predict Links In Original Graph
  53. 53. Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4 Predict Links In Original Graph Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4
  54. 54. Predict Links In Original Graph Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4
  55. 55. Re-Clustering Improves Reconstruction Quality Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4 Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4
  56. 56. Comparable Approaches Pairwise Clustering Unsupervised Supervised Limes, Ngomo’11 ✔ ✔ SILK, Isele’10 ✔ ✔ ✔ Serf, Benjelloun’10 ✔ ✔ *Commercial, Kӧpcke’10 ✔ ✔ GraphSum, Riondato’14 ✔ ✔ *AuthorLDA, Bhattacharya’07 ✔ ✔ CoSum (proposed) ✔ ✔
  57. 57. Quality Comparison Precision Recall F-measure Author Paper Product Author Paper Product Author Paper Product Limes-F 0.958 0.827 0.446 0.864 0.761 0.16 0.909 0.792 0.236 Silk-F 0.846 0.877 0.459 0.986 0.756 0.348 0.91 0.812 0.395 Gsum 0.727 0.668 0.01 0.569 0.624 0.587 0.638 0.645 0.02 CoSum-B 0.993 0.871 0.58 0.94 0.611 0.477 0.966 0.718 0.524 Limes-MO 0.912 0.827 0.446 0.944 0.761 0.16 0.928 0.792 0.236 Silk-MO 0.932 0.877 0.459 0.958 0.756 0.348 0.945 0.812 0.395 Serf 0.985 0.837 0.436 0.687 0.808 0.186 0.809 0.822 0.261 CoSum-P 0.999 0.771 0.639 0.997 0.997 0.695 0.998 0.87 0.666 Commercial 0.615 0.63 0.622 AuthorLDA 0.995
  58. 58. Steps To Build a KG USC Information Sciences Institute CC-By 2.0 58 Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  59. 59. Graph Construction USC Information Sciences Institute CC-By 2.0 59 assembling the data for efficient query & analysis - ElasticSearch: scalable, efficient query - graph databases: network analytics - NoSQL: scalable analytics - bulk loading: massive data imports - real-time updates: live, changing data
  60. 60. elasticsearch • Cloud-based search engine • Based on Apache Lucene • Horizontal scaling, replication, load balancing • Blazingly fast! • Everything is a document – Documents are JSON objects – Index what you want to find – Fields can contain strings, numbers, booleans, etc. CC-By 2.0 60USC Information Sciences Institute
  61. 61. Adult Service Offer Person Efficient indexing and query Phone Web Page ElasticSearch Data Model
  62. 62. Offers As Roots
  63. 63. Products (AdultService) As Roots
  64. 64. Indexing for High Performance Knowledge Graph Queries Avg. Query Times in Milliseconds Single User Query Load 1.2 billion triples State of the Art Graph Database (RDF) DIG indexing deployed in ElasticSearch USC Information Sciences Institute CC-By 2.0 65
  65. 65. Steps To Build a KG USC Information Sciences Institute CC-By 2.0 66 Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface
  66. 66. DIG Deployment for Human Trafficking USC Information Sciences Institute CC-By 2.0 68 - 100 million Web pages - Live updates (~5,000 pages/hour) - ElasticSearch database (7 nodes) - Hadoop workflows (20 nodes) - District Attorney - Law Enforcement - NGOs
  67. 67. DIG Applications Human Trafficking large, real users Material Science Research 70,000 paper abstracts (built in 1 week) Arms Trafficking identify illegal sales Patent Trolls identifies patent trolls Predicting Cyber Attacks combines diverse sources about vulnerabilities, exploits, etc. CC-By 2.0 69USC Information Sciences Institute
  68. 68. Conclusions • Presented the end-to-end tool-chain to build domain-specific knowledge graphs • Integrates heterogeneous data: web pages, databases, CSV, web APIs, images, etc. • Approach scales to million of pages, and billions facts • Has been used to build real-world deployed applicationsUSC Information Sciences Institute CC-By 2.0 70

×