Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs using Semantic Web Technologies

612 views

Published on

Over the last few years we have been building domain-specific knowledge graphs for a variety of real-world problems, including creating virtual museums, combating human trafficking, identifying illegal arms sales, and predicting cyber attacks. We have developed a variety of techniques to construct such knowledge graphs, including techniques for extracting data from online sources, aligning the data to a domain ontology, and linking the data across sources. In his talk I will present these techniques and describe our experience in applying Semantic Web technologies to build knowledge graphs for real-world problems.

Published in: Technology
  • Be the first to comment

From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs using Semantic Web Technologies

  1. 1. From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs using Semantic Web Technologies Craig Knoblock USC Information Sciences Institute U.S. Semantic Technologies Symposium March 1, 2018
  2. 2. Center on Knowledge Graphs: People 2
  3. 3. Center on Knowledge Graphs: People (cont.) 3
  4. 4. Center on Knowledge Graphs: Projects 4Center on Knowledge GraphsUSC Information Sciences Institute
  5. 5. Goal: Building Knowledge Graphs raw  messy  disconnected clean  organized  linked hard to query, analyze & visualize easy to query, analyze & visualize 5Center on Knowledge GraphsUSC Information Sciences Institute
  6. 6. Questions Addressed in this Talk 1. Where should the Semantic Web data come from? • Triplestores? Linked data? Schema.org? 2. What is the “best” representation of the data in a knowledge graph? • Very detailed domain-specific ontologies? 3. How should we deal with incomplete and incorrect information • Manual curation? Automated data cleaning? 4. How do we organize and store the data for efficient access? • RDF? Triplestore? 6Center on Knowledge GraphsUSC Information Sciences Institute
  7. 7. Steps To Build a KG Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface 7Center on Knowledge GraphsUSC Information Sciences Institute Feature Extraction
  8. 8. Illegal Arms Sales • 100s of web sites • ATF wants to find people buying and selling across state lines • Challenge: extract and align the data across sites USC Information Sciences Institute Center on Knowledge Graphs 8
  9. 9. Extraction 9
  10. 10. Structured Extraction 10
  11. 11. Automated Extraction [Minton et al., Inferlink] Input: A Pile of Pages 11Center on Knowledge GraphsUSC Information Sciences Institute
  12. 12. Automated Extraction input: a pile of pages Classify by Templates pages clustered by template 12Center on Knowledge GraphsUSC Information Sciences Institute
  13. 13. Automated Extraction input: a pile of pages Classify by Templates pages clustered by template Infer Extractor Infer Extractor Infer Extractor Infer Extractor extractor 13Center on Knowledge GraphsUSC Information Sciences Institute
  14. 14. Unsupervised Extraction Tool 14
  15. 15. Extraction Evaluation Title Desc Seller Date Price Loc Cat Member Since Expires Views ID Perfect 1.0 (50/50) .76 (37/49) .95 (40/42) .83 (40/48) .87 (39/45) .51 (23/45) .68 (34/50) 1.0 (35/35) .52 (15/29) .76 (19/25) .97 (35/36) Including partial and extra data 1.0 (50/50) .98 (48/49) .95 (40/42) .83 (40/48) .98 (44/45) .84 (38/45) .88 (44/50) 1.0 (35/35) .55 (16/29) 1.0 (25/25) 1.0 (36/36) 10 websites, 5 pages each fields 15Center on Knowledge GraphsUSC Information Sciences Institute
  16. 16. Steps To Build a KG Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface 16Center on Knowledge GraphsUSC Information Sciences Institute
  17. 17. Knowledge Graph for Predicting Cyber Attacks Elastic Search Cyber Domain OntologyBlogs Twitter Conferences CPEs Darkweb marketplaces News CVEs Darkweb Forums Abuse.ch Karma Model Model Microsoft Bulletins 17Center on Knowledge GraphsUSC Information Sciences Institute
  18. 18. Cyber Domain Ontology 18 28 Classes 97 Properties Based on Schema.org
  19. 19. Karma: Mapping Data to Ontologies Services Relational Sources Karma { JSON-LD } Hierarchical Sources Cyber Ontology 19 [ Knoblock, Szekely, et al. ISWC 2012 ] USC Information Sciences Institute
  20. 20. Map Source to Domain Ontology Domain Ontology Source 20 object property data property Software Vulnerability Topic name version author hasVulnerability name description name isTopicOf PostisVulnerabilityOf location mentions datePublished topic hasTopic username Person isAuthorOf Semantic Model: maps source to domain ontology Column 1 Column 2 Column 3 Column 4 Column 5 Bro can you give me a.. English windows xp sp3 CVE-2016-1052 303828 … ‫أنا‬‫جربت‬‫البرنامج‬‫وعمل‬‫ع‬ Arabic jp2_cdef_destroy 147075 salve a tutti, ultimamento … Italian cve-2012-4969 execcommand vuln cve-2012-4969 107075 USC Information Sciences Institute Center on Knowledge Graphs
  21. 21. Semantic Types Post Topic Vulnerabilit y Person text language name userId name Post 21 Column 1 Column 2 Column 3 Column 4 Column 5 Bro can you give me a.. English windows xp sp3 CVE-2016-1052 303828 … ‫أنا‬‫جربت‬‫البرنامج‬‫وعمل‬‫ع‬ Arabic jp2_cdef_destroy 147075 salve a tutti, ultimamento … Italian cve-2012-4969 execcommand vuln cve-2012-4969 107075 USC Information Sciences Institute Center on Knowledge Graphs
  22. 22. Relationships Post Topic Vulnerability Person text language mentions hasTopic author name userId name 22 Column 1 Column 2 Column 3 Column 4 Column 5 Bro can you give me a.. English windows xp sp3 CVE-2016-1052 303828 … ‫أنا‬‫جربت‬‫البرنامج‬‫وعمل‬‫ع‬ Arabic jp2_cdef_destroy 147075 salve a tutti, ultimamento … Italian cve-2012-4969 execcommand vuln cve-2012-4969 107075 USC Information Sciences Institute Center on Knowledge Graphs
  23. 23. Cyber KG Dashboard 23Center on Knowledge GraphsUSC Information Sciences Institute
  24. 24. Karma Learns the Source Models Taheriyan et al., ISWC 2013, ICSC 2014 Domain Ontology Learn Semantic Types Sample Data Construct a Graph Generate Candidate Models Rank Results Known Semantic Models 24Center on Knowledge GraphsUSC Information Sciences Institute
  25. 25. Learning Semantic Types Requirements: Learn from a small number of examples Distinguish both string and numeric values Can be learned quickly and is highly scalable to large numbers of semantic types Person OrganizationCity State name birthdate name namename Person name date city state workplace 1 Fred Collins Oct 1959 Seattle WA Microsoft 2 Tina Peterson May 1980 New York NY Google Domain Ontology 25Center on Knowledge GraphsUSC Information Sciences Institute
  26. 26. Training machine learning model [Pham et al., ISWC 2016] 26
  27. 27. Predicting new attribute 27
  28. 28. Construct a Graph Construct a graph from semantic types and ontology date 28USC Information Sciences Institute
  29. 29. Determine Relationships Select minimal tree that connects all semantic types A customized Steiner tree algorithm [Kou & Markowsky, 1981] Initial Model date 29USC Information Sciences Institute
  30. 30. Refining the Model Correct Model Impose constraints on Steiner Tree Algorithm 30Center on Knowledge GraphsUSC Information Sciences Institute
  31. 31. Knowledge Graphs Karma uses semantic models to create knowledge graphs Karma semi-automatically builds semantic models 31USC Information Sciences Institute Center on Knowledge Graphs
  32. 32. American Art Collaborative • Consortium of 14 American art museums • Explore the use of Linked Data for research, education, and outreach • Build 5* Linked Data for the museums • Create tools to support the construction of Linked Data 32Center on Knowledge GraphsUSC Information Sciences Institute [Knoblock et al., ISWC 2017]
  33. 33. Example Model of Actor for Amon Carter 33Center on Knowledge GraphsUSC Information Sciences Institute
  34. 34. Complete Model of Actor for Amon Carter 34Center on Knowledge GraphsUSC Information Sciences Institute
  35. 35. AAC Data Statistics 35Center on Knowledge GraphsUSC Information Sciences Institute
  36. 36. AAC Target Mappings 36
  37. 37. AAC Mapping Validator 37
  38. 38. Statistics on What Was Mapped 38Center on Knowledge GraphsUSC Information Sciences Institute
  39. 39. Steps To Build a KG Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface 39Center on Knowledge GraphsUSC Information Sciences Institute
  40. 40. Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch Product 4 229 Bose Noise Cancelling Headphones Bos e Product 5 299 price description manufacturerproduct Multi-Type Graph 40 Collective Entity Resolution [Zhu et al, ISWC’16] Identifying and linking instances of the same real world entity
  41. 41. Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch Product 4 229 Bose Noise Cancelling Headphones Bos e Product 5 299 price description manufacturerproduct Multi-Type Graph Collective Entity Resolution [Zhu et al, ISWC’16] Identifying and linking instances of the same real world entity 41
  42. 42. Common Approach: Pairwise Comparisons Product 5 299 Quiet Comfort 25 Noise Cancelling Headphone Bose Electronic 299, 229 Bose Noise Cancelling HeadphonesBoseProduct 4 599 Dish WasherBoschProduct 3 292 Premium Noise Cancelling HeadphonesSonyProduct 2 Noise Cancelling HeadphonesSonyProduct 1 Price TitleManufacturer Jaro 0.5 distance 0.2 Jaccard 0.3 Acceptance Threshold: 0.8 42USC Information Sciences Institute
  43. 43. Graph Summarization: Original Graph Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch Product 4 229 Bose Noise Cancelling Headphones Bos e Product 5 299 price description manufacturerproduct 43Center on Knowledge Graphs
  44. 44. Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch 229 Bose Noise Cancelling Headphones Bos e Product 5 299 Product 4 Similar Nodes simt(x, y) 44
  45. 45. Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch 229 Bose Noise Cancelling Headphones Bos e Product 5 299 Product 4 Graph Sumarization: Super-Nodes 45
  46. 46. Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch 229 Bose Noise Cancelling Headphones Bos e Product 5 299 Product 4 Super-Links 46
  47. 47. Quiet Comfort 25 Noise Cancelling Headphone Bose Electroni c Product 1 Noise Cancelling Headphones Product 2 292 Premium Noise Cancelling Headphones Son y Product 3 599 Dish Washer Bosch 229 Bose Noise Cancelling Headphones Bos e Product 5 299 Product 4 Super-Links 47Center on Knowledge GraphsUSC Information Sciences Institute
  48. 48. Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4 Predict Links In Original Graph 48Center on Knowledge GraphsUSC Information Sciences Institute
  49. 49. Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4 Predict Links In Original Graph Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4 49USC Information Sciences Institute
  50. 50. Predict Links In Original Graph Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4 50Center on Knowledge GraphsUSC Information Sciences Institute
  51. 51. Re-Clustering Improves Reconstruction Quality Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4 Bose Electroni c Product 3 Bosch Bos e Product 5 Product 4 51USC Information Sciences Institute
  52. 52. Quality Comparison Precision Recall F-measure Author Paper Product Author Paper Product Author Paper Product Limes-F 0.958 0.827 0.446 0.864 0.761 0.16 0.909 0.792 0.236 Silk-F 0.846 0.877 0.459 0.986 0.756 0.348 0.91 0.812 0.395 Gsum 0.727 0.668 0.01 0.569 0.624 0.587 0.638 0.645 0.02 CoSum-B 0.993 0.871 0.58 0.94 0.611 0.477 0.966 0.718 0.524 Limes-MO 0.912 0.827 0.446 0.944 0.761 0.16 0.928 0.792 0.236 Silk-MO 0.932 0.877 0.459 0.958 0.756 0.348 0.945 0.812 0.395 Serf 0.985 0.837 0.436 0.687 0.808 0.186 0.809 0.822 0.261 CoSum-P 0.999 0.771 0.639 0.997 0.997 0.695 0.998 0.87 0.666 Commercial 0.615 0.63 0.622 AuthorLDA 0.995 52
  53. 53. Steps To Build a KG Crawling Extraction DataAcquisition Mapping To Ontology Entity Linking &Similarity Knowledge Graph Deployment Query & Visualization Elastic Search Graph DB schema.org geonames Data Acquisition Feature Extraction Feature Alignment Entity Resolution Graph Construction User Interface 53Center on Knowledge GraphsUSC Information Sciences Institute
  54. 54. Counter Human Trafficking 54Center on Knowledge GraphsUSC Information Sciences Institute
  55. 55. DIG for Counter Human Trafficking
  56. 56. Find the locations where a potential victim was advertised
  57. 57. Successfully deployed and used to find victims and prosecute traffickers
  58. 58. Graph Construction assembling the data for efficient query & analysis - Data represented in JSON-LD - Stored in ElasticSearch • Cloud-based search engine based on Apache Lucene • Horizontal scaling, replication, load balancing • Queries are fast! • Everything is a document - bulk loading: massive data imports (> 100M web pages) - real-time updates: live, changing data (~5,000 pages/hour) 58Center on Knowledge GraphsUSC Information Sciences Institute
  59. 59. Adult Service Offer Person Efficient indexing and query Phone Web Page ElasticSearch Data Model 59Center on Knowledge GraphsUSC Information Sciences Institute
  60. 60. Indexing for High Performance Knowledge Graph Queries Avg. Query Times in Milliseconds Single User Query Load 1.2 billion triples State of the Art Graph Database (RDF) DIG indexing deployed in ElasticSearch 60Center on Knowledge GraphsUSC Information Sciences Institute
  61. 61. • Index time for 16 million documents ~2.5 Hours • Query times: • Average Query time for Keyword searches: 8 msec • Find a specific CVE: 14 msec • Get all mentions of a MS Bulletin in all sources: 48 msec • Get all Malware named ‘Locky’ and sort results by observedDate: 68 msec • Get all blogs mentioning keyword ‘microsoft’ with a date range: 98 msec • Aggregate and give document counts for each publisher/sensor: 409 msec 61 Knowledge Graph Performance in Cyber Domain USC Information Sciences Institute Center on Knowledge Graphs
  62. 62. Questions Addressed in This Talk 1. Where should the Semantic Web data come from? • Triplestores? Linked data? Schema.org? 2. What is the best representation of the data in a knowledge graph? • Do we want to use the most detailed ontology possible 3. How should we deal with missing and incomplete information • Manual curation? Automated data cleaning? 4. How do we organize and store the data for efficient access? • RDF? Triplestore? Questions Addressed in This Talk Lessons Learned 62Center on Knowledge GraphsUSC Information Sciences Institute
  63. 63. Questions Addressed in This Talk 1. Where should the Semantic Web data come from? • Triplestores? Linked data? Schema.org? The Web! Waiting for the rest of the world to adopt the Semantic Web and provide the data in RDF is an approach doomed to failure! Questions Addressed in This Talk Lessons Learned 63Center on Knowledge GraphsUSC Information Sciences Institute
  64. 64. Questions Addressed in This Talk 1. Where should the Semantic Web data come from? • Triplestores? Linked data? Schema.org? 2. What is the “best” representation of the data in a knowledge graph? • Do we want to use the most detailed ontology possible The simplest one you need for the problem you are trying to solve Overly complicated ontologies that attempt to be comprehensive for a domain, get in the way of solving the real problems Questions Addressed in This Talk Lessons Learned 64Center on Knowledge GraphsUSC Information Sciences Institute
  65. 65. Questions Addressed in This Talk 1. Where should the Semantic Web data come from? • Triplestores? Linked data? Schema.org? 2. What is the best representation of the data in a knowledge graph? • Carefully curated domain-specific ontologies? 3. How should we deal with missing and incorrect information • Manual curation? Automated data cleaning? Clean where possible, but need techniques that can face these problems The world is a messy place and the ability to deal with it allows us to solve real-world problems Questions Addressed in This Talk Lessons Learned 65Center on Knowledge GraphsUSC Information Sciences Institute
  66. 66. Questions Addressed in This Talk 1. Where should the Semantic Web data come from? • Triplestores? Linked data? Schema.org? 2. What is the best representation of the data in a knowledge graph? • Carefully curated domain-specific ontologies? 3. How should we deal with missing and incomplete information • Manual curation? Automated data cleaning? 4. How do we organize and store the data for efficient access? • RDF? Triplestore? In whatever datastore best meets the goals of the problem! It is a mistake to equate the Semantic Web with triples and triplestores. Questions Addressed in This Talk Lessons Learned 66Center on Knowledge GraphsUSC Information Sciences Institute
  67. 67. Important Directions for Future Research 1. Techniques for extracting data from the online sources 2. Approaches to quickly build, refine, and extend ontologies to solve specific problems 3. Methods for semantically annotating data from extracted sources 4. Scalable and configurable techniques for entity resolution 5. Highly scalable algorithms for querying and reasoning 6. Ability to publish and query semantic data on web pages 67Center on Knowledge GraphsUSC Information Sciences Institute
  68. 68. Thanks! 68

×