Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Domain-specific Knowledge Extraction from the Web of Data

184 views

Published on

Structured data on the Web frequently referred to as knowledge graphs consists of large number of datasets representing diverse domains. Widely used commercial applications such as entity recommendation, search, question answering and knowledge discovery use these knowledge graphs as their knowledge source. Majority of these applications have a particular domain of interest, hence require only the segment of the Web of data representing that domain (e.g., movie, biomedical, sports). In fact, leveraging the entire Web of data for a domain-specific application is not only computationally intensive, but also the irrelevant portion negatively impact the accuracy of the application. Hence, finding the relevant portion of the Web of data for domain-specific applications has become a paramount issue. Identifying the relevant portion of the Web of data consists of two sub-tasks; 1) find the relevant datasets that contain knowledge on the domain of interest, and 2) extract the subgraph representing domain of interest from the knowledge graphs that represent multiple domains (e.g., DBpedia, YAGO, Freebase). In this talk, I will discuss both data-driven and knowledge-driven approaches to solve these two sub-tasks. The domain-specific subgraphs extracted by our approach were 80% less in size in terms of the number of paths compared to original KG and resulted in more than tenfold reduction of required computational time for domain-specific tasks, yet produced better accuracy on domain-specific applications. We believe that this work can significantly contribute for utilizing knowledge graphs for domain-specific applications, specially with the explosive growth in the creation of knowledge graphs.

Published in: Education
  • Be the first to comment

Domain-specific Knowledge Extraction from the Web of Data

  1. 1. 1 Ph.D. Dissertation Defense Domain-specific Knowledge Extraction from the Web of Data Sarasi Lalithsena Kno.e.sis Center Advisor: Dr. Amit P. Sheth Committee members: Dr. T.K. Prasad Dr. Derek Doran Dr. Cory Henson Dr. Saeedeh Shekarpour
  2. 2. Recommending a Movie 2 Steven Spielberg Holocaust World War II Political thriller on Abolitionism in US director director director New movie to watch? director Cold war basedOnbasedOn basedOn basedOn Movies I enjoyed recently Humanistic Issues broader broader broader broader
  3. 3. Question Answering - IBM Watson 3 Q:In 1610 Galileo named the moons of this planet for the Medic brothers Telescope Giovanni Medici Sidereus Nuncius Jupiter Ganymede Invited Talk “Semantic Technology in Watson”, Chris Welty, Google Research, IBM (2002 - 2014) Telescope - Instrument Giovanni Medici - Person Ganymede - Moon Sidereus Nuncius - Book Jupiter - Planet
  4. 4. Knowledge Graph in Action 4 Question Answering Entity Linking Recommendation Systems Knowledge Discovery Conversational Agents
  5. 5. Web of Data – Structured Data on the Web 5 IBM Watson uses YAGO Hierarchy to extract the types Movie recommendation algorithms use DBpedia and Linked MovieDB to determine how two movies are semantically relevant
  6. 6. Growth of Web of Data 6 Number of Knowledge graphs Size of a Knowledge graph Domain Coverage 12 datasets (2007) 295 datasets (2011) 1163 datasets (now) Version English Version Entities Triples 2012 3.77M 400M 2013 4.0M 470M 2014 4.5M 583M 2015 6.2M 1B 2016 6.6M 1.7B DBpedia Cross-domain KGs Domain-specific KGs
  7. 7. Relevant Web of Data • Trillions of data points are available on the Web covering number of domains, how do we extract the relevant Web of data? 7 Entity recommendations applications look for knowledge related to entities of interest Biomedical discovery application looks for knowledge on genes, proteins, chemicals, disorders and their reactions. Twitris looks for campaign specific knowledge on specific topics such as election and natural disasters
  8. 8. 8 Search for Domain-specific Knowledge Book Recommendation DBTropes OpenLibrary DBpedia Freebase Identify the relevant knowledge graphs for a given domain Book Movie Game Person Country Other Identify the relevant portion of the knowledge graphs for a given domain WI 13 IEEE Big data 16 IEEE Big data 17
  9. 9. Thesis Statement “Applications serving specific domains can be benefitted by identifying the relevant knowledge from rapidly growing structured data on the Web. This can be accomplished by: (a) leveraging existing crowdsourced knowledge bases as a reference schema to automatically determine the domains of knowledge graphs, and (b) exploiting the semantics and structure of entities and relationships with statistical techniques to extract the relevant portions of the knowledge graphs.” 9
  10. 10. Outline • Identify relevant knowledge graphs for a given domain • Identify the relevant portion of the knowledge graphs –Domain-specific subgraph extraction from non- hierarchical knowledge –Domain-specific subgraph extraction from hierarchical knowledge 10
  11. 11. Outline • Identify relevant knowledge graphs for a given domain • Identify the relevant portion of the knowledge graphs –Domain-specific subgraph extraction from hierarchical knowledge –Domain-specific subgraph extraction from hierarchical knowledge 11
  12. 12. Problem • Web of data is growing fast with the number of knowledge graphs 12 CKAN Fixed set of tags describing higher level domains Can we identify the domains of these knowledge graphs to find relevant knowledge graphs easily? LODStat Semantic Search Engines User provided keywords Designed for entity search
  13. 13. Approach • Leverage Freebase crowdsource category system as a vocabulary to assign domains for a knowledge graph 13 domain Music Artist
  14. 14. Approach 14 Reference: Sarasi Lalithsena, Prateek Jain, Pascal Hitzler, Amit Sheth. "Automatic domain identification for linked open data." Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2013 Instance Identification Category Hierarchy Generation Assign Weighted Score Ignimbrite, Rock Slate, Rock Granite, Rock Entity Entity type Climbdata http://www.freebase.com/m/01tx7r http://www.freebase.com/m/01c_9j http://www.freebase.com/m/03fcm geology Ignimbrite rock type slate granite geography mountain mountain range geology rock type geography mountain geology rock type geology mountain
  15. 15. Evaluation • We use 30 knowledge graphs in LOD for evaluation 15 Evaluation Appropriateness of the identified domains Effectiveness in finding datasets User study User study with existing system Evaluate CKAN and our approach using gold standard More than 50% of the users agreed on 73% of the terms (88 out of 120) Outperforms LODStat and Sigma and performed as effective as CKAN Significant improvement of Recall and F-measure
  16. 16. Outline • Identify relevant knowledge graphs for a given domain • Identify the relevant portion of the knowledge graphs –Domain-specific subgraph extraction from non- hierarchical knowledge –Domain-specific subgraph extraction from hierarchical knowledge 16
  17. 17. Large Cross-domain Knowledge Graphs 17 570M entities and 18B facts* 44M entities and 2B facts 6.2M entities and 1B facts * Dong, Xin, et al. "Knowledge vault: A web-scale approach to probabilistic knowledge fusion." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.
  18. 18. Challenges in Utilizing Cross-domain KGs 18 Scalability Issues Transformers The Terminator Michael Bay James Cameron Action Film Cursed Random Hearts Wes Craven Sydney Pollack Los Angeles director director director death city death city known for known for Semantic Issues • Current methods extract relevant subgraphs by navigating predefined number of hops (2-4) from known domain entities director
  19. 19. Domain-specific Subgraph Extraction 19 • Extract a domain-specific subgraph from a large KG which reduces the size but preserves the domain related concepts and relationships Franco Zeffirelli Italian Military Personal 20th century military politics Italian Film Directors European Film Directors Italian Opera Directors University of Florence Otello almaMater director broader La_Boheme producer broader broader broader broaderHamlet director Franco Zeffirelli Italian Film Directors European Film Directors Italian Opera Directors Otello director broader La_Boheme producer broader broader Hamlet director Movie Domain Before After
  20. 20. 20 • Semantics of relationships • Semantics of entities • Semantics of the structure Wes carven Los Angeles Horror Filmgenre deathCity Films Directed by James Cameron Kingdoms and countries of Austria-Hungary Film Director Monarchy Country Movie Director Country American Films American Action Films 34 subcategories 34k entities 6 subcategories 632 entities Film Key Elements to Determine the Domain-specificity
  21. 21. Nature of the Relationship Matters 21 • Relationships with diverse semantics – non-hierarchical relationships • Relationships with uniform semantics – hierarchical relationships Franco Zeffirelli Callas Forever Jane Eyre Jesus of Nazareth Forza Italia British Army director writer director militaryUnit party Franco Zeffirelli Italian Film Directors Italian Opera Directors European Film Directors Italian Military Personal 20th Century Military Politics 1923s Births 1920s Births broader broader broader broader broader broader broader
  22. 22. Outline • Identify relevant knowledge graphs for a given domain • Identify the relevant portion of the knowledge graphs –Domain-specific subgraph extraction from non- hierarchical knowledge –Domain-specific subgraph extraction from hierarchical knowledge 22
  23. 23. Relationship Semantics is the Key 23 Relevant for movie recommendation Not relevant for movie recommendation Transformers Michael Bay Terminator James Cameron Action Film Cursed Random Hearts Wes Craven Sydney Pollack Los Angeles director director director director knownFor knownFor deathCity deathCity Let the data tell the algorithm which relationships are important to the domain Reference: Sarasi Lalithsena, Pavan Kapanipathi, and Amit Sheth. "Harnessing relationships for domain- specific subgraph extraction: A recommendation use case." Big Data (Big Data), 2016 IEEE Big data International Conference on. IEEE, 2016.
  24. 24. Domain Specificity Measures for Relationships • Association of the relationship with domain entities provides evidence for domain specificity 24 spouse • Relationship director is specific to the movie domain • Relationship country is not specific to the movie domain • Association of the relationship with the domain entities is straightforward with direct relationships such as director and country • However, it is not trivial for other relationships such as award, spouse, and capital type Movie m1
  25. 25. Type-based Domain Specificity Measure • Measure uses the association between entity types 25 spouse type type type Strong association Weak association Movie Director Country m1 Associativity between directly connected entity types 𝑑_𝑡𝑦𝑝𝑟𝑒𝑙(𝑡𝑖, 𝑡𝑗) = 𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡 𝑡𝑖,𝑡 𝑗 𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡𝑡𝑖 ∗ 𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡 𝑡 𝑗 Movie, Director Associativity between indirectly connected entity types 𝑖𝑛𝑑_𝑡𝑦𝑝𝑒𝑟𝑒𝑙(𝑡 𝑑, 𝑡 𝑛−1, 𝑛) = 𝑘=1 𝑛−1 𝑑_𝑡𝑦𝑝𝑒𝑟𝑒𝑙(𝑡 𝑘−1, 𝑡 𝑘) Movie, Award Associativity between entity types and relationships 𝑝𝑟𝑜𝑝_𝑟𝑒𝑙 𝑝, 𝑡 = 𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡 𝑝,𝑡 𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡 𝑝 award, Director 𝑝𝑟𝑜𝑝_𝑠𝑐𝑜𝑟𝑒(𝑝, 𝑛) = 𝑡 𝑛−1 𝑗 ∈𝐶 𝑖𝑛𝑑_𝑡𝑦𝑝𝑒𝑟𝑒𝑙 𝑡 𝑑, 𝑡 𝑛−1 𝑗 , 𝑛 ∗ 𝑝𝑟𝑜𝑝_𝑟𝑒𝑙(𝑝, 𝑡 𝑛−1 𝑗 )
  26. 26. Path-based Domain Specificity Measure • Measure uses the association between intermediate relationships 26 m1 I1 I3 I4 I5 I2 m2 m1 m2 I1 I3 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 nth Hop p1 p2 p3 pn Domain specific paths m1 H1 H2 H3 Hn-1 Domain-specificity of the nth-hop relationships 𝑃𝑀𝐼 𝑝, 𝑑𝑠𝑝 𝑛−1 = 𝑙𝑜𝑔 𝑃𝑟𝑜𝑏(𝑝, 𝑑𝑠𝑝 𝑛−1) 𝑃𝑟𝑜𝑏(𝑝) ∗ 𝑃𝑟𝑜𝑏(𝑑𝑠𝑝 𝑛−1) p3 ,Domain specific path p1 – p2 𝑃𝑟𝑜𝑏 𝑝, 𝑑𝑠𝑝 𝑛−1 = 𝑃𝑎𝑡ℎ(𝑑𝑠𝑝 𝑛−1, 𝑝) 𝑝∈𝑃 𝑃𝑎𝑡ℎ(𝑑𝑠𝑝 𝑛−1, 𝑝)
  27. 27. Evaluation with Recommendation Use Case • Implement an existing recommendation algorithm and use the n- hop expansion subgraph (baseline) and the domain-specific subgraph (our approach) • Use two domains Movie and Book with existing dataset MovieLens and Dbbook • MovieLens consists of 1,000,209 ratings for 3883 movies by 6,040 users and DBbook 72,372 ratings for 8,170 books by 6181 users 27
  28. 28. Evaluation Metrics • Graph reduction – Measure the reduction of the graph with nodes and reachable paths • Impact on accuracy – Precision@n • Impact on run time 28
  29. 29. Evaluation Metrics – Graph Reduction Path-based Type-based Nodes Paths Nodes Paths 2-hop 1.07M 108.4M 1.07M 108.4M DSG2(15,15) 0.08M (92.0%) 5.08M (95.3%) 0.13M (87.6%) 17M (83.9%) DSG2(25,25) 0.13M (87.3%) 17.4M (83.8%) 0.63M (40.9%) 61.6M (43.19%) DSG2(35,35) 0.64M (40.7%) 61.64M (43.1%) 0.64M (40.7%) 61.62M (43.18%) 29 Movie domain: 2-hop graphs Book domain: 2-hop graphs Path-based Type-based Nodes Paths Nodes Paths 2-hop 1.2M 793.4M 1.2M 793.4M DSG2(15,15) 0.09M (92.8%) 159.6M (79.6%) 0.09M (92.8%) 159.7M (80%)
  30. 30. Evaluation Metrics – Graph Reduction Path-based Type-based Nodes Paths Nodes Paths Movie 3-hop 2.86M 4885.3M 2.86M 4885.3M DSG3(15,25,15) 0.19M (93.2%) 48.4M (98.9%) 0.26M (90.9%) 105.5M (97.83%) Book 3-hop 3.2M 13582.8M 3.2M 13582.8M DSG2(15,25,15) 0.18M (94.2%) 1082.6M (92.2%) 0.12M (96%) 1062.5M (92.33%) 30 3-hop graphs On average, domain-specific subgraph has a reduction of 80% to 90% w.r.t the n-hop expansion sub graph
  31. 31. Evaluation Metrics – Precision@n 31 movies movies Movie Domain Hop 2 Movie Domain Hop 3
  32. 32. Evaluation Metrics – Precision@n 32 Book Domain Hop 2 Book Domain Hop 3 books books
  33. 33. Evaluation Metrics – Run Time Performance 33 Movie Book n-hop expansion DSG n-hop expansion DSG Path Type Path Type 2-hop 72s 5s 11.2s 10.15m 1.3m 1.4m 3-hop 2 h 35 m 76s 3.2 m 7 h 40 m 15.2m 27m Configuration: VM core-8, 15G RAM
  34. 34. Conclusion • Rank non-hierarchical relationships for domain specificity using relation semantics and structural semantics • Approach was able to reduce the graph size by more than 80% which led to a tenfold decrease in computation time for the recommendation algorithm • Accuracy of the algorithm shows no compromise rather found more accurate results 34
  35. 35. Outline • Identify relevant knowledge graphs for a given domain • Identify the relevant portion of the knowledge graphs –Domain-specific subgraph extraction from non- taxonomic knowledge –Domain-specific subgraph extraction from hierarchical knowledge 35
  36. 36. Recap 36 • Relationships with diverse semantics – non-hierarchical relationships • Relationships with uniform semantics – hierarchical relationships Franco Zeffirelli Callas Forever Jane Eyre Jesus of Nazareth Forza Italia British Army director writer director militaryUnit party Franco Zeffirelli Italian Film Directors Italian Opera Directors European Film Directors Italian Military Personal 20th Century Military Politics 1923s Births 1920s Births broader broader broader broader broader broader broader Focused on relationships
  37. 37. Hierarchical Relationships are Important 37 Hierarchical (subject + broader) type name birthplace kingdom office party author product builder
  38. 38. Hierarchical Relationships are Important 38 American LGBT related Films Museums in Popular Culture Blackbird Manhattan Stormbreaker Fair game The Bourne Identity American Spy Films Films Directed by Doug Liman American films by genre Museums Education in Popular Culture EducationEducational Buildings American films by genre Films by American Directors broader – domain irrelevant Categories movies broader – domain relevant
  39. 39. Entity Semantics is the Key 39 Lexical-based SemanticsType-based Semantics Films Directed by James Cameron Kingdoms and countries of Austria-Hungary Film Film Director Monarchy Country Action animation 1997 anime Dance animation 1996 in animation Biographical works Biographical films about entertainers Biographical films about children Reference: Sarasi Lalithsena, Sujan Perera, Pavan Kapanipathi, and Amit Sheth. "Domain-specific hierarchical subgraph extraction: A recommendation use case." Big Data (Big Data), 2017 IEEE Big data International Conference on. IEEE, 2017.
  40. 40. Structural Semantics also Provides Evidence 40 American Action Films …….. American Films 34 subcategories 43K entities 6 subcategories 632 entities
  41. 41. Putting it all together 41 Type: Time Lexical: anime (animation) Structural: 3 subcategories, 0 entities Less relevant High relevant Highly relevant Type: Horror Film Lexical: horror films Structural: 25 subcategories, 13 entities Highly relevant Highly relevant Less relevant Type: Business, Country Lexical: United States Structural: 18 subcategories, 9 entities Less relevant Less relevant Less relevant Type: Biographical films Lexical: Biographical films Structural: 0 subcategories, 2 entities Highly relevant Highly relevant Highly relevant Biographical films about photojournals Business in the United States American teen horror films 1989 anime (a) Complementary (b) Contrasting (a) (b)
  42. 42. Probabilistic Soft Logic to Combine Evidence • We use probabilistic soft logic (PSL), which is a statistical relational learning framework to aggregate complementary and contrasting evidence in a principled way. • PSL is a declarative language to express statistical relational learning problems with these three main components – Predicates e.g- Friends – Atoms: A continuous random variable Friends(A, B) • Atoms can be a observed value or it can be an unobserved value – Rule: Capture the dependencies with a weight • Friends(A, B) ^ Friends(B, C) -> Friends(A, C) 42
  43. 43. Probabilistic Soft Logic • PSL uses Lukasiewicz t-norm to provide a relaxation for the logical connectives – P ⋀ Q = max(0, P + Q - 1) – P ⋁ Q = min(1, P + Q) • Given a rule, – B1(X) ⋀ B2(X) -> H1(X) Measure the distance to satisfaction: Max( (B1(X) ⋀ B2(X)) - (H1(X)),0) 43
  44. 44. PSL Rules to Domain-specificity of Categories 44 𝑠𝑒𝑚_𝑡𝑦𝑝𝑒 𝑐𝑎𝑡, 𝑡𝑦𝑝𝑒 ^ 𝑠𝑒𝑚_𝑠𝑖𝑚(𝑡𝑦𝑝𝑒, 𝑑𝑜𝑚𝑎𝑖𝑛) → 𝑑𝑜𝑚𝑎𝑖𝑛_𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦(𝑐𝑎𝑡, 𝑑𝑜𝑚𝑎𝑖𝑛) 𝑙𝑒𝑥_𝑐𝑙𝑢𝑠_𝑡𝑦𝑝𝑒 𝑐𝑎𝑡, 𝑐𝑙𝑢𝑠 ^ 𝑙𝑒𝑥_𝑐𝑙𝑢𝑠_𝑠𝑖𝑚(𝑐𝑙𝑢𝑠, 𝑑𝑜𝑚𝑎𝑖𝑛) → 𝑑𝑜𝑚𝑎𝑖𝑛_𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦(𝑐𝑎𝑡, 𝑑𝑜𝑚𝑎𝑖𝑛) 𝑔𝑟𝑎𝑝ℎ_𝑠𝑝𝑒𝑐 𝑐𝑎𝑡 → 𝑑𝑜𝑚𝑎𝑖𝑛_𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦(𝑐𝑎𝑡) • Lexical Semantics • Type Semantics • Structural Semantics Films Directed by James Cameron Film Director Film type 1997 Anime Dance Animation animation Film Action Animation American Action Films American Films … Film
  45. 45. Evaluation: Recommendation Use Case • We use the Wikipedia category hierarch as our test bed and use the same evaluation setting we used for non-hierarchical relationships. • We evaluate the performance of the recommendation algorithm on different subgraphs – Baseline subgraph (EXP-DSHGn) – Subgraph from our approach (PSL-DSHGn) – Subgraph from supervised machine learning approach (SUP-DSHGn) * • We use the both movie and book and the evaluation metrics we used earlier 45 *Mirylenka, Daniil, Andrea Passerini, and Luciano Serafini. "Bootstrapping Domain Ontologies from Wikipedia: A Uniform Approach." IJCAI. 2015.
  46. 46. Evaluation Results - Graph Reduction Categories Paths EXP-DSHG2 6413 18M PSL-DSHG2(3500) 3844 (40%) 1.62M (91%) PSL-DSHG2(4000) 4315 (33%) 10.8M (44%) PSL-DSHG2(4500) 4782 (25%) 10.26M (43%) Reduction for Movie hop-2 and hop-3 subgraphs Hop-2 Categories Paths EXP-DSHG3 12348 320M PSL-DSHG3(6500) 6534 (47%) 106M (67%) PSL-DSHG3(7500) 7508 (39%) 115M (64%) PSL-DSHG3(9000) 9015 (27%) 151M (52%) Hop-3 46
  47. 47. Evaluation Results - Graph Reduction Categories Paths EXP-DSHG2 8603 2.2M PSL-DSHG2(5000) 5155 (40%) 1.4M (36%) PSL-DSHG2(5500) 5847 (32%) 1.6M (27%) PSL-DSHG2(6000) 6297 (27%) 1.8M (18%) Hop-2 Categories Paths EXP-DSHG3 18680 22M PSL-DSHG3(6500) 6868 (63%) 9M (73%) PSL-DSHG3(7500) 7504 (60%) 12M (45%) PSL-DSHG3(8500) 8916 (52%) 14M (35%) Hop-3 Reduction for Book hop-2 and hop-3 subgraphs 47
  48. 48. Evaluation Results - Precision@n 48 Movie 2-hop subgraph extracted with top-4500 categories shows the best performance with 43% reduction in paths Movie 3-hop subgraph extracted with top-9000 categories shows the best performance with 52% reduction in paths Movie Domain Hop 2 Movie Domain Hop 3 movies movies
  49. 49. Evaluation Results - Precision@n 49 Book 2-hop subgraph extracted with top-5500 categories shows a similar performance with 27% reduction in paths Book 3-hop subgraph extracted with top-7500 categories shows the best performance with 45% reduction in paths Book Domain Hop 2 Book Domain Hop 3 books @n books
  50. 50. Evaluation Results with PSL-DSHGn and SUP-DSGHn PSL Approach Supervised Approach Categories Paths Categories Paths Movie EXP-DSHGn 6413 18M 77033 17M *x-DSHG 4782 (25%) 10.2M (43%) 10576 (86%) 16M (6%) Book EXP-DSHGn 8603 2.2M 45784 2M *x-DSHG 5847 (31%) 1.6M (27%) 8521 (81%) 1M (50%) • Graph Reduction Path reduction are significantly higher in the movie domain from PSL approach but it is less in book domain 50
  51. 51. Evaluation Results with PSL-DSHGn and SUP-DSGHn 51 Performance of the recommendation system that utilize SUP-DSHG not only deteriorates in comparison to PSL-DSHG but also in comparison to its own expansion subgraph. Movie Domain Book Domain booksmovies
  52. 52. DSHG for US Presidential Election Domain • How about a fine-grained domain such as US Presidential Election Campaign? – Covers diverse set of entity types from politician and policies to funding and media 52
  53. 53. PSL Rules 𝑠𝑒𝑚_𝑠𝑖𝑚 𝑡𝑦𝑝𝑒, 𝑑𝑜𝑚𝑎𝑖𝑛 = 𝑐𝑜𝑛𝑐𝑒𝑝𝑡 𝑠𝑒𝑚_𝑠𝑖𝑚(𝑡𝑦𝑝𝑒,𝑐𝑜𝑛𝑐𝑒𝑝𝑡) 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑛𝑐𝑒𝑝𝑡𝑠 53 • We modify the calculation of axioms in type and lexical PSL rules • Type Semantics 𝑠𝑒𝑚_𝑡𝑦𝑝𝑒 𝑐𝑎𝑡, 𝑡𝑦𝑝𝑒 ^ 𝑠𝑒𝑚_𝑠𝑖𝑚(𝑡𝑦𝑝𝑒, 𝑑𝑜𝑚𝑎𝑖𝑛) → 𝑑𝑜𝑚𝑎𝑖𝑛_𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦(𝑐𝑎𝑡, 𝑑𝑜𝑚𝑎𝑖𝑛)
  54. 54. Evaluation 54 • 3-hop expansion subgraph consisted of 19371 categories and we extract top 5000 categories as our domain-specific subgraph • We create a gold standard with random 1000 categories from our 3- hop expansion subgraph with 3 annotators Accuracy F1 relevant class F1 irrelevant class SUP-DSHG 0.71 0.532 0.786 PSL-DSHG 0.8 0.645 0.861
  55. 55. Examples of US Presidential Election Domain 55 Donald Trump Political positions of United States presidential candidates, 2016 United States presidential candidates, 2016 Candidates in United States elections, 2016 People associated with the United States presidential election, 2016 2016 elections in the United States 2016 elections in the United States Campaign finance in the United States American political fundraisers The Apprentice (TV series) contestants The Apprentice (TV series) Reality television series franchises Television franchises Television series by genre Business-related television series
  56. 56. Examples of US Presidential Election Domain 56 Groom (profession) Lincoln Chafee United States presidential candidates, 2016 Candidates in United States elections, 2016 People associated with the United States presidential election, 2016 2016 elections in the United States American political people Living People Farriers People working with Animals Horse related professions and professionals American politicians who switched parties in office
  57. 57. Examples of US Presidential Election Domain 57 United States presidential election Japan Music Week Presidential elections in the United States Caucus Political Party Elections Federal elections of the United States Elections in the United States November events Events by months November
  58. 58. Limitation • Extracting temporally relevant subgraphs – Our subgraph contain information from prior elections as date back to 1972 58 Presidential elections in the United States United States presidential elections by date United States presidential election, 1972 United States presidential election, 1984 United States presidential election, 1908
  59. 59. Conclusion • We proposed approaches to extract a domain-specific subgraph from large knowledge graphs by exploiting relationship semantics, entity semantics, and structure in both data-driven and knowledge-driven way • We demonstrate the effectiveness of the domain-specific subgraphs using a recommendation use case. The subgraphs created with 40-50% reduction from non- hierarchical relationships and with 80% reduction from hierarchical relationships improves the accuracy of the recommendations and reduce the runtime by over ten folds. 59
  60. 60. Summary • Demonstrated the importance of extracting relevant portion from Web of data • Proposed and developed a technique to identify the relevant knowledge graphs by automatically assigning domains to knowledge graphs using crowed-sourced knowledge sources • Showcase the effectiveness of these domain assignments with an application to find relevant knowledge graphs • Proposed and developed techniques to identify the domain-specific subgraph from large knowledge graphs with hierarchical and non-hierarchical relationships using statistical techniques • Demonstrated the effectiveness of the domain-specific subgraphs using an recommendation use case for multiple domains. 60
  61. 61. Publications • Conference Papers - Sarasi Lalithsena, Pascal Hitzler, Amit Sheth, and Prateek Jain. 2013. “Automatic Domain Identification for Linked Open Data,” WI 2013 - Sarasi Lalithsena, Pavan Kapanipathi and Amit Sheth, "Harnessing relationships for domain-specific subgraph extraction: A recommendation use case," IEEE Bigdata 2016 - Sarasi Lalithsena, Sujan Perera, Pavan Kapanipathi and Amit Sheth, "Domain-specific hierarchical subgraph extraction: A recommendation use case," 2017 IEEE BigData 2017 - Sarasi Lalithsena et al., "Feedback-Driven Radiology Exam Report Retrieval with Semantics," ICHI 2015 - Aini Rakhmawati, Nur, Saleem, Muhammad, Lalithsena, Sarasi and Decker, Stefan. “QFed: Query Set For Federated SPARQL Query Benchmark.” iiWAS2014 • Articles - Kalpa Gunaratne, Sarasi Lalithsena, Amit Sheth. “Alignment and Dataset Identification of Linked Data in Semantic Web.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery (2014). • Journal Paper - Priti P. Parikh, Todd A. Minning, Vinh Nguyen, Sarasi Lalithsena, Amir H. Asiaee, Satya S. Sahoo, Prashant Doshi, Rick Tarleton, and Amit P. Sheth. “A Semantic Problem Solving Environment for Integrative Parasite Research: Identification of Intervention Targets for Trypanosoma cruzi.“ PLoS 2012 • Workshop papers - Roopteja Muppalla, Sarasi Lalithsena, Tanvi Banerjee, Amit Sheth. “A Knowledge Graph Framework for Detecting Traffic Events Using Stationary Cameras.” In Industrial Knowledge Graphs Workshop (co-located with 9th International ACM Web Science Conference 2017 - Nishita Jaykumar, PavanKalyan Yallamelli, Vinh Nguyen, Sarasi Lalithsena, Krishnaprasad Thirunarayan, Amit Sheth, Clare Paul. “KnowledgeWiki: An OpenSource Tool for Creating Community-Curated Vocabulary, with a Use Case in Materials Science,” in LDOW workshop of the 25th I nternational World Wide Web Conference. 61
  62. 62. PhD Journey 62 Knowledge Graph Construction and Consumption for Domain-specific Applications. - Identifying relevant knowledge graphs (WI ‘13, Wiley’14) - Identifying domain relevant subgraphs (IEEE Big Data’16, IEEE Big Data’17) - Contextually relevant patient record retrieval with knowledge graph (ICHI’15) - Federated SPARQL query benchmarks (iiWAS2014) - Dynamically evolving knowledge graphs Semantic Web Application for Interdisciplinary Areas. - Semantics and Services enabled Problem Solving Environment for Tcruzi (Plos’12) - Federated Semantic Services Platform for Open Materials Science and Engineering (LODW-WWW'16) - Knowledge graph framework for detecting traffic events using stationary cameras (Webscience’17) Research Internships - DERI, Galway, Ireland 2012 - GE Research, NY 2014 - Bosch Research, PA 2015 PC Committee - FiCloud2017, ICIW2017, DKMP2017 External Reviewer - WWW, IJCAI, ESWC, EKAW, ISWC, HT, LDOW, NLDB, IJSWIS, Plos One Proposals - NSF Hazzards SEES - NSF DIBBS - NIH chemogenomics
  63. 63. Acknowledgement 63
  64. 64. Acknowledgement 64
  65. 65. Acknowledgement 65 Funding Agencies Colleagues • NSF • NIH • AFRL
  66. 66. Thank You. Questions? 66

×