Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PhD Proposal Defense - Prateek Jain


Published on

Slides from Prateek Jain's PhD Proposal Defense.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

PhD Proposal Defense - Prateek Jain

  1. 1. About 22 years ago.. 1
  2. 2. 11 years later…Image from Scientific American Website
  3. 3. 3
  4. 4. 4
  5. 5. 5
  6. 6. Tim Berners-Lee 20061. Use URIs as names for things2. Use HTTP URIs so that people can look up those names.3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)4. Include links to other URIs. so that they can discover more things. 6
  7. 7. In 2006 Web of Data 7
  8. 8. Linked Open Data • Massive collection of instance data • Primarily connected via owl:sameAs relationship • Excellent source of information for background knowledge • Labeled as mainstream Semantic Web7/2/2012 8 8
  9. 9. Is it really mainstream Semantic Web?• What is the relationship between the models whose instances are being linked?• How to do querying on LOD without knowing individual datasets?• How to perform schema level reasoning over LOD cloud? 9
  10. 10. What can be done?• Relationships are at the heart of Semantics• LOD primarily consists of owl:sameAs links• LOD captures instance level relationships, but lacks class level relationships. o Superclass o Subclass o Equivalence• How to find these relationships? o Perform a matching of the LOD Ontology’s using state of the art ontology matching tools. 10
  11. 11. Linked Data Alignment and Enrichment Proposal Defense June 11th, 2012 Prateek Jain Kno.e.sis CenterWright State University, Dayton, OH
  12. 12. Agenda • Motivation and Significance of this research • Research questions and proposed solutions • State of the current research and planned work • Questions and comments14th February 2012 12
  13. 13. Linked Open Data • A set of best practices for publishing and connecting structured data on the Web • Practices have been adopted by an increasing number of data providers in the past 5 years • Latest count is at 295 datasets with over 50 Billion triples (approx)14th February 2012 13
  14. 14. Linked Open Data 2007 (May)Linking Open Data cloud diagram, this and subsequent pages, by Richard Cyganiak and AnjaJentzsch. 14
  15. 15. Linked Open Data 2007 (Oct) 15
  16. 16. Linked Open Data 2009 16
  17. 17. Linked Open Data 2011 17
  18. 18. Linked Open DataNumber of Datasets Number of triples (Sept 2011) 31,634,213,7702011-09-19 295 with 503,998,829 out-links2010-09-22 2032009-07-14 952008-09-18 452007-10-08 252007-05-01 12 From 18
  19. 19. 6 years of existence how many applications come to your mind?7/2/2012 19
  20. 20. 20
  21. 21. Reality… • “We DID NOT use the entire Dbpedia or LOD. The only component of LOD which helped us with Watson was YAGO class hierarchy present in DBpedia. We had strict information gain requirements and other components honestly did not help much“ – Researcher with the Watson Team7/2/2012 21 21
  22. 22. Why?
  23. 23. A simple query..“Identify congress members, who have voted “No” on pro environmental legislation in the past four years, with high-pollution industry in their congressional districts.”But even with LOD we cannot answer this query. 23
  24. 24. Example: GovTrack Vote: 2009- vote:hasOption vote:vote 887 Votes:2009-887/+ vote:votedBy Aye rdfs:label vote:hasAction people/P000197 H.R. 3962: Affordable Health Care for America dc:title Act name On Passage: H R dc:title 3962 Affordable Nancy Pelosi Health Care for Bills:h3962 America Act 24
  25. 25. Example: GeoNames rdfs:subClassOf? 25
  26. 26. Our ApproachUse knowledge contributed by users To enhance existing approaches to solve these issues: • Ontology integration • Detection relationships withinLOD and across datasetsCloud • Querying multiple datasets 26
  27. 27. Circling Back • LOD captures instance level relationships, but lacks class level relationships. o Superclass o Subclass o Equivalence7/2/2012 28 28
  28. 28. BLOOMS – Bootstrapping …
  29. 29. • BLOOMS - Bootstrapping-based Linked Open Data Ontology Matching System• Developed specifically for LOD Ontologies• Identifies schema level links between different LOD datasets• Aligns ontologies belonging to diverse domains using diverse data sources 30
  30. 30. Existing ApproachesA survey of approaches to automatic Ontology matching by Erhard Rahm, Philip A. Bernstein in the VLDB Journal 10:334–350 (2001) 31
  31. 31. LOD Ontology Alignment• Actual Results from these techniques  Nation = Menstruation, Confidence=0.9 • They perform extremely well on established benchmarks, but typically not in the wilds.• LOD Ontology’s are of very different nature • Created by community for community. • Emphasis on number of instances, not number of meaningful relationships. • Require solutions beyond syntactic and structural matching. 32
  32. 32. Rabbit out of a hat?• Traditional auxiliary data sources (WordNet, Upper Level Ontologies) have limited coverage.• Community generated is noisy, but is rich in • Content • Structure • Has a “self healing property”• Problems like Ontology Matching have a dimension of context associated with them. 33
  33. 33. Wikipedia• The English version alone has more than 2.9 million articles• Continually expanded by approx. 100,000 active volunteer editors• Multiple points of view are mentioned with proper contexts• Article creation/correction is an ongoing activity 34
  34. 34. Ontology Matching using Wikipedia• On Wikipedia, categories are used to organize the entire project.• Wikipedias category system consists of overlapping trees.• Simple rules for categorization 35
  35. 35. BLOOMS Approach – Step 1• Pre-process the input ontology  Remove property restrictions  Remove individuals, properties• Tokenize the class names  Remove underscores, hyphens and other delimiters  Breakdown complex class names • example: SemanticWeb => Semantic Web 36
  36. 36. BLOOMS Approach – Step 2• Identify article in Wikipedia corresponding to the concept. o Each article related to the concept indicates a sense of the usage of the word.• For each article found in the previous step o Identify the Wikipedia category to which it belongs. o For each category found, find its parent categories till level 4.• Once the “BLOOMS tree” for each of the sense of the source concept is created (Ts), utilize it for comparison with the “BLOOMS tree” of the target concepts (Tt). 37
  37. 37. BLOOMS Approach – Step 3• In the tree Ts, remove all nodes for which the parent node which occurs in Tt to create Ts’. o All leaves of Ts are of level 4 or occur in Tt. o The pruned nodes do not contribute any additional new knowledge.• Compute overlap Os between the source and target tree. o Os= n/(k-1), n = |z|, zε Ts’ ΠTt, k= |s|, sε Ts’• The decision of alignment is made as follows. o For Ts εTc and Ttε Td, we have Ts=Tt, then C=D. o If min{o(Ts,Tt),o(Tt,Ts)} ≥ x, then set C rdfs:subClassOf D if o(Ts,Tt) ≤ o(Tt, Ts), and set D rdfs:subClassOf C if o(Ts, Tt) ≥ o(Tt, Ts). 38
  38. 38. Example 39
  39. 39. Evaluation Objectives • To examine BLOOMS as a tool for the purpose of LOD ontology matching. • To examine the ability of BLOOMS to serve as a general purpose ontology matching system. 40
  40. 40. Circling Back • LOD primarily consists of owl:sameAs links7/2/2012 41 41
  41. 41. Part of Relationship Identification
  42. 42. Partonomy Identification• Currently entities across datasets are linked using primarily the owl:sameAs relationship• Relationships such as partonomy (part-of), and causality can allow creating even more intelligent applications such as Watson• Approach PLATO (Part-Of relation finder on Linked Open DAta Tool) 43
  43. 43. PLATO Approach• PLATO generates all possible partonomically linked pairs between the entities in the dataset. o Utilize “strongly” associated entities• Identify the type of each entity in the pair using WordNet. o Use Class Names o Gives the lexicographer files for the synsets corresponding to these entities 44
  44. 44. Winston’s Taxonomy 45
  45. 45. PLATO Approach – Step 2• PLATO generates linguistic patterns for each applicable property based on linguistic cues suggested by Winston. o Cell Wall is made of Cellulose• Tests the lexical patterns for each entity pair in a corpus- driven manner. o Using Web as a corpus• PLATO counts the total number of web pages that contain the pattern o Parse the page and identify the occurance of pattern. 46
  46. 46. PLATO Approach – Step 3• Asserts the partonomy property with strongest supporting evidence o Cell Wall is made of Cellulose, 48 o Cellulose is made of Cell Wall, 10• PLATO also enriches the schema by generalizing from the instance level assertions. 47
  47. 47. Evaluation Objectives • To examine PLATO as a tool for finding different kinds of part-of relation. • To examine PLATO as a tool for finding part-of relation within a dataset • To examine PLATO as a tool for finding part-of relation across dataset 48
  48. 48. 49
  49. 49. BLOOMS BLOOMS+ PLATO Others 2010 1. 1 paper at ISWC 1. Paper at AAAI SS 2. 1 paper at OM 2. Paper at GEOS workshop 2011 1. 1 paper at ESWC 2. Workshop at ICBO 2012 1. 1 paper at ACM Hypertext Total of 7 publications covering this research14th February 2012 50
  50. 50. Research Plan • Evaluation of BLOOMS on LOD ontologies • Evaluation of PLATO • Automatic classification of datasets • Property alignment on LOD14th February 2012 51
  51. 51. Questions?