About 22 years ago..                       1
11 years later…Image from Scientific American Website
3
4
5
Tim Berners-Lee 20061. Use URIs as names for things2. Use HTTP URIs so that people can look up those names.3. When someone...
In 2006 Web of Data                      7
Linked Open Data       • Massive collection of instance data       • Primarily connected via owl:sameAs relationship      ...
Is it really mainstream Semantic Web?• What is the relationship between the models whose instances are being linked?• How ...
What can be done?• Relationships are at the heart of Semantics• LOD primarily consists of owl:sameAs links• LOD captures i...
Linked Open DataAlignment and Querying Dissertation Defense July 27th, 2012            Prateek Jain          Kno.e.sis Cen...
Agenda      •     Motivation and Significance of this research      •     Research questions and proposed solutions      •...
Linked Open Data      •     A set of best practices for publishing and           connecting structured data on the Web    ...
Linked Open Data 2007 (May)Linking Open Data cloud diagram, this and subsequent pages, by Richard Cyganiak and AnjaJentzsc...
Linked Open Data 2007 (Oct)                              15
Linked Open Data 2009                        16
Linked Open Data 2011                        17
Linked Open DataNumber of Datasets            Number of triples (Sept 2011)                              31,634,213,770201...
6 years of existence how            many applications come to                   your mind?7/30/2012                       ...
I tried to investigate..
Compiled List     • BBC Music     • Faviki     • Application Lifecycle Management at IBM          Rational     • British M...
22
Reality…       • “We DID NOT use the entire Dbpedia or LOD.            The only component of LOD which helped us          ...
Why?
A simple query..“Identify congress members, who have voted “No”   on pro environmental legislation in the past four   year...
Example: GovTrack                           Vote: 2009-       vote:hasOption vote:vote                    887             ...
Example: GeoNames        rdfs:subClassOf?                           27
Our ApproachUse knowledge contributed by users                                     To enhance existing approaches         ...
Circling Back       • LOD captures instance level relationships, but            lacks class level relationships.          ...
BLOOMS – Bootstrapping …
• BLOOMS - Bootstrapping-based Linked Open  Data Ontology Matching System• Developed specifically for LOD Ontologies• Iden...
Existing ApproachesA survey of approaches to automatic Ontology matching by Erhard Rahm, Philip A. Bernstein in the VLDB J...
LOD Ontology Alignment• Actual Results from these techniques    Nation = Menstruation, Confidence=0.9 • They perform ext...
Rabbit out of a hat?• Traditional auxiliary data sources (WordNet,  Upper Level Ontologies) have limited coverage.• Commun...
Wikipedia• The English version alone has more than 2.9  million articles• Continually expanded by approx. 100,000 active  ...
Ontology Matching using Wikipedia• On Wikipedia, categories are used to organize  the entire project.• Wikipedias category...
BLOOMS Approach – Step 1• Pre-process the input ontology     Remove property restrictions     Remove individuals, proper...
BLOOMS Approach – Step 2• Identify article in Wikipedia corresponding to the concept.   o Each article related to the conc...
BLOOMS Approach – Step 3• In the tree Ts, remove all nodes for which the parent node  which occurs in Tt to create Ts’.   ...
Example          41
Evaluation Objectives • To examine BLOOMS as a tool for the purpose of LOD   ontology matching. • To examine the ability o...
BLOOMS         43
BLOOMS         44
Circling Back       • LOD primarily consists of owl:sameAs links7/30/2012                                              45 ...
Part of Relationship  Identification
Partonomy Identification•   Currently entities across datasets are linked using primarily the    owl:sameAs relationship• ...
PLATO Approach• PLATO generates all possible partonomically  linked pairs between the entities in the dataset.  o Utilize ...
Winston’s Taxonomy                     49
PLATO Approach – Step 2• PLATO generates linguistic patterns for each applicable  property based on linguistic cues sugges...
PLATO Approach – Step 3• Asserts the partonomy property with strongest supporting  evidence   o Cell Wall is made of Cellu...
Evaluation Objectives • To examine PLATO as a tool for finding different kinds of   part-of relation. • To examine PLATO a...
PLATO Evaluation                   53
54
Some other work       • Requirement document analysis             o Internship at Accenture       • Querying of partonomic...
BLOOMS                 BLOOMS+                PLATO                 Others       2010          1.   1 paper at ISWC       ...
Potential Applications      • Automatic domain identification of datasets             o Work currently being pursued by Sa...
Publications      •    Prateek Jain, Pascal Hitzler, KunalVerma, Peter Z. Yeh and Amit P. Sheth, “Moving beyond sameAs wit...
Publications      •    Prateek Jain, Pascal Hitzler and Amit P. Sheth. "Flexible Bootstrapping-Based Ontology Alignment". ...
Publications      •    Prateek Jain, Peter Z. Yeh, KunalVerma, Alex Kass, and Amit P. Sheth, 2008. "Enhancing process-adap...
Patent      • Peter Z. Yeh, Prateek Jain, KunalVerma,           Reymonrod G. Vasquez, Titled: Information           Source...
Acknowledgement14th February 2012         62
Acknowledgement      • Cory Henson             o coffee breaks, research, football, baseball, politics, life..            ...
Acknowledgement      • NSF Award:IIS-0842129, titled III-SGER: Spatio-           Temporal-Thematic Queries of Semantic Web...
Questions?
Upcoming SlideShare
Loading in...5
×

Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying

2,242

Published on

The recent emergence of the “Linked Data” approach for publishing data represents a major step forward in realizing the original vision of a web that can "understand and satisfy the requests of people and machines to use the web content" – i.e. the Semantic Web. This new approach has resulted in the Linked Open Data (LOD) Cloud, which includes more than 70 large datasets contributed by experts belonging to diverse communities such as geography, entertainment, and life sciences. However, the current interlinks between datasets in the LOD Cloud – as we will illustrate – are too shallow to realize much of the benefits promised. If this limitation is left unaddressed, then the LOD Cloud will merely be more data that suffers from the same kinds of problems, which plague the Web of Documents, and hence the vision of the Semantic Web will fall short.

This thesis presents a comprehensive solution to address the issue of alignment and relationship identification using a bootstrapping based approach. By alignment we mean the process of determining correspondences between classes and properties of ontologies. We identify subsumption, equivalence and part-of relationship between classes. The work identifies part-of relationship between instances. Between properties we will establish subsumption and equivalence relationship. By bootstrapping we mean the process of being able to utilize the information which is contained within the datasets for improving the data within them. The work showcases use of bootstrapping based methods to identify and create richer relationships between LOD datasets. The BLOOMS project (http://wiki.knoesis.org/index.php/BLOOMS) and the PLATO project, both built as part of this research, have provided evidence to the feasibility and the applicability of the solution.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,242
On Slideshare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Thanks to members of LOD Mailing List especially Dr. Hugh Glaser
  • both as a knowledge source and test bed
  • “If logical membership of one category implies logical membership of a second, then the first category should be made a subcategory”“Pages are not placed directly into every possible category, only into the most specific one in any branch”“Every Wikipedia article should belong to at least one category.”
  • Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying

    1. 1. About 22 years ago.. 1
    2. 2. 11 years later…Image from Scientific American Website
    3. 3. 3
    4. 4. 4
    5. 5. 5
    6. 6. Tim Berners-Lee 20061. Use URIs as names for things2. Use HTTP URIs so that people can look up those names.3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)4. Include links to other URIs. so that they can discover more things. 6
    7. 7. In 2006 Web of Data 7
    8. 8. Linked Open Data • Massive collection of instance data • Primarily connected via owl:sameAs relationship • Excellent source of information for background knowledge • Labeled as mainstream Semantic Web7/30/2012 8 8
    9. 9. Is it really mainstream Semantic Web?• What is the relationship between the models whose instances are being linked?• How to do querying on LOD without knowing individual datasets?• How to perform schema level reasoning over LOD cloud? 9
    10. 10. What can be done?• Relationships are at the heart of Semantics• LOD primarily consists of owl:sameAs links• LOD captures instance level relationships, but lacks class level relationships. o Superclass o Subclass o Equivalence• How to find these relationships? o Perform a matching of the LOD Ontology’s using state of the art ontology matching tools. 10
    11. 11. Linked Open DataAlignment and Querying Dissertation Defense July 27th, 2012 Prateek Jain Kno.e.sis Center Wright State University, Dayton, OH
    12. 12. Agenda • Motivation and Significance of this research • Research questions and proposed solutions • State of the current research and planned work • Questions and comments14th February 2012 12
    13. 13. Linked Open Data • A set of best practices for publishing and connecting structured data on the Web • Practices have been adopted by an increasing number of data providers in the past 5 years • Latest count is at 295 datasets with over 50 Billion triples (approx)14th February 2012 13
    14. 14. Linked Open Data 2007 (May)Linking Open Data cloud diagram, this and subsequent pages, by Richard Cyganiak and AnjaJentzsch. http://lod-cloud.net/ 14
    15. 15. Linked Open Data 2007 (Oct) 15
    16. 16. Linked Open Data 2009 16
    17. 17. Linked Open Data 2011 17
    18. 18. Linked Open DataNumber of Datasets Number of triples (Sept 2011) 31,634,213,7702011-09-19 295 with 503,998,829 out-links2010-09-22 2032009-07-14 952008-09-18 452007-10-08 252007-05-01 12 From http://www4.wiwiss.fu-berlin.de/lodcloud/state/ 18
    19. 19. 6 years of existence how many applications come to your mind?7/30/2012 19
    20. 20. I tried to investigate..
    21. 21. Compiled List • BBC Music • Faviki • Application Lifecycle Management at IBM Rational • British Museum14th February 2012 21
    22. 22. 22
    23. 23. Reality… • “We DID NOT use the entire Dbpedia or LOD. The only component of LOD which helped us with Watson was YAGO class hierarchy present in DBpedia. We had strict information gain requirements and other components honestly did not help much“ – Researcher with the Watson Team7/30/2012 23 23
    24. 24. Why?
    25. 25. A simple query..“Identify congress members, who have voted “No” on pro environmental legislation in the past four years, with high-pollution industry in their congressional districts.”But even with LOD we cannot answer this query. 25
    26. 26. Example: GovTrack Vote: 2009- vote:hasOption vote:vote 887 Votes:2009-887/+ vote:votedBy Aye rdfs:label vote:hasAction people/P000197 H.R. 3962: Affordable Health Care for America dc:title Act name On Passage: H R dc:title 3962 Affordable Nancy Pelosi Health Care for Bills:h3962 America Act 26
    27. 27. Example: GeoNames rdfs:subClassOf? 27
    28. 28. Our ApproachUse knowledge contributed by users To enhance existing approaches to solve these issues: • Ontology integration • Detection relationships withinLOD and across datasetsCloud • Querying multiple datasets 28
    29. 29. Circling Back • LOD captures instance level relationships, but lacks class level relationships. o Superclass o Subclass o Equivalence7/30/2012 30 30
    30. 30. BLOOMS – Bootstrapping …
    31. 31. • BLOOMS - Bootstrapping-based Linked Open Data Ontology Matching System• Developed specifically for LOD Ontologies• Identifies schema level links between different LOD datasets• Aligns ontologies belonging to diverse domains using diverse data sources 32
    32. 32. Existing ApproachesA survey of approaches to automatic Ontology matching by Erhard Rahm, Philip A. Bernstein in the VLDB Journal 10:334–350 (2001) 33
    33. 33. LOD Ontology Alignment• Actual Results from these techniques  Nation = Menstruation, Confidence=0.9 • They perform extremely well on established benchmarks, but typically not in the wilds.• LOD Ontology’s are of very different nature • Created by community for community. • Emphasis on number of instances, not number of meaningful relationships. • Require solutions beyond syntactic and structural matching. 34
    34. 34. Rabbit out of a hat?• Traditional auxiliary data sources (WordNet, Upper Level Ontologies) have limited coverage.• Community generated is noisy, but is rich in • Content • Structure • Has a “self healing property”• Problems like Ontology Matching have a dimension of context associated with them. 35
    35. 35. Wikipedia• The English version alone has more than 2.9 million articles• Continually expanded by approx. 100,000 active volunteer editors• Multiple points of view are mentioned with proper contexts• Article creation/correction is an ongoing activity 36
    36. 36. Ontology Matching using Wikipedia• On Wikipedia, categories are used to organize the entire project.• Wikipedias category system consists of overlapping trees.• Simple rules for categorization 37
    37. 37. BLOOMS Approach – Step 1• Pre-process the input ontology  Remove property restrictions  Remove individuals, properties• Tokenize the class names  Remove underscores, hyphens and other delimiters  Breakdown complex class names • example: SemanticWeb => Semantic Web 38
    38. 38. BLOOMS Approach – Step 2• Identify article in Wikipedia corresponding to the concept. o Each article related to the concept indicates a sense of the usage of the word.• For each article found in the previous step o Identify the Wikipedia category to which it belongs. o For each category found, find its parent categories till level 4.• Once the “BLOOMS tree” for each of the sense of the source concept is created (Ts), utilize it for comparison with the “BLOOMS tree” of the target concepts (Tt). 39
    39. 39. BLOOMS Approach – Step 3• In the tree Ts, remove all nodes for which the parent node which occurs in Tt to create Ts’. o All leaves of Ts are of level 4 or occur in Tt. o The pruned nodes do not contribute any additional new knowledge.• Compute overlap Os between the source and target tree. o Os= n/(k-1), n = |z|, zε Ts’ ΠTt, k= |s|, sε Ts’• The decision of alignment is made as follows. o For Ts εTc and Ttε Td, we have Ts=Tt, then C=D. o If min{o(Ts,Tt),o(Tt,Ts)} ≥ x, then set C rdfs:subClassOf D if o(Ts,Tt) ≤ o(Tt, Ts), and set D rdfs:subClassOf C if o(Ts, Tt) ≥ o(Tt, Ts). 40
    40. 40. Example 41
    41. 41. Evaluation Objectives • To examine BLOOMS as a tool for the purpose of LOD ontology matching. • To examine the ability of BLOOMS to serve as a general purpose ontology matching system. 42
    42. 42. BLOOMS 43
    43. 43. BLOOMS 44
    44. 44. Circling Back • LOD primarily consists of owl:sameAs links7/30/2012 45 45
    45. 45. Part of Relationship Identification
    46. 46. Partonomy Identification• Currently entities across datasets are linked using primarily the owl:sameAs relationship• Relationships such as partonomy (part-of), and causality can allow creating even more intelligent applications such as Watson• Approach PLATO (Part-Of relation finder on Linked Open DAta Tool) 47
    47. 47. PLATO Approach• PLATO generates all possible partonomically linked pairs between the entities in the dataset. o Utilize “strongly” associated entities• Identify the type of each entity in the pair using WordNet. o Use Class Names o Gives the lexicographer files for the synsets corresponding to these entities 48
    48. 48. Winston’s Taxonomy 49
    49. 49. PLATO Approach – Step 2• PLATO generates linguistic patterns for each applicable property based on linguistic cues suggested by Winston. o Cell Wall is made of Cellulose• Tests the lexical patterns for each entity pair in a corpus- driven manner. o Using Web as a corpus• PLATO counts the total number of web pages that contain the pattern o Parse the page and identify the occurance of pattern. 50
    50. 50. PLATO Approach – Step 3• Asserts the partonomy property with strongest supporting evidence o Cell Wall is made of Cellulose, 48 o Cellulose is made of Cell Wall, 10• PLATO also enriches the schema by generalizing from the instance level assertions. 51
    51. 51. Evaluation Objectives • To examine PLATO as a tool for finding different kinds of part-of relation. • To examine PLATO as a tool for finding part-of relation within a dataset • To examine PLATO as a tool for finding part-of relation across dataset 52
    52. 52. PLATO Evaluation 53
    53. 53. 54
    54. 54. Some other work • Requirement document analysis o Internship at Accenture • Querying of partonomical relationship • Operators for querying spatio-temporal-thematic data • Plug-n-Play system for BLOOMS7/30/2012 55 55
    55. 55. BLOOMS BLOOMS+ PLATO Others 2010 1. 1 paper at ISWC 1. Paper at AAAI SS 2. 1 paper at OM 2. Paper at GEOS workshop 2011 1. 1 paper at ESWC 2. Workshop at ICBO 3. 1 patent 2012 1. 1 paper at ACM Hypertext Total of 7 publications covering this research14th February 2012 56
    56. 56. Potential Applications • Automatic domain identification of datasets o Work currently being pursued by Sarasi • Property alignment on LOD cloud o Work currently being pursued by Kalpa and Sanjaya • Personalization of property and concepts match. o Machine learning and data mining based techniques14th February 2012 57
    57. 57. Publications • Prateek Jain, Pascal Hitzler, KunalVerma, Peter Z. Yeh and Amit P. Sheth, “Moving beyond sameAs with PLATO: Partonomy detection for Linked Data”. In Proceedings of the 23rd ACM Hypertext and Social Media conference (HT 2012), Milwaukee, WI, USA, June 25th-28th, 2012 (Acceptance Rate 27.5%) • Amit Krishna Joshi, Prateek Jain, Pascal Hitzler, Peter Yeh, KunalVerma, AmitSheth, Mariana Damova, "Alignment-based Querying of Linked Open Data", In Proceedings of the 11th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE 2012) (To Appear) • Prateek Jain,Peter Z. Yeh, KunalVerma, Reymonrod Vasquez, Mariana Damova, Pascal Hitzler and Amit P. Sheth, “Contextual Ontology Alignment of LOD with an Upper Ontology: A Case Study with Proton”.InGrigoris Antoniou, Marko Grobelnik, Elena Simperl, BijanParsia, DimitrisPlexousakis, Jeff Pan and Pieter De Leenheer, editors, Proceedings of the 8th Extended Semantic Web Conference 2011, volume 6643 of Lecture Notes in Computer Science, Heidelberg, 2011. Springer Berlin. (Acceptance Rate 23.5%) • Prateek Jain, Pascal Hitzler, Amit P. Sheth, KunalVerma and Peter Z. Yeh, “Ontology Alignment for Linked Open Data”. In P. Patel-Schneider, Y. Pan, P. Hitzler, P. Mika, L. Zhang, J. Pan, I. Horrocks, And B. Glimm, editors, Proceedings of the 9th International Semantic Web Conference 2010, Shanghai, China, November 7th-11th, 2010,volume 6496 of Lecture Notes in Computer Science, pages 402-417, Heidelberg, 2010. Springer Berlin. (Acceptance Rate 20%)14th February 2012 58
    58. 58. Publications • Prateek Jain, Pascal Hitzler and Amit P. Sheth. "Flexible Bootstrapping-Based Ontology Alignment". In Proceedings of the Fifth international Workshop on Ontology Matching (Shanghai, China, November 7th - 11th, 2010). • Prateek Jain, Pascal Hitzler, Peter Z. Yeh, KunalVerma, and Amit P. Sheth, “Linked Data Is Merely More Data”. In: Dan Brickley, Vinay K. Chaudhri, Harry Halpin, and Deborah McGuinness: Linked Data Meets Artificial Intelligence. Technical Report SS-10-07, AAAI Press, Menlo Park, California, 2010, pp. 82-86. ISBN 978-1-57735-461-1 • Prateek Jain, Peter Yeh, KunalVerma, Cory Henson, and AmitSheth. “SPARQL Query Re-writing Using Partonomy Based Transformation Rules”. In K. Janowicz, M. Raubal, and S. Levashkin, editors, Proceedings of the Third International Conference on GeoSpatial Semantics, December 3-4, 2009, Mexico City, Mexico, volume 5892/2009 of Lecture Notes in Computer Science, pages 140–158, Heidelberg, 2009. Springer Berlin. • Prateek Jain, KunalVerma, Alex Kass, Reymonrod G. Vasquez, “Automated Review of Natural Language Requirements Documents: Generating Useful Warnings with User-extensible Glossaries Driving a Simple State Machine”, In KiranDeshpande, PankajJalote and Sriram K. Rajamani editors, Proceedings of the Second India Software Engineering Conference, February 23-26, 2009, Pune, India, ACM, New York, NY, 37-46. DOI= http://doi.acm.org/10.1145/1506216.1506224 (Acceptance Rate 10%).14th February 2012 59
    59. 59. Publications • Prateek Jain, Peter Z. Yeh, KunalVerma, Alex Kass, and Amit P. Sheth, 2008. "Enhancing process-adaptation capabilities with web-based corporate radar technologies". In Proceedings of the First international Workshop on ontology-Supported Business intelligence (Karlsruhe, Germany, October 27 - 27, 2008). OBI 08, vol. 308. ACM, New York, NY, 1-6. DOI= http://doi.acm.org/10.1145/1452567.1452569 • Matthew Perry, Amit P. Sheth, FarshadHakimpour, Prateek Jain. "Supporting Complex Thematic, Spatial and Temporal Queries over Semantic Web Data", In F. T. Fonseca, M. Andrea Rodriguez and S. Levashkin editors, Proceedings of the Second International Conference on GeoSpatial Semantics, December 3-4, 2009, Mexico City, Mexico, volume 4853/2007 of Lecture Notes in Computer Science, pages 228–246, Heidelberg, 2007. Springer Berlin. • Colin Puri, KarthikGomadam, Prateek Jain, Peter Z. Yeh, KunalVerma, “Multiple Ontologies in Healthcare Information Technology: Motivations and Recommendation for Ontology Mapping and Alignment”.In Proceedings of the Workshop on Working with Multiple Biomedical Ontologies (at ICBO), 26 July 2011, Buffalo, NY, USA. • Cory Henson, Amit P. Sheth, Prateek Jain, Josh Pschorr and Terry Rapoch. "Video on the Semantic Sensor Web", W3C Video on the Web Workshop 12-13 December 2007, San Jose, California and Brussels, Belgium14th February 2012 60
    60. 60. Patent • Peter Z. Yeh, Prateek Jain, KunalVerma, Reymonrod G. Vasquez, Titled: Information Source Alignment, Filed 4th March 2011, Status: Pending.14th February 2012 61
    61. 61. Acknowledgement14th February 2012 62
    62. 62. Acknowledgement • Cory Henson o coffee breaks, research, football, baseball, politics, life.. o First person I met while finding my way to LSDIS lab • Kno.e.sis Lab Members & support staff • Folks at Accenture Technology Labs o Amazing group of people to work with/for14th February 2012 63
    63. 63. Acknowledgement • NSF Award:IIS-0842129, titled III-SGER: Spatio- Temporal-Thematic Queries of Semantic Web Data: a Study of Expressivity and Efficiency • NSF Award 1143717 III: EAGER -- Expressive Scalable Querying over Linked Open Data.14th February 2012 64
    64. 64. Questions?
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×