Semantic Similarity and Selection
of Resources Published According
to Linked Data Best Practice
Riccardo Albertoni,
Monica...
Outline
• Resource Selection, Semantic Similarity and Linked
data.
▫ Why does Resource Selection matter?
▫ Real example:
...
Resource Selection:
• why does it matter?
▫ Effective sharing and reuse of data are still
desiderata of many scientific an...
Real Example
Acquisition
Preprocessing
Integration
ModelsAnalysis
Web server
Sea Trial
courtesy of NATO Undersea Research ...
Potential new “customer” for sea trial
Data
• NATO Agencies/Nations ask for data previously
collected
• New scientists arr...
Potential customers’ point of view
These curtomers were not involved in sea
trials, thus, searching for data they
wonder:
...
ModelsData
Processes
People
Sensors
Characteristics
Metadata Complexity in Real World- Linked data helps in
share complex ...
Problem: keep the bar balanced !!
Semantic similarity
as Metadata analysis
to support user
comparing the
features of candi...
semantic similarity as metadata
analysis tool
• instance similarity is fundamental to support detailed
comparison, ranking...
How to make Semantic Similarity to
scale up to the web of data? 1/2
Identified issues Research Plan
non-authoritative meta...
How to make Semantic Similarity to
scale up to the web of data? 2/2
Identified issues Research Plan
non-consistently ident...
Exploratory phase
• Facing with the aforementioned issues is a very
challenging research plan!!!
• Let’s get a first hand ...
Instance similarity redesigned
prototype
• As test bed for experimenting and deepen the
aforementioned issues
• Extension
...
Context:
Researcher X (URI(X)=A) Researcher Y (URI(Y)=B)
<A> rdfs:label “A descr" ;
dc:license <http://vb.com> ;
foaf:prim...
Non-Authoritative Metadata - Example
URI(Giovanni)=A URI (Renaud)=B
<A> rdfs:label “A descr" ;
dc:license <http://vb.com> ...
Non-Authoritative Metadata -SINDICE
You get RDF Fragments from DBLP only !!!
none from semantic web dog food we know provi...
DBLP URI ---
How to move next?
IDEA: if SWDF added rules likes
<http://data.semanticweb.org/person/name-[midlename]-[famil...
heterogeneous metadata- Example
RDF for Giovanni in DBLP RDF for Giovanni in SWDF
<A> rdfs:label “A descr" ;
dc:license <h...
We must be careful dereferencing
• Dereferencing schemata and URI
▫ is extremely slow
▫ adds many RDF statements which mig...
How to move next?
RDF for Giovanni in DBLP RDF for Giovanni in SWDF
<A> rdfs:label “A descr" ;
dc:license <http://vb.com> ...
Non-consistently identified metadata
What if the same pub is provided both by DBLP and SWDF?
E.g., DBLP:paperC and SWDF:pa...
Conclusion (I)
• Linked data best practice and our semantic
similarity
▫ good potential to support data selection for
comp...
Conclusion (II)
• The exploratory phase shows
▫ All the mentioned issues arise even in very simple
scenario assessing the ...
Do not hesitate to email me (Albertoni@ge.imati.cnr.it)
If you have off line questions
Upcoming SlideShare
Loading in …5
×

Semantic Similarity and Selection of Resources Published According to Linked Data Best Practice

990 views

Published on

The position paper aims at discussing the potential of exploiting linked data best practice to provide metadata documenting domain specific resources created through verbose acquisition-processing pipelines. It argues that resource selection, namely the process engaged to choose a set of resources suitable for a given analysis/design purpose, must be supported by a deep comparison of their metadata. The semantic similarity proposed in our previous works is discussed for this purpose and the main issues to make it scale up to the web of data are introduced. Discussed issues contribute beyond the re-engineering of our similarity since they largely apply to every tool which is going to exploit information made available as linked data. A research plan and an exploratory phase facing the presented issues are described remarking the lessons we have learnt so far.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
990
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Enabling factors for establishing the web of data as preferred selling point for complex resources are: (i) linked data best practice relies on light-weighed ontologies encoded in Resource Description Framework (RDF) which can be exploited to provide ontology driven metadata. Such a kind of metadata takes advantage from the Open Word Assumption, enabling the adoption of complex, domain specialized and independently developed metadata vocabularies, which are pivotal to document resources produced in complex and loosely coupled pipelines; (ii) linked data best practice relies on content negotiation exploiting the standard HTTP protocol, it is not proposing a brand new platform replacing the existing technologies. Rather, it can be placed side by side to domain specific protocol and standards (e.g., Open Geospatial Consortium specification for the geographic domain) making metadata available in human and machine consumable format; (iii) technological headways have brought to mature prototypes in order to expose resource as linked data (e.g., D2R and Pubby), to query them by appropriate query language (i.e., SPARQL), to retrieve their pertaining RDF fragments published around the web (e.g., Sindice), to reason, store and manipulate these fragments once there are retrieved (e.g., JENA API).
  • However, even supposing the linked data was massively adopted to share the metadata of complex resources, the selection of the most suitable datasets for complex domains like environmental analysis would still be an enervating task. A huge amount of resource features and their complex relations must be considered during the selection process. Especially for assisting in this process, semantic similarity algorithms supporting a deep comparison of resource features are pivotal.
  • Before engaging in this challenging research plan, we have undertaken an exploratory phase analyzing real web data. The goal is to get a first-hand experience in varieties introduced by data providers publishing metadata. Although publishing metadata according linked data best practice has a huge potential for documenting resources produced in complex pipelines, it is not yet a common practice in the specialized domains we have mentioned. For this reason, we have been forced to move on a simpler domain considering the scientific publications exposed as linked data by Semantic Web Dog Food-SWDF (http://data.semanticweb.org/) and DBLP in RDF (http://dblp.l3s.de/d2r).
  • Very simple context!
  • We would expect that the similarity starting from comparing two uris, takes advantage from non authoritative info, in order to give as much as possible a realistic assessment of the entity similarity ..
  • Semantic Similarity and Selection of Resources Published According to Linked Data Best Practice

    1. 1. Semantic Similarity and Selection of Resources Published According to Linked Data Best Practice Riccardo Albertoni, Monica De Martino CNR-IMATI-GE Institute of Applied Mathematics and Information Technologies (Dept of Genoa) Consiglio Nazionale delle Ricerche, Italy The 6th International Workshop on Ontology Content (OnToContent 2010) Oct 28, 2010 Crete Part of the OTM (OTM'2010)
    2. 2. Outline • Resource Selection, Semantic Similarity and Linked data. ▫ Why does Resource Selection matter? ▫ Real example:  Complex metadata to document resources  Linked data paves the way for sharing complex metadata ▫ Semantic Similarity as base for resource selection  Nice features as Asymmetry & Context-Dependence • Scaling Semantic similarity up to Web of Data ▫ Issues & Research plandirection ▫ Exploratory phase with real data from the web data  Are the issues we consider relevant? In which varieties shapes issues occur in real data? ▫ Lesson learnt from the exploratory phase
    3. 3. Resource Selection: • why does it matter? ▫ Effective sharing and reuse of data are still desiderata of many scientific and industrial domains where the selection of tailored and high-quality data is a necessary condition to provide successful and competitive services • Resource selection ▫ in order to select the resources which fit a given problemtask we rely on an analysis of metadata documenting resources
    4. 4. Real Example Acquisition Preprocessing Integration ModelsAnalysis Web server Sea Trial courtesy of NATO Undersea Research Centre (NURC), Example developed in NURC Research Assistance granted to R. Albertoni (2008) Short term perspective: data is collected and elaborated for well planned purposes (aka sea trial experiments)
    5. 5. Potential new “customer” for sea trial Data • NATO Agencies/Nations ask for data previously collected • New scientists arriving at NURC ▫ They want to access to data in order to produce model by their own approaches and to compare the results with models already produced at NURC (Benchmarking) • Scientists/Agencies investigating how phenomena have been changed in a long period ▫ They are interested in data collected in the past • Scientists/Agencies planning a new sea trial ▫ It can be useful to know what have collected in previous sea trials, how data have been elaborated Data reusability: unplanned use of data long term perspective
    6. 6. Potential customers’ point of view These curtomers were not involved in sea trials, thus, searching for data they wonder: • Is data collected at NURC suitable for the application I have in mind? • Is data reliable enough? To answer to these macro questions • Users need to have details about how data has been acquired, pre-processed, integrated, analyzed, and even to know who was in charge for which part…
    7. 7. ModelsData Processes People Sensors Characteristics Metadata Complexity in Real World- Linked data helps in share complex metadata Sensor’s responsible party sensor settings Parameters, choices made during the preprocessing Analysis applied.. Parameters etc Sensor Sensor Sensor Sensor APO FOAF ISO19115CoreTest PlanDublin Core SensorML SensorML SensorML SensorML
    8. 8. Problem: keep the bar balanced !! Semantic similarity as Metadata analysis to support user comparing the features of candidate resources Huge amount of ontology driven metadata describing complex features as linked data
    9. 9. semantic similarity as metadata analysis tool • instance similarity is fundamental to support detailed comparison, ranking and selection of resources through its ontology driven metadata ▫ Albertoni R., De Martino M., Asymmetric and context-dependent semantic similarity among ontology instances, Journal on Data Semantics X, Springer Verlag, (2008). • Explicitly addressing the ▫ Context as explicit parameterization of similarity assessment  Context specifies which features to consider and how ▫ Asymmetry to highlight containment between resources  Sim(A,B) ranges [0,1] is worked out to measure how many features A shares with B out of the overall A features  If features(A) are contained in features(B): sim(A,B)=1 and sim(B,A)<1 • Limitation: Not for linked data, it was for locally-stored ontology-driven repository and one well defined schema
    10. 10. How to make Semantic Similarity to scale up to the web of data? 1/2 Identified issues Research Plan non-authoritative metadata, metadata published by actors who are neither the resource producers nor the owners WHEN metadata documenting resources that have been re-elaborated or reviewed by third parties Synergies with semantic web indexes (e.g., SINDICE ) to retrieve non authoritative features heterogeneous metadata, metadata provided according to different, sometimes interlinked, more often overlapping metadata vocabularies WHEN metadata for a resource is provided by stakeholders with different fields of competency, then they may use different vocabularies, not always these vocabularies are independent deploying schema and entity level consolidation using both explicit metadata statements and mining implicit equivalences through co-occurring resources annotations;
    11. 11. How to make Semantic Similarity to scale up to the web of data? 2/2 Identified issues Research Plan non-consistently identified metadata, namely metadata occurring when the same resource has different identifiers in distinct metadata sets WHEN Two actors in the pipeline documents independently the same resource at different stage of the pipeline •reasoning techniques to be applied to web datasets, e.g., to smush fragments of distributed metadata • scripts to interlink resources relying on a- priori knowledge about how datasets have been originated; efficiency and computational issue: in a longer perspective an accurate similarity assessment might result computationally prohibitive WHEN the number of resources discovered and features considered increase. •cashing of intermediate comparisons •techniques to prune comparisons according to a specified application context •algorithms for efficient parallelization can be studied
    12. 12. Exploratory phase • Facing with the aforementioned issues is a very challenging research plan!!! • Let’s get a first hand experience in varieties introduced by data providers ▫ Requirements:  Real metadata published as linked data  Provided by third parties • Linked data provides huge potential for documenting resources produced in complex pipelines but it is not yet a common practice ▫ We considered a simpler domain (researchers and their publications)  Semantic Web Dog Food-SWDF (http://data.semanticweb.org/)  DBLP in RDF (http://dblp.l3s.de/d2r).
    13. 13. Instance similarity redesigned prototype • As test bed for experimenting and deepen the aforementioned issues • Extension ▫ Extended the notion of context including namespaces to consider properties from different RDF schemas ▫ Updated the ontology model, moving from ProtegeAPI to RDF model  JENA Reasoner and SPARQL
    14. 14. Context: Researcher X (URI(X)=A) Researcher Y (URI(Y)=B) <A> rdfs:label “A descr" ; dc:license <http://vb.com> ; foaf:primaryTopic <xzc> ; foaf:made <paperC>; foaf:made <paperD>; foaf:made <paperH>. <B> rdfs:label “B descr" ; dc:license <http://vb.com> ; foaf:primaryTopic <vbn> ; foaf:made <paperC>; foaf:made <paperF>. PREFIX foaf: <http://xmlns.com/foaf/0.1/> [foaf:Person]->{{},{(foaf:made, Count)}} Two researchers are as similar as they have a similar number of publications 3 2 SIM(X,Y)= SIM(A,B)= 2/max(3,2)=2/3 SIM(Y,X)= SIM(B,A)= 3/max(3,2)=1 Take a look to R. Albertoni, M. De Martino JODS X, 2008 for more complex similarity assessment !! We compare researchers by their URIs
    15. 15. Non-Authoritative Metadata - Example URI(Giovanni)=A URI (Renaud)=B <A> rdfs:label “A descr" ; dc:license <http://vb.com> ; foaf:primaryTopic <xzc> ; foaf:made <paperC>; foaf:made <paperD>; foaf:made <paperH>. <B> rdfs:label “B descr" ; dc:license <http://vb.com> ; foaf:primaryTopic <vbn> ; foaf:made <paperC>; foaf:made <paperF>. Let’s compare Giovanni and Renaud starting from their URI in DBLP A= http://dblp.l3s.de/…../Giovanni_Tummarello B= http://dblp.l3s.de/…./Renaud_Delbru But we know, semantic web dog food (SWDF) might provide more info about Giovanni and Renaud, What if SWDF provides an additional paper for Renaud, paper which Giovanni is not coauthoring? SIM(Giovanni, Renaud)=1 instead of 2/3….
    16. 16. Non-Authoritative Metadata -SINDICE You get RDF Fragments from DBLP only !!! none from semantic web dog food we know providing further info.. IDEA: Querying SINDICE by Researchers’ URIs A, B to get RDF fragments pertaining to Giovanni and Renaud •URIs not name as keywords, because different people might share the same name, URIs are in principle more precise First lesson: Non-authoritative metadata and Non-consistently identified metadata are tightly inter-related in the real practice. To effectively deal with the former issue often we have to care about the latter issue. SWDF Researchers URI DBLP Researchers’ URI They do not overlap!!!
    17. 17. DBLP URI --- How to move next? IDEA: if SWDF added rules likes <http://data.semanticweb.org/person/name-[midlename]-[familyname]> owl:sameAs <http://dblp.l3s.de/d2r/resource/authors/name_[middle- name]_familyname> SWDF URIOwl:SameAs The SWDF fragments would have been retrieved by SINDICE.. [We are implicitly assuming some reasoning: e.g.: (X owl:sameAs X1) and (X1 rel Z) -> (X rel Z) ]
    18. 18. heterogeneous metadata- Example RDF for Giovanni in DBLP RDF for Giovanni in SWDF <A> rdfs:label “A descr" ; dc:license <http://vb.com> ; foaf:primaryTopic <xzc> ; foaf:made <paperC>; foaf:made <paperD>; foaf:made <paperH>. <A> rdfs:label “A descr" ; dc:license <http://vb.com> ; foaf:primaryTopic <xzc> ; <paperE> foaf:maker <A>. <paperB> foaf:maker <A>. This problem does not appear in terms of different RDF schemas Both DBLP and SWDF deploy foaf … foaf:made is owl:inverseOf foaf:maker, but you cannot know it if you don’t dereference/load the foaf schema Second lesson: ontology/schema/properties in the context must be dereferenced as much as entity’s URIs to make the semantics of properties exploitable.
    19. 19. We must be careful dereferencing • Dereferencing schemata and URI ▫ is extremely slow ▫ adds many RDF statements which might result useless for semantic similarity assessment  Info not pertaining to specified context ▫ ends up with huge amount of derived RDF statement which might worsen efficiency ad computational problems Third lesson: specific and context driven policies to dereference the URI and retrieve RDF fragments should be deployed in order to ease efficiency and computational problems. For example : to dereference only properties mentioned in context .. Or consider only RDF fragments returned by SINDICE with explicit reference to schemas mentioned in the context.
    20. 20. How to move next? RDF for Giovanni in DBLP RDF for Giovanni in SWDF <A> rdfs:label “A descr" ; dc:license <http://vb.com> ; foaf:primaryTopic <xzc> ; foaf:made <paperC>; foaf:made <paperD>; foaf:made <paperH>. <A> rdfs:label “A descr" ; dc:license <http://vb.com> ; foaf:primaryTopic <xzc>. <paperE> foaf:maker <A>. <paperB> foaf:maker <A>. <A> foaf:make <paperE>. <A> foaf:make <paperB>. Assuming we have dereferenced the foaf:maker, or upload in the reasoner a rule saying (P foaf:maker X)- > (X foaf:make P)
    21. 21. Non-consistently identified metadata What if the same pub is provided both by DBLP and SWDF? E.g., DBLP:paperC and SWDF:paperB are two URI for the same paper We count it twice  Fourth lesson: Non-consistently identified metadata is a recursive problem. Consolidating researchers without consolidating papers brings to wrong similarity results. We must be sure entities and properties in the similarity context have been properly consolidated before applying instance similarity. RDF for Giovanni in DBLP + for Giovanni in SWDF <A> rdfs:label “A descr" ; dc:license <http://vb.com> ; foaf:primaryTopic <xzc> ; foaf:made <DBLP:paperC>; foaf:made <DBLP:paperD>; foaf:made <DBLP:paperH>. <A> rdfs:label “A descr" ; dc:license <http://vb.com> ; foaf:primaryTopic <xzc>. <A> foaf:made <SWDF:paperE>. <A> foaf:made <SWDF:paperB>.
    22. 22. Conclusion (I) • Linked data best practice and our semantic similarity ▫ good potential to support data selection for complex domain resource • But scaling semantic similarity up to web of data means to deal with ▫ Non authoritative metadata ▫ Heterogeneous metadata ▫ Non-consistently identified metadata ▫ Efficiency and computational issue
    23. 23. Conclusion (II) • The exploratory phase shows ▫ All the mentioned issues arise even in very simple scenario assessing the semantic similarity ▫ It is pivotal to have first-hand experience with real data to discover the shape issues might assume • Consideration ▫ Problems we found are not exclusive for similarity assessment  We suspect this issues arise whenever you try to elaborate information published as linked data in order to mining new factsinfo from the published data
    24. 24. Do not hesitate to email me (Albertoni@ge.imati.cnr.it) If you have off line questions

    ×