Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Profile-based Dataset Recommendation for RDF Data Linking

410 views

Published on

With the emergence of the Web of Data, most notably Linked Open Data (LOD), an abundance of data has become available on the web. However, LOD datasets and their inherent subgraphs vary heavily with respect to their size, topic and domain coverage, the schemas and their data dynamicity (respectively schemas and metadata) over the time. To this extent, identifying suitable datasets, which meet spefi c criteria, has become an increasingly important, yet challenging task to support issues such as entity retrieval or semantic search and data linking. Particularly with respect to the interlinking issue, the current topology of the LOD cloud underlines the need for practical and ecient means to recommend suitable datasets: currently, only well-known reference graphs such as DBpedia (the most obvious target), YAGO or Freebase show a high amount of in-links, while there exists a long tail of potentially suitable yet under-recognized datasets. This problem is due to
the semantic web tradition in dealing with "fnding candidate datasets to link to", where data publishers are used to identify target datasets for interlinking.
While an understanding of the nature of the content of speci c datasets is a crucial
prerequisite for the mentioned issues, we adopt in this dissertation the notion of
\dataset pro le" | a set of features that describe a dataset and allow the comparison
of di erent datasets with regard to their represented characteristics. Our
rst research direction was to implement a collaborative ltering-like dataset recommendation
approach, which exploits both existing dataset topic pro les, as well
as traditional dataset connectivity measures, in order to link LOD datasets into
a global dataset-topic-graph. This approach relies on the LOD graph in order to
learn the connectivity behaviour between LOD datasets. However, experiments have
shown that the current topology of the LOD cloud group is far from being complete
to be considered as a ground truth and consequently as learning data.
Facing the limits the current topology of LOD (as learning data), our research
has led to break away from the topic pro les representation of \learn to rank"
approach and to adopt a new approach for candidate datasets identi cation where
the recommendation is based on the intensional pro les overlap between di erent
datasets. By intensional pro le, we understand the formal representation of a set of
schema concept labels that best describe a dataset and can be potentially enriched

  • Login to see the comments

  • Be the first to like this

Profile-based Dataset Recommendation for RDF Data Linking

  1. 1. Profile-based Dataset Recommendation for RDF Data Linking PhD Thesis Defense of: Mohamed BEN ELLEFI LIRMM – Montpellier, France 19/12/2016 Thesis Supervisors: Zohra BELLAHSENE KonstantinTODOROV
  2. 2.  7 industrial and academic partners join forces to discover, test, and implement big open data. http://www.datalyse.fr/ This work is financed by 2
  3. 3. o Context o Vocabulary Recommendation with Datavore o Datasets Recommendation: Problem Statement o Datasets Recommendation:Topic Profile-based Approach o Datasets Recommendation: Intensional Profile-based Approach o Conclusion & Open Issues Outline 3
  4. 4. 4
  5. 5. 4
  6. 6. 4 sameAs Knows sameAs sameAs worksOn worksOn
  7. 7. 5 Linked Data 570 datasets in 2014 12 datasets in 2007 Doc Web of Hypertext Hyperlinks Hyperlinks Hyperlinks Doc Doc The Linking Open Data cloud diagram http://lod-cloud.net/ Web Evloving
  8. 8. 6 Linked Data Life-cycle Modeling Publishing Conversion Interlinking Published Linked Data Raw Data Maintaining
  9. 9. Modeling Publishing Conversion Interlinking Raw Data Maintaining Vocabulary Search Vocabulary terms Selection Vocabulary Editition Published Linked Data Linked Data Life-cycle (1/4) 6
  10. 10. Modeling Publishing Conversion Interlinking Raw Data Published Linked Data Transforming information from raw data source to RDF data using the selected vocabulary… Linked Data Life-cycle (2/4) 6 Maintaining
  11. 11. Modeling Publishing Conversion Interlinking Raw Data Published Linked Data Hosting the linked dataset and its metadata publicly and make it accessible… Linked Data Life-cycle (3/4) 6 Maintaining
  12. 12. Modeling Publishing Conversion Interlinking Raw Data Datasets Search Candidate Selection Data Linking Published Linked Data Linked Data Life-cycle (4/4) 6 Maintaining
  13. 13. Focus on: Modeling Publishing Conversion Interlinking Raw Data Published Linked Data Recommending Vocabulary Terms Recommending Candidate Datasets 6 Maintaining 2 1
  14. 14. o Context o Vocabulary Recommendation with Datavore o Datasets Recommendation: Problem Statement o Datasets Recommendation:Topic Profile-based Approach o Datasets Recommendation: Intensional Profile-based Approach o Conclusion & Open Issues Outline 7
  15. 15. Focus on: Modeling Publishing Conversion Interlinking Raw Data Published Linked Data Recommending Vocabulary Terms 8 Maintaining 2 1
  16. 16. “.. whatever the domain of your vocabulary, someone else has probably done it already.” --- Cookbook for translating relational data models to RDF schemas Motivation: Modeling Linked Data 4 http://lov.okfn.org/ ; 83 http://protege.stanford.edu  Reusing existing vocabularies:  Ontology search engine: i.e, trusted LOV1 (information of more than 500 vocabularies) …  Ontology development tools: i.e, Protégé2 …
  17. 17. “.. whatever the domain of your vocabulary, someone else has probably done it already.” --- Cookbook for translating relational data models to RDF schemas Motivation: Modeling Linked Data 4 http://lov.okfn.org/ ; • What keywords to use for the search • How to select vocabularies • Which metadata can help for modeling 83 http://protege.stanford.edu  Reusing existing vocabularies:  Ontology search engine: i.e, trusted LOV1 (information of more than 500 vocabularies) …  Ontology development tools: i.e, Protégé2 …
  18. 18. Input dataset texte RD LOV search Keywords Sevices Translator API -Cleaning -Translating Datavore LOV Sparql Endpoint -Metadata Extraction -Terms Search-Terms Extractions • Ranked lists of vocabulary terms. • The corresponding Metadatas. • Triples suggestions. 1 2 3 4 5 -Triples Extractions 6 Linked Open Vocabularies 9 Datavore Ecosystem
  19. 19. Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin Todorov. Datavore: A Vocabulary Recommender Tool Assisting Linked Data Modeling. BDA 2015. Datavore Tool • A GUI desktop application http://www.lirmm.fr/benellefi/Datavore_Exe File • A demonstration video http://www.lirmm.fr/benellefi/Datavore_Vid eoDemo 11 Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin Todorov. Datavore: A Vocabulary Recommender Tool Assisting Linked Data Modeling. (Posters & Demos) ISWC 2015.
  20. 20. o Context o Vocabulary Recommendation with Datavore o Datasets Recommendation: Problem Statement o Datasets Recommendation:Topic Profile-based Approach o Datasets Recommendation: Intensional Profile-based Approach o Conclusion & Open Issues Outline 12
  21. 21. Focus on: Modeling Publishing Conversion Interlinking Raw Data Published Linked Data Recommending Candidate Datasets 6 Maintaining 2 1
  22. 22. 13 Entity Linking Challenges “A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider." ----Source: https://www.w3.org/TR/void/#dataset
  23. 23. 13  Data reuse and in-links focused on trusted, reference graphs, i.e., Dbpedia, Freebase, etc. Few datasets actually used… A long tail of potentially suitable yet under-recognized datasets “A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider." ----Source: https://www.w3.org/TR/void/#dataset Entity Linking Challenges
  24. 24. 14 Candidate Datasets Selection: Problem Statement (1/2) How to find candidates to link my lovely dataset? Source
  25. 25. 15 Thank you for the recommendations Candidates  Dataset recommendation for data linking is the task of computing a rank score for each of a set of target datasets with respect to a source dataset.  The rank score indicates the relatedness between the source and the target dataset. Source Candidate Datasets Selection: Problem Statement (2/2)
  26. 26. (1) Nikolov and d'Aquin, 2011 A keyword-based search approach: (i)Extracts literals from instances of source datasets and search the sig.ma for potentially relevant entities (ii)Filtering out irrelevant datasets by measuring semantic concept similarities (OM). Related Work (2) Mehdi et al. 2014 1) Input a set of domain-specific keywords provided manually by an expert. 2) For each Keywords, the system runs a comparison to a set of eight queries: {original-case, proper-case, lower-case, upper- case} * {no-lang-tag, @en-tag}. 3) The output consists of a list of target datasets. (3) Leme et al. 2013 The ranking is based on Bayesian criteria and on the popularity (existing links) of the datasets. 16 (4) Lopes et al. 2014 (3) + exploring the correlation between different sets of features- properties, classes and vocabularies and the links to compute new rank score functions. Recall of 100% | MAP of 60%.
  27. 27. (1) Nikolov and d'Aquin, 2011 A keyword-based search approach: (i)Extracts literals from instances of source datasets and search the sig.ma for potentially relevant entities (ii)Filtering out irrelevant datasets by measuring semantic concept similarities (OM). Sig.ma is currently down! Related Work (2) Mehdi et al. 2014 1) Input a set of domain-specific keywords provided manually by an expert. 2) For each Keywords, the system runs a comparison to a set of eight queries: {original-case, proper-case, lower-case, upper- case} * {no-lang-tag, @en-tag}. 3) The output consists of a list of target datasets. Costly input! (3) Leme et al. 2013 The ranking is based on Bayesian criteria and on the popularity (existing links) of the datasets. Cold Start Problem! 16 (4) Lopes et al. 2014 (3) + exploring the correlation between different sets of features- properties, classes and vocabularies and the links to compute new rank score functions. Recall of 100% | MAP of 60%. To improve efficiency!
  28. 28. (2) Mehdi et al. 2014 1) Input a set of domain-specific keywords provided manually by an expert. 2) For each Keywords, the system runs a comparison to a set of eight queries: {original-case, proper-case, lower-case, upper- case} * {no-lang-tag, @en-tag}. 3) The output consists of a list of target datasets. Costly input! (1) Nikolov and d'Aquin, 2011 A keyword-based search approach: (i)Extracts literals from instances of source datasets and search the sig.ma for potentially relevant entities (ii)Filtering out irrelevant datasets by measuring semantic concept similarities (OM). Sig.ma is currently down! (4) Lopes et al. 2014 (3) + exploring the correlation between different sets of features- properties, classes and vocabularies and the links to compute new rank score functions. Recall of 100% | MAP of 60%. To improve efficiency! Related Work (3) Leme et al. 2013 The ranking is based on Bayesian criteria and on the popularity (existing links) of the datasets. Cold Start Problem! 16  To deal with real world LOD datasets.  To provide a new recommender system with a greater efficiency.
  29. 29. o Context o Vocabulary Recommendation with Datavore o Datasets Recommendation: Problem Statement o Datasets Recommendation:Topic Profile-based Approach o Datasets Recommendation: Intensional Profile-based Approach o Conclusion & Open Issues Outline 17
  30. 30. 18 Profile-based Recommendation: Motivation Similar taste buy buy buy buy Homer Simpson VS. Peter Griffin: Same profile (taste) behavior ! buy Recommend
  31. 31. 19 Linked Data: if two datasets are strongly similar (Profile-based similarity), we can consider that they may have the same connectivity behaviour… What is it going to be a dataset profile Hypothesis: Profile-based Recommendation: Motivation
  32. 32. (*) An RDF dataset profile can be seen as the formal representation of a set of features that describe a dataset and allow the comparison of different datasets with regard to their characteristics.The feature set is dependent on a given application scenario and task. 20 Semantic Web Datasets Profiling (*) Mohamed Ben Ellefi, Zohra Bellahsene, John Breslin, Elena Demidova, Stefan Dietze, Julian Szymanski, Konstantin Todorov. Dataset Profiling - a Guide to Features, Methods, Applications and Vocabularies. Major Revision statue In the Semantic Web Journal. Dataset Profile Features Semantic Qualitative Domain/Topic Context Index elements Schema/Instances Trust Accessibility Representation Context Degree of connectivity Statistical Temporal Schema Level Instance Level Global Instance-specific Semantics-specific
  33. 33. 21 Semantic Web Data management SemanticWeb Dog Food WWW Consortium standards Information retrieval l3s-dblp Datasets Topics Topic Dataset Profile B. Fetahu, S. Dietze, B. Pereira Nunes, M. Antonio Casanova, D. Taibi, and W. Nejdl. A scalable approach for efficiently generating structured dataset topic profiles. In In Proceedings of the 11th ESWC 2014. Topic/Domain Semantic
  34. 34. Topic Profile based-Datasets Recommendation 22 o Step 4: Ranking system o Steps1-3: Preprocessing/ Learning step
  35. 35. 23 Learning/ Preprocessing Topics: Weigts: Source:
  36. 36. 23 Topics: Weigts: Source: Connectivity (Di ,Dj) Learning/ Preprocessing
  37. 37. 23 Topics: Weights: Source: Connectivity (Di ,Dj) Learning/ Preprocessing
  38. 38. 24 o A dataset is modeled as a set of topics-- a dataset's profile. o Inversely, a topic is modeled as a set of datasets assigned to it-- a topic's signature.  𝝈 is a connectivity behaviour measure. Topics Signatures Learning/ Preprocessing
  39. 39. Target Datasets Ranking 25 Let D0 be a new dataset to be linked: 1- Extract Profile(D0). 2- Constitute a pool of target datasets from the signatures (the result of the learning step). 3- Ranking target datasets:
  40. 40. Training Set: The topic profiles graph  from its available Sparql endpoint: http://data-observatory.org/lod- profiles/profile-explorer.  76 datasets and 185 392 topics.  The evaluation data (ED)  the current topology of the LOD-Cloud using the datahub2void tool (https://github.com/lod-cloud/datahub2void). o We made the ED available on http://www.lirmm.fr/benellefi/void.ttl Testing Set: Source Datasets: All the 76 datasets indexed by the topics profiles graph. Target Datasets: 258 datasets from the LOD cloud group (http://datahub.io/group/lodcloud). 26 Experimental Setup
  41. 41. 27 Evaluation Framework  Leave-one-out (5-fold cross-validation). Selected Dataset (3)To recommend target datasets using our system. Selected Dataset owl:sameAs … Selected Dataset owl:sameAs … (1)To select a source dataset in the evaluation data. (2)To consider the dataset as unlinked. (4)To evaluate the recommendation Target 1 Target 2 Target n Target 1 Target 2 Target n
  42. 42. - Average recall: 81%. - 59% of the DS have a recall of 100%. - Average precision: 19%. Evaluation Results (1/3) Recall/Precision/F1-Score over all DS ∈ ED False Positive Rate? 28
  43. 43. 29 FP overestimation: a small average FP- Rate of 13% FP-Rate over all DS ∈ ED Evaluation Results (2/3)
  44. 44. The original search space size: 258 datasets. 30 Search Space Reduction over all DS ∈ ED An average space size reduction is up to 86%. Evaluation Results (3/3)
  45. 45. 31 Baselines & Comparison (1/3) Baselines are available on http://www.lirmm.fr/benellefi/Baselines.rar. 1- 3- 2-
  46. 46. 32 Recall values of our approach vs. baselines over all DS ∈ ED  Baselines fail to provide any results at all for some datasets.  Our approach is more stable and outperforms the baselines in the majority of recommendations. Baselines & Comparison (2/3)
  47. 47. 33  The baseline approaches have produced better results than our system in a limited number of cases.  The shared tags baseline generated an F-Score of 100% on: This is due to the fact that these two datasets are tagged by the same provenance (data.oceandrilling.org)  share the same set of tags. Note: Baselines & Comparison (3/3)
  48. 48. 33 AVG Precision, Recall and F1-score values over all recommendation lists for all source datasets.  The baseline approaches have produced better results than our system in a limited number of cases.  The shared tags baseline generated an F-Score of 100% on: This is due to the fact that these two datasets are tagged by the same provenance (data.oceandrilling.org)  share the same set of tags. Note: Our approach Shared Keywords Shared Linksets SharedTopics Profiles AVG Precison 19% 9% 9% 3% AVG Recall 81% 41% 11% 13% AVG F1-Score 24% 10% 8% 4% Baselines & Comparison (3/3)
  49. 49. 34 Topic-profiles Dataset recommendation approach:  Original search space reduction  Average recall: 81% & Average precision: 19%.  Ranking results available on http://www.lirmm.fr/benellefi/results.csv. Discussion Advantages: Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established Knowledge Graphs Recommending Web Datasets for Data Linking. ICWE2016.
  50. 50. 34 Topic-profiles Dataset recommendation approach:  Original search space reduction  Average recall: 81% & Average precision: 19%.  Ranking results available on http://www.lirmm.fr/benellefi/results.csv. Discussion Precision needs to be improved. Learning Data is not complete. Challenges: Advantages: Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established Knowledge Graphs Recommending Web Datasets for Data Linking. ICWE2016. Breaking up with the learning step.
  51. 51. oContext oVocabulary Recommendation with Datavore oDatasets Recommendation: Problem Statement oDatasets Recommendation:Topic Profile-based Approach oDatasets Recommendation: Intensional Profile-based Approach oConclusion & Open Issues Outline 35
  52. 52. Hypothesis Motivation 36 Coreference resolution:  Different datasets may contain different resources that refer to the same real world entity.  Following the LD best practices, these ressources are generally represented by same/similar types. Datasets that share at least one pair of similar concepts, are likely to contain at least one potential pair of instances to be linked, i.e.,“owl:sameAs” statement.
  53. 53. 37 dbo:Place dbo:Location dbo:Settlement schema:Place umbek-rc:PopulatedPlace yago:Commune108541609 … lgeod:Place lgeod:City yago:District yago:Municipaliy yago:Region yago:Town yago:Commune … dbpedia.org yago-knowledge.org linkedgeodata.org Hypothesis:An Example Montpellier
  54. 54. 37 dbo:Place dbo:Location dbo:Settlement schema:Place umbek-rc:PopulatedPlace yago:Commune108541609 … lgeod:Place lgeod:City yago:District yago:Municipaliy yago:Region yago:Town yago:Commune … dbpedia.org yago-knowledge.org linkedgeodata.org City ≈ Town Place = Place Settlement ≈ Commune Hypothesis:An Example Montpellier
  55. 55. 41  Known from the ontology matching,WordNet-based similarity: [1] L. Han,A. L. Kashyap,T. Finin, J. Mayeld, and J.Weese.Umbc ebiquity-core: Semantic textual similarity systems, in Proc. of the *SEM, Association for Computational Linguistics, 2013.  UMBC measure [1]: combines semantic distance in WordNet, with frequency of occurrence and co-occurrence of terms in a large external corpus (the web). o Wu Palmer o Lin's Similarity Measures to Use: 38
  56. 56. 39 Preprocessing Target Datasets Filtering Datasets Ranking Intensional Approach to Datasets Recommendation * * CCD = Cluster of Comparable Datasets Cosine 1 2 3
  57. 57. Example: Montpellier ∈ DS; <Montpellier, rdf:type, dbo:Town>  PL(DS)= "town", …  PD(DS)= “…Usually, a town is thought of as larger than a village but smaller than a city, though there are exceptions to this rule." 40  Dataset Label Profile-- PL(DS): a set of n schema concepts labels corresponding to DS. Intensional Approach to Datasets Recommendation (1/3) DS PD(DT) PL(DS) Preprocessing PD(DS) PL(DT) Profiling Profiling  Dataset Document Profile-- PD(DS): the concatenation of PL(DS) and the textual descriptions of the n schema concepts. 1
  58. 58. 42 Two datasets DS and DT are comparable if there exist at least one similarity between their labels profiles: (PL(DS), PL(DT)).  We identify CCD(DS) - a Cluster of Comparable Datasets related to DS  All the linking candidates for DS are found in its cluster CCD(DS). Target Datasets Filtering  The next step consists of ranking DT‘s in CCD(DS) sim(PL(DS), PL(DT)) Intensional Approach to Datasets Recommendation (2/3) 2
  59. 59. 43 1) Forming a corpus by profiles documents: PD (DS) and all PD(DT). 2) Building a vector space model by indexing the documents in the corpus 3) Computing TF-IDF + cosine similarity between the document vectors in the corpus. 4) Ranking each DT in the cluster CCD(𝐷 𝑆) with respect to 𝐷 𝑆.  A mapping between datasets is returned based on their labels profiles: PL(DS) , PL(DT). Intensional Approach to Datasets Recommendation (3/3) 3
  60. 60. 44  LOD-Cloud (http://datahub.io/group/lodcloud) 90 responsive datasets  The profile PD  from the LOV (lov.okfn.org)  Wu Palmer & Lin's  2013 WS4J java API (https://code.google.com/archive/p/ws4j/)  The UMBC measure  web API (http://swoogle.umbc.edu/SimService) Experimental Setup  The evaluation data (ED)  the current topology of the LOD-Cloud (the ED is available on http://www.lirmm.fr/benellefi/void.ttl)
  61. 61. 44 Experimental Setup  Leave one out evaluation  LOD-Cloud (http://datahub.io/group/lodcloud) 90 responsive datasets.  The profile PD  from the LOV (lov.okfn.org).  Wu Palmer & Lin's  the 2013 WS4J java API (https://code.google.com/archive/p/ws4j/).  The UMBC measure  its available web API (http://swoogle.umbc.edu/SimService).  The evaluation data (ED)  the current topology of the LOD-Cloud (the ED is available on http://www.lirmm.fr/benellefi/void.ttl)
  62. 62. 45  For each DS ∈ ED, we evaluated the selection of target datasets in the cluster CCD(𝐷 𝑆) in terms of recall. o Wu Palmer: 𝜽 ∈ [𝟎 , 𝟎. 𝟗] o Lin: 𝜽 ∈ [𝟎 , 𝟎. 𝟖] o UMBC: 𝜽 ∈ [𝟎 , 𝟎. 𝟕] Evaluation Results (1/3)  In the following, the evaluation will be restricted only in these intervals in order to guarantee a Recall up to 100% in the ED. Result: the recall value remains 100% in the following threshold 𝜽 intervals:
  63. 63. 46 o Wu Palmer: 𝜃 = 0.9 o Lin: 𝜃 = 0.8 o UMBC: 𝜃 = 0.7 MAP@R over all DS ∈ ED Evaluation Results (2/3) Parameter tuning:
  64. 64. 47 UMBC @ 𝜃 = 0.7 is the best setting for our ranking approach Mean Precision@K over all DS ∈ ED using the three different similarity measures over their best setting Evaluation Results (3/3) P@5 P@10 P@15 P@20 Wu Palmer (𝜽= 0,9) 0,56 0,52 0,53 0,51 Lin (𝜽= 0,8) 0,57 0,54 0,55 0,51 UMBC (𝜽= 0,7) 0,58 0,54 0,53 0,53
  65. 65. 48  Baseline #2: All datasets are represented by their label profiles (PL). 1) CCD(DS) using UMBC @ 𝜃 = 0.7. 2) AvgUMBC be a ranking function that assigns scores to each DT ∈ CCD(): Baselines & Comparison (1/2)  Baseline #1: All datasets are represented by their document profiles (PD). 1) A vector space model by indexing the PD  NO CCD clusters. 2) A TF-IDF + cosine similarity between the document vectors.  NOTF-IDF + cosine similarity
  66. 66. 49 Baseline#1 Proposed ApproachBaseline#2  Baseline#2 produces better results for a limited number of datasets: o RKB explorer datasets are sharing a high number of identical labels in their PL.  Efficiency with an AVG P@R up to 53%, compared to 49% and 39% for the baselines.  Precision up to 100% for DS from geographic and governmental domains P@R over all DS ∈ ED Baselines & Comparison (2/2)
  67. 67. 50  A high performance o An average precision of 53% for a recall of 100%. o Independence of the dataset size or the schema cardinality. Result Analysis False positives overestimation.  A more fair evaluation can be given if better ED are used. o The ED is far from being complete as a GroundTruth. o We ran “SILK” on some FP recommendations. Example of discovered linksets: rkb-explorer- unlocode yovisto datos- bcn-cl datos- bcn-uk owl:sameAs Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Dataset Recommendation for Data Linking: an Intensional Approach. In Proceeding of the 13th ESWC 2016.
  68. 68. 51 Datasets Recommendation: Intensional Profile-based Approach  Semantic Profiles features: Intensional Profile  No Learning Step.  Average Recall: 100%  Mean average precision: 53%.  A mappings between the schema concepts Ranking results available on http://www.lirmm.fr/benellefi/CCD-CosineRank_Result.csv. Topic Profiles-based Vs Intensional-based Dataset recommendation:Topic profiles-based approach  Semantic Profiles features:Topic Profiles  Learning data dependency.  Average recall: 81%  Mean average precision: 19%.  A new topic profiles propagation approach Ranking results available on http://www.lirmm.fr/benellefi/results.csv.
  69. 69. o Context o Vocabulary Recommendation with Datavore o Datasets Recommendation: Problem Statement o Datasets Recommendation:Topic Profile-based Approach o Datasets Recommendation: Intensional Profile-based Approach o Conclusion & Open Issues Outline 52
  70. 70. 53  The choice of the Profile features is dependent on a given application scenario and task.  Learning data dependency:  Learning data is not complete  Effectiveness to be improved.  The learning break-off  Highly greater effeciency.  Better performance with datasets having high quality intensional profiles:  The awareness of the richness of datasets schema descriptions.  The importance of reusing existing vocabulary terms: o i.e, tools such as Datavore can ease the vocabulary reusing task in linked data modeling. Conclusion
  71. 71. 54 Open Issues Dataset recommendation: o A reliable ground truth (i.e., crowdsourcing-based) and benchmark data. o To improve the quality of the intensional profiles: • the population of the schema elements, • the dataset context, • the multilinguisme, … o Investigating the effectiveness of ML techniques Vocabulary recommendation: o A new evaluation framework for the linked data modeling process i.e, in a user study manner and crowdsourcing-based. o To examine the vocabulary terms ranking strategies based on the data structure factors i.e, tabular sources modeling vs. web pages annotation.
  72. 72. Twitter: @benellefi
  73. 73.  Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Dataset Recommendation for Data Linking: an Intensional Approach. In Proceeding of the 13th ESWC 2016; Crete, Grèce.  Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established Knowledge Graphs Recommending Web Datasets for Data Linking. In Proceeding of the 16th ICWE 2016; Lugano, Switzerland.  Manel Achichi, Mohamed Ben Ellefi, Danai Symeonidou, Konstantin Todorov. Automatic Key Selection for Data Linking. In Proceeding of the 20th EKAW 2016; Bologna, Italy.  Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin Todorov. Datavore: A Vocabulary Recommender Tool Assisting Linked Data Modeling. (Posters & Demos) ISWC 2015; Bethlehem, PA, USA.  This paper was also presented In BDA'2015, Île de Porquerolles, France.  Mohamed Ben Ellefi, Zohra Bellahsene, François Scharffe, Konstantin Todorov. Towards Semantic Dataset Profiling. In PROFILES@ESWC, Crete, Grèce, (2014).

×