Advertisement
Advertisement

More Related Content

Similar to A Comparison of Propositionalization Strategies for Creating Features from Linked Open Data(20)

Advertisement

A Comparison of Propositionalization Strategies for Creating Features from Linked Open Data

  1. 1 A Comparison of Propositionalization Strategies for Creating Features from Linked Open Data 9/29/2014 Petar Ristoski, Heiko Paulheim
  2. Motivation 9/29/2014 Ristoski, Paulheim 2
  3. Motivation • Many existing applications use LOD as background knowledge in data mining – Explaining data patterns and statistics: unemployment rate, inflation, energy savings, etc … – Content-based book/movies recommendation system – Classifying incident related tweets – Gene classification – Prediction of car fuel consumption 9/29/2014 Ristoski, Paulheim 3
  4. Motivation Local LOD Data link combine cleanse transform analyze 9/29/2014 Ristoski, Paulheim 4
  5. Motivation • Standard data mining algorithms require propositional feature vector representation • Feature space: V={v1,v2,…, vn} • Each instance is represented as an n-dimensional feature vector (v1,v2,…,vn), where for each 1≤ vi ≤n : – vi ∈ {true, false}, or vi ∈ {1,0} – vi ∈ ℝ – vi ∈ S, where S is a finite set of symbols 9/29/2014 Ristoski, Paulheim 5
  6. Motivation Name Person Music Artist Instrument Genre Trent Reznor 1 1 1 0 Wolfgang A. Mozart 1 1 1 1 Barack Obama 1 0 0 0 9/29/2014 Ristoski, Paulheim 6
  7. Related Work • LiDDM (Narasimha et al.) – a framework tool for Linked Data mining that capture data from LOD cloud to extract hidden information • SPARQL-ML (Kiefer et al.) – extension to SPARQL to support data mining tasks for knowledge discovery in the Semantic Web • FeGeLOD (Paulheim et al.) – Unsupervised generation of data mining features from LOD • The resulting features are binary, or numerical aggregates using SPARQL COUNT constructs • No proper evaluation of the used propositionalization strategy 9/29/2014 Ristoski, Paulheim 7
  8. PROPOSED STRATEGIES 9/29/2014 Ristoski, Paulheim 8
  9. Strategies • Strategies for features derived from specific relations – r rdf:type C – r dcterms:subject S • Strategies for features derived from generic relations 9/29/2014 Ristoski, Paulheim 9
  10. Strategies for Features Derived from Specific Relations 1. Binary feature: – vi =1 if C(r) – vi =0 if ⅂C(r) 2. Relative count feature: – vi = 1 푛 , where r has relation to n objects 3. TF-IDF feature: – vi = 1 푛 log 푁 {푟|퐶 푟 } , where N is the total number of resources in the dataset, and {푟|퐶 푟 } denotes the number of resources for which the specific relation to an object C exists 9/29/2014 Ristoski, Paulheim 10
  11. Features Derived from Specific Relations: Binary vs TF-IDF +10 Music Artists dbpedia:Person dbpedia:Artist dbpedia:MusicArtist dbpedia:MilitaryPerson dbpedia:Kris_Kristofferson dbpedia:Elvis_Presley Name Person Artist Music Artist Military Person Elvis Presley 1 1 1 1 Kris Kristofferson 1 1 1 1 Artist X 1 1 1 0 9/29/2014 Ristoski, Paulheim 11
  12. Features Derived from Specific Relations: Binary vs TF-IDF +10 Music Artists dbpedia:Person dbpedia:Artist dbpedia:MusicArtist dbpedia:MilitaryPerson dbpedia:Kris_Kristofferson dbpedia:Elvis_Presley Name Person Artist Music Artist Military Person Elvis Presley 0 0 0 0.672 Kris Kristofferson 0 0 0 0.672 Artist X 0 0 0 0 9/29/2014 Ristoski, Paulheim 12
  13. Strategies • Strategies for features derived from specific relations – r rdf:type C → C(r) – dcterms:subject • Strategies for features derived from relations as such – describe how resource r is related to resource r′ – outgoing relation: p(r, r′ ) – Incoming relation : p(r′ ,r) 9/29/2014 Ristoski, Paulheim 13
  14. Strategies for Features Derived from Generic Relations 1. Binary feature: – vi = 1 if p(r,r′) – vi = 0 if ⅂p(r) 2. Count feature: – vi = n, where r is connected to n resources with relation p 3. Relative count feature: – vi = 푛푝 푃 , where P is the total number of outgoing relations for r, and np is the number of relations of type p for r 4. TF-IDF feature: – vi = 푛푝 푃 log 푁 {푟|∃푟′:푝 푟,푟′ } , where N is the total number of resources in the dataset, and {푟|∃푟′: 푝 푟, 푟′ } denotes the number of resources for which p(r, r′) exists 9/29/2014 Ristoski, Paulheim 14
  15. Features Derived from Generic Relations : Binary vs Relative Count dbpedia:Chester_Bennington dbpedia:Anthony_Kiedis dbpedia:Jules_Verne dbpedia:instrument 8 5 1 dbpedia:author 68 Name instrument author Chester Bennington 1 0 Anthony Kiedis 1 1 Jules Verne 0 1 9/29/2014 Ristoski, Paulheim 15
  16. Features Derived from Generic Relations : Binary vs Relative Count dbpedia:Chester_Bennington dbpedia:Anthony_Kiedis dbpedia:Jules_Verne dbpedia:instrument 8 5 1 dbpedia:author 68 Name instrument author Chester Bennington 1 0 Anthony Kiedis 0.833 0.166 Jules Verne 0 1 9/29/2014 Ristoski, Paulheim 16
  17. EVALUATION 9/29/2014 Ristoski, Paulheim 18
  18. Evaluation • Comparative evaluation of the propositionalization strategies on three data mining tasks – Classification – Regression – Outlier Detection • Evaluated on six datasets, on five feature sets – types – categories – incoming relations – outgoing relations – incoming and outgoing relations 9/29/2014 Ristoski, Paulheim 19
  19. Evaluation: Classification • Datasets: Dataset # instances # types # categories # rel in # rel out # rel in & rel out Sports Tweets 5,054 7,814 14,025 3,574 5,334 8,908 Cities 212 721 999 1,304 1,081 2,385 • Methods: – Naïve Bayes – k-Nearest Neighbors (k=3) – C4.5 decision trees • Metrics for performance evaluation – Accuracy • Results calculated using stratified 10-fold cross validation 9/29/2014 Ristoski, Paulheim 20
  20. Evaluation: Classification Datasets Cities Sports Tweets Features Representation NB k-NN C4.5 Avg. NB k-NN C4.5 Avg. types Binary 55.71 56.17 59.05 56.98 81 82.9 82.95 82.28 Relative Count 57.1 49.61 55.22 53.98 80.96 81.44 81.88 81.43 TF-IDF 57.1 48.7 54.7 53.50 82.13 82.47 82.64 82.41 categories Binary 55.74 49.98 56.17 53.96 82.26 76.56 71.98 76.93 Relative Count 59.52 44.35 58.96 54.28 90.76 84.09 80.86 85.24 TF-IDF 55.74 49.98 57.08 54.27 89.65 81.98 81.68 84.44 rel in Binary 60.41 58.46 60.35 59.74 83.12 83.63 84.65 83.80 Count 56.69 31.1 59.37 49.05 83.25 85.11 85.4 84.59 Relative Count 49.16 38.23 58.55 48.65 69.51 84.63 85.17 79.77 TF-IDF 34.94 38.2 54.26 42.47 72.66 84.65 84.99 80.77 rel out Binary 47.62 60 56.71 54.78 80.67 82.36 84.41 82.48 Count 49.96 55.24 58.59 54.60 79.95 83.35 85.01 82.77 Relative Count 48.07 58.44 56.65 54.39 62.13 84.27 83.51 76.64 TF-IDF 40.15 54.78 58.51 51.15 69.91 84.4 84.16 79.49 r in & out Binary 59.44 58.57 56.47 58.16 86.17 85.14 86.45 85.92 Count 56.13 54.26 60.82 57.07 86.03 86.01 87.16 86.40 Relative Count 57.68 47.14 56.56 53.79 70.01 84.51 87.22 80.58 TF-IDF 40.17 46.21 58.46 48.28 75.15 84.86 86.19 82.07 9/29/2014 Ristoski, Paulheim 21
  21. Evaluation: Regression • Datasets: Dataset # instances # types # categories # rel in # rel out # rel in & rel out Auto MPG 391 264 308 227 370 597 Cities 212 721 999 1,304 1,081 2,385 • Methods: – Linear Regression – M5Rules – k-Nearest Neighbors (k=3) • Metrics for performance evaluation – RMSE • Results calculated using stratified 10-fold cross validation 9/29/2014 Ristoski, Paulheim 22
  22. Evaluation: Regression Datasets Auto MPG Cities Features Representation LR M5 k-NN Avg. LR M5 k-NN Avg. types Binary 3.952 3.056 3.63 3.546 24.303 18.793 22.164 21.753 Relative Count 3.843 2.952 3.571 3.455 18.046 19.696 33.569 23.770 TF-IDF 3.864 2.964 3.571 3.466 17.852 18.773 22.396 19.674 categories Binary 3.698 2.9 3.62 3.409 18.884 22.323 22.677 21.295 Relative Count 3.747 3 3.57 3.430 18.952 19.98 34.489 24.474 TF-IDF 3.782 2.9 3.56 3.416 19.02 22.323 23.189 21.511 rel in Binary 3.849 2.9 3.61 3.444 49.866 19.205 18.532 29.201 Count 3.892 3 4.62 3.824 138.041 19.915 19.27 59.075 Relative Count 3.976 2.9 3.57 3.488 122.365 22.335 18.877 54.526 TF-IDF 4.109 2.8 3.57 3.508 122.921 21.947 18.568 54.479 rel out Binary 3.792 3.1 3.6 3.490 20.008 19.364 20.918 20.097 Count 4.072 3 4.15 3.734 36.317 19.459 23.994 26.590 Relative Count 4.095 2.9 3.57 3.536 43.22 21.961 21.472 28.884 TF-IDF 4.135 3 3.57 3.572 28.845 20.852 22.212 23.970 r in & out Binary 3.991 3.1 3.67 3.572 40.803 18.803 18.211 25.939 Count 3.991 3.1 4.54 3.870 107.259 19.528 18.906 48.564 Relative Count 3.922 3 3.57 3.493 103.102 22.091 19.608 48.267 TF-IDF 3.982 3.01 3.57 3.523 115.373 20.623 19.702 51.899 9/29/2014 Ristoski, Paulheim 23
  23. Evaluation: Outlier Detection • Datasets: Dataset # instances # types # rel in # rel out # rel in & rel out DBpedia-Peel 2,083 39 586 322 908 DBpedia-DBTropes 4,228 128 912 2,155 3,067 • Methods: – k-NN Global Anomaly Score – GAS (k=25) – Local Outlier Factor – LOF (10<k<50) – Local Outlier Probability – LoOP (k=25) • Metrics for performance evaluation – Area under the ROC curve (AUC) • Results calculated on partial gold standard of 100 links 9/29/2014 Ristoski, Paulheim 24
  24. Evaluation: Outlier Detection Datasets Dbpedia-Peel Dbpedia-DBTropes Features Representation GAS LOF LoOP Avg. GAS LOF LoOP Avg. types Binary 0.386 0.487 0.554 0.476 0.503 0.627 0.605 0.578 Relative Count 0.385 0.398 0.595 0.459 0.503 0.385 0.314 0.401 TF-IDF 0.386 0.504 0.602 0.497 0.503 0.672 0.417 0.531 r in Binary 0.169 0.367 0.289 0.275 0.426 0.52 0.45 0.465 Count 0.2 0.285 0.29 0.258 0.503 0.59 0.602 0.565 Relative Count 0.293 0.496 0.452 0.414 0.589 0.555 0.493 0.546 TF-IDF 0.14 0.354 0.317 0.270 0.509 0.519 0.568 0.532 r out Binary 0.25 0.195 0.207 0.217 0.325 0.438 0.432 0.398 Count 0.539 0.455 0.391 0.462 0.547 0.578 0.522 0.549 Relative Count 0.542 0.544 0.391 0.492 0.618 0.601 0.513 0.577 TF-IDF 0.116 0.396 0.24 0.251 0.322 0.629 0.472 0.474 r in & out Binary 0.324 0.431 0.51 0.422 0.352 0.439 0.396 0.396 Count 0.527 0.368 0.454 0.450 0.57 0.563 0.527 0.553 Relative Count 0.603 0.744 0.616 0.654 0.667 0.672 0.657 0.665 TF-IDF 0.202 0.667 0.484 0.451 0.481 0.462 0.5 0.481 9/29/2014 Ristoski, Paulheim 25
  25. Conclusion • The chosen propositionalization strategy matters • No general recommendation for a strategy – What is the data mining task? – What are the characteristics of the dataset? – Which algorithm is going to be used? 9/29/2014 Ristoski, Paulheim 26
  26. Future Work • Conduct further experiments on more feature sets – Qualified incoming and outgoing relations – Combine features from multiple LOD sources • Conduct experiments on more data mining tasks – Clustering, Recommendation Systems etc… • More sophisticated strategies – Combination of statistical and semantic measures – Adaptation of weighting strategies used in text mining to overcome problems with erroneous data • Use the statistical measures for feature selection 9/29/2014 Ristoski, Paulheim 27
  27. RapidMiner LOD Extension • Simple wiring of operators – Importing – Linking – Feature generation – Data consolidation – Feature selection – Visualization 9/29/2014 Ristoski, Paulheim 28
  28. RapidMiner LOD Extension Local LOD Data link combine cleanse transform analyze 9/29/2014 Ristoski, Paulheim 29
  29. RapidMiner LOD Extension Data Enrichment Data Analysis Linking Feature Selection Schema Matching & Data Fusion 9/29/2014 Ristoski, Paulheim 30
  30. RapidMiner LOD Extension • Simple wiring of operators – Importing – Linking – Feature generation – Data fusion – Feature selection – Visualization • Try it out! – find “Linked Open Data” on the marketplace – Google Group: groups.google.com/forum/#!forum/rmlod 9/29/2014 Ristoski, Paulheim 31
  31. 32 A Comparison of Propositionalization Strategies for Creating Features from Linked Open Data 9/29/2014 Petar Ristoski, Heiko Paulheim
Advertisement