TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

168 views

Published on

TYPifier: Inferring the Type Semantics of Structured Data
Yongtao Ma, Thanh Tran
29th IEEE International Conference on Data Engineering (ICDE2013)

Published in: Technology, Health & Medicine
  • Be the first to comment

  • Be the first to like this

TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

  1. 1. KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics and Formal Description Methods (AIFB) www.kit.edu TYPifier: Inferring the Type Semantics of Structured Data Yongtao Ma, Thanh Tran 29th IEEE International Conference on Data Engineering (ICDE2013)
  2. 2. Institute of Applied Informatics and Formal Description Methods (AIFB)2 April 8th, 2013 Contents Introduction TYPification Features TYPification Algorithm Evaluation Conclusion ICDE2013, Brisbane
  3. 3. Institute of Applied Informatics and Formal Description Methods (AIFB)3 April 8th, 2013 Problem Type information is Missing Dynamic Web Data Heterogeneous Enterprise Data ICDE2013, Brisbane
  4. 4. Institute of Applied Informatics and Formal Description Methods (AIFB)4 April 8th, 2013 Problem Type information is Missing Dynamic Web Data Heterogeneous Enterprise Data ICDE2013, Brisbane ID Title Price Brand Description p1 Epson E1700 260 Epson Up to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5, A5...Energy consumption in operation/stand-by: 285 W/5 W p2 HP 55252 2699 HP 620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono A4 Print p3 LG 47LM7600 1143 LG Standby Mode 0.1 W. Full HD 1080p gives high picture quality over standard HDTV via LG LED... LG’s 47-inch Smart TV is a revolutionary... p4 Panasonic L55DT50 2399 Panasonic Power consumption 85 W. The DT50 LED-LCD series provides a fantastic Smart TV experience and features a 3D IPS LED panel, 1080p Full HD resolution, and a new narrow metal frame. p5 MadMaps Pacific 8 Spotitout Windows Vista / 7 / XP. Media: DVD. It’s a snap to load Pacific Coast GPS Travel Directory by MAD Maps into your GPS device. p6 Garmin Maps 99 Gamin Windows Vista / 7 / XP. Media: DVD. Compatible with GPS Garmin Colorado, Dakota, eTrex...Coverage includes detailed maps for traveling in Australia. p7 Rosetta Spanish 399 Rosetta Stone Windows Vista / 7 / XP. Media: DVD. Build your vocabulary and language abilities... Discover how to speak, read, write, and understand… p8 Learn German 9 Innovative Windows Vista / 7 / XP. Media: DVD. Learn level 9 German vocabulary with the audio playback tool, Listen to the lesson dialog and master the language…
  5. 5. Institute of Applied Informatics and Formal Description Methods (AIFB)5 April 8th, 2013 Problem Type information is Missing Dynamic Web Data Heterogeneous Enterprise Data Typification: inferring the type semantics of structured data ICDE2013, Brisbane
  6. 6. Institute of Applied Informatics and Formal Description Methods (AIFB)6 April 8th, 2013 Contributions We formulate Typification as a clustering problem, where the goal is to identify a particular kind of clusters that represent the types of entities We propose a solution for automatically computing pseudo-schema features from data We propose TYPifier, a novel clustering algorithm for the typification problem, which is An divisive hierarchical clustering algorithm Optimized for (pseudo-)schema-based features Determine the number of types (clusters) automatically Show that typification helps to improve date integration! ICDE2013, Brisbane
  7. 7. Institute of Applied Informatics and Formal Description Methods (AIFB)7 April 8th, 2013 FEATURES FOR TYPIFICATION ICDE2013, Brisbane
  8. 8. Institute of Applied Informatics and Formal Description Methods (AIFB)8 April 8th, 2013 Schema Features Features characterize a type well if: Shared by most entities of that type Not in the feature sets of other entities that belong to other types Schema Features: labels of attributes or relations e.g. Resolution but also HD and LET Tech for type TV Advantages: Better type indicators Problems: missing, scarce Solutions: derive pseudo-schema features ICDE2013, Brisbane
  9. 9. Institute of Applied Informatics and Formal Description Methods (AIFB)9 April 8th, 2013 Pseudo-schema Features Words in attribute values that act as schema features TF-IDF Importance of a term for a document, relative to others in the corpus Representative for instances rather than types Learning words in attribute values representative for types ID Title Price Brand Description p1 Epson E1700 260 Epson Up to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5, A5...Energy consumption in operation/stand-by: 285 W/5 W p2 HP 55252 2699 HP 620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono A4 Print p3 LG 47LM7600 1143 LG Standby Mode 0.1 W. Full HD 1080p gives high picture quality over standard HDTV via LG LED... LG’s 47-inch Smart TV is a revolutionary... p4 Panasonic L55DT50 2399 Panasonic Power consumption 85 W. The DT50 LED-LCD series provides a fantastic Smart TV experience and features a 3D IPS LED panel, 1080p Full HD resolution, and a new narrow metal frame. ICDE2013, Brisbane
  10. 10. Institute of Applied Informatics and Formal Description Methods (AIFB)10 April 8th, 2013 Pseudo-schema Schema Features ICDE2013, Brisbane Feature Co-occurrence Graph Feature Co-occurrence Graph is a weighted directed graph G = (N,E,L) with: - N: the set of words in the attribute values - E: edges as ordered vertex pair (n1,n2), indicating that n1 co-occurs with n2 in the description of some instances - L: edge labels. Let Nn1 and Nn2 be set of instances that contain n1 and n2 in description, the edge labels stand for the conditional co-occurrence probabilities calculated as p(n2|n1)= |Nn1∩Nn2|/|Nn1|
  11. 11. Institute of Applied Informatics and Formal Description Methods (AIFB)11 April 8th, 2013 Pseudo-schema Schema Features ICDE2013, Brisbane dpi A4 ppm W Smart TV LED Instance W dpi p1 X X p2 X X p3 X p4 X 0.5 1.0 NW={p1,p2,p3,p4} Ndpi={p1,p2} w(dpi|w)=|Nw∩Ndpi|/|Nw=0.5 w(w|dpi) |Nw∩Ndpi|/|Ndpi|=1.0 HD
  12. 12. Institute of Applied Informatics and Formal Description Methods (AIFB)12 April 8th, 2013 Pseudo-schema Schema Features v1 and v2 are co-occurred if p(v2|v1)>θ and p(v1|v2)>θ ICDE2013, Brisbane dpi A4 ppm W Smart TV LED 0.5 1.0 HD θ=0.50
  13. 13. Institute of Applied Informatics and Formal Description Methods (AIFB)13 April 8th, 2013 Pseudo-schema Schema Features ICDE2013, Brisbane w ppm dpi A4 Maximum Clique HD TV Smart LED W
  14. 14. Institute of Applied Informatics and Formal Description Methods (AIFB)15 April 8th, 2013 TYPIFICATION ALGORITHM ICDE2013, Brisbane
  15. 15. Institute of Applied Informatics and Formal Description Methods (AIFB)16 April 8th, 2013 Clusters ICDE2013, Brisbane A cluster is defined as a tuple C(F, N, S) F: the set of (pseudo-)schema features N: the set of all entities that have an element in F as feature S: the set of clusters that are either child or descendant nodes of C Cluster Distance : co-occurrence count of features fi and fj : the count of entities having f as feature Ni (Nj ) is the entity set associated with Ci (Cj ) | NE f | count( fi, fj )
  16. 16. Institute of Applied Informatics and Formal Description Methods (AIFB)17 April 8th, 2013 Cluster Relation ICDE2013, Brisbane Four cluster relations : Ci a parent (ancestor) of Cj : Ci a child (descendant) of Cj : Ci and Cj represent the same cluster : there is no relation between Ci and Cj Ci > (>>)Cj Ci < (<<)Cj Ci = Cj Ci ¹ Cj Evidence No counter-evidence
  17. 17. Institute of Applied Informatics and Formal Description Methods (AIFB)18 April 8th, 2013 Typification ICDE2013, Brisbane S* root Power platform Media Resolution Print Speed LED HD Coverage Level Language C Empty 0 Root Power 1. Power < Root Add & Split Clusters Resolution 2. Resolution < Power Add & Split Clusters Print Speed 3.Print Speed = Resolution Merge S* Power Resolution Print Speed LED HD C platform Media Coverage Level Language 1 S* Resolution Print Speed C LED HD 2 S* Resolution Empty C LED HD 3 S* Power LED HD C platform Media Coverage Level Language 4 Children or Descendants of the root Siblings of the root 4. Split Entities
  18. 18. Institute of Applied Informatics and Formal Description Methods (AIFB)19 April 8th, 2013 EVALUATION ICDE2013, Brisbane
  19. 19. Institute of Applied Informatics and Formal Description Methods (AIFB)20 April 8th, 2013 Evaluation Baselines Hierarchical: BIRCH Partitional: K-means++ Kernel-based: SVC Density-based: OPTICS Datasets BTC DBpedia (DBP) Product Data (P) PPS: using pseudo-schema features PTFIDF: using TF-IDF features PD: using all words Dataset Entity Triple Schema Feature Type Hierarchy PS Features BTC 334,661 2,991,411 537 163 0 - DBP 3,600 49,751 146 16 5 - PPS 22,331 111,647 5 6 0 136 PTFIDF 22,331 111,647 5 6 0 7,211 PD 22,331 111,647 5 6 0 18,917 ICDE2013, Brisbane
  20. 20. Institute of Applied Informatics and Formal Description Methods (AIFB)21 April 8th, 2013 Efficiency ICDE2013, Brisbane 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 DBP BTC PPS PTFIDF PD Timelog(ms) Datasets TYPifier K-Means++ BIRCH OPTICS SVC TYPifier, K-means++ and BIRCH are similar in efficiency Pseudo-schema features help to improve efficiency
  21. 21. Institute of Applied Informatics and Formal Description Methods (AIFB)22 April 8th, 2013 Effectiveness ICDE2013, Brisbane 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 DBP BTC PPS PTFIDF PD F-measure(%) Datasets TYPifier K-Means++ BIRCH OPTICS SVC TYPifier outperforms other baselines +33.92% in F-measure (compared to second best) Pseudo-schema feature outperforms other types of feature +86.15% in F-measure (compared to second best)
  22. 22. Institute of Applied Informatics and Formal Description Methods (AIFB)23 April 8th, 2013 Hierarchies ICDE2013, Brisbane TYPifier outperforms other baselines Original Hierarchies Hierarchies Generated by OPTICS Hierarchies Generated by BIRCH Hierarchies Generated by TYPifier Tree Edit Distance TYPifier OPTICS BIRCH 12 14 24
  23. 23. Institute of Applied Informatics and Formal Description Methods (AIFB)24 April 8th, 2013 Parameter Sensitivity Precision improves with higher θ, because pseudo-schema features become more representative Recall improves as θ increases (at low level), drops at high level, because less and lesser pseudo-schema features can be generated ICDE2013, Brisbane 0 10 20 30 40 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 Precision(%) θ TYPifier KMeans++ BIRCH 0 10 20 30 40 50 60 70 80 0.1 0.2 0.3 0.4 0.5 Recall(%) θ TYPifier KMeans++ BIRCH
  24. 24. Institute of Applied Informatics and Formal Description Methods (AIFB)25 April 8th, 2013 Parameter Sensitivity The sensitivity of ε depends on feature correlations Higher ε leads to better precision and recall Extremely high ε may leads to poor quality of hierarchies ICDE2013, Brisbane 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision(%) ε DBP BTC P_PS P_TFIDF 0 10 20 30 40 50 60 70 80 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Recall(%) ε DBP BTC P_PS P_TFIDF
  25. 25. Institute of Applied Informatics and Formal Description Methods (AIFB)26 April 8th, 2013 Conclusion Introduce and formulate Typification as clustering problem Learning pseudo-schema features A divisive hierarchical clustering solution for TYPification TYPifier outperforms baselines by +33.92% in F-measure! Pseudo-schema feature is essential also for baselines! (outperforms other types of feature by +86.15% in F-measure) Generate not only clusters but also hierarchies that closely match human conceptualization / ground truth model! ICDE2013, Brisbane
  26. 26. Institute of Applied Informatics and Formal Description Methods (AIFB)27 April 8th, 2013 Thank you for your attention! Questions? Thanh Tran, https://sites.google.com/site/kimducthanh/ ICDE2013, Brisbane

×