Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HIEDS: A Generic and Efficient Approach
to Hierarchical Dataset Summarization
Gong Cheng, Cheng Jin, Yuzhong Qu
National K...
Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://l...
Scenario: browsing a dataset in an
open data portal
https://data.europa.eu/euodp/en/data/dataset/dgt-translation-memory
I ...
Meeting the challenge with a
dataset summary
i.e., automatically generated small-sized, high-level abstraction of data,
to...
Expected features of a dataset summary
• To provide multigranular abstraction of data to be
incrementally explored
• To pr...
Constitution of a dataset summary
• An example
A hierarchical grouping of entities Relations connecting sibling groups
A p...
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• ...
Quality of a dataset summary
• Coverage of data
• large subgroups, frequent relations
• Height of hierarchy
• Cohesion wit...
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• moderate-sized subgroups
• Cohesion within groups
...
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• informative (i.e., less f...
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• ...
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• ...
Problem formulation:
multidimensional knapsack problem (MKP)
maximizing moderateness
of each subgroup
maximizing cohesion
...
Problem solution
• A greedy strategy is used
(sorting candidates by )
but its efficient implementation is non-trivial.
Experiments
• Baseline: LODeX (ISWC’14)
• flat grouping
• biased towards coverage (e.g., Type:Person)
• redundant informat...
Details can be found
in our poster!
Upcoming SlideShare
Loading in …5
×

HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

136 views

Published on

Presented at IJCAI'16.

Published in: Science
  • Be the first to comment

  • Be the first to like this

HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

  1. 1. HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization Gong Cheng, Cheng Jin, Yuzhong Qu National Key Laboratory for Novel Software Technology Nanjing University, China Websoft
  2. 2. Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/
  3. 3. Scenario: browsing a dataset in an open data portal https://data.europa.eu/euodp/en/data/dataset/dgt-translation-memory I need some insight into the contents, not just metadata.
  4. 4. Meeting the challenge with a dataset summary i.e., automatically generated small-sized, high-level abstraction of data, to summarize the contents of a dataset for quick inspection.
  5. 5. Expected features of a dataset summary • To provide multigranular abstraction of data to be incrementally explored • To preserve the structural nature of a dataset • To be comprehensible
  6. 6. Constitution of a dataset summary • An example A hierarchical grouping of entities Relations connecting sibling groups A property-value pair differentiates a group of entities from sibling groups.
  7. 7. Quality of a dataset summary • Coverage of data • Height of hierarchy • Cohesion within groups • Overlap between groups • Homogeneity of groups
  8. 8. Quality of a dataset summary • Coverage of data • large subgroups, frequent relations • Height of hierarchy • Cohesion within groups • Overlap between groups • Homogeneity of groups
  9. 9. Quality of a dataset summary • Coverage of data • Height of hierarchy • moderate-sized subgroups • Cohesion within groups • Overlap between groups • Homogeneity of groups
  10. 10. Quality of a dataset summary • Coverage of data • Height of hierarchy • Cohesion within groups • informative (i.e., less frequent) property-value pairs • Overlap between groups • Homogeneity of groups
  11. 11. Quality of a dataset summary • Coverage of data • Height of hierarchy • Cohesion within groups • Overlap between groups • controllable overlap • Homogeneity of groups
  12. 12. Quality of a dataset summary • Coverage of data • Height of hierarchy • Cohesion within groups • Overlap between groups • Homogeneity of groups • different values of the same property
  13. 13. Problem formulation: multidimensional knapsack problem (MKP) maximizing moderateness of each subgroup maximizing cohesion within each subgroup disallowing large overlap between subgroups selecting ≤k subgroups (optionally) disallowing different properties
  14. 14. Problem solution • A greedy strategy is used (sorting candidates by ) but its efficient implementation is non-trivial.
  15. 15. Experiments • Baseline: LODeX (ISWC’14) • flat grouping • biased towards coverage (e.g., Type:Person) • redundant information (e.g., Type:Person and Type:Chair) • Advantages of HIEDS • hierarchical grouping • trade-off between coverage and cohesion (e.g., Type:Actor) • controllable overlap
  16. 16. Details can be found in our poster!

×