Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

594 views

Published on

Many databases today are text-rich, comprising not only structured, but also textual data. Querying such databases involves predicates matching structured data combined with string predicates featuring textual constraints. Based on selectivity estimates for these predicates, query processing as well as other tasks that can be solved through such queries can be optimized. Existing work on selectivity estimation focuses either on string or on structured query predicates alone. Further, probabilistic models proposed to incorporate dependencies between predicates are focused on the re- lational setting. In this work, we propose a template-based probabilistic model, which enables selectivity estimation for general graph-structured data. Our probabilistic model allows dependencies between structured data and its text-rich parts to be captured. With this general probabilistic solution, BN+, selectivity estimations can be obtained for queries over text-rich graph-structured data, which may contain structured and string predicates (hybrid queries). In our experiments on real-world data, we show that capturing dependencies between structured and textual data in this way greatly improves the accuracy of selectivity estimates without compromising the efficiency.

Published in: Education
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
594
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
9
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide
  • Queries contains query predicates for structured and unstructured query constraints Resembling SPARQL queries with FILTER contains function unstructured query predicates = string predicates
  • * However, effective estimation of P(Q) is important for query optimizers relying on accurate estimates for intermediate query results.
  • Graphical representation of a set of cond. IndsExpress a factorization of the joint distribution
  • * Given a template X(α1,…, αn), an entity skeleton of X is defined as E(α1, . . . , αn) ⊆ E(α1) × … × E(αn),where each E(αi) ⊆ VE specifies all possible entity assignments to αi .
  • In a relational context, data is stored in tables corresponding to relations captured by a conceptual model. Further, relation names are explicitly given in a query – stated in a FROM clause. Correspondingly, previous works [10, 23] employ a PRM to model selection predicates through randomvariables of the form XR.A, where R is a relational table andA is an attribute. For instance, XPerson.name = “Audrey” is a random variable capturing a selection on table Person where name equals “Audrey”. Analogously, join predicates are modeled as binary random variables that involve two explicitly specified tables. Further, schema information may be queried via class predicates, which are not supported in the relational setting.
  • Inferencing costs are driven by two factors: (1) dependency structure of a BN, and (2) sample space sizes. Existing works on PRMs have focused on the former, targeting a lightweight, tree-shaped BN structure [23]. The latter aspect, however, is crucial as CPD sizes are a mere reflection of sample space sizes. Essentially, for supporting string predicates with all possible keywords, Ω(Xa) must capture allwords and phrases, which occur in a’s values. In order to compactly represent Ω, being a large set of strings, we propose the use of string synopses such as Markov tables [4], histograms [13] or n-gram synopses [25].
  • Then, the space Ba is reduced by using a decision criterion to dictate which n-grams ∈ Ba to include in a synopsis sample space Ω(Xa ). That is, a synopsis space represents a subset of “important” n-grams. Note, n-gram synopses are most accurate, as each synopsis element represents exactly one n-gram ∈ Ba – in contrast to, e.g., histograms. Recent work has outlined several such decision criteria [25].
  • Recently applied for PRMs [6].We impose that strong correlations among templates only occur, if they share some common entities – they need to “talk about the same things” (Def. 2-a). We argue that there is a causal dependence (independence) between a class and an attribute (relation) template (Def. 2-b, -c). In other words, assigningan entity to a given class causally affects the probability of its attribute values, which in turn, influences the probability of observing a particular relation
  • Using fixed structure allows to decompose structure learning: First learn “local” correlations between attribute/class template..Reduce network structure to only capture “most important” correlations via maximal spanning forest.Connect forest of trees via relational templates.
  • Such a template-based approach has the merit of being compact. The number of templates is far less than the number of random variables in a ground BN. Structure and parameters (CPDs) are learned for templates only. At runtime, templates are instantiated with entities to construct a ground BN. For inferencing, a CPD learned for a template is shared among all random variables in the ground BN that instantiate that template.
  • Missing Synopsis Values Multiple Value Assignments
  • DBLP as well as IMDB hold text-rich attributes like name, label or info. However, IMDB contains more text. Strong correlations in IMDB data between/among text and/or structure. In particular, we noticed strong dependencies during structure learning between values of attributes such as label and info. Hypothesis that assuming independence hurts the quality of selectivity estimates, given datasets that exhibit correlations. We also used DBLP, which on the other hand, shows almost no such correlations. Using DBLP data, we expect accuracy differences to be less significant. Our workload includes queries containing [2, 11] predicates in total: [0, 4] relation, [1, 7] string, and [1, 4] class predicates (cf. Tab. 2).
  • Key factor driving overall synopsis size was employed string synopsis.Experiments were run on a Linux server with two Intel Xeon 5140 CPUs (each with 2 cores at 2.33GHz), 48GB RAM (with 16GB assigned to the JVM), and a RAID10 with IBM SAS 148GB 10k rpm disks. Before query execution, all OS caches were cleared.
  • sel(Q) and sel(Q) as exact and estimated selectivity for Q, respectively. Intuitively, me represents the factor at which sel(Q) under-/overestimates sel(Q). Best accuracy results were achieved by ind∗ and bn∗ having a size ≥ 20 MByte, Further, the results confirmed our conjecture that the degree of data correlations has a significant impact on the overall accuracy performance differences between ind∗and bn∗ approaches. That is, a high degree of correlation in the IMDB dataset translated to large accuracy differences, while the improvement bn∗ could achieve over the baseline was small for DBLP. For the IMDB dataset, bnsbf could reduce errors of the indsbf approach by 93 %, while improvements were much smaller given DBLP. We noticed the error to increase in the number of predicates. This effect is expected, as more query predicates (hence more “difficult” queries) lead to an increasingly error-prone probability estimation. An interesting observation is that ind∗ outperformed bn∗ for some queries – see IMDB queries with 5 predicates and DBLP queries with 4 predicates (Fig. 4-b and -f). For instance, given IMDB query Q28, indtop-k achieved 13% better results than bntop-k. In such cases, string query predicates were translated to multiple values (1-grams) that are assigned to one single random variable
  • For instance, for DBLP queries with string predicates name and label, there are no significant correlations in ourBN. Thus, the probabilities obtained by bn∗ were almost identical to the However, while ind∗ led to fairly good estimates for the overall query load on DBLP, we could achieve more accurate selectivity computations via bn∗ for specific “correlated” queries. For instance, for DBLP query Q1 we could approximate an 10% better selectivity estimation.
  • Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

    1. 1. Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs Andreas Wagner, Veli Bicer, and Duc Thanh Tran EDBT/ICDT’13Institute of Applied Informatics and Formal Description Methods (AIFB)KIT – University of the State of Baden-Wuerttemberg andNational Research Center of the Helmholtz Association www.kit.edu
    2. 2. Introduction and Motivation Selectivity Estimation for Text-Rich Data Graphs Evaluation Results2 Institute of Applied Informatics and Formal Description Methods (AIFB)
    3. 3. INTRODUCTION & MOTIVATION3 Institute of Applied Informatics and Formal Description Methods (AIFB)
    4. 4. Text-Rich Data-Graphs and Hybrid Queries Increasing amount of semi-structured, text-rich data: Structured data with unstructured texts (e.g., [1]). Structure Unstructed data Text annotated with structured information (e.g., [2]). [1] DBpedia – A Crystallization Point for the Web of Data. [2] http://webdatacommons.org.4 Andreas Wagner, Veli Bicer, and Duc Thanh Tran Institute of Applied Informatics and Formal Description Methods (AIFB)
    5. 5. Text-Rich Data-Graphs and Hybrid Queries (2) Focus of our work: conjuctive, hybrid queries relation attribute ?x ?y „keyword“ structured query predicates unstructured query predicates „string“ (query) predicates Structure Text5 Institute of Applied Informatics and Formal Description Methods (AIFB)
    6. 6. Problem Definition Problem: Efficiently and effectively estimate the result set size for a conjuctive, hybrid query Q. [5] Selectivity estimation Decompose problem: sel(Q) = R(Q) * P(Q), [5]. using probabilistic models. R(Q): upper-bound cardinality for result set. P(Q): probability for Q having an non-empty result. Correlation between query predicates (data elements) make approximation of P(Q) hard. Correlations Correlations relation attribute ?x relation ?y attribute „keyword“ relation attribute „keyword“ „keyword“ Correlations Correlations make estimations relying on6 „indepence assumptions“ error-prone ! Institute of Applied Informatics and Formal Description Methods (AIFB)
    7. 7. Contributions Previous works focuses either on structured or on unstructured query constraints.- Graph synopses [3] Correlations Correlations- Join samples [4] ?x relation ?y attribute „keyword“ relation relation „keyword“ „keyword“ - Fuzzy string matching [7,8]- PRMs [5,6] - Extraction operators [9,10]-… -… Correlations We introduce a uniform model (BN+) for hybrid queries: Instance of template-based BN well-suited for graph-structed data. Extend BN with string synopses for estimation of string predicates.7 Institute of Applied Informatics and Formal Description Methods (AIFB)
    8. 8. SELECTIVITY ESTIMATION FOR TEXT-RICH DATA GRAPHS8 Institute of Applied Informatics and Formal Description Methods (AIFB)
    9. 9. Preliminaries (1) – Data and Query Model Data Attribute Class Value Node Node Bag of N- Grams Relation Edge Attribute Entity Node Edge Query Relation Predicate Keyword Node contains String Predicate9 Institute of Applied Informatics and Formal Description Methods (AIFB)
    10. 10. Preliminaries (2) – Bayesian Networks (1) sel(Q) = R(Q) * P(Q) Recall: Bayesian Network (BN) provides means for capturing joint probability distributions (e.g., P(Q)). BN comprise network structure and parameters. Nodes = random variables. Edges = dependencies .10 Institute of Applied Informatics and Formal Description Methods (AIFB)
    11. 11. Preliminaries (3) – Bayesian Networks (2) BN comprise network structure and parameters.11 Institute of Applied Informatics and Formal Description Methods (AIFB)
    12. 12. Preliminaries (4) – Bayesian Networks (3) Template-based BNs: templates and template factors [16]. Template is a function Χ(α1,…,αk), and each argument αi is a place- holder to be instantiated to obtain random variables. Xperson = {Xperson (p1), Xperson (p2), Xperson (p3)}. Entity skeleton for Xperson = {p1,p2,p3} . Template factors define probability distributions shared by all instantiated random variables of a given template.12 Shared by all instantiations of XdirectedBy Institute of Applied Informatics and Formal Description Methods (AIFB)
    13. 13. Template-Based BN for Graph-structured Data We define a templates for each … Attribute a, Xa(α1). Entity skeleton: all entities having attribute a. Class c, Xc(α1). Entity skeleton: all entities belonging to class c. Relation r, Xr(α1,α2). Entity skeleton: all pairs of “source” and “target” entities having relation r. Template for relation spouse.Template for attribute title. Template for class person. - PRMs [5,6] -… Dynamic partitioning based on Advantages entity skeletons.13 Template representation is compact. of Applied Informatics and Formal Institute Description Methods (AIFB)
    14. 14. - Fuzzy string matching [7,8] - Extraction operators [9,10] Integration of String Synopses (1) -… Problem: Large sample space for attribute-based templates. Entire n-gram space as Ω. In order to compactly represent Ω, being a large set of strings, we use string synopses (e.g., [7,8,9,10]). Intuitively, for an attribute-based template a string synopsis does: a) Decide how to “compactly represent” Ω. b) Compute probabilities for strings given its compact space. Some synopses even allow to “guess” probabilities for unknown strings.14 Institute of Applied Informatics and Formal Description Methods (AIFB)
    15. 15. Integration of String Synopses (2) [10] Selectivity estimation for extraction operators over text data. In this work, we use n-gram-based synopses [10]. Consider, e.g., top-k n-gram synopsis [10]. Compute n-gram counts and store only top-k n-grams. Probabilities for known n-grams are exact. Omitted n-grams are estimated based on heuristics using known n- grams.15 Institute of Applied Informatics and Formal Description Methods (AIFB)
    16. 16. Learning of BN+ (1): Structure (1) Similar technique has been [11] Approximating discrete recently applied for “Lightweight probability distributions with PRMs” [6]. dependence trees. Simplify structure via product approximation using trees [11,12]. Fixed Structure Assumption: a) Two templates X1 and X2 are conditionally independent given their parents, if they do not share a common entity in their skeletons b) Each class template Xc has no parent. c) Each relation template Xr is independent of any class template Xc, given its parents.16 Institute of Applied Informatics and Formal Description Methods (AIFB)
    17. 17. Learning of BN+ (2): Structure (2) Template Model Using fixed structure allows to decompose structure learning: „Local“ correlations between attribute/class (e.g., Xmovie → Xtitle) Reduce network structure to only capture “most important” correlations via maximal spanning forest. Relation templates connect different trees. Overall, network structure is determined by „overlapping“ entity skeletons and fixed structure assumption.17 Institute of Applied Informatics and Formal Description Methods (AIFB)
    18. 18. Learning of BN+ (3): Parameters Based on the learned structure, parameters are learned via collecting sufficient statistics (i.e., frequency counts). Speed up parameter learning via: Using queries to obtain sufficient statistics. Using caching during structure / parameter learning.18 Institute of Applied Informatics and Formal Description Methods (AIFB)
    19. 19. Estimating P(Q) using BN+ (1) At runtime, templates are instantiated to construct a query- specific ground BN. Template Model Query Query-specific Ground BN19 Assignment is a string synopsis element. Institute of Applied Informatics and Formal Description Methods (AIFB)
    20. 20. Recall: sel(Q) = R(Q) * P(Q) Estimating P(Q) using BN+ (2) Given a query-specific ground BN, we use inferencing to obtain the joint probability P(Q). Query-specific Ground BN20 “Correction” using string synopsis. Institute of Applied Informatics and Formal Description Methods (AIFB)
    21. 21. EVALUATION21 Institute of Applied Informatics and Formal Description Methods (AIFB)
    22. 22. Evaluation (1) – Setting Data: IMDB [14] and DBLP [15]. IMDB featured more correlations than DBLP. Different results between DBLP and IMDB show „relative benefit“. Queries: recent keyword search benchmarks [13,14] . We employed 54 DBLP queries and 46 IMDB queries. [13] Spark2: Top-k keyword query in relational data- Systems: bases. We used n-gram-based string synopses [10]: [14] A framework for random samples of 1-grams, evaluating database key- top-k 1-grams, word search strategies. stratified bloom filters on 1-grams. String predicates were integrated via (1) independence (ind) or (2) conditional independence (bn) assumption.22 Institute of Applied Informatics and Formal Description Methods (AIFB)
    23. 23. Evaluation (2) – Setting (2) Synopsis size: Overall synopsis size depends mainly on string synopsis size. Synopses sizes ∈ {2, 4, 20, 40} MByte memory. Metrics: Efficiency: selectivity estimation time. Effectiveness: multiplicative error [17]. [17] Independence is good: De- pendency-based histogram syno- pses for high-dimensional data.23 Institute of Applied Informatics and Formal Description Methods (AIFB)
    24. 24. Evaluation (3) – Effectiveness – IMDB24 Institute of Applied Informatics and Formal Description Methods (AIFB)
    25. 25. Evaluation (4) – Effectiveness – DBLP25 Institute of Applied Informatics and Formal Description Methods (AIFB)
    26. 26. Evaluation (5) – Efficiency26 Institute of Applied Informatics and Formal Description Methods (AIFB)
    27. 27. CONCLUSION27 Institute of Applied Informatics and Formal Description Methods (AIFB)
    28. 28. Conclusion Tackled the problem of selectivity estimation for conjunctive, hybrid queries. We propose a template-based BN, which is well-suited for graph-structured data. For string predicates, we further propose the integration of string synopses into this model. Experiments showed that: If there are correlations between un-/structured data elements the accuracy of selectivity estimation can be greatly improved via BN+. BN caused no overhead in terms of efficiency.28 Institute of Applied Informatics and Formal Description Methods (AIFB)
    29. 29. QUESTIONS29 Institute of Applied Informatics and Formal Description Methods (AIFB)
    30. 30. REFERENCES30 Institute of Applied Informatics and Formal Description Methods (AIFB)
    31. 31. References [1] Christian Bizer et al: DBpedia – A Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7, Pages 154–165, 2009. [2] http://webdatacommons.org/ [3] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages 275–286, 1999. [4] J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity estimation. In SIGMOD, pages 205–216, 2006. [5] L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD, pages 461–472, 2001. [6] K.Tzoumas, A. Deshpande, and C. S. Jensen. Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB, 4(11):852–863, 2011. [7] S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem. In ICDE, pages 227–238, 2004. [8] L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets. In VLDB, pages 397–408, 2005.31 Institute of Applied Informatics and Formal Description Methods (AIFB)
    32. 32. References (2) [9] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033–1044, 2007. [10] D. Z. Wang, L. Wei, Y. Li, F. Reiss, and S. Vaithyanathan. Selectivity estimation for extraction operators over text data. In ICDE, pages 685–696, 2011. [11] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462–467,1968. [12] M. Meila and M. Jordan. Learning with mixtures of trees. The Journal of Machine Learning Research, 1:1–48, 2001. [13] Y. Luo, W. Wang, X. Lin, X. Zhou, J. Wang, and K. Li. Spark2: Top-k keyword query in relational databases. IEEE Transactions on Knowledge and Data Engineering, 23(12):1763–1780, 2011. [14] J. Coffman and A. C. Weaver. A framework for evaluating database keyword search strategies. In CIKM, pages 729–738, 2010. [15] http://knoesis.org/swetodblp/ [16] D. Koller and N. Friedman. Probabilistic graphical models. MIT press, 2009. [17] A. Deshpande, M. N. Garofalakis, and R. Rastogi. Independence is good: Dependency-based histogram synopses for high-dimensional data. In SIGMOD, pages 199-210, 2001.32 Institute of Applied Informatics and Formal Description Methods (AIFB)

    ×