Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

356 views

Published on

The creation of values to represent incomplete information, often referred to as value invention, is central in data exchange. Within schema mappings, Skolem functions have long been used for value invention as they permit a precise representation of missing information. Recent work on a powerful mapping language called second-order tuple generating dependencies (SO tgds), has drawn attention to the fact that the use of arbitrary Skolem functions can have negative computational and programmatic properties in data exchange. In this paper, we present two techniques for understanding when the Skolem functions needed to represent the correct semantics of incomplete information are computationally well-behaved. Specifically, we consider when the Skolem functions in second-order (SO) mappings have a first-order (FO) semantics and are therefore programmatically and computationally more desirable for use in practice. Our first technique, linearization, significantly extends the Nash, Bernstein and Melnik unskolemization algorithm, by understanding when the sets of arguments of the Skolem functions in a mapping are related by set inclusion. We show that such a linear relationship leads to mappings that have FO semantics and are expressible in popular mapping languages including source-to-target tgds and nested tgds. Our second technique uses source semantics, specifically functional dependencies (including keys), to transform SO mappings into equivalent FO mappings. We show that our algorithms are applicable to a strictly larger class of mappings than previous approaches, but more importantly we present an extensive experimental evaluation that quantifies this difference (about 78% improvement) over an extensive schema mapping benchmark and illustrates the applicability of our results on real mappings.

Published in: Science
  • Be the first to comment

  • Be the first to like this

SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

  1. 1. Value Invention in Data Exchange Patricia Arocena1 Boris Glavic2 Ren´ee J. Miller1 University of Toronto1 DBGroup Illinois Institute of Technology2 DBGroup SIGMOD 2013 - June 25, 2013 - New York, USA
  2. 2. Outline 1 Introduction 2 Linearization 3 Exploiting Source Constraints 4 Experiments 5 Conclusions
  3. 3. The Data Exchange Problem1 Schema Mappings M = (S, T, Σ) • Source Schema S and Target Schema T • High-level specification Σ • models the relationship between S and T Source Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data MSource Schema S Target Schema T Source Data Target Data M 1R. Fagin et al., Theor. Comput. Sci. 336 (2005). Slide 1 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  4. 4. The Data Exchange Problem1 Schema Mappings M = (S, T, Σ) • Source Schema S and Target Schema T • High-level specification Σ • models the relationship between S and T Data Exchange • Given an instance of S Source Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data MSource Schema S Target Schema T Source Data Target Data M 1R. Fagin et al., Theor. Comput. Sci. 336 (2005). Slide 1 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  5. 5. The Data Exchange Problem1 Schema Mappings M = (S, T, Σ) • Source Schema S and Target Schema T • High-level specification Σ • models the relationship between S and T Data Exchange • Given an instance of S • How to materialize a target instance of T? Source Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data MSource Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data M 1R. Fagin et al., Theor. Comput. Sci. 336 (2005). Slide 1 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  6. 6. Example Source Schema S Target Schema T Source Data Target Data MWorksOn(Department,Project,City) Source Schema S Target Schema T Source Data Target Data M Projects(PId, City, ManagerId)Source Schema S Target Schema T Source Data Target Data MSource Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data M IT Web Toronto IT Big Data Chicago Sales Mobile New York NULL Toronto NULL NULL Chicago NULL NULL New York NULL We usually create values to represent incomplete information! Slide 2 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  7. 7. Value Invention Source Schema S Target Schema T Source Data Target Data MWorksOn(Department,Project,City) Source Schema S Target Schema T Source Data Target Data M Projects(PId, City, ManagerId)Source Schema S Target Schema T Source Data Target Data MSource Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data M IT Web Toronto IT Big Data Chicago Sales Mobile New York f(Web) Toronto g(IT) f(Big Data) Chicago g(IT) f(Mobile) New York g(Sales) We usually create values to represent incomplete information! Slide 2 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  8. 8. Value Invention Source Schema S Target Schema T Source Data Target Data MWorksOn(Department,Project,City) Source Schema S Target Schema T Source Data Target Data M Projects(PId, City, ManagerId)Source Schema S Target Schema T Source Data Target Data MSource Schema S Target Schema T Source Data Target Data M Source Schema S Target Schema T Source Data Target Data M IT Web Toronto IT Big Data Chicago Sales Mobile New York f(Web) Toronto g(IT) f(Big Data) Chicago g(IT) f(Mobile) New York g(Sales) We usually create values to represent incomplete information! ∃f ∃g ( WorksOn (d, p, c) → Project (f (p), c, g(d)) ) Slide 2 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  9. 9. Our Goal • Understand when schema mappings specified by SO tgds • Flexible and precise value invention • . . . can be rewritten into nested GLAV mappings • Desirable computational and programatic properties Slide 3 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  10. 10. Skolem Functions • Introduced by Thoralf A. Skolem (1920s) • Widely used in Mathematical Logic and Computer Science Many important uses in Information Integration • to model object identifier (OID) inventiona aR. Hull, M. Yoshikawa, In VLDB (1990). Slide 4 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  11. 11. Skolem Functions • Introduced by Thoralf A. Skolem (1920s) • Widely used in Mathematical Logic and Computer Science Many important uses in Information Integration • to model object identifier (OID) invention • to express correlation semantics (e.g., grouping and data merging)abcd aL. Popa et al., In VLDB (2002). bA. Fuxman et al., In VLDB (2006). cL. Libkin, C. Sirangelo, J. Comput. Syst. Sci. 77 (2011). dB. Alexe et al., VLDB J. 21 (2012). Slide 4 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  12. 12. Skolem Functions • Introduced by Thoralf A. Skolem (1920s) • Widely used in Mathematical Logic and Computer Science Many important uses in Information Integration • to model object identifier (OID) invention • to express correlation semantics (e.g., grouping and data merging) • to provide a precise representation of missing and incomplete informationabc aY. Papakonstantinou et al., In VLDB (1996). bL. Popa et al., In VLDB (2002). cR. Fagin et al., TODS 30 (2005). Slide 4 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  13. 13. Schema Mapping Languages Various logical mapping formalisms • s-t tgds (also known as GLAV)a • Nested s-t tgds (nested GLAV)b • Second-Order (SO) tgdsc aR. Fagin et al., Theor. Comput. Sci. 336 (2005). bA. Fuxman et al., In VLDB (2006). cR. Fagin et al., TODS 30 (2005). Slide 5 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  14. 14. Schema Mapping Languages Various logical mapping formalisms • s-t tgds (also known as GLAV) • Nested s-t tgds (nested GLAV) • Second-Order (SO) tgds Expressiveness • SO tgds permits arbitrary Skolems!a • FO mapping languages have more desirable programmatic and computational propertiesb aR. Fagin et al., TODS 30 (2005). bB. ten Cate, P. Kolaitis, In ICDT (2009). Slide 5 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  15. 15. Characterization of Mapping Languages234 Property GLAV nested GLAV SO tgds Composition Not closed Not closed Closed Value Invention No Linear Fully customized correlation correlation correlation Target Homomorphisms Closed Closed Not closed Model Checking PTIME PTIME NP-Complete 2R. Fagin et al., Theor. Comput. Sci. 336 (2005). 3R. Fagin et al., TODS 30 (2005). 4B. ten Cate, P. Kolaitis, In ICDT (2009). Slide 6 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  16. 16. The Quest for FO Rewritability Rewritability • Many SO tgds are equivalent to FO mappings! • We call this FO/GLAV/nested GLAV rewritable • Some SO tgds are not FO rewritablea • . . . Even testing for FO rewritability is undecidableb aR. Fagin et al., TODS 30 (2005). bI. Feinerer et al., In AMW (2011). Nash, Bernstein and Melnik • First sufficient condition for GLAV rewritabilitya • Tailored to consider SO tgds produced by mapping composition aA. Nash et al., TODS 32 (2007). Slide 7 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  17. 17. Our Contributions 1 Sufficient condition for nested GLAV rewritability of SO tgds 2 Linearize: • PTIME algorithm for rewriting SO tgds 3 Equivalence preserving transformation of SO tgds using source semantics 4 LinearizeFDs: • PTIME algorithm for rewriting SO tgds using source FDs 5 Extensive experimental evaluation • STBenchmark 2.0a • Real-life mapping scenarios aP. C. Arocena et al., “STBenchmark 2.0”, tech. rep. (Uni. of Toronto, 2013). Slide 8 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction
  18. 18. Outline 1 Introduction 2 Linearization 3 Exploiting Source Constraints 4 Experiments 5 Conclusions
  19. 19. Intuition of Rewriting Rewrite SO tgds into nested GLAV • Replace second-order existentials with first-order existentials • ∃f (x) → ∃vf • Apply logical equivalence of Skolemization in reverse direction • May have to reorder universal quantifiers to create ∀x Skolemization Equivalence ∀x∃vf δ(x, vf ) ≡ ∃f ∀x δ(x, vf )[vf ← f (x)] Slide 9 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
  20. 20. UnSkolemization Revisited Example: Key Invention Source Schema WorksOn (Department, Project, BudgetId) Audit (BudgetId, Auditor) City (Department, City) Target Schema Project (PId, BudgetId) Dept (Dept, Year, Project, NumEmp) Location (Department, DepId, City, State) Budget (Project, Leader, Size) ∃f (∀d∀p∀b WorksOn (d, p, b) → Project (f (d, p), b)) Slide 10 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
  21. 21. UnSkolemization Revisited Example: Key Invention Source Schema WorksOn (Department, Project, BudgetId) Audit (BudgetId, Auditor) City (Department, City) Target Schema Project (PId, BudgetId) Dept (Dept, Year, Project, NumEmp) Location (Department, DepId, City, State) Budget (Project, Leader, Size) ∃f (∀d∀p∀b WorksOn (d, p, b) → Project (f (d, p), b)) We need to introduce ∃vf nested within the scope of d and p Slide 10 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
  22. 22. UnSkolemization Revisited Example: Key Invention Source Schema WorksOn (Department, Project, BudgetId) Audit (BudgetId, Auditor) City (Department, City) Target Schema Project (PId, BudgetId) Dept (Dept, Year, Project, NumEmp) Location (Department, DepId, City, State) Budget (Project, Leader, Size) ∃f (∀d∀p∀b WorksOn (d, p, b) → Project (f (d, p), b)) We need to introduce ∃vf nested within the scope of d and p ∀d∀p∃vf ∀b WorksOn (d, p, b) → Project (vf , b) Slide 10 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
  23. 23. Sufficient Rewriting Condition Approach • When can Unskolemization be applied to all Skolems of SO tgd? • Adapt notions from SO quantifier elimination methodsa • Consistency: • OK: . . . f (a) . . . f (a) → ∀a∃vf • NOT OK: . . . f (a) . . . f (b) → ∀a∃vf ∀b∃vf • Linearity: • OK: . . . f (a) . . . g(a, b) → ∀a∃vf ∀b∃vg • NOT OK:. . . f (a, b) . . . g(b, c) → ∀a∀b∃vf ∀c∃vg • Partitioning scheme for multi-clause SO tgds aD. Gabbay et al., Second Order Quantifier Elimination: Foundations, Computational Aspects and Applications, (College Publications, 2008). Slide 11 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
  24. 24. Sufficient Rewriting Condition Approach • When can Unskolemization be applied to all Skolems of SO tgd? • Adapt notions from SO quantifier elimination methods • Consistency: • OK: . . . f (a) . . . f (a) → ∀a∃vf • NOT OK: . . . f (a) . . . f (b) → ∀a∃vf ∀b∃vf • Linearity: • OK: . . . f (a) . . . g(a, b) → ∀a∃vf ∀b∃vg • NOT OK:. . . f (a, b) . . . g(b, c) → ∀a∀b∃vf ∀c∃vg • Partitioning scheme for multi-clause SO tgds Theorem: Linearity Given an SO tgd θ without equalities between or with Skolem terms • Consistent • Linear ⇒ θ can be rewritten as nested GLAV Slide 11 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
  25. 25. Linearize Algorithm Properties of the Algorithm • Rewrites an SO tgd into nested GLAV • PTIME • Size of resulting formula is linear in the size of the input Linearize(θ) 1 Partition θ into independent sub-formulas (maximal partitioning Π) 2 For each partition • Check consistency and linearity 3 If all partitions are linear and consistent then • Rewrite θ into Ω Slide 12 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization
  26. 26. Outline 1 Introduction 2 Linearization 3 Exploiting Source Constraints 4 Experiments 5 Conclusions
  27. 27. A Note on Linearity • Linearity is an syntactic but not a semantic condition • ⇒ There is hope that an equivalent mapping exists that is linear • ⇒ Approach: Find an equivalent mapping that is linear • Modify Skolem arguments? Non-Linear SO tgd θ Linear SO tgd θ nested GLAV Ω Equivalence Preserving Transformation Linearize Slide 13 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
  28. 28. Using Source Functional Dependencies So far • Only considered an SO tgd θ • Have not considered additional knowledge that may be available Source Schema WorksOn (Department, Project, BudgetId) Audit (BudgetId, Auditor) City (Department, City) Target Schema Project (PId, BudgetId) Dept (Dept, Year, Project, NumEmp) Location (Department, DepId, City, State) Budget (Project, Leader, Size) ∃f ∃g WorksOn (d, p, b) ∧ Audit (b, a) → Budget (p, f (d, p), g(b, a)) Slide 14 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
  29. 29. Using Source Functional Dependencies Source constraints • Functional dependencies (FDs) ΣS that hold over the source • Primary keys (and other FDs if available) • FDs imply dependencies between the arguments of Skolem terms Source Schema WorksOn (Department, Project, BudgetId) Audit (BudgetId, Auditor) City (Department, City) Target Schema Project (PId, BudgetId) Dept (Dept, Year, Project, NumEmp) Location (Department, DepId, City, State) Budget (Project, Leader, Size) WorksOn: Department, Project → BudgetId Audit: BudgetId → Auditor ∃f ∃g WorksOn (d, p, b) ∧ Audit (b, a) → Budget (p, f (d, p), g(b, a)) Implied FD1 : d, p → b, a Implied FD2 : b → a Slide 14 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
  30. 30. Using Source Functional Dependencies Source constraints • Functional dependencies (FDs) ΣS that hold over the source • Primary keys (and other FDs if available) • FDs imply dependencies between the arguments of Skolem terms • FD x → y be used to augment Skolem arguments: f (x, z) → f (x, z, y) Source Schema WorksOn (Department, Project, BudgetId) Audit (BudgetId, Auditor) City (Department, City) Target Schema Project (PId, BudgetId) Dept (Dept, Year, Project, NumEmp) Location (Department, DepId, City, State) Budget (Project, Leader, Size) WorksOn: Department, Project → BudgetId Audit: BudgetId → Auditor ∃f ∃g WorksOn (d, p, b) ∧ Audit (b, a) → Budget (p, f (d, p, b, a), g(b, a)) Implied FD1 : d, p → b, a Implied FD2 : b → a Slide 14 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
  31. 31. Equivalence Preserving Transformation Approach • Augment Skolem arguments using implied FDs (Re-Skolemization) • Result θ that is equivalent as long as the FDs hold. Non-Linear SO tgd θ and source FDs ΣS Linear SO tgd θ and source FDs ΣS nested GLAV Ω Re-Skolemize using implied FDs Linearize Theorem: Re-Skolemization with FDs preserves equivalence Given an implied source FD x → y valid over θ: θ[f (x) ← f (x, y)] ∪ ΣS ≡ θ ∪ ΣS Slide 15 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
  32. 32. Why Augmentation? Does Re-Skolemization affect Linearity? • Augmentation (θ[f (x) ← f (x, y)]) • θaug : Result of applying augmentation until no longer possible • Minimization (θ[f (x, y) ← f (x)])a • θmin : Result of applying minimization until no longer possible aB. Marnette et al., PVLDB 3 (2010). Theorem: Only augmentation preserves Linearity Linear(θ) → Linear(θaug ) Linear(θ) → Linear(θmin ) Slide 16 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
  33. 33. LinearizeFDs Algorithm Properties of the Algorithm • Rewrites SO tgd into nested GLAV • PTIME • Size of resulting formula is linear in the size of the input LinearizeFDs(θ,ΣS ) 1 Compute implied FDs 2 Augment arguments of each Skolem term based on FDs • Using attribute closure • Result: θaug 3 Return Linearize(θaug ) Slide 17 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints
  34. 34. Outline 1 Introduction 2 Linearization 3 Exploiting Source Constraints 4 Experiments 5 Conclusions
  35. 35. Mapping Generator and Experiments STBenchmark • Generator for data exchange scenariosa • Schemas, Data and Mappings • Construct complex mappings from simple primitives • e.g., Horizontal Partitioning (HP) • Parameterized and randomized (e.g., join path length) aB. Alexe et al., PVLDB 1 (2008). Extensions • Arbitrary Skolem terms (SO tgds) • New primitives (e.g., Adding and Deleting Attributes, etc.) • Combining primitives into more complex mappings • e.g., simulating composition and complex correlations • Primary Keys (PKs) and Functional dependencies (FDs) Slide 18 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Experiments
  36. 36. Random Scenarios • 12,500,000 randomly generated mapping scenarios • Measure success rate • Compare NBM, Linearize, LinearizeFDs, LinearizeMin • NBM is only rewriting into GLAV! Slide 19 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Experiments
  37. 37. Effect of Primary Keys • Activate/Deactivate source PKs • Vary amount of non-PK FDs 0% 20% 40% 60% 80% 100% No PKs With PKs No PKs With PKs No PKs With PKs SOURCE FDs = 0% SOURCE FDs = 25% SOURCE FDs = 50% SuccessRate Linearize LinearizeFDs Slide 20 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Experiments
  38. 38. Outline 1 Introduction 2 Linearization 3 Exploiting Source Constraints 4 Experiments 5 Conclusions
  39. 39. Conclusions Rewriting SO-tgds → nested GLAV • Linearization • SO tgd is linear → can be rewritten • Equivalence preserving Re-Skolemization • Using source FDs to augment Skolem arguments Experimental and Theoretical Results • Using FDs improves chance to rewrite • 78% increased success rate • Primary keys are most effective • > 75% increased success rate • Augmentation is better than minimization • about 16% increased success rate Slide 21 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Conclusions
  40. 40. Future Work Integrate insights on Re-Skolemization into . . . • Mapping operators such as • Composition • MapMerge • Mapping generation FO Rewritability of SO tgds • Combine our sufficient condition with that of [NBM07]a • we know how to do it! • Exploit Augmentation and Minimization together • to simplify and optimize SO mappings • Use target FDs aA. Nash et al., TODS 32 (2007). Slide 22 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Conclusions
  41. 41. Questions? Slide 23 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Conclusions
  42. 42. Appendix 6 References 7 Additional Notation 8 A Note on Complexity 9 Partitioning Scheme 10 Additional Experiments 11 STBenchmark 2.0 12 Additional Linearization Examples 13 Example Augmentation
  43. 43. R. Fagin, P. Kolaitis, R. J. Miller, L. Popa, Data Exchange: Semantics and Query Answering. Theor. Comput. Sci. 336 (2005). R. Hull, M. Yoshikawa, ILOG: Declarative Creation and Manipulation of Object Identifiers. In VLDB (1990). L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hern´andez, R. Fagin, Translating Web Data. In VLDB (2002). A. Fuxman et al., Nested Mappings: Schema Mapping Reloaded. In VLDB (2006). L. Libkin, C. Sirangelo, Data Exchange and Schema Mappings in Open and Closed Worlds. J. Comput. Syst. Sci. 77 (2011). B. Alexe, M. A. Hern´andez, L. Popa, W. C. Tan, MapMerge: Correlating Independent Schema Mappings. VLDB J. 21 (2012). Slide 1 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: References
  44. 44. Y. Papakonstantinou, S. Abiteboul, H. Garcia-Molina, Object Fusion in Mediator Systems. In VLDB (1996). R. Fagin, P. Kolaitis, L. Popa, W.-C. Tan, Composing Schema Mappings: Second-Order Dependencies to the Rescu TODS 30 (2005). B. ten Cate, P. Kolaitis, Structural Characterizations of Schema-Mapping Languages. In ICDT (2009). I. Feinerer, R. Pichler, E. Sallinger, V. Savenkov, On the Undecidability of the Equivalence of Second-Order Tuple Genera In AMW (2011). A. Nash, P. Bernstein, S. Melnik, Composition of Mappings Given by Embedded Dependencies. TODS 32 (2007). P. C. Arocena, M. D’Angelo, B. Glavic, R. J. Miller, “STBenchmark 2.0”, tech. rep. (Uni. of Toronto, 2013). Slide 2 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: References
  45. 45. D. Gabbay, R. Schmidt, A. Szalas, Second Order Quantifier Elimination: Foundations, Computational Aspe (College Publications, 2008). B. Marnette, G. Mecca, P. Papotti, Scalable Data Exchange with Functional Dependencies. PVLDB 3 (2010). B. Alexe, W. C. Tan, Y. Velegrakis, STBenchmark: Towards a Benchmark for Mapping Systems. PVLDB 1 (2008). Slide 3 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: References
  46. 46. Appendix 6 References 7 Additional Notation 8 A Note on Complexity 9 Partitioning Scheme 10 Additional Experiments 11 STBenchmark 2.0 12 Additional Linearization Examples 13 Example Augmentation
  47. 47. Notation GLAV (s-t tgds): ∀z, x(φ(z, x) → ∃yψ(x, y)) ∀d ∀p ∀b Works(d, p, b) → ∃y1 ∃y2 Project(y1, b, y2) - nested GLAV: Q(x, y)((φ1(x) → ψ1(x, y)) ∧ . . . ∧ (φn(x) → ψn(x, y))), where Q(x, y) is a sequence of quantifiers, that is, ∀ for x and ∃ for y ∀d ∃y1 ∀p ∃y2 ∀b Works(d, p, b) → Project(y1, b, y2) SO tgds: ∃f( (∀x1(φ1 → ψ1)) ∧ · · · ∧ (∀xn(φn → ψn)) ) Note: we usually omit universal quantifiers ∃f ∃g(Works(d, p, b) → Project(f (p), b, g(d))) Slide 4 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Notation
  48. 48. Appendix 6 References 7 Additional Notation 8 A Note on Complexity 9 Partitioning Scheme 10 Additional Experiments 11 STBenchmark 2.0 12 Additional Linearization Examples 13 Example Augmentation
  49. 49. Model Checking Complexity • NP-complete for SO tgds vs. P for nested GLAV • Are we only solving the simple cases? Approach • Find an SO tgd for which model checking is hard • But can be rewritten using (implied) source FDs Slide 5 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: A Note on Complexity
  50. 50. Model Checking: 3-colorability Schema Mappings θ = ∀X, Y :E(X, Y ) → C(f (X), g(Y )) V (X, Y ) → S(f (X), g(Y )) Not linear! Instance • The source relations encode an undirected graph G • For each edge (x, y) we create two tuples E(x, y) and E(y, x) • For each vertex x we create a tuple V (x, x) • The target relations represent a coloring of the vertexes of G using three colors r, g, and b • C: (r, g), (r, b), (g, r), (g, b), (b, r), (b, g) - colors of adjacent nodes • S: (r, r), (g, g), (b, b) - colors of vertexes Theorem: Model Checking is 3-colorability G is 3-colorable if θ holds over I Slide 6 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: A Note on Complexity
  51. 51. Model Checking: 3-colorability Schema Mappings θ = ∀X, Y :E(X, Y ) → C(f (X, Y ), g(Y )) V (X, Y ) → S(f (X, Y ), g(Y )) Linear! FD: X → Y Instance • The source relations encode an undirected graph G • For each edge (x, y) we create two tuples E(x, y) and E(y, x) • For each vertex x we create a tuple V (x, x) • The target relations represent a coloring of the vertexes of G using three colors r, g, and b • C: (r, g), (r, b), (g, r), (g, b), (b, r), (b, g) - colors of adjacent nodes • S: (r, r), (g, g), (b, b) - colors of vertexes Theorem: Model Checking is 3-colorability G is 3-colorable if θ holds over I Slide 6 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: A Note on Complexity
  52. 52. Appendix 6 References 7 Additional Notation 8 A Note on Complexity 9 Partitioning Scheme 10 Additional Experiments 11 STBenchmark 2.0 12 Additional Linearization Examples 13 Example Augmentation
  53. 53. SO tgds are closed conjunction • Every SO tgd θ can be written as a set of clauses (φi → ψi ) • Splitting this set of clauses to form new SO tgds θ1, . . . θn is equivalence preserving • If θi and θj share none Skolems (they are uncorrelated) • then θi and θj can be rewritten independently Slide 7 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Partitioning Scheme
  54. 54. Maximal Partition Maximal Partition • Given an SO tgd θ • Partition clauses into Π = π1, . . . , πn 1 No πi and πj share any skolems 2 There is no Π with more elements than Π that fulfills condition 1) Theorem: Rewritability and Maximal Partitions Rewritable (θ) ⇔ ∀i : Rewritable (πi ) Slide 8 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Partitioning Scheme
  55. 55. Appendix 6 References 7 Additional Notation 8 A Note on Complexity 9 Partitioning Scheme 10 Additional Experiments 11 STBenchmark 2.0 12 Additional Linearization Examples 13 Example Augmentation
  56. 56. Real-Life Mappings • Three real-life mapping scenarios from the literature • Created SO tgds based on • Semantics of the schemas • Documented data transformations • Compared all rewriting techniques Slide 9 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Experiments
  57. 57. Appendix 6 References 7 Additional Notation 8 A Note on Complexity 9 Partitioning Scheme 10 Additional Experiments 11 STBenchmark 2.0 12 Additional Linearization Examples 13 Example Augmentation
  58. 58. Towards STBenchmark 2.0 Noteworthy Features • Support for arbitrary Skolem Functions (SO tgds) and various Skolemization modes (e.g., Key, All and Random) • Simulating some cases of composition using Skolem Noise • Reuse of source schema elements using Source Reuse • PKs and random multi-attribute FDs over the source Usability Case? • Thinking about comparing different notions of mapping inverse • Any suggestions? Slide 10 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: STBenchmark 2.0
  59. 59. Appendix 6 References 7 Additional Notation 8 A Note on Complexity 9 Partitioning Scheme 10 Additional Experiments 11 STBenchmark 2.0 12 Additional Linearization Examples 13 Example Augmentation
  60. 60. Notation GLAV (s-t tgds): ∀z, x(φ(z, x) → ∃yψ(x, y)) ∀d ∀p ∀b Works(d, p, b) → ∃y1 ∃y2 Project(y1, b, y2) nested GLAV: Q(x, y)((φ1(x) → ψ1(x, y)) ∧ . . . ∧ (φn(x) → ψn(x, y))), where Q(x, y) is a sequence of quantifiers, that is, ∀ for x and ∃ for y ∀d ∃y1 ∀p ∃y2 ∀b Works(d, p, b) → Project(y1, b, y2) SO tgds: ∃f( (∀x1(φ1 → ψ1)) ∧ · · · ∧ (∀xn(φn → ψn)) ) Note: we usually omit universal quantifiers ∃f ∃g(Works(d, p, b) → Project(f (p), b, g(d))) Slide 11 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Linearization Examples
  61. 61. Nesting Example: Skolems with Overlapping Arguments Source Schema WorksOn (Department, Project, BudgetId) Audit (BudgetId, Auditor) City (Department, City) Target Schema Project (PId, BudgetId) Dept (Dept, Year, Project, NumEmp) Location (Department, DepId, City, State) Budget (Project, Leader, Size) ∃f ∃g( WorksOn (d, p, b) → Dept (d, f (d), p, g(d, p))) Slide 12 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Linearization Examples
  62. 62. Nesting Example: Skolems with Overlapping Arguments Source Schema WorksOn (Department, Project, BudgetId) Audit (BudgetId, Auditor) City (Department, City) Target Schema Project (PId, BudgetId) Dept (Dept, Year, Project, NumEmp) Location (Department, DepId, City, State) Budget (Project, Leader, Size) ∃f ∃g( WorksOn (d, p, b) → Dept (d, f (d), p, g(d, p))) We need to introduce two ∃ quantifiers without violating the dependencies modeled by f and g Slide 12 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Linearization Examples
  63. 63. Nesting and Linearization Example: Skolems with Overlapping Arguments Source Schema WorksOn (Department, Project, BudgetId) Audit (BudgetId, Auditor) City (Department, City) Target Schema Project (PId, BudgetId) Dept (Dept, Year, Project, NumEmp) Location (Department, DepId, City, State) Budget (Project, Leader, Size) ∃f ∃g( WorksOn (d, p, b) → Dept (d, f (d), p, g(d, p))) We need to introduce two ∃ quantifiers without violating the dependencies modeled by f and g ∀d∃vf ∀p∃vg ∀b WorksOn (d, p, b) → Dept (d, vf , p, vg ) Slide 12 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Linearization Examples
  64. 64. Appendix 6 References 7 Additional Notation 8 A Note on Complexity 9 Partitioning Scheme 10 Additional Experiments 11 STBenchmark 2.0 12 Additional Linearization Examples 13 Example Augmentation
  65. 65. Augmentation is better than Minimization Source Schema WorksOn (Department, Project, BudgetId) Audit (BudgetId, Auditor) City (Department, City) Target Schema Project (PId, BudgetId) Dept (Dept, Year, Project, NumEmp) Location (Department, DepId, City, State) Budget (Project, Leader, Size) WorksOn: Department, Project → BudgetId Audit: BudgetId → Auditor θ = ∃f ∃g WorksOn (d, p, b) ∧ Audit (b, a) → Budget (p, f (d, p), g(b, a)) θaug = ∃f ∃g . . . → Budget (p, f (d, p, b, a), g(b, a)) θmin = ∃f ∃g . . . → Budget (p, f (d, p), g(b)) FD1 : d, p → b, a FD2 : b → a Slide 13 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Example Augmentation

×